Large Language Models (LLMs) have achieved great success in solving difficult tasks across many domains, but such success comes with a high computation cost, and inference latency. As developers and third parties customize these models, the need to provide efficient inference has increased. Many efforts have attempted to reduce inference cost through model compression techniques such as pruning and distillation. However, these techniques either require labeled data, or are time-consuming as they require the compressed model to be retrained to regain accuracy.
In this paper, we propose a gradient-free structured pruning framework that uses only unlabeled data. An evaluation on the GLUE and SQuAD benchmarks using BERTBASE and DistilBERT illustrates the effectiveness of the proposed approach. By only using the weights of the pre-trained model and unlabeled data, in a matter of a few minutes on a single GPU, up to 40% of the original FLOP count can be reduced with less than a 4% accuracy loss across all tasks considered.
I am a Senior Research Scientist at Google DeepMind. My research interests lie in the broad area of machine learning, and combinatorial optimization. Currently, my focus is on improving different capabilities of the Large Language models (LLMs). Specifically I am leading the efforts on efficient inference, reasoning, and planning of the LLMs. I have published more than 40 peer-reviewed papers at scientific venues such as Nature, NeurIPS, ICML, ICLR, VLDB, and SIGMOD. Before joining Google, I was a postdoctoral researcher at Microsoft Research in Data Management, Exploration and Mining DMX group.
My research interests are in the broad areas of social network analysis, graph mining, machine learning, data mining, and database. I completed my PhD in Department of Computer Science and Engineering at the University of Texas, Arlington under the supervision of Dr. Gautam Das in Database Exploration Lab (DBXLAB). My PhD research focused on data exploration and analysis over online community networks such as GooglePlus, Twitter, and Amazon and I solved novel problems that have a practical impact and the solutions often involve the design of new techniques or adapting techniques from various fields such as graph theory, algorithms, statistics, etc.
Google AI Residency has given me the opportunity to collaborate with brilliant researcher on challenging machine learning problems. Many important real-world datasets are in the form of graphs or networks: social networks, knowledge graphs, protein-interaction networks, the World Wide Web, etc. (just to name a few). Most of my current research devoted to the generalization of neural network models to such real-world datasets, where the goal is to exploit the graph structure of such datasets in the training process.