Large language models (LLMs) have changed the AI landscape in recent months. The pace of innovation in AI has been unprecedented and unlike any other field of science. This fast-paced progress has been possible mainly due to open science and open-access. However, research on benchmarking and evaluation tools has yet to be able to keep up with the growth of the capabilities of these powerful models.Â
I will discuss the LLM landscape, where open and closed-access models coexist, and what are the pros and cons of each. Next, I will focus on current tooling for evaluating LLMs’ capabilities and vulnerabilities. Finally, I will discuss open challenges and exciting research problems in taming the wild west of LLMs.
I am Research Lead at Hugging Face 🤗. Currently, I am part of the team working on building an open-source alternative to ChatGPT called H4. While building such a powerful LLM, I am thinking deeply about evaluation, specifically about the following problems -- 1. Designing and implementing methods for evaluating emerging capabilities, 2. What is the pareto front for the different choices of safety vs. usefulness?, and 3. Quantitative measures for tradeoffs between these choices.
My interests and expertise lie at the intersection of Model Evaluation, Robustness, and Interpretability. My long term research agenda includes 1. Dataset curation for understanding training and evaluation data, 2. Human-in-the-loop training and evaluation for model robustness, 3. Repurposing LLMs and PLMs to make them more usable in the real world. Checkout my HF spaces to see what I am building
Before joing Hugging Face, I was a Senior Research Scientist at Salesforce Research where I worked with Richard Socher and Caiming Xiong on commonsense reasoning and interpretability in NLP. I led a small team focused on building robust natural language generation models. Prior to working at Salesforce, I completed my Ph.D. thesis in the Department of Computer Science at the Machine Learning Research Group of the University of Texas, Austin. I worked with my advisor Prof. Ray Mooney on problems in NLP, Vision and at the intersection of NLP and Vision. I have also worked on problems in Explainable AI (XAI) wherein I proposed a scalable approach to generate visual explanations for ensemble methods using the localization maps of the component systems. Evaluating explanations is also a challenging problem and I proposed two novel evaluation metrics that does not require human generated GT.
I completed my MS in CS with thesis advised by Jason Baldridge on new topic detection using topical alignment from tweets based on their author and recipient.