Training large AI language models is a challenging task that requires a deep understanding of natural language processing, machine learning, and distributed computing. In this talk, we will go over lessons learned from training models with billions of parameters across hundreds of GPUs. We will discuss the challenges of handling massive amounts of data, designing effective model architectures, optimizing training procedures, and managing computational resources.
This talk is suitable for ML researchers, practitioners, and anyone curious about the “sausage making” behind training large language models.
Sandeep is an engineering director leading Databricks MosaicAI Model Training products. Large Language Model training/fine-tuning for GenAI, applying neural networks on structured data for predictive AI tasks, and building reliable performant accelerated computing infrastructure, are Sandeep’s key interests and expertise.
Sandeep and his team built the MosaicAI Training platform on which the Mosaic Research team built state-of-the-art MPT, DBRX, and ImageAI models. Currently, Sandeep is focused on building products that help every Enterprise build “intelligence” on their secure enterprise data.
In the past, Sandeep led teams that built the Amazon SageMaker Model Training platform supporting hundreds of Enterprise journeys from their data to an intelligent model in production.