Reliability is a critical obstacle to the successful deployment of AI systems in production. As with conventional software, moving an AI system from prototype to production is far from trivial, requiring robust testing, debugging, and governance. However, unlike conventional systems where components have clear specifications, AI systems—particularly those built on large language models—are black boxes, making failures difficult to detect and diagnose.
To address this challenge, we present two systems: LMArena, which evaluates LLMs with real human prompts to measure both accuracy and the impact of style, and MAST, a taxonomy and dataset that reveal why multi-agent systems fail, including poor specifications, misalignment, and weak verification. We argue that achieving reliability requires transforming AI development into a true engineering discipline, grounded in better specifications that are essential for building, debugging, and verifying modular and robust systems.
Ion Stoica is a Professor in the EECS Department and holds the Xu Bao Chancellor Chair at the University of California at Berkeley. He is the Director of the Sky Computing Lab and serves as the Executive Chairman of Databricks and Anyscale.
His current research focuses on AI systems and cloud computing, and he has contributed to numerous open-source projects including SkyPilot, vLLM, ChatBot Arena, Ray, and Apache Spark.
Ion is a Member of the National Academy of Engineering, an Honorary Member of the Romanian Academy, and an ACM Fellow. He has co-founded four companies: LMArena (2025), Anyscale (2019), Databricks (2013), and Conviva (2006).