As adoption of generative AI accelerates and agentic AI systems add new inference demands, the greatest challenge lies in scaling workloads from prototypes to production, where costs, latency, and GPU management complexity often stall deployment. This talk explores essential strategies such as quantization, batching, caching, and hardware aware optimization that bridge the gap between research performance and production grade performance and reliability
Drawing on lessons from large scale deployments, we highlight how these strategies enable developers to achieve higher throughput, lower costs, and predictable outcomes. We conclude by showing how these principles are realized in FriendliAI, powered by a purpose built inference stack that abstracts infrastructure complexity and consistently delivers unmatched performance at scale.
Byung-Gon Chun is the Founder and CEO of FriendliAI, leading innovations that make AI deployment more efficient and scalable. With decades of experience at the intersection of AI platforms and distributed systems, he blends academic rigor with practical leadership to advance AI performance and impact. He pioneered continuous batching, now the industry standard for LLM inference.
Byung-Gon is currently on leave from Seoul National University, where he is a Professor of Computer Science and Engineering. His prior research experience spans Facebook, Microsoft, Yahoo!, and Intel. His work has received global recognition, including the ACM SIGOPS Hall of Fame Award, the EuroSys Test of Time Award, and research honors from Google, Microsoft, Amazon, and Facebook. He holds a Ph.D. from UC Berkeley, an M.S. from Stanford, and B.S./M.S. degrees from Seoul National University.