Jay Chia

Co-Founder
Eventual

Presentation Title:

How to Build a Multimodal Data Lake:
Unlocking the Next Generation of AI Workloads

Presentation Summary:

While the model stack has evolved rapidly to reason across images, text, audio, and video, most data stacks are still catching up. Traditional data engines like Spark and Trino are designed and optimized for structured, tabular data and not for unstructured data and rich media like images and audio – falling short for the needs of multimodal data in today’s workloads.

In this talk, we explore what it means to build a multimodal data lake designed to handle demands of modern AI workloads, such as heterogeneous datasets spanning images, text, and tabular data, allowing teams to unlock new ML/AI capabilities without being limited by traditional, tabular-only systems. We’ll dive into the architectural patterns behind Daft, an open-source Python-native unified data engine purpose-built for modern AI/ML data workflows, that allow for efficient and scalable data processing across modalities.

We’ll walk through:

Why traditional data lakes fail in multimodal settings
Concrete design patterns for building efficient, scalable multimodal ETL pipelines without blowing out memory or compute resources
How to query multimodal datasets interactively, using a familiar SQL or Python DataFrame API

Join us as we demonstrate how to build a multimodal data lake using Daft on your existing infrastructure. Your data layer should be flexible enough for experimentation and also powerful enough to scale.

About | Jay Chia

Jay Chia is one of the Co-Founders of Eventual, the company behind the Daft open-source project.

Prior to Eventual, he was a software engineer working on ML infrastructure for computational biology at Freenome and self-driving cars at Lyft, building large-scale data and computing platforms for diverse industries.

Jay hails from the sunny island nation of Singapore, and fun fact — he used to command a platoon of tanks in the Singapore military.