While the model stack has evolved rapidly to reason across images, text, audio, and video, most data stacks are still catching up. Traditional data engines like Spark and Trino are designed and optimized for structured, tabular data and not for unstructured data and rich media like images and audio – falling short for the needs of multimodal data in today’s workloads.
In this talk, we explore what it means to build a multimodal data lake designed to handle demands of modern AI workloads, such as heterogeneous datasets spanning images, text, and tabular data, allowing teams to unlock new ML/AI capabilities without being limited by traditional, tabular-only systems. We’ll dive into the architectural patterns behind Daft, an open-source Python-native unified data engine purpose-built for modern AI/ML data workflows, that allow for efficient and scalable data processing across modalities.
We’ll walk through:
Join us as we demonstrate how to build a multimodal data lake using Daft on your existing infrastructure. Your data layer should be flexible enough for experimentation and also powerful enough to scale.
Jay Chia is one of the Co-Founders of Eventual, the company behind the Daft open-source project.
Prior to Eventual, he was a software engineer working on ML infrastructure for computational biology at Freenome and self-driving cars at Lyft, building large-scale data and computing platforms for diverse industries.
Jay hails from the sunny island nation of Singapore, and fun fact — he used to command a platoon of tanks in the Singapore military.