fbpx

Using Large Language Models for Extract, Transform, and Load on AI Data:
An MLtwist Brief

MLtwist Background

MLtwist is a managed platform providing no-code data pipelines for AI use cases. Its customer base are primarily non-engineers and therefore benefit from the flexibility the MLtwist platform provides. 

Large language models for Extract, Transform, and Load

Large language models (LLMs) bring a number of advantages to a no-code extract, transform, and load (ETL) solution, especially for a non-technical audience: A natural language interface, an iterative environment, and semantic understanding of both data and code.

Natural Language Interface

Designing an ETL workflow today typically requires the use of one or more programming languages – often one for building the workflow graph, and another for building the components run by the graph. For a non-engineer, this can be daunting and error-prone. . 

The rise of no-code ETL tools, such as Matillion, Azure Data Factory, and AWS Glue demonstrates the demand for quick and easy ETL. However, the tradeoff is in flexibility. Though the no-code ETL tools purport to be flexible, in reality they connect to many different data sources, or have a library of chainable low-level primitives that also requires rigorous work and customization. 

LLMs provide an alternative that has not previously been available. From a simple text prompt, they can generate complicated code that solves real business problems. 

For example, the following prompt [Write a function in python to split a jpeg image into 12 approximately equal tiles. Show the code only, no explanation.]will correctly generate a Python function to split a jpeg into tiles. Previously, these components would have required an engineer to generate. Now it can be done by anyone needing this transformation. While engineers are still required to test and validate components, the transformative effect of this ability cannot be overstated.

Iterative Environment

The chat-like nature of current LLM interfaces such as ChatGPT or Bard also benefits ETL development. 

Traditionally, data pipelines were built for batch processing, which leads to a long feedback cycle. More recently, notebook-driven development has allowed AI engineers to get quicker feedback and develop in a more iterative manner. 

However, non-engineers are still relegated to giving requirements to an engineer, waiting for a solution, and then repeating the process, again leading to long feedback cycles. The chat based interface is much closer to a notebook in nature. When building a workflow graph, the LLM ETL user can specify requirements step-by-step: testing the code directly in the platform and then moving to the next step. 

The LLM ETL platform will retain the context of the conversation and be able to augment a previous step or create a new one as desired. Upon completion, a user can ask the platform to serialize the workflow for future use. 

Semantic Understanding

Not only can LLMs generate code; they can also be used for semantic understanding of existing code and data. The prompt created by the end user can be augmented by the platform to include information about previous workflows and components. 

For example, the engine managing the above tiling image prompt could query a vector database of existing components instead of generating code and find a preferred component for the company if one exists. An entire workflow could be broken down into a vectorized form that could be used to find reusable and modifiable similar workflows. 

Anecdotally, we have seen companies with multiple workflows from different parts of the organization waste computing and storage resources by recomputing similar or identical data. These organizations often form data engineering groups and create processes to prevent recomputation, but an LLM-powered workflow engine backed by a vector database could provide much of that for free. 

From a data perspective, LLMs can understand metadata and enforce data policies. Assuming the company has set up policies on what data can be used for what purposes, the LLM can help understand both the purpose and the data involved. 

For example, ChatGPT can determine that firstname, first_name, and first-name likely represent the same data in a CSV file and also that they are personally identifiable information (PII). With the right prompt engineering, the LLM could identify potentially sensitive data in any flow and ensure the flow is tagged correctly.

In another example, the LLM can also understand customer data formats and ensure that inputs are parsed correctly, limits and batch sizes are enforced, and outputs are formatted in the customer’s standard data format. The more the customer uses the platform, the better equipped the platform will be to anticipate the customer’s needs. 

By integrating enterprise workflows and metadata with an LLM and vector database, the ETL platform can create a moat via network economies, counter-positioning, and switching costs. More enterprise employees using the tool means more components, workflows, and metadata available. This is a clear network effect where it is easy to get your work done because others are indirectly assisting you as they complete theirs. 

To understand counter-positioning, most no-code ETL tools have put a lot of time and effort into a limited set of components and a drag-and-drop GUI. Changing to an LLM will directly threaten their existing model and require migration and retraining of their users, which will most likely delay investment into this space. 

Finally, the cost of switching away from an LLM ETL tool will be high as the workflow and components were all generated by the ETL tool, with no in-house development team to help migrate to a new system. 

Putting it all together

Combining a natural language interface, an iterative environment, and semantic understanding can allow a user of this platform type to describe a workflow while the platform will find or generate the correct workflow, find or generate the correct components for the workflow, add the necessary metadata, and make important suggestions to the overall approach. We believe that tools such as LangChain and AutoGPT already prove the feasibility of this concept. 

In conclusion, MLtwist sees this development as a generational shift in how no-code ETL tools are envisioned. We have also been exploring this space as a concierge for companies with precise needs, acting as an LLM of sorts as we build out a component and workflow library tuned for AI pipelines. We have been doing the things that don’t scale and developing a viable business model. 

The next step is to scale with LLMs and show the transformational effect they will have on the industry.

Picture of David Smith, Co-founder & CEO, MLtwist

David Smith, Co-founder & CEO, MLtwist

Before founding MLtwist, David Smith held leadership roles at Google and Oracle, and has been through 4 acquisitions. He is focused on enabling end to end pipes for strategic data and has launched first of kind data partnerships with Oracle, Google, JD Power, Twitter. David holds a Bachelor of Science in Computer Science and Engineering from UC Davis.