Generative AI Model Data Pre-Training on Kubernetes: A Use Case St... Alexey Roytman & Anish Asthana

Alexey Roytman, Anish Asthana

KubeCon + CloudNativeCon Europe 2025 · Session

This talk, presented by Anish Asthana from Red Hat and Alexey Roytman from IBM Research, delves into the intricate world of **foundation model data engineering** on Kubernetes. It addresses the critical challenges associated with preparing massive datasets for training large language models (LLMs), focusing on how to scale complex data pre-processing workflows from local development environments to production-grade, cloud-native infrastructure. The speakers share their experiences and solutions, particularly highlighting the integration of **Ray** for distributed computing and **Kubeflow Pipelines (KFP)** for orchestration within a Kubernetes ecosystem.

AI review

This talk delivers a robust, production-validated architectural blueprint for tackling the immense challenges of foundation model data engineering on Kubernetes. By expertly integrating Ray, Cubray, and Kubeflow Pipelines, the speakers demonstrate a scalable and reproducible workflow capable of processing terabytes of data and billions of documents. The introduction of the open-source Data Preparation Kit (DPK) further solidifies its value, offering a standardized, framework-agnostic solution that significantly lowers the barrier to entry for complex data transformations, making this a…

Watch on YouTube