On Distributed Larger-Than-Memory Subset Selection with Pairwise Submodular Functions

Maximilian Böther, Abraham Sebastian, Pranjal Awasthi, Ana Klimovic, Srikumar Ramalingam

Conference on Machine Learning and Systems 2025 · Day 4 · Session 9: Parallel and Distributed Systems

In the realm of large-scale machine learning, the cost associated with training models on massive, often petabyte-scale datasets presents a significant challenge. This talk, presented by Maximilian Böther as a joint work from Google and ETH, introduces a novel approach to **distributed larger-than-memory subset selection using pairwise submodular functions**. The core problem addressed is the need to efficiently identify a small, representative subset of a colossal dataset that can serve as a proxy for the full dataset during model training and experimentation. Such a subset aims to provide an accurate indication of a model's performance on the entire dataset, but at a fraction of the computational cost.

AI review

Solid systems-oriented ML research from a Google/ETH team that takes a real and underappreciated problem — submodular subset selection when even the selected subset doesn't fit in RAM — and proposes a principled distributed solution. The 'bounding' insight is genuinely useful and the honesty about the naive partitioning failure mode is refreshing. But the write-up stays largely at the level of explaining the concept rather than showing how you'd actually wire this up, and the experimental validation on CIFAR doesn't fully de-risk the petabyte claims.