1,001 Ways to Accelerate Python with CUDA Kernels | NVIDIA GTC 2025

Leo Fang

NVIDIA GTC 2025 · Session

In this comprehensive GTC 2025 presentation, Leo Fang, NVIDIA's Python CUDA Tech Lead, unveils a spectrum of innovative approaches for accelerating Python applications by leveraging **CUDA kernels**. The talk, part of a series dedicated to CUDA Python, specifically focuses on the art and science of authoring high-performance kernels from within the Python ecosystem. Fang emphasizes NVIDIA's overarching mission to foster a robust and interoperable CUDA Python environment, where developers can seamlessly mix and match various Python packages and CUDA functionalities. Using the pedagogical example of **segmented reduction**—a common operation in data processing and machine learning—Fang meticulously walks through the evolution from traditional C++ CUDA kernel development to a rich landscape of Python-centric tools and programming models, all while striving to maintain C++-level performance. This talk is crucial for Python developers aiming to push the boundaries of performance on NVIDIA GPUs without having to abandon their preferred language or grapple with the complexities of C++ compilation workflows.

AI review

Leo Fang gives a competent, technically honest survey of the CUDA Python kernel authoring landscape using segmented reduction as a running example. The breadth is genuinely useful — if you didn't know NVGLink or CUDA Cooperative existed, you do now — but the talk stays firmly in survey mode throughout. No benchmarks, no reproducible examples, and several of the most interesting tools are described as 'still under development.' Solid orientation talk for Python engineers who want to understand what's possible on modern NVIDIA hardware without writing C++, but it doesn't go deep enough on any…

Watch on YouTube