In a groundbreaking development, Tri Dao, co-author of Flash Attention, has teamed up with two Ph.D. students from Princeton University to introduce a new kernel library called QuACK. What sets this library apart is its development using Python and CuTe-DSL, without any involvement of CUDA C++ code. This innovation has not only disrupted traditional programming frameworks but also achieved a remarkable 33%-50% speedup on powerful H100 GPUs compared to libraries like torch.compile and Liger in PyTorch.
Dao reveals that achieving high efficiency in memory-intensive kernels is not an elusive "secret" but rather depends on precise handling of key details. He emphasizes that understanding the thread and memory hierarchy structures of modern accelerators is crucial. As GPU performance optimization continues to evolve, developers can now significantly boost performance in a more user-friendly environment using CuTe-DSL, a Python-based domain-specific language.
This breakthrough has quickly garnered attention from industry experts. Vijay, a senior architect at NVIDIA's CUTLASS team, praised the development and highlighted how CuTe-DSL's design enables experts like Dao to efficiently run GPUs. He also hinted at more exciting announcements in this area later this year. Horace He, a member of the PyTorch team, expressed considerable interest in the innovation, particularly noting its significant advantages for processing long sequences.
To benefit more developers, the authors of QuACK have written a comprehensive tutorial outlining the specific steps and code needed for implementation. The article emphasizes that achieving efficient operation during GPU model training and inference requires optimizing both compute-intensive and memory-intensive kernels. While optimizations for matrix multiplication and attention mechanisms have already matured, this study focuses on memory-intensive kernels.
The authors explain that memory-intensive kernels have lower arithmetic intensity, making their throughput more dependent on the amount of data transferred per second. By cleverly leveraging the GPU's memory hierarchy and hardware features, the authors have successfully boosted the performance of memory-intensive kernels to near "light-speed" levels.