A groundbreaking new kernel library called QuACK has recently emerged, developed by the co-author of Flash Attention, Tri Dao, in collaboration with two Ph.D. students from Princeton University. What sets QuACK apart is its development using only Python and CuTe-DSL, with no involvement of CUDA C++ code. This innovative approach has not only disrupted traditional programming frameworks but also achieved a remarkable 33%-50% speedup on powerful H100 GPUs compared to PyTorch's torch.compile and Liger libraries.
Tri Dao reveals that achieving efficient operation of memory-intensive kernels is not a well-guarded secret but relies on precise handling of key details. He emphasizes the importance of understanding the thread and memory hierarchy structure of modern accelerators. As GPU performance optimization continues to advance, developers can now significantly enhance performance in a more user-friendly environment using CuTe-DSL, a Python-based domain-specific language.
This breakthrough has quickly garnered attention from industry experts. Vijay, a senior architect from NVIDIA's CUTLASS team, praises CuTe-DSL's design, enabling experts like Tri Dao to efficiently run GPUs with ease. He also hints at more exciting developments in this area to be released this year. Horace He, a member of the PyTorch team, expresses great interest in this innovation, particularly its significant advantages for processing long sequences.
To benefit more developers, the creators of QuACK have written a detailed tutorial explaining the implementation steps and code for direct use. The article highlights that achieving efficient operation during GPU model training and inference requires optimizing both compute-intensive and memory-intensive kernels. While optimizations for matrix multiplication and attention mechanisms have matured in previous work, this research focuses on memory-intensive kernels.
The authors explain that memory-intensive kernels have lower arithmetic intensity, making throughput more dependent on the amount of data transferred per second. By cleverly leveraging the GPU's memory hierarchy structure and hardware features, they have successfully boosted the performance of memory-intensive kernels to nearly "lightning speed" levels.