SSpMM: Efficiently Scalable SpMM Kernels Across Multiple Generations of Tensor Cores

11

Abstract

Sparse-Dense Matrix-Matrix Multiplication (SpMM) has emerged as a foundational primitive in HPC and AI. Recent advancements have aimed to accelerate SpMM by harnessing the powerful Tensor Cores found in modern GPUs. However, despite these efforts, existing methods frequently encounter performance degradation when ported across different Tensor Core architectures. Recognizing that scalable SpMM across multiple generations of Tensor Cores relies on the effective use of general-purpose instructions, we have meticulously developed a SpMM library named SSpMM. However, a significant conflict exists between granularity and performance in current Tensor Core instructions. To resolve this, we introduce the innovative Transpose Mapping Scheme, which elegantly implements fine-grained kernels using coarse-grained instructions. Additionally, we propose the Register Shuffle Method to further enhance performance. Finally, we introduce Sparse Vector Compression, a technique that ensures our kernels are scalable with both structured and unstructured sparsity. Our experimental results, conducted on four generations of Tensor Core GPUs using over 3,000 sparse matrices from well-established matrix collections, demonstrate that SSpMM achieves an average speedup of 2.04 ×, 2.81 ×, 2.07 ×, and 1.87 ×, respectively, over the state-of-the-art SpMM solution. Furthermore, we have integrated SSpMM into PyTorch, achieving a 1.81 × speedup in end-to-end Transformer inference compared to cuDNN.