NVIDIA GPU矩阵乘法

2025-01-08

cuda

18

使用shared memory，一个线程完成4×4大小的子矩阵的计算

注意每个线程的工作负载分配，在进行子矩阵计算前后都需要做syncthreads()。