Introducing Grouped GEMM APIs in cuBLAS and More Performance Updates
Introducing Grouped GEMM APIs in cuBLAS and More Performance Updates | NVIDIA Technical Blog
“Grouped GEMM APIs can be viewed as a generalization of the batched APIs that enable different matrix sizes, transpositions, and scaling factors to be grouped and parallelized in one kernel launch.
One example where this approach provides speedup is the generation phase of a mixture-of-experts (MoE) model with batch sizes of 8 and 64 and FP16 inputs and outputs. In this example, the grouped GEMM API can achieve a 1.2x speedup over naive looping using the batched GEMM API.
This is impressive because the current grouped GEMM kernels only leverage warp-level MMA instructions. They have demonstrated that they can compete against the batched GEMM kernels, which leverage warp group-level mma (wgmma) instructions…”
Source: developer.nvidia.com/blog/introducing-grouped-gemm-apis-in-cublas-and-more-performance-updates/