DeepSeek Unveils DeepGEMM: 300-Line Code Powers V3 and R1 Models

In its third open-source release of the week, DeepSeek-AI has launched DeepGEMM, an innovative FP8 General Matrix Multiplication (GEMM) acceleration library designed for maximum performance with minimal code complexity. This powerful library, which powers both DeepSeek-V3 and R1 model training and inference, achieves remarkable performance while maintaining exceptional code simplicity.

Outstanding Performance with Clean Design

DeepGEMM delivers impressive computational efficiency, reaching over 1350 FP8 TFLOPS on NVIDIA Hopper architecture GPUs. This exceptional performance translates to faster model training, smoother inference, and reduced computational costs for AI development.

What sets DeepGEMM apart is its elegance – the core kernel function consists of just approximately 300 lines of code. Despite this simplicity, it matches or even surpasses the performance of extensively expert-tuned libraries. This makes DeepGEMM not only a high-performance computation library but also an excellent learning resource for those studying Hopper FP8 matrix multiplication and optimization techniques.

Key Features

  • Incredible Performance: Achieves 1350+ FP8 TFLOPS on Hopper GPUs
  • Minimal Dependencies: Clean codebase with tutorial-like simplicity
  • Just-In-Time Compilation: Fully JIT-compiled for immediate use without pre-compilation
  • Versatile Support: Works with both dense models and Mixture of Experts (MoE) architectures
  • Multiple MoE Layouts: Supports both contiguous and masked data layouts

Advanced Technical Innovations

DeepGEMM incorporates several cutting-edge techniques:

  • Fine-grained Scaling: Leverages the approach introduced in DeepSeek-V3 to effectively utilize FP8’s dynamic range
  • Two-level Accumulation: Uses CUDA-core for enhanced precision in accumulation
  • Persistent Warp-specialization: Optimizes data movement and tensor-core MMA instructions
  • Tensor Memory Accelerator (TMA): Fully utilizes Hopper architecture’s TMA capabilities for faster data access
  • Unified Block Scheduler with Rasterization: Improves L2 cache reuse rates
  • FFMA SASS Interleaving: Deep assembly-level optimizations for maximum hardware utilization

Getting Started

DeepGEMM is now available under the MIT license. To get started, you’ll need:

  • NVIDIA Hopper architecture GPU (sm_90a)
  • Python 3.8+
  • CUDA 12.3+ (12.8+ recommended for best performance)
  • PyTorch 2.1+
  • CUTLASS 3.6+ (can be cloned via Git submodule)

The library can be installed with just a few simple commands and integrated into Python projects with a single import statement.

Benchmarks

Performance tests across various matrix shapes commonly used in DeepSeek-V3/R1 models demonstrate significant speedups:

  • For normal GEMMs used in dense models: up to 2.7x speedup
  • For grouped GEMMs with contiguous layout in MoE models: consistent 1.1-1.2x speedup
  • For masked layout MoE models: 1.2x speedup across various configurations

DeepGEMM’s GitHub repository is now available at: https://github.com/deepseek-ai/DeepGEMM

This library draws inspiration from the CUTLASS project while maintaining a focus on simplicity and usability, making advanced GPU optimization accessible to a broader audience of developers and researchers.