DeepSeek's open-source week
· 32 min read
DeepSeek released several libraries this week. Many of the techniques were already in earlier papers, so the main shift is that the implementation is now available.
Table of Contents
- Day 1: FlashMLA - Multi-Head Latent Attention decoding CUDA kernels
- Day 2: DeepEP - CUDA kernels for Expert Parallelism inside MoE models
- Day 3: DeepGEMM - FP8 Matrix Multiplication (GEMM) library
- Day 4: DualPipe and EPLB - Pipeline Parallelism and Load Balancer Strategies
- Day 5: 3FS and Smallpond - Parallel file system for high-throughput data access
- Day 6: DeepSeek-V3/R1 - How to Operate at 545% Profit Margin
Day 1: FlashMLA - Efficient Multi-Head Latent Attention
Before getting into FlashMLA, a quick refresher on why KV cache optimization matters for LLM inference.
When generating autoregressively, naïve decoding recomputes self-attention over the full prefix at each step: step costs , totaling for tokens. KV caching stores keys/values from prior steps so the new token only computes its own and attends once to cached keys: step becomes , totaling . The trade-off is memory to store caches across layers and heads (for both and ).
To reduce the memory burden of KV caches, several approaches have been proposed. Unlike previous approaches like Multi-Query Attention (MQA) [1] or Group-Query Attention (GQA) [1] that reduce memory usage through head sharing, MLA [1] employs a clever low-rank latent compression technique that preserves the benefits of multiple attention heads.

FlashMLA implements Multi-Head Latent Attention (MLA) for Hopper GPUs, targeting variable-length sequences. Its CUDA kernel reaches up to ~3 TB/s memory-bound throughput. On H100 (HBM3 ~3.3 TB/s), that's close to the hardware limit. For comparison, FlashAttention 2 and 3 hit between 35 and 75% bandwidth utilization [1].
One interesting implementation detail is double buffering for shared key/value tensors.
FLASH_ASSERT(params.k_ptr == params.v_ptr); // Shared_KV
// ...
if (n_block % 2 == 1) {
// Double buffer for sK
constexpr int sK_offset = size(sK);
tSrK.data() = tSrK.data() + sK_offset / 8;
tOrVt.data() = tOrVt.data() + sK_offset / 8;
}
FLASH_ASSERT(params.k_ptr == params.v_ptr)shows that Key and Value tensors share the same memory space -> halving the memory requirement.if (n_block % 2 == 1)implements ping-pong buffering by simply toggling based on block number parity (while one buffer is being used for computation, the other can load data).- the division by 8 in both pointer adjustments (
tSrK.data() + sK_offset / 8andtOrVt.data() + sK_offset / 8) directly implements the MLA's latent compression ratio, allowing the code to work with the compressed representations while maintaining proper memory alignment. - adjusting both Key and Value tensor pointers by the same offset makes each key matching its corresponding value even after switching buffers.
FlashMLA supports BF16 and FP16 precision and uses a paged KV cache with a block size of 64, meaning each memory block stores key-value data for 64 tokens. The paged approach divides the KV cache into fixed-size blocks that can be dynamically allocated and deallocated to handle variable-length sequences [1]. In their Hopper-targeted kernels, 64 tokens is used as the block size; this is a design choice and can vary across implementations.
Quiz on FlashMLA
What is the primary advantage of MLA over MQA/GQA approaches?
Day 2: DeepEP - Expert Parallelism Communication Library
In MoE architectures [1, 1, 1], each input token is routed to multiple experts based on its characteristics. For example, DeepSeek-V3 [1] has 671 billion parameters distributed across 257 experts (1 shared expert and 256 routed experts). Each token activates 8 different experts, but to manage communication costs, each token is limited to accessing experts on at most 4 nodes.

DeepEP is a communication library that optimizes communication patterns in MoE models, specifically the "dispatch" and "combine" operations, which route data between GPUs.
- Dispatch: send token embeddings from their original GPU to the GPUs that host the selected experts.
- Combine: After the experts process the tokens, gather the results back to their original GPUs.
These communication steps can bottleneck. With GPUs, naive routing tends toward traffic. Without overlap between communication and compute, GPUs wait.
To fix this, DeepEP keeps three simple tactics:
- Route on the fastest links. Inside a node it uses NVLink; across nodes it uses RDMA. Each token touches at most four nodes to tame cross‑node traffic.
- Pick the right kernel for the job. Training/prefill kernels mix NVLink and RDMA, saturate NVLink, and reach ~90% of 400 Gbps RDMA (≈43–47 GB/s). Inference uses RDMA‑only low‑latency kernels. As experts scale from 8 to 256, dispatch latency rises ~19%; combine stays ~318–369 μs with 80–92% RDMA use.
- Spend most SMs on math. About 20 of 132 SMs handle communication; the rest stay free for compute.
The library provides a PyTorch-hook-based interface for overlapping communication and computation where the actual data transfer doesn't complete until the hook is explicitly called, so transfer can start early.
def low_latency_dispatch(hidden_states: torch.Tensor, topk_idx: torch.Tensor, num_max_dispatch_tokens_per_rank: int, num_experts: int):
global _buffer
# Do MoE dispatch, compatible with CUDA graph (you can restore some buffer status once you replay)
recv_hidden_states, recv_expert_count, handle, event, hook = \
_buffer.low_latency_dispatch(hidden_states, topk_idx, num_max_dispatch_tokens_per_rank, num_experts,
async_finish=False, return_recv_hook=True)
# NOTES: the actual tensor is received only after calling `hook()`,
# it is useful for double-batch overlapping, but without any SM occupation
# If overlap is not desired, set `return_recv_hook=False`
# Later, you can use our GEMM library to do the computation with this specific format
return recv_hidden_states, recv_expert_count, handle, event, hook
def low_latency_combine(hidden_states: torch.Tensor,
topk_idx: torch.Tensor, topk_weights: torch.Tensor, handle: Tuple):
global _buffer
# Do MoE combine, compatible with CUDA graph (you can restore some buffer status once you replay)
combined_hidden_states, event_overlap, hook = \
_buffer.low_latency_combine(hidden_states, topk_idx, topk_weights, handle,
async_finish=False, return_recv_hook=True)
# NOTES: the same behavior as described in the dispatch kernel
return combined_hidden_states, event_overlap, hook
The implementation also uses traffic isolation and adaptive routing. Traffic isolation separates workloads (normal kernels, low-latency kernels, and other workloads) across different virtual lanes to prevent interference. Adaptive routing distributes traffic across multiple network paths to avoid congestion.
Quiz on DeepEP
What are the two primary operations in MoE that DeepEP optimizes?
Day 3: DeepGEMM - FP8 Matrix Multiplication
Next up: matrix multiplication. General Matrix Multiplication (GEMM) operations are the backbone of LLMs (or any NN, really).
DeepGEMM is a specialized library for FP8 matrix multiplication on Hopper GPUs. At only ~300 lines of CUDA code, it outperforms NVIDIA's CUTLASS in several scenarios:
- Up to 2.7× speedup for small or irregular matrices (particularly important for inference)
- 1.0×-1.7× speedup for large, compute-bound matrices
- 1.1×-1.2× speedup for MoE grouped GEMMs
The key enablers are:
Fine-grained FP8 quantization
Used in the DeepSeek-V3 paper [1] and conceptually inspired by Rouhani et al. [1], DeepGEMM uses a fine-grained quantization strategy to deal with FP8's limited dynamic range, which often causes overflow and underflow. Instead of a single global scaling factor, it applies scaling at a finer granularity, which improves both quantization accuracy and numerical stability.

DeepGEMM quantizes activations and weights using separate scaling factors at different granularities. Activations are quantized per token across groups of 128 channels (1 × 128 tiles), while weights are quantized per 128 input channels by 128 output channels (128 × 128 blocks).
Formally, given an input activation matrix and a weight matrix , the quantization is performed as follows:
Here, and are scaling factors stored in BF16 precision, corresponding to each activation tile and weight block, respectively.
The resulting FP8 GEMM operation is computed as:
This improves numerical stability over naive FP8 quantization in two ways: smaller scaling groups accommodate activation outliers better, and intermediate FP8 results are periodically promoted to FP32 on CUDA cores to avoid accumulation errors.
Mixture-of-Experts Support
DeepGEMM has two grouped GEMM implementations for MoE models, each targeting a different stage. The first is a contiguous layout for training and prefilling, where tokens assigned to each expert are concatenated. The second is a masked layout for inference (especially with CUDA graphs), which processes only valid tokens per expert.
In the DeepSeek-V3 paper [1], this showed a 1.1-1.2× speedup over tuned CUTLASS implementations.
Unaligned Block Sizes
Most GEMM libraries use power-of-2 block sizes (128, 256). DeepGEMM also supports unaligned block sizes like 112, which makes a real difference for GPU utilization on irregular matrix shapes common during inference. For example, with M=256, N=7168, this lets you use 128 SMs instead of just 112.
The SM utilization can be calculated as:
With traditional aligned block sizes (BLOCK_M=128, BLOCK_N=128):
With unaligned block sizes (BLOCK_M=128, BLOCK_N=112):
This 12.2% increase in SM utilization directly translates to performance gains.
Just-In-Time Compilation
DeepGEMM generates kernels at runtime, specialized to the exact dimensions of each matrix multiplication. Traditional GEMM libraries pre-compile generic kernels for arbitrary matrix shapes, which means overhead from branches, loops, and extra logic. DeepGEMM instead treats matrix dimensions (M, N, K) as compile-time constants, so the compiler can fully unroll loops, eliminate boundary checks, and optimize register allocation for that specific shape.
DeepGEMM also picks optimal parameters per shape automatically: block dimensions (including unaligned sizes like 112), pipeline stage count based on arithmetic intensity, and Tensor Memory Accelerator (TMA) cluster size. This matters most for smaller matrices used during inference, where generic kernels waste cycles on unnecessary logic.
Instruction-Level Optimization
At the lowest level, DeepGEMM optimizes CUDA instruction scheduling by modifying the yield and reuse bits of FFMA (Fused Multiply-Add) instructions directly in the compiled binary, improving warp-level parallelism through instruction interleaving.
Each CUDA instruction includes control bits that influence scheduling: the yield bit determines whether the SM can yield the current warp after executing the instruction, and the reuse bit controls whether registers can be immediately reused. By adjusting these bits based on patterns found through binary analysis, DeepGEMM controls the execution schedule so that memory operations (loading matrix elements) overlap with tensor core compute, which overlaps with CUDA core accumulation. This works because the GPU has separate units for these operations that can run concurrently, but only when explicitly told to do so.
Quiz on DeepGEMM
In DeepGEMM's fine-grained scaling approach for FP8 matrix multiplication, how are scaling factors applied?
Day 4: Optimized Parallelism Strategies (DualPipe and EPLB)
With the compute primitives covered, Day 4 turns to orchestrating them across GPUs: DualPipe for pipeline parallelism, and EPLB for expert load balancing.
DualPipe: Bidirectional Pipeline Parallelism
DualPipe is a bidirectional pipeline parallelism algorithm that fully overlaps forward and backward computation-communication phases, cutting pipeline bubbles.
In traditional pipeline parallelism, you split a large NN across multiple GPUs, with each GPU handling a different segment of the model. The problem is that GPUs often end up waiting around, either for data from earlier stages during the forward pass or for gradients from later stages during backpropagation. These idle periods are what we call pipeline bubbles.
Example: Consider a model split across 2 GPUs, where GPU 2 can't start processing the forward pass of batch A until GPU 1 finishes its part. During backprop, the same issue happens in reverse—GPUs wait for gradients from downstream GPUs. These waiting periods create idle gaps, or "bubbles", in the pipeline. The more pipeline stages you have (like in very large models), the more bubbles you get, making scaling inefficient.
DualPipe addresses this by running micro-batches in both directions simultaneously. Forward passes of new batches overlap with backward passes of previous batches, cutting idle time by overlapping communication with computation.
The figure shows DualPipe scheduling with 8 pipeline parallel ranks and 20 micro-batches running in two directions. Cells sharing a black border indicate overlapping computation and communication.
To overlap communication and computation, DualPipe breaks data transfers into smaller chunks ("micro-batch streaming"), so computation can start on early chunks while later ones are still transferring. By using multiple CUDA streams, communication and computation run asynchronously on separate GPU threads. Further, the backward pass gets split into two distinct parts: input-gradients calculate gradients to pass upstream, while weight-gradients calculates gradients for updating the current layer's parameters.
By separating these, input gradients can be sent upstream immediately, speeding up the backward pass in earlier pipeline stages.
Traditional pipelines require time per micro-batch (forward and backward sequentially), while DualPipe approaches through parallel execution. The efficiency difference is substantial:
- Traditional 1F1B: Bubble time =
- With : time wasted
- DualPipe: Bubble time =
- With : time wasted
Where:
- denotes the number of pipeline parallel stages
- denotes the execution time of a forward chunk
- denotes the execution time of a full backward chunk
- denotes the execution time of a "backward for weights" chunk
- denotes the execution time of two mutually overlapped forward and backward chunks
DualPipe cuts pipeline bubbles by over 50% in many scenarios. The trade-off is memory. While traditional 1F1B pipelines store activations for approximately micro-batches total, DualPipe needs to maintain activations for concurrent forward and backward passes.
The per-device activation storage increases from to approximately micro-batches. This trade-off works when training speed is the primary bottleneck and the memory increase remains manageable with distributed training
DualPipe also has a variant called DualPipeV, a V-shape schedule derived through a "cut-in-half" procedure originally introduced by Sea AI Lab. It reduces device requirements from stages to stages.
Example DualPipeV scheduling for 4 PP ranks (8 PP stages) and 10 micro-batches.
EPLB: Expert Parallelism Load Balancer
When using expert parallelism, one common challenge is evenly distributing the computational load across GPUs. The Expert Parallelism Load Balancer (EPLB) tackles this by duplicating heavily used experts.
EPLB provides two distinct strategies depending on the specific scenario:
- Hierarchical load balancing: when server nodes evenly divide expert groups. First spread groups across nodes. Inside each node, duplicate hot experts onto separate GPUs. This mirrors group-limited expert routing from DeepSeek-V3 [1], which keeps tokens within a group to localize work. Keep a group's experts on the same node to cut cross-node traffic. Best during prefilling, when expert-parallel is smaller.
- Global load balancing: when the hierarchy isn't feasible. Duplicate experts regardless of group and spread replicas across all GPUs. Suits decoding, where expert-parallel is larger.
Here's an illustration showing how EPLB assigns experts across nodes and GPUs:
A two-layer MoE model with 12 experts per layer. Adding 4 redundant experts per layer gives 16 expert replicas total, distributed across 2 nodes with 4 GPUs each.
Quiz on DualPipe and EPLB
According to the pipeline bubble formula, how does DualPipe's bubble time compare to 1F1B when PP=8, F=B=10, W=3, and F&B=12?
Day 5: 3FS and Smallpond - Data Infra
Fire-Flyer File System (3FS)
Traditional file systems like HDFS or Lustre are not well-suited for pretraining workflows because they rely on data locality, requiring compute nodes to be physically close to their data.
3FS addresses this via a fully disaggregated architecture. It aggregates the bandwidth and storage capacity of thousands of high-speed SSDs across hundreds of storage nodes, allowing compute nodes to access data quickly without worrying about exact data placement.
Read throughput over time on a 180-node 3FS cluster, peaking at ~7 TiB/s.
3FS also has a strong consistency model, powered by Chain Replication with Apportioned Queries (CRAQ). Unlike traditional chain replication, which bottlenecks reads at the tail node, CRAQ lets any node in the chain serve reads directly. Nodes maintain both fully committed ("clean") and in-progress ("dirty") data versions. Clean data can be served instantly, while dirty data triggers a quick version check with the tail node. This gives you both high throughput and strong consistency, which matters for AI workloads where data inconsistencies can cause subtle bugs or degrade model quality.
Some performance numbers:
- ~6.6 TiB/s total read throughput on a 180-node cluster
- Up to 40 GiB/s per client node for KVCache lookups
KVCache read throughput: peak reads reach up to 40 GiB/s per client node (dotted line), with average reads shown as the solid line, over a 30-minute period.
3FS is particularly useful for:
- Dataloaders: Efficient random access to training samples across multiple nodes, simplifying data loading logic.
- Checkpointing: Fast, parallel checkpointing essential for training large models.
- KV Caching for Inference: A cheaper alternative to DRAM-based caching with high throughput and more storage. Compute nodes fetch cached key-value pairs on-the-fly from 3FS with minimal latency penalty, so you don't need to keep everything in RAM.
Consider training a language model on a 10 TB text dataset. Without 3FS, the dataset might need to be sharded across local disks on each training machine, and each GPU would rely on local storage to avoid network slowdown. With 3FS, all 10 TB can be placed in the file system, and every node can read from it as needed:
- On each training machine (say you have 16 of them), you mount 3FS at
/data. - In your training code, you load files from
/data/my_corpus/shardX.txt(where X can be part of the dataset). Each DataLoader worker will open a file via 3FS, read a chunk of data. 3FS internally will fetch that chunk from the appropriate storage server (maybe multiple in parallel if the file's chunks span servers). - If each server can do, say, 50 GB/s, and you have 10 servers, an aggregate of 500 GB/s is available. Each of the 16 nodes could in theory pull >30 GB/s if needed. Practically, maybe each node might use 5-10 GB/s to saturate the GPUs. This is easily provided by 3FS.
- Training proceeds without ever being starved for data. GPUs are busy 100% with compute rather than sometimes waiting for CPU/disk loading.
At checkpoint time, the trainer on each node saves its model partition to a file in /data/checkpoints/epoch10_rank7.ckpt. All nodes do this simultaneously. 3FS directs each write to perhaps different storage targets (to spread out load) and writes happen in parallel. The result is that you might checkpoint in, say, 30 seconds what used to take 5 minutes on an NFS.
Smallpond
Smallpond is a distributed data processing framework built on DuckDB, an open-source RDBMS optimized for complex queries.
For teams coming from Apache Spark or Hadoop, the appeal is simplicity. No long-running services, no complex dependencies. It provides a high-level API built on Ray:
import smallpond
sp = smallpond.init()
df = sp.read_parquet("path/to/dataset/*.parquet")
df = df.repartition(10)
df = df.map("x + 1")
df.write_parquet("path/to/output")
Day 6: DeepSeek-V3/R1 - Large-scale Cross-node Expert Parallelism and How to Operate at 545% Profit Margin
On the final day, DeepSeek published details about their inference system for DeepSeek-V3/R1. This is an unusually detailed look at a production inference system.
Architecture diagram of DeepSeek's production system.
- Only 8 of 256 experts fire per layer, so parallelism matters.
- Prefill: Routed Expert EP32 + MLA/Shared Expert DP32 across 4 nodes; each GPU hosts 9 routed and 1 shared expert.
- Decode: Routed Expert EP144 + MLA/Shared Expert DP144 across 18 nodes; each GPU hosts 2 routed experts.
- FP8 for feed-forward matrix multiplications and dispatch; BF16 for MLA and combine.
- Dual-batch overlap hides cross-node EP latency: split each batch into two microbatches and run them alternately so compute of one hides the other's communication.

In the decoding phase, stage durations are inherently unbalanced. DeepSeek handles this by splitting the attention layer into two steps and fitting them into a five-stage pipeline to, again, overlap computation and communication.

DeepSeek uses three load balancers: the Prefill Load Balancer equalizes token counts and compute, the Decode Load Balancer handles KVCache usage disparities, and the Expert-Parallel Load Balancer duplicates hot experts across GPUs.
DeepSeek's inference system scales resources dynamically based on daily load patterns: more GPUs during peak hours, fewer during off-peak. In a 24-hour statistical period, the combined peak node occupancy for V3 and R1 inference services reached 278 nodes, with an average occupancy of 226.75 nodes (each containing 8 H800 GPUs). Assuming the leasing cost of one H800 GPU is $2 per hour, the total daily cost amounts to $87,072.

Within a typical 24-hour period:
- Total input tokens: 608B, of which 342B tokens (56.3%) hit the on-disk KV cache
- Total output tokens: 168B
- Average output speed: 20-22 tokens per second
- Average KVCache length per output token: 4,989 tokens
- Average prefilling throughput per H800 node: ~73.7k tokens/s (including cache hits)
- Average decoding throughput per H800 node: ~14.8k tokens/s
The punchline is that this lets DeepSeek achieve a cost profit margin of 545% while being about ~96% cheaper than OpenAI's o1:
- For input tokens with cache hit, R1 costs $0.14 per million
- For input tokens with cache miss, R1 costs $0.55 per million, while o1 costs $15.00 per million
- For output tokens, R1 costs $2.19 per million, while o1 costs $60.00 per million
However, DeepSeek notes that their actual revenue is substantially lower than these theoretical calculations because:
- DeepSeek-V3's pricing is significantly lower than R1
- Only a subset of services are monetized (web and APP access remain free)
- Automatic nighttime discounts are applied during off-peak hours