📅 Engineering Report (2025-12-01 - 2025-12-31)

🚀 Executive Summary

December 2025 was defined by the industry-wide scramble to support and optimize DeepSeek-V3 and V3.2 architectures. Almost every major training and serving framework (Primus, vLLM, SGLang, Megatron-LM) released updates specifically targeting these models, particularly focusing on MoE routing, SparseMLA, and FP8/FP4 quantization recipes.

Simultaneously, HuggingFace Transformers released the Release Candidate for v5.0, signaling massive breaking changes to tokenizers and weight loading APIs that will impact the entire downstream ecosystem. In the hardware landscape, NVIDIA began solidifying software support for Blackwell (GB200) across PyTorch AO and vLLM, while AMD aggressively countered with optimized configs for MI300X/MI355X in Primus and AITER backend integrations in serving engines.

DeepSeek on ROCm: AMD’s Primus framework released v0.6.0, explicitly adding configs for DeepSeek-V3 (16B & 671B) on MI300X and MI355X.
Serving Optimization: Both vLLM and SGLang merged significant ROCm-specific optimizations, including support for DeepSeek V3.2 SparseMLA, FP8 MLA decoding, and the integration of the AITER attention backend.
Tooling: IRLens, an intermediate representation analysis tool, was open-sourced within the Primus repository.
Infrastructure: ROCm/MAD introduced multi-node support, crucial for scaling large training jobs on AMD clusters.

Competitive Analysis

NVIDIA Blackwell Support: vLLM v0.13.0 and PyTorch/AO v0.15.0 introduced specific optimizations for Blackwell Ultra (SM103) and GB200. PyTorch/AO reported a 1.2x speedup for MXFP8 MoE training on GB200 compared to BF16.
FP4/FP8 Maturation: NVIDIA’s TransformerEngine v2.10 and Megatron-LM are heavily pushing NVFP4 training recipes and CUDA graphs for quantized weights, aiming to lock in low-precision training advantages.
Transformers v5: The shift to v5.0 in the HuggingFace ecosystem (removing slow tokenizers, dynamic weight loading) will force all hardware vendors to validate their integrations against the new API surface immediately.

📂 Category Updates

🟥 AMD Ecosystem

[AMD-AGI/Primus]

Key Activity:
- [2025-12-20] 🚨 RELEASE: v0.6.0 - Support for DeepSeek-V3, Docker v25.11, and IRLens open-source.
- [2025-12-04] RELEASE: v0.5.0 - TorchTitan FP8 configs and Qwen3 support.
Details:
- DeepSeek Support: Added configs for DeepSeek-V3 16B & 671B specifically for MI300X and MI355X architectures.
- New Tools: Open-sourced IRLens tool for IR analysis and added a Runner Library/CLI for better UX.
- Refactor: Training configs were reorganized by GPU architecture (MI300 vs MI355).
Metrics: 149 PRs 1 Issues (Healthy maintenance, high feature velocity)

[ROCm/MAD]

Key Activity:
- [2025-12-XX] Added Multi-node support feature.
Details:
- Introduced multi-node support to enable larger scale distributed training tests.
- Released pytorch-xdit:v25.13 and Pytorch/Megatron-LM training v25.11 scripts.
Metrics: 11 PRs 1 Issues

[AMD-AGI/TraceLens]

Key Activity:
- [2025-12-23] Added support for Rocprofv3 profile data.
Details:
- Enabled GPU-only trace support in performance report generation.
Metrics: 8 PRs 2 Issues

🔥 PyTorch Ecosystem

[pytorch/ao] (Architecture Optimization)

Key Activity:
- [2025-12-22] 🚨 RELEASE: v0.15.0 - MXFP8 MoE training & Llama4 Scout support.
Details:
- Blackwell Performance: Demonstrated 1.2x speedup on 64-node GB200 cluster using MXFP8 MoE training vs BF16.
- Features: Added Safetensors support and FqnToConfig for parameter-level quantization targeting (crucial for MoE models).
- Deprecations: Removed config functions like int4_weight_only in favor of class-based configs.
Metrics: 0 PRs (Repo snapshot) 18 Issues

[pytorch/torchtitan]

Key Activity:
- [2025-12-26] 🚨 RELEASE: v0.2.1 - DeepEP integration and GPT-OSS enablement.
Details:
- Integrated DeepEP for expert parallelism.
- Added Context Parallelism for Flux model training.
- Enabled node-limited routing for MoE.
Metrics: 76 PRs 20 Issues

[meta-pytorch/monarch]

Key Activity:
- [2025-12-22] 🚨 RELEASE: v0.2.0 - Focus on supervision, shutdown semantics, and logging.
Details:
- Major hardening of actor supervision hierarchy and recursive shutdown reliability.
- Removed legacy v0 codepaths and decoupled binary dependency on PyTorch (Python-only dep now).
Metrics: 0 PRs (Repo snapshot) 0 Issues

🟩 NVIDIA Ecosystem

[NVIDIA/Megatron-LM]

Key Activity:
- [2025-12-17] 🚨 RELEASE: core_v0.15.0 - DeepSeek V3 / MoE optimizations.
Details:
- MoE: Added DTensor support for Expert Parallelism (EP) and DeepSeek V3 (DSv3) modules.
- Performance: Fused QKV preprocessing (3x speedup).
- Model Support: Added YaRN support for GPT-OSS and FP8 initialization for MTP.
Metrics: 0 PRs (Repo snapshot) 0 Issues

[NVIDIA/TransformerEngine]

Key Activity:
- [2025-12-11] 🚨 RELEASE: v2.10
Details:
- Added support for NVFP4 training recipe for GroupedLinear module.
- Enabled CUDA graphs for quantized weights with Tensor Parallelism.
- Added FSDP2 support with quantized weights.
Metrics: 0 PRs (Repo snapshot) 0 Issues

🚀 Inference & Serving

[vllm-project/vllm]

Key Activity:
- [2025-12-19] 🚨 RELEASE: v0.13.0 - DeepSeek V3/V3.2, Blackwell Ultra support.
- [2025-12-03] 🚨 RELEASE: v0.12.0 - EAGLE Spec Decode, AMD AITER backend.
Details:
- DeepSeek: Massive optimizations including DeepEP high-throughput CUDA graphs and DeepGEMM fused kernels.
- AMD ROCm: Integrated AITER attention backend and enabled DeepSeek V3.2 SparseMLA on ROCm.
- Hardware: Added support for NVIDIA Blackwell Ultra (SM103) and CUDA 13.
- Breaking: Removed xformers backend and upgraded to PyTorch 2.9.0.
Metrics: 0 PRs (Repo snapshot) 0 Issues

[sgl-project/sglang]

Key Activity:
- [2025-12-24] 🚨 RELEASE: Gateway v0.3.0 - Go implementation and Unified Inference Gateway.
- [2025-12-03] 🚨 RELEASE: v0.5.6 - DeepSeek V3.2 support and Blockwise diffusion.
Details:
- DeepSeek: Full support for V3.2 and V3.2 Speciale.
- Gateway: Introduced a Go implementation of the SGLang Model Gateway and “Unified Inference Gateway” mode.
- Diffusion: Added support for Flux2 and Z-image.
- AMD: Optimized DeepSeek-R1 model with RMSNorm + FP8 quant fusion on ROCm.
Metrics: 0 PRs (Repo snapshot) 0 Issues

[THUDM/slime]

Key Activity:
- [2025-12-12] 🚨 RELEASE: v0.2.1 - VLM + FSDP true on-policy training.
Details:
- Enabled True on-policy training for Qwen3-VL (dense) using FSDP.
- Added PD-disaggregation support during rollout.
- Upgraded to SGLang v0.5.6 backend.
Metrics: 0 PRs (Repo snapshot) 0 Issues

🤗 Community & Research Tools

[huggingface/transformers]

Key Activity:
- [2025-12-01] 🚨 RELEASE: v5.0.0rc0 - Major Version Candidate.
Details:
- Dynamic Weight Loading: New API to reshape/merge/split layers during loading (critical for quantization/sharding).
- Tokenizer Refactor: Consolidated “slow” and “fast” tokenizers; removed legacy files.
- Breaking: Dropped torchscript and torch.fx support. Removed load_in_4bit/8bit args in favor of quantization configs.
Metrics: 0 PRs (Repo snapshot) 0 Issues

[tile-ai/tilelang]

Key Activity:
- [2025-12-31] 🚨 RELEASE: v0.1.7.post2
Details:
- Added support for DeepSeek sparse MLA backward via split-H.
- Adapted GEMM v2 for CuteDSL backend.
- Enabled FlashAttention-2 forward pass on AMD MI300X.
Metrics: 0 PRs (Repo snapshot) 0 Issues

[deepspeedai/DeepSpeed]

Key Activity:
- [2025-12-09] 🚨 RELEASE: v0.18.3
Details:
- Added support for the Muon optimizer (with separate learning rates).
- Added Qwen2.5 to the AutoTP model list.
Metrics: 0 PRs (Repo snapshot) 0 Issues