📅 Engineering Report (2025-12-01 - 2025-12-31)

🚀 Executive Summary

December 2025 was a pivotal month characterized by the rapid ecosystem-wide integration of DeepSeek-V3 and V3.2 models. Virtually every major training and inference framework (Primus, vLLM, SGLang, Megatron-LM) prioritized support for these architectures, specifically focusing on SparseMLA and FP8 optimizations.

Hugging Face Transformers released the first Release Candidate for v5.0, marking a massive shift with dynamic weight loading and the removal of legacy “slow” tokenizers. AMD’s Primus saw two significant releases (v0.5.0, v0.6.0) aggressively targeting Qwen3 and DeepSeek-V3 support on MI300X/MI355X. vLLM and SGLang continue to battle for inference supremacy, with vLLM adding Blackwell Ultra support and SGLang introducing a unified enterprise gateway.

Primus Acceleration: AMD-AGI/Primus released v0.6.0, adding specific configurations for MI355X and MI300X for DeepSeek-V3 (16B & 671B) and Qwen3. This indicates ready-day-one support for flagship open models on AMD hardware.
ROCm in vLLM: vLLM v0.12.0 and v0.13.0 brought substantial AMD updates, including support for DeepSeek v3.2 SparseMLA, FP8 MLA decoding, and the AITER attention backend. This significantly narrows the feature gap with CUDA for these specific high-demand models.
Triton Optimizations: Specific work on WarpPipeliner for AMD architectures in the upstream Triton repo suggests ongoing low-level kernel optimization to improve occupancy.
Issues: An issue regarding memory access faults on gfx1151 (Strix Halo) was reported in the core ROCm repo, highlighting early friction with new APU architectures.

Competitive Analysis

NVIDIA Blackwell Support: vLLM v0.13.0 officially added support for NVIDIA Blackwell Ultra (SM103/GB300) via CUDA 13. This hardware availability in open-source inference engines puts pressure on AMD to ensure MI325/355 support is equally frictionless.
NVFP4 Ecosystem: NVIDIA is aggressively pushing NVFP4 (4-bit floating point) into the training stack. Megatron-LM Core v0.15.0 and TransformerEngine v2.10 added native support for NVFP4 training recipes and zero padding.
MXFP8 Scaling: PyTorch AO v0.15.0 demonstrated a 1.2x end-to-end training speedup on GB200 clusters using MXFP8 MoE kernels compared to BF16, setting a new bar for training efficiency benchmarks.

📂 Category Updates

🔴 AMD Ecosystem

AMD-AGI/Primus

Key Activity:
- [2025-12-20] 🚨 RELEASE: v0.6.0: Major update adding support for MI355X configs, DeepSeek-V3 (16B/671B), and Qwen3.
- [2025-12-04] 🚨 RELEASE: v0.5.0: Added TurboAttention fixes and initial Qwen3/DeepSeek support.
Details:
- DeepSeek Support: Added configs for DeepSeek-V3 16B & 671B specifically optimized for MI300X and MI355X.
- Megatron Backend: Primus-Megatron now uses PrimusTurboSpecProvider for model backends.
- TorchTitan: Enabled dynamic model parameter overrides via CLI and added Mock HF dataset support.
Metrics: 149 PRs 1 Issues

ROCm/ROCm

Key Activity:
- [2025-12-XX] Bug reporting and documentation fixes.
Details:
- Bug: “Memory Access Fault on gfx1151 (Strix Halo)” indicates teething issues with new APU hardware.
- Bug: HSA Queue Creation fails on ARM64 with RDNA3 GPUs.
Metrics: 65 PRs 30 Issues

AMD-AGI/TraceLens

Key Activity:
- [2025-12-23] Rocprofv3 profile data support.
Details:
- Enables GPU-only trace support in performance report generation.
Metrics: 8 PRs 2 Issues

🔥 PyTorch Ecosystem

huggingface/transformers

Key Activity:
- [2025-12-01] 🚨 RELEASE: v5.0.0rc0: The first major version bump in 5 years.
Details:
- Dynamic Weight Loading: New API to reshape/merge/split layers during loading (crucial for quantization/parallelism).
- Tokenizer Overhaul: Removal of “slow” (Python-based) tokenizers in favor of Rust-backed implementations; unified encode/decode APIs.
- Deprecations: Dropped torchscript and torch.fx support in favor of dynamo and export.
Metrics: 439 PRs 138 Issues

pytorch/torchtitan

Key Activity:
- [2025-12-26] 🚨 RELEASE: v0.2.1: Features DeepEP integration and MoE enhancements.
Details:
- MoE: Enabled Pipeline Parallel (PP) and Expert Parallel (EP) overlap; added node-limited routing support.
- Flux: Added Context Parallelism to Flux model training.
Metrics: 76 PRs 20 Issues

pytorch/ao

Key Activity:
- [2025-12-22] 🚨 RELEASE: v0.15.0: Focus on MXFP8 and Safetensors.
Details:
- MXFP8: Demonstrated 1.2x speedup for MoE training on GB200 clusters vs BF16.
- Safetensors: Native support for saving/loading quantized checkpoints.
Metrics: 0 PRs (Release notes only in data) 18 Issues

🟩 NVIDIA Ecosystem

NVIDIA/Megatron-LM

Key Activity:
- [2025-12-17] 🚨 RELEASE: core_v0.15.0: Heavy optimization for MoE and quantization.
Details:
- Performance: Fused QKV preprocessing with precomputed RoPE caches (3x preprocessing speedup).
- NVFP4: Implemented NVFP4 Zero Padding for MoE.
- MoE: Added DTensor support for Expert Parallelism (EP).
Metrics: 0 PRs (Release notes only in data) 0 Issues

NVIDIA/TransformerEngine

Key Activity:
- [2025-12-11] 🚨 RELEASE: v2.10: NVFP4 Training Recipes.
Details:
- Added support for NVFP4 training recipe for GroupedLinear module.
- Added support for FSDP2 with quantized weights.
Metrics: 0 PRs 0 Issues

⚡ Inference & Serving

vllm-project/vllm

Key Activity:
- [2025-12-19] 🚨 RELEASE: v0.13.0: DeepSeek optimizations and Blackwell Ultra support.
- [2025-12-03] 🚨 RELEASE: v0.12.0: Eagle Speculative decoding and ROCm expansion.
Details:
- DeepSeek: Massive optimizations (DeepGEMM fused layout, DeepEP High-Throughput CUDA graph).
- Hardware: Support for SM103 (Blackwell Ultra/GB300).
- ROCm: Aiter quantization kernels, Torch.compile layernorm/silu + FP8 quant.
Metrics: 0 PRs (Release notes only in data) 0 Issues

sgl-project/sglang

Key Activity:
- [2025-12-24] 🚨 RELEASE: gateway-v0.3.0: Unified Inference Gateway (IGW).
- [2025-12-03] 🚨 RELEASE: v0.5.6: DeepSeek V3.2 support.
Details:
- Gateway: IGW supports gRPC, HTTP, and OpenAI routers in a single deployment.
- Models: Support for DeepSeek V3.2/V3.2 Speciale, Flux2, and Z-image.
- WASM: Added WASM middleware support for custom routing logic.
Metrics: 0 PRs (Release notes only in data) 0 Issues

tile-ai/tilelang

Key Activity:
- [2025-12-07] 🚨 RELEASE: v0.1.7: Heavy compiler optimizations.
Details:
- Integrated Z3 in TVM Arith Analyzer.
- Added support for Huawei Ascend chips.
- Optimized MHA varlen forward pass.
Metrics: 173 PRs 55 Issues

🔵 JAX & Google Ecosystem

AI-Hypercomputer/maxtext

Key Activity:
- [2025-12-30] 🚨 RELEASE: v1.5.0 & v1.4.0 Tutorials.
Details:
- Added support for Muon optimizer.
- Added packing support for Context Parallelism (Ring Attention).
Metrics: 131 PRs 12 Issues

🛠️ Other Notable Updates

deepspeedai/DeepSpeed: v0.18.3 released. Added separate learning rates for the Muon optimizer and relaxed tolerances for ROCm FP8 tests.
deepseek-ai/DeepEP: Added auto-detection for Blackwell GPU architecture (sm_100a).
THUDM/slime: v0.2.1 released. Enabled true on-policy training for VLM + FSDP (Qwen3-VL).