GitHub Monthly Report: 2025-12-01 to 2025-12-31
π Engineering Report (2025-12-01 - 2025-12-31)
π Executive Summary
December 2025 was a pivotal month characterized by the rapid ecosystem-wide integration of DeepSeek-V3 and V3.2 models. Virtually every major training and inference framework (Primus, vLLM, SGLang, Megatron-LM) prioritized support for these architectures, specifically focusing on SparseMLA and FP8 optimizations.
Hugging Face Transformers released the first Release Candidate for v5.0, marking a massive shift with dynamic weight loading and the removal of legacy βslowβ tokenizers. AMDβs Primus saw two significant releases (v0.5.0, v0.6.0) aggressively targeting Qwen3 and DeepSeek-V3 support on MI300X/MI355X. vLLM and SGLang continue to battle for inference supremacy, with vLLM adding Blackwell Ultra support and SGLang introducing a unified enterprise gateway.
AMD Related Updates
- Primus Acceleration: AMD-AGI/Primus released v0.6.0, adding specific configurations for MI355X and MI300X for DeepSeek-V3 (16B & 671B) and Qwen3. This indicates ready-day-one support for flagship open models on AMD hardware.
- ROCm in vLLM: vLLM v0.12.0 and v0.13.0 brought substantial AMD updates, including support for DeepSeek v3.2 SparseMLA, FP8 MLA decoding, and the AITER attention backend. This significantly narrows the feature gap with CUDA for these specific high-demand models.
- Triton Optimizations: Specific work on WarpPipeliner for AMD architectures in the upstream Triton repo suggests ongoing low-level kernel optimization to improve occupancy.
- Issues: An issue regarding memory access faults on gfx1151 (Strix Halo) was reported in the core ROCm repo, highlighting early friction with new APU architectures.
Competitive Analysis
- NVIDIA Blackwell Support: vLLM v0.13.0 officially added support for NVIDIA Blackwell Ultra (SM103/GB300) via CUDA 13. This hardware availability in open-source inference engines puts pressure on AMD to ensure MI325/355 support is equally frictionless.
- NVFP4 Ecosystem: NVIDIA is aggressively pushing NVFP4 (4-bit floating point) into the training stack. Megatron-LM Core v0.15.0 and TransformerEngine v2.10 added native support for NVFP4 training recipes and zero padding.
- MXFP8 Scaling: PyTorch AO v0.15.0 demonstrated a 1.2x end-to-end training speedup on GB200 clusters using MXFP8 MoE kernels compared to BF16, setting a new bar for training efficiency benchmarks.
π Category Updates
π΄ AMD Ecosystem
AMD-AGI/Primus
- Key Activity:
- [2025-12-20] π¨ RELEASE: v0.6.0: Major update adding support for MI355X configs, DeepSeek-V3 (16B/671B), and Qwen3.
- [2025-12-04] π¨ RELEASE: v0.5.0: Added TurboAttention fixes and initial Qwen3/DeepSeek support.
- Details:
- DeepSeek Support: Added configs for DeepSeek-V3 16B & 671B specifically optimized for MI300X and MI355X.
- Megatron Backend: Primus-Megatron now uses
PrimusTurboSpecProviderfor model backends. - TorchTitan: Enabled dynamic model parameter overrides via CLI and added Mock HF dataset support.
-
Metrics: 149 PRs 1 Issues
ROCm/ROCm
- Key Activity:
- [2025-12-XX] Bug reporting and documentation fixes.
- Details:
- Bug: βMemory Access Fault on gfx1151 (Strix Halo)β indicates teething issues with new APU hardware.
- Bug: HSA Queue Creation fails on ARM64 with RDNA3 GPUs.
-
Metrics: 65 PRs 30 Issues
AMD-AGI/TraceLens
- Key Activity:
- [2025-12-23] Rocprofv3 profile data support.
- Details:
- Enables GPU-only trace support in performance report generation.
-
Metrics: 8 PRs 2 Issues
π₯ PyTorch Ecosystem
huggingface/transformers
- Key Activity:
- [2025-12-01] π¨ RELEASE: v5.0.0rc0: The first major version bump in 5 years.
- Details:
- Dynamic Weight Loading: New API to reshape/merge/split layers during loading (crucial for quantization/parallelism).
- Tokenizer Overhaul: Removal of βslowβ (Python-based) tokenizers in favor of Rust-backed implementations; unified
encode/decodeAPIs. - Deprecations: Dropped
torchscriptandtorch.fxsupport in favor ofdynamoandexport.
-
Metrics: 439 PRs 138 Issues
pytorch/torchtitan
- Key Activity:
- [2025-12-26] π¨ RELEASE: v0.2.1: Features DeepEP integration and MoE enhancements.
- Details:
- MoE: Enabled Pipeline Parallel (PP) and Expert Parallel (EP) overlap; added node-limited routing support.
- Flux: Added Context Parallelism to Flux model training.
-
Metrics: 76 PRs 20 Issues
pytorch/ao
- Key Activity:
- [2025-12-22] π¨ RELEASE: v0.15.0: Focus on MXFP8 and Safetensors.
- Details:
- MXFP8: Demonstrated 1.2x speedup for MoE training on GB200 clusters vs BF16.
- Safetensors: Native support for saving/loading quantized checkpoints.
-
Metrics: 0 PRs (Release notes only in data) 18 Issues
π© NVIDIA Ecosystem
NVIDIA/Megatron-LM
- Key Activity:
- [2025-12-17] π¨ RELEASE: core_v0.15.0: Heavy optimization for MoE and quantization.
- Details:
- Performance: Fused QKV preprocessing with precomputed RoPE caches (3x preprocessing speedup).
- NVFP4: Implemented NVFP4 Zero Padding for MoE.
- MoE: Added DTensor support for Expert Parallelism (EP).
-
Metrics: 0 PRs (Release notes only in data) 0 Issues
NVIDIA/TransformerEngine
- Key Activity:
- [2025-12-11] π¨ RELEASE: v2.10: NVFP4 Training Recipes.
- Details:
- Added support for NVFP4 training recipe for
GroupedLinearmodule. - Added support for FSDP2 with quantized weights.
- Added support for NVFP4 training recipe for
-
Metrics: 0 PRs 0 Issues
β‘ Inference & Serving
vllm-project/vllm
- Key Activity:
- [2025-12-19] π¨ RELEASE: v0.13.0: DeepSeek optimizations and Blackwell Ultra support.
- [2025-12-03] π¨ RELEASE: v0.12.0: Eagle Speculative decoding and ROCm expansion.
- Details:
- DeepSeek: Massive optimizations (DeepGEMM fused layout, DeepEP High-Throughput CUDA graph).
- Hardware: Support for SM103 (Blackwell Ultra/GB300).
- ROCm: Aiter quantization kernels, Torch.compile layernorm/silu + FP8 quant.
-
Metrics: 0 PRs (Release notes only in data) 0 Issues
sgl-project/sglang
- Key Activity:
- [2025-12-24] π¨ RELEASE: gateway-v0.3.0: Unified Inference Gateway (IGW).
- [2025-12-03] π¨ RELEASE: v0.5.6: DeepSeek V3.2 support.
- Details:
- Gateway: IGW supports gRPC, HTTP, and OpenAI routers in a single deployment.
- Models: Support for DeepSeek V3.2/V3.2 Speciale, Flux2, and Z-image.
- WASM: Added WASM middleware support for custom routing logic.
-
Metrics: 0 PRs (Release notes only in data) 0 Issues
tile-ai/tilelang
- Key Activity:
- [2025-12-07] π¨ RELEASE: v0.1.7: Heavy compiler optimizations.
- Details:
- Integrated Z3 in TVM Arith Analyzer.
- Added support for Huawei Ascend chips.
- Optimized MHA varlen forward pass.
-
Metrics: 173 PRs 55 Issues
π΅ JAX & Google Ecosystem
AI-Hypercomputer/maxtext
- Key Activity:
- [2025-12-30] π¨ RELEASE: v1.5.0 & v1.4.0 Tutorials.
- Details:
- Added support for Muon optimizer.
- Added packing support for Context Parallelism (Ring Attention).
-
Metrics: 131 PRs 12 Issues
π οΈ Other Notable Updates
- deepspeedai/DeepSpeed: v0.18.3 released. Added separate learning rates for the Muon optimizer and relaxed tolerances for ROCm FP8 tests.
- deepseek-ai/DeepEP: Added auto-detection for Blackwell GPU architecture (sm_100a).
- THUDM/slime: v0.2.1 released. Enabled true on-policy training for VLM + FSDP (Qwen3-VL).