📅 Engineering Report (2025-12-01 - 2025-12-31)

🚀 Executive Summary

December 2025 was defined by the rapid industry-wide adoption and optimization of DeepSeek-V3/V3.2 architectures. Almost every major training and inference framework (Primus, vLLM, SGLang, Megatron-LM) rushed to support DeepSeek’s specific MoE (Mixture of Experts) requirements, specifically SparseMLA and FP8/MXFP8 quantization strategies.

AMD significantly accelerated its software stack maturity with Primus v0.5.0 and v0.6.0, creating a unified training interface for MI300X and the upcoming MI355X. Concurrently, the inference ecosystem saw massive updates with vLLM v0.13.0 and SGLang v0.5.6, both of which integrated substantial AMD ROCm optimizations (AITER backend, Matrix Core intrinsics).

DeepSeek on ROCm: Primus added specific configurations for training DeepSeek-V3 (16B & 671B) on MI300X and MI355X. vLLM and SGLang added support for DeepSeek V3.2 + SparseMLA and FP8 MLA decode specifically for ROCm.
Primus Maturity: Two major releases (v0.5.0, v0.6.0) introduced a robust Runner CLI, organized configs by GPU architecture (MI300/MI355), and integrated TorchTitan for modern training workflows. The repository health is excellent (143 PRs closed).
Tilelang Optimization: The new compiler stack (v0.1.7) enabled Flash Attention 2 forward pass on MI300X and refactored matrix core intrinsics, critical for custom kernel performance.
Profiling: TraceLens now supports Rocprofv3 profile data, enhancing observability for AMD hardware.

Competitive Analysis

NVIDIA’s Blackwell Prep: TransformerEngine v2.10 and Megatron-LM Core v0.15.0 introduced NVFP4 (4-bit floating point) training recipes, preparing software stacks for the Blackwell generation (GB200).
PyTorch AO (Architecture Optimization): Release v0.15.0 highlighted MXFP8 MoE training, claiming a 1.2x speedup over BF16 on GB200 clusters. This sets a new bar for training efficiency that AMD software must match.
Inference Latency Wars: vLLM and SGLang are in a tight race. vLLM’s “Model Runner V2” and SGLang’s “JIT Kernels” are both targeting overhead reduction to maximize token throughput on H100/H200s.

📂 Category Updates

🔴 AMD Ecosystem

[AMD-AGI/Primus]

Key Activity:
- [2025-12-20] 🚨 RELEASE: v0.6.0 - Major feature update focusing on CLI tooling, Docker v25.11, and DeepSeek configurations.
- [2025-12-04] 🚨 RELEASE: v0.5.0 - initial TorchTitan integration and DeepSeek/Qwen3 model support.
Details:
- Model Support: Added Qwen3 (0.6B/1.7B/32B) and DeepSeek-V3 (16B/671B) configs for MI300X/MI355X.
- Architecture: Refactored configs to be organized by GPU architecture. Added a new “Runner Library” and CLI for unified patch execution.
- Backends: Enhanced TorchTitan backend with dynamic model parameter overrides and mock dataset support. Improved Megatron backend with transformer_engine 2.4 adaptation.
Metrics: 149 New PRs 143 Closed PRs (High Velocity)

[AMD-AGI/TraceLens]

Key Activity:
- [2025-12-23] Added support for Rocprofv3 profile data.
Metrics: 8 New PRs 8 Closed PRs

[tile-ai/tilelang]

Key Activity:
- [2025-12-31] 🚨 RELEASE: v0.1.7.post2
- [2025-12-07] 🚨 RELEASE: v0.1.7
Details:
- AMD Specific: Enabled Flash Attention 2 (FA2) forward pass on AMD MI300X. Refactored MatrixCoreIntrinEmitter for AMD.
- General: Added support for BFloat16 in CUDA codegen, integrated Z3 prover for arithmetic analysis, and added T.gemm_v2 support.
Metrics: 173 New PRs 166 Closed PRs

[ROCm/ROCm]

Key Activity:
- Addressed specific hardware bugs including memory faults on gfx1151 (Strix Halo) and HSA queue creation failures on ARM64+RDNA3.
Metrics: 65 New PRs 69 Closed PRs

🔥 PyTorch Ecosystem

[pytorch/torchtitan]

Key Activity:
- [2025-12-26] 🚨 RELEASE: v0.2.1
Details:
- Parallelism: Enabled Pipeline Parallelism (PP) and Expert Parallelism (EP) overlap for MoE. Integrated DeepEP. Added Context Parallelism for Flux models.
- Models: Enabled GPT-OSS and Qwen3.
- Experimental: Added “Compiler Toolkit” prototypes and auto-parallel experiments.
Metrics: 76 New PRs 64 Closed PRs

[pytorch/ao] (Architecture Optimization)

Key Activity:
- [2025-12-22] 🚨 RELEASE: v0.15.0
Details:
- MXFP8: Highlighted 1.2x training speedup for MoE models using MXFP8 on GB200 clusters.
- Quantization: Added FqnToConfig to target quantization by parameter name (crucial for MoE layers like gate_up_proj). Safetensors support added.
Metrics: 131 New PRs 120 Closed PRs

⚡ Inference & Serving

[vllm-project/vllm]

Key Activity:
- [2025-12-19] 🚨 RELEASE: v0.13.0
- [2025-12-03] 🚨 RELEASE: v0.12.0
Details:
- Core: Upgraded to PyTorch 2.9.0 and CUDA 12.9. Introduced “Model Runner V2” (experimental) for lower latency.
- AMD ROCm: Added DeepSeek v3.2 + SparseMLA support, FP8 MLA decode, and AITER attention backend.
- Models: Added support for AudioFlamingo3, JAIS 2, and latent MoE architectures.
- Performance: Batch invariant BMM optimization yielded ~18% throughput improvement on DeepSeek-V3.1.
Metrics: 439 New PRs 402 Closed PRs

[sgl-project/sglang]

Key Activity:
- [2025-12-24] 🚨 RELEASE: gateway-v0.3.0
- [2025-12-03] 🚨 RELEASE: v0.5.6
Details:
- Gateway: Major refactor including a Go implementation of the gateway, new metrics architecture, and unified Inference Gateway (IGW) mode for Kubernetes.
- DeepSeek: Full support for V3.2 including “Speciale” variants and FP8 optimizations.
- Features: Added support for Blockwise diffusion language models and new diffusion models like Flux2.
Metrics: 0 New PRs (Data limitation in input, but releases indicate high activity)

[THUDM/slime]

Key Activity:
- [2025-12-12] 🚨 RELEASE: v0.2.1
Details:
- Training: Enabled true on-policy training for VLM (Qwen3-VL) using FSDP.
- Architecture: Added PD-disaggregation support during rollout and DP-attention support in rollout routing replay.
Metrics: 0 New PRs (Data limitation in input)

⚔️ NVIDIA Ecosystem

[NVIDIA/Megatron-LM]

Key Activity:
- [2025-12-17] 🚨 RELEASE: core_v0.15.0
Details:
- Performance: Fused QKV preprocessing (3x speedup).
- MoE: Added HybridEP backend to Flex Dispatcher and NVFP4 Zero Padding for MoE.
- Inference: Added CUDA Graph runner lookup table cache (up to 2x speedup).

[NVIDIA/TransformerEngine]

Key Activity:
- [2025-12-11] 🚨 RELEASE: v2.10
Details:
- Added support for NVFP4 training recipe for GroupedLinear.
- Added support for CUDA graphs when using quantized weights with Tensor Parallelism.

📦 JAX & Other

[jax-ml/jax]

Key Activity:
- [2025-12-18] 🚨 RELEASE: jax-v0.8.2
Details:
- Routine deprecations (e.g., jax.lax.pvary). No major feature shifts.

[huggingface/transformers]

Key Activity:
- [2025-12-01] 🚨 RELEASE: v5.0.0rc0
Details:
- Major Version: First major release in 5 years.
- Architecture: Shifted to dynamic weight loading API (WeightConverter). Consolidated tokenizer backends (moving away from separate “fast/slow” files).
- Clean up: Removed load_in_4bit/load_in_8bit arguments in favor of quantization_config. Removed torchscript and torch.fx support.