GitHub Monthly Report: 2025-12-01 to 2025-12-31
π Engineering Report (2025-12-01 - 2025-12-31)
π Executive Summary
December 2025 was a massive month for AI engineering ecosystems, heavily driven by the race to optimize and train the next generation of frontier architectures (DeepSeek-V3/R1 and Qwen3) alongside rapid advancements in quantization (MXFP8, NVFP4). Maintenance health across the board is remarkably strong; major repositories (like PyTorch, XLA, SGLang, and Primus) showed excellent PR closure rates (e.g., Primus closing 143/149 PRs, XLA closing 1143/1213 PRs), indicating robust CI/CD and active maintainer communities capable of keeping up with frantic innovation cycles.
There was a massive coordinated push across PyTorch, TransformerEngine, and SGLang to support complex MoE (Mixture of Experts) paradigms, fully sync-free operations, and overlapping mechanisms (Pipeline Parallelism/Expert Parallelism).
AMD Related Updates
- DeepSeek-V3 and Qwen3 on MI300X/MI355X: AMD-AGI/Primus shipped back-to-back major releases (v0.5.0 & v0.6.0) that directly bring DeepSeek-V3 (16B & 671B) and Qwen3 configurations to MI300X and MI355X hardware.
- MoE & Memory Optimizations: Primus successfully implemented fully sync-free MoE stage 3 and Megatronβs All-to-All/DeepEP overlap in pipelines, heavily reducing communication overheads for massive MoE models.
- Ecosystem Integrations: TorchTitan added an automated ROCm workflow on cron schedule (Main branch). TileLang integrated FlashAttention-2 forward passes specifically for AMD MI300X and updated its CI to ROCm 7.1.
- New AMD Tools: AMD officially open-sourced the IRLens tool inside Primus to assist with nested sub-computations and intermediate representation profiling.
Competitive Analysis
- NVIDIAβs Heavy Push for NVFP4/GB200: NVIDIAβs Megatron-LM and TransformerEngine saw synchronized releases focusing heavily on NVFP4 (Zero Padding for MoE, GroupedLinear recipes) and expanding CUDA Graphs to cover quantized weights with Tensor Parallelism.
- PyTorch AO + Blackwell/Crusoe Benchmarks: The PyTorch/AO v0.15.0 release proudly highlighted MXFP8 MoE training on a 64-node GB200 Crusoe cluster, citing a 1.2x end-to-end training speedup over BF16 with identical convergence. This sets a very high performance bar for MXFP8 implementations moving forward.
- SGLang Dominance in Serving: SGLang pushed three massive releases, establishing an industry-first WASM Middleware support, a Unified Inference Gateway Mode (IGW), and immediate optimization for DeepSeek V3.2. Their rapid adoption of new architectures makes them a formidable benchmark target for AMDβs internal serving stacks.
π Category Updates
AMD Ecosystem
AMD-AGI/Primus
- Key Activity:
- [2025-12-04] π¨ RELEASE: v0.5.0
- [2025-12-20] π¨ RELEASE: v0.6.0 (Building Docker v25.11)
- Details:
- Model Enablement: Added DeepSeek-V3 (16B & 671B), Qwen3 (0.6B/1.7B/32B), and Llama 3.3 configs for MI300X and MI355X architectures.
- MoE & Communications: Integrated Megatronβs A2A and DeepEP overlap in pipeline; supported fully sync-free MoE stage 3.
- Turbo Integration: Megatron backend now uses
PrimusTurboSpecProvider, added Turbo RMSNorm patches, and Turbo FP8 grouped GEMM. - Tooling: Open-sourced the IRLens tool. Re-architected the Runner CLI with patch execution systems and dynamic model parameter overrides.
-
Metrics: 149 PRs 1 Issues (Very healthy maintenance, 143 PRs closed)
ROCm/ROCm & ROCm/MAD
- Key Activity:
- [2025-12-01 to 12-31] Heavy bug squashing and documentation updates for xDiT.
- Details:
- Investigating Memory Access Faults on gfx1151 (Strix Halo) and HSA Queue Creation failures on ARM64 RDNA3 GPUs.
- MAD repo merged PyTorch/Megatron-LM training v25.11 and multi-node support features.
-
Metrics: ROCm: 65 PRs 30 Issues. MAD: 11 PRs 1 Issues.
AMD-AGI/TraceLens
- Key Activity:
- [2025-12-23] Added Rocprofv3 profile data support.
- Details:
- Enabled GPU-only trace support in perf report generation. Tracking feature requests for direct Sharepoint file loading.
-
Metrics: 8 PRs 2 Issues
PyTorch Ecosystem
pytorch/torchtitan
- Key Activity:
- [2025-12-26] π¨ RELEASE: v0.2.1
- Details:
- Features: Re-wrote parallel_dims using DeviceMesh unflatten. Integrated DeepEP for advanced MoE routing.
- Models: Added Context Parallelism to Flux model training and GPT-OSS enablement.
- ROCm: Added automated ROCm workflow on cron schedule.
-
Metrics: 76 PRs 20 Issues
pytorch/ao
- Key Activity:
- [2025-12-22] π¨ RELEASE: v0.15.0
- Details:
- Performance: Highlighted MXFP8 MoE training on GB200 clusters yielding a 1.2x speedup over BF16.
- Features: Safetensors enablement for torchao models (integrated with HF and vLLM). Allowed FQN (Fully Qualified Name) specific parameter quantization via
FqnToConfigfor complex MoE models. - BC Breaking: Cleaned up and standardized config names (e.g.,
int4_weight_only->Int4WeightOnlyConfig).
-
Metrics: 0 PRs (Tracked externally) 18 Issues
pytorch/pytorch
- Key Activity:
- [2025-12-01 to 12-31] Ongoing core maintenance.
- Details:
- Merged ROCm fixes for AMDSMI return types. Tracking Inductor Aot-Autograd fusion bugs and Varlen attention window exposure.
-
Metrics: 1811 PRs 500 Issues (Healthy: 1732 closed PRs)
meta-pytorch/monarch
- Key Activity:
- [2025-12-22] π¨ RELEASE: v0.2.0
- Details:
- Focused on correctness, robustness, and K8s readiness. Strict supervision hierarchy enforced, HostMesh lifecycle control refined, and legacy v0 completely removed.
NVIDIA Ecosystem
NVIDIA/Megatron-LM
- Key Activity:
- [2025-12-17] π¨ RELEASE: core_v0.15.0
- Details:
- Performance: Fused QKV preprocessing with precomputed RoPE caches yielding 10-14% E2E speedup. Added CPU activation offloading via TE.
- MoE: DTensor support for EP and DSv3 modules. Implemented NVFP4 Zero Padding for MoE.
- FSDP: Enabled joint training of parallel modules.
- Inference: Added CUDA Graph runner lookup table cache (up to 2x E2E speedup).
NVIDIA/TransformerEngine
- Key Activity:
- [2025-12-11] π¨ RELEASE: v2.10
- Details:
- PyTorch: Added NVFP4 training recipe for GroupedLinear. CUDA graphs now supported for quantized weights + Tensor Parallelism. Added SWA (Sliding Window Attention) with Context Parallelism.
- JAX: Added support for concurrent Data Parallelism (DP) and Fully-Sharded Data Parallelism (FSDP).
JAX / XLA Ecosystem
jax-ml/jax
- Key Activity:
- [2025-12-18] π¨ RELEASE: jax-v0.8.2
- Details:
- Deprecated
jax.lax.pvaryand all symbols injax.interpreters.pxla.Tracerobjects no longer inherit fromjax.Arrayat runtime.
- Deprecated
AI-Hypercomputer/maxtext
- Key Activity:
- [2025-12-12] π¨ RELEASE: maxtext-tutorial-v1.4.0
- [2025-12-30] π¨ RELEASE: maxtext-tutorial-v1.5.0
- Details:
- Added Muon optimizer integration. Merged packing support for Context Parallelism (Ring Attention).
-
Metrics: 131 PRs 12 Issues (Healthy)
openxla/xla
- Key Activity:
- [2025-12-01 to 12-31] High volume optimization and automated refactoring.
- Details:
- Tracking a severe segmentation fault in HLO HTML dumper due to stale fusion states.
-
Metrics: 1213 PRs 16 Issues (Highly active: 1143 closed PRs)
Serving, Inference & RLHF Ecosystem
sgl-project/sglang
- Key Activity:
- [2025-12-03] π¨ RELEASE: v0.5.6
- [2025-12-10] π¨ RELEASE: gateway-v0.2.4
- [2025-12-24] π¨ RELEASE: gateway-v0.3.0
- Details:
- Gateway v0.3.0: Huge architectural shift. Released Unified Inference Gateway Mode (IGW) to handle entire fleets from a single gateway. Added UUID-based Worker Resource Management and completely redesigned the 6-layer metrics architecture.
- Gateway v0.2.4: Industry-first WASM Middleware support. Full OpenTelemetry integration.
- Engine v0.5.6: Zero-day support for DeepSeek V3.2 / Speciale. Integrated JIT kernels. Support for new blockwise diffusion language models (Flux2, Z-image).
- Metrics: Massive volume of PR merges (~300+ merged for releases)
THUDM/slime
- Key Activity:
- [2025-12-01] π¨ RELEASE: v0.2.0.post1
- [2025-12-12] π¨ RELEASE: v0.2.1
- Details:
- Added true on-policy training on Qwen3-VL (dense) using VLM + FSDP.
- Integrated PD-disaggregation support during rollout and DP-attention support in rollout routing replay (R3). Upgraded backend dependency to SGLang v0.5.6.
xdit-project/xDiT
- Key Activity:
- [2025-12-01 to 12-31] Model and backend expansion.
- Details:
- Added support for HunyuanVideo-1.5 and Z-Image Turbo. Integrated custom attention backend support.
-
Metrics: 7 PRs 3 Issues
Compilers & Low-Level Optimizations
tile-ai/tilelang
- Key Activity:
- [2025-12-07] π¨ RELEASE: v0.1.7
- [2025-12-24] π¨ RELEASE: v0.1.7.post1
- [2025-12-31] π¨ RELEASE: v0.1.7.post2
- Details:
- Hardware Adapters: Enabled FlashAttention-2 forward on AMD MI300X, fixed MLA autotune for ROCm, updated CI to ROCm-7.1, and introduced Huawei Ascend support.
- Core Enhancements: Added CuTeDSL backend, integrated Z3 in TVM Arith Analyzer, added JIT lazy execution, and supported FP8 to FP32 vectorized casts.
-
Metrics: 173 PRs 55 Issues (Healthy: 166 PRs closed)
triton-lang/triton
- Key Activity:
- [2025-12-01 to 12-31] General compiler passes and hardware specific lowering.
- Details:
- AMD WarpPipeliner received support for single block execute regions in
UpdateAsyncWaitCount. Tracking frontend bugs producing invalid IR with SSA values in constant-range loops.
- AMD WarpPipeliner received support for single block execute regions in
-
Metrics: 238 PRs 33 Issues (Healthy: 232 PRs closed)
deepspeedai/DeepSpeed
- Key Activity:
- [2025-12-09] π¨ RELEASE: v0.18.3
- Details:
- Added Muon optimizer support (allowing separate βmuon_lrβ and βadam_lrβ).
- Relaxed tolerances for FP8 unit tests specifically for ROCm (FP16 and BF16 cases). Added Qwen2.5 to the AutoTP model list.