GitHub Monthly Report: 2026-01-01 to 2026-01-31
๐ Engineering Report (2026-01-01 - 2026-01-31)
๐ Executive Summary
January 2026 was characterized by a massive push in foundational infrastructure to support next-generation hardware (NVIDIA Blackwell and AMD ROCm 7.0/MI350X) and aggressive optimization in model serving engines. vLLM released v0.14.0, making async scheduling the default, while SGLang released v0.5.7 with significant throughput gains and a new Gateway (v0.3.1). FBGEMM released v1.4.0, serving as a critical enablement layer for both NVIDIA B200 and AMD MI350X architectures. Quantization (specifically FP8 and NVFP4) remains the primary optimization vector across all libraries.
AMD Related Updates
- FBGEMM GPU v1.4.0 Support: The new release of FBGEMM includes significant AMD-specific updates, including support for ROCm 7.0, the gfx950 architecture, and MI350X FP8 Triton patches. This is a critical upstream dependency update for PyTorch performance on AMD hardware.
- Serving Engine Optimizations:
- vLLM: The v0.14.0 release includes AITER RMSNorm fusion and MTP support for AITER MLA on ROCm.
- SGLang: Added support for AMD GPUs in SGLang-Diffusion, fixed regressions for DeepSeekV3 on MI300, and re-enabled AITER kernels.
- TileLang & Primus:
tile-ai/tilelangadded preshuffle FP8 GEMM examples for AMD and opened ROCm CI testing.AMD-AGI/Primusadded FP8 support to GEMM benchmarks. - Gap Analysis: A new issue in
pytorch/ao(TorchAO) highlights โUnit Test Gaps for ROCM,โ indicating an area requiring engineering focus to ensure parity with CUDA.
Competitive Analysis
- NVIDIA Blackwell (B200/GB200) Readiness:
- FBGEMM: Added Cutlass FMHA kernels and grouped GEMM optimizations specifically for Blackwell (B200).
- vLLM: Added B300 Blackwell MoE configs and Qwen3Moe B200 Triton configs.
- TransformerEngine: v2.11 improved NVFP4 quantization (a key Blackwell feature) and added JAX Triton bindings.
- JAX Ecology: JAX v0.9.0 was released with
jax.thread_guardfor multi-controller safety. NVIDIA continues to tighten JAX integration via TransformerEngine bindings.
๐ Category Updates
๐ข AMD Ecosystem
- Repos:
AMD-AGI/Primus,AMD-AGI/TraceLens,AMD-AGI/GEAK-agent,ROCm/ROCm,ROCm/MAD - Key Activity:
- [2026-01-14] Primus: Added FP8 Support to GEMM Benchmarks.
- [2026-01-16] TraceLens: Documentation updates for performance reporting columns.
- [2026-01-xx] ROCm/MAD: Refactored vLLM integration and added Qwen/Qwen3-Coder models.
- Details:
- ROCm/ROCm: Significant user-reported issues regarding Windows 11 (RX 7900 XTX) crashes and PyTorch โNo HIP GPUs availableโ errors suggest installation/driver stability friction on consumer hardware.
- TraceLens: Identified issues with the performance model for attention implementations and multi-node breakdowns in collective reports.
-
Metrics: 67 New PRs 68 Closed PRs (Healthy velocity)
๐ฅ PyTorch Ecosystem
- Repos:
pytorch/pytorch,pytorch/torchtitan,pytorch/ao,pytorch/FBGEMM - Key Activity:
- ๐จ [2026-01-09] FBGEMM: Release v1.4.0 (Major Feature Release).
- [2026-01-xx] pytorch/pytorch: High volume maintenance (1000+ PRs); addressed Dynamo ONNX export issues and ROCm CI environment variable fixes (
rocm_env.sh). - [2026-01-xx] torchtitan: Draft PR for LoRA converter and support for Kimi 1t.
- Details:
- FBGEMM v1.4.0 Highlights:
- Blackwell: B200 architecture support via Cutlass FMHA kernels.
- ROCm: ROCm 7.0 support, MI350X FP8 Triton patches, and gfx950 support.
- Quantization: FP8 embedding weights support, MXFP4 quantization with inline PTX.
- TorchAO: Identified caching issues where CUDA state messes up device placement with Ray, and gaps in ROCm unit tests.
- FBGEMM v1.4.0 Highlights:
-
Metrics: 1115 New PRs 1095 Closed PRs (Very High Velocity)
๐ค NVIDIA Ecosystem
- Repos:
NVIDIA/Megatron-LM,NVIDIA/TransformerEngine - Key Activity:
- ๐จ [2026-01-15] TransformerEngine: Release v2.11.
- [2026-01-08] Megatron-LM: Release core_v0.15.2.
- Details:
- TransformerEngine v2.11:
- NVFP4: Improved Random Hadamard Transform (RHT) caching and quantization performance.
- JAX: Added Triton kernel bindings for JAX, enabling custom Triton kernels in JAX workflows.
- FP8: Enabled reference Current Scaling recipe for FP8 training.
- TransformerEngine v2.11:
-
Metrics: 0 New PRs 0 Closed PRs (Stable/Maintenance mode in public repo)
โก Serving & Inference
- Repos:
vllm-project/vllm,sgl-project/sglang,volcengine/verl,THUDM/slime - Key Activity:
- ๐จ [2026-01-20] vLLM: Release v0.14.0.
- ๐จ [2026-01-01] SGLang: Release v0.5.7 (followed by Gateway v0.3.1).
- [2026-01-05] Verl: Release v0.7.0.
- [2026-01-18] Slime: Release v0.2.2.
- Details:
- vLLM v0.14.0:
- Breaking: Async scheduling is now enabled by default.
- Hardware: Support for SM103, B300 Blackwell MoE configs, and Qwen3Moe B200 Triton configs.
- ROCm: AITER RMSNorm fusion and MTP for AITER MLA support.
- SGLang v0.5.7:
- Performance: 10-12x performance improvement in cache-aware routing via Radix Tree optimizations.
- AMD: Set
--dit-layerwise-offload trueto reduce peak VRAM usage; significant latency reduction forQwen-Image-Edit.
- Verl v0.7.0: Integrated Megatron-Bridge, removed SPMD rollout mode, and added support for VLA models.
- Slime v0.2.2: Added Int4-QAT training and full Rollout Routing Replay (R3) support with DeepEP.
- vLLM v0.14.0:
- Metrics: High activity across all serving repos; vLLM notably merged ~660 commits.
๐ ๏ธ Compilers & Languages
- Repos:
triton-lang/triton,tile-ai/tilelang,jax-ml/jax - Key Activity:
- ๐จ [2026-01-20] JAX: Release v0.9.0.
- [2026-01-18] TileLang: Release v0.1.7.post3.
- Details:
- JAX v0.9.0: Added
jax.thread_guardfor multi-controller safety. Removedjax_pmap_no_rank_reduction(no-rank-reduction is now default). - TileLang: Added conversion from
cutlass::float_e4m3totl::float_e4m3. Added preshuffle FP8 GEMM example on AMD and opened ROCm CI testing.
- JAX v0.9.0: Added
- Metrics: TileLang active with 35+ PRs merged.
๐ค Transformers & Models
- Repos:
huggingface/transformers - Key Activity:
- [2026-01-16] Transformers: Release v4.57.6 (Patch).
- [2026-01-08] Transformers: Release v5.0.0rc2.
- Details:
- v5.0.0 RC: Focus on MoE performance (batched and grouped experts), dynamic weight loading (faster loading on device), and quantization fixes (FP8/FBGEMM/Quanto support).
- Breaking Change: Default
dtypeforfrom_pretrainedis nowauto.
- Metrics: Steady patch release cycle preparing for major v5.0 launch.