GitHub Monthly Report: 2026-01-01 to 2026-01-31
π Engineering Report (2026-01-01 - 2026-01-31)
π Executive Summary
January 2026 was a pivotal month for the AI infrastructure ecosystem, marked by major version releases across almost every foundational library. PyTorch released v2.10, introducing Python 3.14 support and significant Intel/AMD backend upgrades. ROCm 7.2.0 launched with support for RDNA4 architecture and MI350/MI355 optimizations. HuggingFace Transformers hit v5.0, a massive refactor focusing on dynamic weight loading and MoE performance. In the serving layer, vLLM (v0.15) and SGLang (v0.5.8) aggressively optimized for DeepSeek V3/V3.2 architectures and NVIDIA Blackwell hardware.
AMD Related Updates
- ROCm 7.2.0 Release: Introduced support for RDNA4-based GPUs (Radeon AI PRO, RX 9060) and extended OS support for Instinct MI355X/MI350X.
- PyTorch 2.10 Integration: Native support for ROCm 7.0 and 7.1, along with specific optimizations for MI300/MI355 architectures and
hipBLASLtGEMM improvements. - Triton 3.6.0: Added initial skeleton support for GFX1250 (RDNA4), improved WMMA support, and switched to official ROCm 7 docker images.
- Inference Optimizations: vLLM and SGLang both pushed significant AMD-specific updates, including a high-performance all2all backend for Expert Parallelism (MoRI EP) on ROCm and fixes for DeepSeek V3 on MI300.
- FBGEMM Support: FBGEMM v1.4.0/1.5.0 added specific optimizations for MI350 embedding forward/backward passes and updated build scripts for AMD CPUs/GPUs.
Competitive Analysis
- NVIDIA Blackwell Readiness: The software stack is now fully preparing for B200/Blackwell. FBGEMM, vLLM, and Triton all merged specific optimizations (TMA support, FP4 quantization, Blockwise GEMM) for the new architecture.
- Intel XPU Surge: PyTorch 2.10 and Transformers v5.0 included a massive amount of Intel XPU code, enabling Flash Attention 2, scaled matmuls, and custom operators on Windows, signaling a strong push for Intelβs discrete GPU ecosystem.
- Transformers v5.0: HuggingFaceβs major version bump significantly refactored tokenization (moving to Rust-backed fast tokenizers by default) and optimized Mixture-of-Experts (MoE) inference, directly benefiting complex architectures like DeepSeek and Mixtral.
π Category Updates
π΄ AMD Ecosystem
[ROCm/ROCm]
- Key Activity:
- [2026-01-23] π¨ RELEASE: rocm-7.2.0
- Details:
- New Hardware: Support for RDNA4 (Radeon AI PRO R9600D, RX 9060 XT LP) and RDNA3 (RX 7700).
- Software: Introduced Node Power Management (NPM) for multi-GPU nodes (MI355X/MI350X).
- Libraries:
hipTensoradded software-managed plan cache;rocSHMEMadded GPUDirect Async backend;MIGraphXadded MXFP8/MXFP4 support. - Deprecation: ROCm Offline Installer Creator is deprecated in favor of the Runfile Installer.
-
Metrics: 44 PRs 50 Issues
[AMD-AGI/Primus]
- Key Activity:
- [2026-01-29] Docs updated to use
./primus-cli.
- [2026-01-29] Docs updated to use
- Details:
- New Megatron-LM SFT trainer added with offline datasets.
- Discussions regarding compatibility with ROCm 6.2+.
-
Metrics: 58 PRs 1 Issues
[AMD-AGI/TraceLens]
- Key Activity:
- [2026-01-16] Added comprehensive performance report documentation.
- Details:
- Work on
TraceDiffalgorithm for Python functions. - Issue flagged regarding broken Performance Model for attention implementations.
- Work on
-
Metrics: 20 PRs 5 Issues
π₯ PyTorch Ecosystem
[pytorch/pytorch]
- Key Activity:
- [2026-01-21] π¨ RELEASE: v2.10.0
- Details:
- Python 3.14 Support:
torch.compile()now supports Python 3.14. - Intel XPU: Major expansion of XPU support (FP8, scaled matmul).
- ROCm: Enabled grouped GEMM via CK, added
torch.version.rocm, and updated to support ROCm 7.0/7.1. - Features: New
varlen_attn()op, efficient eigenvalue decompositions, and a newComplexTensorsubclass.
- Python 3.14 Support:
-
Metrics: 1895 PRs 509 Issues
[pytorch/vision]
- Key Activity:
- [2026-01-21] π¨ RELEASE: v0.25.0
- Details:
- Compatible with PyTorch 2.10.
- Added
SanitizeKeyPointstransform. - Fixed GIF decoder issues.
-
Metrics: 0 PRs 0 Issues
[pytorch/FBGEMM]
- Key Activity:
- [2026-01-27] π¨ RELEASE: v1.5.0
- [2026-01-09] π¨ RELEASE: v1.4.0
- Details:
- Blackwell: Enabled CUDA 13 builds, added Paged Attention for FMHA CUTLASS Blackwell.
- AMD: Added MI350 performance optimizations; updated OSS build scripts for AMD.
- Quantization: MXFP8 grouped GEMM improvements.
-
Metrics: 0 PRs 0 Issues
[meta-pytorch/monarch]
- Key Activity:
- [2026-01-30] π¨ RELEASE: v0.3.0
- Details:
- Kubernetes: Added
KubernetesJobAPI for distributed training on K8s. - SPMD: Added interactive SPMD development workflow via
monarch.job.spmd. - Performance: Experimental Queue Dispatch Mode for Rust-Python interop.
- Kubernetes: Added
-
Metrics: 0 PRs 0 Issues
π€ HuggingFace & Transformers
[huggingface/transformers]
- Key Activity:
- [2026-01-26] π¨ RELEASE: v5.0.0
- [2026-01-16] RELEASE: v4.57.6
- Details:
- Major Refactor: Dynamic weight loading API (
WeightConverter), simplified tokenization (consolidated backends). - MoE: Significant performance improvements for Mixture-of-Experts models.
- Defaults:
dtypenow defaults toauto(respects saved format) instead of float32.report_todefaults to βnoneβ. - New Models: SAM3, LFM2 MoE, VideoLlama 3, GLM-ASR, GLM 4.7 Flash.
- Deprecation: Removed
torchscriptandtorch.fxsupport.
- Major Refactor: Dynamic weight loading API (
-
Metrics: 468 PRs 119 Issues
π Inference & Serving
[vllm-project/vllm]
- Key Activity:
- [2026-01-29] π¨ RELEASE: v0.15.0
- [2026-01-20] π¨ RELEASE: v0.14.0
- Details:
- Core: Async scheduling enabled by default. PyTorch 2.9.1 required.
- Models: DeepSeek V3/V3.2 support, Molmo2, GLM-Lite.
- Hardware: FlashInfer MLA is default backend on Blackwell. High-performance MoRI EP backend for AMD ROCm.
- Quantization: Deprecated DeepSpeedFp8 and RTN; HQQ deprecated.
-
Metrics: 0 PRs 0 Issues
[sgl-project/sglang]
- Key Activity:
- [2026-01-23] π¨ RELEASE: v0.5.8
- [2026-01-09] RELEASE: gateway-v0.3.1
- Details:
- Optimization: 1.5x faster diffusion, 65% faster TTFT for GLM4-MoE.
- DeepSeek: Optimized Context Parallelism for DeepSeek V3.2.
- Routing: Cache-aware routing updated to be 10-12x faster with 99% less memory usage (v0.3.1).
-
Metrics: 0 PRs 0 Issues
βοΈ Compilers & Kernels
[triton-lang/triton]
- Key Activity:
- [2026-01-21] π¨ RELEASE: v3.6.0
- Details:
- AMD: Initial skeleton support for GFX1250 (RDNA4), improved WMMA support, and async copy support.
- NVIDIA: Added native FP4 scaled dot and MXFP FP8 scaled dot for SM120 (Blackwell).
- Proton: New profiling tool features including global memory support and intra-kernel call stacks.
-
Metrics: 186 PRs 34 Issues
[tile-ai/tilelang]
- Key Activity:
- [2026-01-18] RELEASE: v0.1.7.post3
- Details:
- Added support for CUDA 13.1 build in CI.
- Implemented
T.sync_warpandT.shfl_sync. - Improved support for AMD preshuffle FP8 GEMM.
-
Metrics: 133 PRs 38 Issues
[openxla/xla]
- Key Activity:
- High maintenance activity.
- Details:
- Updates to
nanobindversions. - Fixes for MacOS cross-compilation.
- Updates to
-
Metrics: 1274 PRs 13 Issues
ποΈ Training & RL
[volcengine/verl]
- Key Activity:
- [2026-01-05] π¨ RELEASE: v0.7.0
- Details:
- Engines: Megatron engine is production-ready; FSDP support solidifying.
- Rollout: Removed SPMD rollout mode; default changed to server mode.
- Models: Added support for DeepSeek-R1-Zero on Ascend NPU and Qwen3-Next.
- Algorithm: Added CISPO and SAPO algorithms.
-
Metrics: 0 PRs 0 Issues
[deepspeedai/DeepSpeed]
- Key Activity:
- [2026-01-30] RELEASE: v0.18.5
- [2026-01-07] RELEASE: v0.18.4
- Details:
- AMD/ROCm: Improved support and bug fixes.
- ZeRO-3: Fixes for gradient checkpointing and sequential allgather optimization.
- Core: Fixes for MPS (Mac) support.
-
Metrics: 54 PRs 16 Issues
π’ NVIDIA Ecosystem
[NVIDIA/TransformerEngine]
- Key Activity:
- [2026-01-15] π¨ RELEASE: v2.11
- Details:
- Enabled reference Current Scaling recipe for FP8 training in PyTorch.
- Added Triton kernel bindings for JAX.
- Support for Context Parallelism (CP) for THD format in JAX.
-
Metrics: 0 PRs 0 Issues
[NVIDIA/Megatron-LM]
- Key Activity:
- [2026-01-08] RELEASE: core_v0.15.2
- Details:
- Routine maintenance releases.
-
Metrics: 0 PRs 0 Issues
π JAX Ecosystem
[jax-ml/jax]
- Key Activity:
- [2026-01-20] π¨ RELEASE: v0.9.0
- Details:
- Added
jax.thread_guardfor multi-controller detection. jax.exportnow supports explicit sharding.- Removed
jax_collectives_common_channel_id.
- Added
-
Metrics: 512 PRs 82 Issues