GitHub Monthly Report: 2026-01-01 to 2026-01-31
๐ Engineering Report (2026-01-01 - 2026-01-31)
๐ Executive Summary
January 2026 was a pivotal month for AI infrastructure, marked by major version releases across almost every foundational library. ROCm 7.2.0, PyTorch 2.10, Hugging Face Transformers v5.0, and Triton 3.6 were all released, signaling a new baseline for the 2026 software stack.
The primary engineering theme this month was Next-Gen Hardware Readiness and Inference Optimization. NVIDIAโs Blackwell architecture support has fully landed in upstream libraries (FBGEMM, vLLM, Triton), while AMD expanded consumer card support (RDNA4) and solidified the MI350X software ecosystem. Inference engines (vLLM, SGLang) are aggressively optimizing for DeepSeek architectures and Speculative Decoding.
AMD Related Updates
- ROCm 7.2.0 Released: This is a significant update introducing support for RDNA4 architectures (Radeon AI PRO R9600D, RX 9060) and extending SLES support to MI350X/MI355X.
- Triton 3.6 & RDNA4: The new Triton release includes initial skeleton support for
gfx1250(RDNA4), alongside specific optimizations for MI300 (FP4->BF16 conversion) and GFX950. - Software Ecosystem Expansion:
- PyTorch 2.10: Added
torch.version.rocmdistinct fromtorch.version.hip, enabled AOTriton runtime compilation, and addedgfx1150/gfx1151support. - FBGEMM: Updated OSS build scripts to support AMD variants and added MI350 performance optimizations for embedding forward/backward passes.
- vLLM: Added the high-performance MoRI EP (Expert Parallel) backend for ROCm and Flash Attention Triton backend support for consumer RDNA3/4 GPUs.
- PyTorch 2.10: Added
Competitive Analysis
- NVIDIA Blackwell Maturity: Support for the Blackwell architecture (B200/GB200) is now ubiquitous across the stack.
- Triton 3.6: Added TMEM (Tensor Memory) bitwidth/layout broadcasting and specific optimizations for SM100/SM120.
- FBGEMM: Enabled CUDA 13 builds, added support for Blackwell CUTLASS attention kernels, and lazy TMEM allocation.
- vLLM: FlashInfer MLA is now the default backend on Blackwell.
- Intel XPU Momentum: PyTorch 2.10 expanded support for Intel GPUs (Panther Lake), enabling FP8 and complex MatMul.
meta-pytorch/torchcommsalso added XCCL backend support. - DeepSeek Optimization Race: Both vLLM and SGLang are aggressively optimizing for DeepSeek V3/V3.2 architectures (FP8 KV cache, Multi-Token Prediction), indicating this model family is a current industry standard benchmark.
๐ Category Updates
๐ข AMD Ecosystem
[ROCm/ROCm]
- Key Activity:
- [2026-01-23] Major Release ๐จ: ROCm 7.2.0.
- Details:
- Hardware: Support added for RDNA4 (Radeon AI PRO R9600D, RX 9060).
- Features: Node Power Management (NPM) for MI350X/MI355X.
- HIP: Graph node scaling (doorbell ring optimization) and
hipCCdeprecation (moves to directamdclanginvocation). - Libraries:
hipTensoradded software-managed plan cache;rocSHMEMadded GPUDirect Async (GDA) support.
-
Metrics: 50 New Issues 44 New PRs
[AMD-AGI/Primus]
- Key Activity:
- [2026-01-29] Documentation updates for CLI usage.
- [2026-01-14] Integration of Megatron-LM SFT trainer.
- Details:
- Added support for Megatron-LM SFT trainer with offline datasets and multi-turn conversation support.
- Addressed questions regarding compatibility with ROCm 6.2/6.3/6.4.
-
Metrics: 58 New PRs 1 New Issues
[AMD-AGI/TraceLens]
- Key Activity:
- [2026-01-16] Added comprehensive documentation for performance reports.
- Details:
- Work on Python function support for TraceDiff algorithm.
- Fixes for consistency in
get_kernel_launchers.
-
Metrics: 20 New PRs 5 New Issues
๐ฅ PyTorch Ecosystem
[pytorch/pytorch]
- Key Activity:
- [2026-01-21] Major Release ๐จ: v2.10.0.
- Details:
- Core: Python 3.14 support for
torch.compile. - Inductor: โCombo-kernelsโ horizontal fusion to reduce launch overhead.
- Intel: Expanded support for Panther Lake (Windows/Linux) and FP8 ops.
- ROCm: Grouped GEMM support via CK, AOTriton runtime compile enabled, and
gfx1150/gfx1151support added tohipblaslt.
- Core: Python 3.14 support for
-
Metrics: 1895 New PRs 510 New Issues
[pytorch/FBGEMM]
- Key Activity:
- [2026-01-27] Release ๐จ: v1.5.0.
- [2026-01-09] Release ๐จ: v1.4.0.
- Details:
- v1.5.0: CUDA 13 and Blackwell support (lazy TMEM allocation). Added MI350 optimization for embedding layers.
- v1.4.0: FP8 embedding weights support. FP4 grouped API for Torch.
-
Metrics: 78 New PRs 3 New Issues
[huggingface/transformers]
- Key Activity:
- [2026-01-26] Major Release ๐จ: v5.0.0.
- [2026-01-08] Multiple Release Candidates leading to v5.
- Details:
- Architecture: Shift to โfastโ tokenizers by default. New dynamic weight loading API (
WeightConverter). - Defaults:
dtypenow defaults toautoinfrom_pretrained. - Performance: Significant MoE performance improvements.
- New Models: Support added for SAM3, LFM2-MoE, VideoLlama 3, GLM-ASR, and GLM 4.7 Flash.
- Architecture: Shift to โfastโ tokenizers by default. New dynamic weight loading API (
-
Metrics: 467 New PRs 131 Closed Issues
[volcengine/verl]
- Key Activity:
- [2026-01-05] Release ๐จ: v0.7.0.
- Details:
- Integrated Megatron-Bridge to support LoRA/PEFT for trillion-parameter reasoning RL.
- Added support for Qwen3-Next and GPT-OSS models.
- Removed SPMD rollout mode in favor of server mode.
- Metrics: 0 New PRs (Data indicates repo movement/maintenance, high activity in release notes).
๐ Serving & Inference
[vllm-project/vllm]
- Key Activity:
- [2026-01-29] Release ๐จ: v0.15.0.
- [2026-01-20] Release ๐จ: v0.14.0.
- Details:
- Architecture: Async scheduling is now enabled by default. New
Model Runner V2enhancements (M-RoPE). - Hardware: FlashInfer MLA is default on NVIDIA Blackwell. Added MoRI EP backend for AMD ROCm.
- Quantization: Deprecated DeepSpeedFp8 and RTN. Added MXFP4 W4A16 support.
- Models: Support for DeepSeek V3.1, Molmo2, and GLM-Lite.
- Architecture: Async scheduling is now enabled by default. New
- Metrics: 0 New PRs reported (Likely high volume, data stream limit).
[sgl-project/sglang]
- Key Activity:
- [2026-01-23] Release ๐จ: v0.5.8.
- [2026-01-09] Release: Gateway v0.3.1.
- Details:
- Performance: 1.5x faster diffusion inference. 65% faster TTFT for GLM4-MoE.
- Models: Day 0 support for GLM 4.7 Flash and LFM2. DeepSeek V3.2 optimization via Context Parallelism.
- Gateway: 10-12x performance improvement in cache-aware routing (Radix Tree optimization).
- Metrics: 0 New PRs reported (High volume release notes).
โ๏ธ Compilers & Kernels
[triton-lang/triton]
- Key Activity:
- [2026-01-21] Major Release ๐จ: v3.6.0.
- Details:
- AMD: Initial skeleton support for
gfx1250(RDNA4). FP4->BF16 optimized conversion for MI300. - NVIDIA: Blackwell TMEM bitwidth and layout broadcasting support. Native FP4 scaled dot for SM120.
- Core: Multidimensional batch support in
tl.dot.
- AMD: Initial skeleton support for
-
Metrics: 186 New PRs 34 New Issues
[tile-ai/tilelang]
- Key Activity:
- [2026-01-18] Release v0.1.7.post3.
- Details:
- Added support for DeepSeek V3.2 sparse MLA forward kernels.
- Implemented
cp.reduce.async.bulk.tensor. - Unified
@jitand@lazy_jitdecorators.
-
Metrics: 133 New PRs 38 New Issues
๐ข NVIDIA Ecosystem
[NVIDIA/TransformerEngine]
- Key Activity:
- [2026-01-15] Release: v2.11.
- Details:
- Enabled reference Current Scaling recipe for FP8 training.
- Added JAX Triton kernel bindings.
- Improved performance of NVFP4 quantization and FSDP2 all-gather.
- Metrics: 0 New PRs reported.
[deepspeedai/DeepSpeed]
- Key Activity:
- [2026-01-30] Release v0.18.5.
- [2026-01-07] Release v0.18.4.
- Details:
- Fixed gradient checkpointing with ZeRO-3.
- Improved AMD ROCm support (fixes for testcases).
- Added sequential allgather optimization for ZeRO-3.
- Metrics: 0 New PRs reported.