GitHub Monthly Report: 2025-11-01 to 2025-11-30
April 18, 2026
π Engineering Report (2025-11-01 - 2025-11-30)
π Executive Summary
November 2025 marked a massive month for next-generation hardware adoption across the entire AI ecosystem. We saw concurrent, foundational software releases enabling the newest architectures from both AMD (MI350X/MI355X) and NVIDIA (Blackwell GB300/SM100). The month also highlighted a heavy shift toward advanced Reinforcement Learning (RL) and multi-token prediction frameworks, with major architecture overhauls in verl and slime to support asynchronous PPO and FSDP at scale. Maintenance health across the tracked repositories is exceptionally strong; large codebases like XLA, Triton, and MaxText are closing PRs at a rate that matches or exceeds new PR creation, indicating sustainable engineering velocity.
AMD Related Updates
- ROCm 7.1 Series Dominates: AMD dropped both ROCm 7.1.0 and 7.1.1 this month, officially extending support and telemetry to the MI325X, MI350X, and MI355X hardware lines.
- Virtualization & Enterprise Ready: Substantial updates were made to KVM SR-IOV support, adding Ubuntu 24.04 and RHEL 10.1 as Guest OSes for MI300X, greatly expanding enterprise deployment flexibility.
- Performance Upgrades: The introduction of βOrigamiβ for GEMM kernel selection reduces tuning time and improves out-of-the-box matrix multiplication performance. Additionally, CK/AITER fused-attn now natively supports padding, eliminating previous Transformer Engine workarounds.
- Ecosystem Recognition:
torchtitanofficially acknowledged AMD forks, andTraceLensmerged JAX/HLO kernel busy-time calculations, signaling deepening integrations of AMD hardware into upstream open-source profiling and training workflows.
Competitive Analysis
- NVIDIA Blackwell is Here: Competing ecosystems shipped day-zero software enablement for Blackwell (SM100 / GB300).
TransformerEngine v2.9added DeepSeek v3-style FP8 block scaling,xformersshipped a dedicated cutlass FMHA op for Blackwell, andtritonreleased a hotfix (v3.5.1) specifically to resolve GB300 compilation targets. - NVFP4 Expansion: NVIDIA is pushing lower-precision training aggressively.
TransformerEngineintroduced Jax support for the NVFP4 training recipe, continuing the trend of squeezing memory bandwidth. - DeepSeek Compute/Comms Overlap:
xformersintegrated FW+BW pass overlaps specifically mimicking DeepSeek-style communication optimizations, showing how quickly hardware libraries are adapting to recent Chinese MoE architectural trends.
π Category Updates
β‘ AMD Ecosystem
AMD-AGI/Primus
- Key Activity:
- [2025-11-17] Added CLI auto-discovery for subcommands and reorganized backend documentation.
- [2025-11-14] Major documentation restructuring.
- Details:
- [2025-11-17] Merged the
moe package v1.2PR. - [2025-11-17] Added support for custom model config and model args via CLI for the
maxtextbackend.
- [2025-11-17] Merged the
-
Metrics: 52 New PRs (45 Closed) 1 New Issue (1 Closed)
AMD-AGI/TraceLens
- Key Activity:
- [2025-11-14] JAX integration enhancements for GPU profiling.
- Details:
- [2025-11-14] Addressed kernel busy time calculation based on HLO ops for JAX.
- [2025-11-14] Classified memsets in memcpy within the
GPUEventAnalyzer.
-
Metrics: 17 New PRs (19 Closed) 14 New Issues (9 Closed)
ROCm/ROCm
- Key Activity:
- [2025-11-26] π¨ RELEASE: rocm-7.1.1
- [2025-11-03] π¨ RELEASE: rocm-7.1.0
- Details:
- [2025-11-26] Added RHEL 10.1 support, extended Debian 13 support to MI355X/MI350X, and added Ubuntu 24.04 as a Guest OS in KVM SR-IOV. Fixed an MI325X SR-IOV Mode 1 reset issue.
- [2025-11-03] Enabled nested tile partitioning in HIP cooperative groups (matching CUDA). Delivered hipBLASLt optimizations for FP8/TF32 on MI350X/MI355X.
- [2025-11-30] Community reported an issue regarding 7900xtx poor performance post-7.1.1 upgrade.
-
Metrics: 63 New PRs (70 Closed) 47 New Issues (27 Closed)
ROCm/MAD
- Key Activity:
- [2025-11-14] Minor script and documentation fixes for vLLM environments.
- Details:
- [2025-11-14] Fixed unified attention typo in vllm script.
-
Metrics: 5 New PRs (5 Closed) 0 New Issues
π₯ PyTorch Ecosystem
pytorch/pytorch
- Key Activity:
- [2025-11-12] π¨ RELEASE: v2.9.1
- Details:
- [2025-11-12] Fixed a massive memory regression in
F.conv3dwith bfloat16 inputs. - [2025-11-12]
Torch.compileimprovements: fixed bugs compiling Gemma, cachedget_free_symbol_uses, and fixed registration design for inductor graph partition for vLLM.
- [2025-11-12] Fixed a massive memory regression in
-
Metrics: 1481 New PRs (1328 Closed) 1007 New Issues (411 Closed)
pytorch/torchtitan
- Key Activity:
- [2025-11-06] Added official blurb for the AMD fork.
- Details:
- [2025-11-14] Added ability to precompile torchtitan models and skipped varlen integration testing on ROCm.
-
Metrics: 88 New PRs (76 Closed) 28 New Issues (10 Closed)
pytorch/vision & pytorch/audio
- Key Activity:
- [2025-11-12] π¨ RELEASE: v0.24.1 (Vision) & v2.9.1 (Audio)
- Details:
- [2025-11-12] Patch releases issued exclusively to ensure compatibility with PyTorch 2.9.1.
π© NVIDIA & General Deep Learning Ecosystem
NVIDIA/TransformerEngine
- Key Activity:
- [2025-11-11] π¨ RELEASE: v2.9
- Details:
- [2025-11-11] Added PyTorch support for DeepSeek v3-style FP8 block scaling on NVIDIA Blackwell (SM100).
- [2025-11-11] Added Jax support for the NVFP4 training recipe.
- [2025-11-11] Added CPU offload support for all attention layouts.
jax-ml/jax
- Key Activity:
- [2025-11-18] π¨ RELEASE: jax-v0.8.1
- Details:
- [2025-11-18]
jax.lax.linalg.eighnow accepts implementation arguments to select between QR, Jacobi, and QDWH implementations. - [2025-11-18] Fixed a bug causing eigh to fail for large matrices on GPUs.
- [2025-11-18]
AI-Hypercomputer/maxtext
- Key Activity:
- [2025-11-06] π¨ RELEASE: maxtext-v0.1.0
- [2025-11-20] π¨ RELEASE: maxtext-tutorial-v1.3.0
- Details:
- [2025-11-06] MaxText published its first official PyPI package, transitioning to a highly accessible installation method for JAX LLM training.
-
Metrics: 177 New PRs (188 Closed) 10 New Issues (3 Closed)
triton-lang/triton
- Key Activity:
- [2025-11-12] π¨ RELEASE: v3.5.1
- Details:
- [2025-11-12] Emergency hotfix to repair sm103 (GB300 / Blackwell) support that was broken in the 3.5.0 release.
- [2025-11-22] Added backend support for out-of-tree TTIR/TTGIR passes.
-
Metrics: 216 New PRs (214 Closed) 26 New Issues (20 Closed)
facebookresearch/xformers
- Key Activity:
- [2025-11-12] π¨ RELEASE: v0.0.33
- Details:
- [2025-11-12] Added cutlass fmha Op specifically for Blackwell GPUs.
- [2025-11-12] Implemented Forward + Backward pass overlap to support DeepSeek-like communication/compute overlap workflows.
openxla/xla
- Key Activity:
- [2025-11-14] Significant codebase refactoring around IR emitters.
- Details:
- [2025-11-14] Merged
ir_emitterandir_emitter_nestedfor XLA:GPU.
- [2025-11-14] Merged
-
Metrics: 1149 New PRs (1103 Closed) 9 New Issues (17 Closed)
π§ LLM & RL Training Frameworks
volcengine/verl
- Key Activity:
- [2025-11-14] π¨ RELEASE: v0.6.1
- Details:
- [2025-11-14] Introduced the βFully Async Policy Trainerβ to decouple the Trainer and Rollouter, allowing asynchronous PPO sample generation.
- [2025-11-14] Added the
TransferQueueData System for asynchronous streaming data management during post-training. - [2025-11-14] Megatron updates: supported 1f1b_overlap, Qwen3VL MoE/Dense, and Qwen2.5 with context parallelism.
THUDM/slime
- Key Activity:
- [2025-11-28] π¨ RELEASE: v0.2.0
- Details:
- [2025-11-28] Massive update introducing a Fully Sharded Data Parallel (FSDP) based training backend.
- [2025-11-28] Added native support for PPO, FP8 full stack (train + infer), and Multi-Token Prediction (MTP) during RL.
- [2025-11-28] Alleviated train-inference mismatch using Rollout Routing Replay (R3).
llm-d/llm-d
- Key Activity:
- [2025-11-26] π¨ RELEASE: v0.4.0
- [2025-11-06] π¨ RELEASE: v0.3.1
- Details:
- [2025-11-26] Shipped robust component updates including
llm-d-inference-scheduler, CPU variants (llm-d-cpu), and the Gateway API Inference Extension. - [2025-11-06] Refactored Dockerfiles to bash scripts, unified GKE image into core CUDA image, and added experimental ARM support.
- [2025-11-26] Shipped robust component updates including
xdit-project/xDiT
- Key Activity:
- [2025-11-12] π¨ RELEASE: 0.4.5
- Details:
- [2025-11-12] Added Apple Silicon (MPS) support and Wan 2.X I2V models.
- [2025-11-28] Upgraded to support FLUX.2 natively via diffusers format.
-
Metrics: 8 New PRs (8 Closed) 5 New Issues (2 Closed)
deepspeedai/DeepSpeed
- Key Activity:
- [2025-11-05] π¨ RELEASE: v0.18.2
- Details:
- [2025-11-05] Added fp32 weight deduplication under
torch autocastand ZeRO3. - [2025-11-05] Enhanced Ulysses Sequence Parallelism (UlyssesSP) API for variable sequence lengths.
- [2025-11-05] Added fp32 weight deduplication under