GitHub Monthly Report: 2025-10-01 to 2025-10-31
April 18, 2026
๐ Engineering Report (2025-10-01 - 2025-10-31)
๐ Executive Summary
October 2025 was a monumental month for AI engineering and infrastructure, marked by major architectural shifts and the rollout of next-generation hardware support across the stack. The software ecosystem is aggressively pivoting to support upcoming AI accelerators, specifically AMDโs MI350 series (CDNA4) and NVIDIAโs Blackwell architecture.
Major Ecosystem Shifts:
- Framework Updates: ๐จ PyTorch 2.9.0 and ๐จ JAX v0.8.0 were both released, bringing massive foundational changes (e.g., PyTorch moving to Python 3.10 minimum, JAX deprecating traditional
pmapforshard_map). - Compiler Advancements: ๐จ Triton v3.5.0 delivered comprehensive backend upgrades for both AMD (GFX950) and NVIDIA (Hopper/Blackwell).
- Maintenance Health: Overall ecosystem maintenance health is exceptionally strong. Repositories like PyTorch (1,798 PRs closed / 1,900 opened) and OpenXLA (1,338 PRs closed / 1,339 opened) are showing near 1:1 open-to-close ratios, demonstrating highly efficient CI/CD pipelines and active maintainer bandwidth.
AMD Related Updates
- TheRock Build System & MI350 Support: AMD released ๐จ ROCm 7.9.0 as a technology preview introducing โTheRockโ build system. This marks a massive transition to ManyLinux_2_28 compliance, architecture-specific Python packages, and an open release process. Crucially, it brings official software support for the Instinct MI355X/MI350X (CDNA4/gfx950).
- TraceLens Goes Public: AMD-AGI/TraceLens ๐จ v0.4.0 was released and officially switched to a public repository. This dramatically improves AMDโs profiling ecosystem, introducing a new UI, JAX data analyses, and deep GEMM performance modeling via Gemmologist.
- Triton GFX950 Support: Triton ๐จ v3.5.0 introduced comprehensive support for AMDโs GFX950 (MI350), including MFMA scale support, scale preshuffling, and software emulation for non-gfx942 architectures.
- Primus Expansion: AMD-AGI/Primus saw two rapid releases (๐จ v0.3.0, ๐จ v0.4.0) adding zero-bubble pipeline parallelism, DeepEP sync-free MoE, and support for massive models (Llama 3.1 405B, Grok 1 & 2, Qwen2.5).
- Broader Framework Adoption: Third-party frameworks like
rtp-llmandtilelangproactively released specific optimizations and pre-built wheels for AMD ROCm platforms.
Competitive Analysis
- NVIDIA Blackwell Dominance: NVIDIAโs software stack is hyper-focused on Blackwell (SM100/SM120). TransformerEngine (v2.7 & v2.8) and Megatron-LM released extensive support for NVFP4 training recipes, MXFP8 grouped GEMMs, and MoE router fusions specifically targeted at extracting maximum TFLOPS from Blackwell.
- PyTorch Native Blackwell Support:
pytorch/ao๐จ v0.14.1 introduced prototype MoE training and NVFP4 Quantization-Aware Training (QAT) explicitly for Blackwell GPUs, showing NVIDIAโs tight integration with upstream PyTorch. - Intel XPU Catch-up: Intel successfully landed
FlexAttentionsupport for XPU within PyTorch 2.9.0 and quantization ops inpytorch/ao, ensuring they remain in the hardware-agnostic conversation. - Custom Frameworks Emerging: Meta open-sourced
monarch, a distributed programming framework using actor-based concurrency and RDMA transfers, indicating a push by hyperscalers to bypass traditional collective communication bottlenecks.
๐ Category Updates
AMD Ecosystem
AMD-AGI/Primus
- Key Activity:
- [2025-10-18] ๐จ RELEASE: v0.4.0
- [2025-10-15] ๐จ RELEASE: v0.3.0
- Details:
- Rapid iterations focusing on performance and scale. Added zero-bubble pipeline parallelism and DeepEP Token Dispatcher (sync-free MoE).
- Enabled compilation and added configs for Llama-3.1 (8B/70B/405B), Grok1, Grok2, and Qwen2.5 models.
- Added unified shell entry scripts for Slurm, container, and direct modes, updating default Docker images to
v25.9_gfx942.
-
Metrics: 46 New PRs 50 Closed PRs 1 New Issue 0 Closed Issues (Excellent merge velocity)
AMD-AGI/TraceLens
- Key Activity:
- [2025-10-17] ๐จ RELEASE: v0.4.0
- Details:
- Repo switched to public. Major leap for AMD profiling tools.
- Added a dedicated TraceLens UI, JAX data analyses, and integrated โgemmologistโ for modeling GEMM efficiencies.
- Added SDPA/Flash Attention changes for the performance model, top-k ops support, and TE linear GEMM analysis.
-
Metrics: 48 New PRs 40 Closed PRs 45 New Issues 34 Closed Issues (High community engagement post-public launch)
ROCm/ROCm
- Key Activity:
- [2025-10-20] ๐จ RELEASE: therock-7.9.0 (Technology Preview)
- [2025-10-10] ๐จ RELEASE: rocm-7.0.2
- Details:
- 7.9.0: Introduces โTheRockโ build system. Supports MI355X/MI350X (CDNA4), MI325X, and Ryzen AI Max. Slims down the SDK and shifts to ManyLinux_2_28 compliance for broader portability.
- 7.0.2: Adds support for RDNA4 (Radeon RX 9060). Introduces ROCm Life Science (ROCm-LS) toolkit, PyTorch 2.8 support, RAG AI enablement, and
gsplatsupport.
-
Metrics: 105 New PRs 101 Closed PRs 54 New Issues 40 Closed Issues (Strong maintenance health)
ROCm/MAD
- Key Activity:
- [2025-10-31] Maintenance and feature updates.
- Details:
- Merged MaxText Training v25.9 and added training/inference tags.
-
Metrics: 8 New PRs 10 Closed PRs 0 New Issues
PyTorch Ecosystem
pytorch/pytorch
- Key Activity:
- [2025-10-15] ๐จ RELEASE: v2.9.0
- Details:
- Breaking: Minimum Python version is now 3.10. Default ONNX exporter uses
dynamo=True. - Features: Symmetric memory for multi-GPU kernels, FlexAttention on Intel GPUs, expanded wheel variants including ROCm, XPU, and CUDA 13.
- Added OCP Micro-scaling Format (mx-fp8/mx-fp4) support for ROCm and introduced the Muon optimizer natively.
- Breaking: Minimum Python version is now 3.10. Default ONNX exporter uses
-
Metrics: 1900 New PRs 1798 Closed PRs 529 New Issues 548 Closed Issues (Outstanding repository health)
pytorch/torchtitan
- Key Activity:
- [2025-10-18] ๐จ RELEASE: v0.2.0
- Details:
- Bumped dependencies to
torch-2.10.0.dev20251019andtorchao-0.15.0.dev. - Added deepseek_v3 experiments and deterministic RL training experiments with vLLM.
- Bumped dependencies to
-
Metrics: 174 New PRs 167 Closed PRs 24 New Issues 17 Closed Issues
pytorch/ao
- Key Activity:
- [2025-10-13] ๐จ RELEASE: v0.14.1
- Details:
- Massive Blackwell focus: Prototype MoE training on Blackwell GPUs and NVFP4 Quantization-Aware Training (QAT).
- Introduced
_scaled_grouped_mmdynamically quantizing inputs for substantial speedups on Llama4 and DeepSeekV3.
-
Metrics: 142 New PRs 135 Closed PRs 26 New Issues 13 Closed Issues
pytorch/vision & pytorch/audio
- Key Activity:
- [2025-10-15] ๐จ RELEASE: vision v0.24.0 & audio v2.9.0
- Details:
- Vision: Improved KeyPoints and Rotated boxes support. Efficient
box_areaandbox_iouops. - Audio: Migrated
load()andsave()APIs to rely on TorchCodec.
- Vision: Improved KeyPoints and Rotated boxes support. Efficient
- Metrics: Minimal PR/Issue activity recorded post-release.
Meta AI Ecosystem
meta-pytorch/monarch
- Key Activity:
- [2025-10-22] ๐จ RELEASE: v0.1.0
- Details:
- Initial Release: A new distributed programming framework for PyTorch.
- Uses actor-based concurrency, fault-tolerant supervision, and high-performance RDMA transfers (zero-copy, one-sided tensor communication via libibverbs).
meta-pytorch/torchcomms & meta-pytorch/torchforge
- Key Activity:
- Documentation updates adjusting setup workflows, READMEs, and RCCL backend prerequisites.
NVIDIA Ecosystem
NVIDIA/TransformerEngine
- Key Activity:
- [2025-10-07] ๐จ RELEASE: v2.8
- [2025-10-01] ๐จ RELEASE: v2.7
- Details:
- Unlocked Blackwellโs potential with NVFP4 training recipes and FP8 attention features.
- Supported FlashAttention v3 for MLA with context parallelism and added multi-tensor swizzle kernels for MXFP8 grouped GEMMs.
NVIDIA/Megatron-LM
- Key Activity:
- [2025-10-08] ๐จ RELEASE: core_v0.14.0
- Details:
- Heavy MoE optimizations: Expert Parallel A2A Overlapping, CP/recompute for MTP, and MoE router fusion.
- Added support for DeepSeek and Qwen configs, optimizer offloading for DSV3 FP8 training.
JAX, Google & XLA Ecosystem
jax-ml/jax
- Key Activity:
- [2025-10-15] ๐จ RELEASE: jax-v0.8.0
- Details:
- Breaking Change:
jax.pmapis in maintenance mode. Default implementation now relies onjax.jitandjax.shard_map. - Removed legacy host callbacks, DLPack, and XLA extension functions.
- Breaking Change:
AI-Hypercomputer/maxtext
- Key Activity:
- [2025-10-25] ๐จ RELEASE: maxtext-tutorial-v1.1.0 (and v1.0.0)
- [2025-10-18] ๐จ RELEASE: tpu-recipes-v0.1.5
- Details:
- Minor workflow and recipe alignment releases for TPU integration.
openxla/xla
- Key Activity:
- High-volume maintenance.
- Details:
- Deprecated
TypeInfoconstructor in favor ofXLA_FFI_TypeInfoalias. Actively debugging low utilization forlax.ragged_all_to_allon 8x B200 setups.
- Deprecated
-
Metrics: 1339 New PRs 1338 Closed PRs 16 New Issues 10 Closed Issues (Remarkable automated merge volume)
Inference, Compilers & LLM Frameworks
triton-lang/triton
- Key Activity:
- [2025-10-21] ๐จ RELEASE: v3.5.0
- Details:
- AMD Backend: Full GFX950 (MI350) support added (scaled MFMA, OpSel implementation, scale preshuffling). FP8 emulation enabled for non-gfx942 architectures.
- NVIDIA Backend: Hopper WGMMA with async wait, Blackwell TMEM support, and FP8 MMAv2 optimizations (~1.9x speedup).
- Introduced generic swizzling for
convert_layoutto minimize shared memory bank conflicts.
-
Metrics: 232 New PRs 202 Closed PRs 45 New Issues 43 Closed Issues
volcengine/verl
- Key Activity:
- [2025-10-15] ๐จ RELEASE: v0.6.0
- Details:
- Transitioned agentic RL rollout from SPMD to Server Mode (fully supporting vLLM/SGLang online serving).
- Introduced FSDP + Ulysses backend model engine prototype. Added support for Qwen3 VL and DeepEyes recipes.
tile-ai/tilelang
- Key Activity:
- [2025-10-31] ๐จ RELEASE: v0.1.6.post2
- Details:
- Added AMD MLA autotuning for ROCm. Support for Huawei Ascend chips. Added fast sine/cosine definitions in CUDA templates and Metal backend support.
-
Metrics: 152 New PRs 154 Closed PRs 95 New Issues 62 Closed Issues
deepspeedai/DeepSpeed
- Key Activity:
- [2025-10-23] ๐จ RELEASE: v0.18.1 & v0.18.0
- Details:
- Added ZenFlow code for Stage 3, DeepCompile ZeRO-3 (robust allgather for uneven shards), and SuperOffload.
- Integrated Ulysses HF Accelerate and DataStates-LLM asynchronous checkpointing.
alibaba/rtp-llm
- Key Activity:
- [2025-10-31] ๐จ RELEASE: v0.2.0
- Details:
- Major framework update supporting PD Disaggregation, Speculative Decoding, and DeepEP.
- Supports massive models: DeepSeek v3/R1, Qwen3, and Llama 4. Provided direct wheel support for AMD ROCm.
llm-d/llm-d
- Key Activity:
- [2025-10-10] ๐จ RELEASE: v0.3.0
- Details:
- Increased support for specialized hardware backends (TPU, XPU, AWS EFA).
- Updated upstream vLLM inference to v0.11.0.
Other Frameworks (vLLM, SGLang, xDiT, slime, miles, DeepEP)
- Primarily documentation, tutorial updates, and minor bug tracking during this period. xDiT added support for Wan2.X I2V models and upgraded Hunyuanvideo for new diffusers formats.