GitHub Monthly Report: 2025-10-01 to 2025-10-31
📅 Engineering Report (2025-10-01 - 2025-10-31)
🚀 Executive Summary
October 2025 was a pivotal month characterized by major foundational releases across the entire AI stack. PyTorch released v2.9.0, introducing CUDA 13 support and symmetric memory primitives. ROCm introduced a new build infrastructure (“TheRock”) via a technology preview in v7.9.0, signaling a shift in how the SDK is delivered. Triton v3.5.0 arrived with critical support for AMD’s next-gen MI350 (GFX950) architecture and NVIDIA’s Blackwell. In the inference space, vLLM v0.11.0 officially deprecated its V0 engine, moving entirely to the V1 architecture for improved performance and modularity.
AMD Related Updates
- 🚨 ROCm Build System Overhaul: The release of ROCm 7.9.0 “TheRock” is a technology preview that introduces a new build and release infrastructure. It introduces a versioning discontinuity (7.9 vs 7.0 stream) and moves toward a more open, predictable 6-week release cadence with
ManyLinuxcompliance. - Next-Gen Hardware Support: Triton v3.5.0 added comprehensive support for the GFX950 (MI350) architecture, including MFMA scale support and scale preshuffling, ensuring software readiness for upcoming hardware.
- Ecosystem Maturity: TraceLens (profiling) is now public (v0.4.0) with JAX support, and Primus (training) v0.4.0 added zero-bubble pipeline parallelism and Grok model support, strengthening the training story on AMD GPUs.
- Inference Updates: vLLM v0.11.0 updated its AMD backend to target ROCm 7.0 base, ensuring the inference engine keeps pace with the SDK.
Competitive Analysis
- NVIDIA Blackwell Readiness: Competitor tooling is aggressively optimizing for the Blackwell architecture. TransformerEngine v2.8 and pytorch/ao v0.14.1 introduced support for NVFP4 training recipes and MoE optimizations specifically for Blackwell.
- CUDA 13 & Ecosystem: PyTorch 2.9.0 and vLLM v0.11.0 both added support for CUDA 13, indicating the software stack is moving to the next major CUDA version.
- Distributed Training: Meta introduced Monarch, a new distributed programming framework, and Verl v0.6.0 advanced RLHF capabilities with SGLang/vLLM native server integration, pushing the boundaries of post-training infrastructure.
📂 Category Updates
🔴 AMD Ecosystem
[ROCm/ROCm]
- Key Activity:
- [2025-10-20] 🚨 RELEASE: therock-7.9.0 (Technology Preview)
- [2025-10-10] RELEASE: rocm-7.0.2
- Details:
- ROCm 7.9.0: Introduces “TheRock” build system. Changes include ManyLinux_2_28 compliance, architecture-specific Python packages, and a slimmed-down SDK. Note: No upgrade path from 7.0 stream; intended for dev preview. Supports MI355X/MI350X (CDNA4) and MI300 series.
- ROCm 7.0.2: Added support for RHEL 10.0, Oracle Linux 10. Enabled RAG AI support and gsplat (Gaussian splatting).
-
Metrics: 105 PRs 54 Issues (High Activity)
[AMD-AGI/Primus]
- Key Activity:
- [2025-10-18] RELEASE: v0.4.0
- [2025-10-15] RELEASE: v0.3.0
- Details:
- v0.4.0: Added Python-based
primusCLI entrypoint. Added support for Zero Bubble Pipeline Parallelism. Added support for Grok-1 and Grok-2 models. - Optimization: Enabled compile for Llama-3.1 (8B/70B/405B).
- Integration: Updated Torchtitan support and synced with upstream.
- v0.4.0: Added Python-based
-
Metrics: 46 PRs 1 Issue
[AMD-AGI/TraceLens]
- Key Activity:
- [2025-10-17] 🚨 RELEASE: v0.4.0 (Repo switched to Public)
- Details:
- New Features: Added TraceLens UI, JAX analysis reporting, and
gemmologistintegration for modeling GEMM efficiencies. - Perf Modeling: Added performance models for
aten::bmm,flash_attn, andaiter::fmha_v3. - Usability: Added
TraceDiffAPI for comparing traces.
- New Features: Added TraceLens UI, JAX analysis reporting, and
-
Metrics: 48 PRs 45 Issues
[triton-lang/triton]
- Key Activity:
- [2025-10-21] 🚨 RELEASE: v3.5.0
- Details:
- AMD Support: GFX950 (MI350) Support added (MFMA scale, buffer load/store). Added ChainedDot schedule and Ping-Pong transformation for AMD backend.
- NVIDIA Support: Warp specialization enabled for persistent matmul/FA. Blackwell specific optimizations (TMEM support, FP8 MMAv2).
- Core: Mutations disallowed in language. Ragged TMA support added.
-
Metrics: 232 PRs 45 Issues
[tile-ai/tilelang]
- Key Activity:
- [2025-10-31] RELEASE: v0.1.6.post2
- Details:
- Last release supporting Python 3.8.
- Added support for Huawei Ascend chips.
- Implemented WGMMA for
T.gemm_v2. - Added Metal backend support.
-
Metrics: 152 PRs 95 Issues
🔥 PyTorch Ecosystem
[pytorch/pytorch]
- Key Activity:
- [2025-10-15] 🚨 RELEASE: v2.9.0
- Details:
- System: Minimum Python version is now 3.10. CUDA 13.0 support added.
- Features: Symmetric memory primitives. FlexAttention on Intel GPUs. Muon optimizer introduced.
- Deprecations:
torch.onnx.dynamo_exportremoved. - Hardware: Enablement of Linux aarch64 binary wheels across all CUDA versions.
-
Metrics: 1900 PRs 529 Issues (Very High Health)
[pytorch/ao] (Architecture Optimization)
- Key Activity:
- [2025-10-13] RELEASE: v0.14.1
- Details:
- Blackwell: Prototype support for NVFP4 Quantization Aware Training (QAT) and MoE training on Blackwell GPUs.
- Optimization: Added
_scaled_grouped_mmfor MoE training speedups.
-
Metrics: 142 PRs 26 Issues
[pytorch/torchtitan]
- Key Activity:
- [2025-10-18] RELEASE: v0.2.0
- Details:
- Aligned with PyTorch 2.10 (dev) and TorchAO 0.15 (dev).
- Consolidated DeepSeek V3 experiments.
-
Metrics: 174 PRs 24 Issues
[meta-pytorch/monarch]
- Key Activity:
- [2025-10-22] 🚨 RELEASE: v0.1.0 (Initial Release)
- Details:
- New distributed programming framework for PyTorch based on scalable actor messaging and RDMA transfers.
- Experimental status.
-
Metrics: 0 PRs 0 Issues (in data snapshot, likely higher in reality for new launch)
🟢 NVIDIA Ecosystem
[NVIDIA/Megatron-LM]
- Key Activity:
- [2025-10-08] RELEASE: core_v0.14.0
- Details:
- Inference: Multi-batch CUDA Graphs for Dynamic Inference.
- MoE: Active optimization for Blackwell Platform. Added MoE router fusion and expert parallel A2A overlapping.
- Comm: Added
HyperCommGridfor N-Dimensional Communication Grids.
- Metrics: 0 PRs (Internal development model, mirrored to GitHub)
[NVIDIA/TransformerEngine]
- Key Activity:
- [2025-10-07] RELEASE: v2.8
- [2025-10-01] RELEASE: v2.7
- Details:
- v2.8: Added support for NVFP4 training recipe and FP8 attention with current scaling.
- v2.7: FP8 performance improvements and support for
cublasMPbackend.
- Metrics: 0 PRs (Internal development model)
🔵 JAX & Google Ecosystem
[jax-ml/jax]
- Key Activity:
- [2025-10-15] 🚨 RELEASE: jax-v0.8.0
- Details:
- Breaking:
jax.pmapis in maintenance mode; users encouraged to usejax.shard_mapandjax.jit. - Removed:
jax.experimental.host_callbackandjax.utilmodules removed. - Features: Default nonsymmetric eigendecomposition on NVIDIA GPUs now uses
cusolver.
- Breaking:
- Metrics: 0 PRs (Snapshot limitation)
[AI-Hypercomputer/maxtext]
- Key Activity:
- [2025-10-24] RELEASE: maxtext-tutorial-v1.0.0
- [2025-10-18] RELEASE: tpu-recipes-v0.1.5
- Details:
- Released specific versions for tutorials and TPU recipes.
- Docs updated for installation guides.
-
Metrics: 132 PRs 12 Issues
⚡ Inference & Serving
[vllm-project/vllm]
- Key Activity:
- [2025-10-02] 🚨 RELEASE: v0.11.0
- Details:
- Major Architecture Shift: V0 engine removed entirely. V1 is the only engine.
- Hardware: ROCm 7.0 support added. NVIDIA Blackwell BF16 fused MoE support.
- Features: CUDA graph mode
FULL_AND_PIECEWISEis now default. Added support for Qwen3-VL, DeepSeek-V3.2-Exp. - Quantization: Support for NVFP4 on dense models.
- Metrics: 0 PRs (Snapshot limitation, activity is actually very high)
[llm-d/llm-d]
- Key Activity:
- [2025-10-10] RELEASE: v0.3.0
- Details:
- Increased support for specialized hardware backends (TPU, XPU).
- Added support for DOKS (DigitalOcean Kubernetes).
- Metrics: 0 PRs (Snapshot limitation)
[volcengine/verl] (Hybrid Training/Inference)
- Key Activity:
- [2025-10-15] RELEASE: v0.6.0
- Details:
- Architecture: Prototype Model Engine using FSDP + Ulysses.
- Rollout: Migrated SGLang and vLLM to native server mode for agentic RL.
- Algorithms: Added Token-level and Sequence-level importance sampling (TIS).
- Metrics: 0 PRs (Snapshot limitation)