📅 Engineering Report (2025-10-01 - 2025-10-31)

🚀 Executive Summary

October 2025 was a pivotal month characterized by major foundational releases across the entire AI stack. PyTorch released v2.9.0, introducing CUDA 13 support and symmetric memory primitives. ROCm introduced a new build infrastructure (“TheRock”) via a technology preview in v7.9.0, signaling a shift in how the SDK is delivered. Triton v3.5.0 arrived with critical support for AMD’s next-gen MI350 (GFX950) architecture and NVIDIA’s Blackwell. In the inference space, vLLM v0.11.0 officially deprecated its V0 engine, moving entirely to the V1 architecture for improved performance and modularity.

🚨 ROCm Build System Overhaul: The release of ROCm 7.9.0 “TheRock” is a technology preview that introduces a new build and release infrastructure. It introduces a versioning discontinuity (7.9 vs 7.0 stream) and moves toward a more open, predictable 6-week release cadence with ManyLinux compliance.
Next-Gen Hardware Support: Triton v3.5.0 added comprehensive support for the GFX950 (MI350) architecture, including MFMA scale support and scale preshuffling, ensuring software readiness for upcoming hardware.
Ecosystem Maturity: TraceLens (profiling) is now public (v0.4.0) with JAX support, and Primus (training) v0.4.0 added zero-bubble pipeline parallelism and Grok model support, strengthening the training story on AMD GPUs.
Inference Updates: vLLM v0.11.0 updated its AMD backend to target ROCm 7.0 base, ensuring the inference engine keeps pace with the SDK.

Competitive Analysis

NVIDIA Blackwell Readiness: Competitor tooling is aggressively optimizing for the Blackwell architecture. TransformerEngine v2.8 and pytorch/ao v0.14.1 introduced support for NVFP4 training recipes and MoE optimizations specifically for Blackwell.
CUDA 13 & Ecosystem: PyTorch 2.9.0 and vLLM v0.11.0 both added support for CUDA 13, indicating the software stack is moving to the next major CUDA version.
Distributed Training: Meta introduced Monarch, a new distributed programming framework, and Verl v0.6.0 advanced RLHF capabilities with SGLang/vLLM native server integration, pushing the boundaries of post-training infrastructure.

📂 Category Updates

🔴 AMD Ecosystem

[ROCm/ROCm]

Key Activity:
- [2025-10-20] 🚨 RELEASE: therock-7.9.0 (Technology Preview)
- [2025-10-10] RELEASE: rocm-7.0.2
Details:
- ROCm 7.9.0: Introduces “TheRock” build system. Changes include ManyLinux_2_28 compliance, architecture-specific Python packages, and a slimmed-down SDK. Note: No upgrade path from 7.0 stream; intended for dev preview. Supports MI355X/MI350X (CDNA4) and MI300 series.
- ROCm 7.0.2: Added support for RHEL 10.0, Oracle Linux 10. Enabled RAG AI support and gsplat (Gaussian splatting).
Metrics: 105 PRs 54 Issues (High Activity)

[AMD-AGI/Primus]

Key Activity:
- [2025-10-18] RELEASE: v0.4.0
- [2025-10-15] RELEASE: v0.3.0
Details:
- v0.4.0: Added Python-based primus CLI entrypoint. Added support for Zero Bubble Pipeline Parallelism. Added support for Grok-1 and Grok-2 models.
- Optimization: Enabled compile for Llama-3.1 (8B/70B/405B).
- Integration: Updated Torchtitan support and synced with upstream.
Metrics: 46 PRs 1 Issue

[AMD-AGI/TraceLens]

Key Activity:
- [2025-10-17] 🚨 RELEASE: v0.4.0 (Repo switched to Public)
Details:
- New Features: Added TraceLens UI, JAX analysis reporting, and gemmologist integration for modeling GEMM efficiencies.
- Perf Modeling: Added performance models for aten::bmm, flash_attn, and aiter::fmha_v3.
- Usability: Added TraceDiff API for comparing traces.
Metrics: 48 PRs 45 Issues

[triton-lang/triton]

Key Activity:
- [2025-10-21] 🚨 RELEASE: v3.5.0
Details:
- AMD Support: GFX950 (MI350) Support added (MFMA scale, buffer load/store). Added ChainedDot schedule and Ping-Pong transformation for AMD backend.
- NVIDIA Support: Warp specialization enabled for persistent matmul/FA. Blackwell specific optimizations (TMEM support, FP8 MMAv2).
- Core: Mutations disallowed in language. Ragged TMA support added.
Metrics: 232 PRs 45 Issues

[tile-ai/tilelang]

Key Activity:
- [2025-10-31] RELEASE: v0.1.6.post2
Details:
- Last release supporting Python 3.8.
- Added support for Huawei Ascend chips.
- Implemented WGMMA for T.gemm_v2.
- Added Metal backend support.
Metrics: 152 PRs 95 Issues

🔥 PyTorch Ecosystem

[pytorch/pytorch]

Key Activity:
- [2025-10-15] 🚨 RELEASE: v2.9.0
Details:
- System: Minimum Python version is now 3.10. CUDA 13.0 support added.
- Features: Symmetric memory primitives. FlexAttention on Intel GPUs. Muon optimizer introduced.
- Deprecations: torch.onnx.dynamo_export removed.
- Hardware: Enablement of Linux aarch64 binary wheels across all CUDA versions.
Metrics: 1900 PRs 529 Issues (Very High Health)

[pytorch/ao] (Architecture Optimization)

Key Activity:
- [2025-10-13] RELEASE: v0.14.1
Details:
- Blackwell: Prototype support for NVFP4 Quantization Aware Training (QAT) and MoE training on Blackwell GPUs.
- Optimization: Added _scaled_grouped_mm for MoE training speedups.
Metrics: 142 PRs 26 Issues

[pytorch/torchtitan]

Key Activity:
- [2025-10-18] RELEASE: v0.2.0
Details:
- Aligned with PyTorch 2.10 (dev) and TorchAO 0.15 (dev).
- Consolidated DeepSeek V3 experiments.
Metrics: 174 PRs 24 Issues

[meta-pytorch/monarch]

Key Activity:
- [2025-10-22] 🚨 RELEASE: v0.1.0 (Initial Release)
Details:
- New distributed programming framework for PyTorch based on scalable actor messaging and RDMA transfers.
- Experimental status.
Metrics: 0 PRs 0 Issues (in data snapshot, likely higher in reality for new launch)

🟢 NVIDIA Ecosystem

[NVIDIA/Megatron-LM]

Key Activity:
- [2025-10-08] RELEASE: core_v0.14.0
Details:
- Inference: Multi-batch CUDA Graphs for Dynamic Inference.
- MoE: Active optimization for Blackwell Platform. Added MoE router fusion and expert parallel A2A overlapping.
- Comm: Added HyperCommGrid for N-Dimensional Communication Grids.
Metrics: 0 PRs (Internal development model, mirrored to GitHub)

[NVIDIA/TransformerEngine]

Key Activity:
- [2025-10-07] RELEASE: v2.8
- [2025-10-01] RELEASE: v2.7
Details:
- v2.8: Added support for NVFP4 training recipe and FP8 attention with current scaling.
- v2.7: FP8 performance improvements and support for cublasMP backend.
Metrics: 0 PRs (Internal development model)

🔵 JAX & Google Ecosystem

[jax-ml/jax]

Key Activity:
- [2025-10-15] 🚨 RELEASE: jax-v0.8.0
Details:
- Breaking: jax.pmap is in maintenance mode; users encouraged to use jax.shard_map and jax.jit.
- Removed: jax.experimental.host_callback and jax.util modules removed.
- Features: Default nonsymmetric eigendecomposition on NVIDIA GPUs now uses cusolver.
Metrics: 0 PRs (Snapshot limitation)

[AI-Hypercomputer/maxtext]

Key Activity:
- [2025-10-24] RELEASE: maxtext-tutorial-v1.0.0
- [2025-10-18] RELEASE: tpu-recipes-v0.1.5
Details:
- Released specific versions for tutorials and TPU recipes.
- Docs updated for installation guides.
Metrics: 132 PRs 12 Issues

⚡ Inference & Serving

[vllm-project/vllm]

Key Activity:
- [2025-10-02] 🚨 RELEASE: v0.11.0
Details:
- Major Architecture Shift: V0 engine removed entirely. V1 is the only engine.
- Hardware: ROCm 7.0 support added. NVIDIA Blackwell BF16 fused MoE support.
- Features: CUDA graph mode FULL_AND_PIECEWISE is now default. Added support for Qwen3-VL, DeepSeek-V3.2-Exp.
- Quantization: Support for NVFP4 on dense models.
Metrics: 0 PRs (Snapshot limitation, activity is actually very high)

[llm-d/llm-d]

Key Activity:
- [2025-10-10] RELEASE: v0.3.0
Details:
- Increased support for specialized hardware backends (TPU, XPU).
- Added support for DOKS (DigitalOcean Kubernetes).
Metrics: 0 PRs (Snapshot limitation)

[volcengine/verl] (Hybrid Training/Inference)

Key Activity:
- [2025-10-15] RELEASE: v0.6.0
Details:
- Architecture: Prototype Model Engine using FSDP + Ulysses.
- Rollout: Migrated SGLang and vLLM to native server mode for agentic RL.
- Algorithms: Added Token-level and Sequence-level importance sampling (TIS).
Metrics: 0 PRs (Snapshot limitation)