📅 Engineering Report (2025-10-01 - 2025-10-31)

🚀 Executive Summary

October 2025 was a pivotal month characterized by major foundational releases across the entire AI stack. PyTorch released v2.9.0, introducing CUDA 13 support and symmetric memory primitives. ROCm introduced a new build infrastructure (“TheRock”) via a technology preview in v7.9.0, signaling a shift in how the SDK is delivered. Triton v3.5.0 arrived with critical support for AMD’s next-gen MI350 (GFX950) architecture and NVIDIA’s Blackwell. In the inference space, vLLM v0.11.0 officially deprecated its V0 engine, moving entirely to the V1 architecture for improved performance and modularity.

  • 🚨 ROCm Build System Overhaul: The release of ROCm 7.9.0 “TheRock” is a technology preview that introduces a new build and release infrastructure. It introduces a versioning discontinuity (7.9 vs 7.0 stream) and moves toward a more open, predictable 6-week release cadence with ManyLinux compliance.
  • Next-Gen Hardware Support: Triton v3.5.0 added comprehensive support for the GFX950 (MI350) architecture, including MFMA scale support and scale preshuffling, ensuring software readiness for upcoming hardware.
  • Ecosystem Maturity: TraceLens (profiling) is now public (v0.4.0) with JAX support, and Primus (training) v0.4.0 added zero-bubble pipeline parallelism and Grok model support, strengthening the training story on AMD GPUs.
  • Inference Updates: vLLM v0.11.0 updated its AMD backend to target ROCm 7.0 base, ensuring the inference engine keeps pace with the SDK.

Competitive Analysis

  • NVIDIA Blackwell Readiness: Competitor tooling is aggressively optimizing for the Blackwell architecture. TransformerEngine v2.8 and pytorch/ao v0.14.1 introduced support for NVFP4 training recipes and MoE optimizations specifically for Blackwell.
  • CUDA 13 & Ecosystem: PyTorch 2.9.0 and vLLM v0.11.0 both added support for CUDA 13, indicating the software stack is moving to the next major CUDA version.
  • Distributed Training: Meta introduced Monarch, a new distributed programming framework, and Verl v0.6.0 advanced RLHF capabilities with SGLang/vLLM native server integration, pushing the boundaries of post-training infrastructure.

📂 Category Updates

🔴 AMD Ecosystem

[ROCm/ROCm]

  • Key Activity:
    • [2025-10-20] 🚨 RELEASE: therock-7.9.0 (Technology Preview)
    • [2025-10-10] RELEASE: rocm-7.0.2
  • Details:
    • ROCm 7.9.0: Introduces “TheRock” build system. Changes include ManyLinux_2_28 compliance, architecture-specific Python packages, and a slimmed-down SDK. Note: No upgrade path from 7.0 stream; intended for dev preview. Supports MI355X/MI350X (CDNA4) and MI300 series.
    • ROCm 7.0.2: Added support for RHEL 10.0, Oracle Linux 10. Enabled RAG AI support and gsplat (Gaussian splatting).
  • Metrics: 105 PRs 54 Issues (High Activity)

[AMD-AGI/Primus]

  • Key Activity:
    • [2025-10-18] RELEASE: v0.4.0
    • [2025-10-15] RELEASE: v0.3.0
  • Details:
    • v0.4.0: Added Python-based primus CLI entrypoint. Added support for Zero Bubble Pipeline Parallelism. Added support for Grok-1 and Grok-2 models.
    • Optimization: Enabled compile for Llama-3.1 (8B/70B/405B).
    • Integration: Updated Torchtitan support and synced with upstream.
  • Metrics: 46 PRs 1 Issue

[AMD-AGI/TraceLens]

  • Key Activity:
    • [2025-10-17] 🚨 RELEASE: v0.4.0 (Repo switched to Public)
  • Details:
    • New Features: Added TraceLens UI, JAX analysis reporting, and gemmologist integration for modeling GEMM efficiencies.
    • Perf Modeling: Added performance models for aten::bmm, flash_attn, and aiter::fmha_v3.
    • Usability: Added TraceDiff API for comparing traces.
  • Metrics: 48 PRs 45 Issues

[triton-lang/triton]

  • Key Activity:
    • [2025-10-21] 🚨 RELEASE: v3.5.0
  • Details:
    • AMD Support: GFX950 (MI350) Support added (MFMA scale, buffer load/store). Added ChainedDot schedule and Ping-Pong transformation for AMD backend.
    • NVIDIA Support: Warp specialization enabled for persistent matmul/FA. Blackwell specific optimizations (TMEM support, FP8 MMAv2).
    • Core: Mutations disallowed in language. Ragged TMA support added.
  • Metrics: 232 PRs 45 Issues

[tile-ai/tilelang]

  • Key Activity:
    • [2025-10-31] RELEASE: v0.1.6.post2
  • Details:
    • Last release supporting Python 3.8.
    • Added support for Huawei Ascend chips.
    • Implemented WGMMA for T.gemm_v2.
    • Added Metal backend support.
  • Metrics: 152 PRs 95 Issues

🔥 PyTorch Ecosystem

[pytorch/pytorch]

  • Key Activity:
    • [2025-10-15] 🚨 RELEASE: v2.9.0
  • Details:
    • System: Minimum Python version is now 3.10. CUDA 13.0 support added.
    • Features: Symmetric memory primitives. FlexAttention on Intel GPUs. Muon optimizer introduced.
    • Deprecations: torch.onnx.dynamo_export removed.
    • Hardware: Enablement of Linux aarch64 binary wheels across all CUDA versions.
  • Metrics: 1900 PRs 529 Issues (Very High Health)

[pytorch/ao] (Architecture Optimization)

  • Key Activity:
    • [2025-10-13] RELEASE: v0.14.1
  • Details:
    • Blackwell: Prototype support for NVFP4 Quantization Aware Training (QAT) and MoE training on Blackwell GPUs.
    • Optimization: Added _scaled_grouped_mm for MoE training speedups.
  • Metrics: 142 PRs 26 Issues

[pytorch/torchtitan]

  • Key Activity:
    • [2025-10-18] RELEASE: v0.2.0
  • Details:
    • Aligned with PyTorch 2.10 (dev) and TorchAO 0.15 (dev).
    • Consolidated DeepSeek V3 experiments.
  • Metrics: 174 PRs 24 Issues

[meta-pytorch/monarch]

  • Key Activity:
    • [2025-10-22] 🚨 RELEASE: v0.1.0 (Initial Release)
  • Details:
    • New distributed programming framework for PyTorch based on scalable actor messaging and RDMA transfers.
    • Experimental status.
  • Metrics: 0 PRs 0 Issues (in data snapshot, likely higher in reality for new launch)

🟢 NVIDIA Ecosystem

[NVIDIA/Megatron-LM]

  • Key Activity:
    • [2025-10-08] RELEASE: core_v0.14.0
  • Details:
    • Inference: Multi-batch CUDA Graphs for Dynamic Inference.
    • MoE: Active optimization for Blackwell Platform. Added MoE router fusion and expert parallel A2A overlapping.
    • Comm: Added HyperCommGrid for N-Dimensional Communication Grids.
  • Metrics: 0 PRs (Internal development model, mirrored to GitHub)

[NVIDIA/TransformerEngine]

  • Key Activity:
    • [2025-10-07] RELEASE: v2.8
    • [2025-10-01] RELEASE: v2.7
  • Details:
    • v2.8: Added support for NVFP4 training recipe and FP8 attention with current scaling.
    • v2.7: FP8 performance improvements and support for cublasMP backend.
  • Metrics: 0 PRs (Internal development model)

🔵 JAX & Google Ecosystem

[jax-ml/jax]

  • Key Activity:
    • [2025-10-15] 🚨 RELEASE: jax-v0.8.0
  • Details:
    • Breaking: jax.pmap is in maintenance mode; users encouraged to use jax.shard_map and jax.jit.
    • Removed: jax.experimental.host_callback and jax.util modules removed.
    • Features: Default nonsymmetric eigendecomposition on NVIDIA GPUs now uses cusolver.
  • Metrics: 0 PRs (Snapshot limitation)

[AI-Hypercomputer/maxtext]

  • Key Activity:
    • [2025-10-24] RELEASE: maxtext-tutorial-v1.0.0
    • [2025-10-18] RELEASE: tpu-recipes-v0.1.5
  • Details:
    • Released specific versions for tutorials and TPU recipes.
    • Docs updated for installation guides.
  • Metrics: 132 PRs 12 Issues

⚡ Inference & Serving

[vllm-project/vllm]

  • Key Activity:
    • [2025-10-02] 🚨 RELEASE: v0.11.0
  • Details:
    • Major Architecture Shift: V0 engine removed entirely. V1 is the only engine.
    • Hardware: ROCm 7.0 support added. NVIDIA Blackwell BF16 fused MoE support.
    • Features: CUDA graph mode FULL_AND_PIECEWISE is now default. Added support for Qwen3-VL, DeepSeek-V3.2-Exp.
    • Quantization: Support for NVFP4 on dense models.
  • Metrics: 0 PRs (Snapshot limitation, activity is actually very high)

[llm-d/llm-d]

  • Key Activity:
    • [2025-10-10] RELEASE: v0.3.0
  • Details:
    • Increased support for specialized hardware backends (TPU, XPU).
    • Added support for DOKS (DigitalOcean Kubernetes).
  • Metrics: 0 PRs (Snapshot limitation)

[volcengine/verl] (Hybrid Training/Inference)

  • Key Activity:
    • [2025-10-15] RELEASE: v0.6.0
  • Details:
    • Architecture: Prototype Model Engine using FSDP + Ulysses.
    • Rollout: Migrated SGLang and vLLM to native server mode for agentic RL.
    • Algorithms: Added Token-level and Sequence-level importance sampling (TIS).
  • Metrics: 0 PRs (Snapshot limitation)