📅 Engineering Report (2025-10-01 - 2025-10-31)

🚀 Executive Summary

October 2025 was a pivotal month for infrastructure modernization and next-generation hardware preparation across the AI ecosystem. AMD launched a technology preview of “TheRock” (ROCm 7.9.0), signaling a major shift toward standard Linux packaging and a slimmed-down SDK, while simultaneously open-sourcing the TraceLens profiling tool. PyTorch released v2.9.0, introducing symmetric memory primitives and officially supporting Python 3.14.

In the serving landscape, vLLM v0.11.0 marked the end of an era by removing the legacy V0 engine entirely, committing fully to the V1 architecture. Triton v3.5.0 arrived with critical support for AMD’s upcoming MI350 (GFX950) architecture and NVIDIA’s Blackwell, reflecting the industry’s rush to support the next wave of accelerators.

ROCm Build System Overhaul: The release of ROCm 7.9.0 (“TheRock”) is a technology preview that introduces ManyLinux compliance and a slimmed-down SDK. This addresses long-standing developer friction regarding installation and portability across Linux distributions.
TraceLens Goes Public: AMD-AGI open-sourced TraceLens (v0.4.0), a tool for analyzing profiler traces. This provides developers with much-needed visibility into PyTorch and JAX workloads on AMD hardware, including roofline analysis.
MI350 Preparation: Triton v3.5.0 added explicit support for GFX950 (MI350 series), including MFMA scale support and scale preshuffling, ensuring the software ecosystem is ready for the next-gen hardware launch.
Primus & Training: Primus (AMD’s training stack) introduced Zero-Bubble Pipeline Parallelism and aligned closely with TorchTitan updates, showing continued maturation in large-scale training capabilities.

Competitive Analysis

NVIDIA Blackwell Readiness: NVIDIA’s ecosystem is heavily optimizing for the Blackwell architecture. TransformerEngine v2.8 and TorchAO v0.14.1 both introduced NVFP4 support and specific optimizations for MoE training on Blackwell GPUs. Triton v3.5.0 also added Warp Specialization enhancements specifically for Hopper and Blackwell.
PyTorch Intel Support: PyTorch v2.9.0 enabled FlexAttention on Intel GPUs and added XPU wheel support, signaling Intel’s continued push to be a viable third option in the PyTorch upstream.
Agentic RL Architectures: The Verl framework (v0.6.0) refactored its architecture to support “Server Mode” for rollouts, decoupling the inference engine (vLLM/SGLang) from the training loop. This architectural shift is designed to support multi-turn “Agentic RL,” a growing trend that AMD software stacks must accommodate efficiently.

📂 AMD Ecosystem

[ROCm/ROCm]

Key Activity:
- [2025-10-20] 🚨 RELEASE: therock-7.9.0 (Technology Preview) - Introduction of “TheRock” build system.
- [2025-10-10] RELEASE: rocm-7.0.2 - Stable maintenance release.
Details:
- ROCm 7.9.0: This is a preview of the future build infrastructure. It features ManyLinux_2_28 compliance (single build for multiple distros), architecture-specific Python packages, and a predictable 6-week release cadence. Note: No upgrade path from 7.0 stream yet.
- ROCm 7.0.2: Added support for RDNA4 (Radeon RX 9060) and newer kernels/OS (Ubuntu 24.04.3, RHEL 10).
- Future Deprecations: ROCm-EP (Execution Provider) for ONNX Runtime is being deprecated in favor of MIGraphX EP.
Metrics: 105 PRs 54 Issues

[AMD-AGI/TraceLens]

Key Activity:
- [2025-10-17] 🚨 RELEASE: v0.4.0 - Repository switched to public.
Details:
- Added roofline_analyzer for performance modeling.
- Integrated gemmologist for modeling GEMM efficiencies.
- Added support for JAX performance analysis.
- Implemented TraceDiff API for comparing traces.
Metrics: 48 PRs 45 Issues

[AMD-AGI/Primus]

Key Activity:
- [2025-10-18] RELEASE: v0.4.0
- [2025-10-15] RELEASE: v0.3.0
Details:
- Added support for Zero-Bubble Pipeline Parallelism.
- Added support for Grok-1 and Grok-2 models.
- Support for torchtitan with Primus-Turbo.
- Aligned FP8 linear arguments with Megatron.
Metrics: 46 PRs 1 Issues

📂 PyTorch Ecosystem

[pytorch/pytorch]

Key Activity:
- [2025-10-15] 🚨 RELEASE: v2.9.0
Details:
- Symmetric Memory: Enables easy programming of multi-GPU kernels.
- FlexAttention: Enabled on Intel GPUs and optimized for X86 CPUs (Flash Decoding).
- Compatibility: Minimum Python version is now 3.10. Dropped support for MacOS < 14 for MPS.
- Export: torch.onnx.export now defaults to dynamo=True.
Metrics: 1900 PRs 529 Issues

[pytorch/torchtitan]

Key Activity:
- [2025-10-18] RELEASE: v0.2.0
Details:
- Updated dependencies to PyTorch 2.10 (dev) and TorchAO 0.15 (dev).
- Consolidated DeepSeek V3 experiments.
- Work initiated on “Inductor Light Mode”.
Metrics: 174 PRs 24 Issues

[pytorch/ao] (Architecture Optimization)

Key Activity:
- [2025-10-13] RELEASE: v0.14.1
Details:
- Blackwell Support: Added prototype MoE training support on Blackwell GPUs and NVFP4 QAT (Quantization Aware Training).
- Performance: _scaled_grouped_mm added as a drop-in replacement for torch._grouped_mm offering ~1.4x speedups for MoE training.
Metrics: 142 PRs 26 Issues

[meta-pytorch/monarch]

Key Activity:
- [2025-10-22] 🚨 RELEASE: v0.1.0 - Initial Public Release.
Details:
- A new distributed programming framework for PyTorch based on Actors.
- Features high-performance RDMA transfers and supervision trees for fault tolerance.
Metrics: 0 PRs 0 Issues

📂 Compilers & Kernels

[triton-lang/triton]

Key Activity:
- [2025-10-21] 🚨 RELEASE: v3.5.0
Details:
- AMD: Added GFX950 (MI350) support including MFMA scale support. Added “ChainedDot” schedule and ping-pong transformations.
- NVIDIA: Added Warp Specialization enhancements for Hopper/Blackwell and TMEM support.
- Language: Disallowed mutations in the language to fix semantic issues.
Metrics: 232 PRs 45 Issues

[tile-ai/tilelang]

Key Activity:
- [2025-10-31] RELEASE: v0.1.6.post2
Details:
- Final release supporting Python 3.8.
- Added support for Huawei Ascend chips.
- Implemented WGMMA for T.gemm_v2.
Metrics: 152 PRs 95 Issues

📂 Inference & Serving

[vllm-project/vllm]

Key Activity:
- [2025-10-02] 🚨 RELEASE: v0.11.0
Details:
- Architecture Change: Completely removed the V0 engine. V1 is now the only engine.
- Performance: FULL_AND_PIECEWISE is now the default CUDA graph mode.
- Hardware: Added support for ROCm 7.0.
- Quantization: FP8 per-token-group quantization and NVFP4 support for dense models.
Metrics: 0 PRs (Repo activity tracked via releases)

[sgl-project/sglang]

Key Activity:
- [2025-10-26] RELEASE: v0.5.4
Details:
- DeepSeek: Full-set optimizations for DeepSeek-V3.2 (MTP, PD-Disagg).
- Features: Beta support for Overlap Scheduler (Speculative Decoding) and Piecewise CUDA graph for prefill.
- Integration: KTransformer integration and Native ModelOpt quantization support.
Metrics: 0 PRs (Repo activity tracked via releases)

[volcengine/verl]

Key Activity:
- [2025-10-15] RELEASE: v0.6.0
Details:
- Architecture Shift: Rollout engine transitioned from SPMD mode to Server Mode (supporting separate SGLang/vLLM servers). This is optimized for multi-turn agentic RL.
- Algorithms: Introduced GSPO and Token-level TIS (Truncated Importance Sampling).
Metrics: 0 PRs (Repo activity tracked via releases)

📂 NVIDIA Ecosystem

[NVIDIA/Megatron-LM]

Key Activity:
- [2025-10-08] RELEASE: core_v0.14.0
Details:
- MoE: Heavy optimization for fine-grained MoE on Blackwell. Added expert parallel A2A overlapping.
- Inference: FP8 inference support with padded tensors.
- Communication: Added HyperCommGrid for flexible N-Dimensional communication groups.
Metrics: 0 PRs (Repo activity tracked via releases)

[NVIDIA/TransformerEngine]

Key Activity:
- [2025-10-07] RELEASE: v2.8
Details:
- Added support for NVFP4 training recipe.
- Added support for FP8 attention with current scaling recipes.
- Added nvte_rmsnorm_bwd_add for fusing RMSNorm and Add operations.
Metrics: 0 PRs (Repo activity tracked via releases)