📅 Engineering Report (2025-09-01 - 2025-09-30)

🚀 Executive Summary

September 2025 was a pivotal month characterized by major infrastructure updates across the ecosystem. ROCm 7.0.0 was released, marking a significant architectural shift by decoupling the AMDGPU driver from the ROCm stack and introducing support for the next-generation MI350/MI355X series. Concurrently, Primus (AMD’s training framework) reached v0.2.0 with Llama 4 configurations, signaling readiness for future model architectures.

On the competitive front, vLLM released v0.10.2 with massive updates, including native Blackwell (SM100) optimizations and a mandatory upgrade to PyTorch 2.8.0. PyTorch AO (Architecture Optimization) introduced NVFP4 and FP8 quantization support, directly targeting NVIDIA’s latest hardware capabilities. DeepSpeed introduced the Muon optimizer, continuing the trend of integrating cutting-edge research into training stacks.

🚨 ROCm 7.0.0 Release: This major release introduces support for the Instinct MI355X and MI350X GPUs. Crucially, the AMD GPU Driver is now versioned and distributed separately from the ROCm software stack, simplifying driver management.
Primus v0.2.0: AMD’s unified training framework released v0.2.0, adding Llama 4 configurations, unifying backend CLIs, and integrating “LightMegatron” for cleaner config-based integration.
JAX on Windows: JAX documentation was updated to reflect experimental support for WSL2 on ROCm, lowering the barrier to entry for developers on Windows machines using AMD hardware.
vLLM Support: The new vLLM release enables Pipeline Parallelism on ROCm using Ray, and adds TorchAO quantization enablement for ROCm platforms.

Competitive Analysis

NVIDIA Blackwell Readiness: vLLM v0.10.2 includes specific kernels for SM100 (Blackwell), including FlashInfer CUTLASS MoE FP8 backends and MXFP4 support.
Quantization Wars: PyTorch AO v0.13.0 added prototype support for NVFP4 (NVIDIA’s 4-bit floating point) and FP8 Quantization Aware Training (QAT). This tight integration with NVIDIA’s specific low-precision formats poses a competitive pressure on software-side quantization support.
DeepSpeed Innovations: DeepSpeed v0.17.6 enables the Muon Optimizer, a new optimization technique gaining traction in research, ensuring NVIDIA users have immediate access to the latest training methods.
Broadening Hardware Support: TileLang added support for Huawei Ascend chips, indicating a diversifying hardware landscape for compiler backends beyond the AMD/NVIDIA duopoly.

📂 Category Updates

🔴 AMD Ecosystem

AMD-AGI/Primus

Key Activity:
- [2025-09-11] 🚨 RELEASE: v0.2.0 - Significant update unifying backend CLIs and enhancing MoE support.
Details:
- [2025-09-11] Added Llama 4 configurations (17B models), suggesting proactive tuning for unreleased/upcoming open weights.
- [2025-09-11] Integration of LightMegatronPretrainTrainer and config export support.
- [2025-09-11] Fused MoE router scatter logic added to primus_turbo.
Metrics: 40 PRs 1 Issues

ROCm/ROCm

Key Activity:
- [2025-09-16] 🚨 RELEASE: rocm-7.0.0 - Major release supporting MI350/MI355X.
- [2025-09-17] RELEASE: rocm-7.0.1 - Quality release resolving CPER memory page reporting issues.
Details:
- [2025-09-16] Driver Separation: AMD GPU Driver (amdgpu) is now distributed separately from the ROCm stack.
- [2025-09-16] AI Frameworks: Added support for PyTorch 2.7, JAX 0.6.0, and OCP FP8 data types in vLLM.
- [2025-09-16] Compilers: Introduction of AMD Next-Gen Fortran compiler (llvm-flang).
Metrics: 162 PRs 45 Issues

AMD-AGI/TraceLens

Key Activity:
- [2025-09-XX] Focus on kernel launchers and graph mode testing.
Details:
- [2025-09-XX] Added support for trtllm::cublas_scaled_mm and initial test_graph_mode.py.
Metrics: 17 PRs 31 Issues

🔥 PyTorch & Frameworks

pytorch/ao (Architecture Optimization)

Key Activity:
- [2025-09-02] 🚨 RELEASE: v0.13.0 - Major feature release for quantization.
Details:
- [2025-09-02] NVFP4 Support: Prototype support for NVIDIA’s custom 4-bit float format.
- [2025-09-02] MXFP8 Speedups: Achieved 1.2x speedup vs BF16 on NVIDIA B200 for pretraining.
- [2025-09-02] Dropped support for PyTorch versions older than 2.6.
Metrics: 0 PRs 31 Issues

pytorch/pytorch

Key Activity:
- High-volume maintenance and Python version cleanup.
Details:
- [2025-09-24] Removal of dead code related to Python 3.9.
- [2025-09-05] MacOS deployment platform updated to 11.0.
Metrics: 1799 PRs 649 Issues

deepspeedai/DeepSpeed

Key Activity:
- [2025-09-19] 🚨 RELEASE: v0.17.6 - Feature release.
Details:
- [2025-09-19] Enabled Muon Optimizer.
- [2025-09-24] Documentation update for “SuperOffload” release.
- [2025-09-19] Added RISC-V 64 CPU support in shared memory communications.
Metrics: 0 PRs 0 Issues

🟢 Inference & Serving

vllm-project/vllm

Key Activity:
- [2025-09-13] 🚨 RELEASE: v0.10.2 - Massive update with 740 commits.
Details:
- [2025-09-13] Breaking Change: Upgrade to PyTorch 2.8.0.
- [2025-09-13] Blackwell Support: DeepGEMM Linear and MoE FP8 backends for SM100.
- [2025-09-13] ROCm Updates: Enabled Pipeline Parallelism with Ray on ROCm; enabled TorchAO quantization.
- [2025-09-13] Native aarch64 support: Enabled usage on GB200 platforms.
Metrics: 0 PRs 0 Issues

NVIDIA/TransformerEngine

Key Activity:
- [2025-09-15] 🚨 RELEASE: v2.6 - Performance and fusion updates.
Details:
- [2025-09-15] Added ONNX export support.
- [2025-09-15] Support for gradient accumulation fusion with FSDP (Megatron-core).
Metrics: 0 PRs 0 Issues

🔵 JAX Ecosystem

jax-ml/jax

Key Activity:
- [2025-09-16] RELEASE: v0.7.2 - Maintenance and deprecation release.
Details:
- [2025-09-23] ROCm Update: Documentation updated to reflect experimental support for WSL2 on ROCm.
- [2025-09-16] Added CUDA 13 to installation documentation.
Metrics: 0 PRs 0 Issues

AI-Hypercomputer/maxtext

Key Activity:
- Documentation and CI hardening.
Details:
- [2025-09-25] Added user guide and tests for GPT-OSS.
Metrics: 0 PRs 0 Issues

🌐 Emerging Tech

tile-ai/tilelang

Key Activity:
- [2025-09-19] RELEASE: v0.1.6 - Compiler stack updates.
Details:
- [2025-09-29] Added announcement of support for Huawei Ascend chips, broadening the compiler’s backend targets beyond NVIDIA/AMD.
Metrics: 0 PRs 0 Issues