📅 Engineering Report (2025-09-01 - 2025-09-30)

🚀 Executive Summary

September 2025 marked a pivotal month for AI infrastructure with the release of ROCm 7.0, a major version update introducing support for the MI350/MI355X series and next-generation OCP data types (FP4/FP6/FP8). In parallel, the inference landscape saw a massive update with vLLM v0.10.2, which upgraded to PyTorch 2.8 and introduced significant architectural changes for both NVIDIA Blackwell and AMD ROCm support. The quantization ecosystem is aggressively moving toward low-precision formats (FP4/FP6), with both PyTorch AO and ROCm libraries adding native support to handle next-gen LLM sizes. Notably, referencing “Llama 4” configurations in both AMD Primus and Google MaxText repositories suggests ecosystem preparation for an imminent model generation shift.

🚨 Major Release ROCm 7.0.0 & 7.0.1: This is the most significant update of the year, formally introducing support for AMD Instinct MI355X and MI350X GPUs. It decouples the AMDGPU driver from the ROCm stack and enables KVM Passthrough for the new accelerators.
Primus v0.2.0 Release: AMD’s training framework (Primus) released v0.2.0, adding specific configurations for Llama 4 (17B/128E) and integrating Megatron-LM backends, signaling readiness for next-gen model training on AMD hardware.
OCP Data Type Standardization: ROCm 7.0 libraries (hipBLASLt, MIOpen, etc.) now natively support Open Compute Project (OCP) Microscaling formats (MX) for FP4, FP6, and FP8, aligning with industry standards for low-precision training/inference.
TileLang Adoption: The compiler project TileLang added specific support for AMD MI300 series (Flash Attention examples) and ROCm architecture detection, expanding the third-party compiler ecosystem for AMD.

Competitive Analysis

NVIDIA Blackwell Readiness: NVIDIA’s ecosystem is hardening for the Blackwell (SM100) architecture. vLLM v0.10.2 and TransformerEngine v2.6 added specific kernels for Blackwell, including MXFP4 MoE support, although FlashMLA was temporarily disabled on Blackwell due to compatibility issues.
Proprietary vs. Open Quantization: While ROCm adopted OCP standards for FP4/FP6, PyTorch AO’s release (v0.13.0) introduced prototype support for NVFP4 (NVIDIA specific FP4) alongside standard formats, highlighting a divergence in low-precision data type strategies.
Hugging Face Framework Consolidation: Hugging Face transformers removed library-wide support for TensorFlow and JAX, consolidating heavily around PyTorch. This solidifies PyTorch (and by extension, the ROCm PyTorch stack) as the critical path for model deployment.

📂 Category Updates

AMD Ecosystem

ROCm/ROCm

Key Activity:
- 🚨 [2025-09-16] RELEASE: rocm-7.0.0 - Major release introducing MI355X/MI350X support.
- 🚨 [2025-09-17] RELEASE: rocm-7.0.1 - Patch release fixing out-of-bound CPERs for bad memory pages.
Details:
- Hardware Support: Added MI355X and MI350X support across the stack (libraries, compiler, profilers).
- Architecture: AMD GPU Driver (amdgpu) is now distributed separately from the ROCm software stack.
- Precision: Full support for OCP FP4, FP6, and FP8 data types in HIP runtime and math libraries.
- Ecosystem: Added official support for Ray and llama.cpp.
Metrics: 162 PRs (High Activity) 45 New Issues

AMD-AGI/Primus

Key Activity:
- 🚨 [2025-09-11] RELEASE: v0.2.0 - Unified config/backend CLI and Turbo backend updates.
Details:
- Llama 4 Readiness: Multiple PRs merged adding initial configs for Llama 4 (17B variants) using Maverick and Scout configs.
- Features: Added LightMegatronPretrainTrainer, fused MoE router scatter logic, and Async TP adaptation for TE2.x.
- Performance: Updates to Turbo Grouped GEMM for BF16/FP16.
Metrics: 40 PRs (Perfect Close Rate) 1 New Issue

AMD-AGI/TraceLens

Key Activity:
- High volume of new issues identifying API gaps.
Details:
- Added support for trtllm::cublas_scaled_mm and initial graph mode tests.
- Discussions opened regarding renaming kernel launchers to “leaf ops” and clarifying UID vs Event APIs.
Metrics: 17 PRs 31 New Issues

PyTorch Ecosystem

pytorch/pytorch

Key Activity:
- Preparing for PyTorch 2.6/2.7 cycles.
Details:
- Heavy focus on dynamo and export functionality.
- Windows: Updated libuv version.
- MacOS: Updated deployment platform target to 11.0.
Metrics: 1799 PRs 649 New Issues

pytorch/ao (Architecture Optimization)

Key Activity:
- 🚨 [2025-09-02] RELEASE: v0.13.0-rc8
Details:
- Quantization: Added prototype NVFP4 (NVIDIA FP4) and FP8 QAT support.
- Performance: Achieved 1.2x speedup on NVIDIA B200 (Blackwell) using MXFP8 with torchtitan.
- Deprecation: Dropped support for PyTorch 2.5 and older.
Metrics: 0 PRs (Release Month) 31 New Issues

pytorch/torchtitan

Key Activity:
- Focus on FSDP2 maturity.
Details:
- Added support for sync_module_states in FSDP2.
- Ported true BF16 training into forge experiments.
- Refactored attention interfaces.
Metrics: 78 PRs 30 New Issues

meta-pytorch/monarch

Key Activity:
- 🚨 [2025-09-03] RELEASE: v0.0.0 - Initial public release on PyPI.
Metrics: 0 PRs 0 New Issues

Serving & Inference

vllm-project/vllm

Key Activity:
- 🚨 [2025-09-13] RELEASE: v0.10.2 - Massive update with breaking changes.
Details:
- Core: Upgraded to PyTorch 2.8.0.
- Hardware: Added native aarch64 support for GB200.
- AMD ROCm: Added Pipeline Parallelism with Ray on ROCm (#24275) and TorchAO quantization support (#24400).
- Blackwell: Enabled DeepGEMM Linear and MXFP4 MoE for Blackwell, but disabled FlashMLA due to compatibility issues.
- Model Support: Added support for Qwen2.5-Omni, LFM2, Ovis2.5, and InternVL3.5.
Metrics: 0 PRs (Post-release window) 0 New Issues (Data artifact, actual activity likely higher)

deepspeedai/DeepSpeed

Key Activity:
- 🚨 [2025-09-19] RELEASE: v0.17.6
Details:
- Optimization: Enabled Muon Optimizer.
- Features: Added “SuperOffload” and support for non-ZeRO modes.
- DeepCompile: Fixes for IPG bucket clearing and gradient buffer access.
Metrics: 49 PRs 32 New Issues

deepseek-ai/DeepEP

Key Activity:
- 🚨 [2025-09-16] RELEASE: v1.2.1
Details:
- Added support for CUDA Graph in internode dispatch (normal kernel).
- Added permute extension to hybrid-ep.
Metrics: 26 PRs 28 New Issues

Compilers & Languages

tile-ai/tilelang

Key Activity:
- 🚨 [2025-09-19] RELEASE: v0.1.6
Details:
- AMD Support: Added ROCm architecture detection and Flash Attention examples for MI300 series.
- Huawei Support: Added support for Ascend NPU IR.
- Features: Added support for 1D TMA (Tensor Memory Accelerator) operations.
Metrics: 99 PRs 46 New Issues

triton-lang/triton

Key Activity:
- Stabilization and bug fixes.
Details:
- Addressed B200 (Blackwell) flex_attention_fwd performance regression.
- Investigated issues with casting float16 to float8e5.
Metrics: 270 PRs 40 New Issues

openxla/xla

Key Activity:
- Ongoing maintenance.
Details:
- Issues reported with profiling HLO Ops on GPU.
- XLA:TPU translation issues for mhlo.acosh.
Metrics: 1282 PRs 6 New Issues

Other Frameworks

huggingface/transformers

Key Activity:
- Strategic framework shift.
Details:
- [2025-09-18] Fully removed TensorFlow and JAX support library-wide.
- Added Bengali language README and updated formatting.
Metrics: 514 PRs 145 New Issues

NVIDIA/TransformerEngine

Key Activity:
- 🚨 [2025-09-15] RELEASE: v2.6
Details:
- Fusion: Added gradient accumulation fusion for FSDP (Megatron-Core).
- MoE: Optimized permute fusion kernels.
- Export: Added support for ONNX export.
Metrics: 0 PRs (Repo snapshot) 0 New Issues

jax-ml/jax

Key Activity:
- 🚨 [2025-09-16] RELEASE: jax-v0.7.2
Details:
- Minimum supported NumPy is now 2.0.
- Experimental support for WSL2 on ROCm documentation updated.
- Removed jax2tf non-native serialization support.
Metrics: 668 PRs 77 New Issues