📅 Engineering Report (2025-09-01 - 2025-09-30)

🚀 Executive Summary

September 2025 was dominated by the major release of ROCm 7.0, marking a significant architectural shift by decoupling the AMD GPU driver from the ROCm stack and introducing support for next-generation hardware (MI350/MI355X). The AMD ecosystem also saw the maturity of the Primus training framework (v0.2.0) with support for unreleased “Llama 4” configurations.

On the framework side, PyTorch AO (Architecture Optimization) is aggressively pushing quantization boundaries with new QAT APIs and NVFP4 prototypes, while also validating MI300X performance. NVIDIA’s TransformerEngine and DeepSpeed released significant updates focusing on MXFP8 quantization, ONNX export, and new optimizer implementations (Muon).

  • 🚨 ROCm 7.0 Release: The most critical update of the month. It introduces support for MI355X, MI350X, and MI325X GPUs. Key features include support for OCP Microscaling formats (FP4/FP6/FP8), a move to LLVM 20 for the compiler, and the separation of the amdgpu driver from the ROCm user-space stack.
  • Primus v0.2.0: The unified training framework released v0.2.0, adding significant features for MoE (Mixture of Experts) and initial configurations for Llama 4 (Scout/Maverick), indicating proactive preparation for next-gen models.
  • JAX on Windows: JAX documentation was updated to reflect experimental support for WSL2 on ROCm, lowering the barrier to entry for developers on Windows machines.
  • TileLang Adoption: The tile-ai/tilelang compiler project integrated AMD GPU CI and ROCm architecture detection, signaling growing third-party compiler support for AMD hardware.

Competitive Analysis

  • NVIDIA TransformerEngine v2.6: NVIDIA continues to refine the FP8 ecosystem with improved MXFP8 kernels and ONNX export support, maintaining a strong lock on efficient Transformer training/inference workflows.
  • DeepSpeed Innovation: The v0.17.6 release introduces the Muon Optimizer and SuperOffload, continuing their trend of aggressive optimization for large-scale model training.
  • PyTorch AO & NVFP4: PyTorch AO is prototyping support for NVFP4 (NVIDIA’s 4-bit float), preparing the software stack for future Blackwell-architecture precision capabilities.
  • Triton Benchmarking: Issues regarding B200 (Blackwell) performance regressions in triton-lang indicate active tuning for NVIDIA’s newest silicon is underway.

📂 Category Updates

🟢 AMD Ecosystem

AMD-AGI/Primus

  • Key Activity:
    • 🚨 [2025-09-11] Release v0.2.0: Significant update to the unified training framework.
    • Maintenance Health: Excellent (40 PRs opened, 40 closed).
  • Details:
    • Llama 4 Readiness: PRs merged adding initial Llama 4 configs (Scout-17B, Maverick-17B).
    • MoE Support: Enhanced Fused MoE router scatter logic and load balancing.
    • Integration: Added LightMegatronPretrainTrainer and config export support.
  • Metrics: 40 PRs 1 Issues

ROCm/ROCm

  • Key Activity:
    • 🚨 [2025-09-16] Release v7.0.0: Major version release.
    • [2025-09-17] Release v7.0.1: Immediate patch release resolving CPERs (error recording) issues.
    • Maintenance Health: High activity (162 New PRs).
  • Details:
    • Hardware: Support added for Instinct MI355X, MI350X, and MI325X.
    • Architecture: AMD GPU Driver (amdgpu) is now distributed separately from the ROCm software stack.
    • Precision: Added support for OCP FP4, FP6, and FP8 data types.
    • Compiler: Updated to AMD Clang version 20.0.0.
  • Metrics: 162 PRs 45 Issues

AMD-AGI/TraceLens

  • Key Activity:
    • Active development on kernel launchers and graph mode testing.
  • Details:
    • Added support for trtllm::cublas_scaled_mm.
    • Clarified UID vs Event in API.
  • Metrics: 17 PRs 31 Issues

tile-ai/tilelang

  • Key Activity:
    • 🚨 [2025-09-19] Release v0.1.6: Kernel language updates.
    • Maintenance Health: Very High (99 New PRs, 95 Closed).
  • Details:
    • AMD Integration: Added ROCm architecture detection and AMD GPU CI pipelines.
    • Features: Added Flash Attention example for MI300 series.
  • Metrics: 99 PRs 46 Issues

🟠 PyTorch Ecosystem

pytorch/ao (Architecture Optimization)

  • Key Activity:
    • 🚨 [2025-09-02] Release v0.13.0-rc8: Focus on Quantization Aware Training (QAT).
  • Details:
    • New Features: Added simple multi-step QAT API and prototype NVFP4 support.
    • Performance: Updated Float8 README with AMD MI300X benchmark results.
    • Deprecation: Dropped support for PyTorch 2.5 and older.
  • Metrics: 0 PRs 31 Issues (Post-release window)

pytorch/pytorch

  • Key Activity:
    • Standard high-volume maintenance.
  • Details:
    • Cleanup of Python 3.9 dead code.
    • Update MacOS deployment platform to 11.0.
  • Metrics: 1799 PRs 649 Issues

pytorch/torchtitan

  • Key Activity:
    • FSDP2 (Fully Sharded Data Parallel 2) enhancements.
  • Details:
    • Added support for sync_module_states in FSDP2.
    • Refactored attention masks to be arguments to the model.
  • Metrics: 78 PRs 30 Issues

meta-pytorch/monarch

  • Key Activity:
    • 🚨 [2025-09-03] Release v0.0.0: First public release of the Monarch project.
  • Metrics: 0 PRs 0 Issues

🔵 NVIDIA / Competition

NVIDIA/TransformerEngine

  • Key Activity:
    • 🚨 [2025-09-15] Release v2.6: Feature update for Transformer acceleration.
  • Details:
    • Export: Added support for ONNX export.
    • Performance: Improved MXFP8 quantization kernels and KV caching kernels.
    • Fusion: Added gradient accumulation fusion for FSDP (megatron-core).
  • Metrics: 0 PRs 0 Issues

deepspeedai/DeepSpeed

  • Key Activity:
    • 🚨 [2025-09-19] Release v0.17.6: Optimization toolkit update.
  • Details:
    • Optimization: Enabled Muon Optimizer.
    • Memory: Added SuperOffload support.
    • DeepCompile: Fixes for IPG bucket clearing and training hangs.
  • Metrics: 49 PRs 32 Issues

triton-lang/triton

  • Key Activity:
    • Compiler backend improvements.
  • Details:
    • Debug info lowering to cubin/hsaco.
    • Issues tracked regarding B200 (Blackwell) flex attention regression.
  • Metrics: 270 PRs 40 Issues

🟣 JAX / Google Ecosystem

jax-ml/jax

  • Key Activity:
    • 🚨 [2025-09-16] Release v0.7.2: Maintenance and ecosystem alignment.
  • Details:
    • Dependency: Minimum NumPy version raised to 2.0.
    • AMD Support: Documentation updated to reflect experimental support for WSL2 on ROCm.
    • Deprecation: jax.dlpack.from_dlpack capsule support removed.
  • Metrics: 668 PRs 77 Issues

AI-Hypercomputer/maxtext

  • Key Activity:
    • Documentation and CI improvements.
  • Details:
    • Added user guide for GPT-OSS.
    • Migrated Llama4 layers to NNX.
  • Metrics: 159 PRs 7 Issues

⚪ Other Frameworks

deepseek-ai/DeepEP

  • Key Activity:
    • 🚨 [2025-09-16] Release v1.2.1.
  • Details:
    • Added permute extension to hybrid-ep.
    • Support CUDA Graph for internode dispatch.
  • Metrics: 26 PRs 28 Issues

vllm-project/vllm

  • Key Activity:
    • Documentation maintenance.
  • Details:
    • Removed Neuron install docs (backend support dropped).
  • Metrics: 0 PRs 0 Issues