📅 Engineering Report (2025-05-01 - 2025-05-31)

🚀 Executive Summary

May 2025 was a significant month for infrastructure releases across the board. AMD launched ROCm 6.4.1, introducing RDNA4 support and the new ROCm Data Science Toolkit (ROCm-DS). PyTorch AO released v0.11.0 with major strides in Mixture-of-Experts (MoE) quantization and PT2 export flows. NVIDIA released TransformerEngine v2.3, heavily optimizing FP8 workflows with FSDP support and adding support for the unreleased RTX 5090. Activity in PyTorch Core remains extremely high, while the JAX ecosystem saw a minor version bump to v0.6.1.

  • ROCm 6.4.1 Release: The major highlight of the month. Key features include support for RDNA4 GPUs (Radeon AI PRO R9700, RX 9070/9060 XT), the introduction of DPX partition mode for MI300X, and the beta launch of the ROCm Data Science Toolkit (ROCm-DS).
  • Tooling Deprecations: A clear signal was sent regarding the deprecation of legacy tools (ROCm SMI, ROCTracer, ROCProfiler) in favor of the new ROCprofiler-SDK and AMD SMI, effective Q1 2026.
  • PyTorch AO Integration: The PyTorch AO v0.11.0 release explicitly included optimizations for ROCm, specifically “preshuffled weight mm,” indicating active upstream contribution to PyTorch quantization libraries.
  • Primus & TraceLens: Internal AMD tooling saw healthy activity, with Primus updating Megatron-LM submodules and TraceLens enabling batch GEMM functionality via Gemmologist.

Competitive Analysis

  • NVIDIA FP8 Dominance: With TransformerEngine v2.3, NVIDIA has enabled FP8 weights when using Fully Sharded Data Parallel (FSDP). This is a critical competitive advantage for training LLMs at scale. They also added support for the RTX 5090, preparing software stacks well ahead of consumer hardware availability.
  • MoE Quantization: PyTorch AO v0.11.0 introduced prototype support for quantizing MoE modules. As MoE architectures (like Mixtral and DeepSeek) dominate the efficient LLM landscape, this tooling is vital.
  • DeepSeek Optimization: Both NVIDIA (TransformerEngine) and the broader ecosystem (RTP-LLM, PyTorch Titan) are actively optimizing for DeepSeek architectures, specifically regarding Float8 block scaling and training loops.

📂 Category Updates

AMD Ecosystem

ROCm/ROCm

  • Key Activity:
    • [2025-05-21] 🚨 RELEASE: rocm-6.4.1
    • [2025-05-21] Documentation updates for tools and versioning.
  • Details:
    • Hardware Support: Added RDNA4 support (Radeon AI PRO R9700, RX 9070 XT, RX 9060 XT).
    • Features: DPX partition mode for MI300X; Introduction of ROCm Data Science Toolkit.
    • Deprecations: Announced end-of-life plans for ROCm SMI (replaced by AMD SMI) and legacy profiling tools (replaced by ROCprofiler-SDK).
    • Fixes: Resolved MSCCL initialization failures and ROCm SMI uninstallation issues on RHEL/SLES.
  • Metrics: 116 PRs 36 Issues

AMD-AGI/Primus

  • Key Activity:
    • [2025-05-23] Megatron-LM submodule update (bumped to May 2025 version).
    • [2025-05-20] Config unification and README refactoring.
  • Details:
    • New PR: Update mixtral pretrain configs.
    • Fix: Resolved interleaved virtual pipeline training errors in Megatron integration.
  • Metrics: 17 PRs 0 Issues

AMD-AGI/TraceLens

  • Key Activity:
    • [2025-05-xx] Performance modeling improvements and bug fixes.
  • Details:
    • Feature: Enabling batch GEMM through gemmologist.
    • Feature Request: Exposed functionality for plotting roofline models.
    • Fix: Made input strides optional to prevent performance model crashes.
  • Metrics: 44 PRs 11 Issues

ROCm/MAD

  • Key Activity:
    • Updates to benchmarking scripts.
  • Details:
    • Update: vLLM 05/27 release integration.
    • Update: Profiling path scripts for vLLM.
  • Metrics: 14 PRs 0 Issues

PyTorch Ecosystem

pytorch/ao (Architecture Optimization)

  • Key Activity:
    • [2025-05-09] 🚨 RELEASE: v0.11.0
    • [2025-05-22] QAT and FSDP integration discussions.
  • Details:
    • Feature: MoE Quantization prototype (Int8/Int4 weight-only).
    • Feature: PT2 Export Quantization APIs.
    • AMD Specific: Added “preshuffled weight mm” for ROCm.
    • Benchmarks: Added microbenchmarking framework for inference APIs.
  • Metrics: 0 PRs 22 Issues (Note: Repo stats showed 0 PRs in snapshot, likely due to release freeze/merge, but Release Notes indicate significant activity).

pytorch/pytorch

  • Key Activity:
    • [2025-05-14] Documentation and URL fixes.
    • Massive continuous integration and bug fixing.
  • Details:
    • Issue: FP8 scaled mm lowering ignores scale_result argument.
    • Issue: Inconsistent results for torch.amin() between CPU and GPU.
    • Fix: Numpy compatibility for 2d small list indices.
  • Metrics: 1480 PRs 754 Issues

pytorch/torchtitan

  • Key Activity:
    • DeepSeek model training loop hardening.
  • Details:
    • Fix: DeepSeek router collapse on training loop.
    • Feature: [WIP] SimpleFSDP support for HSDP/DDP + TP.
  • Metrics: 65 PRs 25 Issues

NVIDIA Ecosystem

NVIDIA/TransformerEngine

  • Key Activity:
    • [2025-05-14] 🚨 RELEASE: v2.3
  • Details:
    • Feature: Enabled FP8 weights with FSDP (PyTorch).
    • Hardware: Added support for RTX 5090.
    • Feature: Float8 block scaling recipe (DeepSeek v3 paper implementation).
    • Performance: CPU offloading for activation tensors (FP8 attention).
  • Metrics: 0 PRs 0 Issues

NVIDIA/apex

  • Key Activity:
    • [2025-05-07] Major code cleanup.
  • Details:
    • Removal: Removed legacy features: amp, ddp, and rnn (users should use upstream PyTorch implementations).
  • Metrics: 0 PRs 0 Issues

JAX Ecosystem

jax-ml/jax

  • Key Activity:
    • [2025-05-21] 🚨 RELEASE: jax-v0.6.1
  • Details:
    • Feature: Added jax.lax.axis_size.
    • Change: Re-enabled version checking for CUDA package dependencies.
    • Change: jax.ShapeDtypeStruct is now immutable.
  • Metrics: 0 PRs 0 Issues

Serving & Frameworks

vllm-project/vllm

  • Key Activity:
    • Documentation overhaul and reorganization.
  • Details:
    • [2025-05-24] Reorganized user guide and updated external links.
  • Metrics: 0 PRs 0 Issues

volcengine/verl

  • Key Activity:
    • RLHF/PPO method updates.
  • Details:
    • Feature: Added support for PF-PPO.
    • Change: Set default actor entropy coefficient to 0 (Breaking Change).
  • Metrics: 0 PRs 0 Issues

triton-lang/triton

  • Key Activity:
    • Build configuration adjustments.
  • Details:
    • Update: Added (and then reverted/adjusted) options to remove debug info from modules.
    • Doc: Removed instructions to build PyTorch for Blackwell (likely upstreamed or changed).
  • Metrics: 0 PRs 0 Issues