📅 Engineering Report (2025-05-01 - 2025-05-31)

🚀 Executive Summary

May 2025 was a pivotal month characterized by major version releases across the hardware-software stack. AMD significantly expanded hardware accessibility with ROCm 6.4.1, introducing support for RDNA4 consumer and workstation GPUs. PyTorch AO released v0.11.0, democratizing Mixture-of-Experts (MoE) quantization. On the competitive front, NVIDIA updated TransformerEngine to support the unreleased RTX 5090 and specific DeepSeek V3 floating-point recipes. Across the board, there is a heavy engineering focus on optimizing for DeepSeek architectures and enabling FP8/quantization workflows.

  • ROCm 6.4.1 Release: This is a major accessibility update. ROCm now supports RDNA4 architecture (Radeon AI PRO R9700, RX 9070 XT, RX 9060 XT). This expands the development surface area significantly beyond datacenter GPUs.
  • New Tooling (ROCm-DS): Introduction of the ROCm Data Science toolkit (early access), aiming to accelerate standard data science workloads on AMD hardware.
  • Memory Optimization: AMD Instinct MI300X now supports DPX partition mode under NPS2 memory mode, allowing for finer-grained memory control.
  • Ecosystem Integration: The AMD-AGI/Primus repo is actively updating Megatron-LM submodules and Mixtral configs, ensuring AMD hardware can train the latest MoE models effectively.

Competitive Analysis

  • NVIDIA & Consumer Hardware: NVIDIA’s TransformerEngine v2.3 explicitly added support for the RTX 5090. This indicates the next-gen consumer flagship is imminent and software stacks are already prepared.
  • DeepSeek Adoption: NVIDIA added support for Float8 block scaling recipes specifically cited in the DeepSeek V3 paper. pytorch/torchtitan and DeepEP also saw specific DeepSeek-related engineering (router collapse fixes, IBGDA optimizations), highlighting how this model architecture is driving library development.
  • Quantization Pressure: The PyTorch AO v0.11.0 release introduces MoE quantization and PT2 Export quantization. This sets a new standard for ease of use in quantizing complex architectures; AMD tools (like quantization in MIGraphX or AMDMIGraphX) will need to ensure parity with these PyTorch-native flows.

📂 Category Updates

AMD Ecosystem

ROCm/ROCm

  • Key Activity:
    • [2025-05-21] 🚨 RELEASE: rocm-6.4.1
    • [2025-05-21] Doc updates regarding tool versioning.
  • Details:
    • Hardware Support: Added RDNA4 support (Radeon AI PRO R9700, RX 9070 XT, RX 9060 XT).
    • New Feature: ROCm Data Science toolkit (ROCm-DS) introduced.
    • Feature: DPX partition mode under NPS2 memory mode for MI300X.
    • Deprecation Warning: ROCm SMI and hipCC perl scripts are slated for removal in future releases.
  • Metrics: 116 New PRs 36 New Issues

AMD-AGI/Primus

  • Key Activity:
    • [2025-05-23] Megatron-LM submodule update (moved to May 2025 version).
    • [2025-05-20] Documentation refactor and config unification.
  • Details:
    • Optimization: Updates to Mixtral pretrain configs.
    • Fix: Resolved interleaved virtual pipeline training errors in Megatron context.
  • Metrics: 17 New PRs 0 New Issues

AMD-AGI/TraceLens

  • Key Activity:
    • Active development on performance modeling and plotting.
  • Details:
    • Feature Request: Plotting roofline functionality exposed in the package.
    • Improvement: Enabling batch gemm through gemmologist.
  • Metrics: 44 New PRs 11 New Issues

ROCm/MAD

  • Key Activity:
    • Integration with vLLM profiling.
  • Details:
    • Update: vLLM 05/27 release support and profiling script path updates.
  • Metrics: 14 New PRs 0 New Issues

PyTorch Ecosystem

pytorch/ao (Architecture Optimization)

  • Key Activity:
    • [2025-05-09] 🚨 RELEASE: v0.11.0
  • Details:
    • Major Feature: Support for MoE Quantization (Mixture-of-Experts).
    • Major Feature: PyTorch 2 Export Quantization (PT2E).
    • Tooling: New microbenchmarking framework for inference APIs.
    • Performance: Benchmarks provided for Mixtral-MoE on H100 showing significant memory savings with int4/int8 weight-only quantization.
  • Metrics: 0 New PRs (Post-release lull) 22 New Issues

pytorch/pytorch

  • Key Activity:
    • High-volume maintenance and bug fixing.
  • Details:
    • Issue: FP8 scaled mm lowering ignores scale_result argument.
    • Issue: Inconsistent results in torch.amin() between CPU and GPU.
    • Fix: Numpy compatibility for 2d small list indices.
  • Metrics: 1480 New PRs 754 New Issues (High Velocity)

pytorch/torchtitan

  • Key Activity:
    • Focus on DeepSeek training stability and parallelism.
  • Details:
    • Bug: “Router collapse on deepseek training loop” reported.
    • Feature WIP: Adding support for HSDP (Hybrid Sharded Data Parallel) + TP (Tensor Parallel).
  • Metrics: 65 New PRs 25 New Issues

NVIDIA Ecosystem

NVIDIA/TransformerEngine

  • Key Activity:
    • [2025-05-14] 🚨 RELEASE: v2.3
  • Details:
    • Hardware: Added support for RTX 5090.
    • DeepSeek: Added Float8 block scaling recipe (Deepseek v3 paper) for Hopper GPUs.
    • Feature: Enabled FP8 weights when using FSDP.
    • Deprecation: CPU offloading of weight tensors is deprecated.
  • Metrics: 0 New PRs reported (Release focused)

NVIDIA/apex

  • Key Activity:
    • [2025-05-07] Removal of legacy features.
  • Details:
    • Cleanup: Removed old amp, ddp, and rnn implementations, signaling users must move to upstream PyTorch implementations.
  • Metrics: 0 New PRs (Maintenance mode)

JAX Ecosystem

jax-ml/jax

  • Key Activity:
    • [2025-05-21] 🚨 RELEASE: jax-v0.6.1
  • Details:
    • Feature: Added jax.lax.axis_size.
    • Fix: Re-enabled CUDA package dependency version checking.
    • Issue: FFI breaks if sm_90a (Hopper) is targeted.
  • Metrics: 595 New PRs 79 New Issues

AI-Hypercomputer/maxtext

  • Key Activity:
    • Model conversion and compiler updates.
  • Details:
    • Issue: Llama3 conversion to HF format broken.
    • Update: Draft PR for v7x AOT (Ahead-of-Time compilation) support.
  • Metrics: 126 New PRs 6 New Issues

Frameworks & Compilers

facebookresearch/xformers

  • Key Activity:
    • Hardware backend updates.
  • Details:
    • Support: Enabling Blackwell support.
    • Support: Updates from ROCm/xformers merged (indicates active AMD maintenance).
  • Metrics: 3 New PRs 4 New Issues

deepspeedai/DeepSpeed

  • Key Activity:
    • Training stability fixes.
  • Details:
    • Bug: Trainer.train resume fails with auto tensor parallel.
    • CI: Fix CI hangs in Torch 2.7.
  • Metrics: 25 New PRs 25 New Issues

deepseek-ai/DeepEP

  • Key Activity:
    • Communication optimization.
  • Details:
    • Optimization: Use IBGDA (InfiniBand Global Distributed Addressing) only and lighter barriers.
    • Fix: Low-latency P2P code cleanup.
  • Metrics: 8 New PRs 25 New Issues

triton-lang/triton

  • Key Activity:
    • Kernel updates and documentation churn.
  • Details:
    • Kernel: Implement attention kernels for d64 and d128.
    • Doc Change: Removed instructions to build PyTorch for Blackwell (unclear if support dropped or native now).
  • Metrics: 308 New PRs 33 New Issues