📅 Engineering Report (2025-05-01 - 2025-05-31)

🚀 Executive Summary

May 2025 was a pivotal month characterized by major version releases across the hardware-software stack. AMD significantly expanded hardware accessibility with ROCm 6.4.1, introducing support for RDNA4 consumer and workstation GPUs. PyTorch AO released v0.11.0, democratizing Mixture-of-Experts (MoE) quantization. On the competitive front, NVIDIA updated TransformerEngine to support the unreleased RTX 5090 and specific DeepSeek V3 floating-point recipes. Across the board, there is a heavy engineering focus on optimizing for DeepSeek architectures and enabling FP8/quantization workflows.

ROCm 6.4.1 Release: This is a major accessibility update. ROCm now supports RDNA4 architecture (Radeon AI PRO R9700, RX 9070 XT, RX 9060 XT). This expands the development surface area significantly beyond datacenter GPUs.
New Tooling (ROCm-DS): Introduction of the ROCm Data Science toolkit (early access), aiming to accelerate standard data science workloads on AMD hardware.
Memory Optimization: AMD Instinct MI300X now supports DPX partition mode under NPS2 memory mode, allowing for finer-grained memory control.
Ecosystem Integration: The AMD-AGI/Primus repo is actively updating Megatron-LM submodules and Mixtral configs, ensuring AMD hardware can train the latest MoE models effectively.

Competitive Analysis

NVIDIA & Consumer Hardware: NVIDIA’s TransformerEngine v2.3 explicitly added support for the RTX 5090. This indicates the next-gen consumer flagship is imminent and software stacks are already prepared.
DeepSeek Adoption: NVIDIA added support for Float8 block scaling recipes specifically cited in the DeepSeek V3 paper. pytorch/torchtitan and DeepEP also saw specific DeepSeek-related engineering (router collapse fixes, IBGDA optimizations), highlighting how this model architecture is driving library development.
Quantization Pressure: The PyTorch AO v0.11.0 release introduces MoE quantization and PT2 Export quantization. This sets a new standard for ease of use in quantizing complex architectures; AMD tools (like quantization in MIGraphX or AMDMIGraphX) will need to ensure parity with these PyTorch-native flows.

📂 Category Updates

AMD Ecosystem

ROCm/ROCm

Key Activity:
- [2025-05-21] 🚨 RELEASE: rocm-6.4.1
- [2025-05-21] Doc updates regarding tool versioning.
Details:
- Hardware Support: Added RDNA4 support (Radeon AI PRO R9700, RX 9070 XT, RX 9060 XT).
- New Feature: ROCm Data Science toolkit (ROCm-DS) introduced.
- Feature: DPX partition mode under NPS2 memory mode for MI300X.
- Deprecation Warning: ROCm SMI and hipCC perl scripts are slated for removal in future releases.
Metrics: 116 New PRs 36 New Issues

AMD-AGI/Primus

Key Activity:
- [2025-05-23] Megatron-LM submodule update (moved to May 2025 version).
- [2025-05-20] Documentation refactor and config unification.
Details:
- Optimization: Updates to Mixtral pretrain configs.
- Fix: Resolved interleaved virtual pipeline training errors in Megatron context.
Metrics: 17 New PRs 0 New Issues

AMD-AGI/TraceLens

Key Activity:
- Active development on performance modeling and plotting.
Details:
- Feature Request: Plotting roofline functionality exposed in the package.
- Improvement: Enabling batch gemm through gemmologist.
Metrics: 44 New PRs 11 New Issues

ROCm/MAD

Key Activity:
- Integration with vLLM profiling.
Details:
- Update: vLLM 05/27 release support and profiling script path updates.
Metrics: 14 New PRs 0 New Issues

PyTorch Ecosystem

pytorch/ao (Architecture Optimization)

Key Activity:
- [2025-05-09] 🚨 RELEASE: v0.11.0
Details:
- Major Feature: Support for MoE Quantization (Mixture-of-Experts).
- Major Feature: PyTorch 2 Export Quantization (PT2E).
- Tooling: New microbenchmarking framework for inference APIs.
- Performance: Benchmarks provided for Mixtral-MoE on H100 showing significant memory savings with int4/int8 weight-only quantization.
Metrics: 0 New PRs (Post-release lull) 22 New Issues

pytorch/pytorch

Key Activity:
- High-volume maintenance and bug fixing.
Details:
- Issue: FP8 scaled mm lowering ignores scale_result argument.
- Issue: Inconsistent results in torch.amin() between CPU and GPU.
- Fix: Numpy compatibility for 2d small list indices.
Metrics: 1480 New PRs 754 New Issues (High Velocity)

pytorch/torchtitan

Key Activity:
- Focus on DeepSeek training stability and parallelism.
Details:
- Bug: “Router collapse on deepseek training loop” reported.
- Feature WIP: Adding support for HSDP (Hybrid Sharded Data Parallel) + TP (Tensor Parallel).
Metrics: 65 New PRs 25 New Issues

NVIDIA Ecosystem

NVIDIA/TransformerEngine

Key Activity:
- [2025-05-14] 🚨 RELEASE: v2.3
Details:
- Hardware: Added support for RTX 5090.
- DeepSeek: Added Float8 block scaling recipe (Deepseek v3 paper) for Hopper GPUs.
- Feature: Enabled FP8 weights when using FSDP.
- Deprecation: CPU offloading of weight tensors is deprecated.
Metrics: 0 New PRs reported (Release focused)

NVIDIA/apex

Key Activity:
- [2025-05-07] Removal of legacy features.
Details:
- Cleanup: Removed old amp, ddp, and rnn implementations, signaling users must move to upstream PyTorch implementations.
Metrics: 0 New PRs (Maintenance mode)

JAX Ecosystem

jax-ml/jax

Key Activity:
- [2025-05-21] 🚨 RELEASE: jax-v0.6.1
Details:
- Feature: Added jax.lax.axis_size.
- Fix: Re-enabled CUDA package dependency version checking.
- Issue: FFI breaks if sm_90a (Hopper) is targeted.
Metrics: 595 New PRs 79 New Issues

AI-Hypercomputer/maxtext

Key Activity:
- Model conversion and compiler updates.
Details:
- Issue: Llama3 conversion to HF format broken.
- Update: Draft PR for v7x AOT (Ahead-of-Time compilation) support.
Metrics: 126 New PRs 6 New Issues

Frameworks & Compilers

facebookresearch/xformers

Key Activity:
- Hardware backend updates.
Details:
- Support: Enabling Blackwell support.
- Support: Updates from ROCm/xformers merged (indicates active AMD maintenance).
Metrics: 3 New PRs 4 New Issues

deepspeedai/DeepSpeed

Key Activity:
- Training stability fixes.
Details:
- Bug: Trainer.train resume fails with auto tensor parallel.
- CI: Fix CI hangs in Torch 2.7.
Metrics: 25 New PRs 25 New Issues

deepseek-ai/DeepEP

Key Activity:
- Communication optimization.
Details:
- Optimization: Use IBGDA (InfiniBand Global Distributed Addressing) only and lighter barriers.
- Fix: Low-latency P2P code cleanup.
Metrics: 8 New PRs 25 New Issues

triton-lang/triton

Key Activity:
- Kernel updates and documentation churn.
Details:
- Kernel: Implement attention kernels for d64 and d128.
- Doc Change: Removed instructions to build PyTorch for Blackwell (unclear if support dropped or native now).
Metrics: 308 New PRs 33 New Issues