📅 Engineering Report (2025-05-01 - 2025-05-31)

🚀 Executive Summary

May 2025 was a pivotal month for the AMD software ecosystem with the release of ROCm 6.4.1, bringing significant expansion in hardware support (RDNA4/Radeon RX 9000 series) and the introduction of the ROCm Data Science Toolkit.

Across the broader landscape, Deepseek-V3/R1 architectures are driving immediate tooling updates. NVIDIA’s TransformerEngine v2.3 added specific Float8 block scaling for Deepseek, while PyTorch’s torchtitan and DeepEP (Deepseek’s own repo) saw active development to stabilize training loops for these models. Additionally, PyTorch AO v0.11.0 released with major features for Mixture-of-Experts (MoE) quantization, critical for efficient inference of modern sparse models.

🚨 Major Release: ROCm 6.4.1: This release introduces support for RDNA4 architecture (Radeon AI PRO R9700, RX 9070 XT, RX 9060 XT) on Linux, significantly widening the accessibility of ROCm for developers on consumer/workstation hardware.
New Tooling: Launch of the ROCm Data Science toolkit (ROCm-DS), an open-source collection for accelerating data science workloads.
MI300X Optimization: Added DPX partition mode under NPS2 memory mode for Instinct MI300X.
PyTorch AO Integration: The new PyTorch AO v0.11.0 release includes specific ROCm optimizations, notably “preshuffled weight mm” support, indicating upstream collaboration is yielding performance improvements.
Primus & Training: AMD-AGI/Primus updated its Megatron-LM submodule and refined mixtral pretrain configs, focusing on stabilizing large-scale training pipelines.

Competitive Analysis

NVIDIA TransformerEngine v2.3: NVIDIA has moved quickly to support “Deepseek v3” specific recipes (Float8 block scaling) on Hopper GPUs. They also added support for the RTX 5090, signaling readiness for next-gen consumer hardware.
Quantization Moves: PyTorch AO’s new MoE quantization (Int8/Int4 weight-only) allows large sparse models (like Mixtral) to run on fewer GPUs. This lowers the barrier to entry for competitors using standard PyTorch export flows.
Triton Lang: Active development on “Gluon” attention kernels for d64/d128 head dimensions suggests optimization for specific model architectures is ongoing at the kernel level.

📂 Category Updates

AMD Ecosystem

ROCm/ROCm

Key Activity:
- [2025-05-21] RELEASE: rocm-6.4.1
- [2025-05-21] Documentation updates for AI Developer Hub and multiple Docker image benchmarks (LLM Foundry, vLLM, PyTorch).
Details:
- Hardware: Added support for RDNA4 (Radeon RX 9000 series).
- Components: HIP upgraded to 6.4.1; hipBLASLt to 0.12.1.
- Deprecation Warning: ROCm SMI and ROCTracer/ROCProfiler are being phased out in favor of AMD SMI and ROCprofiler-SDK respectively.
Metrics: 116 PRs 36 Issues

AMD-AGI/Primus

Key Activity:
- [2025-05-23] Updated Megatron-LM submodule (major date jump from March to May 2025).
- [2025-05-20] Refactored README and unified config usage for Megatron pretrain scripts.
Details:
- Highlight PR: Fixed interleaved virtual pipeline training errors with new Unit Tests.
- Highlight PR: Updated mixtral pretrain configurations.
Metrics: 17 PRs 0 Issues

AMD-AGI/TraceLens

Key Activity:
- Focus on functionality for plotting roofline models and batch GEMM support.
Details:
- New PR: Enabling batch gemm through gemmologist.
- New Issue: Request to expose functionality for plotting roofline within the package.
Metrics: 44 PRs 11 Issues

ROCm/MAD

Key Activity:
- Integration with vLLM profiling.
Details:
- New PR: Update path to use profiling scripts in vLLM.
Metrics: 14 PRs 0 Issues

sgl-project/sglang

Key Activity:
- CI maintenance for AMD.
Details:
- [2025-05-16] Fix amd ci.
Metrics: 0 PRs 0 Issues (Doc/CI updates only)

PyTorch Ecosystem

pytorch/pytorch

Key Activity:
- High-volume maintenance. Fixes for FP8 scalings and CPU/GPU consistency.
Details:
- New Issue: FP8 scaled mm lowering ignores scale_result argument.
- New Issue: Inconsistent result of torch.amin() in CPU vs GPU.
Metrics: 1480 PRs 754 Issues

pytorch/ao (Architecture Optimization)

Key Activity:
- [2025-05-09] RELEASE: v0.11.0
Details:
- Feature: Added support for Mixture-of-Experts (MoE) quantization.
- Feature: PyTorch 2 Export Quantization (PT2E).
- AMD Specific: [ROCm] preshuffled weight mm (PR #2044).
Metrics: 0 PRs 22 Issues (Post-release)

pytorch/torchtitan

Key Activity:
- Stabilizing training loops for Deepseek models and extending FSDP support.
Details:
- New Issue: “[Deepseek] Router collapse on deepseek training loop”.
- New PR: “[WIP][SimpleFSDP] Add support for hsdp/ddp + tp”.
Metrics: 65 PRs 25 Issues

NVIDIA Ecosystem

NVIDIA/TransformerEngine

Key Activity:
- [2025-05-14] RELEASE: v2.3
Details:
- Hardware: Added support for RTX 5090.
- Feature: Added support for Float8 block scaling recipe (Deepseek v3 paper) for Hopper GPUs.
- Optimization: Sped up import via lazy compilation using torch.compile.
Metrics: 0 PRs 0 Issues (Release focused)

NVIDIA/Megatron-LM

Key Activity:
- Maintenance and documentation.
Details:
- [2025-05-08] Updated setup instructions.
Metrics: 0 PRs 0 Issues

JAX & XLA

openxla/xla

Key Activity:
- TPU bug fixes and graph fusion logic updates.
Details:
- New PR: Drop restriction that XnnGraph fusions can be grown only from root.
- New Issue: TPU X64 rewriting issue on c64.
Metrics: 1344 PRs 8 Issues

AI-Hypercomputer/maxtext

Key Activity:
- Llama3 conversion fixes and AOT support.
Details:
- New Issue: Llama3 conversion to HF does not work.
- New PR: [DRAFT] v7x AOT support.
Metrics: 126 PRs 6 Issues

Inference & Kernels

triton-lang/triton

Key Activity:
- Kernel implementations and debug tooling.
Details:
- New PR: [Gluon] Implement attention kernels for d64 and d128.
- New PR: Check for shared memory limitation when converting ConvertLayoutOp.
Metrics: 308 PRs 33 Issues

tile-ai/tilelang

Key Activity:
- JIT refactoring and vectorization fixes.
Details:
- [2025-05-18] Refactor tilelang.jit to support a faster/flexible kernel cache.
- New Issue: Automatic vectorization and T.Cast() conflicts in quant kernel.
Metrics: 78 PRs 17 Issues

xdit-project/xDiT

Key Activity:
- Parallelism features for diffusion models.
Details:
- New PR: ADD FP8 forward for FA3 (FlashAttention3).
- New PR: add Lumina-Next CFG parallel script.
Metrics: 5 PRs 10 Issues

deepseek-ai/DeepEP

Key Activity:
- Communication kernel optimization for Mixture-of-Experts.
Details:
- New PR: Use IBGDA (Infiniband Global Distributed Addressing) only and lighter barrier.
- New Issue: Questions regarding “SM-free normal kernels” in roadmap.
Metrics: 8 PRs 25 Issues