GitHub Monthly Report: 2025-05-01 to 2025-05-31
📅 Engineering Report (2025-05-01 - 2025-05-31)
🚀 Executive Summary
May 2025 was a significant month for infrastructure releases across the board. AMD launched ROCm 6.4.1, introducing RDNA4 support and the new ROCm Data Science Toolkit (ROCm-DS). PyTorch AO released v0.11.0 with major strides in Mixture-of-Experts (MoE) quantization and PT2 export flows. NVIDIA released TransformerEngine v2.3, heavily optimizing FP8 workflows with FSDP support and adding support for the unreleased RTX 5090. Activity in PyTorch Core remains extremely high, while the JAX ecosystem saw a minor version bump to v0.6.1.
AMD Related Updates
- ROCm 6.4.1 Release: The major highlight of the month. Key features include support for RDNA4 GPUs (Radeon AI PRO R9700, RX 9070/9060 XT), the introduction of DPX partition mode for MI300X, and the beta launch of the ROCm Data Science Toolkit (ROCm-DS).
- Tooling Deprecations: A clear signal was sent regarding the deprecation of legacy tools (ROCm SMI, ROCTracer, ROCProfiler) in favor of the new ROCprofiler-SDK and AMD SMI, effective Q1 2026.
- PyTorch AO Integration: The PyTorch AO v0.11.0 release explicitly included optimizations for ROCm, specifically “preshuffled weight mm,” indicating active upstream contribution to PyTorch quantization libraries.
- Primus & TraceLens: Internal AMD tooling saw healthy activity, with Primus updating Megatron-LM submodules and TraceLens enabling batch GEMM functionality via Gemmologist.
Competitive Analysis
- NVIDIA FP8 Dominance: With TransformerEngine v2.3, NVIDIA has enabled FP8 weights when using Fully Sharded Data Parallel (FSDP). This is a critical competitive advantage for training LLMs at scale. They also added support for the RTX 5090, preparing software stacks well ahead of consumer hardware availability.
- MoE Quantization: PyTorch AO v0.11.0 introduced prototype support for quantizing MoE modules. As MoE architectures (like Mixtral and DeepSeek) dominate the efficient LLM landscape, this tooling is vital.
- DeepSeek Optimization: Both NVIDIA (TransformerEngine) and the broader ecosystem (RTP-LLM, PyTorch Titan) are actively optimizing for DeepSeek architectures, specifically regarding Float8 block scaling and training loops.
📂 Category Updates
AMD Ecosystem
ROCm/ROCm
- Key Activity:
- [2025-05-21] 🚨 RELEASE: rocm-6.4.1
- [2025-05-21] Documentation updates for tools and versioning.
- Details:
- Hardware Support: Added RDNA4 support (Radeon AI PRO R9700, RX 9070 XT, RX 9060 XT).
- Features: DPX partition mode for MI300X; Introduction of ROCm Data Science Toolkit.
- Deprecations: Announced end-of-life plans for ROCm SMI (replaced by AMD SMI) and legacy profiling tools (replaced by ROCprofiler-SDK).
- Fixes: Resolved MSCCL initialization failures and ROCm SMI uninstallation issues on RHEL/SLES.
-
Metrics: 116 PRs 36 Issues
AMD-AGI/Primus
- Key Activity:
- [2025-05-23] Megatron-LM submodule update (bumped to May 2025 version).
- [2025-05-20] Config unification and README refactoring.
- Details:
- New PR: Update mixtral pretrain configs.
- Fix: Resolved interleaved virtual pipeline training errors in Megatron integration.
-
Metrics: 17 PRs 0 Issues
AMD-AGI/TraceLens
- Key Activity:
- [2025-05-xx] Performance modeling improvements and bug fixes.
- Details:
- Feature: Enabling batch GEMM through
gemmologist. - Feature Request: Exposed functionality for plotting roofline models.
- Fix: Made input strides optional to prevent performance model crashes.
- Feature: Enabling batch GEMM through
-
Metrics: 44 PRs 11 Issues
ROCm/MAD
- Key Activity:
- Updates to benchmarking scripts.
- Details:
- Update: vLLM 05/27 release integration.
- Update: Profiling path scripts for vLLM.
-
Metrics: 14 PRs 0 Issues
PyTorch Ecosystem
pytorch/ao (Architecture Optimization)
- Key Activity:
- [2025-05-09] 🚨 RELEASE: v0.11.0
- [2025-05-22] QAT and FSDP integration discussions.
- Details:
- Feature: MoE Quantization prototype (Int8/Int4 weight-only).
- Feature: PT2 Export Quantization APIs.
- AMD Specific: Added “preshuffled weight mm” for ROCm.
- Benchmarks: Added microbenchmarking framework for inference APIs.
-
Metrics: 0 PRs 22 Issues (Note: Repo stats showed 0 PRs in snapshot, likely due to release freeze/merge, but Release Notes indicate significant activity).
pytorch/pytorch
- Key Activity:
- [2025-05-14] Documentation and URL fixes.
- Massive continuous integration and bug fixing.
- Details:
- Issue: FP8 scaled mm lowering ignores
scale_resultargument. - Issue: Inconsistent results for
torch.amin()between CPU and GPU. - Fix: Numpy compatibility for 2d small list indices.
- Issue: FP8 scaled mm lowering ignores
-
Metrics: 1480 PRs 754 Issues
pytorch/torchtitan
- Key Activity:
- DeepSeek model training loop hardening.
- Details:
- Fix: DeepSeek router collapse on training loop.
- Feature: [WIP] SimpleFSDP support for HSDP/DDP + TP.
-
Metrics: 65 PRs 25 Issues
NVIDIA Ecosystem
NVIDIA/TransformerEngine
- Key Activity:
- [2025-05-14] 🚨 RELEASE: v2.3
- Details:
- Feature: Enabled FP8 weights with FSDP (PyTorch).
- Hardware: Added support for RTX 5090.
- Feature: Float8 block scaling recipe (DeepSeek v3 paper implementation).
- Performance: CPU offloading for activation tensors (FP8 attention).
-
Metrics: 0 PRs 0 Issues
NVIDIA/apex
- Key Activity:
- [2025-05-07] Major code cleanup.
- Details:
- Removal: Removed legacy features:
amp,ddp, andrnn(users should use upstream PyTorch implementations).
- Removal: Removed legacy features:
-
Metrics: 0 PRs 0 Issues
JAX Ecosystem
jax-ml/jax
- Key Activity:
- [2025-05-21] 🚨 RELEASE: jax-v0.6.1
- Details:
- Feature: Added
jax.lax.axis_size. - Change: Re-enabled version checking for CUDA package dependencies.
- Change:
jax.ShapeDtypeStructis now immutable.
- Feature: Added
-
Metrics: 0 PRs 0 Issues
Serving & Frameworks
vllm-project/vllm
- Key Activity:
- Documentation overhaul and reorganization.
- Details:
- [2025-05-24] Reorganized user guide and updated external links.
-
Metrics: 0 PRs 0 Issues
volcengine/verl
- Key Activity:
- RLHF/PPO method updates.
- Details:
- Feature: Added support for PF-PPO.
- Change: Set default actor entropy coefficient to 0 (Breaking Change).
-
Metrics: 0 PRs 0 Issues
triton-lang/triton
- Key Activity:
- Build configuration adjustments.
- Details:
- Update: Added (and then reverted/adjusted) options to remove debug info from modules.
- Doc: Removed instructions to build PyTorch for Blackwell (likely upstreamed or changed).
-
Metrics: 0 PRs 0 Issues