GitHub Monthly Report: 2025-05-01 to 2025-05-31
📅 Engineering Report (2025-05-01 - 2025-05-31)
🚀 Executive Summary
May 2025 was a pivotal month characterized by major version releases across the hardware-software stack. AMD significantly expanded hardware accessibility with ROCm 6.4.1, introducing support for RDNA4 consumer and workstation GPUs. PyTorch AO released v0.11.0, democratizing Mixture-of-Experts (MoE) quantization. On the competitive front, NVIDIA updated TransformerEngine to support the unreleased RTX 5090 and specific DeepSeek V3 floating-point recipes. Across the board, there is a heavy engineering focus on optimizing for DeepSeek architectures and enabling FP8/quantization workflows.
AMD Related Updates
- ROCm 6.4.1 Release: This is a major accessibility update. ROCm now supports RDNA4 architecture (Radeon AI PRO R9700, RX 9070 XT, RX 9060 XT). This expands the development surface area significantly beyond datacenter GPUs.
- New Tooling (ROCm-DS): Introduction of the ROCm Data Science toolkit (early access), aiming to accelerate standard data science workloads on AMD hardware.
- Memory Optimization: AMD Instinct MI300X now supports DPX partition mode under NPS2 memory mode, allowing for finer-grained memory control.
- Ecosystem Integration: The
AMD-AGI/Primusrepo is actively updating Megatron-LM submodules and Mixtral configs, ensuring AMD hardware can train the latest MoE models effectively.
Competitive Analysis
- NVIDIA & Consumer Hardware: NVIDIA’s TransformerEngine v2.3 explicitly added support for the RTX 5090. This indicates the next-gen consumer flagship is imminent and software stacks are already prepared.
- DeepSeek Adoption: NVIDIA added support for Float8 block scaling recipes specifically cited in the DeepSeek V3 paper.
pytorch/torchtitanandDeepEPalso saw specific DeepSeek-related engineering (router collapse fixes, IBGDA optimizations), highlighting how this model architecture is driving library development. - Quantization Pressure: The PyTorch AO v0.11.0 release introduces MoE quantization and PT2 Export quantization. This sets a new standard for ease of use in quantizing complex architectures; AMD tools (like quantization in MIGraphX or AMDMIGraphX) will need to ensure parity with these PyTorch-native flows.
📂 Category Updates
AMD Ecosystem
ROCm/ROCm
- Key Activity:
- [2025-05-21] 🚨 RELEASE: rocm-6.4.1
- [2025-05-21] Doc updates regarding tool versioning.
- Details:
- Hardware Support: Added RDNA4 support (Radeon AI PRO R9700, RX 9070 XT, RX 9060 XT).
- New Feature: ROCm Data Science toolkit (ROCm-DS) introduced.
- Feature: DPX partition mode under NPS2 memory mode for MI300X.
- Deprecation Warning: ROCm SMI and hipCC perl scripts are slated for removal in future releases.
-
Metrics: 116 New PRs 36 New Issues
AMD-AGI/Primus
- Key Activity:
- [2025-05-23] Megatron-LM submodule update (moved to May 2025 version).
- [2025-05-20] Documentation refactor and config unification.
- Details:
- Optimization: Updates to Mixtral pretrain configs.
- Fix: Resolved interleaved virtual pipeline training errors in Megatron context.
-
Metrics: 17 New PRs 0 New Issues
AMD-AGI/TraceLens
- Key Activity:
- Active development on performance modeling and plotting.
- Details:
- Feature Request: Plotting roofline functionality exposed in the package.
- Improvement: Enabling batch gemm through
gemmologist.
-
Metrics: 44 New PRs 11 New Issues
ROCm/MAD
- Key Activity:
- Integration with vLLM profiling.
- Details:
- Update: vLLM 05/27 release support and profiling script path updates.
-
Metrics: 14 New PRs 0 New Issues
PyTorch Ecosystem
pytorch/ao (Architecture Optimization)
- Key Activity:
- [2025-05-09] 🚨 RELEASE: v0.11.0
- Details:
- Major Feature: Support for MoE Quantization (Mixture-of-Experts).
- Major Feature: PyTorch 2 Export Quantization (PT2E).
- Tooling: New microbenchmarking framework for inference APIs.
- Performance: Benchmarks provided for Mixtral-MoE on H100 showing significant memory savings with int4/int8 weight-only quantization.
-
Metrics: 0 New PRs (Post-release lull) 22 New Issues
pytorch/pytorch
- Key Activity:
- High-volume maintenance and bug fixing.
- Details:
- Issue: FP8 scaled mm lowering ignores
scale_resultargument. - Issue: Inconsistent results in
torch.amin()between CPU and GPU. - Fix: Numpy compatibility for 2d small list indices.
- Issue: FP8 scaled mm lowering ignores
-
Metrics: 1480 New PRs 754 New Issues (High Velocity)
pytorch/torchtitan
- Key Activity:
- Focus on DeepSeek training stability and parallelism.
- Details:
- Bug: “Router collapse on deepseek training loop” reported.
- Feature WIP: Adding support for HSDP (Hybrid Sharded Data Parallel) + TP (Tensor Parallel).
-
Metrics: 65 New PRs 25 New Issues
NVIDIA Ecosystem
NVIDIA/TransformerEngine
- Key Activity:
- [2025-05-14] 🚨 RELEASE: v2.3
- Details:
- Hardware: Added support for RTX 5090.
- DeepSeek: Added Float8 block scaling recipe (Deepseek v3 paper) for Hopper GPUs.
- Feature: Enabled FP8 weights when using FSDP.
- Deprecation: CPU offloading of weight tensors is deprecated.
- Metrics: 0 New PRs reported (Release focused)
NVIDIA/apex
- Key Activity:
- [2025-05-07] Removal of legacy features.
- Details:
- Cleanup: Removed old
amp,ddp, andrnnimplementations, signaling users must move to upstream PyTorch implementations.
- Cleanup: Removed old
- Metrics: 0 New PRs (Maintenance mode)
JAX Ecosystem
jax-ml/jax
- Key Activity:
- [2025-05-21] 🚨 RELEASE: jax-v0.6.1
- Details:
- Feature: Added
jax.lax.axis_size. - Fix: Re-enabled CUDA package dependency version checking.
- Issue: FFI breaks if
sm_90a(Hopper) is targeted.
- Feature: Added
-
Metrics: 595 New PRs 79 New Issues
AI-Hypercomputer/maxtext
- Key Activity:
- Model conversion and compiler updates.
- Details:
- Issue: Llama3 conversion to HF format broken.
- Update: Draft PR for v7x AOT (Ahead-of-Time compilation) support.
-
Metrics: 126 New PRs 6 New Issues
Frameworks & Compilers
facebookresearch/xformers
- Key Activity:
- Hardware backend updates.
- Details:
- Support: Enabling Blackwell support.
- Support: Updates from
ROCm/xformersmerged (indicates active AMD maintenance).
-
Metrics: 3 New PRs 4 New Issues
deepspeedai/DeepSpeed
- Key Activity:
- Training stability fixes.
- Details:
- Bug:
Trainer.trainresume fails with auto tensor parallel. - CI: Fix CI hangs in Torch 2.7.
- Bug:
-
Metrics: 25 New PRs 25 New Issues
deepseek-ai/DeepEP
- Key Activity:
- Communication optimization.
- Details:
- Optimization: Use IBGDA (InfiniBand Global Distributed Addressing) only and lighter barriers.
- Fix: Low-latency P2P code cleanup.
-
Metrics: 8 New PRs 25 New Issues
triton-lang/triton
- Key Activity:
- Kernel updates and documentation churn.
- Details:
- Kernel: Implement attention kernels for d64 and d128.
- Doc Change: Removed instructions to build PyTorch for Blackwell (unclear if support dropped or native now).
-
Metrics: 308 New PRs 33 New Issues