GitHub Monthly Report: 2025-04-01 to 2025-04-30
📅 Engineering Report (2025-04-01 - 2025-04-30)
🚀 Executive Summary
April 2025 was dominated by the release of PyTorch v2.7.0, which triggered a cascade of updates across the entire AI ecosystem to ensure compatibility. Major libraries including torchvision, torchaudio, xformers, and NVIDIA’s TransformerEngine all pushed releases to align with the new PyTorch core.
This month saw significant maturity in quantization and low-bit optimization, with TorchAO v0.10.0 moving low-bit optimizers to official support and introducing prototypes for NVIDIA B200 mxfp8 training. In the compiler space, TileLang v0.1.4 released with substantial, specific support for AMD architectures (DeepSeek MLA), marking a strong community contribution to the ROCm ecosystem.
AMD Related Updates
- TileLang v0.1.4 Release: This compiler project added significant support for AMD GPUs, including FlashMLA (DeepSeek) implementation, ROCm Docker file creation, and support for
T.gemmwithtranspose_b=Falsefor the AMD backend. - FBGEMM GenAI Support: FBGEMM v1.2.0 added preliminary ROCm OSS build support for GenAI operations, signaling a broadening of meta-optimization library support for AMD hardware.
- ROCm Documentation & Tooling: The
ROCm/ROCmrepo is preparing for version 6.4 (updating tooling docs). Users reported naming inconsistencies (amd-smivsamd_smi) in 6.3.x which are being tracked. - Performance Tooling:
AMD-AGI/TraceLensis integrating “gemmologist” for modeling GEMM efficiencies and added new functionality to calculate TFLOPs and GB/s in node replay. - Model Support:
ROCm/MADadded support for Mosaic-ML MPT-30B training and JAX MaxText v25.5.
Competitive Analysis
- NVIDIA Blackwell Support: PyTorch v2.7.0 explicitly notes “NVIDIA Blackwell Architecture Support” as a prototype feature. TorchAO v0.10.0 also released prototype end-to-end training support for mxfp8 on NVIDIA B200.
- TransformerEngine v2.2: NVIDIA released v2.2 to support PyTorch 2.7 and added CPU offloading. However, they explicitly noted that the RTX 5090 is currently unsupported for FP8 execution, with a fix planned for v2.3.
- Triton Compiler: Ongoing work in
triton-langis focusing heavily on Hopper architecture optimizations (TMA load hoisting, WGMMA), indicating continued heavy optimization for NVIDIA’s H100/H200 series.
📂 Category Updates
🟢 PyTorch Ecosystem
🚨 pytorch/pytorch
- Key Activity:
- [2025-04-23] RELEASE v2.7.0 launched with massive changelogs.
- Details:
- Highlights: Beta Torch.Compile support for Torch Function Modes, Prototype Blackwell support, FlexAttention on x86 CPUs, and enhancing Intel GPU acceleration.
- Deprecations: Dropped support for CUDA 12.4 (CI moved to 12.8) and Triton < 2.2.0.
- Issues: High compilation time variance reported on benchmark dashboards.
-
Metrics: 1428 New PRs 767 New Issues (Very High Velocity)
🚨 pytorch/ao (Architecture Optimization)
- Key Activity:
- [2025-04-07] RELEASE v0.10.0 focusing on quantization maturity.
- Details:
- Features: Low Bit Optimizers moved to official support. Prototype support for mxfp8 training on NVIDIA B200. Introduction of PARQ (Piecewise-Affine Regularized Quantization).
- Perf: Low-bit CPU and MPS kernels are now pip installable from source.
-
Metrics: 142 New PRs 27 New Issues
🚨 pytorch/vision
- Key Activity:
- [2025-04-23] RELEASE v0.22.0.
- Details:
- Deprecation Warning: Video decoding/encoding capabilities are deprecated and will move to TorchCodec in v0.25.
- Optimization: Significant NMS speed-up on CUDA for high box counts.
-
Metrics: 14 New PRs 22 New Issues
🚨 pytorch/FBGEMM
- Key Activity:
- [2025-04-27] RELEASE v1.2.0.
- Details:
- GenAI: GenAI ops now packaged separately. Added preliminary ROCm OSS build support.
- TBE (Table Batched Embeddings): Added support for
int64_ttable indices on GPU.
- Metrics: 0 New PRs (Release focused)
pytorch/audio
- Key Activity:
- [2025-04-24] RELEASE v2.7.0.
- Details:
- Strategy Shift: The project announced it is transitioning into a maintenance phase to reduce redundancies with the wider PyTorch ecosystem.
-
Metrics: 0 New PRs 5 New Issues
pytorch/torchtitan
- Key Activity:
- [2025-04-07] Added llama4 as an experiment.
- Details:
- WIP: Float8 rowwise all-gather implementation.
- CI: Updated to use PyTorch nightly, indicating close tracking of core development.
-
Metrics: 84 New PRs 36 New Issues
🔴 AMD & ROCm Ecosystem
ROCm/ROCm
- Key Activity:
- [2025-04-11] Documentation updates targeting ROCm 6.4.
- Details:
- Issues: Users noted
amd-smichanged toamd_smiin 6.3.x causing scripts to break. Flash-attention import errors reported with ComfyUI. - PRs: Updates to vLLM docker tags for benchmarking.
- Issues: Users noted
-
Metrics: 101 New PRs 48 New Issues
AMD-AGI/Primus
- Key Activity:
- [2025-04-25] Documentation and Runner updates.
- Details:
- Highlights: Added tensile tuning examples and fixed preflight scripts.
-
Metrics: 24 New PRs 0 New Issues
AMD-AGI/TraceLens
- Key Activity:
- Focus on performance modeling and replay accuracy.
- Details:
- Features: Integrated
gemmologistfor modeling GEMM efficiencies. Added TFLOPs and GB/s calculations to node replay.
- Features: Integrated
-
Metrics: 27 New PRs 18 New Issues
ROCm/MAD
- Key Activity:
- Model addition.
- Details:
- Models: Added Mosaic-ML MPT-30B training model and JAX-MaxText v25.5.
-
Metrics: 12 New PRs 0 New Issues
🟢 NVIDIA Ecosystem
🚨 NVIDIA/TransformerEngine
- Key Activity:
- [2025-04-28] RELEASE v2.2.
- Details:
- Compatibility: Added PyTorch 2.7 support.
- Features: Support for CPU offloading with Megatron-Core distributed optimizers.
- Limitations: Explicitly noted RTX 5090 is unsupported for FP8 (fix coming in v2.3).
-
Metrics: 0 New PRs 0 New Issues
facebookresearch/xformers
- Key Activity:
- [2025-04-28] RELEASE v0.0.30.
- Details:
- Requirement: Now requires PyTorch >= 2.7.
- Feature: Added support for local attention on the Flash3 backend (H100).
-
Metrics: 0 New PRs 0 New Issues
🟡 Compiler & Languages (Triton, XLA, TileLang)
🚨 tile-ai/tilelang
- Key Activity:
- [2025-04-18] RELEASE v0.1.4.
- Details:
- AMD Focus: Significant updates for AMD GPU support, including DeepSeek MLA implementation, ROCm Dockerfiles, and fixes for composable kernel include paths.
- New Models: Added BitNet 1.58b examples.
-
Metrics: 103 New PRs 26 New Issues
triton-lang/triton
- Key Activity:
- Refactoring configuration and attention partitioning.
- Details:
- Refactor: Renaming
config.pytoknobs.pyand introducing a new config module. - Issues: Reported duplicated reductions in TTIR -> TTGIR and Hopper TMA load hoisting issues.
- Refactor: Renaming
-
Metrics: 259 New PRs 45 New Issues
openxla/xla
- Key Activity:
- High volume maintenance and backend tweaks.
- Details:
- Arch: Re-enabled SVE on Aarch64 backend.
- Issues: AOT compilation issues reported with GEMM rewriter.
-
Metrics: 1510 New PRs 13 New Issues
🔵 Inference & Serving
huggingface/transformers
- Key Activity:
- [2025-04-07] Dropped support for PyTorch 2.0.
-
Metrics: 535 New PRs 199 New Issues
xdit-project/xDiT
- Key Activity:
- Multiple point releases (0.4.3.post2/3).
- Details:
- Features: Added sparse sage attention support and
xFuserLongContextAttentionsync flags.
- Features: Added sparse sage attention support and
-
Metrics: 7 New PRs 7 New Issues
AI-Hypercomputer/JetStream
- Key Activity:
- Multi-LoRA support.
- Details:
- Feature: Supported Multi-LoRA inferencing via JetStream server. Added llama benchmarks.
-
Metrics: 32 New PRs 1 New Issue