📅 Engineering Report (2025-04-01 - 2025-04-30)

🚀 Executive Summary

April 2025 was dominated by the release of PyTorch v2.7.0, which triggered a cascade of updates across the entire AI ecosystem to ensure compatibility. Major libraries including torchvision, torchaudio, xformers, and NVIDIA’s TransformerEngine all pushed releases to align with the new PyTorch core.

This month saw significant maturity in quantization and low-bit optimization, with TorchAO v0.10.0 moving low-bit optimizers to official support and introducing prototypes for NVIDIA B200 mxfp8 training. In the compiler space, TileLang v0.1.4 released with substantial, specific support for AMD architectures (DeepSeek MLA), marking a strong community contribution to the ROCm ecosystem.

  • TileLang v0.1.4 Release: This compiler project added significant support for AMD GPUs, including FlashMLA (DeepSeek) implementation, ROCm Docker file creation, and support for T.gemm with transpose_b=False for the AMD backend.
  • FBGEMM GenAI Support: FBGEMM v1.2.0 added preliminary ROCm OSS build support for GenAI operations, signaling a broadening of meta-optimization library support for AMD hardware.
  • ROCm Documentation & Tooling: The ROCm/ROCm repo is preparing for version 6.4 (updating tooling docs). Users reported naming inconsistencies (amd-smi vs amd_smi) in 6.3.x which are being tracked.
  • Performance Tooling: AMD-AGI/TraceLens is integrating “gemmologist” for modeling GEMM efficiencies and added new functionality to calculate TFLOPs and GB/s in node replay.
  • Model Support: ROCm/MAD added support for Mosaic-ML MPT-30B training and JAX MaxText v25.5.

Competitive Analysis

  • NVIDIA Blackwell Support: PyTorch v2.7.0 explicitly notes “NVIDIA Blackwell Architecture Support” as a prototype feature. TorchAO v0.10.0 also released prototype end-to-end training support for mxfp8 on NVIDIA B200.
  • TransformerEngine v2.2: NVIDIA released v2.2 to support PyTorch 2.7 and added CPU offloading. However, they explicitly noted that the RTX 5090 is currently unsupported for FP8 execution, with a fix planned for v2.3.
  • Triton Compiler: Ongoing work in triton-lang is focusing heavily on Hopper architecture optimizations (TMA load hoisting, WGMMA), indicating continued heavy optimization for NVIDIA’s H100/H200 series.

📂 Category Updates

🟢 PyTorch Ecosystem

🚨 pytorch/pytorch

  • Key Activity:
    • [2025-04-23] RELEASE v2.7.0 launched with massive changelogs.
  • Details:
    • Highlights: Beta Torch.Compile support for Torch Function Modes, Prototype Blackwell support, FlexAttention on x86 CPUs, and enhancing Intel GPU acceleration.
    • Deprecations: Dropped support for CUDA 12.4 (CI moved to 12.8) and Triton < 2.2.0.
    • Issues: High compilation time variance reported on benchmark dashboards.
  • Metrics: 1428 New PRs 767 New Issues (Very High Velocity)

🚨 pytorch/ao (Architecture Optimization)

  • Key Activity:
    • [2025-04-07] RELEASE v0.10.0 focusing on quantization maturity.
  • Details:
    • Features: Low Bit Optimizers moved to official support. Prototype support for mxfp8 training on NVIDIA B200. Introduction of PARQ (Piecewise-Affine Regularized Quantization).
    • Perf: Low-bit CPU and MPS kernels are now pip installable from source.
  • Metrics: 142 New PRs 27 New Issues

🚨 pytorch/vision

  • Key Activity:
    • [2025-04-23] RELEASE v0.22.0.
  • Details:
    • Deprecation Warning: Video decoding/encoding capabilities are deprecated and will move to TorchCodec in v0.25.
    • Optimization: Significant NMS speed-up on CUDA for high box counts.
  • Metrics: 14 New PRs 22 New Issues

🚨 pytorch/FBGEMM

  • Key Activity:
    • [2025-04-27] RELEASE v1.2.0.
  • Details:
    • GenAI: GenAI ops now packaged separately. Added preliminary ROCm OSS build support.
    • TBE (Table Batched Embeddings): Added support for int64_t table indices on GPU.
  • Metrics: 0 New PRs (Release focused)

pytorch/audio

  • Key Activity:
    • [2025-04-24] RELEASE v2.7.0.
  • Details:
    • Strategy Shift: The project announced it is transitioning into a maintenance phase to reduce redundancies with the wider PyTorch ecosystem.
  • Metrics: 0 New PRs 5 New Issues

pytorch/torchtitan

  • Key Activity:
    • [2025-04-07] Added llama4 as an experiment.
  • Details:
    • WIP: Float8 rowwise all-gather implementation.
    • CI: Updated to use PyTorch nightly, indicating close tracking of core development.
  • Metrics: 84 New PRs 36 New Issues

🔴 AMD & ROCm Ecosystem

ROCm/ROCm

  • Key Activity:
    • [2025-04-11] Documentation updates targeting ROCm 6.4.
  • Details:
    • Issues: Users noted amd-smi changed to amd_smi in 6.3.x causing scripts to break. Flash-attention import errors reported with ComfyUI.
    • PRs: Updates to vLLM docker tags for benchmarking.
  • Metrics: 101 New PRs 48 New Issues

AMD-AGI/Primus

  • Key Activity:
    • [2025-04-25] Documentation and Runner updates.
  • Details:
    • Highlights: Added tensile tuning examples and fixed preflight scripts.
  • Metrics: 24 New PRs 0 New Issues

AMD-AGI/TraceLens

  • Key Activity:
    • Focus on performance modeling and replay accuracy.
  • Details:
    • Features: Integrated gemmologist for modeling GEMM efficiencies. Added TFLOPs and GB/s calculations to node replay.
  • Metrics: 27 New PRs 18 New Issues

ROCm/MAD

  • Key Activity:
    • Model addition.
  • Details:
    • Models: Added Mosaic-ML MPT-30B training model and JAX-MaxText v25.5.
  • Metrics: 12 New PRs 0 New Issues

🟢 NVIDIA Ecosystem

🚨 NVIDIA/TransformerEngine

  • Key Activity:
    • [2025-04-28] RELEASE v2.2.
  • Details:
    • Compatibility: Added PyTorch 2.7 support.
    • Features: Support for CPU offloading with Megatron-Core distributed optimizers.
    • Limitations: Explicitly noted RTX 5090 is unsupported for FP8 (fix coming in v2.3).
  • Metrics: 0 New PRs 0 New Issues

facebookresearch/xformers

  • Key Activity:
    • [2025-04-28] RELEASE v0.0.30.
  • Details:
    • Requirement: Now requires PyTorch >= 2.7.
    • Feature: Added support for local attention on the Flash3 backend (H100).
  • Metrics: 0 New PRs 0 New Issues

🟡 Compiler & Languages (Triton, XLA, TileLang)

🚨 tile-ai/tilelang

  • Key Activity:
    • [2025-04-18] RELEASE v0.1.4.
  • Details:
    • AMD Focus: Significant updates for AMD GPU support, including DeepSeek MLA implementation, ROCm Dockerfiles, and fixes for composable kernel include paths.
    • New Models: Added BitNet 1.58b examples.
  • Metrics: 103 New PRs 26 New Issues

triton-lang/triton

  • Key Activity:
    • Refactoring configuration and attention partitioning.
  • Details:
    • Refactor: Renaming config.py to knobs.py and introducing a new config module.
    • Issues: Reported duplicated reductions in TTIR -> TTGIR and Hopper TMA load hoisting issues.
  • Metrics: 259 New PRs 45 New Issues

openxla/xla

  • Key Activity:
    • High volume maintenance and backend tweaks.
  • Details:
    • Arch: Re-enabled SVE on Aarch64 backend.
    • Issues: AOT compilation issues reported with GEMM rewriter.
  • Metrics: 1510 New PRs 13 New Issues

🔵 Inference & Serving

huggingface/transformers

  • Key Activity:
    • [2025-04-07] Dropped support for PyTorch 2.0.
  • Metrics: 535 New PRs 199 New Issues

xdit-project/xDiT

  • Key Activity:
    • Multiple point releases (0.4.3.post2/3).
  • Details:
    • Features: Added sparse sage attention support and xFuserLongContextAttention sync flags.
  • Metrics: 7 New PRs 7 New Issues

AI-Hypercomputer/JetStream

  • Key Activity:
    • Multi-LoRA support.
  • Details:
    • Feature: Supported Multi-LoRA inferencing via JetStream server. Added llama benchmarks.
  • Metrics: 32 New PRs 1 New Issue