📅 Engineering Report (2025-03-01 - 2025-03-31)

🚀 Executive Summary

March 2025 was a pivotal month for cross-ecosystem compatibility. The most significant trend was the expansion of AMD GPU support into high-profile third-party libraries: xDiT (DiT inference) and Verl (RLHF training) both released major versions explicitly adding AMD/ROCm backends.

Within the ROCm ecosystem, the focus remains on enabling heavy workloads, with ROCm/MAD releasing containers for JAX (MaxText) and Megatron-LM. Meanwhile, PyTorch and XLA continued heavy infrastructure refactoring, with PyTorch modernizing build scripts and XLA refining GPU paths for TensorFlow.

  • Third-Party Adoption (Critical):
    • 🚨 xDiT v0.4.3 officially added AMD GPU support, enabling high-performance Diffusion Transformer inference on ROCm.
    • 🚨 Verl v0.3.0 added AMD support for vLLM and FSDP backends, opening up PPO/GRPO RLHF training pipelines for AMD hardware.
  • Model Enablement: ROCm/MAD introduced new docker containers for JAX training (MaxText) and Megatron-LM v25.4, targeting enterprise-grade LLM training.
  • Tooling Maturity: TraceLens v0.3 and Primus saw significant refactoring for scalability and core stability, indicating a maturing profiling ecosystem.

Competitive Analysis

  • JAX/Google Agility: MaxText rapidly integrated support for DeepSeek instructions and Gemma3, demonstrating that the JAX ecosystem is keeping pace with cutting-edge model releases as fast as PyTorch.
  • DeepSeek Infrastructure: DeepEP (DeepSeek’s Expert Parallelism lib) removed NVLink low-latency plans while adding BF16 support, potentially signaling a shift in how they handle inter-node communication or a focus on standardizing precision.
  • Compiler Wars: Triton and TileLang remain highly active. TileLang is solving complex layout issues for GQA decoding, while Triton is grappling with TMA descriptor hangs, highlighting the complexity of stabilizing H100-era features.

📂 Category Updates

🟢 AMD Ecosystem (Internal & Official)

AMD-AGI/Primus

  • Key Activity:
    • [2025-03-31] Refactored shell scripts for better maintainability.
    • [2025-03-25] Added support for MTP (Multi-Token Prediction) and updated documentation.
  • Details:
    • Focus on data preprocessing pipelines and core architectural refactors.
  • Metrics: 12 New PRs 11 Closed PRs 0 New Issues

AMD-AGI/TraceLens

  • Key Activity:
    • [2025-03-25] Documentation updates for v0.3.
    • [2025-03-06] Documentation updates for v0.2.
  • Details:
    • Implementation of an alternative subtract_intervals function optimized specifically for scalability, crucial for profiling large-scale distributed runs.
  • Metrics: 25 New PRs 25 Closed PRs 0 New Issues

ROCm/ROCm (Platform)

  • Key Activity:
    • Addressed runtime instability (system suspension crashes).
    • Investigated consumer card support (RX 6750 GRE) for Stable Diffusion.
  • Details:
    • CI Updates: Added Ninja build generation for 12 components and new dependencies for rocprof-compute.
  • Metrics: 61 New PRs 46 New Issues 64 Closed PRs (High maintainer responsiveness)

ROCm/MAD (Model Automation & Deployment)

  • Key Activity:
    • [2025-03-12] Unified vLLM docker with v0.7.3.
  • Details:
    • Added README and support for JAX-training (MaxText v25.4).
    • Added Megatron-LM training docker v25.4.
  • Metrics: 6 New PRs 5 Closed PRs

🔥 PyTorch Ecosystem

pytorch/pytorch

  • Key Activity:
    • [2025-03-24] Major build system cleanup: Removed pre-CXX11 ABI logic.
    • Continuous integration updates for MacOS (MKLDNN support).
  • Details:
    • Standardization efforts: torch.reshape() now supports the copy kwarg (Python Array API standard).
    • ONNX issues tracked regarding CompositeImplicitAutograd ops preservation.
  • Metrics: 1432 New PRs 708 New Issues

pytorch/torchtitan

  • Key Activity:
    • [2025-03-24] Script execution standardization (running as modules).
    • [2025-03-06] Legal clearance for additional datasets.
  • Details:
    • Work in progress on Contiguous Group GeMM kernels.
    • Refactoring loss functions to support chunked loss.
    • Discussions on supporting Context Parallelism on Turing-generation GPUs.
  • Metrics: 100 New PRs 27 New Issues

pytorch/ao (Architecture Optimization)

  • Key Activity:
    • [2025-03-10] Promoted Low Bit Optimizers out of prototype status.
  • Details:
    • Issues tracked regarding custom CUDA op dispatch failures and static quantized model saving.
  • Metrics: 0 New PRs (Data discrepancy likely) 25 New Issues

🌐 Google / JAX / XLA

AI-Hypercomputer/maxtext

  • Key Activity:
    • [2025-03-22] Updated for Gemma3 announcement.
    • [2025-03-18] Added DeepSeek instruction support.
  • Details:
    • Refactored prefill packing into a Python module.
    • Implemented un-sharding of QKV on the head dimension.
    • Addressed bugs in Mixture-of-Experts (MoE) load balancing loss reporting.
  • Metrics: 171 New PRs 6 New Issues

openxla/xla

  • Key Activity:
    • [2025-03-10] Documentation cleanup.
  • Details:
    • Fixes for TensorFlow GPU builds.
    • Autotuning improvements to centralize entry version updates.
    • Issues regarding tanhf optimization on aarch64 (SVE implementation).
  • Metrics: 1132 New PRs 14 New Issues

⚙️ GenAI Infrastructure & Compilers

volcengine/verl (RLHF)

  • Key Activity:
    • 🚨 [2025-03-30] RELEASE v0.3.0.post0:
      • AMD Support: Added support for AMD GPUs via vLLM and FSDP backends.
      • New Algorithms: PRIME, RLOO, ReMax, and FIRE sampling.
      • SGLang: Integration available for preview.
      • Models: Support for Qwen2.5-VL and GRPO.
  • Metrics: 0 New PRs (Release focused) 0 New Issues

xdit-project/xDiT (Diffusion Transformers)

  • Key Activity:
    • 🚨 [2025-03-20] RELEASE v0.4.3:
      • AMD Support: PR #477 explicitly added AMD GPU support.
      • Features: Added SDXL CFG parallel support, TeaCache, and FBCache.
    • [2025-03-26] v0.4.3.post1 released with minor bugfixes.
  • Details:
    • Resolved issues regarding dit_parallel_size and parallel_world_size mismatches.
  • Metrics: 14 New PRs 16 New Issues

deepseek-ai/DeepEP

  • Key Activity:
    • [2025-03-27] Strategic Shift: Removed NVLink low-latency plan.
    • [2025-03-10] Added BF16 support for low-latency kernels.
  • Details:
    • Investigating deadlocks on H20 GPUs and multi-node A100 setups.
  • Metrics: 7 New PRs 65 New Issues

triton-lang/triton

  • Key Activity:
    • [2025-03-28] Build system updates (Pin cmake < 4).
  • Details:
    • Backend updates to newer LLVM project commits.
    • Critical issue: Loading from TMA descriptor hangs.
  • Metrics: 215 New PRs 55 New Issues

huggingface/transformers

  • Key Activity:
    • [2025-03-21] Installation documentation updates.
  • Details:
    • Added fast image processor for ZoeDepth.
    • Fixes for Trainer data parallelism crashing on 1-dimensional tensors.
    • Addressed warnings when loading DeepSeek-V3.
  • Metrics: 461 New PRs 205 New Issues