📅 Engineering Report (2025-03-01 - 2025-03-31)

🚀 Executive Summary

March 2025 was a pivotal month for ecosystem expansion, particularly regarding third-party adoption of AMD hardware. While the internal ROCm teams focused heavily on CI/CD refactoring and providing updated Docker environments for vLLM and Megatron-LM, the external ecosystem saw significant integration. Both volcengine/verl (RLHF framework) and xdit-project/xDiT (DiT inference) released major versions explicitly adding AMD GPU support.

In the kernel/compiler space, TileLang had an aggressive release cycle (v0.1.3), adding support for FlashAttention3 Backward and FP8 GEMM, positioning itself as a rapidly maturing TDSL.

  • Third-Party Adoption: volcengine/verl released v0.3.0 with official AMD support for vLLM and FSDP backends. Simultaneously, xdit-project/xDiT v0.4.3 added specific support for AMD GPUs.
  • Infrastructure & Tooling: ROCm/MAD unified their vLLM Docker images and added support for Megatron-LM training (v25.4).
  • Stability: The ROCm/ROCm core repository is tracking a critical issue regarding runtime crashes after system suspension, alongside significant CI improvements (Ninja build generation).

Competitive Analysis

  • Kernel Compilation: TileLang (v0.1.3) is moving rapidly, introducing support for FlashAttention3 Backward, DeepGEMM, and FP8 support. This indicates growing competition in the custom kernel generation space that Triton currently dominates.
  • Communication Libraries: DeepSeek-AI/DeepEP removed their “NVLink low-latency plan” from documentation, while adding BF16 support for low-latency kernels. This suggests a potential shift in strategy or optimization focus away from specific proprietary interconnect features for that specific project.
  • PyTorch Optimization: PyTorch/AO promoted “Low Bit Optim” out of prototype status, signaling maturity in quantization workflows.

📂 Category Updates

AMD Ecosystem (ROCm & Tools)

🚨 REPO: xdit-project/xDiT

  • Key Activity:
    • [2025-03-20] RELEASE: v0.4.3 - Added AMD GPU support.
    • [2025-03-03] RELEASE: v0.4.2 - Added Tensor Parallelism for Step-Video-T2V.
  • Details:
    • The v0.4.3 release includes a specific PR (#477) adding AMD GPU support, broadening the hardware compatibility for this Diffusion Transformer inference engine.
    • Added support for SDXL (CFG parallel only) and Ray-based parallel inference.
  • Metrics: 0 New Issues 0 New PRs (Data reflects release notes)

REPO: ROCm/ROCm

  • Key Activity:
    • [High Volume] Significant CI/CD overhaul and issue tracking.
  • Details:
    • New Issue: Investigating ROCm runtime crashes after system suspension.
    • CI Updates: Added Ninja build generation for 12 components and added dependency on rocprof-sdk.
  • Metrics: 61 New PRs 46 New Issues

REPO: ROCm/MAD

  • Key Activity:
    • [2025-03-12] Docker container unification.
  • Details:
    • Unified vLLM docker with v0.7.3.
    • Added README for JAX training (maxtext-v25.4) and a new Megatron-LM training docker (v25.4).
  • Metrics: 6 New PRs 0 New Issues

REPO: AMD-AGI/Primus

  • Key Activity:
    • [2025-03-31] Documentation and Script refactoring.
  • Details:
    • Refactored shell scripts and added data preprocessing support.
  • Metrics: 12 New PRs 0 New Issues

REPO: AMD-AGI/TraceLens

  • Key Activity:
    • [2025-03-25] Documentation updates for v0.3.
  • Details:
    • Optimization of the subtract_intervals function for scalability.
  • Metrics: 25 New PRs 0 New Issues

PyTorch Ecosystem

REPO: pytorch/pytorch

  • Key Activity:
    • [Massive Scale] 1400+ new PRs indicates high-velocity maintenance.
  • Details:
    • CI: Build MacOS CI with MKLDNN.
    • New Features: Discussion on supporting copy kwarg in torch.reshape to align with Python array API.
    • Issues: ONNX decomposition failing to preserve custom CompositeImplicitAutograd ops.
  • Metrics: 1432 New PRs 708 New Issues

REPO: pytorch/torchtitan

  • Key Activity:
    • [2025-03-24] Script module execution updates.
  • Details:
    • Work in progress on Contiguous Group GeMM kernels.
    • Discussion/Issue regarding Context Parallel support on Turing GPUs.
  • Metrics: 100 New PRs 27 New Issues

REPO: pytorch/ao (Architecture Optimization)

  • Key Activity:
    • [2025-03-10] Feature Promotion.
  • Details:
    • Promoted Low Bit Optim out of prototype.
    • Moved torchao/_models to benchmarks/_models, cleaning up the core library.
  • Metrics: 0 New PRs 25 New Issues

RLHF & Training Frameworks

🚨 REPO: volcengine/verl

  • Key Activity:
    • [2025-03-30] RELEASE: v0.3.0.post0 - AMD Support Added.
  • Details:
    • AMD Support: Now available for vLLM and FSDP backends.
    • New Algorithms: Added PRIME, RLOO, remax, and FIRE sampling.
    • Engine: SGLang integration available for preview. Upgraded Megatron (v0.11) and vLLM (v0.8.2).
  • Metrics: 0 New PRs 0 New Issues (Data reflects release notes)

REPO: AI-Hypercomputer/maxtext

  • Key Activity:
    • [2025-03-22] Model Support updates.
  • Details:
    • Added DeepSeek instruction support.
    • Updated README to reflect Gemma3 announcements.
  • Metrics: 0 New PRs 0 New Issues

Compilers & Kernels

🚨 REPO: tile-ai/tilelang

  • Key Activity:
    • [2025-03-23] RELEASE: v0.1.3 - Major feature expansion.
  • Details:
    • New Kernels: FlashAttention3 Backward, DeepGEMM, FP8 GEMM, GQA Backward.
    • Features: Async Pipeline inference, Auto-tuning config performance trace, TMA Store Synchronization.
    • Refactor: Phased out mandatory LLVM dependency (now optional).
  • Metrics: 0 New PRs 0 New Issues (Data reflects release notes)

REPO: triton-lang/triton

  • Key Activity:
    • [2025-03-28] Build system constraints.
  • Details:
    • Pinned cmake < 4 to prevent build issues.
  • Metrics: 0 New PRs 0 New Issues

REPO: deepseek-ai/DeepEP

  • Key Activity:
    • [2025-03-27] Strategic Documentation update.
  • Details:
    • Removed NVLink low-latency plan from roadmap/docs.
    • Added BF16 support for low-latency kernels.
  • Metrics: 0 New PRs 0 New Issues