📅 Engineering Report (2025-03-01 - 2025-03-31)

🚀 Executive Summary

March 2025 was a pivotal month for ecosystem expansion, particularly regarding third-party adoption of AMD hardware. While the internal ROCm teams focused heavily on CI/CD refactoring and providing updated Docker environments for vLLM and Megatron-LM, the external ecosystem saw significant integration. Both volcengine/verl (RLHF framework) and xdit-project/xDiT (DiT inference) released major versions explicitly adding AMD GPU support.

In the kernel/compiler space, TileLang had an aggressive release cycle (v0.1.3), adding support for FlashAttention3 Backward and FP8 GEMM, positioning itself as a rapidly maturing TDSL.

Third-Party Adoption: volcengine/verl released v0.3.0 with official AMD support for vLLM and FSDP backends. Simultaneously, xdit-project/xDiT v0.4.3 added specific support for AMD GPUs.
Infrastructure & Tooling: ROCm/MAD unified their vLLM Docker images and added support for Megatron-LM training (v25.4).
Stability: The ROCm/ROCm core repository is tracking a critical issue regarding runtime crashes after system suspension, alongside significant CI improvements (Ninja build generation).

Competitive Analysis

Kernel Compilation: TileLang (v0.1.3) is moving rapidly, introducing support for FlashAttention3 Backward, DeepGEMM, and FP8 support. This indicates growing competition in the custom kernel generation space that Triton currently dominates.
Communication Libraries: DeepSeek-AI/DeepEP removed their “NVLink low-latency plan” from documentation, while adding BF16 support for low-latency kernels. This suggests a potential shift in strategy or optimization focus away from specific proprietary interconnect features for that specific project.
PyTorch Optimization: PyTorch/AO promoted “Low Bit Optim” out of prototype status, signaling maturity in quantization workflows.

📂 Category Updates

AMD Ecosystem (ROCm & Tools)

🚨 REPO: xdit-project/xDiT

Key Activity:
- [2025-03-20] RELEASE: v0.4.3 - Added AMD GPU support.
- [2025-03-03] RELEASE: v0.4.2 - Added Tensor Parallelism for Step-Video-T2V.
Details:
- The v0.4.3 release includes a specific PR (#477) adding AMD GPU support, broadening the hardware compatibility for this Diffusion Transformer inference engine.
- Added support for SDXL (CFG parallel only) and Ray-based parallel inference.
Metrics: 0 New Issues 0 New PRs (Data reflects release notes)

REPO: ROCm/ROCm

Key Activity:
- [High Volume] Significant CI/CD overhaul and issue tracking.
Details:
- New Issue: Investigating ROCm runtime crashes after system suspension.
- CI Updates: Added Ninja build generation for 12 components and added dependency on rocprof-sdk.
Metrics: 61 New PRs 46 New Issues

REPO: ROCm/MAD

Key Activity:
- [2025-03-12] Docker container unification.
Details:
- Unified vLLM docker with v0.7.3.
- Added README for JAX training (maxtext-v25.4) and a new Megatron-LM training docker (v25.4).
Metrics: 6 New PRs 0 New Issues

REPO: AMD-AGI/Primus

Key Activity:
- [2025-03-31] Documentation and Script refactoring.
Details:
- Refactored shell scripts and added data preprocessing support.
Metrics: 12 New PRs 0 New Issues

REPO: AMD-AGI/TraceLens

Key Activity:
- [2025-03-25] Documentation updates for v0.3.
Details:
- Optimization of the subtract_intervals function for scalability.
Metrics: 25 New PRs 0 New Issues

PyTorch Ecosystem

REPO: pytorch/pytorch

Key Activity:
- [Massive Scale] 1400+ new PRs indicates high-velocity maintenance.
Details:
- CI: Build MacOS CI with MKLDNN.
- New Features: Discussion on supporting copy kwarg in torch.reshape to align with Python array API.
- Issues: ONNX decomposition failing to preserve custom CompositeImplicitAutograd ops.
Metrics: 1432 New PRs 708 New Issues

REPO: pytorch/torchtitan

Key Activity:
- [2025-03-24] Script module execution updates.
Details:
- Work in progress on Contiguous Group GeMM kernels.
- Discussion/Issue regarding Context Parallel support on Turing GPUs.
Metrics: 100 New PRs 27 New Issues

REPO: pytorch/ao (Architecture Optimization)

Key Activity:
- [2025-03-10] Feature Promotion.
Details:
- Promoted Low Bit Optim out of prototype.
- Moved torchao/_models to benchmarks/_models, cleaning up the core library.
Metrics: 0 New PRs 25 New Issues

RLHF & Training Frameworks

🚨 REPO: volcengine/verl

Key Activity:
- [2025-03-30] RELEASE: v0.3.0.post0 - AMD Support Added.
Details:
- AMD Support: Now available for vLLM and FSDP backends.
- New Algorithms: Added PRIME, RLOO, remax, and FIRE sampling.
- Engine: SGLang integration available for preview. Upgraded Megatron (v0.11) and vLLM (v0.8.2).
Metrics: 0 New PRs 0 New Issues (Data reflects release notes)

REPO: AI-Hypercomputer/maxtext

Key Activity:
- [2025-03-22] Model Support updates.
Details:
- Added DeepSeek instruction support.
- Updated README to reflect Gemma3 announcements.
Metrics: 0 New PRs 0 New Issues

Compilers & Kernels

🚨 REPO: tile-ai/tilelang

Key Activity:
- [2025-03-23] RELEASE: v0.1.3 - Major feature expansion.
Details:
- New Kernels: FlashAttention3 Backward, DeepGEMM, FP8 GEMM, GQA Backward.
- Features: Async Pipeline inference, Auto-tuning config performance trace, TMA Store Synchronization.
- Refactor: Phased out mandatory LLVM dependency (now optional).
Metrics: 0 New PRs 0 New Issues (Data reflects release notes)

REPO: triton-lang/triton

Key Activity:
- [2025-03-28] Build system constraints.
Details:
- Pinned cmake < 4 to prevent build issues.
Metrics: 0 New PRs 0 New Issues

REPO: deepseek-ai/DeepEP

Key Activity:
- [2025-03-27] Strategic Documentation update.
Details:
- Removed NVLink low-latency plan from roadmap/docs.
- Added BF16 support for low-latency kernels.
Metrics: 0 New PRs 0 New Issues