📅 Engineering Report (2025-03-01 - 2025-03-31)

🚀 Executive Summary

March 2025 was a pivotal month for cross-ecosystem compatibility. The most significant trend was the expansion of AMD GPU support into high-profile third-party libraries: xDiT (DiT inference) and Verl (RLHF training) both released major versions explicitly adding AMD/ROCm backends.

Within the ROCm ecosystem, the focus remains on enabling heavy workloads, with ROCm/MAD releasing containers for JAX (MaxText) and Megatron-LM. Meanwhile, PyTorch and XLA continued heavy infrastructure refactoring, with PyTorch modernizing build scripts and XLA refining GPU paths for TensorFlow.

Third-Party Adoption (Critical):
- 🚨 xDiT v0.4.3 officially added AMD GPU support, enabling high-performance Diffusion Transformer inference on ROCm.
- 🚨 Verl v0.3.0 added AMD support for vLLM and FSDP backends, opening up PPO/GRPO RLHF training pipelines for AMD hardware.
Model Enablement: ROCm/MAD introduced new docker containers for JAX training (MaxText) and Megatron-LM v25.4, targeting enterprise-grade LLM training.
Tooling Maturity: TraceLens v0.3 and Primus saw significant refactoring for scalability and core stability, indicating a maturing profiling ecosystem.

Competitive Analysis

JAX/Google Agility: MaxText rapidly integrated support for DeepSeek instructions and Gemma3, demonstrating that the JAX ecosystem is keeping pace with cutting-edge model releases as fast as PyTorch.
DeepSeek Infrastructure: DeepEP (DeepSeek’s Expert Parallelism lib) removed NVLink low-latency plans while adding BF16 support, potentially signaling a shift in how they handle inter-node communication or a focus on standardizing precision.
Compiler Wars: Triton and TileLang remain highly active. TileLang is solving complex layout issues for GQA decoding, while Triton is grappling with TMA descriptor hangs, highlighting the complexity of stabilizing H100-era features.

📂 Category Updates

🟢 AMD Ecosystem (Internal & Official)

AMD-AGI/Primus

Key Activity:
- [2025-03-31] Refactored shell scripts for better maintainability.
- [2025-03-25] Added support for MTP (Multi-Token Prediction) and updated documentation.
Details:
- Focus on data preprocessing pipelines and core architectural refactors.
Metrics: 12 New PRs 11 Closed PRs 0 New Issues

AMD-AGI/TraceLens

Key Activity:
- [2025-03-25] Documentation updates for v0.3.
- [2025-03-06] Documentation updates for v0.2.
Details:
- Implementation of an alternative subtract_intervals function optimized specifically for scalability, crucial for profiling large-scale distributed runs.
Metrics: 25 New PRs 25 Closed PRs 0 New Issues

ROCm/ROCm (Platform)

Key Activity:
- Addressed runtime instability (system suspension crashes).
- Investigated consumer card support (RX 6750 GRE) for Stable Diffusion.
Details:
- CI Updates: Added Ninja build generation for 12 components and new dependencies for rocprof-compute.
Metrics: 61 New PRs 46 New Issues 64 Closed PRs (High maintainer responsiveness)

ROCm/MAD (Model Automation & Deployment)

Key Activity:
- [2025-03-12] Unified vLLM docker with v0.7.3.
Details:
- Added README and support for JAX-training (MaxText v25.4).
- Added Megatron-LM training docker v25.4.
Metrics: 6 New PRs 5 Closed PRs

🔥 PyTorch Ecosystem

pytorch/pytorch

Key Activity:
- [2025-03-24] Major build system cleanup: Removed pre-CXX11 ABI logic.
- Continuous integration updates for MacOS (MKLDNN support).
Details:
- Standardization efforts: torch.reshape() now supports the copy kwarg (Python Array API standard).
- ONNX issues tracked regarding CompositeImplicitAutograd ops preservation.
Metrics: 1432 New PRs 708 New Issues

pytorch/torchtitan

Key Activity:
- [2025-03-24] Script execution standardization (running as modules).
- [2025-03-06] Legal clearance for additional datasets.
Details:
- Work in progress on Contiguous Group GeMM kernels.
- Refactoring loss functions to support chunked loss.
- Discussions on supporting Context Parallelism on Turing-generation GPUs.
Metrics: 100 New PRs 27 New Issues

pytorch/ao (Architecture Optimization)

Key Activity:
- [2025-03-10] Promoted Low Bit Optimizers out of prototype status.
Details:
- Issues tracked regarding custom CUDA op dispatch failures and static quantized model saving.
Metrics: 0 New PRs (Data discrepancy likely) 25 New Issues

🌐 Google / JAX / XLA

AI-Hypercomputer/maxtext

Key Activity:
- [2025-03-22] Updated for Gemma3 announcement.
- [2025-03-18] Added DeepSeek instruction support.
Details:
- Refactored prefill packing into a Python module.
- Implemented un-sharding of QKV on the head dimension.
- Addressed bugs in Mixture-of-Experts (MoE) load balancing loss reporting.
Metrics: 171 New PRs 6 New Issues

openxla/xla

Key Activity:
- [2025-03-10] Documentation cleanup.
Details:
- Fixes for TensorFlow GPU builds.
- Autotuning improvements to centralize entry version updates.
- Issues regarding tanhf optimization on aarch64 (SVE implementation).
Metrics: 1132 New PRs 14 New Issues

⚙️ GenAI Infrastructure & Compilers

volcengine/verl (RLHF)

Key Activity:
- 🚨 [2025-03-30] RELEASE v0.3.0.post0:
  - AMD Support: Added support for AMD GPUs via vLLM and FSDP backends.
  - New Algorithms: PRIME, RLOO, ReMax, and FIRE sampling.
  - SGLang: Integration available for preview.
  - Models: Support for Qwen2.5-VL and GRPO.
Metrics: 0 New PRs (Release focused) 0 New Issues

xdit-project/xDiT (Diffusion Transformers)

Key Activity:
- 🚨 [2025-03-20] RELEASE v0.4.3:
  - AMD Support: PR #477 explicitly added AMD GPU support.
  - Features: Added SDXL CFG parallel support, TeaCache, and FBCache.
- [2025-03-26] v0.4.3.post1 released with minor bugfixes.
Details:
- Resolved issues regarding dit_parallel_size and parallel_world_size mismatches.
Metrics: 14 New PRs 16 New Issues

deepseek-ai/DeepEP

Key Activity:
- [2025-03-27] Strategic Shift: Removed NVLink low-latency plan.
- [2025-03-10] Added BF16 support for low-latency kernels.
Details:
- Investigating deadlocks on H20 GPUs and multi-node A100 setups.
Metrics: 7 New PRs 65 New Issues

triton-lang/triton

Key Activity:
- [2025-03-28] Build system updates (Pin cmake < 4).
Details:
- Backend updates to newer LLVM project commits.
- Critical issue: Loading from TMA descriptor hangs.
Metrics: 215 New PRs 55 New Issues

huggingface/transformers

Key Activity:
- [2025-03-21] Installation documentation updates.
Details:
- Added fast image processor for ZoeDepth.
- Fixes for Trainer data parallelism crashing on 1-dimensional tensors.
- Addressed warnings when loading DeepSeek-V3.
Metrics: 461 New PRs 205 New Issues