📅 Engineering Report (2025-03-01 - 2025-03-31)

🚀 Executive Summary

March 2025 was characterized by significant downstream adoption of AMD hardware in major third-party training and inference frameworks, alongside aggressive innovation in the custom kernel language space. TileLang emerged as a rapidly maturing competitor to Triton with its v0.1.3 release, adding support for cutting-edge architectures like NSA (Native Sparse Attention) and FlashAttention3.

For the AMD ecosystem, the highlight is the explicit integration of ROCm support into Volcengine’s Verl (RLHF framework) and xDiT (Diffusion Transformer inference), signalling that the “software gap” is closing in higher-level application layers.

Adoption in RLHF: 🚨 Volcengine/verl v0.3.0 officially added AMD support for vLLM and FSDP backends, providing a “getting started” guide for ROCm users.
Adoption in Diffusion: 🚨 xDiT v0.4.3 merged a PR specifically adding AMD GPU support, enabling parallel inference for Diffusion Transformers on ROCm.
Tooling Maturity: ROCm/MAD released updated Docker containers for Megatron-LM training (v25.4) and JAX training (MaxText), streamlining the onboarding process for LLM training on AMD.
CI/CD Visibility: PyTorch/ao (Architecture Optimization) is actively refining its test suite for ROCm, evidenced by PRs explicitly managing test skips for quantization, ensuring cleaner CI signals.

Competitive Analysis

Kernel Language Wars: 🚨 TileLang v0.1.3 was released with significant features: FlashAttention3 Backward, FP8 GEMM, and NSA (Native Sparse Attention) Decode. This indicates a rapid maturation of non-Triton kernel languages aiming for high-performance custom ops.
DeepSeek Infrastructure: DeepEP (DeepSeek’s expert parallelism lib) is shifting focus, notably removing the “NVLink low-latency plan” while optimizing BF16 kernels, suggesting a strategic pivot or optimization of their specific cluster topology.
NVIDIA Ecosystem: DeepSpeed is now actively addressing compatibility with CUDA 12.6, ensuring day-one readiness for the latest NVIDIA drivers.
Alternative Hardware: Facebookresearch/xformers saw issue reports regarding building on Huawei Ascend NPU, highlighting community interest in diversifying hardware backends beyond NVIDIA and AMD.

📂 Category Updates

🟥 AMD Ecosystem

AMD-AGI/Primus

Key Activity:
- [2025-03-31] Refactored shell scripts.
- [2025-03-25] Added support for MTP (Multi-Token Prediction) in docs.
Details:
- [2025-03-16] Core documentation updates regarding Primus architecture.
Metrics: 12 PRs 0 Issues

AMD-AGI/TraceLens

Key Activity:
- [2025-03-25] Documentation updates for v0.3 release.
Details:
- [Highlight] Added an alternative subtract_intervals function optimized for scalability, improving trace analysis performance on large datasets.
Metrics: 25 PRs 0 Issues

ROCm/ROCm

Key Activity:
- High volume of maintenance and issue tracking.
Details:
- [Highlight] ADDED: Ninja build generation for 12 components to speed up CI/Builds.
- [Issue] Investigating runtime crashes after system suspension.
Metrics: 61 PRs 46 Issues (High activity indicates active stabilization)

ROCm/MAD

Key Activity:
- [2025-03-12] Unified vLLM docker with v0.7.3.
Details:
- [Highlight] Added README for jax-training:maxtext-v25.4.
- [Highlight] Added megatron-lm training docker v25.4.
Metrics: 6 PRs 0 Issues

🔥 PyTorch Ecosystem

pytorch/pytorch

Key Activity:
- [2025-03-24] Build System Change: Removed pre-CXX11 ABI logic from build scripts.
Details:
- [Highlight] New Issue regarding torch.reshape() supporting the copy kwarg for Array API standard compliance.
- [Highlight] Build MacOS CI with MKLDNN.
Metrics: 1432 PRs 708 Issues (Massive maintenance volume)

pytorch/torchtitan

Key Activity:
- [2025-03-24] Script execution improvements (running as modules).
Details:
- [Highlight] WIP PR for Contiguous Group GeMM kernels.
- [Highlight] Refactoring loss functions to support chunked loss.
Metrics: 100 PRs 27 Issues

pytorch/ao (Architecture Optimization)

Key Activity:
- [2025-03-10] Promoted “Low Bit Optim” out of prototype status.
Details:
- [Highlight] ROCm: PR merged to skip test_galore_quant.py specifically for ROCm (CI stabilization).
- [Highlight] Added PartialLinear module for structured sparsity.
Metrics: 154 PRs 25 Issues

⚡ JAX & XLA Ecosystem

jax-ml/jax

Key Activity:
- Steady maintenance flow.
Details:
- [Highlight] __jax_array__ support added to jnp.reshape, jnp.transpose.
Metrics: 617 PRs 0 Issues

AI-Hypercomputer/maxtext

Key Activity:
- [2025-03-22] Updated Readme with Gemma3 announcement.
- [2025-03-18] Added DeepSeek instruction support.
Details:
- [Highlight] PR to unshard QKV on the head dimension.
- [Highlight] Refactoring Prefill Packing into a Python module.
Metrics: 171 PRs 6 Issues

openxla/xla

Key Activity:
- [2025-03-10] Documentation cleanup.
Details:
- [Highlight] Fixes for TensorFlow GPU builds.
- [Highlight] Issue regarding tanhf vectorization on AArch64 (ARM).
Metrics: 1132 PRs 14 Issues

🛠️ Kernel Languages & Compilers

tile-ai/tilelang

Key Activity:
- 🚨 RELEASE: v0.1.3 (2025-03-23)
Details:
- New Kernels: NSA (Native Sparse Attention) Decode, FlashAttention3 Backward, FP8 GEMM, DeepGEMM.
- Features: Multi-Threads Compilation for Fast Auto Tuning, CPU JIT with backend ctypes, TMA Store Synchronization.
- Fixes: Layout conflict handling for GQA decoding.
Metrics: 132 PRs 48 Issues (Explosive growth in feature set)

triton-lang/triton

Key Activity:
- [2025-03-28] Build system updates (Pin cmake < 4).
Details:
- [Highlight] Update backend to newer LLVM project commit.
- [Issue] Reports of loading from TMA descriptor hanging.
Metrics: 215 PRs 55 Issues

🤖 Inference & Training Frameworks

volcengine/verl (RLHF)

Key Activity:
- 🚨 RELEASE: v0.3.0.post0 (2025-03-30)
Details:
- AMD Support: Added support for AMD GPUs via vLLM and FSDP backend.
- New Algos: PRIME, RLOO, Remax, and Vision Language Reasoning with Qwen2.5-VL.
- Engine: SGLang integration available for preview.
Metrics: 0 PRs (Post-release) 0 Issues tracked in snapshot

xdit-project/xDiT

Key Activity:
- 🚨 RELEASE: v0.4.3 (2025-03-20) & v0.4.2 (2025-03-03)
Details:
- [Highlight] AMD GPU Support added by contributor @jammm.
- [Highlight] Added TeaCache and FBCache support.
- [Highlight] SDXL support for CFG parallel only.
Metrics: 14 PRs 16 Issues

deepspeedai/DeepSpeed

Key Activity:
- [2025-03-25] Documentation updates regarding AutoTP.
Details:
- [Highlight] Fixes for compiling DeepSpeed with CUDA 12.6.
- [Highlight] Updates to BF16Optimizer and Stage2 for new Torch grad hook API.
Metrics: 47 PRs 47 Issues

deepseek-ai/DeepEP

Key Activity:
- Focus on low-latency kernel optimization.
Details:
- [2025-03-27] Removed NVLink low-latency plan.
- [2025-03-10] Added support for BF16 for low-latency kernels.
Metrics: 7 PRs 65 Issues

huggingface/transformers

Key Activity:
- Routine maintenance and model updates.
Details:
- [Highlight] Issue: Warnings when loading Deepseek-V3 without custom code.
- [Highlight] Added fast image processor for ZoeDepth.
Metrics: 461 PRs 205 Issues