GitHub Monthly Report: 2025-03-01 to 2025-03-31
📅 Engineering Report (2025-03-01 - 2025-03-31)
🚀 Executive Summary
March 2025 was characterized by significant downstream adoption of AMD hardware in major third-party training and inference frameworks, alongside aggressive innovation in the custom kernel language space. TileLang emerged as a rapidly maturing competitor to Triton with its v0.1.3 release, adding support for cutting-edge architectures like NSA (Native Sparse Attention) and FlashAttention3.
For the AMD ecosystem, the highlight is the explicit integration of ROCm support into Volcengine’s Verl (RLHF framework) and xDiT (Diffusion Transformer inference), signalling that the “software gap” is closing in higher-level application layers.
AMD Related Updates
- Adoption in RLHF: 🚨 Volcengine/verl v0.3.0 officially added AMD support for vLLM and FSDP backends, providing a “getting started” guide for ROCm users.
- Adoption in Diffusion: 🚨 xDiT v0.4.3 merged a PR specifically adding AMD GPU support, enabling parallel inference for Diffusion Transformers on ROCm.
- Tooling Maturity: ROCm/MAD released updated Docker containers for Megatron-LM training (v25.4) and JAX training (MaxText), streamlining the onboarding process for LLM training on AMD.
- CI/CD Visibility: PyTorch/ao (Architecture Optimization) is actively refining its test suite for ROCm, evidenced by PRs explicitly managing test skips for quantization, ensuring cleaner CI signals.
Competitive Analysis
- Kernel Language Wars: 🚨 TileLang v0.1.3 was released with significant features: FlashAttention3 Backward, FP8 GEMM, and NSA (Native Sparse Attention) Decode. This indicates a rapid maturation of non-Triton kernel languages aiming for high-performance custom ops.
- DeepSeek Infrastructure: DeepEP (DeepSeek’s expert parallelism lib) is shifting focus, notably removing the “NVLink low-latency plan” while optimizing BF16 kernels, suggesting a strategic pivot or optimization of their specific cluster topology.
- NVIDIA Ecosystem: DeepSpeed is now actively addressing compatibility with CUDA 12.6, ensuring day-one readiness for the latest NVIDIA drivers.
- Alternative Hardware: Facebookresearch/xformers saw issue reports regarding building on Huawei Ascend NPU, highlighting community interest in diversifying hardware backends beyond NVIDIA and AMD.
📂 Category Updates
🟥 AMD Ecosystem
AMD-AGI/Primus
- Key Activity:
- [2025-03-31] Refactored shell scripts.
- [2025-03-25] Added support for MTP (Multi-Token Prediction) in docs.
- Details:
- [2025-03-16] Core documentation updates regarding Primus architecture.
-
Metrics: 12 PRs 0 Issues
AMD-AGI/TraceLens
- Key Activity:
- [2025-03-25] Documentation updates for v0.3 release.
- Details:
- [Highlight] Added an alternative
subtract_intervalsfunction optimized for scalability, improving trace analysis performance on large datasets.
- [Highlight] Added an alternative
-
Metrics: 25 PRs 0 Issues
ROCm/ROCm
- Key Activity:
- High volume of maintenance and issue tracking.
- Details:
- [Highlight] ADDED: Ninja build generation for 12 components to speed up CI/Builds.
- [Issue] Investigating runtime crashes after system suspension.
-
Metrics: 61 PRs 46 Issues (High activity indicates active stabilization)
ROCm/MAD
- Key Activity:
- [2025-03-12] Unified vLLM docker with v0.7.3.
- Details:
- [Highlight] Added README for
jax-training:maxtext-v25.4. - [Highlight] Added
megatron-lmtraining docker v25.4.
- [Highlight] Added README for
-
Metrics: 6 PRs 0 Issues
🔥 PyTorch Ecosystem
pytorch/pytorch
- Key Activity:
- [2025-03-24] Build System Change: Removed pre-CXX11 ABI logic from build scripts.
- Details:
- [Highlight] New Issue regarding
torch.reshape()supporting thecopykwarg for Array API standard compliance. - [Highlight] Build MacOS CI with MKLDNN.
- [Highlight] New Issue regarding
-
Metrics: 1432 PRs 708 Issues (Massive maintenance volume)
pytorch/torchtitan
- Key Activity:
- [2025-03-24] Script execution improvements (running as modules).
- Details:
- [Highlight] WIP PR for Contiguous Group GeMM kernels.
- [Highlight] Refactoring loss functions to support chunked loss.
-
Metrics: 100 PRs 27 Issues
pytorch/ao (Architecture Optimization)
- Key Activity:
- [2025-03-10] Promoted “Low Bit Optim” out of prototype status.
- Details:
- [Highlight] ROCm: PR merged to skip
test_galore_quant.pyspecifically for ROCm (CI stabilization). - [Highlight] Added
PartialLinearmodule for structured sparsity.
- [Highlight] ROCm: PR merged to skip
-
Metrics: 154 PRs 25 Issues
⚡ JAX & XLA Ecosystem
jax-ml/jax
- Key Activity:
- Steady maintenance flow.
- Details:
- [Highlight]
__jax_array__support added tojnp.reshape,jnp.transpose.
- [Highlight]
-
Metrics: 617 PRs 0 Issues
AI-Hypercomputer/maxtext
- Key Activity:
- [2025-03-22] Updated Readme with Gemma3 announcement.
- [2025-03-18] Added DeepSeek instruction support.
- Details:
- [Highlight] PR to unshard QKV on the head dimension.
- [Highlight] Refactoring Prefill Packing into a Python module.
-
Metrics: 171 PRs 6 Issues
openxla/xla
- Key Activity:
- [2025-03-10] Documentation cleanup.
- Details:
- [Highlight] Fixes for TensorFlow GPU builds.
- [Highlight] Issue regarding
tanhfvectorization on AArch64 (ARM).
-
Metrics: 1132 PRs 14 Issues
🛠️ Kernel Languages & Compilers
tile-ai/tilelang
- Key Activity:
- 🚨 RELEASE: v0.1.3 (2025-03-23)
- Details:
- New Kernels: NSA (Native Sparse Attention) Decode, FlashAttention3 Backward, FP8 GEMM, DeepGEMM.
- Features: Multi-Threads Compilation for Fast Auto Tuning, CPU JIT with backend ctypes, TMA Store Synchronization.
- Fixes: Layout conflict handling for GQA decoding.
-
Metrics: 132 PRs 48 Issues (Explosive growth in feature set)
triton-lang/triton
- Key Activity:
- [2025-03-28] Build system updates (Pin cmake < 4).
- Details:
- [Highlight] Update backend to newer LLVM project commit.
- [Issue] Reports of loading from TMA descriptor hanging.
-
Metrics: 215 PRs 55 Issues
🤖 Inference & Training Frameworks
volcengine/verl (RLHF)
- Key Activity:
- 🚨 RELEASE: v0.3.0.post0 (2025-03-30)
- Details:
- AMD Support: Added support for AMD GPUs via vLLM and FSDP backend.
- New Algos: PRIME, RLOO, Remax, and Vision Language Reasoning with Qwen2.5-VL.
- Engine: SGLang integration available for preview.
-
Metrics: 0 PRs (Post-release) 0 Issues tracked in snapshot
xdit-project/xDiT
- Key Activity:
- 🚨 RELEASE: v0.4.3 (2025-03-20) & v0.4.2 (2025-03-03)
- Details:
- [Highlight] AMD GPU Support added by contributor
@jammm. - [Highlight] Added TeaCache and FBCache support.
- [Highlight] SDXL support for CFG parallel only.
- [Highlight] AMD GPU Support added by contributor
-
Metrics: 14 PRs 16 Issues
deepspeedai/DeepSpeed
- Key Activity:
- [2025-03-25] Documentation updates regarding AutoTP.
- Details:
- [Highlight] Fixes for compiling DeepSpeed with CUDA 12.6.
- [Highlight] Updates to BF16Optimizer and Stage2 for new Torch grad hook API.
-
Metrics: 47 PRs 47 Issues
deepseek-ai/DeepEP
- Key Activity:
- Focus on low-latency kernel optimization.
- Details:
- [2025-03-27] Removed NVLink low-latency plan.
- [2025-03-10] Added support for BF16 for low-latency kernels.
-
Metrics: 7 PRs 65 Issues
huggingface/transformers
- Key Activity:
- Routine maintenance and model updates.
- Details:
- [Highlight] Issue: Warnings when loading Deepseek-V3 without custom code.
- [Highlight] Added fast image processor for ZoeDepth.
-
Metrics: 461 PRs 205 Issues