GitHub Monthly Report: 2025-03-01 to 2025-03-31
📅 Engineering Report (2025-03-01 - 2025-03-31)
🚀 Executive Summary
March 2025 was a pivotal month for cross-ecosystem compatibility. The most significant trend was the expansion of AMD GPU support into high-profile third-party libraries: xDiT (DiT inference) and Verl (RLHF training) both released major versions explicitly adding AMD/ROCm backends.
Within the ROCm ecosystem, the focus remains on enabling heavy workloads, with ROCm/MAD releasing containers for JAX (MaxText) and Megatron-LM. Meanwhile, PyTorch and XLA continued heavy infrastructure refactoring, with PyTorch modernizing build scripts and XLA refining GPU paths for TensorFlow.
AMD Related Updates
- Third-Party Adoption (Critical):
- 🚨 xDiT v0.4.3 officially added AMD GPU support, enabling high-performance Diffusion Transformer inference on ROCm.
- 🚨 Verl v0.3.0 added AMD support for vLLM and FSDP backends, opening up PPO/GRPO RLHF training pipelines for AMD hardware.
- Model Enablement: ROCm/MAD introduced new docker containers for JAX training (MaxText) and Megatron-LM v25.4, targeting enterprise-grade LLM training.
- Tooling Maturity: TraceLens v0.3 and Primus saw significant refactoring for scalability and core stability, indicating a maturing profiling ecosystem.
Competitive Analysis
- JAX/Google Agility: MaxText rapidly integrated support for DeepSeek instructions and Gemma3, demonstrating that the JAX ecosystem is keeping pace with cutting-edge model releases as fast as PyTorch.
- DeepSeek Infrastructure: DeepEP (DeepSeek’s Expert Parallelism lib) removed NVLink low-latency plans while adding BF16 support, potentially signaling a shift in how they handle inter-node communication or a focus on standardizing precision.
- Compiler Wars: Triton and TileLang remain highly active. TileLang is solving complex layout issues for GQA decoding, while Triton is grappling with TMA descriptor hangs, highlighting the complexity of stabilizing H100-era features.
📂 Category Updates
🟢 AMD Ecosystem (Internal & Official)
AMD-AGI/Primus
- Key Activity:
- [2025-03-31] Refactored shell scripts for better maintainability.
- [2025-03-25] Added support for MTP (Multi-Token Prediction) and updated documentation.
- Details:
- Focus on data preprocessing pipelines and core architectural refactors.
-
Metrics: 12 New PRs 11 Closed PRs 0 New Issues
AMD-AGI/TraceLens
- Key Activity:
- [2025-03-25] Documentation updates for v0.3.
- [2025-03-06] Documentation updates for v0.2.
- Details:
- Implementation of an alternative
subtract_intervalsfunction optimized specifically for scalability, crucial for profiling large-scale distributed runs.
- Implementation of an alternative
-
Metrics: 25 New PRs 25 Closed PRs 0 New Issues
ROCm/ROCm (Platform)
- Key Activity:
- Addressed runtime instability (system suspension crashes).
- Investigated consumer card support (RX 6750 GRE) for Stable Diffusion.
- Details:
- CI Updates: Added Ninja build generation for 12 components and new dependencies for
rocprof-compute.
- CI Updates: Added Ninja build generation for 12 components and new dependencies for
-
Metrics: 61 New PRs 46 New Issues 64 Closed PRs (High maintainer responsiveness)
ROCm/MAD (Model Automation & Deployment)
- Key Activity:
- [2025-03-12] Unified vLLM docker with v0.7.3.
- Details:
- Added
READMEand support for JAX-training (MaxText v25.4). - Added Megatron-LM training docker v25.4.
- Added
-
Metrics: 6 New PRs 5 Closed PRs
🔥 PyTorch Ecosystem
pytorch/pytorch
- Key Activity:
- [2025-03-24] Major build system cleanup: Removed pre-CXX11 ABI logic.
- Continuous integration updates for MacOS (MKLDNN support).
- Details:
- Standardization efforts:
torch.reshape()now supports thecopykwarg (Python Array API standard). - ONNX issues tracked regarding
CompositeImplicitAutogradops preservation.
- Standardization efforts:
-
Metrics: 1432 New PRs 708 New Issues
pytorch/torchtitan
- Key Activity:
- [2025-03-24] Script execution standardization (running as modules).
- [2025-03-06] Legal clearance for additional datasets.
- Details:
- Work in progress on Contiguous Group GeMM kernels.
- Refactoring loss functions to support chunked loss.
- Discussions on supporting Context Parallelism on Turing-generation GPUs.
-
Metrics: 100 New PRs 27 New Issues
pytorch/ao (Architecture Optimization)
- Key Activity:
- [2025-03-10] Promoted Low Bit Optimizers out of prototype status.
- Details:
- Issues tracked regarding custom CUDA op dispatch failures and static quantized model saving.
-
Metrics: 0 New PRs (Data discrepancy likely) 25 New Issues
🌐 Google / JAX / XLA
AI-Hypercomputer/maxtext
- Key Activity:
- [2025-03-22] Updated for Gemma3 announcement.
- [2025-03-18] Added DeepSeek instruction support.
- Details:
- Refactored prefill packing into a Python module.
- Implemented un-sharding of QKV on the head dimension.
- Addressed bugs in Mixture-of-Experts (MoE) load balancing loss reporting.
-
Metrics: 171 New PRs 6 New Issues
openxla/xla
- Key Activity:
- [2025-03-10] Documentation cleanup.
- Details:
- Fixes for TensorFlow GPU builds.
- Autotuning improvements to centralize entry version updates.
- Issues regarding
tanhfoptimization on aarch64 (SVE implementation).
-
Metrics: 1132 New PRs 14 New Issues
⚙️ GenAI Infrastructure & Compilers
volcengine/verl (RLHF)
- Key Activity:
- 🚨 [2025-03-30] RELEASE v0.3.0.post0:
- AMD Support: Added support for AMD GPUs via vLLM and FSDP backends.
- New Algorithms: PRIME, RLOO, ReMax, and FIRE sampling.
- SGLang: Integration available for preview.
- Models: Support for Qwen2.5-VL and GRPO.
- 🚨 [2025-03-30] RELEASE v0.3.0.post0:
-
Metrics: 0 New PRs (Release focused) 0 New Issues
xdit-project/xDiT (Diffusion Transformers)
- Key Activity:
- 🚨 [2025-03-20] RELEASE v0.4.3:
- AMD Support: PR #477 explicitly added AMD GPU support.
- Features: Added SDXL CFG parallel support, TeaCache, and FBCache.
- [2025-03-26] v0.4.3.post1 released with minor bugfixes.
- 🚨 [2025-03-20] RELEASE v0.4.3:
- Details:
- Resolved issues regarding
dit_parallel_sizeandparallel_world_sizemismatches.
- Resolved issues regarding
-
Metrics: 14 New PRs 16 New Issues
deepseek-ai/DeepEP
- Key Activity:
- [2025-03-27] Strategic Shift: Removed NVLink low-latency plan.
- [2025-03-10] Added BF16 support for low-latency kernels.
- Details:
- Investigating deadlocks on H20 GPUs and multi-node A100 setups.
-
Metrics: 7 New PRs 65 New Issues
triton-lang/triton
- Key Activity:
- [2025-03-28] Build system updates (Pin cmake < 4).
- Details:
- Backend updates to newer LLVM project commits.
- Critical issue: Loading from TMA descriptor hangs.
-
Metrics: 215 New PRs 55 New Issues
huggingface/transformers
- Key Activity:
- [2025-03-21] Installation documentation updates.
- Details:
- Added fast image processor for ZoeDepth.
- Fixes for Trainer data parallelism crashing on 1-dimensional tensors.
- Addressed warnings when loading DeepSeek-V3.
-
Metrics: 461 New PRs 205 New Issues