GitHub Monthly Report: 2025-03-01 to 2025-03-31
📅 Engineering Report (2025-03-01 - 2025-03-31)
🚀 Executive Summary
March 2025 was a pivotal month for ecosystem expansion, particularly regarding third-party adoption of AMD hardware. While the internal ROCm teams focused heavily on CI/CD refactoring and providing updated Docker environments for vLLM and Megatron-LM, the external ecosystem saw significant integration. Both volcengine/verl (RLHF framework) and xdit-project/xDiT (DiT inference) released major versions explicitly adding AMD GPU support.
In the kernel/compiler space, TileLang had an aggressive release cycle (v0.1.3), adding support for FlashAttention3 Backward and FP8 GEMM, positioning itself as a rapidly maturing TDSL.
AMD Related Updates
- Third-Party Adoption: volcengine/verl released v0.3.0 with official AMD support for vLLM and FSDP backends. Simultaneously, xdit-project/xDiT v0.4.3 added specific support for AMD GPUs.
- Infrastructure & Tooling: ROCm/MAD unified their vLLM Docker images and added support for Megatron-LM training (v25.4).
- Stability: The ROCm/ROCm core repository is tracking a critical issue regarding runtime crashes after system suspension, alongside significant CI improvements (Ninja build generation).
Competitive Analysis
- Kernel Compilation: TileLang (v0.1.3) is moving rapidly, introducing support for FlashAttention3 Backward, DeepGEMM, and FP8 support. This indicates growing competition in the custom kernel generation space that Triton currently dominates.
- Communication Libraries: DeepSeek-AI/DeepEP removed their “NVLink low-latency plan” from documentation, while adding BF16 support for low-latency kernels. This suggests a potential shift in strategy or optimization focus away from specific proprietary interconnect features for that specific project.
- PyTorch Optimization: PyTorch/AO promoted “Low Bit Optim” out of prototype status, signaling maturity in quantization workflows.
📂 Category Updates
AMD Ecosystem (ROCm & Tools)
🚨 REPO: xdit-project/xDiT
- Key Activity:
- [2025-03-20] RELEASE: v0.4.3 - Added AMD GPU support.
- [2025-03-03] RELEASE: v0.4.2 - Added Tensor Parallelism for Step-Video-T2V.
- Details:
- The v0.4.3 release includes a specific PR (#477) adding AMD GPU support, broadening the hardware compatibility for this Diffusion Transformer inference engine.
- Added support for SDXL (CFG parallel only) and Ray-based parallel inference.
-
Metrics: 0 New Issues 0 New PRs (Data reflects release notes)
REPO: ROCm/ROCm
- Key Activity:
- [High Volume] Significant CI/CD overhaul and issue tracking.
- Details:
- New Issue: Investigating ROCm runtime crashes after system suspension.
- CI Updates: Added Ninja build generation for 12 components and added dependency on
rocprof-sdk.
-
Metrics: 61 New PRs 46 New Issues
REPO: ROCm/MAD
- Key Activity:
- [2025-03-12] Docker container unification.
- Details:
- Unified vLLM docker with v0.7.3.
- Added README for JAX training (maxtext-v25.4) and a new Megatron-LM training docker (v25.4).
-
Metrics: 6 New PRs 0 New Issues
REPO: AMD-AGI/Primus
- Key Activity:
- [2025-03-31] Documentation and Script refactoring.
- Details:
- Refactored shell scripts and added data preprocessing support.
-
Metrics: 12 New PRs 0 New Issues
REPO: AMD-AGI/TraceLens
- Key Activity:
- [2025-03-25] Documentation updates for v0.3.
- Details:
- Optimization of the
subtract_intervalsfunction for scalability.
- Optimization of the
-
Metrics: 25 New PRs 0 New Issues
PyTorch Ecosystem
REPO: pytorch/pytorch
- Key Activity:
- [Massive Scale] 1400+ new PRs indicates high-velocity maintenance.
- Details:
- CI: Build MacOS CI with MKLDNN.
- New Features: Discussion on supporting
copykwarg intorch.reshapeto align with Python array API. - Issues: ONNX decomposition failing to preserve custom
CompositeImplicitAutogradops.
-
Metrics: 1432 New PRs 708 New Issues
REPO: pytorch/torchtitan
- Key Activity:
- [2025-03-24] Script module execution updates.
- Details:
- Work in progress on Contiguous Group GeMM kernels.
- Discussion/Issue regarding Context Parallel support on Turing GPUs.
-
Metrics: 100 New PRs 27 New Issues
REPO: pytorch/ao (Architecture Optimization)
- Key Activity:
- [2025-03-10] Feature Promotion.
- Details:
- Promoted Low Bit Optim out of prototype.
- Moved
torchao/_modelstobenchmarks/_models, cleaning up the core library.
-
Metrics: 0 New PRs 25 New Issues
RLHF & Training Frameworks
🚨 REPO: volcengine/verl
- Key Activity:
- [2025-03-30] RELEASE: v0.3.0.post0 - AMD Support Added.
- Details:
- AMD Support: Now available for vLLM and FSDP backends.
- New Algorithms: Added PRIME, RLOO, remax, and FIRE sampling.
- Engine: SGLang integration available for preview. Upgraded Megatron (v0.11) and vLLM (v0.8.2).
-
Metrics: 0 New PRs 0 New Issues (Data reflects release notes)
REPO: AI-Hypercomputer/maxtext
- Key Activity:
- [2025-03-22] Model Support updates.
- Details:
- Added DeepSeek instruction support.
- Updated README to reflect Gemma3 announcements.
-
Metrics: 0 New PRs 0 New Issues
Compilers & Kernels
🚨 REPO: tile-ai/tilelang
- Key Activity:
- [2025-03-23] RELEASE: v0.1.3 - Major feature expansion.
- Details:
- New Kernels: FlashAttention3 Backward, DeepGEMM, FP8 GEMM, GQA Backward.
- Features: Async Pipeline inference, Auto-tuning config performance trace, TMA Store Synchronization.
- Refactor: Phased out mandatory LLVM dependency (now optional).
-
Metrics: 0 New PRs 0 New Issues (Data reflects release notes)
REPO: triton-lang/triton
- Key Activity:
- [2025-03-28] Build system constraints.
- Details:
- Pinned
cmake < 4to prevent build issues.
- Pinned
-
Metrics: 0 New PRs 0 New Issues
REPO: deepseek-ai/DeepEP
- Key Activity:
- [2025-03-27] Strategic Documentation update.
- Details:
- Removed NVLink low-latency plan from roadmap/docs.
- Added BF16 support for low-latency kernels.
-
Metrics: 0 New PRs 0 New Issues