GitHub Monthly Report: 2025-03-01 to 2025-03-31
March 31, 2026
π Engineering Report (2025-03-01 - 2025-03-31)
π Executive Summary
March 2025 demonstrated vigorous engineering momentum across the AI and ML ecosystems, particularly in framework support for newer models (DeepSeek-V3, Gemma 3, Qwen2.5-VL) and deep optimizations for parallelism and quantization. Core libraries like PyTorch and OpenXLA showed exceptional maintenance health, consistently closing >90% of their massive incoming PR queues. The most significant architectural shifts this month occurred in distributed training and inference engines, with both volcengine/verl and xdit-project/xDiT rolling out major feature releases that heavily bolster cross-hardware support and distributed caching mechanisms.
AMD Related Updates
- Expanded Upstream Hardware Support: Huge wins for the AMD ecosystem this month as third-party frameworks natively integrated AMD GPU support. The
volcengine/verlπ¨ v0.3.0 release officially brought AMD support to its vLLM and FSDP backends, andxdit-project/xDiTπ¨ v0.4.3 merged dedicated AMD GPU support. - ROCm & MAD Container Updates: ROCm/MAD continuously synced its container recipes with the broader ecosystem, including updates to the unified vLLM docker (v0.7.3), maxtext-v25.4 for JAX training, and megatron-lm v25.4.
- Tooling Stability: AMDβs internal tools saw steady improvements.
TraceLensmerged 25 PRs entirely focused on scalability (e.g., alternativesubtract_intervals), whilePrimusrefined data preprocessing and shell scripts. ROCm runtime engineers are actively tracking and addressing edge cases, including system suspension crashes and specific GPU (RX 6750 GRE) compatibility in Stable Diffusion.
Competitive Analysis
- NVIDIAβs FP8 & Optimization Push:
NVIDIA/TransformerEngineis aggressively optimizing MXFP8 cast kernels and GroupedGEMM operations in JAX. However, users are reporting backward-compatibility friction, specifically regarding FP8 extra states from version 1.x failing to load in 2.x. - DeepSeek Hardware Specifics:
deepseek-ai/DeepEPexplicitly dropped its NVLink low-latency plan and shifted focus to BF16 low-latency kernels. They are actively triaging Hopper-specific bugs, including processes blocking on H20 nodes. - Ecosystem Hardware Friction: Alternative hardware architectures are surfacing compilation issues in legacy wrappers.
xformersis seeing requests for Huawei Ascend/CANN build support, whileDeepSpeedengineers are addressing compilation errors specific to the newly adopted CUDA 12.6.
π Category Updates
AMD Ecosystem
AMD-AGI/Primus
- Key Activity:
- [2025-03-31] Refactored shell scripts.
- [2025-03-25] Added MTP support and updated the README.
- Details:
- [2025-03-31] HIGHLIGHT: New PR for data preprocessing enhancements.
- [2025-03-31] HIGHLIGHT: New PR to refactor shell scripts.
-
Metrics: 12 New PRs, 11 Closed PRs 0 New Issues, 0 Closed Issues
AMD-AGI/TraceLens
- Key Activity:
- [2025-03-25] Updated README for the v0.3 release.
- [2025-03-06] Documentation updates for v0.2.
- Details:
- [2025-03-25] HIGHLIGHT: Merged an alternative
subtract_intervalsfunction optimized for large-scale profiling scalability.
- [2025-03-25] HIGHLIGHT: Merged an alternative
-
Metrics: 25 New PRs, 25 Closed PRs 0 New Issues, 2 Closed Issues
ROCm/ROCm
- Key Activity:
- [2025-03-31] Ongoing triage of core runtime and specific GPU availability issues.
- Details:
- [2025-03-31] HIGHLIGHT: Ex CI added dependency on
rocprof-sdkforrocprof-compute. - [2025-03-31] HIGHLIGHT: Ex CI added Ninja build generator for 12 components.
- [2025-03-31] ISSUE: Tracking a ROCm runtime crash after system suspension.
- [2025-03-31] ISSUE: Investigating Stable Diffusion compatibility on the RX 6750 GRE.
- [2025-03-31] HIGHLIGHT: Ex CI added dependency on
-
Metrics: 61 New PRs, 64 Closed PRs 46 New Issues, 38 Closed Issues
ROCm/MAD
- Key Activity:
- [2025-03-12] Unified the vLLM docker definition with upstream v0.7.3.
- Details:
- [2025-03-31] HIGHLIGHT: Added README for
jax-training:maxtext-v25.4. - [2025-03-31] HIGHLIGHT: Added
megatron-lmtraining docker v25.4.
- [2025-03-31] HIGHLIGHT: Added README for
-
Metrics: 6 New PRs, 5 Closed PRs 0 New Issues, 0 Closed Issues
PyTorch Ecosystem
pytorch/pytorch
- Key Activity:
- [2025-03-24] Removed pre-CXX11 ABI logic from build scripts.
- Details:
- [2025-03-31] HIGHLIGHT: Built MacOS CI with MKLDNN.
- [2025-03-31] HIGHLIGHT: Added
gen_schemasupport forinvoke_subgraphin hop schema. - [2025-03-31] ISSUE: Adding
copykwarg intorch.reshape()to follow the Python array API standard. - [2025-03-31] ISSUE: Addressed ONNX decomp not preserving custom CompositeImplicitAutograd ops.
-
Metrics: 1432 New PRs, 1368 Closed PRs 708 New Issues, 512 Closed Issues
pytorch/torchtitan
- Key Activity:
- [2025-03-24] Updated modules to use
-moption for script execution. - [2025-03-06] Legal modifications applied to additional datasets.
- [2025-03-24] Updated modules to use
- Details:
- [2025-03-31] HIGHLIGHT: Refactored
train_spec.loss_fntobuild_loss_fnand implemented chunked loss. - [2025-03-31] HIGHLIGHT: Work-in-progress kernel for Contiguous Group GeMM.
- [2025-03-31] ISSUE: Investigating Context parallel support on legacy Turing GPUs.
- [2025-03-31] HIGHLIGHT: Refactored
-
Metrics: 100 New PRs, 91 Closed PRs 27 New Issues, 15 Closed Issues
pytorch/ao
- Key Activity:
- [2025-03-10] Promoted Low Bit Optim out of prototype status.
- [2025-03-05] Reverted the move of
torchao/_modelstobenchmarks/_models.
- Details:
- [2025-03-31] ISSUE: Diagnosing failed dispatch of Custom CUDA OPs in TorchAO.
- [2025-03-31] ISSUE: Addressed failures when saving static quantized models.
-
Metrics: 0 New PRs, 0 Closed PRs 25 New Issues, 38 Closed Issues
pytorch/FBGEMM
- Key Activity:
- [2025-03-29] General documentation updates.
-
Metrics: 0 New PRs, 0 Closed PRs 0 New Issues, 0 Closed Issues
JAX & OpenXLA Ecosystem
jax-ml/jax
- Key Activity:
- [2025-03-31] Heavy codebase maintenance with >500 PRs merged.
- Details:
- [2025-03-31] HIGHLIGHT: Added
__jax_array__support injnp.reshape,transpose, andmatrix_transpose. - [2025-03-31] HIGHLIGHT: Included the
ProfilerDataclass in Jaxlibβs profiler submodule. - [2025-03-31] ISSUE: Proposed adding the βlargestβ argument to
top_k.
- [2025-03-31] HIGHLIGHT: Added
-
Metrics: 617 New PRs, 566 Closed PRs 101 New Issues, 60 Closed Issues
openxla/xla
- Key Activity:
- [2025-03-10] Homepage link updates and backend documentation.
- Details:
- [2025-03-31] HIGHLIGHT: Autotuning updates to ensure entry versions only need updating in one place.
- [2025-03-31] HIGHLIGHT: Fixes submitted for TensorFlow GPU builds.
- [2025-03-31] ISSUE: Multidevice sharding raising
UnspecifiedValuewith custom PJRT backends. - [2025-03-31] ISSUE: Requests for replacing
tanhfon aarch64 with vectorized SVE implementations.
-
Metrics: 1132 New PRs, 861 Closed PRs 14 New Issues, 17 Closed Issues
AI-Hypercomputer/maxtext
- Key Activity:
- [2025-03-22] Added Gemma 3 announcements and DeepSeek instructions to the README.
- Details:
- [2025-03-31] HIGHLIGHT: Refactored Prefill Packing into a dedicated python module.
- [2025-03-31] HIGHLIGHT: Unsharded QKV on the head dimension.
- [2025-03-31] ISSUE: Identified bug where
moe_lb_losswas not divided by gradient_accumulation_steps for reporting. - [2025-03-31] ISSUE: Addressed errors when using Megablox with expert_parallelism.
-
Metrics: 171 New PRs, 114 Closed PRs 6 New Issues, 6 Closed Issues
AI-Hypercomputer/JetStream
- Key Activity:
- [2025-03-31] Focused optimizations on context generation.
- Details:
- [2025-03-31] HIGHLIGHT: Fixed chunked prefill regression.
- [2025-03-31] HIGHLIGHT: Added support for long context dataset accuracy measurement.
-
Metrics: 16 New PRs, 15 Closed PRs 0 New Issues, 0 Closed Issues
NVIDIA & Low-Level Operations
NVIDIA/TransformerEngine
- Key Activity:
- [2025-03-31] Deep focus on FP8 performance and JAX integrations.
- Details:
- [2025-03-31] HIGHLIGHT: Improved performance of mxfp8 cast kernels.
- [2025-03-31] HIGHLIGHT: JAX Refactor incorporating MXFP8 and GroupedGEMM.
- [2025-03-31] ISSUE: Backward compatibility bug where PyTorch FP8 extra state from version 1.x cannot load in 2.x.
-
Metrics: 71 New PRs, 76 Closed PRs 28 New Issues, 33 Closed Issues
triton-lang/triton
- Key Activity:
- [2025-03-28] Pinned cmake to < 4 and added
MAX_JOBSinstall instructions.
- [2025-03-28] Pinned cmake to < 4 and added
- Details:
- [2025-03-31] HIGHLIGHT: Updated backend to target
llvm/llvm-project@1d4801f22ab. - [2025-03-31] ISSUE: Loading from TMA descriptor hangs investigated.
- [2025-03-31] ISSUE: User requests for
tl.dotauto-broadcast support.
- [2025-03-31] HIGHLIGHT: Updated backend to target
-
Metrics: 215 New PRs, 212 Closed PRs 55 New Issues, 37 Closed Issues
facebookresearch/xformers
- Key Activity:
- [2025-03-31] Maintenance on core attention kernels.
- Details:
- [2025-03-31] HIGHLIGHT: Corrected the condition for using
merge_nhead_groups_seqlen_q. - [2025-03-31] ISSUE: Ecosystem requests for NPU builds targeting Huawei Ascend / CANN.
- [2025-03-31] HIGHLIGHT: Corrected the condition for using
-
Metrics: 5 New PRs, 7 Closed PRs 7 New Issues, 4 Closed Issues
deepspeedai/DeepSpeed
- Key Activity:
- [2025-03-25] Linked AutoTP blogs and updated documentation handlers.
- Details:
- [2025-03-31] HIGHLIGHT: Updated to new PyTorch grad hook APIs for
BF16Optimizerand Stage2. - [2025-03-31] HIGHLIGHT: Removed legacy definitions causing compilation errors in CUDA 12.6.
- [2025-03-31] ISSUE: Handled
grad_normNaN issues when usingoverlap_comm:Truewithcontiguous_gradients:True.
- [2025-03-31] HIGHLIGHT: Updated to new PyTorch grad hook APIs for
-
Metrics: 47 New PRs, 44 Closed PRs 47 New Issues, 29 Closed Issues
deepseek-ai/DeepEP
- Key Activity:
- [2025-03-27] Removed NVLink low-latency plan from documentation.
- [2025-03-10] Added BF16 support for low-latency kernels.
- Details:
- [2025-03-31] HIGHLIGHT: Adjusted
kNumThreadsofnotify_dispatch. - [2025-03-31] ISSUE: Triaging block failures on H20 nodes during
test_low_latency.py.
- [2025-03-31] HIGHLIGHT: Adjusted
-
Metrics: 7 New PRs, 8 Closed PRs 65 New Issues, 40 Closed Issues
HuggingFace & LLM Frameworks
volcengine/verl
- Key Activity:
- [2025-03-30] π¨ RELEASE: v0.3.0.post0 - A massive feature drop including AMD Support for vLLM and FSDP backends, Qwen2.5-VL support, PRIME/RLOO/remax algorithms, and FIRE sampling. SGLang integration preview is now available.
- Details:
- [2025-03-30] HIGHLIGHT: Megatron upgraded to v0.11; vLLM upgraded to v0.8.2.
- [2025-03-30] HIGHLIGHT: Added Ulysses sequence parallel support (transformers >= 0.48) and offloading parameters during rollout.
-
Metrics: 0 New PRs, 0 Closed PRs 0 New Issues, 0 Closed Issues
xdit-project/xDiT
- Key Activity:
- [2025-03-20] π¨ RELEASE: 0.4.3 - Brought official AMD GPU support, SDXL CFG parallel support, and Sage attention in
long_ctx_attn. - [2025-03-03] π¨ RELEASE: 0.4.2 - Added Ray disaggregating VAE/DiT, USP implementations, TeaCache, FBCache, and Tensor Parallelism for the Step-Video-T2V model.
- [2025-03-26] π¨ RELEASE: 0.4.3.post1 - Hotfix patch.
- [2025-03-20] π¨ RELEASE: 0.4.3 - Brought official AMD GPU support, SDXL CFG parallel support, and Sage attention in
-
Metrics: 0 New PRs, 0 Closed PRs 0 New Issues, 0 Closed Issues
huggingface/transformers
- Key Activity:
- [2025-03-21] README and installation documentation updates.
- Details:
- [2025-03-31] HIGHLIGHT: Added fast image processor for ZoeDepth.
- [2025-03-31] HIGHLIGHT: Updated model card for distilbert.
- [2025-03-31] ISSUE: Diagnosing chunk dimension errors with default data parallelism in the Trainer.
- [2025-03-31] ISSUE: Resolved Deepseek-V3 loading warnings.
-
Metrics: 461 New PRs, 369 Closed PRs 205 New Issues, 165 Closed Issues
tile-ai/tilelang
- Key Activity:
- [2025-03-26] Deprecated
T.Bufferarguments, transitioning the syntax over toT.Tensor.
- [2025-03-26] Deprecated
- Details:
- [2025-03-31] ISSUE: Users migrating from older versions are hitting
module 'tilelang.language' has no attribute 'Tensor'. - [2025-03-31] ISSUE: Compiler caching bug in
example_mha_bwd.
- [2025-03-31] ISSUE: Users migrating from older versions are hitting
-
Metrics: 0 New PRs, 0 Closed PRs 48 New Issues, 34 Closed Issues
vllm-project/vllm
- Key Activity:
- [2025-03-20 to 2025-03-23] Trimmed and refreshed README front-page news and added a user forum.
-
Metrics: 0 New PRs, 0 Closed PRs 0 New Issues, 0 Closed Issues
sgl-project/sglang
- Key Activity:
- [2025-03-04 to 2025-03-22] Routine documentation and README.md alignment.
-
Metrics: 0 New PRs, 0 Closed PRs 0 New Issues, 0 Closed Issues
alibaba/rtp-llm
- Key Activity:
- [2025-03-04] Documented multi-process capabilities for the frontend server.
-
Metrics: 0 New PRs, 0 Closed PRs 0 New Issues, 0 Closed Issues