GitHub Monthly Report: 2025-06-01 to 2025-06-30
April 18, 2026
đź“… Engineering Report (2025-06-01 - 2025-06-30)
🚀 Executive Summary
June 2025 was marked by significant advancements in multi-modal and Massive Mixture-of-Experts (MoE) model scaling frameworks, as well as critical stabilization patches across major libraries. The broader ecosystem is aggressively standardizing FP8 workflows and advanced attention mechanisms. Maintenance health across the industry is exceptionally strong, exemplified by main pillars like pytorch/pytorch (closing 1,730 of 1,771 PRs) and openxla/xla (closing 1,079 of 1,109 PRs) resolving over 95% of their incoming pull requests within the month.
AMD Related Updates
- ROCm Stability & Upstreaming: The
pytorch/pytorch2.7.1 release included critical fixes for Distributed Fused Adam in Rocm/APEX when utilizing thenccl_ubfeature. - FP8 and Emulation: Native ecosystem support for AMD FP8 capabilities is accelerating.
triton-lang/tritonenabled multiple passing FP8 downcast clamping tests for AMD, whiletile-ai/tilelangpushed support for Vectorized FP8 DataPacking on AMD architectures. - Performance Tracking: AMD’s internal profiling tool
TraceLensis expanding capabilities to capture fundamental optimizations, specifically adding PyTorch’s_scaled_dot_product_flash_attentionto its performance model. - Framework Adjustments:
pytorch/aorelaxed precision requirements to ensure successful passing oftest_int8_wo_quant_save_loadtests on ROCm.
Competitive Analysis
- NVIDIA’s Heavy Optimization: NVIDIA’s
TransformerEngine🚨 v2.4 brought aggressive optimizations, including a staggering framework extension binary reduction from 108MB to just 2MB, and added support for MXFP8 recipes overlapping Tensor Parallelism communications and GEMMs. - Hardware Feature Backporting:
facebookresearch/xformersreleased v0.0.31, bringing Flash-Attention 3 capabilities backward to Ampere GPUs, significantly boosting the value proposition of NVIDIA’s older-generation hardware fleets, while officially dropping V100 support. Similarly,deepseek-ai/DeepEPexplicitly added Ampere architecture support. - DeepSeek & Massive MoE Scaling: Projects like Volcengine’s
verlreleased massive updates (v0.4.0 & v0.4.1) natively supporting DeepSeek 671b and Qwen3 235b scaling via Megatron backends, demonstrating NVIDIA’s tight grip on ultra-large-scale MoE training loops.
đź“‚ Category Updates
AMD Ecosystem
AMD-AGI/Primus
- Key Activity:
- [2025-06-18] Added TorchTitan backend support documentation to the README.
- [2025-06-06] Refactored Megatron run scripts to support multi-backend configurations.
- Details:
- [Highlight PR] Created comprehensive Primus Config/Patch documentation.
- [Highlight PR] Enabled dumping of pipeline parallelism (PP) schedule data and added a visualization tool for Megatron.
-
Metrics: 45 PRs (43 Closed) 0 Issues
AMD-AGI/TraceLens
- Key Activity:
- Continued mapping out Aten-level Flash Attention operations for accurate timeline tracing and profiling.
- Details:
- [Highlight Issue] Identified a need to throw warnings for host-device misaligned timelines.
- [Highlight PR] Added
aten::_scaled_dot_product_flash_attentionto the performance modeling system.
-
Metrics: 15 PRs (16 Closed) 13 Issues (4 Closed)
ROCm/ROCm
- Key Activity:
- [2025-06-12] Updated README.md to prominently feature and point to
TheRockinstaller near the top of the repo.
- [2025-06-12] Updated README.md to prominently feature and point to
- Details:
- [Highlight Issue] Raised driver crash issue for kernels > 6.13 due to scheduler
comp_1.1.1not being ready. - [Highlight Issue] Flagged missing firmware causing Ubuntu 24.04 to not detect AMD RYZEN AI MAX+ 395 iGPUs.
- [Highlight PR] Fixed hardcoded gfx configurations in MIOpen CK script.
- [Highlight Issue] Raised driver crash issue for kernels > 6.13 due to scheduler
-
Metrics: 101 PRs (98 Closed) 22 Issues (30 Closed)
ROCm/MAD
- Key Activity:
- Minor pipeline and build improvements.
- Details:
- [Highlight PR] Merged functionality to parse and load Docker build args.
- [Highlight PR] Fixed bugs related to multiple results checking logic.
-
Metrics: 6 PRs (7 Closed) 0 Issues
PyTorch Ecosystem
pytorch/pytorch
- Key Activity:
- 🚨 [2025-06-04] RELEASE: v2.7.1: A critical patch release resolving extensive cudagraph re-recording for Hugging Face LLM models, Flex Attention performance regressions, and AMD Rocm/APEX Distributed Fused Adam bugs.
- [2025-06-10] Updated CUDA installation documentation regarding conda requirements.
- Details:
- [Highlight Issue] Tracked inductor dynamic shape failures when Hugging Face models create error guards.
- [Highlight PR] Added inductor lowerings for
adaptive_avg_pool3dandadaptive_max_pool3d. - [Highlight PR] Added support for complex numbers in
DTensorredistributions.
-
Metrics: 1771 PRs (1730 Closed) 751 Issues (661 Closed)
pytorch/torchtitan
- Key Activity:
- 🚨 [2025-06-18] RELEASE: v0.1.0: The first official pre-release of TorchTitan, locking to
torch-2.8.0.dev20250617+cu126. - [2025-06-13] Added a dedicated folder for benchmarks and submission guidelines.
- 🚨 [2025-06-18] RELEASE: v0.1.0: The first official pre-release of TorchTitan, locking to
- Details:
- [Highlight PR] Added support for doing MoE conversion before Float8Linear conversion for seamless FP8 MoE training.
- [Highlight Issue] Debugged DTensor local tensor dtype mismatches in Llama 4 Tensor Parallelism.
-
Metrics: 86 PRs (63 Closed) 23 Issues (17 Closed)
pytorch/ao
- Key Activity:
- [2025-06-28] Revamped README to include references to Flux-Fast and axolotl + QAT integrations.
- Details:
- [Highlight PR] Introduced the new
Float8Tensortype to the quantization API. - [Highlight PR] Relaxed precision requirements to solve failing
test_int8_wo_quant_save_loadtest suite on ROCm hardware.
- [Highlight PR] Introduced the new
-
Metrics: 149 PRs (139 Closed) 27 Issues (7 Closed)
pytorch/vision & pytorch/audio
- Key Activity:
- 🚨 [2025-06-04] RELEASE: torchvision v0.22.1 & torchaudio v2.7.1: Both libraries are moving older components to a maintenance phase.
torchvisionannounced deprecation of video decoding in favor of the newTorchCodecproject.torchaudiois being scoped tightly strictly for ML processing, stripping redundant ecosystem features.
- 🚨 [2025-06-04] RELEASE: torchvision v0.22.1 & torchaudio v2.7.1: Both libraries are moving older components to a maintenance phase.
-
Metrics: Vision: 31 PRs (24 Closed) Audio: 0 PRs / 0 Issues.
NVIDIA & Competitive Ecosystem
NVIDIA/TransformerEngine
- Key Activity:
- 🚨 [2025-06-05] RELEASE: v2.4: Massive framework update featuring MXFP8 recipes for userbuffers (overlapping TP/GEMM communication), Float8CurrentScaling in JAX, and binary size optimizations (reducing from 108MB to 2MB).
-
Metrics: 0 PRs 0 Issues (Tracked internally, dropped directly as a release)
facebookresearch/xformers
- Key Activity:
- 🚨 [2025-06-25] RELEASE: v0.0.31: Introduced python-version agnostic wheels.
- [2025-06-25] Shipped support for Flash-Attention 3 on Ampere architecture GPUs, extending the lifespan of older enterprise hardware while formally removing support for V100/older GPUs.
-
Metrics: 0 PRs 0 Issues
deepseek-ai/DeepEP
- Key Activity:
- [2025-06-11] Upstreamed support for Ampere architecture GPUs for DeepSeek’s internal EP mechanisms.
- [2025-06-27] Enforced stricter conditions for aggressive PTX instructions.
-
Metrics: 0 PRs 0 Issues
JAX & Compilers
openxla/xla
- Key Activity:
- Continued heavy compiler integrations, specifically syncing LLVM and handling Shardy automated partitioning integrations.
- Details:
- [Highlight PR] Shardy auto-spmd partitioning field propagation integrated from config.
- [Highlight Issue] Discussing ways cross-architecture operator libraries (like cuBLAS) can be seamlessly applied in JAX.
-
Metrics: 1109 PRs (1079 Closed) 16 Issues (10 Closed)
triton-lang/triton
- Key Activity:
- [2025-06-27] Documentation improvements around custom LLVM builds.
- Details:
- [Highlight PR] AMD specific update to enable more passing FP8 downcast clamping tests in Triton.
- [Highlight PR] Added vast scheduling infrastructure to refine-ops-pass for pipeline tuning.
-
Metrics: 323 PRs (314 Closed) 30 Issues (21 Closed)
tile-ai/tilelang
- Key Activity:
- 🚨 [2025-06-05] RELEASE: v0.1.5: Extremely feature-dense release introducing implicit Warp Specialize programming models, Auto Layout Inference enhancements, and broad Docker enhancements.
- Details:
- [Highlight PR] Implemented AMD vectorized FP8 DataPacking support.
- [Highlight PR] Fixed AMD Docker issues related to conda environment instantiation.
- [Highlight Issue] Discussed optimizations for Variable Length (varlen) NSA to match FA3 speeds.
-
Metrics: 56 PRs (59 Closed) 9 Issues (10 Closed)
Large Language Model Tooling (vLLM, Verl, SGLang, Transformers)
volcengine/verl
- Key Activity:
- 🚨 [2025-06-06] RELEASE: v0.4.0 & 🚨 [2025-06-27] RELEASE: v0.4.1: Verl has become a dominant RL framework. The v0.4.x releases added support for ultra-large MoE models (DeepSeek 671b, Qwen3 235b) over Megatron backends, integrated tool-calling, and sample-level multi-turn RL workflows leveraging
sglangrollouts.
- 🚨 [2025-06-06] RELEASE: v0.4.0 & 🚨 [2025-06-27] RELEASE: v0.4.1: Verl has become a dominant RL framework. The v0.4.x releases added support for ultra-large MoE models (DeepSeek 671b, Qwen3 235b) over Megatron backends, integrated tool-calling, and sample-level multi-turn RL workflows leveraging
- Metrics: Active release tracking, extremely high momentum in the RLHF pipeline space.
huggingface/transformers
- Key Activity:
- Minor model additions including Gemma3n and MARIAN updates.
- Details:
- [Highlight PR] Shipped several framework fixes specifically mapping Gemma3n functionality.
-
Metrics: 462 PRs (423 Closed) 162 Issues (156 Closed)
xdit-project/xDiT
- Key Activity:
- [2025-06-26] Expanded multi-model support by bringing in SanaSprint and native SANA support.
-
Metrics: 4 PRs (5 Closed) 7 Issues (2 Closed)