GitHub Monthly Report: 2025-06-01 to 2025-06-30
📅 Engineering Report (2025-06-01 - 2025-06-30)
🚀 Executive Summary
June 2025 was a pivotal month for the PyTorch ecosystem with PyTorch v2.7.1 focusing on stabilization and TorchTitan graduating to its first official pre-release (v0.1.0), signaling a serious push into large-scale training standardization. In the RLHF/Training space, Verl saw massive updates (v0.4.0/v0.4.1) adding support for massive MoE models (DeepSeek 671B, Qwen3) and FSDP2.
On the hardware/compiler front, TileLang demonstrated significant velocity with distinct optimizations for AMD GPUs (Float8 Matrix Cores), while NVIDIA’s TransformerEngine v2.4 introduced new FP8 recipes and drastically reduced binary sizes. ROCm documentation has begun referencing “TheRock,” hinting at a new kernel or stack distribution.
AMD Related Updates
- Primus & TorchTitan Integration:
AMD-AGI/Primushas added specific backend support forTorchTitan, ensuring AMD hardware is ready for PyTorch’s new flagship training framework. - TileLang Optimizations: The
tile-ai/tilelangcompiler (v0.1.5) merged specific AMD backend support, including Float8 Matrix Core support and vectorized FP8 DataPacking, improving high-performance kernel generation for AMD accelerators. - TraceLens Coverage:
TraceLensadded performance modeling forflash_attentionoperations, critical for accurate profiling of modern LLM workloads on ROCm. - ROCm Rebranding/Stack:
ROCm/ROCmREADME updates now reference “TheRock,” suggesting a potential shift in distribution or kernel packaging naming conventions. - PyTorch AO Support:
pytorch/aorelaxed precision requirements for ROCm in quantization tests, unblocking INT8 save/load workflows.
Competitive Analysis
- NVIDIA TransformerEngine v2.4: NVIDIA released v2.4, significantly reducing the binary size (108MB to 2MB) and adding support for MXFP8 recipes and logical partitioning in Jax.
- Verl Dominance in RLHF:
volcengine/verlis rapidly becoming the standard for post-training. The v0.4.x releases added support for DeepSeek-V3/671B and Qwen3-235B, explicitly supporting Megatron backends and FSDP2. This cements NVIDIA/Megatron compatibility in the most active RLHF repo. - xFormers Drops V100: Meta’s
xformershas dropped support for V100 GPUs (Volta), signaling that optimization efforts are now exclusively focused on Ampere and newer architectures (Hopper/Blackwell).
📂 Category Updates
🟢 AMD Ecosystem
AMD-AGI/Primus
- Key Activity:
- [2025-06-18] Added TorchTitan backend support to README.
- [2025-06-06] Refactored Megatron run scripts to support multi-backend via parameters.
- Details:
- The team is actively making Primus backend-agnostic, moving away from hardcoded Megatron dependencies to support TorchTitan.
-
Metrics: 45 PRs (Open) 43 PRs (Closed)
AMD-AGI/TraceLens
- Key Activity:
- Focus on Flash Attention performance modeling.
- Details:
- Added
aten::_scaled_dot_product_flash_attentionto the performance model. - New warning system implemented for host device misaligned timelines.
- Added
-
Metrics: 15 PRs (Open) 16 PRs (Closed)
ROCm/ROCm
- Key Activity:
- [2025-06-12] Documentation updates pointing to “TheRock”.
- Details:
- Issues reported regarding Ubuntu 24.04 detection on Ryzen AI MAX+ igpus (firmware missing).
- Kernel > 6.13 crashing drivers reported.
-
Metrics: 101 PRs (Open) 98 PRs (Closed)
tile-ai/tilelang
- Key Activity:
- 🚨 [2025-06-05] RELEASE: v0.1.5
- Details:
- Significant AMD support added: Float8 matrix core support and Vectorized FP8 DataPacking.
- Added NVRTC execution backend.
- Fixed deepgemm examples and improved autotuning caching.
-
Metrics: 56 PRs (Open) 59 PRs (Closed)
🔥 PyTorch Ecosystem
pytorch/pytorch
- Key Activity:
- 🚨 [2025-06-04] RELEASE: v2.7.1
- Details:
- Torch.compile: Fixed excessive cudagraph re-recording for HF LLMs and graph breaks in
einops. - Distributed: Fixed issues with Distributed Fused Adam in ROCm/APEX.
- General: Addressed regressions in Llama2 model export.
- Torch.compile: Fixed excessive cudagraph re-recording for HF LLMs and graph breaks in
-
Metrics: 1771 PRs (Open) 1730 PRs (Closed) (Very High Velocity)
pytorch/torchtitan
- Key Activity:
- 🚨 [2025-06-18] RELEASE: v0.1.0 (First Pre-release)
- Details:
- Built on
torch-2.8.0.devandtorchao-0.12.0.dev. - Standardized model folder structure.
- Work on Llama4 Tensor Parallelism (TP) bugs involving DTensor metadata mismatches.
- Built on
-
Metrics: 86 PRs (Open) 63 PRs (Closed)
pytorch/ao (Architecture Optimization)
- Key Activity:
- Integration of Flux-Fast and Axolotl + QAT support.
- Details:
- Added
Float8Tensor. - AMD Specific: Relaxed precision tests for
test_int8_wo_quant_save_loadon ROCm to fix CI failures.
- Added
-
Metrics: 149 PRs (Open) 139 PRs (Closed)
pytorch/audio
- Key Activity:
- 🚨 [2025-06-04] RELEASE: v2.7.1
- Details:
- Strategic Shift: Explicitly announced the library is entering a “maintenance phase” to reduce redundancies with the wider ecosystem.
-
Metrics: 30 PRs (Open) 21 PRs (Closed)
🟩 NVIDIA / Competitors
NVIDIA/TransformerEngine
- Key Activity:
- 🚨 [2025-06-05] RELEASE: v2.4
- Details:
- Binary size reduced from 108 MB to 2 MB.
- Added support for MXFP8 recipe.
- Fixed numerical issues with activation recompute in FP8.
- Metrics: 0 Issues/PRs tracked (Release only).
facebookresearch/xformers
- Key Activity:
- 🚨 [2025-06-25] RELEASE: v0.0.31
- Details:
- Dropped Support: V100 and older GPUs are no longer supported.
- Added support for Flash-Attention 3 on Ampere GPUs.
- Metrics: Low activity outside release.
🤖 LLM Training & Inference
volcengine/verl
- Key Activity:
- 🚨 [2025-06-27] RELEASE: v0.4.1
- 🚨 [2025-06-06] RELEASE: v0.4.0
- Details:
- MoE Support: DeepSeek 671B and Qwen3 235B support added.
- FSDP2: Now the recommended strategy over FSDP1.
- Integration: SGLang rollout with MCP client and tool calling.
- Switched Megatron checkpointer to
mcore’s distributed checkpointing.
- Metrics: Healthy maintenance (Issues and PRs handled alongside major releases).
openxla/xla
- Key Activity:
- Heavy development on LLVM integration and Shardy partitioning.
- Details:
- Integration of LLVM at newer commits.
- Discussions on generating HLO texture IR via fuzzers.
-
Metrics: 1109 PRs (Open) 1079 PRs (Closed) (Extremely High Velocity)
xdit-project/xDiT
- Key Activity:
- Focus on Image Generation parallelism.
- Details:
- Added support for SanaSprint and SANA models.
-
Metrics: 4 PRs (Open) 5 PRs (Closed)