GitHub Monthly Report: 2025-06-01 to 2025-06-30
📅 Engineering Report (2025-06-01 - 2025-06-30)
🚀 Executive Summary
June 2025 was a pivotal month for Reinforcement Learning (RLHF) and Distributed Training infrastructure. Volcengine’s verl saw massive activity with two major releases (v0.4.0/v0.4.1), adding support for DeepSeek-V3/671B and Qwen3 MoE models, indicating a rapid maturation of open-source RLHF stacks. In the PyTorch ecosystem, the release of v2.7.1 focused on stability (Torch.compile fixes), while TorchTitan officially hit its pre-release milestone (v0.1.0), signaling PyTorch’s native answer to Megatron-LM is ready for broader testing. AMD’s ecosystem showed strong momentum in compiler optimization (TileLang/Triton) and backend interoperability (Primus adding TorchTitan support).
AMD Related Updates
- Infrastructure Interoperability:
AMD-AGI/Primushas added backend support for TorchTitan, aligning AMD’s training stack with PyTorch’s emerging native distributed framework, moving beyond just Megatron-LM dependencies. - Compiler Optimization (TileLang & Triton):
tile-ai/tilelang(v0.1.5+) saw significant AMD-specific commits, including Vectorized FP8 DataPacking and Float8 Matrix Core support. Similarly,triton-lang/tritonenabled more passing FP8 downcast clamping tests for AMD. - ROCm Stability: Issues were flagged regarding Ubuntu 24.04 on Ryzen AI Max+ iGPUs and kernel crashes on versions >6.13, highlighting integration challenges with newer OS/Kernel combinations. Documentation updates point to “TheRock,” suggesting a new centralization of resources.
- PyTorch Support: PyTorch v2.7.1 included a specific fix for Distributed Fused Adam in ROCm/APEX when using
nccl_ub, addressing a distributed training regression.
Competitive Analysis
- NVIDIA FP8 Maturity: TransformerEngine v2.4 was released with extensive FP8 recipe support (Float8CurrentScaling, MXFP8) and caching optimizations, reinforcing NVIDIA’s lead in low-precision training efficiency.
- Hardware Support deprecation:
facebookresearch/xformersreleased v0.0.31 which dropped support for V100 GPUs, signaling that major libraries are aggressively deprecating pre-Ampere architectures to focus on Flash-Attention 3 optimization. - RLHF Scaling: The
verlupdates (v0.4.0) introduced support for massive MoE models (DeepSeek 671B) on Megatron backends, effectively creating a turnkey solution for training state-of-the-art models that competes with proprietary stacks. - DeepEP Expansion: DeepSeek’s communication library (
DeepEP) added support for Ampere architecture, expanding its utility beyond Hopper GPUs.
📂 Category Updates
🟥 AMD Ecosystem
AMD-AGI/Primus
- Key Activity:
- [2025-06-18] Added TorchTitan backend support to README, signaling a strategic shift to support PyTorch’s native distributed training tools.
- [2025-06-06] Refactored Megatron run scripts to support a multi-backend parameter structure.
- Details:
- New PRs focused on
feat(megatron)for dumping pipeline schedule data and visualization tools. - Documentation updates for Docker image v25.5_py310.
- New PRs focused on
-
Metrics: 45 PRs 0 Issues (High development velocity, focused on PR merging)
AMD-AGI/TraceLens
- Key Activity:
- Focus on performance modeling for Flash Attention (
aten::_flash_attention_forward). - Added warnings for host device misaligned timelines to improve profiling accuracy.
- Focus on performance modeling for Flash Attention (
-
Metrics: 15 PRs 13 Issues
ROCm/ROCm
- Key Activity:
- [2025-06-12] Documentation updates pointing to “TheRock”, likely a new central hub or branding for ROCm resources.
- Crucial Bugs: Identified firmware detection issues on Ubuntu 24.04 (Ryzen AI Max+) and driver crashes on Kernels > 6.13.
- Details:
- Fixes merged for hardcoded gfx versions in MIOpen CK scripts.
-
Metrics: 101 PRs 22 Issues
ROCm/MAD
- Key Activity:
- Maintenance updates focused on Docker build args and fixing multiple results checking logic.
-
Metrics: 6 PRs 0 Issues
🔥 PyTorch Ecosystem
pytorch/pytorch
- Key Activity:
- 🚨 [2025-06-04] RELEASE: v2.7.1
- Addressed regressions in
Torch.compile(HuggingFace model crashes, excessive cudagraph re-recording). - AMD Specific: Fixed Distributed Fused Adam in Rocm/APEX when using
nccl_ub.
- Details:
- Fixed
flex_attentionperformance regression on nanogpt speedrun. - Maintenance warning: Transitioning
pytorch/audiointo a maintenance phase.
- Fixed
-
Metrics: 1771 PRs 751 Issues (Extremely high volume)
pytorch/torchtitan
- Key Activity:
- 🚨 [2025-06-18] RELEASE: v0.1.0 (First Pre-release).
- Built on
torch-2.8.0.devandtorchao.
- Details:
- Added specific logic for float8 MoE training to prevent unconverting Float8Linears.
- Identified Llama4 Tensor Parallelism (TP) bugs regarding DTensor/Meta registration.
-
Metrics: 86 PRs 23 Issues
pytorch/ao (Architecture Optimization)
- Key Activity:
- [2025-06-28] Added Flux-Fast support to README.
- Added
Float8Tensorimplementation.
- Details:
- Relaxed precision requirements for
test_int8_wo_quant_save_loadspecifically for ROCm to fix CI failures.
- Relaxed precision requirements for
-
Metrics: 149 PRs 27 Issues
pytorch/vision
- Key Activity:
- 🚨 [2025-06-04] RELEASE: v0.22.1
- Deprecation Notice: Video decoding/encoding capabilities are being deprecated in favor of
TorchCodec(targeted removal in v0.25).
-
Metrics: 31 PRs 13 Issues
🟩 NVIDIA Ecosystem
NVIDIA/TransformerEngine
- Key Activity:
- 🚨 [2025-06-05] RELEASE: v2.4
- Major Features: Support for Float8CurrentScaling recipe (JAX), MXFP8 recipe (PyTorch), and switching among FP8 recipes during training.
- Details:
- Reduced framework extension binary size from 108MB to 2MB.
- Fixed numerical bugs in LayerNorm/RMSNorm when using Sequence Parallelism.
-
Metrics: 0 PRs 0 Issues (Code pushed primarily via release)
facebookresearch/xformers
- Key Activity:
- 🚨 [2025-06-25] RELEASE: v0.0.31
- Breaking: Dropped support for V100 and older GPUs.
- Added Flash-Attention 3 support for Ampere GPUs.
-
Metrics: 0 PRs 0 Issues
🤖 Compilers & Languages
tile-ai/tilelang
- Key Activity:
- 🚨 [2025-06-05] RELEASE: v0.1.5
- AMD Optimization: PRs merged later in the month added support for Vectorized FP8 DataPacking and Float8 matrix cores on AMD.
- Details:
- Extensive work on Warp Specialization and TMA integration.
- Fixed AMD Docker issues related to Conda environments.
-
Metrics: 56 PRs 9 Issues
triton-lang/triton
- Key Activity:
- AMD Specific: PR merged to “Enable more passing fp8 downcast clamping tests.”
- Documentation updates regarding custom LLVM builds.
-
Metrics: 323 PRs 30 Issues
openxla/xla
- Key Activity:
- Heavy integration work with LLVM upstream.
- Shardy propagation fixes for auto SPMD partitioning.
-
Metrics: 1109 PRs 16 Issues
📈 RLHF, Inference & Serving
volcengine/verl
- Key Activity:
- 🚨 [2025-06-27] RELEASE: v0.4.1 & [2025-06-06] RELEASE: v0.4.0
- Massive Update: Added support for DeepSeek 671b & Qwen3 235b (MoE models) using Megatron backend.
- Details:
- Switched checkpointer to
mcore’s dist_checkpoint. - Integrated SGLang for multi-turn rollout and tool calling.
- Added recipes for Search-R1 and Entropy Mechanism (Clip-Cov/KL-Cov).
- Switched checkpointer to
- Metrics: 0 PRs reported (Release focused)
vllm-project/vllm
- Key Activity:
- Documentation updates focusing on community contact and slack integration.
-
Metrics: 0 PRs 0 Issues (Low visible activity in tracked logs)
sgl-project/sglang
- Key Activity:
- Added documentation for GB200 NVL72 support.
-
Metrics: 0 PRs 0 Issues
🔵 JAX & Google
jax-ml/jax
- Key Activity:
- 🚨 [2025-06-17] RELEASE: v0.6.2
- Added
jax.tree.broadcast. Raised min NumPy version to 1.26.
-
Metrics: 0 PRs 0 Issues
AI-Hypercomputer/maxtext
- Key Activity:
- 🚨 [2025-06-16] RELEASE: pre-nnx-v0.1.0
- Snapshot of the latest version dependent on Flax Linen before the transition to NNX.
-
Metrics: 0 PRs 0 Issues