📅 Engineering Report (2025-06-01 - 2025-06-30)

🚀 Executive Summary

June 2025 was a pivotal month for Reinforcement Learning (RLHF) and Distributed Training infrastructure. Volcengine’s verl saw massive activity with two major releases (v0.4.0/v0.4.1), adding support for DeepSeek-V3/671B and Qwen3 MoE models, indicating a rapid maturation of open-source RLHF stacks. In the PyTorch ecosystem, the release of v2.7.1 focused on stability (Torch.compile fixes), while TorchTitan officially hit its pre-release milestone (v0.1.0), signaling PyTorch’s native answer to Megatron-LM is ready for broader testing. AMD’s ecosystem showed strong momentum in compiler optimization (TileLang/Triton) and backend interoperability (Primus adding TorchTitan support).

Infrastructure Interoperability: AMD-AGI/Primus has added backend support for TorchTitan, aligning AMD’s training stack with PyTorch’s emerging native distributed framework, moving beyond just Megatron-LM dependencies.
Compiler Optimization (TileLang & Triton): tile-ai/tilelang (v0.1.5+) saw significant AMD-specific commits, including Vectorized FP8 DataPacking and Float8 Matrix Core support. Similarly, triton-lang/triton enabled more passing FP8 downcast clamping tests for AMD.
ROCm Stability: Issues were flagged regarding Ubuntu 24.04 on Ryzen AI Max+ iGPUs and kernel crashes on versions >6.13, highlighting integration challenges with newer OS/Kernel combinations. Documentation updates point to “TheRock,” suggesting a new centralization of resources.
PyTorch Support: PyTorch v2.7.1 included a specific fix for Distributed Fused Adam in ROCm/APEX when using nccl_ub, addressing a distributed training regression.

Competitive Analysis

NVIDIA FP8 Maturity: TransformerEngine v2.4 was released with extensive FP8 recipe support (Float8CurrentScaling, MXFP8) and caching optimizations, reinforcing NVIDIA’s lead in low-precision training efficiency.
Hardware Support deprecation: facebookresearch/xformers released v0.0.31 which dropped support for V100 GPUs, signaling that major libraries are aggressively deprecating pre-Ampere architectures to focus on Flash-Attention 3 optimization.
RLHF Scaling: The verl updates (v0.4.0) introduced support for massive MoE models (DeepSeek 671B) on Megatron backends, effectively creating a turnkey solution for training state-of-the-art models that competes with proprietary stacks.
DeepEP Expansion: DeepSeek’s communication library (DeepEP) added support for Ampere architecture, expanding its utility beyond Hopper GPUs.

📂 Category Updates

🟥 AMD Ecosystem

AMD-AGI/Primus

Key Activity:
- [2025-06-18] Added TorchTitan backend support to README, signaling a strategic shift to support PyTorch’s native distributed training tools.
- [2025-06-06] Refactored Megatron run scripts to support a multi-backend parameter structure.
Details:
- New PRs focused on feat(megatron) for dumping pipeline schedule data and visualization tools.
- Documentation updates for Docker image v25.5_py310.
Metrics: 45 PRs 0 Issues (High development velocity, focused on PR merging)

AMD-AGI/TraceLens

Key Activity:
- Focus on performance modeling for Flash Attention (aten::_flash_attention_forward).
- Added warnings for host device misaligned timelines to improve profiling accuracy.
Metrics: 15 PRs 13 Issues

ROCm/ROCm

Key Activity:
- [2025-06-12] Documentation updates pointing to “TheRock”, likely a new central hub or branding for ROCm resources.
- Crucial Bugs: Identified firmware detection issues on Ubuntu 24.04 (Ryzen AI Max+) and driver crashes on Kernels > 6.13.
Details:
- Fixes merged for hardcoded gfx versions in MIOpen CK scripts.
Metrics: 101 PRs 22 Issues

ROCm/MAD

Key Activity:
- Maintenance updates focused on Docker build args and fixing multiple results checking logic.
Metrics: 6 PRs 0 Issues

🔥 PyTorch Ecosystem

pytorch/pytorch

Key Activity:
- 🚨 [2025-06-04] RELEASE: v2.7.1
- Addressed regressions in Torch.compile (HuggingFace model crashes, excessive cudagraph re-recording).
- AMD Specific: Fixed Distributed Fused Adam in Rocm/APEX when using nccl_ub.
Details:
- Fixed flex_attention performance regression on nanogpt speedrun.
- Maintenance warning: Transitioning pytorch/audio into a maintenance phase.
Metrics: 1771 PRs 751 Issues (Extremely high volume)

pytorch/torchtitan

Key Activity:
- 🚨 [2025-06-18] RELEASE: v0.1.0 (First Pre-release).
- Built on torch-2.8.0.dev and torchao.
Details:
- Added specific logic for float8 MoE training to prevent unconverting Float8Linears.
- Identified Llama4 Tensor Parallelism (TP) bugs regarding DTensor/Meta registration.
Metrics: 86 PRs 23 Issues

pytorch/ao (Architecture Optimization)

Key Activity:
- [2025-06-28] Added Flux-Fast support to README.
- Added Float8Tensor implementation.
Details:
- Relaxed precision requirements for test_int8_wo_quant_save_load specifically for ROCm to fix CI failures.
Metrics: 149 PRs 27 Issues

pytorch/vision

Key Activity:
- 🚨 [2025-06-04] RELEASE: v0.22.1
- Deprecation Notice: Video decoding/encoding capabilities are being deprecated in favor of TorchCodec (targeted removal in v0.25).
Metrics: 31 PRs 13 Issues

🟩 NVIDIA Ecosystem

NVIDIA/TransformerEngine

Key Activity:
- 🚨 [2025-06-05] RELEASE: v2.4
- Major Features: Support for Float8CurrentScaling recipe (JAX), MXFP8 recipe (PyTorch), and switching among FP8 recipes during training.
Details:
- Reduced framework extension binary size from 108MB to 2MB.
- Fixed numerical bugs in LayerNorm/RMSNorm when using Sequence Parallelism.
Metrics: 0 PRs 0 Issues (Code pushed primarily via release)

facebookresearch/xformers

Key Activity:
- 🚨 [2025-06-25] RELEASE: v0.0.31
- Breaking: Dropped support for V100 and older GPUs.
- Added Flash-Attention 3 support for Ampere GPUs.
Metrics: 0 PRs 0 Issues

🤖 Compilers & Languages

tile-ai/tilelang

Key Activity:
- 🚨 [2025-06-05] RELEASE: v0.1.5
- AMD Optimization: PRs merged later in the month added support for Vectorized FP8 DataPacking and Float8 matrix cores on AMD.
Details:
- Extensive work on Warp Specialization and TMA integration.
- Fixed AMD Docker issues related to Conda environments.
Metrics: 56 PRs 9 Issues

triton-lang/triton

Key Activity:
- AMD Specific: PR merged to “Enable more passing fp8 downcast clamping tests.”
- Documentation updates regarding custom LLVM builds.
Metrics: 323 PRs 30 Issues

openxla/xla

Key Activity:
- Heavy integration work with LLVM upstream.
- Shardy propagation fixes for auto SPMD partitioning.
Metrics: 1109 PRs 16 Issues

📈 RLHF, Inference & Serving

volcengine/verl

Key Activity:
- 🚨 [2025-06-27] RELEASE: v0.4.1 & [2025-06-06] RELEASE: v0.4.0
- Massive Update: Added support for DeepSeek 671b & Qwen3 235b (MoE models) using Megatron backend.
Details:
- Switched checkpointer to mcore’s dist_checkpoint.
- Integrated SGLang for multi-turn rollout and tool calling.
- Added recipes for Search-R1 and Entropy Mechanism (Clip-Cov/KL-Cov).
Metrics: 0 PRs reported (Release focused)

vllm-project/vllm

Key Activity:
- Documentation updates focusing on community contact and slack integration.
Metrics: 0 PRs 0 Issues (Low visible activity in tracked logs)

sgl-project/sglang

Key Activity:
- Added documentation for GB200 NVL72 support.
Metrics: 0 PRs 0 Issues

🔵 JAX & Google

jax-ml/jax

Key Activity:
- 🚨 [2025-06-17] RELEASE: v0.6.2
- Added jax.tree.broadcast. Raised min NumPy version to 1.26.
Metrics: 0 PRs 0 Issues

AI-Hypercomputer/maxtext

Key Activity:
- 🚨 [2025-06-16] RELEASE: pre-nnx-v0.1.0
- Snapshot of the latest version dependent on Flax Linen before the transition to NNX.
Metrics: 0 PRs 0 Issues