📅 Engineering Report (2025-06-01 - 2025-06-30)

🚀 Executive Summary

June 2025 was a pivotal month for Reinforcement Learning (RLHF) and Distributed Training infrastructure. Volcengine’s verl saw massive activity with two major releases (v0.4.0/v0.4.1), adding support for DeepSeek-V3/671B and Qwen3 MoE models, indicating a rapid maturation of open-source RLHF stacks. In the PyTorch ecosystem, the release of v2.7.1 focused on stability (Torch.compile fixes), while TorchTitan officially hit its pre-release milestone (v0.1.0), signaling PyTorch’s native answer to Megatron-LM is ready for broader testing. AMD’s ecosystem showed strong momentum in compiler optimization (TileLang/Triton) and backend interoperability (Primus adding TorchTitan support).

  • Infrastructure Interoperability: AMD-AGI/Primus has added backend support for TorchTitan, aligning AMD’s training stack with PyTorch’s emerging native distributed framework, moving beyond just Megatron-LM dependencies.
  • Compiler Optimization (TileLang & Triton): tile-ai/tilelang (v0.1.5+) saw significant AMD-specific commits, including Vectorized FP8 DataPacking and Float8 Matrix Core support. Similarly, triton-lang/triton enabled more passing FP8 downcast clamping tests for AMD.
  • ROCm Stability: Issues were flagged regarding Ubuntu 24.04 on Ryzen AI Max+ iGPUs and kernel crashes on versions >6.13, highlighting integration challenges with newer OS/Kernel combinations. Documentation updates point to “TheRock,” suggesting a new centralization of resources.
  • PyTorch Support: PyTorch v2.7.1 included a specific fix for Distributed Fused Adam in ROCm/APEX when using nccl_ub, addressing a distributed training regression.

Competitive Analysis

  • NVIDIA FP8 Maturity: TransformerEngine v2.4 was released with extensive FP8 recipe support (Float8CurrentScaling, MXFP8) and caching optimizations, reinforcing NVIDIA’s lead in low-precision training efficiency.
  • Hardware Support deprecation: facebookresearch/xformers released v0.0.31 which dropped support for V100 GPUs, signaling that major libraries are aggressively deprecating pre-Ampere architectures to focus on Flash-Attention 3 optimization.
  • RLHF Scaling: The verl updates (v0.4.0) introduced support for massive MoE models (DeepSeek 671B) on Megatron backends, effectively creating a turnkey solution for training state-of-the-art models that competes with proprietary stacks.
  • DeepEP Expansion: DeepSeek’s communication library (DeepEP) added support for Ampere architecture, expanding its utility beyond Hopper GPUs.

📂 Category Updates

🟥 AMD Ecosystem

AMD-AGI/Primus

  • Key Activity:
    • [2025-06-18] Added TorchTitan backend support to README, signaling a strategic shift to support PyTorch’s native distributed training tools.
    • [2025-06-06] Refactored Megatron run scripts to support a multi-backend parameter structure.
  • Details:
    • New PRs focused on feat(megatron) for dumping pipeline schedule data and visualization tools.
    • Documentation updates for Docker image v25.5_py310.
  • Metrics: 45 PRs 0 Issues (High development velocity, focused on PR merging)

AMD-AGI/TraceLens

  • Key Activity:
    • Focus on performance modeling for Flash Attention (aten::_flash_attention_forward).
    • Added warnings for host device misaligned timelines to improve profiling accuracy.
  • Metrics: 15 PRs 13 Issues

ROCm/ROCm

  • Key Activity:
    • [2025-06-12] Documentation updates pointing to “TheRock”, likely a new central hub or branding for ROCm resources.
    • Crucial Bugs: Identified firmware detection issues on Ubuntu 24.04 (Ryzen AI Max+) and driver crashes on Kernels > 6.13.
  • Details:
    • Fixes merged for hardcoded gfx versions in MIOpen CK scripts.
  • Metrics: 101 PRs 22 Issues

ROCm/MAD

  • Key Activity:
    • Maintenance updates focused on Docker build args and fixing multiple results checking logic.
  • Metrics: 6 PRs 0 Issues

🔥 PyTorch Ecosystem

pytorch/pytorch

  • Key Activity:
    • 🚨 [2025-06-04] RELEASE: v2.7.1
    • Addressed regressions in Torch.compile (HuggingFace model crashes, excessive cudagraph re-recording).
    • AMD Specific: Fixed Distributed Fused Adam in Rocm/APEX when using nccl_ub.
  • Details:
    • Fixed flex_attention performance regression on nanogpt speedrun.
    • Maintenance warning: Transitioning pytorch/audio into a maintenance phase.
  • Metrics: 1771 PRs 751 Issues (Extremely high volume)

pytorch/torchtitan

  • Key Activity:
    • 🚨 [2025-06-18] RELEASE: v0.1.0 (First Pre-release).
    • Built on torch-2.8.0.dev and torchao.
  • Details:
    • Added specific logic for float8 MoE training to prevent unconverting Float8Linears.
    • Identified Llama4 Tensor Parallelism (TP) bugs regarding DTensor/Meta registration.
  • Metrics: 86 PRs 23 Issues

pytorch/ao (Architecture Optimization)

  • Key Activity:
    • [2025-06-28] Added Flux-Fast support to README.
    • Added Float8Tensor implementation.
  • Details:
    • Relaxed precision requirements for test_int8_wo_quant_save_load specifically for ROCm to fix CI failures.
  • Metrics: 149 PRs 27 Issues

pytorch/vision

  • Key Activity:
    • 🚨 [2025-06-04] RELEASE: v0.22.1
    • Deprecation Notice: Video decoding/encoding capabilities are being deprecated in favor of TorchCodec (targeted removal in v0.25).
  • Metrics: 31 PRs 13 Issues

🟩 NVIDIA Ecosystem

NVIDIA/TransformerEngine

  • Key Activity:
    • 🚨 [2025-06-05] RELEASE: v2.4
    • Major Features: Support for Float8CurrentScaling recipe (JAX), MXFP8 recipe (PyTorch), and switching among FP8 recipes during training.
  • Details:
    • Reduced framework extension binary size from 108MB to 2MB.
    • Fixed numerical bugs in LayerNorm/RMSNorm when using Sequence Parallelism.
  • Metrics: 0 PRs 0 Issues (Code pushed primarily via release)

facebookresearch/xformers

  • Key Activity:
    • 🚨 [2025-06-25] RELEASE: v0.0.31
    • Breaking: Dropped support for V100 and older GPUs.
    • Added Flash-Attention 3 support for Ampere GPUs.
  • Metrics: 0 PRs 0 Issues

🤖 Compilers & Languages

tile-ai/tilelang

  • Key Activity:
    • 🚨 [2025-06-05] RELEASE: v0.1.5
    • AMD Optimization: PRs merged later in the month added support for Vectorized FP8 DataPacking and Float8 matrix cores on AMD.
  • Details:
    • Extensive work on Warp Specialization and TMA integration.
    • Fixed AMD Docker issues related to Conda environments.
  • Metrics: 56 PRs 9 Issues

triton-lang/triton

  • Key Activity:
    • AMD Specific: PR merged to “Enable more passing fp8 downcast clamping tests.”
    • Documentation updates regarding custom LLVM builds.
  • Metrics: 323 PRs 30 Issues

openxla/xla

  • Key Activity:
    • Heavy integration work with LLVM upstream.
    • Shardy propagation fixes for auto SPMD partitioning.
  • Metrics: 1109 PRs 16 Issues

📈 RLHF, Inference & Serving

volcengine/verl

  • Key Activity:
    • 🚨 [2025-06-27] RELEASE: v0.4.1 & [2025-06-06] RELEASE: v0.4.0
    • Massive Update: Added support for DeepSeek 671b & Qwen3 235b (MoE models) using Megatron backend.
  • Details:
    • Switched checkpointer to mcore’s dist_checkpoint.
    • Integrated SGLang for multi-turn rollout and tool calling.
    • Added recipes for Search-R1 and Entropy Mechanism (Clip-Cov/KL-Cov).
  • Metrics: 0 PRs reported (Release focused)

vllm-project/vllm

  • Key Activity:
    • Documentation updates focusing on community contact and slack integration.
  • Metrics: 0 PRs 0 Issues (Low visible activity in tracked logs)

sgl-project/sglang

  • Key Activity:
    • Added documentation for GB200 NVL72 support.
  • Metrics: 0 PRs 0 Issues

🔵 JAX & Google

jax-ml/jax

  • Key Activity:
    • 🚨 [2025-06-17] RELEASE: v0.6.2
    • Added jax.tree.broadcast. Raised min NumPy version to 1.26.
  • Metrics: 0 PRs 0 Issues

AI-Hypercomputer/maxtext

  • Key Activity:
    • 🚨 [2025-06-16] RELEASE: pre-nnx-v0.1.0
    • Snapshot of the latest version dependent on Flax Linen before the transition to NNX.
  • Metrics: 0 PRs 0 Issues