📅 Engineering Report (2025-06-01 - 2025-06-30)

🚀 Executive Summary

June 2025 was defined by significant maturation in training frameworks and low-level kernel optimization languages. Volcengine/verl released major updates (v0.4.0/v0.4.1) enabling Large MoE training (DeepSeek 671b) and advanced RLHF workflows, positioning it as a critical tool for reasoning model development. PyTorch released patch v2.7.1 while simultaneously pushing the first pre-release of TorchTitan (v0.1.0), aiming to standardize large-scale model training.

  • Primus & TorchTitan Integration: AMD’s Primus repository added backend support for TorchTitan, ensuring AMD hardware is ready for PyTorch’s next-generation distributed training reference implementation.
  • TraceLens Flash Attention Support: TraceLens added profiling support for aten::_flash_attention_forward and _scaled_dot_product_flash_attention, critical for optimizing LLM inference and training on ROCm.
  • TileLang AMD Support: The tile-ai/tilelang compiler project (v0.1.5) added explicit support for AMD Float8 Matrix Cores and vectorized FP8 DataPacking, significantly lowering the barrier for writing high-performance custom kernels on AMD GPUs.
  • ROCm “TheRock”: Documentation updates in ROCm/ROCm point to a new component or rebrand dubbed “TheRock,” alongside fixes for Ubuntu 24.04 and Kernel 6.13 scheduler crashes.
  • PyTorch Fixes: PyTorch v2.7.1 included specific fixes for ROCm/APEX Distributed Fused Adam when using NCCL user buffers.

Competitive Analysis

  • NVIDIA TransformerEngine v2.4: NVIDIA continues to optimize the low-precision stack with the release of TransformerEngine v2.4, adding support for MXFP8 recipes and reducing framework extension binary sizes by 98%.
  • Legacy Hardware Deprecation: facebookresearch/xformers (v0.0.31) officially dropped support for V100 GPUs (Volta), signaling a unified ecosystem shift toward Ampere and newer architectures.
  • RLHF Explosion: The release of verl v0.4.0 with support for DeepSeek-V3, Qwen3-235b, and FSDP2 demonstrates the open-source community is rapidly catching up on infrastructure for training massive Mixture-of-Experts (MoE) models, a domain previously dominated by proprietary stacks.

📂 Category Updates

AMD Ecosystem

AMD-AGI/Primus

  • Key Activity:
    • Refactoring for multi-backend support.
    • Integration with PyTorch’s new training system.
  • Details:
    • [2025-06-18] docs: add TorchTitan backend support entry to README
    • [2025-06-06] refactor: move Megatron run scripts to examples root and add –backend parameter
  • Metrics: 45 New PRs 43 Closed PRs

AMD-AGI/TraceLens

  • Key Activity:
    • Enhanced profiling coverage for Flash Attention operations.
    • Improved error handling for timeline alignment.
  • Details:
    • [Highlight] New PR: add aten::_scaled_dot_product_flash_attention to perf model
    • [Highlight] New Issue: Throw warning for host device misaligned timelines
  • Metrics: 15 New PRs 13 New Issues

ROCm/ROCm

  • Key Activity:
    • Documentation updates referencing “TheRock.”
    • Kernel compatibility issues reported for Linux 6.13.
  • Details:
    • [2025-06-12] DOC UPDATE: Update README.md to point to TheRock
    • [Highlight] New Issue: kernels > 6.13 crash driver due to scheduler comp_1.1.1 is not ready
  • Metrics: 101 New PRs 22 New Issues

tile-ai/tilelang

  • Key Activity:
    • 🚨 RELEASE: v0.1.5 - Major update for kernel generation.
    • Significant AMD-specific optimizations (Float8 Matrix Core).
  • Details:
    • [Feature] Support AMD float8 matrix core
    • [AMD] Support Vectorized FP8 DataPacking
    • [Fix] Fix AMD Docker issues related to conda environment setup
  • Metrics: Active Development (Release Notes)

PyTorch Ecosystem

pytorch/pytorch

  • Key Activity:
    • 🚨 RELEASE: v2.7.1 - Patch release focusing on regressions.
    • Fixes for Torch.compile, Flex Attention, and Distributed contexts.
  • Details:
    • [Fix] Fix an issue related to Distributed Fused Adam in Rocm/APEX when using nccl_ub feature
    • [Fix] Fix Torch.compile on some HuggingFace models
    • [Fix] Fix performance regression on nanogpt speedrun
  • Metrics: 1771 New PRs 751 New Issues

pytorch/torchtitan

  • Key Activity:
    • 🚨 RELEASE: v0.1.0 - First pre-release of the clean-slate large model training library.
    • Benchmarks and architectural refactoring.
  • Details:
    • [2025-06-18] Release v0.1.0: Compatible with torch-2.8.0.dev.
    • [Bug] Llama4 TP bug: DTensor local tensor dtype does not match DTensorSpec
  • Metrics: 86 New PRs 23 New Issues

pytorch/ao (Architecture Optimization)

  • Key Activity:
    • Integration with Axolotl and Flux-Fast.
  • Details:
    • [2025-06-25] DOC UPDATE: Call out axolotl + QAT integration on README
    • [Issue] Fail for ROCm: test_int8_wo_quant_save_load
  • Metrics: 0 New PRs 27 New Issues

pytorch/audio

  • Key Activity:
    • 🚨 RELEASE: v2.7.1 - Maintenance mode announcement.
  • Details:
    • NOTE: Project transitioning to maintenance phase. Removing user-facing redundancies to focus solely on audio data processing.
  • Metrics: 0 New PRs 0 New Issues

facebookresearch/xformers

  • Key Activity:
    • 🚨 RELEASE: v0.0.31
    • Hardware support lifecycle changes.
  • Details:
    • Removed: Support for V100 or older GPUs.
    • Added: Support for Flash-Attention 3 on Ampere GPUs.
    • Feature: Wheels are now Python-version agnostic.
  • Metrics: 0 New PRs (External)

NVIDIA & Competitive Ecosystem

NVIDIA/TransformerEngine

  • Key Activity:
    • 🚨 RELEASE: v2.4
    • Performance optimizations and FP8 recipe expansions.
  • Details:
    • [PyTorch] Added support for MXFP8 recipe
    • [PyTorch] Reduced binary size of framework extension library from 108 MB to 2 MB
    • [JAX] Added support for Float8CurrentScaling recipe
  • Metrics: 0 New PRs (External)

volcengine/verl (RLHF/RLAIF)

  • Key Activity:
    • 🚨 RELEASE: v0.4.0 & v0.4.1
    • Massive update for Large Scale Reinforcement Learning.
  • Details:
    • Large MoE Support: DeepSeek 671b & Qwen3 235b support via Megatron backend.
    • SGLang Integration: Sample-level rollout with tool calling and multi-turn RL.
    • FSDP2: Recommended replacement for FSDP1, with CPU offloading.
    • LoRA: Support enabling 70B+ models on single node A100x8.
  • Metrics: High Release Activity

deepseek-ai/DeepEP

  • Key Activity:
    • Architecture expansion.
  • Details:
    • [2025-06-11] DOC UPDATE: Support Ampere architecture
  • Metrics: 0 New PRs

JAX Ecosystem

jax-ml/jax

  • Key Activity:
    • 🚨 RELEASE: v0.6.2
  • Details:
    • New Feature: jax.tree.broadcast.
    • Raised minimum requirements: NumPy 1.26, SciPy 1.12.
  • Metrics: 0 New PRs (External)

AI-Hypercomputer/maxtext

  • Key Activity:
    • 🚨 RELEASE: pre-nnx-v0.1.0
  • Details:
    • Final release fully depending on Flax Linen before migration to NNX.
  • Metrics: 0 New PRs