📅 Engineering Report (2025-06-01 - 2025-06-30)

🚀 Executive Summary

June 2025 was defined by significant maturation in training frameworks and low-level kernel optimization languages. Volcengine/verl released major updates (v0.4.0/v0.4.1) enabling Large MoE training (DeepSeek 671b) and advanced RLHF workflows, positioning it as a critical tool for reasoning model development. PyTorch released patch v2.7.1 while simultaneously pushing the first pre-release of TorchTitan (v0.1.0), aiming to standardize large-scale model training.

Primus & TorchTitan Integration: AMD’s Primus repository added backend support for TorchTitan, ensuring AMD hardware is ready for PyTorch’s next-generation distributed training reference implementation.
TraceLens Flash Attention Support: TraceLens added profiling support for aten::_flash_attention_forward and _scaled_dot_product_flash_attention, critical for optimizing LLM inference and training on ROCm.
TileLang AMD Support: The tile-ai/tilelang compiler project (v0.1.5) added explicit support for AMD Float8 Matrix Cores and vectorized FP8 DataPacking, significantly lowering the barrier for writing high-performance custom kernels on AMD GPUs.
ROCm “TheRock”: Documentation updates in ROCm/ROCm point to a new component or rebrand dubbed “TheRock,” alongside fixes for Ubuntu 24.04 and Kernel 6.13 scheduler crashes.
PyTorch Fixes: PyTorch v2.7.1 included specific fixes for ROCm/APEX Distributed Fused Adam when using NCCL user buffers.

Competitive Analysis

NVIDIA TransformerEngine v2.4: NVIDIA continues to optimize the low-precision stack with the release of TransformerEngine v2.4, adding support for MXFP8 recipes and reducing framework extension binary sizes by 98%.
Legacy Hardware Deprecation: facebookresearch/xformers (v0.0.31) officially dropped support for V100 GPUs (Volta), signaling a unified ecosystem shift toward Ampere and newer architectures.
RLHF Explosion: The release of verl v0.4.0 with support for DeepSeek-V3, Qwen3-235b, and FSDP2 demonstrates the open-source community is rapidly catching up on infrastructure for training massive Mixture-of-Experts (MoE) models, a domain previously dominated by proprietary stacks.

📂 Category Updates

AMD Ecosystem

AMD-AGI/Primus

Key Activity:
- Refactoring for multi-backend support.
- Integration with PyTorch’s new training system.
Details:
- [2025-06-18] docs: add TorchTitan backend support entry to README
- [2025-06-06] refactor: move Megatron run scripts to examples root and add –backend parameter
Metrics: 45 New PRs 43 Closed PRs

AMD-AGI/TraceLens

Key Activity:
- Enhanced profiling coverage for Flash Attention operations.
- Improved error handling for timeline alignment.
Details:
- [Highlight] New PR: add aten::_scaled_dot_product_flash_attention to perf model
- [Highlight] New Issue: Throw warning for host device misaligned timelines
Metrics: 15 New PRs 13 New Issues

ROCm/ROCm

Key Activity:
- Documentation updates referencing “TheRock.”
- Kernel compatibility issues reported for Linux 6.13.
Details:
- [2025-06-12] DOC UPDATE: Update README.md to point to TheRock
- [Highlight] New Issue: kernels > 6.13 crash driver due to scheduler comp_1.1.1 is not ready
Metrics: 101 New PRs 22 New Issues

tile-ai/tilelang

Key Activity:
- 🚨 RELEASE: v0.1.5 - Major update for kernel generation.
- Significant AMD-specific optimizations (Float8 Matrix Core).
Details:
- [Feature] Support AMD float8 matrix core
- [AMD] Support Vectorized FP8 DataPacking
- [Fix] Fix AMD Docker issues related to conda environment setup
Metrics: Active Development (Release Notes)

PyTorch Ecosystem

pytorch/pytorch

Key Activity:
- 🚨 RELEASE: v2.7.1 - Patch release focusing on regressions.
- Fixes for Torch.compile, Flex Attention, and Distributed contexts.
Details:
- [Fix] Fix an issue related to Distributed Fused Adam in Rocm/APEX when using nccl_ub feature
- [Fix] Fix Torch.compile on some HuggingFace models
- [Fix] Fix performance regression on nanogpt speedrun
Metrics: 1771 New PRs 751 New Issues

pytorch/torchtitan

Key Activity:
- 🚨 RELEASE: v0.1.0 - First pre-release of the clean-slate large model training library.
- Benchmarks and architectural refactoring.
Details:
- [2025-06-18] Release v0.1.0: Compatible with torch-2.8.0.dev.
- [Bug] Llama4 TP bug: DTensor local tensor dtype does not match DTensorSpec
Metrics: 86 New PRs 23 New Issues

pytorch/ao (Architecture Optimization)

Key Activity:
- Integration with Axolotl and Flux-Fast.
Details:
- [2025-06-25] DOC UPDATE: Call out axolotl + QAT integration on README
- [Issue] Fail for ROCm: test_int8_wo_quant_save_load
Metrics: 0 New PRs 27 New Issues

pytorch/audio

Key Activity:
- 🚨 RELEASE: v2.7.1 - Maintenance mode announcement.
Details:
- NOTE: Project transitioning to maintenance phase. Removing user-facing redundancies to focus solely on audio data processing.
Metrics: 0 New PRs 0 New Issues

facebookresearch/xformers

Key Activity:
- 🚨 RELEASE: v0.0.31
- Hardware support lifecycle changes.
Details:
- Removed: Support for V100 or older GPUs.
- Added: Support for Flash-Attention 3 on Ampere GPUs.
- Feature: Wheels are now Python-version agnostic.
Metrics: 0 New PRs (External)

NVIDIA & Competitive Ecosystem

NVIDIA/TransformerEngine

Key Activity:
- 🚨 RELEASE: v2.4
- Performance optimizations and FP8 recipe expansions.
Details:
- [PyTorch] Added support for MXFP8 recipe
- [PyTorch] Reduced binary size of framework extension library from 108 MB to 2 MB
- [JAX] Added support for Float8CurrentScaling recipe
Metrics: 0 New PRs (External)

volcengine/verl (RLHF/RLAIF)

Key Activity:
- 🚨 RELEASE: v0.4.0 & v0.4.1
- Massive update for Large Scale Reinforcement Learning.
Details:
- Large MoE Support: DeepSeek 671b & Qwen3 235b support via Megatron backend.
- SGLang Integration: Sample-level rollout with tool calling and multi-turn RL.
- FSDP2: Recommended replacement for FSDP1, with CPU offloading.
- LoRA: Support enabling 70B+ models on single node A100x8.
Metrics: High Release Activity

deepseek-ai/DeepEP

Key Activity:
- Architecture expansion.
Details:
- [2025-06-11] DOC UPDATE: Support Ampere architecture
Metrics: 0 New PRs

JAX Ecosystem

jax-ml/jax

Key Activity:
- 🚨 RELEASE: v0.6.2
Details:
- New Feature: jax.tree.broadcast.
- Raised minimum requirements: NumPy 1.26, SciPy 1.12.
Metrics: 0 New PRs (External)

AI-Hypercomputer/maxtext

Key Activity:
- 🚨 RELEASE: pre-nnx-v0.1.0
Details:
- Final release fully depending on Flax Linen before migration to NNX.
Metrics: 0 New PRs