📅 Engineering Report (2025-06-01 - 2025-06-30)

🚀 Executive Summary

June 2025 was a pivotal month for the PyTorch ecosystem with PyTorch v2.7.1 focusing on stabilization and TorchTitan graduating to its first official pre-release (v0.1.0), signaling a serious push into large-scale training standardization. In the RLHF/Training space, Verl saw massive updates (v0.4.0/v0.4.1) adding support for massive MoE models (DeepSeek 671B, Qwen3) and FSDP2.

On the hardware/compiler front, TileLang demonstrated significant velocity with distinct optimizations for AMD GPUs (Float8 Matrix Cores), while NVIDIA’s TransformerEngine v2.4 introduced new FP8 recipes and drastically reduced binary sizes. ROCm documentation has begun referencing “TheRock,” hinting at a new kernel or stack distribution.

Primus & TorchTitan Integration: AMD-AGI/Primus has added specific backend support for TorchTitan, ensuring AMD hardware is ready for PyTorch’s new flagship training framework.
TileLang Optimizations: The tile-ai/tilelang compiler (v0.1.5) merged specific AMD backend support, including Float8 Matrix Core support and vectorized FP8 DataPacking, improving high-performance kernel generation for AMD accelerators.
TraceLens Coverage: TraceLens added performance modeling for flash_attention operations, critical for accurate profiling of modern LLM workloads on ROCm.
ROCm Rebranding/Stack: ROCm/ROCm README updates now reference “TheRock,” suggesting a potential shift in distribution or kernel packaging naming conventions.
PyTorch AO Support: pytorch/ao relaxed precision requirements for ROCm in quantization tests, unblocking INT8 save/load workflows.

Competitive Analysis

NVIDIA TransformerEngine v2.4: NVIDIA released v2.4, significantly reducing the binary size (108MB to 2MB) and adding support for MXFP8 recipes and logical partitioning in Jax.
Verl Dominance in RLHF: volcengine/verl is rapidly becoming the standard for post-training. The v0.4.x releases added support for DeepSeek-V3/671B and Qwen3-235B, explicitly supporting Megatron backends and FSDP2. This cements NVIDIA/Megatron compatibility in the most active RLHF repo.
xFormers Drops V100: Meta’s xformers has dropped support for V100 GPUs (Volta), signaling that optimization efforts are now exclusively focused on Ampere and newer architectures (Hopper/Blackwell).

📂 Category Updates

🟢 AMD Ecosystem

AMD-AGI/Primus

Key Activity:
- [2025-06-18] Added TorchTitan backend support to README.
- [2025-06-06] Refactored Megatron run scripts to support multi-backend via parameters.
Details:
- The team is actively making Primus backend-agnostic, moving away from hardcoded Megatron dependencies to support TorchTitan.
Metrics: 45 PRs (Open) 43 PRs (Closed)

AMD-AGI/TraceLens

Key Activity:
- Focus on Flash Attention performance modeling.
Details:
- Added aten::_scaled_dot_product_flash_attention to the performance model.
- New warning system implemented for host device misaligned timelines.
Metrics: 15 PRs (Open) 16 PRs (Closed)

ROCm/ROCm

Key Activity:
- [2025-06-12] Documentation updates pointing to “TheRock”.
Details:
- Issues reported regarding Ubuntu 24.04 detection on Ryzen AI MAX+ igpus (firmware missing).
- Kernel > 6.13 crashing drivers reported.
Metrics: 101 PRs (Open) 98 PRs (Closed)

tile-ai/tilelang

Key Activity:
- 🚨 [2025-06-05] RELEASE: v0.1.5
Details:
- Significant AMD support added: Float8 matrix core support and Vectorized FP8 DataPacking.
- Added NVRTC execution backend.
- Fixed deepgemm examples and improved autotuning caching.
Metrics: 56 PRs (Open) 59 PRs (Closed)

🔥 PyTorch Ecosystem

pytorch/pytorch

Key Activity:
- 🚨 [2025-06-04] RELEASE: v2.7.1
Details:
- Torch.compile: Fixed excessive cudagraph re-recording for HF LLMs and graph breaks in einops.
- Distributed: Fixed issues with Distributed Fused Adam in ROCm/APEX.
- General: Addressed regressions in Llama2 model export.
Metrics: 1771 PRs (Open) 1730 PRs (Closed) (Very High Velocity)

pytorch/torchtitan

Key Activity:
- 🚨 [2025-06-18] RELEASE: v0.1.0 (First Pre-release)
Details:
- Built on torch-2.8.0.dev and torchao-0.12.0.dev.
- Standardized model folder structure.
- Work on Llama4 Tensor Parallelism (TP) bugs involving DTensor metadata mismatches.
Metrics: 86 PRs (Open) 63 PRs (Closed)

pytorch/ao (Architecture Optimization)

Key Activity:
- Integration of Flux-Fast and Axolotl + QAT support.
Details:
- Added Float8Tensor.
- AMD Specific: Relaxed precision tests for test_int8_wo_quant_save_load on ROCm to fix CI failures.
Metrics: 149 PRs (Open) 139 PRs (Closed)

pytorch/audio

Key Activity:
- 🚨 [2025-06-04] RELEASE: v2.7.1
Details:
- Strategic Shift: Explicitly announced the library is entering a “maintenance phase” to reduce redundancies with the wider ecosystem.
Metrics: 30 PRs (Open) 21 PRs (Closed)

🟩 NVIDIA / Competitors

NVIDIA/TransformerEngine

Key Activity:
- 🚨 [2025-06-05] RELEASE: v2.4
Details:
- Binary size reduced from 108 MB to 2 MB.
- Added support for MXFP8 recipe.
- Fixed numerical issues with activation recompute in FP8.
Metrics: 0 Issues/PRs tracked (Release only).

facebookresearch/xformers

Key Activity:
- 🚨 [2025-06-25] RELEASE: v0.0.31
Details:
- Dropped Support: V100 and older GPUs are no longer supported.
- Added support for Flash-Attention 3 on Ampere GPUs.
Metrics: Low activity outside release.

🤖 LLM Training & Inference

volcengine/verl

Key Activity:
- 🚨 [2025-06-27] RELEASE: v0.4.1
- 🚨 [2025-06-06] RELEASE: v0.4.0
Details:
- MoE Support: DeepSeek 671B and Qwen3 235B support added.
- FSDP2: Now the recommended strategy over FSDP1.
- Integration: SGLang rollout with MCP client and tool calling.
- Switched Megatron checkpointer to mcore’s distributed checkpointing.
Metrics: Healthy maintenance (Issues and PRs handled alongside major releases).

openxla/xla

Key Activity:
- Heavy development on LLVM integration and Shardy partitioning.
Details:
- Integration of LLVM at newer commits.
- Discussions on generating HLO texture IR via fuzzers.
Metrics: 1109 PRs (Open) 1079 PRs (Closed) (Extremely High Velocity)

xdit-project/xDiT

Key Activity:
- Focus on Image Generation parallelism.
Details:
- Added support for SanaSprint and SANA models.
Metrics: 4 PRs (Open) 5 PRs (Closed)