GitHub Monthly Report: 2025-06-01 to 2025-06-30
📅 Engineering Report (2025-06-01 - 2025-06-30)
🚀 Executive Summary
June 2025 was defined by significant maturation in training frameworks and low-level kernel optimization languages. Volcengine/verl released major updates (v0.4.0/v0.4.1) enabling Large MoE training (DeepSeek 671b) and advanced RLHF workflows, positioning it as a critical tool for reasoning model development. PyTorch released patch v2.7.1 while simultaneously pushing the first pre-release of TorchTitan (v0.1.0), aiming to standardize large-scale model training.
AMD Related Updates
- Primus & TorchTitan Integration: AMD’s
Primusrepository added backend support forTorchTitan, ensuring AMD hardware is ready for PyTorch’s next-generation distributed training reference implementation. - TraceLens Flash Attention Support:
TraceLensadded profiling support foraten::_flash_attention_forwardand_scaled_dot_product_flash_attention, critical for optimizing LLM inference and training on ROCm. - TileLang AMD Support: The
tile-ai/tilelangcompiler project (v0.1.5) added explicit support for AMD Float8 Matrix Cores and vectorized FP8 DataPacking, significantly lowering the barrier for writing high-performance custom kernels on AMD GPUs. - ROCm “TheRock”: Documentation updates in
ROCm/ROCmpoint to a new component or rebrand dubbed “TheRock,” alongside fixes for Ubuntu 24.04 and Kernel 6.13 scheduler crashes. - PyTorch Fixes: PyTorch v2.7.1 included specific fixes for ROCm/APEX Distributed Fused Adam when using NCCL user buffers.
Competitive Analysis
- NVIDIA TransformerEngine v2.4: NVIDIA continues to optimize the low-precision stack with the release of TransformerEngine v2.4, adding support for MXFP8 recipes and reducing framework extension binary sizes by 98%.
- Legacy Hardware Deprecation:
facebookresearch/xformers(v0.0.31) officially dropped support for V100 GPUs (Volta), signaling a unified ecosystem shift toward Ampere and newer architectures. - RLHF Explosion: The release of
verlv0.4.0 with support for DeepSeek-V3, Qwen3-235b, and FSDP2 demonstrates the open-source community is rapidly catching up on infrastructure for training massive Mixture-of-Experts (MoE) models, a domain previously dominated by proprietary stacks.
📂 Category Updates
AMD Ecosystem
AMD-AGI/Primus
- Key Activity:
- Refactoring for multi-backend support.
- Integration with PyTorch’s new training system.
- Details:
- [2025-06-18] docs: add TorchTitan backend support entry to README
- [2025-06-06] refactor: move Megatron run scripts to examples root and add –backend parameter
-
Metrics: 45 New PRs 43 Closed PRs
AMD-AGI/TraceLens
- Key Activity:
- Enhanced profiling coverage for Flash Attention operations.
- Improved error handling for timeline alignment.
- Details:
- [Highlight] New PR: add aten::_scaled_dot_product_flash_attention to perf model
- [Highlight] New Issue: Throw warning for host device misaligned timelines
-
Metrics: 15 New PRs 13 New Issues
ROCm/ROCm
- Key Activity:
- Documentation updates referencing “TheRock.”
- Kernel compatibility issues reported for Linux 6.13.
- Details:
- [2025-06-12] DOC UPDATE: Update README.md to point to TheRock
- [Highlight] New Issue: kernels > 6.13 crash driver due to scheduler comp_1.1.1 is not ready
-
Metrics: 101 New PRs 22 New Issues
tile-ai/tilelang
- Key Activity:
- 🚨 RELEASE: v0.1.5 - Major update for kernel generation.
- Significant AMD-specific optimizations (Float8 Matrix Core).
- Details:
- [Feature] Support AMD float8 matrix core
- [AMD] Support Vectorized FP8 DataPacking
- [Fix] Fix AMD Docker issues related to conda environment setup
- Metrics: Active Development (Release Notes)
PyTorch Ecosystem
pytorch/pytorch
- Key Activity:
- 🚨 RELEASE: v2.7.1 - Patch release focusing on regressions.
- Fixes for Torch.compile, Flex Attention, and Distributed contexts.
- Details:
- [Fix] Fix an issue related to Distributed Fused Adam in Rocm/APEX when using nccl_ub feature
- [Fix] Fix Torch.compile on some HuggingFace models
- [Fix] Fix performance regression on nanogpt speedrun
-
Metrics: 1771 New PRs 751 New Issues
pytorch/torchtitan
- Key Activity:
- 🚨 RELEASE: v0.1.0 - First pre-release of the clean-slate large model training library.
- Benchmarks and architectural refactoring.
- Details:
- [2025-06-18] Release v0.1.0: Compatible with torch-2.8.0.dev.
- [Bug] Llama4 TP bug: DTensor local tensor dtype does not match DTensorSpec
-
Metrics: 86 New PRs 23 New Issues
pytorch/ao (Architecture Optimization)
- Key Activity:
- Integration with Axolotl and Flux-Fast.
- Details:
- [2025-06-25] DOC UPDATE: Call out axolotl + QAT integration on README
- [Issue] Fail for ROCm: test_int8_wo_quant_save_load
-
Metrics: 0 New PRs 27 New Issues
pytorch/audio
- Key Activity:
- 🚨 RELEASE: v2.7.1 - Maintenance mode announcement.
- Details:
- NOTE: Project transitioning to maintenance phase. Removing user-facing redundancies to focus solely on audio data processing.
-
Metrics: 0 New PRs 0 New Issues
facebookresearch/xformers
- Key Activity:
- 🚨 RELEASE: v0.0.31
- Hardware support lifecycle changes.
- Details:
- Removed: Support for V100 or older GPUs.
- Added: Support for Flash-Attention 3 on Ampere GPUs.
- Feature: Wheels are now Python-version agnostic.
- Metrics: 0 New PRs (External)
NVIDIA & Competitive Ecosystem
NVIDIA/TransformerEngine
- Key Activity:
- 🚨 RELEASE: v2.4
- Performance optimizations and FP8 recipe expansions.
- Details:
- [PyTorch] Added support for MXFP8 recipe
- [PyTorch] Reduced binary size of framework extension library from 108 MB to 2 MB
- [JAX] Added support for Float8CurrentScaling recipe
- Metrics: 0 New PRs (External)
volcengine/verl (RLHF/RLAIF)
- Key Activity:
- 🚨 RELEASE: v0.4.0 & v0.4.1
- Massive update for Large Scale Reinforcement Learning.
- Details:
- Large MoE Support: DeepSeek 671b & Qwen3 235b support via Megatron backend.
- SGLang Integration: Sample-level rollout with tool calling and multi-turn RL.
- FSDP2: Recommended replacement for FSDP1, with CPU offloading.
- LoRA: Support enabling 70B+ models on single node A100x8.
- Metrics: High Release Activity
deepseek-ai/DeepEP
- Key Activity:
- Architecture expansion.
- Details:
- [2025-06-11] DOC UPDATE: Support Ampere architecture
- Metrics: 0 New PRs
JAX Ecosystem
jax-ml/jax
- Key Activity:
- 🚨 RELEASE: v0.6.2
- Details:
- New Feature:
jax.tree.broadcast. - Raised minimum requirements: NumPy 1.26, SciPy 1.12.
- New Feature:
- Metrics: 0 New PRs (External)
AI-Hypercomputer/maxtext
- Key Activity:
- 🚨 RELEASE: pre-nnx-v0.1.0
- Details:
- Final release fully depending on Flax Linen before migration to NNX.
- Metrics: 0 New PRs