📅 Engineering Report (2025-06-01 - 2025-06-30)

🚀 Executive Summary

June 2025 was marked by significant advancements in multi-modal and Massive Mixture-of-Experts (MoE) model scaling frameworks, as well as critical stabilization patches across major libraries. The broader ecosystem is aggressively standardizing FP8 workflows and advanced attention mechanisms. Maintenance health across the industry is exceptionally strong, exemplified by main pillars like pytorch/pytorch (closing 1,730 of 1,771 PRs) and openxla/xla (closing 1,079 of 1,109 PRs) resolving over 95% of their incoming pull requests within the month.

ROCm Stability & Upstreaming: The pytorch/pytorch 2.7.1 release included critical fixes for Distributed Fused Adam in Rocm/APEX when utilizing the nccl_ub feature.
FP8 and Emulation: Native ecosystem support for AMD FP8 capabilities is accelerating. triton-lang/triton enabled multiple passing FP8 downcast clamping tests for AMD, while tile-ai/tilelang pushed support for Vectorized FP8 DataPacking on AMD architectures.
Performance Tracking: AMD’s internal profiling tool TraceLens is expanding capabilities to capture fundamental optimizations, specifically adding PyTorch’s _scaled_dot_product_flash_attention to its performance model.
Framework Adjustments: pytorch/ao relaxed precision requirements to ensure successful passing of test_int8_wo_quant_save_load tests on ROCm.

Competitive Analysis

NVIDIA’s Heavy Optimization: NVIDIA’s TransformerEngine 🚨 v2.4 brought aggressive optimizations, including a staggering framework extension binary reduction from 108MB to just 2MB, and added support for MXFP8 recipes overlapping Tensor Parallelism communications and GEMMs.
Hardware Feature Backporting: facebookresearch/xformers released v0.0.31, bringing Flash-Attention 3 capabilities backward to Ampere GPUs, significantly boosting the value proposition of NVIDIA’s older-generation hardware fleets, while officially dropping V100 support. Similarly, deepseek-ai/DeepEP explicitly added Ampere architecture support.
DeepSeek & Massive MoE Scaling: Projects like Volcengine’s verl released massive updates (v0.4.0 & v0.4.1) natively supporting DeepSeek 671b and Qwen3 235b scaling via Megatron backends, demonstrating NVIDIA’s tight grip on ultra-large-scale MoE training loops.

📂 Category Updates

AMD Ecosystem

AMD-AGI/Primus

Key Activity:
- [2025-06-18] Added TorchTitan backend support documentation to the README.
- [2025-06-06] Refactored Megatron run scripts to support multi-backend configurations.
Details:
- [Highlight PR] Created comprehensive Primus Config/Patch documentation.
- [Highlight PR] Enabled dumping of pipeline parallelism (PP) schedule data and added a visualization tool for Megatron.
Metrics: 45 PRs (43 Closed) 0 Issues

AMD-AGI/TraceLens

Key Activity:
- Continued mapping out Aten-level Flash Attention operations for accurate timeline tracing and profiling.
Details:
- [Highlight Issue] Identified a need to throw warnings for host-device misaligned timelines.
- [Highlight PR] Added aten::_scaled_dot_product_flash_attention to the performance modeling system.
Metrics: 15 PRs (16 Closed) 13 Issues (4 Closed)

ROCm/ROCm

Key Activity:
- [2025-06-12] Updated README.md to prominently feature and point to TheRock installer near the top of the repo.
Details:
- [Highlight Issue] Raised driver crash issue for kernels > 6.13 due to scheduler comp_1.1.1 not being ready.
- [Highlight Issue] Flagged missing firmware causing Ubuntu 24.04 to not detect AMD RYZEN AI MAX+ 395 iGPUs.
- [Highlight PR] Fixed hardcoded gfx configurations in MIOpen CK script.
Metrics: 101 PRs (98 Closed) 22 Issues (30 Closed)

ROCm/MAD

Key Activity:
- Minor pipeline and build improvements.
Details:
- [Highlight PR] Merged functionality to parse and load Docker build args.
- [Highlight PR] Fixed bugs related to multiple results checking logic.
Metrics: 6 PRs (7 Closed) 0 Issues

PyTorch Ecosystem

pytorch/pytorch

Key Activity:
- 🚨 [2025-06-04] RELEASE: v2.7.1: A critical patch release resolving extensive cudagraph re-recording for Hugging Face LLM models, Flex Attention performance regressions, and AMD Rocm/APEX Distributed Fused Adam bugs.
- [2025-06-10] Updated CUDA installation documentation regarding conda requirements.
Details:
- [Highlight Issue] Tracked inductor dynamic shape failures when Hugging Face models create error guards.
- [Highlight PR] Added inductor lowerings for adaptive_avg_pool3d and adaptive_max_pool3d.
- [Highlight PR] Added support for complex numbers in DTensor redistributions.
Metrics: 1771 PRs (1730 Closed) 751 Issues (661 Closed)

pytorch/torchtitan

Key Activity:
- 🚨 [2025-06-18] RELEASE: v0.1.0: The first official pre-release of TorchTitan, locking to torch-2.8.0.dev20250617+cu126.
- [2025-06-13] Added a dedicated folder for benchmarks and submission guidelines.
Details:
- [Highlight PR] Added support for doing MoE conversion before Float8Linear conversion for seamless FP8 MoE training.
- [Highlight Issue] Debugged DTensor local tensor dtype mismatches in Llama 4 Tensor Parallelism.
Metrics: 86 PRs (63 Closed) 23 Issues (17 Closed)

pytorch/ao

Key Activity:
- [2025-06-28] Revamped README to include references to Flux-Fast and axolotl + QAT integrations.
Details:
- [Highlight PR] Introduced the new Float8Tensor type to the quantization API.
- [Highlight PR] Relaxed precision requirements to solve failing test_int8_wo_quant_save_load test suite on ROCm hardware.
Metrics: 149 PRs (139 Closed) 27 Issues (7 Closed)

pytorch/vision & pytorch/audio

Key Activity:
- 🚨 [2025-06-04] RELEASE: torchvision v0.22.1 & torchaudio v2.7.1: Both libraries are moving older components to a maintenance phase. torchvision announced deprecation of video decoding in favor of the new TorchCodec project. torchaudio is being scoped tightly strictly for ML processing, stripping redundant ecosystem features.
Metrics: Vision: 31 PRs (24 Closed) Audio: 0 PRs / 0 Issues.

NVIDIA & Competitive Ecosystem

NVIDIA/TransformerEngine

Key Activity:
- 🚨 [2025-06-05] RELEASE: v2.4: Massive framework update featuring MXFP8 recipes for userbuffers (overlapping TP/GEMM communication), Float8CurrentScaling in JAX, and binary size optimizations (reducing from 108MB to 2MB).
Metrics: 0 PRs 0 Issues (Tracked internally, dropped directly as a release)

facebookresearch/xformers

Key Activity:
- 🚨 [2025-06-25] RELEASE: v0.0.31: Introduced python-version agnostic wheels.
- [2025-06-25] Shipped support for Flash-Attention 3 on Ampere architecture GPUs, extending the lifespan of older enterprise hardware while formally removing support for V100/older GPUs.
Metrics: 0 PRs 0 Issues

deepseek-ai/DeepEP

Key Activity:
- [2025-06-11] Upstreamed support for Ampere architecture GPUs for DeepSeek’s internal EP mechanisms.
- [2025-06-27] Enforced stricter conditions for aggressive PTX instructions.
Metrics: 0 PRs 0 Issues

JAX & Compilers

openxla/xla

Key Activity:
- Continued heavy compiler integrations, specifically syncing LLVM and handling Shardy automated partitioning integrations.
Details:
- [Highlight PR] Shardy auto-spmd partitioning field propagation integrated from config.
- [Highlight Issue] Discussing ways cross-architecture operator libraries (like cuBLAS) can be seamlessly applied in JAX.
Metrics: 1109 PRs (1079 Closed) 16 Issues (10 Closed)

triton-lang/triton

Key Activity:
- [2025-06-27] Documentation improvements around custom LLVM builds.
Details:
- [Highlight PR] AMD specific update to enable more passing FP8 downcast clamping tests in Triton.
- [Highlight PR] Added vast scheduling infrastructure to refine-ops-pass for pipeline tuning.
Metrics: 323 PRs (314 Closed) 30 Issues (21 Closed)

tile-ai/tilelang

Key Activity:
- 🚨 [2025-06-05] RELEASE: v0.1.5: Extremely feature-dense release introducing implicit Warp Specialize programming models, Auto Layout Inference enhancements, and broad Docker enhancements.
Details:
- [Highlight PR] Implemented AMD vectorized FP8 DataPacking support.
- [Highlight PR] Fixed AMD Docker issues related to conda environment instantiation.
- [Highlight Issue] Discussed optimizations for Variable Length (varlen) NSA to match FA3 speeds.
Metrics: 56 PRs (59 Closed) 9 Issues (10 Closed)

Large Language Model Tooling (vLLM, Verl, SGLang, Transformers)

volcengine/verl

Key Activity:
- 🚨 [2025-06-06] RELEASE: v0.4.0 & 🚨 [2025-06-27] RELEASE: v0.4.1: Verl has become a dominant RL framework. The v0.4.x releases added support for ultra-large MoE models (DeepSeek 671b, Qwen3 235b) over Megatron backends, integrated tool-calling, and sample-level multi-turn RL workflows leveraging sglang rollouts.
Metrics: Active release tracking, extremely high momentum in the RLHF pipeline space.

huggingface/transformers

Key Activity:
- Minor model additions including Gemma3n and MARIAN updates.
Details:
- [Highlight PR] Shipped several framework fixes specifically mapping Gemma3n functionality.
Metrics: 462 PRs (423 Closed) 162 Issues (156 Closed)

xdit-project/xDiT

Key Activity:
- [2025-06-26] Expanded multi-model support by bringing in SanaSprint and native SANA support.
Metrics: 4 PRs (5 Closed) 7 Issues (2 Closed)

GitHub Monthly Report: 2025-06-01 to 2025-06-30

📅 Engineering Report (2025-06-01 - 2025-06-30)

🚀 Executive Summary

Competitive Analysis

📂 Category Updates

AMD Ecosystem

AMD-AGI/Primus

AMD-AGI/TraceLens

ROCm/ROCm

ROCm/MAD

PyTorch Ecosystem

pytorch/pytorch

pytorch/torchtitan

pytorch/ao

pytorch/vision & pytorch/audio

NVIDIA & Competitive Ecosystem

NVIDIA/TransformerEngine

facebookresearch/xformers

deepseek-ai/DeepEP

JAX & Compilers

openxla/xla

triton-lang/triton

tile-ai/tilelang

Large Language Model Tooling (vLLM, Verl, SGLang, Transformers)

volcengine/verl

huggingface/transformers

xdit-project/xDiT

🔗 References

📅 Engineering Report (2025-06-01 - 2025-06-30)

🚀 Executive Summary

AMD Related Updates

Competitive Analysis

📂 Category Updates

AMD Ecosystem

AMD-AGI/Primus

AMD-AGI/TraceLens

ROCm/ROCm

ROCm/MAD

PyTorch Ecosystem

pytorch/pytorch

pytorch/torchtitan

pytorch/ao

pytorch/vision & pytorch/audio

NVIDIA & Competitive Ecosystem

NVIDIA/TransformerEngine

facebookresearch/xformers

deepseek-ai/DeepEP

JAX & Compilers

openxla/xla

triton-lang/triton

tile-ai/tilelang

Large Language Model Tooling (vLLM, Verl, SGLang, Transformers)

volcengine/verl

huggingface/transformers

xdit-project/xDiT

🔗 References