GitHub Monthly Report: 2025-07-01 to 2025-07-31
📅 Engineering Report (2025-07-01 - 2025-07-31)
🚀 Executive Summary
July 2025 was a pivotal month for AI infrastructure, defined by major version releases across the stack (ROCm 6.4.2, JAX 0.7.0, Triton 3.4.0, and TransformerEngine 2.5). There is a distinct industry-wide push to support next-generation hardware (AMD GFX950 and NVIDIA Blackwell) and emerging model architectures (DeepSeek-V3/R1 and Llama 4).
AMD Related Updates
- 🚨 ROCm 6.4.2 Release: A significant update introducing FP8 metrics support for MI300 series in the Compute Profiler and expanded OS support (SLES 15 SP7, Oracle Linux).
- Next-Gen Hardware Prep: Triton 3.4.0 officially added support for AMD GFX950 architecture, including WMMA operations and specific optimizations.
- Ecosystem Expansion: ROCm documentation now officially lists support for Deep Graph Library (DGL), Stanford Megatron-LM, and Volcano Engine Reinforcement Learning (verl).
- Intel/Leaks: Both ROCm documentation and
pytorch/torchtitanPRs contain explicit references to Llama-4, indicating active pre-release optimization work is already underway for Meta’s next model generation.
Competitive Analysis
- NVIDIA Blackwell Readiness: NVIDIA’s ecosystem is heavily preparing for Blackwell. TorchAO v0.12.0 added prototype MXFP/NVFP support for Blackwell, and Triton 3.4.0 added Blackwell Enhanced TMEM support.
- JAX Architecture Shift: JAX 0.7.0 marks a major breaking change, migrating from GSPMD to Shardy by default, signaling a shift in how Google handles distributed partitioning.
- DeepSeek Influence: The “DeepSeek” effect is visible in the stack. TorchAO implemented DeepSeek blockwise quantization, DeepEP (DeepSeek’s communication library) saw heavy activity, and Volcengine/verl added Qwen/DeepSeek specific recipes.
📂 Category Updates
🟢 AMD Ecosystem
ROCm/ROCm
- Key Activity:
- [2025-07-21] 🚨 RELEASE: rocm-6.4.2
- Details:
- New Features: FP8/BF16/Int8 support in Roofline profiling for MI300; transition from
rocm-smitoamd-smi; new Offline Installer Creator options. - Documentation: Added tutorials for “Profiling Llama-4 inference with vLLM” and FP8 quantization with AMD Quark.
- Deprecations:
hipcc.plperl scripts and__AMDGCN_WAVEFRONT_SIZEmacros are slated for removal.
- New Features: FP8/BF16/Int8 support in Roofline profiling for MI300; transition from
-
Metrics: 109 New PRs 107 Closed PRs (Healthy maintenance velocity)
AMD-AGI/Primus
- Key Activity:
- Focus on Llama 3 70B sequence length issues and numerical bug fixes in TE grouped GEMM.
-
Metrics: 38 New PRs 39 Closed PRs
AMD-AGI/TraceLens
- Key Activity:
- Work on
all2allvbandwidth calculations and adding Flash Attention to the performance model.
- Work on
-
Metrics: 23 New PRs 23 Closed PRs
🔥 PyTorch & Meta Ecosystem
pytorch/ao (Architecture Optimization)
- Key Activity:
- [2025-07-17] 🚨 RELEASE: v0.12.0
- Details:
- Highlights: Integration of QAT with Axolotl; Prototype MXFP and NVFP4 support for NVIDIA Blackwell; Implementation of DeepSeek blockwise quantization for FP8.
- Breaking Changes: Removal of Galore and several quantization primitive adjustments.
-
Metrics: 0 New PRs 0 Closed PRs (Post-release lull)
pytorch/torchtitan
- Key Activity:
- Preparation for Llama 4.
- Details:
- [2025-07-xx] PRs explicitly mention
[llama4]optimizations, including combining w1/w3 for efficient grouped GEMM and storing expert weights non-transposed. - Significant work on MoE auxiliary-loss-free load balancing.
- [2025-07-xx] PRs explicitly mention
-
Metrics: 118 New PRs 99 Closed PRs
pytorch/pytorch
- Key Activity:
- Standard high-volume maintenance.
- Details:
- Issues reported regarding SimpleFSDP + TP embedding sharding errors.
- Documentation updates regarding build system requirements splitting.
-
Metrics: 1652 New PRs 1656 Closed PRs (Extremely high velocity)
⚡ Compiler & Kernels
triton-lang/triton
- Key Activity:
- [2025-07-30] 🚨 RELEASE: v3.4.0
- Details:
- AMD: Full GFX950 architecture support (WMMA, perf optimizations). Improved AMD buffer operations and ping-pong scheduling.
- NVIDIA: Blackwell TMEM support, automatic Warp Specialization, and MMAv5 pipelining.
- Language: Added
@tl.aggregatefor Python class autogeneration and enhanced JITconstexprsupport.
-
Metrics: 330 New PRs 323 Closed PRs
tile-ai/tilelang
- Key Activity:
- Phaseout of legacy documentation and H100 shared memory bug fixes.
-
Metrics: 67 New PRs 67 Closed PRs
🤖 NVIDIA Ecosystem
NVIDIA/TransformerEngine
- Key Activity:
- [2025-07-28] 🚨 RELEASE: v2.5
- Details:
- Added support for head dimensions > 128.
- PyTorch: CPU offloading support for FP8 params; Context Parallel for Multi Latent Attention (MLA).
- JAX: Support for Sliding Window Attention (SWA) in context parallel.
- Metrics: 0 New PRs logged in timeframe (likely internal dev cycle).
NVIDIA/Megatron-LM
- Key Activity:
- [2025-07-28] Release of core_v0.14.0rc3.
- Metrics: Low public GitHub activity (Development likely internal).
🌐 JAX & Google Ecosystem
jax-ml/jax
- Key Activity:
- [2025-07-22] 🚨 RELEASE: jax-v0.7.0
- Details:
- Breaking: Migration from GSPMD to Shardy by default.
- Breaking: Autodiff now uses direct linearization by default.
- Deprecation: Removal of
jax.experimental.shard(moved tojax.sharding). - Issues reported regarding cuDNN 9.11+ causing convolution failures.
-
Metrics: 681 New PRs 602 Closed PRs
AI-Hypercomputer/maxtext
- Key Activity:
- [2025-07-15] Release tpu-recipes-v0.1.4.
- Integration of Multi-Token Prediction (MTP) training objective.
-
Metrics: 144 New PRs 113 Closed PRs
📡 Inference, Serving & RL
volcengine/verl
- Key Activity:
- [2025-07-23] 🚨 RELEASE: v0.5.0
- Details:
- Introduced “Agentic RL” rollout interface.
- Disaggregated placement: Async training with trainer/rollout on separate resources (one-step-off policy).
- Significant performance improvements for SGLang + Megatron integration (10x faster weight resharding).
- Metrics: 0 New PRs logged (Release focus).
llm-d/llm-d
- Key Activity:
- [2025-07-29] 🚨 RELEASE: v0.2.0
- Details:
- Migrated from monolithic to composable installs.
- Integration with upstream
vllm-project/vllmv0.10.0.
- Metrics: Low PR volume (Release focus).
deepseek-ai/DeepEP
- Key Activity:
- High engagement on kernel optimizations (RDMA receiver, SM free normal kernels).
- Community questions regarding missing
combine_llanddispatch_llkernels.
-
Metrics: 35 New PRs 30 Closed PRs
deepspeedai/DeepSpeed
- Key Activity:
- [2025-07-31] 🚨 RELEASE: v0.17.4
- [2025-07-28] 🚨 RELEASE: v0.17.3
- Details:
- Added
TiledFusedLogitsLoss. - Fixes for ZeRO-3 bucket resets and Ulysses-ALST FA3 support.
- Added
-
Metrics: 29 New PRs 26 Closed PRs