GitHub Monthly Report: 2025-09-01 to 2025-09-30
📅 Engineering Report (2025-09-01 - 2025-09-30)
🚀 Executive Summary
September 2025 marked a pivotal month for AI infrastructure with the release of ROCm 7.0, a major version update introducing support for the MI350/MI355X series and next-generation OCP data types (FP4/FP6/FP8). In parallel, the inference landscape saw a massive update with vLLM v0.10.2, which upgraded to PyTorch 2.8 and introduced significant architectural changes for both NVIDIA Blackwell and AMD ROCm support. The quantization ecosystem is aggressively moving toward low-precision formats (FP4/FP6), with both PyTorch AO and ROCm libraries adding native support to handle next-gen LLM sizes. Notably, referencing “Llama 4” configurations in both AMD Primus and Google MaxText repositories suggests ecosystem preparation for an imminent model generation shift.
AMD Related Updates
- 🚨 Major Release ROCm 7.0.0 & 7.0.1: This is the most significant update of the year, formally introducing support for AMD Instinct MI355X and MI350X GPUs. It decouples the AMDGPU driver from the ROCm stack and enables KVM Passthrough for the new accelerators.
- Primus v0.2.0 Release: AMD’s training framework (Primus) released v0.2.0, adding specific configurations for Llama 4 (17B/128E) and integrating Megatron-LM backends, signaling readiness for next-gen model training on AMD hardware.
- OCP Data Type Standardization: ROCm 7.0 libraries (hipBLASLt, MIOpen, etc.) now natively support Open Compute Project (OCP) Microscaling formats (MX) for FP4, FP6, and FP8, aligning with industry standards for low-precision training/inference.
- TileLang Adoption: The compiler project TileLang added specific support for AMD MI300 series (Flash Attention examples) and ROCm architecture detection, expanding the third-party compiler ecosystem for AMD.
Competitive Analysis
- NVIDIA Blackwell Readiness: NVIDIA’s ecosystem is hardening for the Blackwell (SM100) architecture. vLLM v0.10.2 and TransformerEngine v2.6 added specific kernels for Blackwell, including MXFP4 MoE support, although FlashMLA was temporarily disabled on Blackwell due to compatibility issues.
- Proprietary vs. Open Quantization: While ROCm adopted OCP standards for FP4/FP6, PyTorch AO’s release (v0.13.0) introduced prototype support for NVFP4 (NVIDIA specific FP4) alongside standard formats, highlighting a divergence in low-precision data type strategies.
- Hugging Face Framework Consolidation: Hugging Face
transformersremoved library-wide support for TensorFlow and JAX, consolidating heavily around PyTorch. This solidifies PyTorch (and by extension, the ROCm PyTorch stack) as the critical path for model deployment.
📂 Category Updates
AMD Ecosystem
ROCm/ROCm
- Key Activity:
- 🚨 [2025-09-16] RELEASE: rocm-7.0.0 - Major release introducing MI355X/MI350X support.
- 🚨 [2025-09-17] RELEASE: rocm-7.0.1 - Patch release fixing out-of-bound CPERs for bad memory pages.
- Details:
- Hardware Support: Added MI355X and MI350X support across the stack (libraries, compiler, profilers).
- Architecture: AMD GPU Driver (
amdgpu) is now distributed separately from the ROCm software stack. - Precision: Full support for OCP
FP4,FP6, andFP8data types in HIP runtime and math libraries. - Ecosystem: Added official support for Ray and llama.cpp.
-
Metrics: 162 PRs (High Activity) 45 New Issues
AMD-AGI/Primus
- Key Activity:
- 🚨 [2025-09-11] RELEASE: v0.2.0 - Unified config/backend CLI and Turbo backend updates.
- Details:
- Llama 4 Readiness: Multiple PRs merged adding initial configs for Llama 4 (17B variants) using Maverick and Scout configs.
- Features: Added
LightMegatronPretrainTrainer, fused MoE router scatter logic, and Async TP adaptation for TE2.x. - Performance: Updates to Turbo Grouped GEMM for BF16/FP16.
-
Metrics: 40 PRs (Perfect Close Rate) 1 New Issue
AMD-AGI/TraceLens
- Key Activity:
- High volume of new issues identifying API gaps.
- Details:
- Added support for
trtllm::cublas_scaled_mmand initial graph mode tests. - Discussions opened regarding renaming kernel launchers to “leaf ops” and clarifying UID vs Event APIs.
- Added support for
-
Metrics: 17 PRs 31 New Issues
PyTorch Ecosystem
pytorch/pytorch
- Key Activity:
- Preparing for PyTorch 2.6/2.7 cycles.
- Details:
- Heavy focus on
dynamoand export functionality. - Windows: Updated libuv version.
- MacOS: Updated deployment platform target to 11.0.
- Heavy focus on
-
Metrics: 1799 PRs 649 New Issues
pytorch/ao (Architecture Optimization)
- Key Activity:
- 🚨 [2025-09-02] RELEASE: v0.13.0-rc8
- Details:
- Quantization: Added prototype NVFP4 (NVIDIA FP4) and FP8 QAT support.
- Performance: Achieved 1.2x speedup on NVIDIA B200 (Blackwell) using MXFP8 with torchtitan.
- Deprecation: Dropped support for PyTorch 2.5 and older.
-
Metrics: 0 PRs (Release Month) 31 New Issues
pytorch/torchtitan
- Key Activity:
- Focus on FSDP2 maturity.
- Details:
- Added support for
sync_module_statesin FSDP2. - Ported true BF16 training into forge experiments.
- Refactored attention interfaces.
- Added support for
-
Metrics: 78 PRs 30 New Issues
meta-pytorch/monarch
- Key Activity:
- 🚨 [2025-09-03] RELEASE: v0.0.0 - Initial public release on PyPI.
-
Metrics: 0 PRs 0 New Issues
Serving & Inference
vllm-project/vllm
- Key Activity:
- 🚨 [2025-09-13] RELEASE: v0.10.2 - Massive update with breaking changes.
- Details:
- Core: Upgraded to PyTorch 2.8.0.
- Hardware: Added native
aarch64support for GB200. - AMD ROCm: Added Pipeline Parallelism with Ray on ROCm (#24275) and TorchAO quantization support (#24400).
- Blackwell: Enabled DeepGEMM Linear and MXFP4 MoE for Blackwell, but disabled FlashMLA due to compatibility issues.
- Model Support: Added support for Qwen2.5-Omni, LFM2, Ovis2.5, and InternVL3.5.
-
Metrics: 0 PRs (Post-release window) 0 New Issues (Data artifact, actual activity likely higher)
deepspeedai/DeepSpeed
- Key Activity:
- 🚨 [2025-09-19] RELEASE: v0.17.6
- Details:
- Optimization: Enabled Muon Optimizer.
- Features: Added “SuperOffload” and support for non-ZeRO modes.
- DeepCompile: Fixes for IPG bucket clearing and gradient buffer access.
-
Metrics: 49 PRs 32 New Issues
deepseek-ai/DeepEP
- Key Activity:
- 🚨 [2025-09-16] RELEASE: v1.2.1
- Details:
- Added support for CUDA Graph in internode dispatch (normal kernel).
- Added permute extension to hybrid-ep.
-
Metrics: 26 PRs 28 New Issues
Compilers & Languages
tile-ai/tilelang
- Key Activity:
- 🚨 [2025-09-19] RELEASE: v0.1.6
- Details:
- AMD Support: Added ROCm architecture detection and Flash Attention examples for MI300 series.
- Huawei Support: Added support for Ascend NPU IR.
- Features: Added support for 1D TMA (Tensor Memory Accelerator) operations.
-
Metrics: 99 PRs 46 New Issues
triton-lang/triton
- Key Activity:
- Stabilization and bug fixes.
- Details:
- Addressed B200 (Blackwell)
flex_attention_fwdperformance regression. - Investigated issues with casting float16 to float8e5.
- Addressed B200 (Blackwell)
-
Metrics: 270 PRs 40 New Issues
openxla/xla
- Key Activity:
- Ongoing maintenance.
- Details:
- Issues reported with profiling HLO Ops on GPU.
- XLA:TPU translation issues for
mhlo.acosh.
-
Metrics: 1282 PRs 6 New Issues
Other Frameworks
huggingface/transformers
- Key Activity:
- Strategic framework shift.
- Details:
- [2025-09-18] Fully removed TensorFlow and JAX support library-wide.
- Added Bengali language README and updated formatting.
-
Metrics: 514 PRs 145 New Issues
NVIDIA/TransformerEngine
- Key Activity:
- 🚨 [2025-09-15] RELEASE: v2.6
- Details:
- Fusion: Added gradient accumulation fusion for FSDP (Megatron-Core).
- MoE: Optimized permute fusion kernels.
- Export: Added support for ONNX export.
-
Metrics: 0 PRs (Repo snapshot) 0 New Issues
jax-ml/jax
- Key Activity:
- 🚨 [2025-09-16] RELEASE: jax-v0.7.2
- Details:
- Minimum supported NumPy is now 2.0.
- Experimental support for WSL2 on ROCm documentation updated.
- Removed
jax2tfnon-native serialization support.
-
Metrics: 668 PRs 77 New Issues