GitHub Monthly Report: 2025-08-01 to 2025-08-31
๐ Engineering Report (2025-08-01 - 2025-08-31)
๐ Executive Summary
August 2025 was a pivotal month characterized by the release of PyTorch 2.8.0, which triggered a cascade of updates across the entire AI ecosystem (Vision, Audio, DeepSpeed, xFormers).
For AMD, the month was defined by the maturation of the training stack with the Primus v0.1.0-rc1 release, introducing support for major models like LLaMA 3.1 405B and DeepSeek V2. Concurrently, ROCm 6.4.3 was released as a quality update, improving driver stability and expanding framework support to Taichi and Megablocks.
On the inference side, vLLM v0.10.1 brought significant architectural changes (deprecating V0 FA3) and hardware readiness for both AMD (ROCm Qwen-VL support) and NVIDIA (Blackwell/RTX 50-series).
AMD Related Updates
- Primus Training Stack: Released v0.1.0-rc1, adding critical support for
torch_fsdp2patching,Primus-Turbobackend integration, and configurations for LLaMA 3.1 405B and Mixtral 8x22B. - ROCm 6.4.3: Focused on stability, fixing performance degradation in RCCL applications and expanding the ecosystem to include the Taichi language and Megablocks (MoE training).
- PyTorch 2.8 Optimization: The upstream PyTorch release included specific ROCm performance improvements for
softmax,NLLLoss, and scatter-add operations on MI250X, alongsideHipSparseLTintegration for semi-structured sparsity. - Inference Expansion: vLLM v0.10.1 enabled Flash Attention backends for Qwen-VL models specifically on the ROCm platform and added AITER HIP block quantization kernels.
- Future Hardware Hints: FBGEMM v1.3.0 code references explicitly added support for
float8e4m3fnfor MI350+ architectures.
Competitive Analysis
- NVIDIA Blackwell & RTX 50-Series Readiness: vLLM v0.10.1 and FBGEMM v1.3.0 introduced extensive support for NVIDIAโs next-gen architecture (SM100/Blackwell) and the upcoming RTX 5090/Pro 6000 (SM120), specifically targeting FP8 GEMM tuning and block quantization.
- PyTorch Architecture Culling: PyTorch 2.8 dropped support for older NVIDIA architectures (Maxwell/Pascal) in default builds to manage binary size, signalling a shift toward newer hardware requirements.
- JAX & CUDA: JAX v0.7.1 is now built using CUDA 12.9, maintaining aggressive pace with NVIDIAโs latest driver stack.
- Intel Momentum: PyTorch 2.8 introduced significant XPU (Intel GPU) updates, including a distributed backend (XCCL) and high-performance quantized LLM inference on CPUs.
๐ Category Updates
๐ด AMD Ecosystem
[AMD-AGI/Primus]
- Key Activity:
- [2025-08-13] ๐จ RELEASE: v0.1.0-rc1 - Major milestone for the training framework.
- Details:
- New Backend: Added
Primus-Turbobackend support for Megatron and Torchtitan. - Model Support: Added configs for LLaMA 3.1 (70B/405B), DeepSeek V2, and Mixtral.
- Optimization: Implemented
hipblasltauto-tuning and specific optimizations for FP8 training memory usage. - Features: Patched Megatron
torch_FSDP2with Primus implementation.
- New Backend: Added
-
Metrics: 34 PRs 1 Issues
[ROCm/ROCm]
- Key Activity:
- [2025-08-11] ๐จ RELEASE: rocm-6.4.3 - Maintenance and ecosystem expansion.
- Details:
- Drivers: Fixed latency issues in RCCL applications caused by queue eviction.
- Ecosystem: Added support for Taichi (parallel programming language) and Megablocks (efficient MoE training).
- Deprecations: Announced roadmap to replace ROCm SMI with AMD SMI and move away from ROCTracer/ROCProfiler in favor of ROCprofiler-SDK.
-
Metrics: 63 PRs 29 Issues
[AMD-AGI/TraceLens]
- Key Activity:
- [2025-08-XX] Focus on JAX and NCCL analysis.
- Details:
- Added compute communication tags to kernels.
- Implemented NCCL/RCCL analyzer specifically for JAX workloads.
-
Metrics: 13 PRs 7 Issues
๐ฅ PyTorch Ecosystem
[pytorch/pytorch]
- Key Activity:
- [2025-08-06] ๐จ RELEASE: v2.8.0 - Massive framework update.
- Details:
- Core: Introduced
torch.compilehierarchical compilation and Control Flow Operator Library. - ROCm: Deprecated legacy profiling configs in favor of composable kernel tile configs. Improved performance for
softmaxandNLLLoss. - Intel: Added XCCL (Intel GPU distributed backend) support.
- Core: Introduced
-
Metrics: 1640 PRs 616 Issues
[pytorch/torchtitan]
- Key Activity:
- [2025-08-31] Integration testing updates.
- Details:
- Refactored integration test framework to support DeepSeek-v3.
- Improved Huggingface Asset integration.
-
Metrics: 118 PRs 40 Issues
[pytorch/FBGEMM]
- Key Activity:
- [2025-08-24] ๐จ RELEASE: v1.3.0 - GenAI and quantization focus.
- Details:
- Hardware: Added build support for CUDA 12.9 and MI350+ (
float8e4m3fn). - Kernels: Added HSTU ops and improved CUTLASS BF16 grouped GEMM.
- Quantization: Added support for fused SILU/RMS with quantization.
- Hardware: Added build support for CUDA 12.9 and MI350+ (
- Metrics: 0 PRs (Release Note aggregation)
[facebookresearch/xformers]
- Key Activity:
- [2025-08-13] ๐จ RELEASE: v0.0.32 - PyTorch 2.8 alignment.
- Details:
- Added ROCm 6.4 build support.
- Updated Flash-Attention package support to v2.8.2.
- Metrics: 0 PRs tracked
๐ Inference & Serving
[vllm-project/vllm]
- Key Activity:
- [2025-08-18] ๐จ RELEASE: v0.10.1 - Major feature and hardware update.
- Details:
- Architecture: Deprecated V0 FA3 support. Introduced full CUDA graph support with separate attention routines.
- AMD: Enabled Flash Attention backend for Qwen-VL on ROCm. Added AITER HIP block quantization kernels.
- NVIDIA: Added Block FP8 quantization for RTX 5090/SM120 and CutlassMLA backend for Blackwell/SM100.
- Models: Added support for GPT-OSS, Eagle multimodal, and MiniCPM-V 4.0.
- Metrics: 727 commits included in release notes.
[sgl-project/sglang]
- Key Activity:
- [2025-08-26] Community engagement.
- Details:
- Documentation updates regarding the โSGLang x AMD SF Meetup,โ indicating tightening relations between the project and AMD.
- Metrics: 0 PRs tracked
๐๏ธ Training & Fine-tuning
[deepspeedai/DeepSpeed]
- Key Activity:
- [2025-08-20] ๐จ RELEASE: v0.17.5
- Details:
- Added ZenFlow code for Stage 1 & 2.
- Fixed
DeepCompilecompatibility for PyTorch v2.8.
- Metrics: 0 PRs tracked
[THUDM/slime]
- Key Activity:
- [2025-08-31] ๐จ RELEASE: v0.1.0
- Details:
- Optimized SGLang with FP8 + DeepEP.
- Added CI for E2E GLM4 9B and Qwen3 30B-A3B training.
- Metrics: 0 PRs tracked
[NVIDIA/Megatron-LM]
- Key Activity:
- [2025-08-11] ๐จ RELEASE: core_v0.14.0rc5
- Details:
- Continued rapid release cadence for Megatron Core.
- Metrics: 0 PRs tracked
๐ง JAX Ecosystem
[jax-ml/jax]
- Key Activity:
- [2025-08-20] ๐จ RELEASE: v0.7.1
- Details:
- Build system updated to use CUDA 12.9.
- Shipped Python 3.14 wheels.
- Exposed
jax.set_meshglobal setter.
- Metrics: 0 PRs tracked