๐Ÿ“… Engineering Report (2025-08-01 - 2025-08-31)

๐Ÿš€ Executive Summary

August 2025 was a pivotal month characterized by the release of PyTorch 2.8.0, which triggered a cascade of updates across the entire AI ecosystem (Vision, Audio, DeepSpeed, xFormers).

For AMD, the month was defined by the maturation of the training stack with the Primus v0.1.0-rc1 release, introducing support for major models like LLaMA 3.1 405B and DeepSeek V2. Concurrently, ROCm 6.4.3 was released as a quality update, improving driver stability and expanding framework support to Taichi and Megablocks.

On the inference side, vLLM v0.10.1 brought significant architectural changes (deprecating V0 FA3) and hardware readiness for both AMD (ROCm Qwen-VL support) and NVIDIA (Blackwell/RTX 50-series).

  • Primus Training Stack: Released v0.1.0-rc1, adding critical support for torch_fsdp2 patching, Primus-Turbo backend integration, and configurations for LLaMA 3.1 405B and Mixtral 8x22B.
  • ROCm 6.4.3: Focused on stability, fixing performance degradation in RCCL applications and expanding the ecosystem to include the Taichi language and Megablocks (MoE training).
  • PyTorch 2.8 Optimization: The upstream PyTorch release included specific ROCm performance improvements for softmax, NLLLoss, and scatter-add operations on MI250X, alongside HipSparseLT integration for semi-structured sparsity.
  • Inference Expansion: vLLM v0.10.1 enabled Flash Attention backends for Qwen-VL models specifically on the ROCm platform and added AITER HIP block quantization kernels.
  • Future Hardware Hints: FBGEMM v1.3.0 code references explicitly added support for float8e4m3fn for MI350+ architectures.

Competitive Analysis

  • NVIDIA Blackwell & RTX 50-Series Readiness: vLLM v0.10.1 and FBGEMM v1.3.0 introduced extensive support for NVIDIAโ€™s next-gen architecture (SM100/Blackwell) and the upcoming RTX 5090/Pro 6000 (SM120), specifically targeting FP8 GEMM tuning and block quantization.
  • PyTorch Architecture Culling: PyTorch 2.8 dropped support for older NVIDIA architectures (Maxwell/Pascal) in default builds to manage binary size, signalling a shift toward newer hardware requirements.
  • JAX & CUDA: JAX v0.7.1 is now built using CUDA 12.9, maintaining aggressive pace with NVIDIAโ€™s latest driver stack.
  • Intel Momentum: PyTorch 2.8 introduced significant XPU (Intel GPU) updates, including a distributed backend (XCCL) and high-performance quantized LLM inference on CPUs.

๐Ÿ“‚ Category Updates

๐Ÿ”ด AMD Ecosystem

[AMD-AGI/Primus]

  • Key Activity:
    • [2025-08-13] ๐Ÿšจ RELEASE: v0.1.0-rc1 - Major milestone for the training framework.
  • Details:
    • New Backend: Added Primus-Turbo backend support for Megatron and Torchtitan.
    • Model Support: Added configs for LLaMA 3.1 (70B/405B), DeepSeek V2, and Mixtral.
    • Optimization: Implemented hipblaslt auto-tuning and specific optimizations for FP8 training memory usage.
    • Features: Patched Megatron torch_FSDP2 with Primus implementation.
  • Metrics: 34 PRs 1 Issues

[ROCm/ROCm]

  • Key Activity:
    • [2025-08-11] ๐Ÿšจ RELEASE: rocm-6.4.3 - Maintenance and ecosystem expansion.
  • Details:
    • Drivers: Fixed latency issues in RCCL applications caused by queue eviction.
    • Ecosystem: Added support for Taichi (parallel programming language) and Megablocks (efficient MoE training).
    • Deprecations: Announced roadmap to replace ROCm SMI with AMD SMI and move away from ROCTracer/ROCProfiler in favor of ROCprofiler-SDK.
  • Metrics: 63 PRs 29 Issues

[AMD-AGI/TraceLens]

  • Key Activity:
    • [2025-08-XX] Focus on JAX and NCCL analysis.
  • Details:
    • Added compute communication tags to kernels.
    • Implemented NCCL/RCCL analyzer specifically for JAX workloads.
  • Metrics: 13 PRs 7 Issues

๐Ÿ”ฅ PyTorch Ecosystem

[pytorch/pytorch]

  • Key Activity:
    • [2025-08-06] ๐Ÿšจ RELEASE: v2.8.0 - Massive framework update.
  • Details:
    • Core: Introduced torch.compile hierarchical compilation and Control Flow Operator Library.
    • ROCm: Deprecated legacy profiling configs in favor of composable kernel tile configs. Improved performance for softmax and NLLLoss.
    • Intel: Added XCCL (Intel GPU distributed backend) support.
  • Metrics: 1640 PRs 616 Issues

[pytorch/torchtitan]

  • Key Activity:
    • [2025-08-31] Integration testing updates.
  • Details:
    • Refactored integration test framework to support DeepSeek-v3.
    • Improved Huggingface Asset integration.
  • Metrics: 118 PRs 40 Issues

[pytorch/FBGEMM]

  • Key Activity:
    • [2025-08-24] ๐Ÿšจ RELEASE: v1.3.0 - GenAI and quantization focus.
  • Details:
    • Hardware: Added build support for CUDA 12.9 and MI350+ (float8e4m3fn).
    • Kernels: Added HSTU ops and improved CUTLASS BF16 grouped GEMM.
    • Quantization: Added support for fused SILU/RMS with quantization.
  • Metrics: 0 PRs (Release Note aggregation)

[facebookresearch/xformers]

  • Key Activity:
    • [2025-08-13] ๐Ÿšจ RELEASE: v0.0.32 - PyTorch 2.8 alignment.
  • Details:
    • Added ROCm 6.4 build support.
    • Updated Flash-Attention package support to v2.8.2.
  • Metrics: 0 PRs tracked

๐Ÿš€ Inference & Serving

[vllm-project/vllm]

  • Key Activity:
    • [2025-08-18] ๐Ÿšจ RELEASE: v0.10.1 - Major feature and hardware update.
  • Details:
    • Architecture: Deprecated V0 FA3 support. Introduced full CUDA graph support with separate attention routines.
    • AMD: Enabled Flash Attention backend for Qwen-VL on ROCm. Added AITER HIP block quantization kernels.
    • NVIDIA: Added Block FP8 quantization for RTX 5090/SM120 and CutlassMLA backend for Blackwell/SM100.
    • Models: Added support for GPT-OSS, Eagle multimodal, and MiniCPM-V 4.0.
  • Metrics: 727 commits included in release notes.

[sgl-project/sglang]

  • Key Activity:
    • [2025-08-26] Community engagement.
  • Details:
    • Documentation updates regarding the โ€œSGLang x AMD SF Meetup,โ€ indicating tightening relations between the project and AMD.
  • Metrics: 0 PRs tracked

๐Ÿ‹๏ธ Training & Fine-tuning

[deepspeedai/DeepSpeed]

  • Key Activity:
    • [2025-08-20] ๐Ÿšจ RELEASE: v0.17.5
  • Details:
    • Added ZenFlow code for Stage 1 & 2.
    • Fixed DeepCompile compatibility for PyTorch v2.8.
  • Metrics: 0 PRs tracked

[THUDM/slime]

  • Key Activity:
    • [2025-08-31] ๐Ÿšจ RELEASE: v0.1.0
  • Details:
    • Optimized SGLang with FP8 + DeepEP.
    • Added CI for E2E GLM4 9B and Qwen3 30B-A3B training.
  • Metrics: 0 PRs tracked

[NVIDIA/Megatron-LM]

  • Key Activity:
    • [2025-08-11] ๐Ÿšจ RELEASE: core_v0.14.0rc5
  • Details:
    • Continued rapid release cadence for Megatron Core.
  • Metrics: 0 PRs tracked

๐Ÿง  JAX Ecosystem

[jax-ml/jax]

  • Key Activity:
    • [2025-08-20] ๐Ÿšจ RELEASE: v0.7.1
  • Details:
    • Build system updated to use CUDA 12.9.
    • Shipped Python 3.14 wheels.
    • Exposed jax.set_mesh global setter.
  • Metrics: 0 PRs tracked