📅 Engineering Report (2025-10-01 - 2025-10-31)

🚀 Executive Summary

October 2025 marked a significant pivot point for infrastructure across the AI stack. AMD launched a technology preview of its new build system (“TheRock”) via ROCm 7.9.0, preparing the software stack for the MI350 series. Simultaneously, AMD-AGI took TraceLens public and aggressively updated Primus with support for massive models (Llama 3.1 405B, Grok).

In the broader ecosystem, PyTorch 2.9.0 was released with support for CUDA 13 and Python 3.14. NVIDIA is heavily pushing Blackwell-specific optimizations (NVFP4, FP8 attention) across Megatron-LM and TransformerEngine. vLLM executed a major refactor in v0.11.0, deleting its legacy V0 engine entirely to focus on the higher-performance V1 engine, while simultaneously adding support for ROCm 7.0.

ROCm 7.9.0 “TheRock” Preview: AMD released a technology preview of a new build/release infrastructure. It introduces a versioning discontinuity (7.9 vs 7.0 stream) and is specifically built to support MI355X and MI350X. This is critical as it signals the software readiness phase for AMD’s next-gen hardware.
Primus & TraceLens Maturity: Primus v0.4.0 added support for Grok1/2 and Qwen2.5, alongside zero-bubble pipeline parallelism. TraceLens v0.4.0 was open-sourced (repo made public) and received a UI, signaling AMD’s commitment to better profiling tooling for developers.
PyTorch CI for MI355X: PyTorch v2.9.0 release notes confirm that ROCm CI for MI355X testing has been enabled, ensuring day-one framework support for the new accelerator.

Competitive Analysis

NVIDIA Blackwell Readiness: Competitor repositories (Megatron-LM, TransformerEngine, vLLM) are flooded with “Blackwell,” “B200,” and “GB200” specific optimizations. Specifically, NVFP4 (4-bit floating point) quantization support is now mainline in TransformerEngine and torchao, aimed at doubling throughput on Blackwell hardware.
vLLM Architecture Shift: vLLM v0.11.0 removed its legacy engine. This reduces technical debt and focuses optimization efforts. While they added ROCm 7.0 support, they also added specific optimizations for NVIDIA’s FP8 FlashInfer MLA decode, widening the feature gap in advanced quantization formats.
DeepSeek & Qwen Dominance: The ecosystem (RTP-LLM, SGLang, vLLM) is heavily optimizing for Qwen3-Next and DeepSeek-V3.2 architectures. AMD’s Primus is keeping pace here, but competitor libraries like SGLang are introducing specific kernel optimizations for these models rapidly.

📂 Category Updates

🟥 AMD Ecosystem

AMD-AGI/Primus

Key Activity:
- Heavy feature addition for large-scale training (Llama 405B, Grok) and pipeline parallelism improvements.
Details:
- 🚨 [2025-10-18] RELEASE: v0.4.0: Added support for Grok1/Grok2, Qwen2.5-7B/72B, and Llama 3.1 405B. Introduced zero-bubble pipeline parallelism and a Python-based CLI entry point.
- [2025-10-15] RELEASE: v0.3.0: Added Primus-Turbo support for TorchTitan and async tensor parallelism.
Metrics: 46 PRs 1 New Issues (Health: High activity, fast releases)

AMD-AGI/TraceLens

Key Activity:
- Repository switched to public. Added UI and JAX support.
Details:
- 🚨 [2025-10-17] RELEASE: v0.4.0: Initial public release notes. Added TraceLens UI, JAX analysis reporting tool, and performance models for aten::bmm.
- [2025-10-17] Added support for Transformer Engine v1 GEMM ops.
Metrics: 48 PRs 45 New Issues (Health: Rapid growth following public release)

ROCm/ROCm

Key Activity:
- Split release streams: Production (7.0.2) vs. Technology Preview (7.9.0).
Details:
- 🚨 [2025-10-20] RELEASE: therock-7.9.0: Preview of “TheRock” build system. Support for MI355X and MI350X. No upgrade path from 7.0; intended for dev preview only.
- [2025-10-10] RELEASE: rocm-7.0.2: Production release. Added RHEL 10 / Debian 13 support. Added RX 9060 support.
- Note: ROCm-EP (Execution Provider) for ONNX Runtime is deprecated.
Metrics: 105 PRs 54 New Issues (Health: Very High activity due to dual release streams)

🔥 PyTorch Ecosystem

pytorch/pytorch

Key Activity:
- Major version release with significant hardware backend updates.
Details:
- 🚨 [2025-10-15] RELEASE: v2.9.0:
  - CUDA 13 & Python 3.14 support.
  - Symmetric Memory: Enables easier multi-GPU kernel programming.
  - Intel/XPU: FlexAttention enabled on Intel GPUs.
  - ROCm: Enabled CI for MI355X.
Metrics: 1900 PRs 529 New Issues (Health: Extremely High velocity)

pytorch/torchtitan

Key Activity:
- Stabilization and experiments.
Details:
- [2025-10-18] RELEASE: v0.2.0: Dependencies updated to PyTorch 2.10 (dev) and torchao 0.15 (dev).
- [2025-10-14] Consolidated DeepSeek-V3 experiments.
Metrics: 174 PRs 24 New Issues

pytorch/ao

Key Activity:
- Optimization for next-gen NVIDIA hardware.
Details:
- 🚨 [2025-10-13] RELEASE: v0.14.1: Added prototype NVFP4 QAT (Quantization Aware Training) and MoE training support specifically for Blackwell GPUs.
Metrics: 0 PRs recorded (Release note data only) 26 New Issues

meta-pytorch/monarch

Key Activity:
- New project launch.
Details:
- 🚨 [2025-10-22] RELEASE: v0.1.0: Initial release of a distributed programming framework for PyTorch using actor-based concurrency and RDMA.
Metrics: 0 PRs recorded

🟩 NVIDIA Ecosystem

NVIDIA/Megatron-LM

Key Activity:
- Blackwell preparation and architectural improvements.
Details:
- [2025-10-08] RELEASE: core_v0.14.0:
  - MoE: Optimizations for large-scale fine-grained MoE on Blackwell.
  - HyperCommGrid: N-Dimensional Communication Grid for Model Parallelism.
  - Inference: Multi-batch size CUDA Graphs.
Metrics: 0 PRs recorded

NVIDIA/TransformerEngine

Key Activity:
- Quantization format expansion.
Details:
- [2025-10-07] RELEASE: v2.8: Added support for NVFP4 training recipe and FP8 attention with current scaling.
- [2025-10-01] RELEASE: v2.7: Support for applying LayerNorm/RMSNorm to key/query tensors.
Metrics: 0 PRs recorded

🚀 Serving & Inference

vllm-project/vllm

Key Activity:
- Massive architectural cleanup and hardware expansion.
Details:
- 🚨 [2025-10-02] RELEASE: v0.11.0:
  - Removed V0 Engine: Complete removal of legacy engine code. V1 is now the standard.
  - ROCm 7.0 Support: Official support added.
  - Models: Support for Qwen3-Next, DeepSeek-V3.2-Exp, OLMo3.
  - Quantization: FP8 per-token-group quantization and NVFP4 for dense models.
Metrics: 0 PRs recorded (Release note data only)

sgl-project/sglang

Key Activity:
- Deep optimizations for DeepSeek and Qwen.
Details:
- [2025-10-26] RELEASE: v0.5.4:
  - Native ModelOpt quantization support.
  - Full optimizations for DeepSeek-V3.2 (MTP, PD-Disagg).
  - Overlap scheduler for speculative decoding.
Metrics: 0 PRs recorded

volcengine/verl

Key Activity:
- Focus on RLHF/RL training infrastructure.
Details:
- [2025-10-15] RELEASE: v0.6.0: Introduced “Model Engine” service prototype. Migrated SGLang and vLLM to native server mode for rollouts. Added support for Qwen3 VL and GPT OSS.
Metrics: 0 PRs recorded

🧠 JAX & Compilers

jax-ml/jax

Key Activity:
- API cleanup and distributed map changes.
Details:
- [2025-10-15] RELEASE: jax-v0.8.0: Breaking change: jax.pmap is now implemented in terms of jax.shard_map and is in maintenance mode. NVIDIA GPU eigendecomposition now uses cusolver by default.
Metrics: 0 PRs recorded

triton-lang/triton

Key Activity:
- Backend maturity for both NVIDIA (Blackwell) and AMD (MI350).
Details:
- [2025-10-21] RELEASE: v3.5.0:
  - AMD: GFX950 (MI350) Support added (MFMA scale, buffer load/store).
  - NVIDIA: Warp specialization enhancements and Blackwell TMEM support.
Metrics: 0 PRs recorded

tile-ai/tilelang

Key Activity:
- Legacy cleanup and new hardware.
Details:
- [2025-10-31] RELEASE: v0.1.6.post2: Last release supporting Python 3.8. Added support for Huawei Ascend chips and a Metal backend.
Metrics: 0 PRs recorded