📅 Engineering Report (2025-11-01 - 2025-11-30)

🚀 Executive Summary

November 2025 was a pivotal month for the AMD software ecosystem, anchored by the back-to-back releases of ROCm 7.1.0 and 7.1.1. These releases introduce critical support for the next-generation Instinct MI325X, MI350X, and MI355X accelerators, while simultaneously streamlining the software stack by deprecating legacy tools (ROCm SMI, ROCTracer) in favor of newer SDKs.

In the broader AI landscape, the “Serving Wars” intensified. vLLM (v0.11.x) and SGLang (v0.5.5) both shipped massive updates focusing on batch-invariant compilation, asynchronous scheduling stability, and support for DeepSeek architectures. Slime (v0.2.0) and Verl (v0.6.1) made significant strides in Reinforcement Learning (RLHF/PPO) training frameworks, introducing FSDP backends and fully asynchronous training recipes. NVIDIA continues to aggressively optimize for Blackwell (SM100/SM120) across Triton, TransformerEngine, and xFormers, setting a high bar for day-zero hardware support in open-source libraries.

ROCm 7.1 Series Launch: ROCm 7.1.0 and 7.1.1 were released, expanding OS support (RHEL 10, Debian 13) and officially supporting the MI325X, MI350X, and MI355X.
Tooling Consolidation: AMD explicitly deprecated ROCm SMI (replaced by AMD SMI) and ROCTracer/ROCProfiler (replaced by ROCprofiler-SDK), signaling a push for a unified, modern telemetry and profiling stack.
Virtualization & Resiliency: New support for SR-IOV on MI300X/MI325X and “Multimedia Engine Reset” on MI355X improves fault tolerance in data center environments.
Upstream Integration: vLLM merged multiple fixes for ROCm, including custom_ops registration, ViT attention fixes, and specific tuning configs for MI300X/MI350X. TraceLens (AMD-AGI) is actively improving JAX profiling capabilities.

Competitive Analysis

NVIDIA Blackwell Readiness: NVIDIA libraries (TransformerEngine v2.9, xFormers v0.0.33) and downstream consumers (vLLM) aggressively added support for Blackwell (SM100/SM120) architectures this month, including specific FP8 block scaling recipes used by DeepSeek.
- Impact for AMD: NVIDIA is ensuring their software ecosystem is ready before widespread hardware availability. AMD must ensure MI300/MI325 optimized kernels (via Composable Kernel/Triton) are equally prevalent in these high-velocity repos to prevent “CUDA-first” features from becoming entrenched defaults.
RLHF Framework Maturation: The release of Slime v0.2.0 and Verl v0.6.1 highlights a shift toward fully async PPO training and FSDP integration.
- Impact for AMD: These frameworks rely heavily on efficient communication libraries (NCCL/RCCL). AMD must ensure rccl performance parity, particularly for the scatter/gather operations used heavily in these new RL recipes.
PyTorch 2.9: PyTorch v2.9.1 was released. vLLM immediately updated its default build to support it.
- Impact for AMD: Rapid adoption of new PyTorch versions by consumers requires AMD to maintain tight synchronization with upstream PyTorch releases to avoid compatibility lag.

📂 Category Updates

🟢 AMD Ecosystem

[ROCm/ROCm]

Key Activity: 🚨 Major Releases 7.1.0 & 7.1.1
- [2025-11-26] Release 7.1.1: Adds support for RHEL 10.1, MI355X/MI350X telemetry via AMD SMI, and “Multimedia Engine Reset” for resiliency. Fixes MI325X SR-IOV resets.
- [2025-11-03] Release 7.1.0: Adds support for MI325X, RHEL 10.0, and KVM SR-IOV improvements. Deprecates ROCm SMI and older profiling tools.
Details:
- New Hardware: Full support for MI325X, MI350X, and MI355X.
- Deprecation: ROCm SMI is entering maintenance mode; users must migrate to AMD SMI. ROCTracer/ROCProfiler deprecated in favor of ROCprofiler-SDK.
- Libraries: hipBLASLt adds FP8 support for MI350X. rocAL adds Vision Transformer training support.
Metrics: 47 New Issues 63 New PRs 70 Closed PRs

[AMD-AGI/Primus]

Key Activity: Documentation restructuring and CLI improvements.
Details: Added CLI auto-discovery for subcommands and reorganized backend patch notes.
Metrics: 52 New PRs 1 New Issue

[AMD-AGI/TraceLens]

Key Activity: Improving JAX profiling accuracy.
Details: Work on calculating kernel busy time based on HLO ops for JAX and classifying memsets/memcpy events.
Metrics: 17 New PRs 14 New Issues

🔥 Inference & Serving

[vllm-project/vllm]

Key Activity: 🚨 v0.11.1 & v0.11.2 Releases - High Velocity.
Details:
- Architecture: Updated default build to PyTorch 2.9.0 + CUDA 12.9.1.
- Features: Batch-invariant torch.compile support, robust async scheduling (likely default in next release), and Anthropic API support.
- Hardware: Specific tuning configs for AMD MI300X/MI350/MI355 (via PRs) and extensive Blackwell (SM100) enablement.
- Models: DeepSeek-V3.2 and Eagle3 speculative decoding support.
Metrics: 1481 New PRs 1329 Closed PRs (Extremely Active)

[sgl-project/sglang]

Key Activity: 🚨 v0.5.5 and Gateway v0.2.3
Details:
- Gateway: Introduced “Harmony Pipeline” (OpenAI native architecture), bucket mode routing (20-30% perf boost), and PostgreSQL support.
- Core: Day-0 support for MiniMax-M2 and Kimi-K2-Thinking. Added video/image generation support.
Metrics: High activity (48 commits for Gateway release alone).

[THUDM/slime]

Key Activity: 🚨 v0.2.0 Release
Details:
- RLHF: Added native PPO support and MTP (Multi-Token Prediction) training.
- Backend: Introduced FSDP backend and FP8 full-stack training/inference.
- Algorithms: Added Train-Inference mismatch mitigation (R2/R3 routing replay).
Metrics: 0 Open Issues 0 New PRs (Data suggests a release dump).

[volcengine/verl]

Key Activity: 🚨 v0.6.1 Release
Details:
- Training: Fully Async Policy Trainer (decoupling Trainer and Rollouter).
- Megatron: Support for Qwen2.5/3VL with context parallel and 1F1B overlap.
Metrics: 0 Open Issues (Release tracking).

[llm-d/llm-d]

Key Activity: 🚨 v0.4.0 Release
Details: Midstreamed image of vLLM with PD disaggregation and KV cache awareness. Added CPU offloading support.
Metrics: 0 New Issues (Release tracking).

⚛️ PyTorch Ecosystem

[pytorch/pytorch]

Key Activity: 🚨 v2.9.1 Release
Details: Patch release fixing a significant memory regression in F.conv3d with bfloat16 and Inductor bugs regarding Gemma compilation.
Metrics: 1481 New PRs 1329 Closed PRs (High volume maintenance).

[pytorch/torchtitan]

Key Activity: AMD fork acknowledgment and pre-compilation features.
Details: Added documentation for AMD’s fork and ability to precompile models.
Metrics: 88 New PRs 28 New Issues

🟢 NVIDIA Ecosystem

[NVIDIA/TransformerEngine]

Key Activity: 🚨 v2.9 Release
Details: Added support for FP8 block scaling (DeepSeek recipe) on NVIDIA Blackwell. Added CPU offload support for all attention layouts.
Metrics: 0 New PRs (Release tracking).

[facebookresearch/xformers]

Key Activity: 🚨 v0.0.33 Release
Details: Added Cutlass FMHA op for Blackwell GPUs and FA3 deterministic mode.
Metrics: 0 New PRs (Release tracking).

✖️ JAX & Compilers

[jax-ml/jax]

Key Activity: 🚨 v0.8.1 Release
Details: jax.jit now supports decorator factory pattern. Breaking changes/deprecations in sharding APIs (PmapSharding).
Metrics: 490 New PRs 435 Closed PRs

[AI-Hypercomputer/maxtext]

Key Activity: 🚨 v0.1.0 PyPI Release
Details: First official PyPI package release. Significant documentation updates for tutorials.
Metrics: 177 New PRs 188 Closed PRs

[triton-lang/triton]

Key Activity: 🚨 v3.5.1 Release
Details: Bug fix release specifically for SM103 (GB300) support broken in 3.5.0.
Metrics: 216 New PRs 26 New Issues