📅 Engineering Report (2025-11-01 - 2025-11-30)

🚀 Executive Summary

November 2025 was a pivotal month characterized by rapid adoption of next-generation hardware and the maturation of RLHF (Reinforcement Learning from Human Feedback) pipelines. ROCm saw two significant releases (7.1.0 and 7.1.1), aggressively expanding support for RHEL 10 and the AMD Instinct MI355X/MI350X series, while narrowing the compatibility gap with CUDA via HIP API updates.

In the broader ecosystem, PyTorch released v2.9.1, prompting downstream consumers like vLLM (v0.11.x) to upgrade immediately. The inference landscape remains hyper-competitive; vLLM and SGLang both released major updates focusing on multimodal support and DeepSeek model optimizations. On the training side, Slime and Verl both shipped major versions (v0.2.0 and v0.6.1 respectively), introducing advanced PPO support, FSDP backends, and MTP (Multi-Token Prediction) training, signaling a shift toward more complex post-training workflows in open source.

ROCm 7.1 Series Launch: AMD released both ROCm 7.1.0 and 7.1.1. Key deliverables include official support for RHEL 10.1 and Debian 13, and extended telemetry/monitoring for the MI350X and MI355X series.
HIP API Parity: ROCm 7.1.0 introduced several new HIP APIs (e.g., hipMemsetD2D8, hipMemcpyBatchAsync) specifically designed to match CUDA functionality, reducing friction for porting kernels.
Software Ecosystem Maturity: The ROCm Data Science Toolkit (ROCm-DS) transitioned from Early Access to General Availability.
vLLM Integration: The vLLM project merged specific fixes for the ROCm AITER backend and added a ROCm quickstart guide, indicating improving upstream support for AMD accelerators in top-tier inference engines.
Primus & TraceLens: Continued active development in AMD-AGI internal tools, focusing on JAX support and CLI improvements for training workflows.

Competitive Analysis

NVIDIA Blackwell Readiness: Support for the Blackwell architecture (SM100/SM120) is landing rapidly across the stack. TransformerEngine v2.9, xFormers v0.0.33, and vLLM all merged specific optimizations (e.g., FP8 block scaling, CUTLASS kernels) for GB200/B200 hardware this month.
DeepSeek Optimizations: There is a distinct industry-wide sprint to optimize for DeepSeek v3/v3.2 models. Competitor libraries (NVIDIA-backed projects and community inference engines) are prioritizing kernels (FlashMLA) specifically for these architectures.
Triton 3.5.1: A patch release was issued specifically to fix support for SM103 (GB300), reinforcing NVIDIA’s tight feedback loop between hardware release and software stack readiness.

📂 Category Updates

AMD Ecosystem

[ROCm/ROCm]

Key Activity:
- [2025-11-03] 🚨 Release: rocm-7.1.0: Major update extending OS support (RHEL 10, Debian 13) and hardware support (MI325X/MI355X). Added new HIP APIs to match CUDA memory management and stream behaviors.
- [2025-11-26] 🚨 Release: rocm-7.1.1: Patch release adding telemetry enhancements for MI355X/MI350X, GEMM kernel selection improvements (Origami), and support for newer models like Phi-4 and QwQ-32B.
Details:
- Deprecation warning: ROCm-EP (Execution Provider for ONNX) is EOS; users must migrate to MIGraphX EP.
- Significant updates to hipBLASLt (TF32/FP32 optimizations) and RCCL (performance fixes for MI300X with Pollara AI NIC).
Metrics: 63 PRs 47 Issues (High Activity)

[AMD-AGI/Primus]

Key Activity:
- [2025-11-17] CLI improvements for auto-discovery of subcommands.
- [2025-11-14] Documentation reorganization for backend patch notes.
Metrics: 52 PRs 1 Issues (Good maintenance health)

[AMD-AGI/TraceLens]

Key Activity:
- [2025-11-XX] Focus on calculating kernel busy time based on HLO ops for JAX.
- [2025-11-XX] Improvements to GPUEventAnalyzer to classify memsets in memcpy.
Metrics: 17 PRs 14 Issues

PyTorch Ecosystem

[pytorch/pytorch]

Key Activity:
- [2025-11-12] 🚨 Release: v2.9.1: Patch release fixing a significant memory regression in F.conv3d with bfloat16 inputs and various Torch.compile/Inductor bugs.
Details:
- High volume of “Silent Correctness” fixes for Inductor graph breaks.
- Updates to cpp-httplib and RNG bindings.
Metrics: 1481 PRs 1007 Issues (Massive scale)

[pytorch/torchtitan]

Key Activity:
- [2025-11-06] Documentation update explicitly mentioning AMD’s fork.
- [2025-11-XX] Added ability to precompile torchtitan models.
Metrics: 88 PRs 28 Issues

Inference & Serving

[vllm-project/vllm]

Key Activity:
- [2025-11-18] 🚨 Release: v0.11.1: Major update moving to PyTorch 2.9.0 + CUDA 12.9.1.
- [2025-11-20] Release: v0.11.2: Patch fixes for Ray multi-node and async-scheduling.
Details:
- Added Anthropic API support (/v1/messages).
- Implemented batch-invariant torch.compile support.
- Extensive work on DeepSeek v3.2 and Qwen2.5/3-VL support.
- AMD Specific: ROCm quickstart guide added; fixes for AITER unified attention.
Metrics: 1456 Commits (tracked via release notes)

[sgl-project/sglang]

Key Activity:
- [2025-11-06] 🚨 Release: v0.5.5: Day-0 support for Kimi-K2-Thinking and Minimax-M2.
- [2025-11-17] Release: gateway-v0.2.3: Introduced bucket-mode routing (20-30% perf boost) and PostgreSQL support.
Details:
- Added support for direct video/image generation inference.
- Optimizations for Blackwell kernels.
Metrics: High PR velocity related to router and kernel optimizations.

[llm-d/llm-d]

Key Activity:
- [2025-11-26] 🚨 Release: v0.4.0: Major release introducing CPU image variants and workload-variant-autoscaler.
- [2025-11-06] Release: v0.3.1: Added ARM support and AKS cloud provider support.
Details:
- Updates to modelservice helm charts.
- Integration of gateway-api-inference-extension v1.2.0.
Metrics: Steady development on infrastructure/deployment tools.

[triton-lang/triton]

Key Activity:
- [2025-11-12] 🚨 Release: v3.5.1: Hotfix specifically for SM103 (NVIDIA GB300) support which was broken in 3.5.0.
Metrics: 216 PRs 26 Issues

Training & RLHF

[volcengine/verl]

Key Activity:
- [2025-11-14] 🚨 Release: v0.6.1: Major feature drop for RLHF training.
Details:
- Megatron: Added support for 1F1B overlap, Qwen3VL MoE/Dense models, and Context Parallel.
- Trainer: Added FP16 training support (FSDP/Megatron).
- Algorithm: Overhaul of Rollout Correction system.
Metrics: Active development in advanced RLHF techniques.

[THUDM/slime]

Key Activity:
- [2025-11-28] 🚨 Release: v0.2.0: Massive update adding FSDP backend and PPO support.
Details:
- Added MTP (Multi-Token Prediction) training during RL.
- Full stack FP8 support (training and inference).
- Implemented “True On-Policy” training to eliminate train-inference mismatch.
Metrics: 24 New PRs 25 Issues

[NVIDIA/TransformerEngine]

Key Activity:
- [2025-11-11] 🚨 Release: v2.9: Added FP8 block scaling recipe (required for DeepSeek v3) on Blackwell.
Details:
- Added CPU offload support for all attention layouts.
- API updates to generalize non-FP8 recipes.

JAX & XLA

[jax-ml/jax]

Key Activity:
- [2025-11-18] 🚨 Release: v0.8.1: Usability improvements for jax.jit (decorator factory pattern).
Details:
- jax.lax.linalg.eigh now supports implementation selection (QR/Jacobi/QDWH).
- Deprecated jax.sharding.PmapSharding.
Metrics: 490 PRs 0 Issues (Very high PR closure rate)

[AI-Hypercomputer/maxtext]

Key Activity:
- [2025-11-06] 🚨 Release: v0.1.0: First official PyPI package release.
Details:
- Shift to standard installation via pip.
- Tutorial updates for v1.3.0.
Metrics: 177 PRs 10 Issues

[openxla/xla]

Key Activity:
- Merge of ir_emitter and ir_emitter_nested for GPU.
- Ongoing integration with LLVM upstream.
Metrics: 1149 PRs 9 Issues (Extremely high velocity)