πŸ“… Engineering Report (2026-02-01 - 2026-02-28)

πŸš€ Executive Summary

February 2026 was a massive month across the AI ecosystem, marked by an industry-wide rush to optimize inference and training for the newest frontier models (DeepSeek V3/R1, Qwen3.5, and GLM-5) using extreme low-precision formats (FP8, FP4, NVFP4, MXFP4/8). Ecosystem maintenance health is exceptionally high, with high-velocity repositories like HuggingFace transformers, vllm, and Megatron-LM merging hundreds of PRs to support next-generation Mixture-of-Experts (MoE) scaling and new hardware platforms.

  • Primus Gets Major Upgrades: AMD’s Primus framework saw a substantial v0.7.0 release integrating heavily with Megatron (MoE, FP8, FSDP2, Zero-bubble pipeline parallelism) and explicitly optimizing batch sizes for DeepSeek V3 on the MI355X.
  • Ecosystem Native Support Expanding: Upstream projects are rapidly merging AMD-specific optimizations. TorchAO added support for AMD gfx942 FP8 types in scaled grouped GEMMs. TorchTitan added extensive ROCm CI support (including H100 to ROCm tests and mxfp8 on gfx950). TileLang shipped native support for AMD MI300X FA2 forward passes and MI350/MI355 FP8.
  • ROCm 7 Standardization in SGLang: SGLang officially deprecated ROCm 6.3 in favor of standardizing on ROCm 7, pushing critical optimizations like FP8 prefill attention kernels and Day-0 support for new models like Kimi K2.5 on AMD hardware.
  • Profiling Tooling Matures: TraceLens updated its support for rocprofv3 and added ridge points to roofline metrics, improving AMD hardware performance debugging.

Competitive Analysis

  • NVIDIA’s Blackwell Push is Aggressive: NVIDIA is rapidly maturing its software stack for the Blackwell architecture (SM120/SM121). TransformerEngine v2.12 and Megatron-LM v0.16.0 heavily prioritized NVFP4 MoE kernels, TRT-LLM integration, and UVM (Unified Virtual Memory) allocator compilation.
  • TRT-LLM Driving Massive Inference Gains: SGLang integrated TRT-LLM Native Sparse Attention (NSA) kernels, specifically citing a 3x-5x speedup for DeepSeek V3.2 on Blackwell. AMD must ensure its vLLM and SGLang backends (like AITER) can match this MoE dispatch efficiency.
  • XPU/Intel Platform Overhaul: vLLM completely deprecated IPEX in favor of vllm-xpu-kernels, adding substantial XPU support including unquantized MoE, MXFP4 MoE, and scaled_mm kernels. Intel is moving aggressively to unify its inference backend.
  • The Low-Precision War: The battleground for LLM performance has definitively shifted to FP4 and MXFP8/4. With TorchAO showing 10-25% speedups for DeepSeek V3 using MXFP8 MoE building blocks, hardware parity in handling these formats without accuracy degradation is the primary competitive moat for Q1/Q2 2026.

πŸ“‚ Category Updates

AMD Ecosystem

AMD-AGI/Primus

  • Key Activity:
    • [2026-02-02] 🚨 RELEASE: v0.7.0 - Massive framework update focusing on performance projection and Megatron integration.
    • [2026-02-06] DOC UPDATE: Updated Docker base image from v25.10 to v26.1.
  • Details:
    • [2026-02-02] Integrated Megatron MoE, Transformer Engine, Turbo, and Zero-bubble pipeline (Zbpp) patches.
    • [2026-02-02] Added Torch FSDP2 and FP8 context support.
    • [2026-02-02] MI355X Optimization: Maximized batch sizes for DeepSeek V3 on MI355X (adjusted 16B batch size to 8).
    • [2026-02-02] Preflight tool completely refactored for richer host+GPU+network reports.
  • Metrics: 44 PRs 2 Issues (Highlight Issues: Segfault on 8x MI250X, Slurm hang when MASTER_ADDR != SLURM_NODELIST[0])

AMD-AGI/TraceLens

  • Key Activity:
    • [2026-02-13] DOC UPDATE: Added support for reporting analysis of pftrace files via rocprofv3 CLI.
    • [2026-02-25] DOC UPDATE: Restructured README.
  • Details:
    • [2026-02-25] PR merged to add ridge point column to tracelens perf report roofline metrics.
  • Metrics: 27 PRs 11 Issues

ROCm/ROCm & ROCm/MAD

  • Key Activity:
    • Ongoing tracking of ecosystem bug reports and vLLM documentation alignment.
  • Details:
    • [ROCm] Highlighted Windows RDNA3 HIP memory pool fragmentation causing massive VRAM-heavy slowdowns.
    • [MAD] Added vLLM router support for PD Disaggregation and released PyTorch xdit:v26.2.
  • Metrics: 45 PRs 74 Issues

AMD-AGI/GEAK-agent

  • Key Activity:
    • [2026-02-25] Refactored CLI: Renamed Chinese references to English and --mcp to --rag.
  • Metrics: 0 PRs 0 Issues

PyTorch Ecosystem

pytorch/ao

  • Key Activity:
    • [2026-02-10] 🚨 RELEASE: v0.16.0 - Major release introducing MXFP8 MoE building blocks.
  • Details:
    • [2026-02-10] Added MXFP8 MoE training with Expert Parallelism (EP), yielding 10-25% tokens/sec speedup for DeepSeek V3.
    • [2026-02-10] AMD Support: scaled_grouped_mm now supports gfx942 FP8 data types.
    • [2026-02-10] Heavily cleaned up the API by deprecating v1 quantization configs, Old GPTQ, and Old SmoothQuant implementations.
  • Metrics: 0 PRs 20 Issues

pytorch/torchtitan

  • Key Activity:
    • [2026-02-20] 🚨 RELEASE: v0.2.2
  • Details:
    • [2026-02-20] ROCm Native CI: Added ROCm support for H100 tests, simple FSDP experiments, Auto Parallel, and Transformer Modeling Backend tests. Added mxfp8 support on gfx950.
    • [2026-02-20] Added DeepEP shared_experts overlap, Qwen3 attention scaling, and updated DeepSeek V3 device mesh usage.
  • Metrics: 120 PRs 52 Issues

pytorch/pytorch

  • Key Activity:
    • [2026-02-24] CMake bumped to C++ 20.
  • Details:
    • [2026-02-28] Addressed posix_fallocate error handling and tracked active issues regarding uneven FSDP2 sharding and torch.vmap failures with functorch.
  • Metrics: 1679 PRs 806 Issues (Massive throughput, stable health)

NVIDIA Ecosystem

NVIDIA/Megatron-LM

  • Key Activity:
    • [2026-02-26] 🚨 RELEASE: core_v0.16.0 - Massive framework evolution.
    • [2026-02-06] 🚨 RELEASE: core_v0.15.3 - Security patch release.
  • Details:
    • [2026-02-26] DeepSeek/Qwen Support: Added DeepSeek V3.2 support, Qwen3-Next MoE gates, and MTP packed-seq support.
    • [2026-02-26] Blackwell & Formats: Added NVFP4 MOE with proper padding and FP8 params support for Megatron-FSDP.
    • [2026-02-26] Architecture: Merged Megatron-RL into LM, enabling native RLHF/GRPO functional tests. Integrated Hybrid Context Parallelism and hybrid tensor+expert+data parallelism for inference.
  • Metrics: Dozens of active PRs mapped to the release.

NVIDIA/TransformerEngine

  • Key Activity:
    • [2026-02-24] 🚨 RELEASE: v2.12
  • Details:
    • [2026-02-24] Improved NVFP4 quantization kernels and fused permute+pad for FP8 optimization.
    • [2026-02-24] Fixed SM120 compilation with CUDA 12 and added cudagraph support for activation recomputation.
  • Metrics: 0 PRs 0 Issues (in current window)

Serving & Inference Ecosystem

vllm-project/vllm

  • Key Activity:
    • [2026-02-25] 🚨 RELEASE: v0.16.0
    • [2026-02-04] 🚨 RELEASE: v0.15.1 (Security & RTX Blackwell fixes)
  • Details:
    • [2026-02-25] Core Engine: Async scheduling + Pipeline Parallelism now fully supported (30.8% throughput improvement).
    • [2026-02-25] Hardware: AMD ROCm received Qwen3-Next FP8 tunings and AITER attention backend. Intel XPU completely overhauled (deprecated IPEX for vllm-xpu-kernels). NVIDIA got FlashInfer TRTLLM BF16 MoE and SM100 INT4 W4A16.
    • [2026-02-25] Features: Added WebSocket-based Realtime API (audio) and Unified Parallel Drafting for speculative decoding.
  • Metrics: Hundreds of PRs merged to support v0.16.0.

sgl-project/sglang

  • Key Activity:
    • [2026-02-24] 🚨 RELEASE: v0.5.9
  • Details:
    • [2026-02-24] Blackwell NSA: Integrated TRT-LLM DSA kernels for Native Sparse Attention (NSA), yielding 3x-5x performance boosts for DeepSeek V3.2 on Blackwell.
    • [2026-02-24] AMD ROCm: Deprecated ROCm 6.3, standardized on ROCm 7. Added FP8 prefill attention kernel integration and bumped AITER to v0.1.10.post3 (FP8 Prefill/Decode/KV Cache support).
    • [2026-02-24] Core: LoRA weight loading overlap reduces TTFT by ~78%. Added Flashinfer all-to-all MoE dispatcher.
  • Metrics: High velocity; directly competing with vLLM on DeepSeek V3 efficiency.

llm-d/llm-d

  • Key Activity:
    • [2026-02-04] 🚨 RELEASE: v0.5.0
  • Details:
    • [2026-02-04] Promoted the Workload-Variant-Autoscaler to a core component.
    • [2026-02-04] Upgraded infrastructure to Gateway API v1.4.0 and Istio 1.28.1. Temporarily deprecated EFA (Elastic Fabric Adapter) due to libiverbs conflicts.
  • Metrics: 0 PRs 0 Issues (post-release stabilization)

tile-ai/tilelang

  • Key Activity:
    • [2026-02-16] 🚨 RELEASE: v0.1.8
  • Details:
    • [2026-02-16] Enabled FA2 fwd on AMD MI300X.
    • [2026-02-16] Added AMD MI350/MI355 FP8 support.
    • [2026-02-16] Integrated Z3 in TVM Arith Analyzer and added CuTeDSL backend.
  • Metrics: 95 PRs 57 Issues

xdit-project/xDiT & deepseek-ai/DeepEP

  • Key Activity:
    • [xDiT] Added Qwen Image, Qwen Image Edit, and FLUX.2-klein-9/4B support. PR added for AITER Sage V2.
    • [DeepEP] Updated README for mori-EP branch (AMD community fork). Tracked an issue where init fails on partial CUDA peer access.

Hugging Face Ecosystem

huggingface/transformers

  • Key Activity:
    • [2026-02-16] 🚨 RELEASE: v5.2.0
    • [2026-02-05] 🚨 RELEASE: v5.1.0
  • Details:
    • [2026-02-16] Model Additions: Qwen3.5 (397B), GLM-5, VoxtralRealtime, VibeVoice Acoustic Tokenizer.
    • [2026-02-05] Model Additions: EXAONE-MoE, PP-DocLayoutV3, Youtu-LLM.
    • [2026-02-16] Breaking Changes: Implemented a new Attention mask interface everywhere. Modified ModernBERT default attention to stop using FA.
    • [2026-02-16] Deprecations: Removed torchao.autoquant from transformers.
  • Metrics: 561 PRs 261 Issues (Highly active, maintaining excellent closure rates).

Compilers & Other Frameworks

  • openxla/xla: Tracked cross-compile build failures for CPU and GPU hangs with Triton Warp Specialization. (1348 PRs)
  • triton-lang/triton: Added LLVM diagnostic handler for early AMD LDS resource checks. (241 PRs)
  • AI-Hypercomputer/maxtext: Addressed Qwen3 MoE Conversion errors and enabled mask for AdamW optimizer. (216 PRs)
  • facebookresearch/xformers: Released v0.0.35 (2026-02-20) supporting free-threading Python and removing bundled pre-built Flash-Attention 3.
  • deepspeedai/DeepSpeed: Released v0.18.6 (2026-02-12) adding custom partitioning patterns for AutoTP.