📅 Engineering Report (2026-02-01 - 2026-02-28)

🚀 Executive Summary

February 2026 was defined by the rapid productization of Reasoning Models (DeepSeek R1/V3) and Mixture-of-Experts (MoE) across the stack. Frameworks are shifting from experimental support to production hardening.

AMD made significant strides in training infrastructure with Primus v0.7.0, establishing a robust patch system for Megatron-LM and TorchTitan on ROCm. NVIDIA consolidated their lead in Post-Training (RLHF) by merging Megatron-RL into Megatron-LM Core and introducing GRPO support, directly addressing the industry’s shift toward reasoning-heavy workloads. In inference, vLLM and SGLang both released major versions focusing on Pipeline Parallelism, low-bit quantization (FP8/NVFP4), and disaggregated serving.

  • Primus v0.7.0 Release: A critical update for AMD’s training stack. It introduces a comprehensive patch framework for Megatron-LM and TorchTitan, enabling deterministic training, MoE support, and specific optimizations for DeepSeek V3 on MI355X hardware.
  • ROCm 7.0 & MI300X Optimization: Tilelang v0.1.8 enabled FlashAttention-2 forward pass on MI300X and updated CI to ROCm 7.1. SGLang v0.5.9 deprecated ROCm 6.3 artifacts in favor of ROCm 7 standardization.
  • Day-0 Model Support: Both vLLM and SGLang integrated support for new architectures (e.g., Kimi K2.5, Qwen3-Next) on ROCm immediately upon release, utilizing updated AITER kernels.

Competitive Analysis

  • NVIDIA Consolidates RL Training: With Megatron-LM Core v0.16.0, NVIDIA merged Megatron-RL into the main codebase. The addition of GRPO (Group Relative Policy Optimization) support indicates a strategic focus on the “reasoning” training loop popularized by DeepSeek R1, moving beyond standard SFT/DPO.
  • Quantization Warfare: TransformerEngine v2.12 and vLLM v0.16.0 significantly improved NVFP4 (4-bit floating point) support. This puts pressure on non-NVIDIA hardware to accelerate sub-8-bit precision support to remain competitive in inference throughput.
  • Serving Maturity: vLLM’s introduction of Pipeline Parallelism and a Realtime API (WebSocket-based) in v0.16.0 closes the gap with enterprise-grade serving solutions, reducing the need for third-party wrappers.

📂 Category Updates

🟢 AMD Ecosystem

AMD-AGI/Primus

  • Key Activity:
    • [2026-02-02] 🚨 RELEASE: v0.7.0 - Major feature release.
  • Details:
    • Megatron-LM Backend: Added deterministic training, MoE patches, Zero-Bubble Pipeline (ZBPP) patches, and FP8 context support.
    • TorchTitan Integration: Moved TorchTitan trainer patches into the backend system and adjusted batch sizes for DeepSeek V3 16B.
    • Infrastructure: Integrated preflight tools into CLI and aligned TFLOPS computation with Megatron defaults.
  • Metrics: 44 PRs 2 Issues

tile-ai/tilelang

  • Key Activity:
    • [2026-02-16] 🚨 RELEASE: v0.1.8
  • Details:
    • AMD Specific: Enabled FlashAttention-2 (FA2) forward pass on MI300X; fixed Docker build bugs for MI3x GPUs; updated CI to ROCm 7.1.
    • Features: Introduced T.access_of, packed FP32x2 math intrinsics, and improved atomic reduction operations.
  • Metrics: 95 PRs 29 Issues

ROCm/ROCm

  • Key Activity:
    • General maintenance and documentation updates.
  • Details:
    • Identified issues with RDNA3 HIP Memory Pool Fragmentation on Windows causing slowdowns.
    • Updated vLLM inference documentation to point upstream.
  • Metrics: 38 PRs 44 Issues

🔵 PyTorch Ecosystem

pytorch/torchtitan

  • Key Activity:
    • [2026-02-20] 🚨 RELEASE: v0.2.2
  • Details:
    • Features: Refactored Context Parallel (CP) to use new PyTorch APIs; enabled FlexCP for Llama3; added support for DeepEP shared experts overlap.
    • Hardware: Added peak FLOPS support for NVIDIA H20 and mxfp8 support on ROCm gfx950.
    • Refactor: Moved from TOML config system to a Python Dataclass Registry.
  • Metrics: 120 PRs 33 Issues

pytorch/ao (Architecture Optimization)

  • Key Activity:
    • [2026-02-10] 🚨 RELEASE: v0.16.0
  • Details:
    • Training: Introduced differentiable building blocks for MXFP8 MoE Training with Expert Parallelism. Reported 10-25% speedup on DeepSeek V3 training.
    • Deprecation: Removed v1 configs for Float8/Int8 quantization in favor of v2.
  • Metrics: 0 PRs 20 Issues

pytorch/pytorch

  • Key Activity:
    • High-volume maintenance and documentation updates.
  • Details:
    • Bumped C++ standard to version 20.
    • Addressed issues regarding torch.vmap failures with functorch tracing and uneven sharding in FSDP2.
  • Metrics: 1679 PRs 389 Issues

🟢 NVIDIA Ecosystem

NVIDIA/Megatron-LM

  • Key Activity:
    • [2026-02-26] 🚨 RELEASE: core_v0.16.0
  • Details:
    • RL Merger: Merged Megatron-RL into the main Megatron-LM repo, enabling tighter integration for Post-Training.
    • GRPO: Added functional tests and support for Group Relative Policy Optimization (GRPO), critical for reasoning models.
    • Inference: Enabled hybrid model support (Tensor + Expert + Data Parallelism) in mcore inference.
    • Fixes: Resolved DeepSeek V3 FSDP hangs related to precision-aware optimizers.
  • Metrics: 0 PRs 0 Issues (Repo stats likely aggregated in release notes)

NVIDIA/TransformerEngine

  • Key Activity:
    • [2026-02-24] RELEASE: v2.12
  • Details:
    • Quantization: Improved performance of NVFP4 quantization kernels.
    • PyTorch: Added fused permute+pad operations for FP8 optimization and Sliding Window Attention support.
  • Metrics: 0 PRs 0 Issues

🚀 Serving & Inference

vllm-project/vllm

  • Key Activity:
    • [2026-02-25] 🚨 RELEASE: v0.16.0
    • [2026-02-04] RELEASE: v0.15.1
  • Details:
    • Architecture: Fully supported Async Scheduling + Pipeline Parallelism (~30% throughput improvement).
    • API: Launched WebSocket-based Realtime API for streaming audio.
    • XPU Overhaul: Deprecated IPEX in favor of vllm-xpu-kernels for Intel GPUs.
    • AMD: Tuned QWEN3-NEXT FP8 and integrated AITER attention backend.
  • Metrics: 0 PRs 0 Issues (High volume in reality, stats snapshot limited)

sgl-project/sglang

  • Key Activity:
    • [2026-02-24] 🚨 RELEASE: v0.5.9
  • Details:
    • Performance: Implemented LoRA weight loading overlap with computation (78% reduction in TTFT). Integrated TRT-LLM NSA kernels for DeepSeek V3.2.
    • AMD/ROCm: Standardized on ROCm 7; added Day-0 support for Kimi K2.5 and FP8 prefill attention kernels.
    • Diffusion: Added support for LTX-2 and MOVA models; enabled token-level sequence sharding.
  • Metrics: 0 PRs 0 Issues

llm-d/llm-d

  • Key Activity:
    • [2026-02-04] 🚨 RELEASE: v0.5.0
  • Details:
    • Infrastructure: Upgraded to Gateway API v1.4.0 and Istio 1.28.1.
    • Components: Graduated Workload-Variant-Autoscaler (WVA) from experimental to core.
    • Refactor: Moved routing sidecar logic into the inference scheduler repo.
  • Metrics: 0 PRs 0 Issues

📚 Models & Libraries

huggingface/transformers

  • Key Activity:
    • [2026-02-16] RELEASE: v5.2.0
    • [2026-02-05] RELEASE: v5.1.0
  • Details:
    • New Models: Qwen 3.5, GLM-5, VoxtralRealtime, EXAONE-MoE, PP-DocLayoutV3.
    • Breaking Changes: Refactored attention mask interface; removed SDPA workarounds for Torch 2.4+.
    • Hardware: Added XPU support for MoE kernels (MegaBlocks).
  • Metrics: 561 PRs 133 Issues

deepspeedai/DeepSpeed

  • Key Activity:
    • [2026-02-12] RELEASE: v0.18.6
  • Details:
    • Fixes: Addressed BF16 gradient norm divergence with ZeRO stage 0 and leaf module race conditions.
    • Features: Supported custom partitioning patterns for AutoTP.
  • Metrics: 0 PRs 0 Issues