📅 Engineering Report (2026-02-01 - 2026-02-28)

🚀 Executive Summary

February 2026 was defined by the rapid productization of Reasoning Models (DeepSeek R1/V3) and Mixture-of-Experts (MoE) across the stack. Frameworks are shifting from experimental support to production hardening.

AMD made significant strides in training infrastructure with Primus v0.7.0, establishing a robust patch system for Megatron-LM and TorchTitan on ROCm. NVIDIA consolidated their lead in Post-Training (RLHF) by merging Megatron-RL into Megatron-LM Core and introducing GRPO support, directly addressing the industry’s shift toward reasoning-heavy workloads. In inference, vLLM and SGLang both released major versions focusing on Pipeline Parallelism, low-bit quantization (FP8/NVFP4), and disaggregated serving.

Primus v0.7.0 Release: A critical update for AMD’s training stack. It introduces a comprehensive patch framework for Megatron-LM and TorchTitan, enabling deterministic training, MoE support, and specific optimizations for DeepSeek V3 on MI355X hardware.
ROCm 7.0 & MI300X Optimization: Tilelang v0.1.8 enabled FlashAttention-2 forward pass on MI300X and updated CI to ROCm 7.1. SGLang v0.5.9 deprecated ROCm 6.3 artifacts in favor of ROCm 7 standardization.
Day-0 Model Support: Both vLLM and SGLang integrated support for new architectures (e.g., Kimi K2.5, Qwen3-Next) on ROCm immediately upon release, utilizing updated AITER kernels.

Competitive Analysis

NVIDIA Consolidates RL Training: With Megatron-LM Core v0.16.0, NVIDIA merged Megatron-RL into the main codebase. The addition of GRPO (Group Relative Policy Optimization) support indicates a strategic focus on the “reasoning” training loop popularized by DeepSeek R1, moving beyond standard SFT/DPO.
Quantization Warfare: TransformerEngine v2.12 and vLLM v0.16.0 significantly improved NVFP4 (4-bit floating point) support. This puts pressure on non-NVIDIA hardware to accelerate sub-8-bit precision support to remain competitive in inference throughput.
Serving Maturity: vLLM’s introduction of Pipeline Parallelism and a Realtime API (WebSocket-based) in v0.16.0 closes the gap with enterprise-grade serving solutions, reducing the need for third-party wrappers.

📂 Category Updates

🟢 AMD Ecosystem

AMD-AGI/Primus

Key Activity:
- [2026-02-02] 🚨 RELEASE: v0.7.0 - Major feature release.
Details:
- Megatron-LM Backend: Added deterministic training, MoE patches, Zero-Bubble Pipeline (ZBPP) patches, and FP8 context support.
- TorchTitan Integration: Moved TorchTitan trainer patches into the backend system and adjusted batch sizes for DeepSeek V3 16B.
- Infrastructure: Integrated preflight tools into CLI and aligned TFLOPS computation with Megatron defaults.
Metrics: 44 PRs 2 Issues

tile-ai/tilelang

Key Activity:
- [2026-02-16] 🚨 RELEASE: v0.1.8
Details:
- AMD Specific: Enabled FlashAttention-2 (FA2) forward pass on MI300X; fixed Docker build bugs for MI3x GPUs; updated CI to ROCm 7.1.
- Features: Introduced T.access_of, packed FP32x2 math intrinsics, and improved atomic reduction operations.
Metrics: 95 PRs 29 Issues

ROCm/ROCm

Key Activity:
- General maintenance and documentation updates.
Details:
- Identified issues with RDNA3 HIP Memory Pool Fragmentation on Windows causing slowdowns.
- Updated vLLM inference documentation to point upstream.
Metrics: 38 PRs 44 Issues

🔵 PyTorch Ecosystem

pytorch/torchtitan

Key Activity:
- [2026-02-20] 🚨 RELEASE: v0.2.2
Details:
- Features: Refactored Context Parallel (CP) to use new PyTorch APIs; enabled FlexCP for Llama3; added support for DeepEP shared experts overlap.
- Hardware: Added peak FLOPS support for NVIDIA H20 and mxfp8 support on ROCm gfx950.
- Refactor: Moved from TOML config system to a Python Dataclass Registry.
Metrics: 120 PRs 33 Issues

pytorch/ao (Architecture Optimization)

Key Activity:
- [2026-02-10] 🚨 RELEASE: v0.16.0
Details:
- Training: Introduced differentiable building blocks for MXFP8 MoE Training with Expert Parallelism. Reported 10-25% speedup on DeepSeek V3 training.
- Deprecation: Removed v1 configs for Float8/Int8 quantization in favor of v2.
Metrics: 0 PRs 20 Issues

pytorch/pytorch

Key Activity:
- High-volume maintenance and documentation updates.
Details:
- Bumped C++ standard to version 20.
- Addressed issues regarding torch.vmap failures with functorch tracing and uneven sharding in FSDP2.
Metrics: 1679 PRs 389 Issues

🟢 NVIDIA Ecosystem

NVIDIA/Megatron-LM

Key Activity:
- [2026-02-26] 🚨 RELEASE: core_v0.16.0
Details:
- RL Merger: Merged Megatron-RL into the main Megatron-LM repo, enabling tighter integration for Post-Training.
- GRPO: Added functional tests and support for Group Relative Policy Optimization (GRPO), critical for reasoning models.
- Inference: Enabled hybrid model support (Tensor + Expert + Data Parallelism) in mcore inference.
- Fixes: Resolved DeepSeek V3 FSDP hangs related to precision-aware optimizers.
Metrics: 0 PRs 0 Issues (Repo stats likely aggregated in release notes)

NVIDIA/TransformerEngine

Key Activity:
- [2026-02-24] RELEASE: v2.12
Details:
- Quantization: Improved performance of NVFP4 quantization kernels.
- PyTorch: Added fused permute+pad operations for FP8 optimization and Sliding Window Attention support.
Metrics: 0 PRs 0 Issues

🚀 Serving & Inference

vllm-project/vllm

Key Activity:
- [2026-02-25] 🚨 RELEASE: v0.16.0
- [2026-02-04] RELEASE: v0.15.1
Details:
- Architecture: Fully supported Async Scheduling + Pipeline Parallelism (~30% throughput improvement).
- API: Launched WebSocket-based Realtime API for streaming audio.
- XPU Overhaul: Deprecated IPEX in favor of vllm-xpu-kernels for Intel GPUs.
- AMD: Tuned QWEN3-NEXT FP8 and integrated AITER attention backend.
Metrics: 0 PRs 0 Issues (High volume in reality, stats snapshot limited)

sgl-project/sglang

Key Activity:
- [2026-02-24] 🚨 RELEASE: v0.5.9
Details:
- Performance: Implemented LoRA weight loading overlap with computation (78% reduction in TTFT). Integrated TRT-LLM NSA kernels for DeepSeek V3.2.
- AMD/ROCm: Standardized on ROCm 7; added Day-0 support for Kimi K2.5 and FP8 prefill attention kernels.
- Diffusion: Added support for LTX-2 and MOVA models; enabled token-level sequence sharding.
Metrics: 0 PRs 0 Issues

llm-d/llm-d

Key Activity:
- [2026-02-04] 🚨 RELEASE: v0.5.0
Details:
- Infrastructure: Upgraded to Gateway API v1.4.0 and Istio 1.28.1.
- Components: Graduated Workload-Variant-Autoscaler (WVA) from experimental to core.
- Refactor: Moved routing sidecar logic into the inference scheduler repo.
Metrics: 0 PRs 0 Issues

📚 Models & Libraries

huggingface/transformers

Key Activity:
- [2026-02-16] RELEASE: v5.2.0
- [2026-02-05] RELEASE: v5.1.0
Details:
- New Models: Qwen 3.5, GLM-5, VoxtralRealtime, EXAONE-MoE, PP-DocLayoutV3.
- Breaking Changes: Refactored attention mask interface; removed SDPA workarounds for Torch 2.4+.
- Hardware: Added XPU support for MoE kernels (MegaBlocks).
Metrics: 561 PRs 133 Issues

deepspeedai/DeepSpeed

Key Activity:
- [2026-02-12] RELEASE: v0.18.6
Details:
- Fixes: Addressed BF16 gradient norm divergence with ZeRO stage 0 and leaf module race conditions.
- Features: Supported custom partitioning patterns for AutoTP.
Metrics: 0 PRs 0 Issues