GitHub Monthly Report: 2026-02-01 to 2026-02-28
📅 Engineering Report (2026-02-01 - 2026-02-28)
🚀 Executive Summary
February 2026 was defined by the rapid productization of Reasoning Models (DeepSeek R1/V3) and Mixture-of-Experts (MoE) across the stack. Frameworks are shifting from experimental support to production hardening.
AMD made significant strides in training infrastructure with Primus v0.7.0, establishing a robust patch system for Megatron-LM and TorchTitan on ROCm. NVIDIA consolidated their lead in Post-Training (RLHF) by merging Megatron-RL into Megatron-LM Core and introducing GRPO support, directly addressing the industry’s shift toward reasoning-heavy workloads. In inference, vLLM and SGLang both released major versions focusing on Pipeline Parallelism, low-bit quantization (FP8/NVFP4), and disaggregated serving.
AMD Related Updates
- Primus v0.7.0 Release: A critical update for AMD’s training stack. It introduces a comprehensive patch framework for Megatron-LM and TorchTitan, enabling deterministic training, MoE support, and specific optimizations for DeepSeek V3 on MI355X hardware.
- ROCm 7.0 & MI300X Optimization: Tilelang v0.1.8 enabled FlashAttention-2 forward pass on MI300X and updated CI to ROCm 7.1. SGLang v0.5.9 deprecated ROCm 6.3 artifacts in favor of ROCm 7 standardization.
- Day-0 Model Support: Both vLLM and SGLang integrated support for new architectures (e.g., Kimi K2.5, Qwen3-Next) on ROCm immediately upon release, utilizing updated AITER kernels.
Competitive Analysis
- NVIDIA Consolidates RL Training: With Megatron-LM Core v0.16.0, NVIDIA merged Megatron-RL into the main codebase. The addition of GRPO (Group Relative Policy Optimization) support indicates a strategic focus on the “reasoning” training loop popularized by DeepSeek R1, moving beyond standard SFT/DPO.
- Quantization Warfare: TransformerEngine v2.12 and vLLM v0.16.0 significantly improved NVFP4 (4-bit floating point) support. This puts pressure on non-NVIDIA hardware to accelerate sub-8-bit precision support to remain competitive in inference throughput.
- Serving Maturity: vLLM’s introduction of Pipeline Parallelism and a Realtime API (WebSocket-based) in v0.16.0 closes the gap with enterprise-grade serving solutions, reducing the need for third-party wrappers.
📂 Category Updates
🟢 AMD Ecosystem
AMD-AGI/Primus
- Key Activity:
- [2026-02-02] 🚨 RELEASE: v0.7.0 - Major feature release.
- Details:
- Megatron-LM Backend: Added deterministic training, MoE patches, Zero-Bubble Pipeline (ZBPP) patches, and FP8 context support.
- TorchTitan Integration: Moved TorchTitan trainer patches into the backend system and adjusted batch sizes for DeepSeek V3 16B.
- Infrastructure: Integrated preflight tools into CLI and aligned TFLOPS computation with Megatron defaults.
-
Metrics: 44 PRs 2 Issues
tile-ai/tilelang
- Key Activity:
- [2026-02-16] 🚨 RELEASE: v0.1.8
- Details:
- AMD Specific: Enabled FlashAttention-2 (FA2) forward pass on MI300X; fixed Docker build bugs for MI3x GPUs; updated CI to ROCm 7.1.
- Features: Introduced
T.access_of, packed FP32x2 math intrinsics, and improved atomic reduction operations.
-
Metrics: 95 PRs 29 Issues
ROCm/ROCm
- Key Activity:
- General maintenance and documentation updates.
- Details:
- Identified issues with RDNA3 HIP Memory Pool Fragmentation on Windows causing slowdowns.
- Updated vLLM inference documentation to point upstream.
-
Metrics: 38 PRs 44 Issues
🔵 PyTorch Ecosystem
pytorch/torchtitan
- Key Activity:
- [2026-02-20] 🚨 RELEASE: v0.2.2
- Details:
- Features: Refactored Context Parallel (CP) to use new PyTorch APIs; enabled FlexCP for Llama3; added support for DeepEP shared experts overlap.
- Hardware: Added peak FLOPS support for NVIDIA H20 and mxfp8 support on ROCm gfx950.
- Refactor: Moved from TOML config system to a Python Dataclass Registry.
-
Metrics: 120 PRs 33 Issues
pytorch/ao (Architecture Optimization)
- Key Activity:
- [2026-02-10] 🚨 RELEASE: v0.16.0
- Details:
- Training: Introduced differentiable building blocks for MXFP8 MoE Training with Expert Parallelism. Reported 10-25% speedup on DeepSeek V3 training.
- Deprecation: Removed v1 configs for Float8/Int8 quantization in favor of v2.
-
Metrics: 0 PRs 20 Issues
pytorch/pytorch
- Key Activity:
- High-volume maintenance and documentation updates.
- Details:
- Bumped C++ standard to version 20.
- Addressed issues regarding
torch.vmapfailures with functorch tracing and uneven sharding in FSDP2.
-
Metrics: 1679 PRs 389 Issues
🟢 NVIDIA Ecosystem
NVIDIA/Megatron-LM
- Key Activity:
- [2026-02-26] 🚨 RELEASE: core_v0.16.0
- Details:
- RL Merger: Merged
Megatron-RLinto the mainMegatron-LMrepo, enabling tighter integration for Post-Training. - GRPO: Added functional tests and support for Group Relative Policy Optimization (GRPO), critical for reasoning models.
- Inference: Enabled hybrid model support (Tensor + Expert + Data Parallelism) in mcore inference.
- Fixes: Resolved DeepSeek V3 FSDP hangs related to precision-aware optimizers.
- RL Merger: Merged
-
Metrics: 0 PRs 0 Issues (Repo stats likely aggregated in release notes)
NVIDIA/TransformerEngine
- Key Activity:
- [2026-02-24] RELEASE: v2.12
- Details:
- Quantization: Improved performance of NVFP4 quantization kernels.
- PyTorch: Added fused permute+pad operations for FP8 optimization and Sliding Window Attention support.
-
Metrics: 0 PRs 0 Issues
🚀 Serving & Inference
vllm-project/vllm
- Key Activity:
- [2026-02-25] 🚨 RELEASE: v0.16.0
- [2026-02-04] RELEASE: v0.15.1
- Details:
- Architecture: Fully supported Async Scheduling + Pipeline Parallelism (~30% throughput improvement).
- API: Launched WebSocket-based Realtime API for streaming audio.
- XPU Overhaul: Deprecated IPEX in favor of
vllm-xpu-kernelsfor Intel GPUs. - AMD: Tuned QWEN3-NEXT FP8 and integrated AITER attention backend.
-
Metrics: 0 PRs 0 Issues (High volume in reality, stats snapshot limited)
sgl-project/sglang
- Key Activity:
- [2026-02-24] 🚨 RELEASE: v0.5.9
- Details:
- Performance: Implemented LoRA weight loading overlap with computation (78% reduction in TTFT). Integrated TRT-LLM NSA kernels for DeepSeek V3.2.
- AMD/ROCm: Standardized on ROCm 7; added Day-0 support for Kimi K2.5 and FP8 prefill attention kernels.
- Diffusion: Added support for LTX-2 and MOVA models; enabled token-level sequence sharding.
-
Metrics: 0 PRs 0 Issues
llm-d/llm-d
- Key Activity:
- [2026-02-04] 🚨 RELEASE: v0.5.0
- Details:
- Infrastructure: Upgraded to Gateway API v1.4.0 and Istio 1.28.1.
- Components: Graduated Workload-Variant-Autoscaler (WVA) from experimental to core.
- Refactor: Moved routing sidecar logic into the inference scheduler repo.
-
Metrics: 0 PRs 0 Issues
📚 Models & Libraries
huggingface/transformers
- Key Activity:
- [2026-02-16] RELEASE: v5.2.0
- [2026-02-05] RELEASE: v5.1.0
- Details:
- New Models: Qwen 3.5, GLM-5, VoxtralRealtime, EXAONE-MoE, PP-DocLayoutV3.
- Breaking Changes: Refactored attention mask interface; removed SDPA workarounds for Torch 2.4+.
- Hardware: Added XPU support for MoE kernels (MegaBlocks).
-
Metrics: 561 PRs 133 Issues
deepspeedai/DeepSpeed
- Key Activity:
- [2026-02-12] RELEASE: v0.18.6
- Details:
- Fixes: Addressed BF16 gradient norm divergence with ZeRO stage 0 and leaf module race conditions.
- Features: Supported custom partitioning patterns for AutoTP.
-
Metrics: 0 PRs 0 Issues