GitHub Monthly Report: 2026-02-01 to 2026-02-28
April 18, 2026
π Engineering Report (2026-02-01 - 2026-02-28)
π Executive Summary
February 2026 was a massive month across the AI ecosystem, marked by an industry-wide rush to optimize inference and training for the newest frontier models (DeepSeek V3/R1, Qwen3.5, and GLM-5) using extreme low-precision formats (FP8, FP4, NVFP4, MXFP4/8). Ecosystem maintenance health is exceptionally high, with high-velocity repositories like HuggingFace transformers, vllm, and Megatron-LM merging hundreds of PRs to support next-generation Mixture-of-Experts (MoE) scaling and new hardware platforms.
AMD Related Updates
- Primus Gets Major Upgrades: AMDβs Primus framework saw a substantial
v0.7.0release integrating heavily with Megatron (MoE, FP8, FSDP2, Zero-bubble pipeline parallelism) and explicitly optimizing batch sizes for DeepSeek V3 on the MI355X. - Ecosystem Native Support Expanding: Upstream projects are rapidly merging AMD-specific optimizations.
TorchAOadded support for AMDgfx942FP8 types in scaled grouped GEMMs.TorchTitanadded extensive ROCm CI support (including H100 to ROCm tests and mxfp8 ongfx950).TileLangshipped native support for AMD MI300X FA2 forward passes and MI350/MI355 FP8. - ROCm 7 Standardization in SGLang:
SGLangofficially deprecated ROCm 6.3 in favor of standardizing on ROCm 7, pushing critical optimizations like FP8 prefill attention kernels and Day-0 support for new models like Kimi K2.5 on AMD hardware. - Profiling Tooling Matures:
TraceLensupdated its support forrocprofv3and added ridge points to roofline metrics, improving AMD hardware performance debugging.
Competitive Analysis
- NVIDIAβs Blackwell Push is Aggressive: NVIDIA is rapidly maturing its software stack for the Blackwell architecture (SM120/SM121).
TransformerEngine v2.12andMegatron-LM v0.16.0heavily prioritized NVFP4 MoE kernels, TRT-LLM integration, and UVM (Unified Virtual Memory) allocator compilation. - TRT-LLM Driving Massive Inference Gains:
SGLangintegrated TRT-LLM Native Sparse Attention (NSA) kernels, specifically citing a 3x-5x speedup for DeepSeek V3.2 on Blackwell. AMD must ensure its vLLM and SGLang backends (like AITER) can match this MoE dispatch efficiency. - XPU/Intel Platform Overhaul:
vLLMcompletely deprecatedIPEXin favor ofvllm-xpu-kernels, adding substantial XPU support including unquantized MoE, MXFP4 MoE, and scaled_mm kernels. Intel is moving aggressively to unify its inference backend. - The Low-Precision War: The battleground for LLM performance has definitively shifted to FP4 and MXFP8/4. With
TorchAOshowing 10-25% speedups for DeepSeek V3 using MXFP8 MoE building blocks, hardware parity in handling these formats without accuracy degradation is the primary competitive moat for Q1/Q2 2026.
π Category Updates
AMD Ecosystem
AMD-AGI/Primus
- Key Activity:
- [2026-02-02] π¨ RELEASE: v0.7.0 - Massive framework update focusing on performance projection and Megatron integration.
- [2026-02-06] DOC UPDATE: Updated Docker base image from v25.10 to v26.1.
- Details:
- [2026-02-02] Integrated Megatron MoE, Transformer Engine, Turbo, and Zero-bubble pipeline (Zbpp) patches.
- [2026-02-02] Added Torch FSDP2 and FP8 context support.
- [2026-02-02] MI355X Optimization: Maximized batch sizes for DeepSeek V3 on MI355X (adjusted 16B batch size to 8).
- [2026-02-02] Preflight tool completely refactored for richer host+GPU+network reports.
-
Metrics: 44 PRs 2 Issues (Highlight Issues: Segfault on 8x MI250X, Slurm hang when MASTER_ADDR != SLURM_NODELIST[0])
AMD-AGI/TraceLens
- Key Activity:
- [2026-02-13] DOC UPDATE: Added support for reporting analysis of
pftracefiles viarocprofv3CLI. - [2026-02-25] DOC UPDATE: Restructured README.
- [2026-02-13] DOC UPDATE: Added support for reporting analysis of
- Details:
- [2026-02-25] PR merged to add ridge point column to tracelens perf report roofline metrics.
-
Metrics: 27 PRs 11 Issues
ROCm/ROCm & ROCm/MAD
- Key Activity:
- Ongoing tracking of ecosystem bug reports and
vLLMdocumentation alignment.
- Ongoing tracking of ecosystem bug reports and
- Details:
- [ROCm] Highlighted Windows RDNA3 HIP memory pool fragmentation causing massive VRAM-heavy slowdowns.
- [MAD] Added vLLM router support for PD Disaggregation and released PyTorch xdit:v26.2.
-
Metrics: 45 PRs 74 Issues
AMD-AGI/GEAK-agent
- Key Activity:
- [2026-02-25] Refactored CLI: Renamed Chinese references to English and
--mcpto--rag.
- [2026-02-25] Refactored CLI: Renamed Chinese references to English and
-
Metrics: 0 PRs 0 Issues
PyTorch Ecosystem
pytorch/ao
- Key Activity:
- [2026-02-10] π¨ RELEASE: v0.16.0 - Major release introducing MXFP8 MoE building blocks.
- Details:
- [2026-02-10] Added MXFP8 MoE training with Expert Parallelism (EP), yielding 10-25% tokens/sec speedup for DeepSeek V3.
- [2026-02-10] AMD Support:
scaled_grouped_mmnow supportsgfx942FP8 data types. - [2026-02-10] Heavily cleaned up the API by deprecating v1 quantization configs, Old GPTQ, and Old SmoothQuant implementations.
-
Metrics: 0 PRs 20 Issues
pytorch/torchtitan
- Key Activity:
- [2026-02-20] π¨ RELEASE: v0.2.2
- Details:
- [2026-02-20] ROCm Native CI: Added ROCm support for H100 tests, simple FSDP experiments, Auto Parallel, and Transformer Modeling Backend tests. Added
mxfp8support ongfx950. - [2026-02-20] Added DeepEP shared_experts overlap, Qwen3 attention scaling, and updated DeepSeek V3 device mesh usage.
- [2026-02-20] ROCm Native CI: Added ROCm support for H100 tests, simple FSDP experiments, Auto Parallel, and Transformer Modeling Backend tests. Added
-
Metrics: 120 PRs 52 Issues
pytorch/pytorch
- Key Activity:
- [2026-02-24] CMake bumped to C++ 20.
- Details:
- [2026-02-28] Addressed
posix_fallocateerror handling and tracked active issues regarding uneven FSDP2 sharding andtorch.vmapfailures with functorch.
- [2026-02-28] Addressed
-
Metrics: 1679 PRs 806 Issues (Massive throughput, stable health)
NVIDIA Ecosystem
NVIDIA/Megatron-LM
- Key Activity:
- [2026-02-26] π¨ RELEASE: core_v0.16.0 - Massive framework evolution.
- [2026-02-06] π¨ RELEASE: core_v0.15.3 - Security patch release.
- Details:
- [2026-02-26] DeepSeek/Qwen Support: Added DeepSeek V3.2 support, Qwen3-Next MoE gates, and MTP packed-seq support.
- [2026-02-26] Blackwell & Formats: Added NVFP4 MOE with proper padding and FP8 params support for Megatron-FSDP.
- [2026-02-26] Architecture: Merged Megatron-RL into LM, enabling native RLHF/GRPO functional tests. Integrated Hybrid Context Parallelism and hybrid tensor+expert+data parallelism for inference.
- Metrics: Dozens of active PRs mapped to the release.
NVIDIA/TransformerEngine
- Key Activity:
- [2026-02-24] π¨ RELEASE: v2.12
- Details:
- [2026-02-24] Improved NVFP4 quantization kernels and fused permute+pad for FP8 optimization.
- [2026-02-24] Fixed SM120 compilation with CUDA 12 and added cudagraph support for activation recomputation.
-
Metrics: 0 PRs 0 Issues (in current window)
Serving & Inference Ecosystem
vllm-project/vllm
- Key Activity:
- [2026-02-25] π¨ RELEASE: v0.16.0
- [2026-02-04] π¨ RELEASE: v0.15.1 (Security & RTX Blackwell fixes)
- Details:
- [2026-02-25] Core Engine: Async scheduling + Pipeline Parallelism now fully supported (30.8% throughput improvement).
- [2026-02-25] Hardware: AMD ROCm received Qwen3-Next FP8 tunings and AITER attention backend. Intel XPU completely overhauled (deprecated IPEX for
vllm-xpu-kernels). NVIDIA got FlashInfer TRTLLM BF16 MoE and SM100 INT4 W4A16. - [2026-02-25] Features: Added WebSocket-based Realtime API (audio) and Unified Parallel Drafting for speculative decoding.
- Metrics: Hundreds of PRs merged to support
v0.16.0.
sgl-project/sglang
- Key Activity:
- [2026-02-24] π¨ RELEASE: v0.5.9
- Details:
- [2026-02-24] Blackwell NSA: Integrated TRT-LLM DSA kernels for Native Sparse Attention (NSA), yielding 3x-5x performance boosts for DeepSeek V3.2 on Blackwell.
- [2026-02-24] AMD ROCm: Deprecated ROCm 6.3, standardized on ROCm 7. Added FP8 prefill attention kernel integration and bumped AITER to
v0.1.10.post3(FP8 Prefill/Decode/KV Cache support). - [2026-02-24] Core: LoRA weight loading overlap reduces TTFT by ~78%. Added Flashinfer all-to-all MoE dispatcher.
- Metrics: High velocity; directly competing with vLLM on DeepSeek V3 efficiency.
llm-d/llm-d
- Key Activity:
- [2026-02-04] π¨ RELEASE: v0.5.0
- Details:
- [2026-02-04] Promoted the
Workload-Variant-Autoscalerto a core component. - [2026-02-04] Upgraded infrastructure to Gateway API
v1.4.0and Istio1.28.1. Temporarily deprecated EFA (Elastic Fabric Adapter) due to libiverbs conflicts.
- [2026-02-04] Promoted the
-
Metrics: 0 PRs 0 Issues (post-release stabilization)
tile-ai/tilelang
- Key Activity:
- [2026-02-16] π¨ RELEASE: v0.1.8
- Details:
- [2026-02-16] Enabled FA2 fwd on AMD MI300X.
- [2026-02-16] Added AMD MI350/MI355 FP8 support.
- [2026-02-16] Integrated Z3 in TVM Arith Analyzer and added
CuTeDSLbackend.
-
Metrics: 95 PRs 57 Issues
xdit-project/xDiT & deepseek-ai/DeepEP
- Key Activity:
- [xDiT] Added Qwen Image, Qwen Image Edit, and FLUX.2-klein-9/4B support. PR added for AITER Sage V2.
- [DeepEP] Updated README for
mori-EPbranch (AMD community fork). Tracked an issue where init fails on partial CUDA peer access.
Hugging Face Ecosystem
huggingface/transformers
- Key Activity:
- [2026-02-16] π¨ RELEASE: v5.2.0
- [2026-02-05] π¨ RELEASE: v5.1.0
- Details:
- [2026-02-16] Model Additions: Qwen3.5 (397B), GLM-5, VoxtralRealtime, VibeVoice Acoustic Tokenizer.
- [2026-02-05] Model Additions: EXAONE-MoE, PP-DocLayoutV3, Youtu-LLM.
- [2026-02-16] Breaking Changes: Implemented a new Attention mask interface everywhere. Modified ModernBERT default attention to stop using FA.
- [2026-02-16] Deprecations: Removed
torchao.autoquantfrom transformers.
-
Metrics: 561 PRs 261 Issues (Highly active, maintaining excellent closure rates).
Compilers & Other Frameworks
- openxla/xla: Tracked cross-compile build failures for CPU and GPU hangs with Triton Warp Specialization. (1348 PRs)
- triton-lang/triton: Added LLVM diagnostic handler for early AMD LDS resource checks. (241 PRs)
- AI-Hypercomputer/maxtext: Addressed Qwen3 MoE Conversion errors and enabled mask for AdamW optimizer. (216 PRs)
- facebookresearch/xformers: Released
v0.0.35(2026-02-20) supporting free-threading Python and removing bundled pre-built Flash-Attention 3. - deepspeedai/DeepSpeed: Released
v0.18.6(2026-02-12) adding custom partitioning patterns for AutoTP.