AI & GPU Industry Weekly Recap: March 23–29, 2026

🔑 Key Highlights

AMD’s FlyDSL kernel delivers up to 162% throughput gains running Kimi-K2.5 on AMD Instinct MI300X, with 65% lower TTFT and 69% lower TPOT — a landmark result for ROCm-based AI inference competitiveness
Huawei unveils the Atlas 350 AI accelerator based on the Ascend 950PR chip, claiming 1.56 PFLOPS of FP4 compute and 2.8x the performance of NVIDIA’s H20, marking a major milestone in China’s AI self-sufficiency drive
AMD releases FSR SDK 2.2, delivering updated ML-powered upscaling (FSR Upscaling 4.1) and ray denoising (FSR Ray Regeneration 1.1) for RDNA 4 GPUs, furthering AMD’s neural rendering ambitions
AMD and CIQ announce AMD-optimized Rocky Linux, a collaborative RHEL-derived OS distribution targeting AI and HPC workloads with day-zero ROCm and Instinct GPU support
NVIDIA demonstrates power-flexible AI factories in partnership with Emerald AI, National Grid, and Nebius — successfully modulating Blackwell Ultra GPU cluster power to act as a real-time grid stabilizer in London

🤖 AI & Machine Learning

AMD ROCm: Kimi-K2.5 Inference Optimization on MI300X

AMD’s ROCm engineering team published a detailed optimization blog showcasing dramatic inference performance improvements for Kimi-K2.5 — the MoE (Mixture-of-Experts) LLM recommended by the open-source OpenClaw framework — running on AMD Instinct MI300X GPUs. Profiling revealed the fused_moe_kernel_gptq_awq kernel consumed 87–90% of GPU execution time in both prefill- and decode-dominated scenarios, making it the sole optimization target.

AMD engineers leveraged FlyDSL (Flexible Layout Python DSL), a Python-native GPU kernel authoring tool backed by a custom MLIR compiler stack, to rapidly build an optimized mixed-precision (W4A16 + BF16) fused MoE kernel. FlyDSL outperformed PyTorch, Triton, and Composable Kernel (CK) on all tested MoE shapes — notably, CK lacked W4A16 support entirely and encountered GPU faults on Kimi-K2.5’s E=384 expert configurations.

End-to-end results (8× MI300X, ROCm 7.2.0, SGLang with AITER):

Metric	Improvement
TTFT (mean, prefill-dominated)	-65.3%
TPOT (mean, decode-dominated)	-69.2%
Output Throughput (decode-dominated)	+162.4%

GSM8K accuracy was fully preserved at 96% across both baseline and optimized configurations. The FlyDSL kernel has been merged upstream into the ROCm/FlyDSL repository, and the full Docker image (clementlincf/amdafde:v0.5.8-rocm720-mi30x-kimi-k2.5-opt) is publicly available.

NVIDIA: Power-Flexible AI Factories for Grid Stabilization

NVIDIA, in collaboration with Emerald AI, EPRI, National Grid, and Nebius, demonstrated that AI datacenters running NVIDIA Blackwell Ultra GPUs (96-GPU cluster, NVIDIA Quantum-X800 InfiniBand) can act as real-time shock absorbers for the electricity grid. At Nebius’ London AI factory, the Emerald AI Conductor Platform dynamically throttled GPU power during simulated grid stress events — including the infamous “TV pickup” demand spike from UEFA EURO 2020 — achieving 100% alignment with over 200 power targets issued by National Grid, all without disrupting high-priority AI workloads. A real-world deployment at the Aurora AI Factory in Virginia is expected later in 2026.

AMD Ryzen AI NPUs: LLM Inference on Linux Reaches Maturity

Lemonade SDK 10.0.1 was released, building on the breakthrough 10.0 version that made LLM inference on AMD Ryzen AI XDNA 2 NPUs viable on Linux. New additions include Debian PPA packages for Ubuntu, Arch Linux and Fedora setup guides, AppIndicator3 system tray support, Qwen3.5-4B support via FastFlowLM, and GGUF integration with Hugging Face model search. This positions AMD’s NPU as an increasingly credible edge inference platform across Linux distributions.

⚡ GPU & Hardware

AMD FSR SDK 2.2: Neural Rendering Gets Sharper

AMD shipped FSR SDK 2.2 on GPUOpen, the first major update since the “Redstone” launch in December 2025. Key updates:

FSR Upscaling 4.1: Improved sharpness in motion, enhanced ultra-performance and Dynamic Resolution Scaling (DRS) modes via updated ML inference
FSR Ray Regeneration 1.1: Quality and memory improvements, new debug view modes — first integrated in Crimson Desert by Pearl Abyss (note: ABI-breaking change from v1.0)

All ML-powered features remain exclusive to RDNA 4 architecture (Radeon RX 9000 series), with analytical FSR 3 fallbacks covering RDNA 3.5 and older. The Unreal Engine 5 plugin remains on FSR Upscaling 4.0.3 pending a future update. Full SDK available on GitHub.

Huawei Atlas 350: China’s FP4 AI Accelerator

Announced at the Huawei China Partner Conference 2026 in Shenzhen, the Atlas 350 is built on the in-house Ascend 950PR die and targets AI prefill/inference workloads:

Spec	Value
FP4 Throughput	1.56 PFLOPS
HBM Capacity	112GB (HiBL 1.0)
Memory Bandwidth	1.4 TB/s
Interconnect (LingQu)	2 TB/s
TDP	600W
Price (reported)	~111,000 Yuan (~$16,000 USD)

Huawei claims 2.87× the FP4 performance of NVIDIA’s H20, though this is a difficult comparison since Hopper-generation GPUs lack native FP4 support. The Atlas 350 is notable as the first Chinese-domestic accelerator optimized for FP4 precision — a format NVIDIA only introduced with Blackwell. Despite U.S. sanctions blocking TSMC CoWoS packaging access, Huawei is employing alternative advanced packaging, with proprietary HBM sourced from an undisclosed supplier.

Open-Source NVIDIA Driver: Nouveau + NVK Progress

Phoronix benchmarked the Nouveau + NVK (Mesa 26.1-dev) + Linux 7.0 stack against the proprietary NVIDIA R590 driver on an RTX 5080, showing measurable progress over the Ubuntu 25.10 baseline (Mesa 25.2 + Linux 6.17). While the open-source stack still trails the official driver in raw performance, the gap has narrowed meaningfully over six months, with NVK (Vulkan), Zink (OpenGL), and Rusticl (OpenCL) all improving.

Intel Arc and Crimson Desert: A GPU Ecosystem Win

Pearl Abyss reversed its controversial decision to exclude Intel Arc GPU users from Crimson Desert, publicly apologizing for “confusion” caused by dismissive FAQ language that originally suggested Arc users seek a refund. Intel confirmed it had reached out to Pearl Abyss “many times” to offer optimization assistance. Compatibility and optimization work is now underway, though no timeline was given. This highlights Intel’s continued challenge gaining traction with game developers despite Arc B-series iGPUs reaching GTX 1060-class performance.

🏭 Industry & Market

AMD + CIQ: Enterprise-Grade Rocky Linux for AI/HPC

AMD and CIQ announced a multi-phase collaboration to produce AMD-optimized Rocky Linux builds, targeting enterprise AI and HPC deployments on AMD Instinct GPUs with ROCm integration. The initiative includes day-zero validated AMD driver support, and a roadmap to integrate AMD optimizations across CIQ’s full infrastructure stack — including Warewulf Pro (cluster management), Ascender Pro (IT automation), Apptainer (containerization), and Fuzzball (workload orchestration). This mirrors Intel’s historical Clear Linux initiative and signals AMD’s increasing commitment to enterprise Linux ecosystems. GA timing has not been announced.

AMD EPYC + CXL: Kernel-Level Memory Optimization

AMD engineers posted the latest RFC patches for “pghot” — a hot-page tracking and promotion subsystem proposed for the Linux kernel. Tested on an AMD EPYC Zen 5 server with dual NUMA nodes and a CXL memory node, pghot unifies hot-page detection across kernel subsystems, centralizes promotion logic, and delivers measurable benchmark speedups in tiered-memory scenarios (both page promotion and demotion). This infrastructure improvement could prove critical for next-generation EPYC + CXL deployments at scale.

China AI Semiconductor Autonomy Accelerates

The Huawei Atlas 350 launch underscores the rapid pace of China’s domestic AI chip development despite U.S. export restrictions. While Chinese enterprises still procure NVIDIA GPUs where possible (CUDA ecosystem maturity remains a key advantage), the Atlas 350’s FP4 support, competitive pricing, and 2 TB/s LingQu interconnect suggest Huawei is narrowing the gap with Western AI accelerators faster than many anticipated.

🛠️ Developer Ecosystem

FlyDSL: Python-Native GPU Kernel Development for AMD

FlyDSL emerged this week as a first-class tool in AMD’s ROCm kernel optimization story. Its Python-native workflow — backed by a custom MLIR stack compiling through FLIR (Flexible Layout Intermediate Representation), targeting gfx942 (MI300X) and gfx950 (MI350) — enabled the Kimi-K2.5 fused MoE kernel to be developed and tuned in a fraction of the time required by Triton (manual tuning) or Composable Kernel (hand-written assembly). The kernel is now merged into the ROCm/FlyDSL repository. Software stack used: ROCm 7.2.0, PyTorch 2.9.1, Triton 3.5.1, AITER 0.1.5, SGLang.

GROMACS on AMD Instinct: Complete HPC Build Guide

AMD published a comprehensive build guide for GROMACS (molecular dynamics simulation) on MI300X and MI355X, covering the full dependency stack: UCX 1.19.1 (GPU-aware communication), OpenMPI 5.0.8, CMake 3.28, and GROMACS compiled with HIP for GPU acceleration. Both bare metal and containerized (Docker and Apptainer/Singularity) deployment paths are documented. Multi-GPU FFT decomposition (GMX_MULTI_GPU_FFT) and GPU-resident update/constraints (GMX_GPU_UPDATE) are highlighted as key performance flags. The guide was validated on Vultr and TensorWave MI300X cloud instances.

AMD FSR “Redstone” SDK: Unified Branding and Developer Tools

The FSR SDK 2.2 release formalized AMD’s brand consolidation, replacing the legacy AMD FidelityFX naming convention across all technologies with the unified AMD FSR brand. New FSR Naming Guidelines were published on GPUOpen to help developers correctly label FSR features in game UI. The SDK supports a single integration path targeting AMD RDNA 4 ML features while falling back gracefully to FSR 3 analytical modes on RDNA 3.5 and earlier — covering handheld, console, and PC platforms from one codebase.

📊 Key Takeaways

AMD had a dominant week across both AI infrastructure and gaming graphics: the FlyDSL-powered Kimi-K2.5 optimization on MI300X — delivering 162% throughput gains via a Python-native kernel — demonstrates ROCm’s growing maturity as a serious CUDA alternative for production MoE inference, while FSR SDK 2.2 and the AMD-optimized Rocky Linux initiative show the company systematically hardening its ecosystem from the data center to the GPU. Meanwhile, Huawei’s Atlas 350 — the first Chinese AI accelerator with native FP4 support — signals that geopolitical hardware fragmentation in AI is accelerating, with domestic Chinese silicon increasingly capable of challenging export-restricted NVIDIA products in key inference workloads.

*Sources: AMD GPUOpen, ROCm Tech Blog, Phoronix, Tom’s Hardware, NVIDIA Blog

Coverage period: March 23–29, 2026*

News Weekly: 2026-03-23–2026-03-29