Update: 2026-03-24 (07:17 AM)
Executive Summary
- AMD Instinct Software Optimization: A new FlyDSL-based optimized kernel has been introduced for Kimi-K2.5 inference on MI300X, resolving
fused_moebottlenecks and yielding up to 162% throughput increases with mixed-precision (W4A16 + BF16) computation. - Streamlined HPC Deployments: AMD published comprehensive, end-to-end build guides for compiling GROMACS with GPU-aware MPI (UCX/OpenMPI) targeting MI300X and MI355X architectures across bare metal, Docker, and Apptainer environments.
- Ecosystem Expansion: AMD has partnered with CIQ to develop an AMD-optimized Rocky Linux distribution tailored specifically for AI and HPC workloads, featuring day-zero ROCm support and validated Instinct drivers.
- Competitor Hardware Landscape: Huawei launched the Atlas 350 AI accelerator in China, built on Ascend 950PR silicon. The 600W chip leverages native FP4 precision and proprietary “HiBL” HBM to directly compete with NVIDIA’s H20 in the inference market.
🤖 ROCm Updates & Software
[2026-03-24] Accelerating Kimi-K2.5 on AMD Instinct™ MI300X: Optimizing Fused MoE …
Source: ROCm Tech Blog
Key takeaway relevant to AMD:
- AMD’s FlyDSL toolset provides a massive advantage for developers by allowing rapid, Python-native development of highly optimized MLIR-backed kernels. This enables AMD hardware to rapidly adapt to and accelerate trending models like Kimi-K2.5 without relying on hand-written assembly.
Summary:
- Profiling Kimi-K2.5 on MI300X revealed that
fused_moeconsumed ~88-90% of GPU execution time. - AMD engineers used FlyDSL to quickly build a mixed-precision (W4A16 + BF16) fused MoE kernel, outperforming PyTorch, Triton, and Composable Kernel (CK) out-of-the-box.
- End-to-end framework integration within SGLang and AITER yielded drastic latency reductions and throughput improvements without compromising model accuracy.
Details:
- FlyDSL & Architecture: FlyDSL utilizes FLIR (Flexible Layout Intermediate Representation), a layout algebra system compiling through custom MLIR passes to optimized binaries targeting
gfx942(MI300X) andgfx950(MI350). - Software Stack Evaluated: Built on ROCm 7.2.0, PyTorch 2.9.1, Triton 3.5.1, and custom branches for SGLang (
kimi-K2.5-dev) and AITER (dev/kimi-K2.5). - Kernel Benchmarks (Large Shape - tokens=16384, model_dim=7168, E=384): FlyDSL achieved 8.68ms (BF16) and 9.77ms (W4A16), significantly beating Torch (119.82ms), Triton (12.09ms / 31.43ms), while CK failed (gpu_fault/unsupported).
- End-to-End Performance (Concurrency = 2): Time To First Token (TTFT) reduced by 65.3% (from 2918.16ms to 1014.13ms). Output throughput increased by 47.1%.
- End-to-End Performance (Concurrency = 40): Time Per Output Token (TPOT) reduced by 69.2% (from 230.37ms to 70.86ms). Output throughput increased by 162.4% (from 135.39 tok/s to 355.35 tok/s).
- Accuracy: Verified using
lm-eval-harnesson GSM8K (10-shot), maintaining exactly 0.96 exact_match with no degradation. - Hybrid Precision: Employs
FLYDSL_W4A16_HYBRID=w2_bf16to run MoE Stage 1 in W4A16 (4-bit) and Stage 2 in BF16 (full precision) to optimize memory vs. compute stability.
[2026-03-24] GROMACS on AMD Instinct GPUs: A Complete Build Guide Blog Post
Source: ROCm Tech Blog
Key takeaway relevant to AMD:
- Providing an official, validated end-to-end build guide for GROMACS with GPU-aware MPI severely reduces friction for HPC system administrators adopting MI300X/MI355X instances on-premises or in the cloud.
Summary:
- Outlines a definitive installation pipeline for deploying GROMACS molecular dynamics workloads on AMD MI300X and MI355X hardware.
- Covers compiling the necessary communication stack (UCX and OpenMPI) alongside GROMACS to enable multi-GPU scaling via HIP.
- Provides deployment tracks for bare metal, Apptainer (Singularity), and Docker environments.
Details:
- Hardware/Target Flags: Specifically targets
gfx942for MI300X andgfx950for MI355X processors. - ROCm Requirements: Requires ROCm ≥ 6.0.0 for MI300X and ROCm ≥ 7.0.0 for MI355X.
- Dependency Stack Built: UCX version 1.19.1 (configured with
--with-rocmand--without-cuda), OpenMPI version 5.0.8, and CMake 3.28.0. - GROMACS Build Specs: Uses branch
4947-hip-feature-enablement. Compiles with critical CMake flags:-DGMX_GPU=HIP,-DGMX_GPU_UPDATE=ON,-DGMX_MPI=ON,-DGMX_OPENMP=ON,-DGMX_GPU_FFT=ON, and-DGMX_MULTI_GPU_FFT=ON. - Containerization: Container workflows are built on the
rocm/dev-ubuntu-24.04:latestbase image. The Apptainer guide configures runtime variables to force GPU-aware MPI (GMX_FORCE_GPU_AWARE_MPI=1) and bypass ROCm active wait timeouts for latency reduction.
[2026-03-24] AMD-Optimized Rocky Linux Distribution To Focus On AI & HPC Workloads
Source: Phoronix (AMD Linux)
Key takeaway relevant to AMD:
- A dedicated Linux distribution optimally tuned for ROCm out-of-the-box eliminates complex OS-level configuration hurdles for enterprise customers, directly accelerating the time-to-deployment for AMD Instinct clusters.
Summary:
- AMD and CIQ are collaborating on a customized, AMD-optimized version of Rocky Linux (a RHEL derivative).
- Designed specifically for AI and HPC environments, serving as a robust foundation for Instinct hardware and the ROCm software stack.
- The OS will provide “day-zero” deployment capabilities for large language model training and scientific simulations.
Details:
- Core Offerings: The build will ship with natively validated AMD drivers and pre-configured ROCm support, significantly lowering enterprise procurement and technical barriers.
- Future Integrations: Beyond the base OS, CIQ intends to build ROCm support across its wider infrastructure portfolio. This includes Warewulf Pro (cluster management), Ascender Pro (IT automation), Apptainer (containerization), and Fuzzball (workload orchestration).
- Strategic Ecosystem Move: Unlike Intel’s Clear Linux which focused heavily on CPU micro-optimizations, this distribution appears hyper-focused on accelerator (GPU/NPU) and ROCm enablement, though underlying EPYC tuning is also anticipated.
[2026-03-24] Merge branch ‘release’ of https://github.com/ROCm/rocm-blogs-internal…
Source: ROCm Tech Blog
Key takeaway relevant to AMD:
- Confirms active maintenance and continuous integration of the AMD ROCm public documentation and enablement ecosystem.
Summary:
- A repository maintenance commit logging the merge of internal staging branches to the public-facing release branch.
Details:
- Commit Actions: Merged parent commits 0f6884e and 363b7bd.
- Developer Engagement: The commit notes specifically highlight: “We read every piece of feedback, and take your input very seriously,” indicating AMD’s active monitoring of community feedback regarding ROCm documentation and tutorials.
🤼♂️ Market & Competitors
[2026-03-24] Huawei unveils new Atlas 350 AI accelerator with 1.56 PFLOPS of FP4 compute and up to 112GB of HBM — claims 2.8x more performance than Nvidia’s H20
Source: Tom’s Hardware (GPUs)
Key takeaway relevant to AMD:
- Huawei’s aggressive development in Chinese domestic silicon—utilizing advanced formats like FP4 and proprietary memory tech—shows strong evasion of US sanctions. While their direct target is NVIDIA’s H20, this creates an increasingly contested landscape for any Instinct export-compliant parts aiming for the Chinese market.
Summary:
- Huawei announced the Atlas 350, an AI accelerator built on Ascend 950PR silicon explicitly targeting prefill (inference) AI workloads.
- Promotes native FP4 precision to achieve 1.56 PFLOPS throughput, heavily undercutting NVIDIA’s China-restricted H20 GPU.
- Implements proprietary components to bypass TSMC/Western constraints, including “HiBL 1.0” memory and advanced domestic packaging.
Details:
- Compute Metrics: Claims 1.56 PFLOPS of FP4 compute. By leveraging FP4 (which Hopper/H20 natively lacks but Blackwell introduced), Huawei can execute larger models with reduced memory footprint.
- Memory & Bandwidth: Features 112GB of proprietary “HiBL 1.0” HBM, maxing out at 1.4 TB/s. Memory access granularity has been heavily optimized, reduced from 512 bytes down to 128 bytes.
- Interconnect: Uses the new LingQu protocol, providing 2 TB/s of interconnect bandwidth (a 2.5x increase over the older Ascend 910 series).
- Power & Pricing: Operates at a 600W TDP (200W higher than the H20). Rumored pricing is approximately 111,000 Yuan (~$16,000), positioning it competitively against H20’s $15k-$25k regional pricing.
- Availability: Expected Q1 2026 release. Hardware gaps aside, the robust CUDA software stack remains a primary hurdle Huawei is trying to overcome with localized equivalents.