News: 2026-01-26
January 26, 2026 · Generated 05:42 AM PT
Technical Intelligence Report: 2026-01-26
Executive Summary
- AMD Software (ROCm/Quark): Technical documentation revealed performance data for IBM Granite 4.0 on AMD Instinct MI300/MI355 GPUs. Using the AMD Quark quantization library (FP8), token throughput nearly doubled (~1.96x) compared to baseline.
- AMD LLVM Compiler: AMD engineers proposed a new Device-Side Profile Guided Optimization (PGO) for the AMDGPU LLVM backend. The proposal introduces “uniformity-aware” profiling to prevent performance regressions caused by thread divergence, showing 12-14% speedups on uniform branches.
- Competitor Analysis (NVIDIA): NVIDIA launched “Earth-2,” a suite of fully open-source AI weather models (Atlas, StormScope, HealDA). This move aggressively targets the scientific computing market, challenging traditional physics-based forecasting and setting a new bar for AI-driven meteorology performance.
🤖 ROCm Updates & Software
[2026-01-26] Accelerating IBM Granite 4.0 with FP8 using AMD Quark (MI300/MI355)
Source: ROCm Tech Blog (via GitHub Commit)
Key takeaway relevant to AMD:
- New Hardware Mention: The documentation explicitly references the AMD Instinct MI355 alongside the MI300, confirming active software stack readiness for this hardware iteration.
- Throughput Doubling: FP8 quantization via AMD Quark demonstrates a massive throughput increase (~96%) for Granite 4.0 models on AMD hardware, critical for competitive inference serving.
- Tooling Maturity: Demonstrates a mature workflow using
Quark,vLLM, andSafetensorson ROCm, lowering the barrier for enterprise adoption of IBM’s enterprise-focused LLMs.
Summary:
- A blog post draft (observed via Git history) details the quantization of IBM Granite 4.0 (8B) to FP8 using the AMD Quark library.
- The workflow integrates with ROCm-optimized vLLM to serve the quantized model on MI300/MI355 GPUs.
- Benchmarks show near-lossless accuracy recovery with significant performance gains.
Details:
- Hardware Tested: AMD Instinct MI300 and MI355.
- Software Stack:
- Docker:
rocm/vllm-dev - Transformers: 4.56.0
- Quark: 0.11 (AMD’s quantization library)
- vLLM: 0.11.0
- Docker:
- Quantization Process: Used AMD Quark to convert the
ibm-granite/granite-4.0-h-smallmodel to FP8 (E4M3/E5M2). This leverages native matrix-core support on MI300+. - Performance Benchmarks (MI300):
- Throughput (Original/BF16): 13,018.16 tokens/second.
- Throughput (FP8): 25,541.64 tokens/second.
- Uplift: ~1.96x speedup.
- Accuracy Recovery:
- GSM8K: 98.75% recovery (85.60 vs 84.53).
- IFEVAL (Instruct, Strict): 100% recovery.
- Implications: The documentation provides a reproducible script using
Quark.torchandLLMTemplateto exportsafetensors, proving that AMD’s quantization pipeline is stabilizing for third-party models.
[2026-01-26] AMD LLVM Device-Side PGO with Uniformity Detection
Source: Phoronix
Key takeaway relevant to AMD:
- Compiler Optimization: AMD is upstreaming sophisticated compiler features to LLVM that address specific GPU architecture bottlenecks (SIMT divergence).
- Performance Gains: Initial tests show double-digit percentage gains (12-14%) for uniform code paths, improving HIP/ROCm workload efficiency without hardware changes.
- Safety Mechanism: The “Uniformity-aware” feature prevents the common pitfall where PGO hurts GPU performance by optimizing for paths that cause memory coalescing issues.
Summary:
- AMD engineer Sam Liu opened an LLVM merge request for Device-Side Profile Guided Optimization (PGO) for the AMDGPU backend.
- The system uses standard HIP APIs (no CLR patches needed) for instrumentation and profile collection.
- It specifically addresses the risks of register spilling in divergent branches on GPUs.
Details:
- Problem: Standard CPU PGO often moves register spills to “cold” paths. On a GPU, if a “cold” path is taken by only some threads in a wave (divergence), it causes partial-wave memory accesses, leading to poor coalescing and up to 3.7x slowdowns.
- Solution: Uniformity-aware PGO. The compiler instruments code to detect if branches are uniform (all threads take the same path) or divergent at runtime.
- Optimization Strategy:
- If a branch is uniform, the compiler applies aggressive optimizations (like spill placement), resulting in 12-14% speedup.
- If a branch is divergent, the compiler gates these optimizations to prevent regression.
- Technical Implementation:
- Uses wave-aggregated counter increments to reduce atomic contention.
- Uses per-TU (Translation Unit) contiguous counter allocation to avoid linker reordering issues.
- Relevance: This is critical for maximizing performance in complex HIP applications where static analysis cannot predict runtime divergence.
🤼♂️ Market & Competitors
[2026-01-26] NVIDIA Launches Earth-2 Family of Open Models
Source: NVIDIA Blog & The Next Platform
Key takeaway relevant to AMD:
- Competitive Moat: NVIDIA is entrenching itself in the scientific computing/HPC vertical by releasing open “foundation models” for weather. To compete, AMD must ensure these open models (available on Hugging Face) perform performantly on MI300 to avoid losing the government/meteorological sector.
- Architecture Shift: The move confirms the industry shift from physics-based numerical weather prediction (NWP) to AI-based prediction, which runs significantly faster on GPUs.
Summary:
- NVIDIA unveiled the “Earth-2” family of open models, libraries, and frameworks for AI weather forecasting.
- The release includes three distinct architectures targeting different forecast horizons (Nowcasting to Medium Range).
- Major entities (Israel Meteorological Service, The Weather Company, NOAA/NWS) are already evaluating or deploying these models.
Details:
- New Models Released:
- Earth-2 Medium Range: Based on the “Atlas” architecture. Predicts up to 15 days out. Outperforms Google’s GenCast.
- Earth-2 Nowcasting: Based on the “StormScope” architecture. Focuses on 0-6 hour prediction at kilometer-scale resolution. Uses generative AI/transformers.
- Earth-2 Global Data Assimilation: Based on “HealDA” architecture. Used to create initial atmospheric snapshots.
- Performance:
- Israel Meteorological Service reports a 90% reduction in compute time at 2.5km resolution compared to running classic numerical models on a CPU cluster.
- CorrDiff model cited as superior for precipitation verification.
- Availability: Models are available via NVIDIA Earth2Studio, Hugging Face, and GitHub.
- Ecosystem Integration: Supports integration with NVIDIA Modulus (PhysicsNeMo) for fine-tuning.
- Strategy: By making these open source, NVIDIA encourages standardization on their CUDA-optimized architectures (“Atlas”, “StormScope”), potentially sidelining competitors if ROCm support for these specific architectures lags.