Here is the Technical Intelligence Report for 2026-03-13.

Executive Summary

  • Major AMD & Microsoft GDC 2026 Collaboration: AMD and Microsoft announced deep software integrations, bringing DirectStorage 1.4 (Zstandard), DirectX Linear Algebra (with direct WMMA core access), and native GPU-accelerated model-level ML to Windows. AMD also released a Developer Preview driver supporting Agility SDK 1.619 & 1.719-preview for RX 7000 and RX 9000 series GPUs.
  • RDNA4 Linux Compute Fix: The upcoming Linux 7.0 kernel includes a critical fix for RDNA4/GFX12 GPUs that resolves a severe idle power draw and 100% utilization bug triggered after running AI inference workloads like Llama.cpp.
  • ByteDance Evades US Export Controls: ByteDance is successfully accessing 36,000 next-generation Nvidia B200 GPUs (a $2.5B cluster) via a Malaysian cloud provider, perfectly highlighting a legal loophole in current US export restrictions.
  • Intel vLLM Scaling: Intel released llm-scaler-vllm v0.14.0-b8.1, expanding Arc GPU support to handle massive Qwen3.5 LLMs (up to 122B parameters in FP8/INT4), reinforcing Intel’s push into accessible local AI compute.

🤖 ROCm Updates & Software

[2026-03-13] AMD and Microsoft partner on DirectX ML, DirectStorage, and developer tools at GDC 2026

Source: AMD GPUOpen

Key takeaway relevant to AMD:

  • AMD is ensuring day-one support for massive shifts in the Windows ecosystem, specifically giving developers direct access to AMD hardware blocks (WMMA cores) for matrix math and optimizing machine learning models directly at the driver level via the new DirectX Compute Graph Compiler.

Summary:

  • At GDC 2026, Microsoft and AMD revealed sweeping updates to the DirectX ecosystem, including DirectStorage 1.4, advanced PIX developer tools interoperability, and the integration of machine learning into real-time graphics via DirectX Linear Algebra and Compute Graph Compiler.

Details:

  • DirectStorage 1.4: Introduces Zstandard compression for game assets (runs on CPU/GPU). AMD will release optimizations in a public driver in H2 2026.
  • Developer Tools (PIX Integrations): AMD’s Radeon Raytracing Analyzer (RRA) now interoperates with PIX, allowing developers to export Acceleration Structures for deep BVH inspection. Radeon GPU Profiler (RGP) now natively displays PIX markers via new APIs. Initial support via Radeon Developer Tool Suite updates drops in Q2 2026.
  • DirectX Linear Algebra: Expands HLSL math to include matrix-matrix operations. Developers get direct access to AMD WMMA cores on Radeon graphics cards from shaders (Public preview: April 2026).
  • DirectX Compute Graph Compiler (CGC): Designed closely with AMD, CGC utilizes Windows MLIR Dialect to natively accelerate model-level ML. AMD’s drivers optimize kernels, scheduling, and memory for the underlying GPU architecture. (Private preview: Summer 2026).
  • Advanced Shader Delivery: Prevents compilation stutter by delivering pre-compiled shaders. First debuted on the AMD Ryzen Z Series (Xbox ROG Ally/Ally X).
  • Driver & Hardware Support: AMD Software Developer Preview Edition 25.30.21.01 launched for Agility SDK 1.619 and 1.719-preview.
    • RX 9000 Series Exclusive Features: Long Vector, 16-bit float Specials, Shader Execution Reordering (SER) (Note: “MaybeReorderThreads” limitation applies), and VPblit 3DLUT.
    • RX 7000 & 9000 Series Supported Features: Revised Resource View Creation APIs, Increased Dispatch Grid Limit, CPU Timeline Query Resolves, and Fence Barriers.

[2026-03-13] Linux 7.0 AMDGPU Fixing Idle Power Issue For RDNA4 GPUs After Compute Workloads

Source: Phoronix (AMD Linux)

Key takeaway relevant to AMD:

  • Crucial fix for developers and users running local AI inference (like Llama.cpp) on RDNA4 hardware, resolving a severe bug that left GPUs pinned at maximum power draw even when tasks finished.

Summary:

  • A fix has been merged into the Linux 7.0 kernel for the AMDGPU driver to correct an issue where RDNA4 (GFX12) GPUs remained at 100% utilization and high power consumption after running compute workloads.

Details:

  • Root Cause: Current/older MES firmware causes abnormal power consumption and incorrect load reporting after inference tasks via the HIP back-end (e.g., Llama.cpp, Ollama).
  • Driver Workaround: The AMDGPU kernel driver patch adds a check to adjust the MES over-subscription timer, mitigating the power draw bug without requiring users to flash new firmware.
  • Upcoming Firmware: AMD is simultaneously preparing a public release of new GPU MES firmware to permanently address the underlying hardware-level state issue.
  • Additional Fixes: The Linux 7.0 pull also incorporates updates for SMU13 and SMU14 power management blocks.

🔲 AMD Hardware & Products

(No standalone hardware announcements today; hardware architectural support details for RX 9000/7000 are covered under Software/GDC updates).


🤼‍♂️ Market & Competitors

[2026-03-13] Intel Updates LLM-Scaler-vLLM With Support For More Qwen3/3.5 Models

Source: Phoronix (AMD Linux)

Key takeaway relevant to AMD:

  • Intel continues to aggressively build out its AI software stack for Arc Graphics via vLLM, ensuring they remain a highly competitive alternative in the entry-to-mid tier local LLM hardware market.

Summary:

  • Intel launched version 0.14.0-b8.1 of their Docker-based llm-scaler-vllm, heavily expanding support for the newest generation of Qwen large language models on Intel Arc graphics.

Details:

  • Software Version: llm-scaler-vllm 0.14.0-b8.1.
  • Infrastructure: Built on vLLM and leverages Intel’s “Project Battlematrix” driver optimizations.
  • New LLM Support additions:
    • Qwen3.5-27B
    • Qwen3.5-35B-A3B
    • Qwen3.5-122B-A10B (Supporting both FP8 and INT4 quantization).
    • Qwen3-ASR-1.7B (Audio-speech recognition model).

[2026-03-13] China’s ByteDance to access 36,000 Blackwell GPU cluster through Malaysia cloud operator — Nvidia confirms no objections, deal is in line with US export controls

Source: Tom’s Hardware (GPUs)

Key takeaway relevant to AMD:

  • Hyperscaler AI demand remains entirely unhampered by geographic restrictions. By utilizing offshore clouds, Chinese mega-corporations will continue to absorb massive volumes of high-end accelerators (like Nvidia B200 or AMD MI300X/MI350 series), sustaining extreme data center revenue growth for hardware vendors.

Summary:

  • Chinese tech giant ByteDance is legally accessing a $2.5 billion cluster of 36,000 Nvidia Blackwell (B200) GPUs hosted by a Malaysian cloud provider, perfectly complying with current US export regulations.

Details:

  • Hardware Scale: The Malaysian cluster consists of 500 NVL72 GB200 rack-scale systems (36,000 B200 GPUs total), valued at approximately $2.5 billion. Hardware is supplied by AI server builder Aivres.
  • Operator: Aolani Cloud (Cayman Islands holding structure), designated as an Nvidia Tier-1 cloud partner.
  • Export Control Loophole: US Department of Commerce (BIS) export rules restrict the physical shipping destination of advanced hardware (like B200s), but do not restrict the end-user access via cloud computing.
  • Nvidia Compliance: Nvidia verified that Aolani passes all field operation, finance, and compliance checks.
  • Future Expansions: ByteDance is already planning a secondary cluster of 7,000 B200 GPUs to be deployed in a data center in Indonesia.

💬 Reddit & Community

(No notable updates today).


🔬 Research & Papers

(No notable updates today).

📈 GitHub Stats

Category Repository Total Stars 1-Day 7-Day 30-Day
AMD Ecosystem AMD-AGI/GEAK-agent 73 +2 +4 +11
AMD Ecosystem AMD-AGI/Primus 82 +2 +6 +8
AMD Ecosystem AMD-AGI/TraceLens 63 0 +1 +5
AMD Ecosystem ROCm/MAD 31 0 0 0
AMD Ecosystem ROCm/ROCm 6,247 +2 +22 +87
Compilers openxla/xla 4,066 +4 +19 +88
Compilers tile-ai/tilelang 5,364 +3 +34 +206
Compilers triton-lang/triton 18,642 +8 +73 +240
Google / JAX AI-Hypercomputer/JetStream 415 0 0 +9
Google / JAX AI-Hypercomputer/maxtext 2,169 +1 +7 +31
Google / JAX jax-ml/jax 35,074 +10 +64 +234
HuggingFace huggingface/transformers 157,804 +15 +313 +1444
Inference Serving alibaba/rtp-llm 1,066 +3 +7 +19
Inference Serving efeslab/Atom 335 0 -1 -1
Inference Serving llm-d/llm-d 2,609 +6 +28 +132
Inference Serving sgl-project/sglang 24,418 +51 +251 +910
Inference Serving vllm-project/vllm 73,014 +84 +776 +2944
Inference Serving xdit-project/xDiT 2,566 +1 +6 +31
NVIDIA NVIDIA/Megatron-LM 15,637 +24 +106 +449
NVIDIA NVIDIA/TransformerEngine 3,206 +5 +19 +47
NVIDIA NVIDIA/apex 8,930 +1 +2 +15
Optimization deepseek-ai/DeepEP 9,044 +1 +23 +69
Optimization deepspeedai/DeepSpeed 41,802 +1 +46 +205
Optimization facebookresearch/xformers 10,368 +2 +8 +34
PyTorch & Meta meta-pytorch/monarch 989 0 +4 +21
PyTorch & Meta meta-pytorch/torchcomms 347 0 +2 +17
PyTorch & Meta meta-pytorch/torchforge 641 +1 +9 +25
PyTorch & Meta pytorch/FBGEMM 1,540 +1 +2 +10
PyTorch & Meta pytorch/ao 2,729 0 +9 +57
PyTorch & Meta pytorch/audio 2,838 +1 +4 +11
PyTorch & Meta pytorch/pytorch 98,236 +9 +237 +899
PyTorch & Meta pytorch/torchtitan 5,139 +6 +29 +77
PyTorch & Meta pytorch/vision 17,563 +2 +19 +56
RL & Post-Training THUDM/slime 4,740 +27 +145 +997
RL & Post-Training radixark/miles 974 +2 +24 +112
RL & Post-Training volcengine/verl 19,876 +21 +203 +725