Update: 2026-03-13 (09:31 AM)
Here is the Technical Intelligence Report for 2026-03-13.
Executive Summary
- AMD & Microsoft Deepen DirectX/ML Integration: At GDC 2026, a sweeping set of updates to the Windows graphics platform was announced, focusing heavily on Machine Learning integration (DirectX Linear Algebra, Compute Graph Compiler) and developer tools. AMD is offering immediate day-one driver support for the new Agility SDK 1.619 and 1.719-preview, unlocking specific architectural features for the RX 7000 and the newer RX 9000 series.
- Critical RDNA4 Linux Fixes: The Linux 7.0 kernel will introduce an essential fix for RDNA4 (GFX12) GPUs, specifically targeting the Radeon R9700. The patch resolves a severe idle power draw issue triggered by compute workloads (like Llama.cpp via HIP) by implementing an AMDGPU kernel workaround for the MES firmware’s over-subscription timer.
- Intel Expands Arc LLM Support: Intel has released a new update to its LLM-Scaler-vLLM deployment framework (v0.14.0-b8.1), introducing support for massive Qwen3.5 models (up to 122B parameters in FP8/INT4), leveraging recent Project Battlematrix driver enhancements.
- Nvidia Blackwell Loophole Exploited by ByteDance: ByteDance is successfully navigating US export controls to access next-gen AI compute by utilizing a massive, $2.5 billion cluster of 36,000 Nvidia B200 GPUs physically located in Malaysia, establishing a precedent for Chinese AI firms leveraging offshore cloud infrastructure.
🤖 ROCm Updates & Software
[2026-03-13] AMD and Microsoft partner on DirectX ML, DirectStorage, and developer tools at GDC 2026
Source: AMD GPUOpen
Key takeaway relevant to AMD:
- AMD is providing day-one developer driver support (v25.30.21.01) for major machine learning and toolchain advancements in DirectX, ensuring native hardware-accelerated ML capabilities (such as direct WMMA core access) are highly optimized for RDNA3 (RX 7000) and RDNA4 (RX 9000) architectures.
- Interoperability between Microsoft PIX and AMD’s developer suites (RRA and RGP) has been significantly improved, drastically reducing the friction for profiling and debugging raytracing and performance regressions on Radeon hardware.
Summary:
- Microsoft and AMD co-presented at GDC 2026, unveiling DirectX Linear Algebra, DirectX Compute Graph Compiler (CGC), DirectStorage 1.4, and the largest wave of new DirectX tooling features to date.
- AMD released the AMD Software: AgilitySDK Developer Preview Edition 25.30.21.01 driver to support the newly released Agility SDK 1.619 and 1.719-preview, detailing hardware-specific feature support for RX 7000 and RX 9000 series GPUs.
Details:
- Machine Learning Integration:
- DirectX Linear Algebra: Expands on Shader Model 6.9’s Cooperative Vector feature to include matrix-matrix operations. It allows direct access to AMD WMMA (Wave Matrix Multiply Accumulate) cores directly from shaders. (Public preview: April 2026).
- DirectX Compute Graph Compiler (CGC) & Windows MLIR: A new API enabling native GPU acceleration for full model execution. It processes models through the MLIR Dialect, allowing AMD drivers to automatically optimize scheduling, operator fusion, and memory planning for underlying hardware. (Private preview: Summer 2026).
- Storage & Compilation:
- DirectStorage 1.4: Introduces CPU/GPU Zstandard compression via the Game Asset Conditioning Library. AMD is currently optimizing Zstandard decompression for both Radeon GPUs and Ryzen CPUs, targeting a public driver release in 2H 2026.
- Advanced Shader Delivery: First seen on Ryzen Z Series handhelds, new APIs in Agility SDK 1.619 allow gamers to download fully compiled, hardware-tailored shaders directly, circumventing local compilation stutter.
- Developer Tooling Capabilities (Available Q2 2026):
- DirectX Dump Files & PIX Updates: Full interoperability between PIX and AMD Radeon Raytracing Analyzer (RRA). Acceleration Structures can now be exported directly from PIX GPU Captures to diagnose BVH structures and scene hotspots.
- AMD Radeon GPU Profiler (RGP): Now natively displays PIX markers using new Event Configurability APIs, requiring no special headers or recompilation.
- Agility SDK 1.619 & 1.719-preview Support Matrix:
- RX 9000 Series Exclusive: Long Vector, 16-bit float Specials, Shader Execution Reordering (SER) (Note: AMD limitation: “MaybeReorderThreads” does not move threads), VPblit 3DLUT (1.719-preview).
- RX 7000 and RX 9000 Series: Revised Resource View Creation APIs, Increased Dispatch Grid Limit, CPU Timeline Query Resolves, Fence Barriers (1.719-preview).
[2026-03-13] Linux 7.0 AMDGPU Fixing Idle Power Issue For RDNA4 GPUs After Compute Workloads
Source: Phoronix (AMD Linux)
Key takeaway relevant to AMD:
- A major power management flaw affecting RDNA4 hardware in compute scenarios (especially LLM inference via HIP) has been patched in the Linux 7.0 kernel, ensuring developers and enterprise users do not experience severe thermal/power drain when the GPU should be idling.
Summary:
- A fix is being merged into the Linux 7.0 kernel via the AMDGPU driver to resolve high idle power consumption and false 100% utilization reporting on RDNA4/GFX12 GPUs.
- The underlying cause is tied to the GPU MES (Micro-Engine Scheduler) firmware, and while a new firmware update is pending, the kernel patch provides an immediate software-level workaround.
Details:
- Trigger Condition: The bug is consistently reproduced when performing inference tasks using the HIP back-end on tools like Llama.cpp or ollama. Following the workload, the GPU fails to downclock, remaining at 100% reported usage and drawing maximum power.
- Affected Hardware: Specifically confirmed to impact RDNA4 / GFX12 hardware (the Radeon R9700 GPU is explicitly cited in the November bug report).
- Technical Fix: The AMDGPU kernel driver has introduced a check that actively adjusts the MES over-subscription timer. This mitigates the issue for users running older firmware until AMD’s official MES firmware update is rolled out publicly.
- Additional DRM Updates: The broader pull request for Linux 7.0 also includes fixes for SMU13 and SMU14 (System Management Unit) alongside minor AMDGPU code refinements.
🤼♂️ Market & Competitors
[2026-03-13] Intel Updates LLM-Scaler-vLLM With Support For More Qwen3/3.5 Models
Source: Phoronix (AMD Linux)
Key takeaway relevant to AMD:
- Intel is aggressively expanding out-of-the-box, optimized open-source LLM support for its Arc Graphics hardware. AMD must ensure ROCm/vLLM parity and ease-of-deployment (via Docker) to maintain its competitive edge against Intel’s growing “Project Battlematrix” software stack.
Summary:
- Intel released version 0.14.0-b8.1 of its LLM-Scaler-vLLM project.
- The update brings targeted deployment support for massive, newly released Qwen3 and Qwen3.5 LLMs to Intel Arc hardware.
Details:
- Framework Update: The release (
llm-scaler-vllm 0.14.0-b8.1) acts as a Docker-based deployment environment built on top of the standard vLLM architecture, specifically tailored for Intel GPUs. - Driver Foundation: Relies heavily on driver enhancements formulated over the past year under Intel’s “Project Battlematrix”.
- New Model Support: Adds seamless deployment capabilities for:
- Qwen3.5-27B
- Qwen3.5-35B-A3B
- Qwen3.5-122B-A10B (Supporting both FP8 and INT4 quantizations)
- Qwen3-ASR-1.7B
[2026-03-13] China’s ByteDance to access 36,000 Blackwell GPU cluster through Malaysia cloud operator — Nvidia confirms no objections, deal is in line with US export controls
Source: Tom’s Hardware (GPUs)
Key takeaway relevant to AMD:
- US export controls restrict hardware shipment destinations, not cloud access. This geopolitical loophole means hyperscalers and Chinese mega-corps like ByteDance can still pump billions into next-generation AI accelerators (like Nvidia B200 or AMD MI300/MI350 series) via Southeast Asian proxy cloud operators. AMD could potentially capture similar massive offshore cloud deals without violating BIS regulations.
Summary:
- ByteDance is circumventing hardware import bans by leasing a massive 36,000 Nvidia B200 GPU cluster located in Malaysia to power its AI R&D.
- Nvidia and the US Department of Commerce’s BIS confirm the arrangement is fully legal because the physical hardware resides outside of China.
Details:
- Cluster Specifications: The cluster consists of 500 NVL72 GB200 rack-scale systems, totaling 36,000 B200 GPUs.
- Financial Scope: The hardware valuation is estimated at $2.5 billion.
- Operational Chain: Aivres is supplying the servers to Aolani Cloud (established late 2023, Cayman Islands holding structure), a Tier-1 Nvidia cloud partner in Malaysia, which in turn leases the compute to ByteDance.
- Existing / Future Deployments: ByteDance has already been testing this pipeline by leasing H100 GPU servers from Aolani since February 2025. ByteDance is also planning an additional cluster of over 7,000 B200 GPUs in an Indonesian data center.
- Regulatory Mechanism: Current 2023 US export controls dictate where the chips are physically shipped, intentionally allowing global cloud infrastructure to be built on American hardware. Because ByteDance is not on the Entity List or Military End Use (MEU) list, Nvidia field operations and compliance teams have legally cleared the hardware delivery to the Malaysian operator.
📈 GitHub Stats
| Category | Repository | Total Stars | 1-Day | 7-Day | 30-Day |
|---|---|---|---|---|---|
| AMD Ecosystem | AMD-AGI/GEAK-agent | 73 | +2 | +4 | +11 |
| AMD Ecosystem | AMD-AGI/Primus | 82 | +2 | +6 | +8 |
| AMD Ecosystem | AMD-AGI/TraceLens | 63 | 0 | +1 | +5 |
| AMD Ecosystem | ROCm/MAD | 31 | 0 | 0 | 0 |
| AMD Ecosystem | ROCm/ROCm | 6,247 | +2 | +22 | +87 |
| Compilers | openxla/xla | 4,066 | +4 | +19 | +88 |
| Compilers | tile-ai/tilelang | 5,364 | +3 | +34 | +206 |
| Compilers | triton-lang/triton | 18,643 | +9 | +74 | +241 |
| Google / JAX | AI-Hypercomputer/JetStream | 415 | 0 | 0 | +9 |
| Google / JAX | AI-Hypercomputer/maxtext | 2,169 | +1 | +7 | +31 |
| Google / JAX | jax-ml/jax | 35,076 | +12 | +66 | +236 |
| HuggingFace | huggingface/transformers | 157,768 | -21 | +277 | +1408 |
| Inference Serving | alibaba/rtp-llm | 1,066 | +3 | +7 | +19 |
| Inference Serving | efeslab/Atom | 335 | 0 | -1 | -1 |
| Inference Serving | llm-d/llm-d | 2,609 | +6 | +28 | +132 |
| Inference Serving | sgl-project/sglang | 24,419 | +52 | +252 | +911 |
| Inference Serving | vllm-project/vllm | 72,996 | +66 | +758 | +2926 |
| Inference Serving | xdit-project/xDiT | 2,566 | +1 | +6 | +31 |
| NVIDIA | NVIDIA/Megatron-LM | 15,640 | +27 | +109 | +452 |
| NVIDIA | NVIDIA/TransformerEngine | 3,207 | +6 | +20 | +48 |
| NVIDIA | NVIDIA/apex | 8,930 | +1 | +2 | +15 |
| Optimization | deepseek-ai/DeepEP | 9,044 | +1 | +23 | +69 |
| Optimization | deepspeedai/DeepSpeed | 41,803 | +2 | +47 | +206 |
| Optimization | facebookresearch/xformers | 10,367 | +1 | +7 | +33 |
| PyTorch & Meta | meta-pytorch/monarch | 989 | 0 | +4 | +21 |
| PyTorch & Meta | meta-pytorch/torchcomms | 347 | 0 | +2 | +17 |
| PyTorch & Meta | meta-pytorch/torchforge | 641 | +1 | +9 | +25 |
| PyTorch & Meta | pytorch/FBGEMM | 1,540 | +1 | +2 | +10 |
| PyTorch & Meta | pytorch/ao | 2,730 | +1 | +10 | +58 |
| PyTorch & Meta | pytorch/audio | 2,839 | +2 | +5 | +12 |
| PyTorch & Meta | pytorch/pytorch | 98,206 | -21 | +207 | +869 |
| PyTorch & Meta | pytorch/torchtitan | 5,139 | +6 | +29 | +77 |
| PyTorch & Meta | pytorch/vision | 17,564 | +3 | +20 | +57 |
| RL & Post-Training | THUDM/slime | 4,740 | +27 | +145 | +997 |
| RL & Post-Training | radixark/miles | 974 | +2 | +24 | +112 |
| RL & Post-Training | volcengine/verl | 19,878 | +23 | +205 | +727 |