Executive Summary

  • AI Workbench & LLM Autoscaling: AMD has released detailed guidance on implementing Kubernetes-driven autoscaling for LLM inference (like Mixtral 8x7B) on Instinct MI300X using the AI Workbench and AIM Engine.
  • ROCm Deep Dive Tutorials: A new developer guide thoroughly explains the TensorDescriptor abstraction in AMD’s Composable Kernel (CK), demystifying physical memory mappings to help developers write highly optimized custom GPU kernels.
  • Robotics & Reinforcement Learning: AMD demonstrated complex native RL training using JAX and MuJoCo on ROCm 7.2 with consumer Radeon RX 7900 XTX hardware, signifying strengthening open-source ecosystem support.
  • Nvidia DLSS 4.5 Expands: Nvidia has rolled out its DLSS 4.5 Dynamic Multi Frame Generation in beta. The 5X and 6X multiplier modes deliver massive FPS scaling for RTX 50-series cards with no discernible penalty to input latency, increasing pressure on AMD’s FSR.
  • Strategic Interconnect Partnerships: Nvidia’s $2 billion investment in Marvell targets the proliferation of its proprietary NVLink Fusion protocol into custom XPUs and third-party scale-up networking hardware, representing a direct counteroffensive to the UALink consortium.

🤖 ROCm Updates & Software

[2026-03-31] Leveraging AMD AI Workbench to Scale LLM Inference for Optimal Resource Utilization

Source: ROCm Tech Blog

Key takeaway relevant to AMD:

  • AMD is significantly lowering the barrier to entry for enterprise scale-out inference deployments by abstracting complex Kubernetes scaling mechanisms into a low-code UI for AMD Instinct hardware.

Summary:

  • An instructional blog detailing how to configure Kubernetes-based autoscaling for AMD Inference Microservices (AIMs) using the AMD AI Workbench and AIM Engine.
  • Demonstrates scaling behaviors using the Mixtral 8x7B Instruct v0.1 model deployed on AMD Instinct MI300X GPUs.

Details:

  • Backend Stack: Utilizes Kubernetes Event-Driven Autoscaling (KEDA) and OpenTelemetry to automatically adjust computing resources based on real-time traffic.
  • Scaling Metrics: Developers can scale based on “Running requests” (currently processing) or “Waiting requests” (queued requests, an early indicator of saturation).
  • Aggregation Methods: Supports “Average” (stable scaling), “Maximum” (reacts to peak load on any single pod), and “Minimum” (conservative, scales only when all pods are busy).
  • Replica Config: Allows sliding ranges from 1 up to 30 replicas, scaling dynamically when metric thresholds (default target value of 10) are exceeded.
  • Under the Hood: The open-source AIM Engine Kubernetes operator manages CRDs (AIMService, AIMModel) to handle hardware-aware scheduling, model caching, and the underlying KServe InferenceService.

[2026-03-31] Merge branch ‘release’ of https://github.com/ROCm/rocm-blogs-internal… (AMD GPU Programming From Beginner to Expert 1)

Source: ROCm Tech Blog

Key takeaway relevant to AMD:

  • Providing low-level programming documentation like this is critical for attracting HPC/AI developers to the AMD ecosystem, enabling them to build highly optimized custom GPU kernels using Composable Kernel (CK).

Summary:

  • Part 1 of a new tutorial series demystifying low-level AMD GPU kernel programming, focusing specifically on the TensorDescriptor in the Composable Kernel (CK) library.
  • Explains the mathematical and logical mechanisms used to map multi-dimensional array coordinates to physical linear memory offsets.

Details:

  • TensorDescriptor Abstraction: Uses a tree structure composed of multi-level coordinates and Transforms to manage multi-dimensional data layouts.
  • Transform Structs: Introduces specific operations (e.g., Embed, Unmerge, PassThrough) and the CalculateLowerIndex method, which recursively maps logical coordinates layer by layer down to physical 1D storage.
  • Indexing Hierarchy: Defines the system using “Upper dimension id” (above the transform), “Lower dimension id” (below the transform), and “Visible dimension id” (user-facing).
  • Code Implementation: Provides complete C++ code demonstrating how to instantiate Transforms (make_unmerge_transform, PassThrough), calculate physical offsets, and generate tensor coordinate shapes natively on ROCm.

[2026-03-31] Training a Robotic Arm using MuJoCo and JAX on AMD Hardware with ROCm

Source: ROCm Tech Blog

Key takeaway relevant to AMD:

  • Demonstrates strong, production-ready compatibility between AMD’s consumer/workstation GPUs (Radeon RX 7900 XTX) and premier AI/physics frameworks (JAX and DeepMind’s MuJoCo), expanding AMD’s footprint in robotics research.

Summary:

  • A comprehensive guide to training an RL-based pick-and-place robotic arm (UFactory X-Arm 7) in the MuJoCo simulator using JAX on AMD hardware via ROCm 7.2.
  • Covers the entire pipeline from environment setup and mesh decimation to reward shaping and simulated domain randomization.

Details:

  • Hardware / Software: Validated on Ubuntu 24.04 using an AMD Radeon RX 7900 XTX and ROCm 7.2.
  • ROCm Configuration: Requires critical environment variables to function correctly inside the JAX Docker container: HIP_DEVICE_LIB_PATH (pointing to bitcode libraries), MUJOCO_GL=osmesa (to avoid EGL/GPU display errors on headless setups), and XLA_FLAGS="--xla_gpu_enable_command_buffer=" (to disable XLA command buffers).
  • Optimization Techniques: Includes a Python script utilizing Open3D to decimate high-polygon STL meshes (targeting 500 faces) to prevent coplanar face warnings in MJX and drastically accelerate collision detection during training.
  • MuJoCo XML Configurations: Modifies the X-Arm 7 XML to cap maxhullvert="20" to reduce convex hull complexity, ensuring higher simulation throughput during massive parallel RL batch processing.

🤼‍♂️ Market & Competitors

[2026-03-31] We go hands-on with Nvidia’s DLSS 4.5 Dynamic Multi Frame Generation and its 5X and 6X multipliers…

Source: Tom’s Hardware (GPUs)

Key takeaway relevant to AMD:

  • Nvidia is successfully scaling frame generation to 6X without latency penalties on new Blackwell hardware, setting an incredibly high target for AMD’s FSR frame generation efforts in high-refresh-rate gaming.

Summary:

  • A hands-on review of Nvidia’s new DLSS 4.5 Dynamic Multi Frame Generation (MFG), available in beta exclusively for RTX 50-series graphics cards.
  • The technology introduces dynamic 5X and 6X multipliers that automatically shift based on the target monitor refresh rate to maximize visual fluidity.

Details:

  • Latency Testing: Remarkable latency retention. At a 6X multiplier, input latency was measured at 52.6ms, slightly better than the 53.2ms recorded at 4X. Baseline native was 35.0ms.
  • Performance Scaling (Cyberpunk 2077 / RTX 5080): 4K Native: 60.0 FPS; 2X MFG: 103.1 FPS; 4X MFG: 178.8 FPS; 6X MFG: 247.7 FPS. Nvidia achieves 240+ FPS path-traced gaming by stretching a single native frame into multiple generated frames.
  • Image Quality: Artifacts are generally minimal. Reviewers noted minor “noisy” player shadows and ghosting UI/coat edges in Clair Obscur: Expedition 33 and slight screen-edge artifacting in Hogwarts Legacy, though largely unnoticeable in motion.

[2026-03-31] Nvidia DLSS 4.5 Dynamic MFG plus 5X and 6X framegen modes enter beta for RTX 50-series users

Source: Tom’s Hardware (GPUs)

Key takeaway relevant to AMD:

  • Nvidia’s software stack is evolving rapidly to eliminate previous AI-upscaling weaknesses (like UI flicker). Coupled with Intel’s recent XeSS 3 update, AMD faces intense competitive pressure to iterate rapidly on FSR.

Summary:

  • Nvidia has launched the second half of its DLSS 4.5 feature set via the Nvidia App beta, featuring the dynamic frame-generation tech and a new “Preset B” UI-aware model.
  • Marks the final update to the DLSS 4 stack before Nvidia transitions development exclusively to the heavily anticipated DLSS 5.

Details:

  • Preset B Model: A new enhanced frame generation model that utilizes “additional UI buffers” from game engines to significantly reduce artifacts, such as crosshair flickering. This specific feature is available to both RTX 40 and 50-series users.
  • Dynamic Matching: Automatically downshifts the multi-frame generation multiplier (e.g., drops from 6X to 4X) if the generated output exceeds the display’s maximum refresh rate (e.g., 240Hz/360Hz), preventing tearing.
  • Competitive Landscape: Article notes Intel is also aggressively pushing frame generation, having recently launched XeSS 3 MFG with 3X and 4X capabilities for Arc and Core Ultra platforms.

Source: The Next Platform

Key takeaway relevant to AMD:

  • Nvidia’s strategic $2 billion injection into Marvell is a direct attack on AMD’s championing of the UALink open-standard interconnect, aiming to lock custom hyperscaler silicon into the proprietary NVLink ecosystem.

Summary:

  • Nvidia is investing $2 billion into chip maker Marvell to secure mass production and compatibility for NVLink Fusion ports on custom third-party XPUs and scale-up networking hardware.
  • This deal echoes recent $2 billion Nvidia investments into Lumentum and Coherent to secure co-packaged optics (CPO) supply chains.

Details:

  • Hardware Synergy: Marvell will offer “custom XPUs and NVLink Fusion-compatible scale-up networking”. This is largely driven by Marvell’s acquisition of XConn, which produces the PCIe 6.0 Structera S 60260 switch (260 lanes, 2.1 TB/sec bandwidth).
  • Cloud Implications: Marvell is the primary silicon packager for AWS custom silicon. AWS’s upcoming Trainium 4 XPU is confirmed to support both UALink and NVLink protocols.
  • Celestial AI Integration: Marvell recently acquired Celestial AI for $3.25B, opening the door for Nvidia’s NVLink protocol to run natively on top of Celestial’s “Photonic Fabric” for row-scale optical coherent memory.
  • Future Speculation: Analysts suggest this pressures Broadcom (the primary manufacturer for Google TPUs and Meta MTIA) to eventually cut a similar cross-licensing deal with Nvidia for NVLink compatibility.