Technical Intelligence Analyst Report Date: 2026-02-19

Executive Summary

  • Ex-AMD Talent Launches “Hard Coded” AI Chip: Startup Taalas, led by former AMD and Tenstorrent architects, has exited stealth with a ROM-based inference chip that embeds model weights directly into transistors. The HC1 chip claims significantly lower latency and power compared to NVIDIA B200 GPUs.
  • Major Linux IO_uring Performance Fix: Linux maintainer Jens Axboe, utilizing AI assistance, identified a sleeping issue in io_uring that yields a 50-80x performance improvement for idle systems. This patch is staged for QEMU and affects the general Linux ecosystem, including AMD EPYC/Ryzen deployments.
  • Linux 7.0 Tooling Updates: The turbostat utility has added new L2 cache statistics (L2MRPS and L2%hit), though initial support is limited to recent Intel architectures, highlighting a temporary gap in open-source telemetry for AMD parts in this specific utility.

🤖 ROCm Updates & Software

[2026-02-19] AI Helped Uncover A “50-80x Improvement” For Linux’s IO_uring

Source: Phoronix

Key takeaway relevant to AMD:

  • General Linux Optimization: This massive performance uplift in io_uring benefits all x86_64 architecture running modern Linux kernels, including AMD EPYC server clusters handling high-throughput I/O (databases, virtualization).
  • QEMU Impact: As the patches are staged for QEMU, AMD-based virtualization environments will see reduced latency on block devices.

Summary:

  • Linux block maintainer Jens Axboe used Claude AI to debug performance regressions in AHCI/SCSI code.
  • A one-line patch resulted in a 50-80x improvement for io_uring on idle systems.
  • The fix addresses an issue where ppoll() was sleeping excessively before submitting I/O.

Details:

  • The Bug: In idle systems with I/O to submit, ppoll() was found to sleep for up to 499ms (approx 500ms) unnecessarily.
  • The Fix: A single-line code change prevents this sleep behavior, ensuring immediate submission.
  • Performance Metric:
    • Previous behavior: AHCI devices would randomly timeout; specific regression tests took significant time.
    • New behavior: 50-80x improvement in throughput/latency for affected idle scenarios.
    • Comparison: Virtio-blk or NVMe devices previously finished in ~1 second; AHCI was the bottleneck.
  • AI Involvement: Claude AI was used to analyze the reproducer code and identify the event loop flaws, leading to the patch.
  • Status: Patches are currently staged for inclusion in QEMU.

[2026-02-19] Turbostat With Linux 7.0 Can Report New L2 Cache Statistics

Source: Phoronix

Key takeaway relevant to AMD:

  • Telemetry Gap: While turbostat is a standard tool for both AMD and Intel, these specific new L2 metrics rely on Intel-specific performance counters. AMD engineers may need to contribute equivalent patches to ensure parity in cache profiling on Linux 7.0.

Summary:

  • The turbostat utility in the Linux 7.0 kernel source tree has been updated.
  • New functionality allows reporting of L2 cache references and hit rates.
  • Support is currently limited to recent Intel hybrid and server processors.

Details:

  • New Metrics:
    • L2MRPS: L2 Cache M-References Per Second.
    • L2%hit: L2 Cache hit rate percentage.
  • Hardware Support: The update relies on specific L2 perf counters found in:
    • Intel Xeon Sapphire Rapids and newer.
    • Intel Atom Gracemont and newer.
    • Intel Alder Lake and newer hybrid CPUs.
  • Implementation: These changes were merged during the Linux 7.0 merge window.

🤼‍♂️ Market & Competitors

[2026-02-19] Taalas Etches AI Models Onto Transistors To Rocket Boost Inference

Source: The Next Platform

Key takeaway relevant to AMD:

  • Founding Team Heritage: Taalas was founded by Ljubisa Bajic (former AMD Director of IC Design), Lejla Bajic (former AMD Sr. Manager Systems Engineering), and Drago Ignjatovic (former AMD Director of ASIC Design).
  • Competitive Threat: This approach removes the memory-bandwidth bottleneck common in GPUs (including AMD Instinct MI series) by hard-coding weights, potentially undercutting AMD/Nvidia on cost and latency for fixed-model inference.

Summary:

  • Startup Taalas (Toronto-based) exited stealth with $200M in funding.
  • They unveiled “Hard Coded Inference” (HC) chips where model weights are encoded into the transistor structure (Mask ROM).
  • The approach eliminates the memory wall and external DRAM/HBM requirements.
  • Taalas claims 100x lower cost to deploy custom chips than to train the model itself.

Details:

  • Architecture (HC1):
    • Process: TSMC 6nm (N6).
    • Die Size: 815 mm² (near reticle limit).
    • Transistor Count: 53 Billion.
    • Power: ~200 Watts per card.
    • Density: Currently 8 billion parameters per chip (hard-wired).
  • Technology:
    • Uses a “mask ROM recall fabric” paired with SRAM.
    • 1 Transistor Efficiency: A single transistor stores 4 bits and performs the multiplication associated with it.
    • Re-spin Time: Changing the model requires changing two metal layers; turnaround time is 2 months via TSMC.
  • Performance Claims:
    • Throughput/Latency: Charts indicate significantly lower latency and higher tokens/sec compared to NVIDIA Blackwell B200 and AMD/Intel counterparts.
    • Use Case: Ideal for stabilized “frontier” models (e.g., Llama 3.1) where model weights do not change frequently.
  • Roadmap (HC2):
    • Targeting Summer release.
    • Support for 20 billion parameters per chip.
    • Pipeline parallelism support to span larger models (e.g., DeepSeek) across multiple cards via PCIe (high bandwidth interconnect not required due to low latency).