News: 2026-02-19

Technical Intelligence Analyst Report Date: 2026-02-19

Executive Summary

Ex-AMD Talent Launches “Hard Coded” AI Chip: Startup Taalas, led by former AMD and Tenstorrent architects, has exited stealth with a ROM-based inference chip that embeds model weights directly into transistors. The HC1 chip claims significantly lower latency and power compared to NVIDIA B200 GPUs.
Major Linux IO_uring Performance Fix: Linux maintainer Jens Axboe, utilizing AI assistance, identified a sleeping issue in io_uring that yields a 50-80x performance improvement for idle systems. This patch is staged for QEMU and affects the general Linux ecosystem, including AMD EPYC/Ryzen deployments.
Linux 7.0 Tooling Updates: The turbostat utility has added new L2 cache statistics (L2MRPS and L2%hit), though initial support is limited to recent Intel architectures, highlighting a temporary gap in open-source telemetry for AMD parts in this specific utility.

Source: Phoronix

Key takeaway relevant to AMD:

General Linux Optimization: This massive performance uplift in io_uring benefits all x86_64 architecture running modern Linux kernels, including AMD EPYC server clusters handling high-throughput I/O (databases, virtualization).
QEMU Impact: As the patches are staged for QEMU, AMD-based virtualization environments will see reduced latency on block devices.

Summary:

Linux block maintainer Jens Axboe used Claude AI to debug performance regressions in AHCI/SCSI code.
A one-line patch resulted in a 50-80x improvement for io_uring on idle systems.
The fix addresses an issue where ppoll() was sleeping excessively before submitting I/O.

Details:

The Bug: In idle systems with I/O to submit, ppoll() was found to sleep for up to 499ms (approx 500ms) unnecessarily.
The Fix: A single-line code change prevents this sleep behavior, ensuring immediate submission.
Performance Metric:
- Previous behavior: AHCI devices would randomly timeout; specific regression tests took significant time.
- New behavior: 50-80x improvement in throughput/latency for affected idle scenarios.
- Comparison: Virtio-blk or NVMe devices previously finished in ~1 second; AHCI was the bottleneck.
AI Involvement: Claude AI was used to analyze the reproducer code and identify the event loop flaws, leading to the patch.
Status: Patches are currently staged for inclusion in QEMU.

Source: Phoronix

Key takeaway relevant to AMD:

Telemetry Gap: While turbostat is a standard tool for both AMD and Intel, these specific new L2 metrics rely on Intel-specific performance counters. AMD engineers may need to contribute equivalent patches to ensure parity in cache profiling on Linux 7.0.

Summary:

Details:

New Metrics:
- L2MRPS: L2 Cache M-References Per Second.
- L2%hit: L2 Cache hit rate percentage.
Hardware Support: The update relies on specific L2 perf counters found in:
- Intel Xeon Sapphire Rapids and newer.
- Intel Atom Gracemont and newer.
- Intel Alder Lake and newer hybrid CPUs.
Implementation: These changes were merged during the Linux 7.0 merge window.

Source: The Next Platform

Key takeaway relevant to AMD:

Founding Team Heritage: Taalas was founded by Ljubisa Bajic (former AMD Director of IC Design), Lejla Bajic (former AMD Sr. Manager Systems Engineering), and Drago Ignjatovic (former AMD Director of ASIC Design).
Competitive Threat: This approach removes the memory-bandwidth bottleneck common in GPUs (including AMD Instinct MI series) by hard-coding weights, potentially undercutting AMD/Nvidia on cost and latency for fixed-model inference.

Summary:

Startup Taalas (Toronto-based) exited stealth with $200M in funding.
They unveiled “Hard Coded Inference” (HC) chips where model weights are encoded into the transistor structure (Mask ROM).
The approach eliminates the memory wall and external DRAM/HBM requirements.
Taalas claims 100x lower cost to deploy custom chips than to train the model itself.

Details:

Architecture (HC1):
- Process: TSMC 6nm (N6).
- Die Size: 815 mm² (near reticle limit).
- Transistor Count: 53 Billion.
- Power: ~200 Watts per card.
- Density: Currently 8 billion parameters per chip (hard-wired).
Technology:
- Uses a “mask ROM recall fabric” paired with SRAM.
- 1 Transistor Efficiency: A single transistor stores 4 bits and performs the multiplication associated with it.
- Re-spin Time: Changing the model requires changing two metal layers; turnaround time is 2 months via TSMC.
Performance Claims:
- Throughput/Latency: Charts indicate significantly lower latency and higher tokens/sec compared to NVIDIA Blackwell B200 and AMD/Intel counterparts.
- Use Case: Ideal for stabilized “frontier” models (e.g., Llama 3.1) where model weights do not change frequently.
Roadmap (HC2):
- Targeting Summer release.
- Support for 20 billion parameters per chip.
- Pipeline parallelism support to span larger models (e.g., DeepSeek) across multiple cards via PCIe (high bandwidth interconnect not required due to low latency).

🔗 References