News: 2026-02-19
February 19, 2026 · Generated 08:29 AM PT
Technical Intelligence Analyst Report Date: 2026-02-19
Executive Summary
- Ex-AMD Talent Launches “Hard Coded” AI Chip: Startup Taalas, led by former AMD and Tenstorrent architects, has exited stealth with a ROM-based inference chip that embeds model weights directly into transistors. The HC1 chip claims significantly lower latency and power compared to NVIDIA B200 GPUs.
- Major Linux IO_uring Performance Fix: Linux maintainer Jens Axboe, utilizing AI assistance, identified a sleeping issue in
io_uringthat yields a 50-80x performance improvement for idle systems. This patch is staged for QEMU and affects the general Linux ecosystem, including AMD EPYC/Ryzen deployments. - Linux 7.0 Tooling Updates: The
turbostatutility has added new L2 cache statistics (L2MRPSandL2%hit), though initial support is limited to recent Intel architectures, highlighting a temporary gap in open-source telemetry for AMD parts in this specific utility.
🤖 ROCm Updates & Software
[2026-02-19] AI Helped Uncover A “50-80x Improvement” For Linux’s IO_uring
Source: Phoronix
Key takeaway relevant to AMD:
- General Linux Optimization: This massive performance uplift in
io_uringbenefits all x86_64 architecture running modern Linux kernels, including AMD EPYC server clusters handling high-throughput I/O (databases, virtualization). - QEMU Impact: As the patches are staged for QEMU, AMD-based virtualization environments will see reduced latency on block devices.
Summary:
- Linux block maintainer Jens Axboe used Claude AI to debug performance regressions in AHCI/SCSI code.
- A one-line patch resulted in a 50-80x improvement for
io_uringon idle systems. - The fix addresses an issue where
ppoll()was sleeping excessively before submitting I/O.
Details:
- The Bug: In idle systems with I/O to submit,
ppoll()was found to sleep for up to 499ms (approx 500ms) unnecessarily. - The Fix: A single-line code change prevents this sleep behavior, ensuring immediate submission.
- Performance Metric:
- Previous behavior: AHCI devices would randomly timeout; specific regression tests took significant time.
- New behavior: 50-80x improvement in throughput/latency for affected idle scenarios.
- Comparison: Virtio-blk or NVMe devices previously finished in ~1 second; AHCI was the bottleneck.
- AI Involvement: Claude AI was used to analyze the reproducer code and identify the event loop flaws, leading to the patch.
- Status: Patches are currently staged for inclusion in QEMU.
[2026-02-19] Turbostat With Linux 7.0 Can Report New L2 Cache Statistics
Source: Phoronix
Key takeaway relevant to AMD:
- Telemetry Gap: While
turbostatis a standard tool for both AMD and Intel, these specific new L2 metrics rely on Intel-specific performance counters. AMD engineers may need to contribute equivalent patches to ensure parity in cache profiling on Linux 7.0.
Summary:
- The
turbostatutility in the Linux 7.0 kernel source tree has been updated. - New functionality allows reporting of L2 cache references and hit rates.
- Support is currently limited to recent Intel hybrid and server processors.
Details:
- New Metrics:
L2MRPS: L2 Cache M-References Per Second.L2%hit: L2 Cache hit rate percentage.
- Hardware Support: The update relies on specific L2 perf counters found in:
- Intel Xeon Sapphire Rapids and newer.
- Intel Atom Gracemont and newer.
- Intel Alder Lake and newer hybrid CPUs.
- Implementation: These changes were merged during the Linux 7.0 merge window.
🤼♂️ Market & Competitors
[2026-02-19] Taalas Etches AI Models Onto Transistors To Rocket Boost Inference
Source: The Next Platform
Key takeaway relevant to AMD:
- Founding Team Heritage: Taalas was founded by Ljubisa Bajic (former AMD Director of IC Design), Lejla Bajic (former AMD Sr. Manager Systems Engineering), and Drago Ignjatovic (former AMD Director of ASIC Design).
- Competitive Threat: This approach removes the memory-bandwidth bottleneck common in GPUs (including AMD Instinct MI series) by hard-coding weights, potentially undercutting AMD/Nvidia on cost and latency for fixed-model inference.
Summary:
- Startup Taalas (Toronto-based) exited stealth with $200M in funding.
- They unveiled “Hard Coded Inference” (HC) chips where model weights are encoded into the transistor structure (Mask ROM).
- The approach eliminates the memory wall and external DRAM/HBM requirements.
- Taalas claims 100x lower cost to deploy custom chips than to train the model itself.
Details:
- Architecture (HC1):
- Process: TSMC 6nm (N6).
- Die Size: 815 mm² (near reticle limit).
- Transistor Count: 53 Billion.
- Power: ~200 Watts per card.
- Density: Currently 8 billion parameters per chip (hard-wired).
- Technology:
- Uses a “mask ROM recall fabric” paired with SRAM.
- 1 Transistor Efficiency: A single transistor stores 4 bits and performs the multiplication associated with it.
- Re-spin Time: Changing the model requires changing two metal layers; turnaround time is 2 months via TSMC.
- Performance Claims:
- Throughput/Latency: Charts indicate significantly lower latency and higher tokens/sec compared to NVIDIA Blackwell B200 and AMD/Intel counterparts.
- Use Case: Ideal for stabilized “frontier” models (e.g., Llama 3.1) where model weights do not change frequently.
- Roadmap (HC2):
- Targeting Summer release.
- Support for 20 billion parameters per chip.
- Pipeline parallelism support to span larger models (e.g., DeepSeek) across multiple cards via PCIe (high bandwidth interconnect not required due to low latency).