Technical Intelligence Report: 2025-12-23

🤼‍♂️ Market & Competitors

[2025-12-23] Rebellions AI Puts Together An HBM And Arm Alliance To Take On Nvidia

Key takeaway relevant to AMD: > Rebellions AI is emerging as a formidable competitor in the AI inference market, backed by both Samsung and SK Hynix. Their “Rebel Quad” accelerator matches the AMD MI325X in TFLOPS per watt efficiency and utilizes an open-source software stack (PyTorch/Triton) that directly competes for the same developer mindshare as AMD’s ROCm ecosystem.

Summary: > South Korean startup Rebellions AI, recently merged with Sapeon Korea to become a $1.5B+ unicorn, has unveiled technical details for its third-generation “Rebel” AI inference chips. Built on Samsung’s 4nm process, the architecture utilizes a Coarse Grained Configurable Array (CGRA) and a software-defined network-on-chip to optimize for the distinct phases of LLM inference (prefill vs. decode).

Details:

Architectural Innovation: The Rebel chip uses a CGRA approach where neural cores can be dynamically programmed. During the prefill stage (compute-intensive), the cores act as a systolic array. During the decode stage (memory-intensive), they transition to a memory-bandwidth-optimized configuration.
Rebel Quad Specifications:
- Performance: 1 Petaflops (FP16) and 2 Petaflops (FP8).
- Memory: 4.8 TB/sec aggregate bandwidth using four 12-high stacks of Samsung HBM3E.
- Interconnect: Uses UCI-Express-A (licensed from Alphawave Semi) providing 1 TB/s per port.
- Power Consumption: 600W TDP in a PCIe form factor.
Comparative Metrics:
- Efficiency: Delivers ~20.7% more Teraflops per watt compared to the Nvidia H200.
- AMD Comparison: Matches the AMD MI325X in Teraflops per watt, though the MI325X maintains a ~28% raw throughput advantage in floating-point operations.
Core Hardware Components:
- Neural Cores: Each core features 4MB of L1 SRAM, dedicated Load/Store units, and Tensor/Vector units supporting FP16, FP8, FP4, NF4, and MXFP4.
- Command Processor (CP): Contains two 4-core Arm Neoverse CPU blocks for data movement orchestration and synchronization.
Software Ecosystem:
- The stack is built on native PyTorch and the Triton inference engine.
- RBLN CCL: A proprietary collective communications library (analogous to AMD’s RCCL or Nvidia’s NCCL).
- Integrated with the vLLM library and the Ray distributed inference framework via Red Hat OpenShift.
Strategic Alliance: By partnering with the “Arm Total Design” ecosystem, Rebellions plans to integrate their accelerators with Neoverse-based CPUs on Samsung’s upcoming 2nm process.