on-chip SRAM AI ASIC; 为啥是AI ASIC

胡雪盐8 2025-12-14 10:22:46 ( reads)

An on-chip SRAM AI ASIC is an accelerator where most of the working set (activations, partial sums, sometimes weights) stays inside SRAM physically on the compute die instead of being fetched from off-chip DRAM/HBM.

1. Latency dominance (especially LLM inference)

SRAM access: ~0.3–1 ns
HBM access (effective): ~50–100 ns
DDR access: 100+ ns

For token-by-token inference, this difference dominates user-visible latency.

2. Energy efficiency

Approximate energy per access:

SRAM: ~0.1–1 pJ/bit
HBM: ~3–5 pJ/bit
DDR: 10+ pJ/bit

LLMs are often memory-energy limited, not compute-limited.

3. Deterministic performance

No DRAM scheduling, refresh, or bank conflicts
Enables cycle-accurate pipelines (important for real-time systems)

Chip class	On-chip SRAM
Mobile NPU	4–32 MB
Edge inference ASIC	32–128 MB
Datacenter inference ASIC	100–300 MB
Wafer-scale (Cerebras)	10s of GB

Famous examples (and what they optimized for)

Groq

All on-chip SRAM
Static schedule, no caches
Unmatched token latency
Limited flexibility and capacity

Google TPU v1–v3

Large SRAM buffers
Matrix-centric workloads
Training + inference hybrid

Cerebras

Wafer-scale SRAM + compute
Avoids off-chip memory entirely
Extreme cost, extreme performance for certain models

When on-chip SRAM AI ASICs are the right answer

Ultra-low latency LLM inference
Real-time systems (finance, robotics, telecom)
Edge or power-constrained environments
Predictable workloads with known model shapes

更多我的博客文章

更多我的博客文章>>>