How is Cerebras so Fast?
./danielkim.shAt Cerebras, we're pushing the boundaries of what's possible with our Llama 3.1 models, achieving unprecedented inference speeds of 2148 tokens per second (t/s).
To put this into perspective, this is ~16x faster than the leading GPU-based provider, Fireworks. Here's a behind-the-scenes look at how we do it.
LLM Inference Is Bottlenecked by Memory Bandwidth
Large language models (LLMs), like Llama 3.1, generate text one word (or token) at a time, following a sequential process.
For instance, to generate the response "The Quick Brown Fox Jumps", the model generates a response one token at a time by passing the input through multiple layers of computation, each building on the previous.
Each Token Requires Moving All Model Weights from Memory to Compute Cores
To generate each token, all model weights must be transferred from memory to the compute cores for each layer of the model. This movement occurs layer by layer and is repeated for every token, making the process inherently sequential and highly reliant on memory bandwidth.
To Enhance Efficiency, LLMs Use Caching to Store Intermediate Outputs
Caching reduces the need to recompute values for each layer during subsequent token creation, speeding up the overall process. However, all of the computed values that are cached must be stored and accessed from memory, which adds to the memory bandwidth required for inference.
Each new token generated still relies on its predecessor, making parallelization impossible. This means memory bandwidth is a critical factor—each time data is moved between memory and compute, latency is introduced, slowing inference.
Despite caching, the need to move model weights and access cached values for each token makes memory bandwidth the primary bottleneck. The efficiency of generating text is constrained by how quickly data can move between memory and compute cores.
Inference Without Memory Bandwidth Limits
Cerebras has made multiple innovations to achieve these speeds, particularly at the chip architecture level, the execution mode, and through software optimizations.
Cerebras WSE Chip Architecture
GPUs use a hierarchical memory system, involving multiple layers with varying capacities and access speeds.
This setup includes a small, fast on-chip cache (L1), a slightly larger but slower L2 cache, and high-bandwidth memory (HBM) for off-chip storage. While effective in some scenarios, scaling to larger AI models introduces complexity and latency as the frequent data transfers between these memory levels—particularly between the L2 cache and HBM—become bottlenecks. This limits the available memory bandwidth, making it harder to sustain performance as models grow in size and demand more memory resources.
In contrast, the Cerebras Wafer Scale Engine (WSE) is designed to eliminate these bottlenecks through a unique architecture that tightly couples compute and memory at the core level.
Each core in the WSE has its own high-performance compute capabilities, tightly integrated with 48kB of SRAM memory and a 512B local cache. This direct integration ensures sufficient bandwidth to achieve full SIMD performance and reduces the need for off-chip data transfers.
The WSE's on-chip mesh network allows each core to access its own compute and direct access to SRAM, which drastically reduces latency. The reason for this massive efficiency advantage is simple: it's just physics.
In traditional GPU setups, data must move through SerDes serial links, connectors, circuit boards, and switch chips, introducing significant latency. In contrast, moving bits less than a millimeter on silicon within the WSE is far more efficient. As a result, the WSE architecture increases memory bandwidth and avoids the bottlenecks GPUs face.
Pipeline Execution Mode
GPUs face memory bandwidth bottlenecks due to shared memory across multiple chips and frequent off-chip communication, resulting in higher latency.
In Cerebras' pipelined execution for AI inference, model layers are mapped onto different regions of the wafer. Each region on the wafer is responsible for executing one or more model layers, with the region size dynamically determined based on memory and compute requirements. The on-chip fabric interconnect allows ultra-low latency communication between regions, with logically adjacent pipeline stages placed next to each other to further reduce latency.
Since a single request utilizes only a fraction of the total memory bandwidth, Cerebras can support high multi-request throughput, running additional requests in parallel. Each request gets full access to the model without any performance compromise, as all pipeline stages run concurrently.
Thanks to the large amount of SRAM memory on the WSE-3, if the model parameters fit entirely on a wafer, we only need a single wafer for inference. For example, serving Llama 3.1-8B (16GB) can be handled entirely by WSE-3's 44GB of memory on a single chip. The Cerebras WSE-3 scales effectively with larger models as well.
Models that need more memory are mapped across multiple wafers. Llama 3.1-70B, for example, fits on 4 wafers, providing 176 GB of memory. The high-bandwidth fabric on each wafer minimizes communication to just transferring activations, keeping latency low.
Unlike multi-GPU setups that suffer from exponentially increasing latency due to frequent data exchange of activations, weight updates, and gradients across GPUs, Cerebras' architecture scales linearly without performance degradation because it only communicates activations between wafers, which requires relatively lower bandwidth compared to GPUs.
Speculative Decoding
On October 24th, Cerebras introduced speculative decoding, boosting inference speeds from 580 t/s to an impressive 2148 t/s.
Speculative decoding accelerates inference by using a smaller, faster model to predict the next tokens in bulk, which are then verified or corrected by a larger, more accurate model like Llama 70B. Cerebras uses Llama 3.2-3B to generate tokens rapidly, followed by Llama 70B for spot checks. This hybrid approach nearly doubles the speed of inference without sacrificing accuracy.
The key advantage here is leveraging the computational efficiency of smaller models while maintaining the robustness and accuracy of larger models. Inference speeds up because the larger model is invoked only intermittently to verify predictions rather than generating each token.
What's Next?
Cerebras's cutting-edge hardware overcomes traditional memory bandwidth bottlenecks and latency issues on GPUs. By leveraging tightly integrated compute and memory, on-chip mesh networks, and innovations like speculative decoding, Cerebras achieved 20x performance compared to GPUs. Reasoning models like o1 require fast inference to provide timely, contextually accurate answers that keep interactions smooth and effective. Agentic workflows depend on fast inference to enable real-time decision-making and adaptiveness, essential for dynamic and autonomous agents. Real-time voice and video use cases need rapid inference to ensure low latency, maintaining natural conversation flow and seamless user experiences.
Building something groundbreaking with fast inference? Let's chat—DM me on Twitter!
To get an API key, sign up at cloud.cerebras.ai.