Building High-Performance LLM Infrastructure: Cloudflare’s Approach to Separating Input and Output Processing
Learn to build cost-effective LLM infrastructure by separating input and output processing, inspired by Cloudflare's architecture. Step-by-step guide with code examples.
Overview
Large language models (LLMs) have become a cornerstone of modern AI applications, but serving them efficiently at scale presents unique challenges. Cloudflare recently announced a novel infrastructure design that tackles two critical bottlenecks: the high cost of hardware and the massive throughput required for both user prompts and model responses. Instead of treating inference as a monolithic process, Cloudflare splits the pipeline into two distinct stages—input processing and output generation—each optimized for its specific workload. This tutorial walks you through the architecture, step by step, explaining why this separation matters and how you can apply similar principles to your own LLM deployments.

Whether you’re a machine learning engineer, infrastructure architect, or developer building AI-powered APIs, you’ll gain insights into cost-effective scaling, latency reduction, and resource allocation.
Prerequisites
Knowledge Requirements
- Basic understanding of how LLMs work (transformer architecture, tokenization, autoregressive generation).
- Familiarity with cloud computing concepts (distributed systems, load balancing, edge computing).
- Experience with deploying AI models in production (optional but helpful).
Technical Requirements
- Access to a cloud environment (AWS, GCP, Azure, or Cloudflare Workers) for hands-on experimentation.
- A sample LLM (e.g., a smaller variant like Llama 2 7B) to test the architecture.
- Tools: Python, PyTorch, ONNX Runtime, or a framework like vLLM for serving.
Step-by-Step Instructions
1. Understanding the Challenge of LLM Inference
LLM inference involves two phases: prefill (processing the input prompt) and decoding (generating tokens one by one). The prefill phase is compute-heavy and parallelizable, while the decoding phase is memory-bandwidth-bound and sequential. Traditional serving systems handle both on the same hardware, leading to inefficiencies—expensive GPUs idle during decoding stalls, or memory capacity is wasted on storing intermediate results. Cloudflare’s insight: separate the phases onto specialized systems to optimize cost and performance.
2. The Core Innovation: Separating Input and Output Processing
Cloudflare’s infrastructure divides the workload into two distinct pipelines. The input processing system handles prefill—tokenizing the prompt, running the feed‑forward and attention layers for all input tokens, and producing a key-value (KV) cache. The output generation system then uses that cache to generate the response tokens autoregressively. Each system is optimized for its specific bottleneck: the input system for compute throughput (e.g., high‑end GPUs with many cores), the output system for memory bandwidth and low latency (e.g., CPUs with large caches or specialized ASICs).
3. Designing the Input Processing Pipeline
To implement this separation, start by building a dedicated service for prefill. Use a pool of GPU instances with high FLOPS (e.g., NVIDIA A100 or H100). Batch multiple prompts together to maximize GPU utilization. Prefill can be parallelized across prompt tokens, so you can process several inputs simultaneously. Store the resulting KV cache in a high‑speed shared memory layer (e.g., Redis or RDMA‑connected memory) that the output service can access quickly.
# Example: Prefill batching logic (pseudocode)
def process_prompt(prompt, model):
tokens = tokenize(prompt)
kv_cache = model.forward(tokens) # Returns key‑value pairs for all layers
return kv_cache
Cloudflare’s network globally distributes these input processing nodes, so a user’s prompt is handled by the nearest edge location, reducing initial latency.
4. Designing the Output Generation Pipeline
For generation, you need a system that fetches the KV cache and produces tokens one at a time. Because generation is sequential, you can use cheaper, memory‑optimized hardware. Consider leveraging CPUs or low‑cost TPUs with large, fast caches. The generation service must maintain low latency per token (typically under 50 ms). Use techniques like speculative decoding or caching repeated response prefixes to speed up the process.
/presentations/game-vr-flat-screens/en/smallimage/thumbnail-1775637585504.jpg)
# Example: Autoregressive decoding (pseudocode)
def generate_response(kv_cache, prompt_tokens, model):
output_tokens = []
for _ in range(max_tokens):
logits = model.decode(kv_cache) # Forward pass for last token
next_token = sample(logits)
output_tokens.append(next_token)
kv_cache.update(next_token)
return detokenize(output_tokens)
Because the generation phase doesn’t need to re‑process the entire prompt, it can start instantly from the cached state, reducing time‑to‑first‑token significantly.
5. Deploying Across Cloudflare’s Global Network
Cloudflare’s key advantage is its vast edge network. Input processing nodes can be placed close to users, while output generation nodes are optimally distributed for low‑latency response delivery. You can mimic this by using a global load balancer (e.g., Cloudflare’s own DNS‑based routing) to direct requests to the nearest prefill server, then have that server pass the KV cache to a generation server that is also geographically close to the user. Use connection pooling and caching (e.g., KV cache stored in Cloudflare’s global memory store) to minimize data movement.
Common Mistakes and Pitfalls
Ignoring Latency Asymmetry
Many developers assume that input and output phases can share the same latency budget. But prefill is typically latency‑insensitive (users wait for the first token), while generation needs consistent low latency. Failing to provision generation hardware with sufficient memory bandwidth can cause token‑per‑second drops and user frustration.
Underestimating Memory Bandwidth
Even with separated pipelines, the KV cache transfer between input and output systems can become a bottleneck. Use shared memory or fast interconnects (e.g., InfiniBand, NVLink) rather than standard network calls. Cloudflare uses custom low‑latency networking to avoid this pitfall.
Overlooking Edge Caching for Prompts
If the same prompt appears frequently (e.g., in a chatbot), you can cache its KV cache to avoid redundant prefill. This reduces cost and latency. Cloudflare likely implements such a cache at the edge. Without it, you may waste GPU cycles on repetitive inputs.
Summary
Cloudflare’s high‑performance LLM infrastructure demonstrates a powerful principle: decoupling input and output processing allows each phase to run on optimized hardware, reducing cost and latency simultaneously. By following this guide, you can design your own split‑pipeline serving system, leveraging batching, KV cache sharing, and global edge distribution. Remember to watch out for common pitfalls like mismatched latency budgets and insufficient memory bandwidth. With the right architecture, you can serve LLMs at scale without breaking the bank.