# Designing an LLM Inference Platform

> 🗓️ **Last updated: July 2026**

The interviewer drops a familiar-sounding line: *"design a multi-tenant API platform that serves machine-learning models at scale."*

Your reflexes fire — load balancer, autoscaling pool, request queue, observability, rate limits. Classical web-service playbook.

Then they qualify it: *"large language models, hundreds of tenants, mixed workloads, latency-sensitive interactive traffic alongside batch jobs, GPUs are the bottleneck."*

The playbook breaks. Almost every assumption that makes a stateless HTTP service easy to scale is inverted here. Requests are large. Latency is variable in a way that depends on the request's *output length*, not just its input. The bottleneck is GPU memory and bandwidth, not CPU or network. Batching is no longer an optimisation — it is the architecture. And the unit cost is not measured in milliseconds of compute but in dollars per million tokens.

This article walks the system-design version of *"design an LLM inference platform"* the way I'd want it walked in a senior interview — or in a real architecture review for a team about to commit a quarter of capex to GPU clusters.

The thesis: **multi-tenant LLM serving inverts the web-infra playbook. The request is large, the latency is variable-by-output, the resource is scarce and expensive, and the only way to make the unit economics work is to treat batching as a first-class architectural concept rather than a clever optimisation.**

* * *

## Requirements

Five non-functional requirements drive every later decision.

**1\. Mixed workloads, mixed SLOs.** The platform serves at least three traffic classes: interactive chat (humans waiting on a stream), batch jobs (offline summarisation, embedding regeneration, evaluation runs), and code completions or other near-real-time streams. Each class has a different latency SLO and a different throughput appetite. A single fleet with one SLO will either waste GPUs on batch or starve interactive traffic.

**2\. Latency expressed as a distribution, not an average.** P50 hides everything that matters. The numbers that move customer experience are time-to-first-token (TTFT) and inter-token latency (ITL) at P95 and P99. Tail behaviour is the product. Dean and Barroso's *The Tail at Scale* (CACM 2013) framed this for classical distributed systems; LLM serving has the same problem with sharper teeth because long generations and shared GPUs amplify tail effects.

**3\. Throughput in tokens per second per GPU.** Cost lives here. A serving stack that gets 1,800 tokens/sec/GPU on a workload costs roughly half what one that gets 900 tokens/sec/GPU costs for the same throughput. Every architectural choice — batching strategy, attention implementation, quantisation, caching — is ultimately measured against this number.

**4\. Isolation between tenants.** Noisy neighbours in LLM serving are catastrophic. One tenant submitting a request with a 64K-token context and a 4K-token generation can stall a shared replica for tens of seconds. Isolation is not a courtesy; it is an SLO commitment.

**5\. Operability under continuous model churn.** New model versions ship constantly. The platform has to handle rolling upgrades, A/B routing, canarying, and rollback without breaking active streams. This sounds routine — until you remember that loading a 70B-parameter model into GPU memory takes minutes, not seconds.

The product manager wants two more things that look like requirements but are actually traps: *"infinite context"* and *"no rate limits for paying customers."* Neither is free. The architecture has to bound both at the platform level even when the contract language doesn't.

* * *

## What the classical version of this looked like

The classical analogue is a multi-tenant ML inference service from the pre-LLM era — image classification, recommendation scoring, fraud detection. The shape was:

*   A model server (TensorFlow Serving, TorchServe, Triton) loads a model into GPU memory once and serves it for many requests.
    
*   Requests are small and uniform — a feature vector or an image tensor. Latency is dominated by network and a single forward pass measured in tens of milliseconds.
    
*   Batching is straightforward: collect requests for a few milliseconds, run them as a single tensor, return results. Padding overhead is negligible because requests are similar sizes.
    
*   Autoscaling tracks GPU utilisation or queue depth. Cold-start is annoying but bounded.
    
*   Cost is per-inference, easy to model, easy to bill.
    

LLM serving breaks four of those five points simultaneously. Requests are huge and variable. Latency is no longer a single forward pass — it's hundreds or thousands of forward passes for the decode loop. Batching across requests with different generation lengths wastes most of the compute unless you batch *dynamically*. And cold-start is now measured in minutes because a single model weight file can be hundreds of gigabytes.

That last point — variable generation length — is the wedge that drives almost every interesting architectural decision in this system.

* * *

## Four cascading decisions

**Decision 1: Workload tiering and SLO classes.** Pick how many traffic classes the platform exposes and what their SLOs are. A reasonable starting point is three: *interactive* (TTFT under a few hundred milliseconds, ITL bounded for streaming), *near-real-time* (completions, embeddings, classification — sub-second end-to-end), and *batch* (no per-request SLO; throughput-optimised). The number is less important than the act of *committing* to tiers. Without tiers, every customer's traffic competes against every other customer's traffic, and the tail is whatever the worst case in the system happens to be at that moment.

**Decision 2: Hardware tier and tenancy model.** Pick GPU SKUs, reserved-versus-on-demand mix, and whether replicas are single-tenant or shared. The real-world choices are H100 / H200 / A100 / L4 on the NVIDIA side, MI300X on the AMD side, and the various cloud accelerators (TPU, Trainium, Inferentia). The decision interacts with the workload tier: batch jobs are happy on older, cheaper GPUs with high memory; interactive workloads benefit from the latest-generation chips with faster interconnect. A single tenancy model rarely wins — most mature platforms run a mix of dedicated capacity for enterprise tiers and pooled capacity for everyone else.

**Decision 3: Multi-model routing strategy.** Pick how many models the platform serves and how requests are routed across them. Three patterns dominate:

*   *Single-model fleet.* One model, many replicas. Simple, but you pay frontier-model prices for every request including the ones that didn't need it.
    
*   *Tiered routing.* A small/fast model handles easy requests; harder requests escalate to a large model. Saves cost dramatically when traffic is bimodal, but adds a routing decision that can itself be wrong.
    
*   *Multi-LoRA serving.* One base model in GPU memory, many fine-tuned adapters swapped in per request. Research like S-LoRA (Sheng et al. 2023) and Punica showed this is viable at scale. Powerful when you serve many tenants who each want their own fine-tune, but operationally complex.
    

The right answer depends on the workload. A consumer chat product is often a single-model fleet. A platform serving hundreds of enterprise tenants with their own fine-tunes leans toward multi-LoRA. A general-purpose API often runs tiered routing under the hood.

**Decision 4: Isolation and quota enforcement.** Pick what guarantees the platform makes per tenant and how they are enforced. Token-per-second quotas, request-per-second quotas, concurrent-request caps, and maximum-context-length caps all live here. The naive "rate limit at the gateway" pattern is insufficient — a single in-flight request with a 100K-token context can monopolise a replica regardless of how many requests-per-second the gateway lets through. Quotas have to be expressed in the resource the system is actually bottlenecked on: tokens, KV-cache pages, GPU-seconds.

These four decisions cascade. The workload tiering decision constrains the hardware tier (interactive needs the fast chips). The hardware tier constrains the routing strategy (multi-LoRA needs enough GPU memory for the base model plus adapters). The routing strategy constrains the isolation model (tiered routing means a tenant's traffic crosses model boundaries, complicating per-tenant accounting). Choose them in order — anything else gets re-litigated three times.

* * *

## Continuous batching: the architecture, not the optimisation

Static batching — the classical approach where you collect N requests, run them as a single forward pass, return results, then collect the next batch — is unusable for LLM serving. The reason is asymmetric generation length. If one request in a batch of 32 generates 500 tokens and the others generate 50, the GPU is busy with 31 requests' worth of wasted compute for the duration of the long generation. Throughput collapses.

Continuous batching, introduced by Orca (Yu et al., OSDI 2022) and now standard in serving stacks like vLLM, TGI, TensorRT-LLM, and DeepSpeed-MII, fixes this by batching at the *iteration* level instead of the request level. Every decoding step is its own batch. A request that finishes drops out; a new request that arrives joins in. The GPU is never waiting for the longest request in the batch.

The architectural consequences are sharper than they look:

*   The "batch" is no longer a unit anyone outside the inference engine sees. Requests don't queue waiting for a batch — they slot into an active iteration.
    
*   The KV cache becomes the dominant memory pressure. Every active request holds tens to thousands of tokens of cached attention state. PagedAttention (Kwon et al., SOSP 2023) was the breakthrough here — treating KV cache like virtual memory pages, allowing tighter packing and lower fragmentation.
    
*   Admission control happens at the page level, not the request level. The scheduler decides whether to admit a new request based on whether there's enough KV cache room to hold its prefix and its expected generation, not on whether the request queue is full.
    

If a candidate or a colleague says *"we'll batch requests every 10 milliseconds"* in 2026, that's an architectural smell. Continuous batching has been state-of-the-art for two years and the serving frameworks that implement it are open source. The interesting questions are *how the scheduler decides which requests to admit*, *how prefill is interleaved with decode*, and *how prefix caches are shared across requests* — not whether to batch at all.

* * *

## Prefill and decode are different workloads

A request to an LLM has two phases. *Prefill* processes the input prompt in parallel — a compute-bound operation, dominated by matrix multiplies, that scales roughly with the square of context length for attention. *Decode* generates output tokens one at a time, autoregressively — a memory-bandwidth-bound operation, dominated by KV-cache reads, that scales linearly with output length.

These two phases have wildly different performance characteristics. Prefill on a long context can take seconds on a single GPU. Decode is fast per token but throws away most of the GPU's compute capacity because the operation can't saturate matrix multiplies the way prefill does.

Serving stacks have tried two approaches:

*   *Interleaved scheduling.* Prefill and decode share the same GPU, with the scheduler picking which to advance each step. The cost: a long prefill can stall decode steps, hurting inter-token latency for active streams.
    
*   *Disaggregated serving.* Prefill and decode run on separate GPU pools, with the KV cache shipped between them. Research systems like DistServe (Zhong et al. 2024) and Splitwise (Patel et al. 2024) showed this can substantially improve tail latency for interactive workloads at the cost of extra system complexity.
    

Which one to pick is workload-dependent. Heavy long-context workloads benefit from disaggregation. Workloads dominated by short prompts and long generations often run fine on interleaved scheduling. Whichever you pick, the *non-decision* is the dangerous one — assuming all GPU time is fungible and ignoring the prefill/decode split is how you end up with a platform that passes throughput benchmarks and fails latency SLOs.

* * *

## Prefix caching, speculative decoding, and the latency budget

Two more techniques deserve first-class billing because they materially change the cost-versus-latency Pareto frontier.

**Prefix caching** keeps the KV-cache state for repeated prompt prefixes across requests, so the prefill work doesn't have to be redone. System prompts, chat history, RAG-injected context — anything that appears at the start of many requests. OpenAI shipped prefix caching publicly in October 2024; Anthropic in August 2024; Google followed. For workloads with high prefix overlap (chat applications, RAG over a stable knowledge base), the cache hit rate often dominates platform economics — and the difference between a platform that exposes it and one that doesn't can be a factor of two on cost.

**Speculative decoding** (Leviathan et al., ICML 2023; Chen et al. 2023) uses a smaller "draft" model to propose multiple tokens at once, which the main model then verifies in a single forward pass. When draft acceptance is high, throughput per request improves substantially. When acceptance is low, you've spent compute on the draft and gotten nothing — a small but real tail-latency penalty.

The architectural lesson is that the latency budget is not just *prefill plus decode time*. It's the joint distribution over prefix-cache hits, draft acceptance rates, scheduler admission delays, queue wait times, and network. Optimise any one of those in isolation and you can make the others worse. The platforms that consistently hit their SLOs are the ones that instrument all of those terms separately and tune them as a system.

* * *

## Autoscaling under variable-latency workloads

Classical autoscaling tracks CPU utilisation or request rate. Both signals lie for LLM serving.

GPU utilisation reported by `nvidia-smi` measures compute occupancy of the streaming multiprocessors, which is high during prefill and *low* during memory-bound decode. A replica that's 30% utilised on this metric might be at 95% of its actual throughput ceiling because the bottleneck is HBM bandwidth, not compute.

Request rate is similarly misleading. A replica handling 10 requests per second with average generation lengths of 50 tokens is at a completely different load than the same replica handling 10 requests per second with 500-token generations.

The signals that actually drive sound autoscaling decisions:

*   *Active KV-cache pages as a fraction of total.* This is the closest thing to "memory pressure" for an LLM server.
    
*   *Time-to-first-token P95.* When TTFT starts climbing, the prefill side is saturating.
    
*   *Inter-token latency P95.* When ITL climbs, the decode side is saturating.
    
*   *Queue depth measured in tokens, not requests.* A queue with 100 requests of 100 input tokens each is a different load from a queue with 5 requests of 100K input tokens each.
    

Autoscale on those, not on the metrics your existing infrastructure platform makes easy.

The other autoscaling trap is cold-start. Pulling a 70B model from object storage and loading it into GPU memory can take minutes per replica, not seconds. The platform has to scale ahead of demand, not in reaction to it — which means forecasting, pre-warming, and accepting some over-provisioning as the cost of meeting latency SLOs during traffic ramps.

* * *

## Failure modes

Seven failure modes recur. I've ordered them roughly by severity.

**1\. Head-of-line blocking from long generations.** One request asks for 4,000 output tokens; everyone else in that replica's batch waits. The fix is admission control by expected generation length, separate replicas (or separate scheduling tiers) for long-form workloads, and hard caps on `max_tokens` enforced at the platform level.

**2\. KV cache OOM under load spike.** A burst of long-context requests fills the KV cache faster than active requests can release it. New admissions stall or fail. The fix is page-level admission control, a real eviction policy for the prefix cache, and clear backpressure to the gateway.

**3\. Prefill stall starving decode (or the reverse).** A long-context prefill monopolises the GPU for several seconds; active streams see their inter-token latency spike. The fix is scheduler tuning, chunked prefill (process long prefixes in pieces), and in the most demanding workloads, disaggregated prefill-decode.

**4\. Queue depth runaway and cascading timeouts.** The gateway accepts requests faster than replicas can drain them; clients time out and retry; retries deepen the queue. The fix is bounded queues with explicit reject behaviour (load shedding is a feature, not a failure), backpressure signalled to the gateway, and client SDKs that respect Retry-After.

**5\. Cold start on scale-out or model rotation.** A new replica takes minutes to come online; demand peaks before capacity arrives. The fix is forecast-based pre-scaling, model weight caching on local NVMe, and warm-pool replicas held in reserve for known-bursty workloads.

**6\. Routing thrash in tiered systems.** The small model is called, fails confidence, escalates to the large model — and does this for most requests, paying the cost of both calls without the savings of either. The fix is to make the router's threshold a real decision boundary, evaluate it offline regularly, and instrument the escalation rate as a first-class metric.

**7\. Model-version drift across replicas.** A rolling upgrade leaves some replicas on version A and others on version B; users see non-deterministic behaviour across requests. The fix is version-aware routing (a request's session sticks to one version for its lifetime), and a clear contract about whether version pinning is a tenant-facing guarantee.

The interview-favourite failure mode — *"what happens when the model itself produces unsafe output?"* — belongs to the application layer above this platform, not the platform itself. The inference platform's responsibility is to deliver tokens reliably and economically. Content safety, citation grounding, hallucination control — those are upstream concerns. Conflating them is how teams end up with a platform that is bad at both.

* * *

## Cost economics

The unit cost of LLM inference is dollars per million tokens, but the unit production cost is GPU-seconds — and the conversion between them is where the platform earns or loses its margin.

A useful mental model: take a replica's measured tokens-per-second throughput, multiply by 86,400 seconds per day, divide by the GPU's hourly cost, and you get the platform's break-even price per token. Compare that to the published price per token of the major API providers and you have a reasonable read on how close to the frontier the platform's engineering is.

The four levers that move that number, in order of typical impact:

*   *Batching efficiency.* The single biggest lever. Continuous batching with good admission control delivers two to four times the throughput of static batching on realistic workloads.
    
*   *Quantisation.* Serving a 70B model in FP8 or INT8 instead of FP16 roughly doubles the throughput per GPU at small quality cost. The major open-source toolchains (GPTQ, AWQ, FP8 in TensorRT-LLM) have made this routine.
    
*   *Prefix caching hit rate.* For workloads with stable system prompts or repeated RAG contexts, this can cut prefill cost by an order of magnitude.
    
*   *Hardware tier.* The newest accelerators have higher memory bandwidth and faster interconnect; for many workloads this translates into proportionally lower cost per token, but only if the rest of the stack is good enough to use it.
    

The trap is buying the latest GPUs and running an inefficient serving stack on them. The first lever is worth more than the fourth.

* * *

## Architect's Checklist

A reusable list for designing or reviewing an LLM inference platform:

1.  Express latency targets as TTFT and ITL at P50/P95/P99 — never as averages.
    
2.  Commit to workload tiers with explicit SLOs; do not let interactive and batch share a queue.
    
3.  Pick a serving stack that does continuous batching with PagedAttention or equivalent — and verify, don't trust the marketing.
    
4.  Decide prefill/decode strategy: interleaved vs disaggregated, based on workload context-length profile.
    
5.  Expose prefix caching to the workload, and instrument hit rate as a top-line metric.
    
6.  Decide multi-model strategy: single fleet, tiered, multi-LoRA. Match it to the tenant profile.
    
7.  Quantise where quality allows; measure quality change with an offline evaluation suite, not vibes.
    
8.  Bound `max_tokens` at the platform level. The gateway must enforce it; do not rely on clients.
    
9.  Express quotas in the bottleneck resource: tokens, KV-cache pages, GPU-seconds. Not just requests-per-second.
    
10.  Autoscale on KV-cache occupancy and TTFT/ITL P95 — not GPU utilisation or request rate.
     
11.  Pre-warm replicas based on traffic forecasts; do not scale reactively for models with multi-minute load times.
     
12.  Instrument every failure mode listed above with a dashboard. The list is the runbook.
     

* * *

## Architect's Mental Model

The shift from classical web infrastructure to LLM serving is the shift from *uniform small requests on cheap commodity hardware* to *variable large requests on scarce expensive hardware*. Almost every assumption has to be re-examined.

Stateless replicas are out — every active request carries kilobytes to megabytes of KV-cache state that the replica must hold for the duration of the generation. Static batching is out — the request-length variance destroys efficiency. Average-latency SLOs are out — the workload is tail-dominated and the customer experience is tail-determined. Reactive autoscaling on CPU is out — the relevant signals are KV-cache pressure and token-level latencies, and cold-start is measured in minutes.

What replaces them: continuous batching as the architectural unit, prefill and decode treated as distinct workloads, prefix and speculative caches as first-class economic levers, workload tiers as the way to keep tail latency contained, and quotas expressed in the bottleneck resource rather than at the gateway.

This article connects to the previous Architecting Agents series in two specific places. First, the observability discipline from article 4 of that series is the *only* way to operate a platform of this complexity — TTFT and ITL histograms, KV-cache occupancy gauges, and per-tenant token accounting are the dials that keep the system honest. Second, the cost-bounded agent pattern from article 5 has its dual here: token budgets in the agent translate directly into GPU-seconds on the platform, and the two have to be designed together if the unit economics are going to work.

> *Throughput pays the bills. Tail latency loses the customer. Build for both — then prove both with a histogram, not an average.*

* * *

## Series Progress

| # | Article | Status |
| --- | --- | --- |
| 1 | Designing an AI-Augmented Search Engine | Published |
| 2 | Designing a Real-Time Coding Assistant | Published |
| 3 | Designing a Multi-Tenant RAG Knowledge Base | Published |
| 4 | Designing a Customer Support Agent at Scale | Published |
| 5 | Designing an LLM Inference Platform | You are here |
| 6 | Designing an Agentic Workflow Engine | Next |

* * *
