Designing a Real-Time AI Coding Assistant

🗓️ Last updated: June 2026

Ask any senior engineer to design a code completion product in a system design interview, and the first thirty minutes sound exactly like the autocomplete designs we have been drawing for a decade. Lexer, parser, scope resolution, ranked suggestions, debounce on keystroke, language server protocol over a local socket. Most of the diagram still works, which is the comforting part.

Then the interviewer adds one constraint, the suggestions are generated by a large language model, and every line of that tidy diagram quietly turns into a latency, cost, or correctness problem wearing a disguise. A user typing at 60 words per minute hits a keystroke roughly every 200 milliseconds. The model call, end to end, is usually slower than that. So here is the whole game in one sentence: everything else in the architecture exists to make a model that is slower than the user feel faster than the user.

This is the second article in System Design, Reimagined. In article 1 we looked at search, where the LLM bolted a synthesis layer on top of a classical system that kept doing its job underneath. Coding assistants are a meaner case. The LLM does not sit politely on top. It replaces the exact part of the system the user touches all day. Classical autocomplete gets demoted to understudy.

Requirements: The User Is Typing, and the User Is Impatient

The hard requirement is timing. Direct manipulation interfaces have a perception threshold somewhere around 100 milliseconds, beyond which the user starts to notice the gap and resent it. This is one of the oldest results in human computer interaction, popularized by Nielsen's response time guidelines and grounded in earlier human factors research. For an inline completion that has to appear during the pause between keystrokes, the budget is brutal, and the model call is the single fattest line item in it.

The functional surface is wider than classical autocomplete:

Inline completion: ghost text after the cursor, one keystroke to accept
Multi line edits: the assistant proposes deleting lines, inserting lines, restructuring blocks
Chat in the IDE: a back and forth conversation about the current file or selection
Command palette actions: "explain this", "write a test for this", "fix the lint error"

These four modes have different latency budgets and different definitions of "good." Treating them as one product is the most popular architectural mistake in the genre. Inline completion has to be felt inside a single keystroke. Multi line edits can take a beat. Chat can take a few seconds and nobody dies. A "write a test" action can take longer still, if you let it go fetch context first.

Three nonfunctional requirements run the whole show:

Perceived latency, not actual latency, is the budget. The assistant is allowed to be slower than 100 milliseconds as long as the user never feels the gap. This is the entire architecture, hiding in plain sight.
Wrong is worse than slow. A bad suggestion costs the developer trust and time. A slow but correct suggestion costs almost nothing. Coding assistants get fired for hallucinated APIs, not for being a little behind.
Cost is bounded per developer hour, not per query. A developer fires thousands of keystrokes a day. The cost model has to amortize across the whole hour, not sweat the single completion.

Hold onto those three. Every interesting decision ladders back to one of them, and the boring ones do too.

The Classical Baseline: What Autocomplete Used to Be

Before LLMs, code completion looked like this:

A language server (one per language, speaking LSP) kept a parsed, type resolved model of the project in memory.
On every keystroke, the editor sent a textDocument/completion request to that server.
The server returned ranked candidates based on the syntax tree at the cursor: identifiers in scope, methods on the receiver's type, imported symbols, snippets.
The editor deduplicated and rendered the list.

The architecture had three genuinely nice properties. It was deterministic (same cursor, same suggestions). It was fast (a few milliseconds on a warm cache). And it was correct in a narrow but honest sense: every suggestion was a real symbol that actually compiled.

It also had a hard ceiling. The language server could only ever suggest things that already existed. It could not write a regex, generate a test, complete a function body, or refactor a block. Those were exactly the use cases the LLM assistants walked in and ate first.

The classical baseline did not go away. It still runs underneath the LLM completions in every major IDE, and it is still the right answer when the user just needs to disambiguate a method name. The actual design problem is knowing when to defer to the model and when to defer to the LSP, and being honest that the model is not always the better answer.

The Three Decisions That Cascade Through Everything Else

Once you accept that the model is in the loop of a typing user, three decisions ripple through the rest of the architecture whether you planned for them or not.

Decision 1: Where does the model run?

Self hosted on GPUs you own, called over a managed inference API, or a small local model running on the developer's laptop? Each has a different latency floor and a different cost curve. Local models can hit sub 100 millisecond completions but cap out on quality. Managed frontier models give better completions and charge you a round trip on every call. Most production assistants today run a blended strategy: a small local or self hosted model for inline completions, a larger frontier model for chat and multi line edits. Yes, that means maintaining two model paths. Welcome to the job.

Decision 2: When does the model fire?

On every keystroke is too aggressive (cost, rate limits, jitter). Only on demand is too passive (the user has to remember the feature exists). The middle path is a debouncer that fires after a short idle pause, plus explicit triggers like Tab or a command palette shortcut. This sounds trivial. Tuning it is quietly the highest leverage UX knob in the entire product, and the one most teams ship at its default value and never touch again.

Decision 3: What context does the model see?

A modern frontier model has a context window big enough to swallow a mid sized repository. That is not an invitation to send it the entire repository on every keystroke. Cost scales with input tokens, and so does prefill latency, the time the model spends reading your prompt before it produces a single output token. The art is selecting the smallest context that still makes the suggestion right. The model will happily read everything you give it and bill you for the privilege.

Each of these has a naive answer and a "what production teams actually do" answer. We will work through them.

Context Selection: What Goes Into the Prompt

The prompt sent to the model for an inline completion is the most over engineered single string in the system. It usually contains, in order:

System prompt: capabilities, style, "do not invent APIs", and so on. Largely static. An excellent prefix cache candidate.
Project metadata: language, framework, test framework, lint config. Static per project, so cache it.
Selected file context: the current file, often trimmed to a window around the cursor.
Cross file context: a few related files, such as the test for the current file, the file defining the type at the cursor, recently edited files.
Symbols from the language server: type signatures, imports, classes in scope. The LSP is still earning its salary here.
Cursor marker: a sentinel token meaning "complete here".

The two interesting moves are cross file selection and symbol extraction. Cross file selection turns the assistant from a per file autocomplete into a project aware one, and it is also the single biggest contributor to prompt length, because of course it is. Cheap heuristics (most recently edited files, files imported by the current file, the matching test file) routinely beat learned retrieval systems for this use case. The recency signal is strong enough that fancier retrieval rarely earns back its complexity for inline completion.

Symbol extraction is the LSP's revenge. The model does not need the full definition of every class in scope. It needs the type signatures. A few hundred tokens of curated symbol context routinely outperforms several thousand tokens of raw file content. Treat the language server as a context compressor, not a sad fallback you keep around for old times' sake.

The Latency Budget: Where the Milliseconds Go to Die

Inline completion gives you, generously, a few hundred milliseconds before the user feels the lag. That budget has to cover all of this:

Debounce. A short idle window after the last keystroke before firing. Often tens to low hundreds of milliseconds.
Context assembly. LSP queries, file reads, prompt construction. Targetable: tens of milliseconds.
Network. Round trip to the inference endpoint. Tens to low hundreds of milliseconds.
Prefill. The model reading the prompt. Scales with prompt length. The first big lever for caching.
Generation. First tokens produced. Streamed, so the user sees output as it arrives.
Render. Editor draws the ghost text. Negligible if you implemented it with any care.

Two facts dominate this table.

First, prefill is often a bigger share of latency than generation for short completions, because the prompt is long and the output is short. This is the exact inverse of chat, where the output dominates. It is also why prefix caching matters so much here: the same multi thousand token prefix gets re sent on every single keystroke, and paying full price for it every time is how you light money on fire one completion at a time.

Second, the debounce is part of the budget, not a freebie sitting outside it. A long debounce on a tight total budget leaves almost nothing for everything else. Many teams shorten the debounce, fire speculatively, and cancel in flight requests when the user keeps typing. That trades money up for perceived latency down, which is the trade you make over and over in this system until you run out of either patience or budget.

Streaming and Speculative Completion

Two architectural moves do most of the work of hiding the model's latency.

Streaming is table stakes. The assistant does not wait for the full completion before rendering. It streams tokens as they arrive over SSE or a WebSocket. The user sees the first few tokens within a fraction of the total time, and if those tokens look wrong, they keep typing and the request gets cancelled. This is the same insight that made search results render incrementally: time to first byte beats time to last byte, every time a human is watching.

Speculative completion is the more interesting move. The assistant fires a request during the debounce, before the user has actually paused, betting that they are about to. If the user keeps typing, the request is cancelled and the partial cost is eaten with a shrug. If the user pauses, the completion is already in flight and lands sooner. Cancellation discipline is the whole ballgame here. Uncancelled speculative requests are the single largest source of cost inflation in a poorly tuned assistant, and "poorly tuned" is the factory default.

A related, more research flavored technique is speculative decoding at the model level: a small "draft" model generates several tokens at once and the larger "target" model verifies them in parallel. Published work (Leviathan et al., 2023) shows real throughput gains at the same quality. This is an inference platform optimization (see article 5), not something most product teams build themselves, but it is worth knowing about, because it quietly shifts the latency versus cost frontier underneath you while you were not looking.

Prefix Caching: The Single Biggest Lever

The prompt for an inline completion is dominated by a stable prefix (system prompt, project metadata, the file before the cursor) followed by a small unstable suffix near the cursor. If your inference provider supports prefix caching (OpenAI, Anthropic, and Google all do as of late 2024), the prefill cost for that stable prefix drops sharply on later calls. Cached input pricing on the published rate cards is a fraction of the uncached rate, and the latency win is even bigger than the billing win.

The architectural implication is blunt: prompt structure should be designed around the cache, not around the convenience of whoever wrote the prompt. Put the stable parts first. Keep them byte identical across calls. Push volatile content (the cursor neighborhood, recent edits, retrieved snippets) to the end. Avoid timestamps, request IDs, or any per request sparkle in the prefix, because each one invalidates the cache for absolutely no benefit.

This one discipline bends the cost and latency curves more than any model upgrade will. It is the dullest, highest leverage move in the architecture, and it is also the easiest one to break by accident the next time someone "just adds one little field" to the top of the prompt.

Evaluation: Offline vs Online

Coding assistants have two evaluation regimes, and most teams only do the first one well, then act surprised.

Offline evaluation runs the assistant against a held out set of (context, expected completion) pairs and measures token level or function level match. Public benchmarks like HumanEval and MBPP are the genre's reference points. Offline eval is fast, deterministic, and useful for ranking model variants. It is also a weak proxy for what users actually feel, which it will never admit on its own.

Online evaluation is the one that matters. It measures things no static dataset can give you:

Acceptance rate: what fraction of shown completions did the user actually accept?
Retention: how much of the accepted code survived an hour, a day, a week?
Time to edit: did the user have to fix the accepted code, and how much?
Cancellation rate: how often does the user type straight through a suggestion as if it were not there?

Online metrics expose the failure modes offline eval cheerfully misses: completions that look right but are subtly wrong, suggestions that are technically correct but stylistically off, code that is right for the model and wrong for this codebase. The architectural cost is real. You need an event pipeline, telemetry consent, and a way to tie edits back to the completions that caused them. Build it from day one, because bolting it on later is the kind of project nobody volunteers for twice.

Failure Modes

These are the failures real coding assistants actually exhibit, in roughly the order of how fast they burn trust:

Hallucinated APIs. The model invents a function, method, or import that does not exist. The user pastes it, the code refuses to compile, and the assistant's credibility takes the hit. Mitigations: ground the prompt in real symbols from the LSP, post filter completions against a symbol table, and keep the model focused on local edits rather than cross file fan fiction.
Broken multi line edits. The assistant proposes an edit that deletes a closing brace, drops an import, or duplicates a block. Mitigations: apply edits in a staging buffer, run a syntax check before showing the diff, and use structured edit formats (line anchored diffs) instead of free form text replacement.
Style drift. The assistant produces working code that ignores the codebase's conventions: wrong naming, wrong error handling, wrong test framework. Mitigations: feed style examples into the context and pin the model to a project specific system prompt.
Stale context. The assistant uses a file version that does not reflect the user's most recent unsaved edits, then confidently completes against a reality that no longer exists. Mitigations: source context from the editor's buffer, not from disk, and version every snippet you send.
Cost runaways. A misconfigured debounce or uncancelled speculative requests balloon the per developer hour cost. Mitigations: hard per user, per minute budgets at the gateway, alerting on token consumption per user, and the cancellation discipline you were supposed to have anyway.

Failures 1 and 2 erode trust. Failures 3 and 4 erode utility. Failure 5 erodes margin. In that order, which is also the order your engineering investment should go.

Cost Economics: Per Token vs Per Developer Hour

The cost question every coding assistant has to answer is simple to state and annoying to satisfy: what is the per developer hour cost ceiling, and how do you stay under it?

A developer working with an active assistant generates a lot of completion requests per hour. Most are short. The input prompt is the heavy part, the output is a handful of tokens. So cost is dominated by input tokens, which is the whole reason prefix caching is not optional, it is load bearing.

The interesting observation is that per developer hour cost is far more predictable than per query cost. You can budget for a developer day. You cannot really budget for a single query. This is the inverse of search (article 1), where per query economics run everything. A coding assistant is a subscription product wearing a token billed costume, and the architecture should be optimized around the subscription unit, not the token.

Practical implication: set token budgets and rate limits at the user session level, not the request level. A user with a runaway session should be throttled, not have individual queries denied. Denials feel like a broken product. Throttling feels like back pressure, which users grumble about and then tolerate.

The Architect's Checklist

A 12 item action list to carry into a real coding assistant design:

Decide which assistant modes you ship (inline, multi line, chat, commands) and give each its own latency budget. Do not treat them as one product.
Choose a model placement strategy: local only, hosted only, or blended. Default to blended, small for inline plus hosted for chat.
Build the prompt template with the stable prefix at the top and the volatile suffix at the bottom. Make this an inviolable rule, not a suggestion.
Wire up prefix caching against your inference provider. Verify it is actually hitting by reading billing telemetry, not by trusting the provider's marketing.
Use the language server as a context compressor. Send curated symbols, not raw files, whenever the cursor's type information is available.
Implement streaming end to end. The editor renders tokens as they arrive. No spinner for inline completion.
Add speculative completion with strict cancellation. Track the cancellation ratio as a first class metric, not an afterthought.
Validate every multi line edit with a syntax check before showing it. Reject silently if it fails.
Build the online evaluation pipeline before you ship publicly. Acceptance rate, retention, time to edit, cancellation, all four, not just the flattering first one.
Set per user, per session token budgets at the gateway. Throttle, do not deny.
Source context from the editor's buffer, not from disk. Version every snippet you send.
Have an explicit fallback to the classical LSP completion when the model is slow, errors out, or returns nothing. The product should never feel broken just because the model is having a moment.

The Architect's Mental Model

A coding assistant is not "an LLM in an IDE." It is a latency hiding system whose entire job is to make a model that is slower than the user feel faster than the user. Every move in the design (prefix caching, streaming, speculative completion, debounce tuning, the LSP as context compressor, syntax validated edits, online evaluation, per session budgets) exists to serve that one goal.

The classical pieces did not vanish. The language server still runs underneath. The syntax tree still resolves at the cursor. The lexer still tokenizes. The model sits on top of all of it, and the architecture is good precisely when the model is invisible. The moment the user can feel the model, the architecture is losing.

If you are building this, the rest of Architecting Agents applies more than it looks like it does. The conversation memory model from article 2 is your chat mode backbone. The safe tool call discipline from article 3 is your multi line edit backbone. The observability discipline from article 4 is your online evaluation pipeline. The cost bounding discipline from article 5 is your per session budget. Reach for them instead of reinventing them.

Model quality wins the demo. Latency hiding wins the user.

Faster Than the Fingers: Designing a Real Time Coding Assistant

Requirements: The User Is Typing, and the User Is Impatient

The Classical Baseline: What Autocomplete Used to Be

The Three Decisions That Cascade Through Everything Else

Context Selection: What Goes Into the Prompt

The Latency Budget: Where the Milliseconds Go to Die

Streaming and Speculative Completion

Prefix Caching: The Single Biggest Lever

Evaluation: Offline vs Online

Failure Modes

Cost Economics: Per Token vs Per Developer Hour

The Architect's Checklist

The Architect's Mental Model

Comments

System Design, Reimagined

Designing AI Search: The Answer Layer and the Iceberg Beneath It

More from this blog

Designing AI Search: The Answer Layer and the Iceberg Beneath It

More Agents, More Problems

Your Agent Has a Spending Problem

In Probabilistic Systems, You Watch the Shape of Success

Command Palette

Requirements: The User Is Typing, and the User Is Impatient

The Classical Baseline: What Autocomplete Used to Be

The Three Decisions That Cascade Through Everything Else

Context Selection: What Goes Into the Prompt

The Latency Budget: Where the Milliseconds Go to Die

Streaming and Speculative Completion

Prefix Caching: The Single Biggest Lever

Evaluation: Offline vs Online

Failure Modes

Cost Economics: Per Token vs Per Developer Hour

The Architect's Checklist

The Architect's Mental Model

Comments

System Design, Reimagined

Designing AI Search: The Answer Layer and the Iceberg Beneath It

More from this blog