Token Budgets: Enforcing Limits at the API Layer

Preamble

Uncapped generation is how you burn VRAM, money, and user patience in a single request. Limits belong in middleware—before the GPU does heavy work—not as a vague suggestion in the front-end. Token budgets are policy: product, security, and finance get a say; engineering implements them deterministically.

This pairs directly with Serving Distilled Models Behind an HTTP API’s serving layer and with RAG Foundations: Embeddings, Chunking, and the Retrieval Loop’s RAG context stuffing: the same window is contested by system prompts, retrieved chunks, tool outputs, and the user’s message history.

Hard caps: `max_tokens` and friends

Server-side max_new_tokens (or API max_tokens, depending on the stack) is the simplest brake: stop after N new tokens regardless of whether the model “feels done.” Defaults should be sane per tier, with higher quotas for authenticated power users where appropriate.

The important detail is consistent semantics across streaming and non-streaming paths. Partial streams must surface a stop reason so clients know the answer was truncated by length, not completed naturally.

Pre-flight token counting

Rejecting oversize prompts before allocation saves money. Token counts should use the same tokenizer family the model uses—approximations from rough character heuristics drift and cause surprise denials.

When a prompt is too large, responses should be actionable: which limit tripped, whether retry is possible after truncation, and never a generic 500 when the fix is user-side editing.

Truncation strategies for long threads

For chat history, naive “keep the last K messages” drops important early instructions. Alternatives to mix and match:

Drop middle: keep system/developer instructions and recent turns; summarize or elide the middle.
Summarize: periodically roll older turns into a running summary (watch for summary drift and stale facts).
Retrieve instead of paste: RAG pulls fresh chunks rather than embedding entire old threads in the prompt.

Each strategy has failure modes; the right choice depends on whether the product is support (needs ticket context) or creative (needs recent style).

Streaming and client contracts

With SSE or chunked responses, clients must handle incremental tokens, finish reasons, and errors mid-stream. Whether partial JSON tool calls can arrive and how to recover should be documented—same care as any long-lived HTTP stream.

Observability and fairness

Histograms for prompt length, output length, and time-to-first-token expose abusive patterns and misconfigured clients before finance does. Per-tenant quotas pair with these metrics so one noisy integration does not starve the rest.

Implementation sketch: pre-flight with `tiktoken` or the Hub tokenizer

Use the same tokenizer class the model uses—not a rough len(text)//4 heuristic.

from transformers import AutoTokenizer

tok = AutoTokenizer.from_pretrained(MODEL_ID, revision=MODEL_REV)

def count_messages(messages) -> int:
    # Prefer the model's chat template when available
    return len(tok.apply_chat_template(messages, tokenize=True, add_generation_prompt=True))

def assert_within_budget(messages, max_prompt_tokens: int) -> None:
    n = count_messages(messages)
    if n > max_prompt_tokens:
        raise ValueError(f"prompt_tokens={n} exceeds max_prompt_tokens={max_prompt_tokens}")

In middleware, return HTTP 413 or 400 with a JSON body { "error": "context_length", "prompt_tokens": n, "limit": max_prompt_tokens } so clients know to truncate or summarize—mirroring how public APIs surface limits.

For RAG, count tokens after you assemble system + retrieved passages + user message; if over budget, drop lowest-scoring chunks first (RAG in Production: Re-ranking, HyDE, and Simple Evals’s reranker makes that ordering meaningful).

Reproducible truncation regression test

Keep a fixture of long threads in tests/fixtures/long_chat.json. After any change to truncation or summarization:

Token-count before/after.
Golden hash of the exact byte sequence sent to the model (or the list of chunk ids injected for RAG).

CI fails when the fixture’s effective context changes without a version bump—same instinct as index versioning in RAG Foundations: Embeddings, Chunking, and the Retrieval Loop.