Preamble
Training a LoRA is pretty satisfying (once you see the results start leaning closer to the thing you’re striving towards); serving it is where software engineering returns. Clients should see stable JSON, explicit errors, and predictable latency—not a Jupyter cell calling model.generate. The HTTP layer is a contract: versioning, observability, and back-pressure belong there, not buried inside CUDA kernels.
This post frames OpenAI-compatible surfaces and the runtimes behind them, in the same spirit as the interop and API-design posts from 2024.
Why OpenAI-compatible routes help
Exposing /v1/chat/completions (or a close cousin) means existing SDKs, proxies, and internal tools often drop in without a rewrite. You still document your own limits—max context, supported modalities, tool-calling quirks—but you avoid inventing yet another message schema for every new service.
Compatibility is not perfection. Deltas should be documented explicitly (streaming format, stop sequences, tool JSON) so client teams do not rely on undocumented behavior.
Runtime choices: vLLM, TGI, llama.cpp, and friends
The right engine depends on constraints more than on hype:
-
Multi-GPU, high throughput, Python ecosystem
vLLM and similar servers emphasize batching, PagedAttention-style memory use, and throughput. Good when you have NVIDIA hardware and want many concurrent sessions. -
Hugging Face TGI
Another solid path when you want tight integration with Hub workflows and well-traveled Docker images. -
CPU or edge, smaller models
llama.cpp-family servers shine when GPUs are scarce or you need a portable binary. Latency per token differs; total cost of ownership may still win.
The choice rests on SLO, hardware, and team familiarity—not benchmark leaderboard bragging rights alone.
Operations: health, versions, request IDs
Health checks should validate more than “process is up”: load the tokenizer, warm the model, and optionally run a tiny forward pass in staging. Model version belongs in response headers or metadata so clients can log which revision answered.
Request IDs propagate through auth, rate limiting, inference, and logging—the same discipline as wiring gRPC and HTTP across Python and Java services. When something regresses, one string traces the path.
Cost control: restricting token usage
Cost tracks tokens processed (prompt + completion) and how often you run inference. To manage spend you need hard limits clients cannot bypass, visibility into who burned what, and fairness so one integration cannot exhaust the pool.
1. Cap generation length server-side. OpenAI-shaped APIs accept max_tokens (or max_completion_tokens) on the request; that only helps if you enforce a ceiling: ignore values above your tier’s cap, or clamp them downward before the request reaches the engine. Otherwise a buggy or malicious client can ask for tens of thousands of new tokens per call. The engine should still return a clear finish reason when generation stops on length.
2. Bound the context window at the runtime. Engines expose a maximum sequence length (e.g. vLLM’s --max-model-len, llama.cpp’s -c). That limits worst-case memory and per-request work. It does not replace policy: you still want pre-flight checks (same tokenizer as the model) so oversize prompts fail fast with a 4xx instead of tying up a slot—see Token Budgets: Enforcing Limits at the API Layer for counting, truncation, and streaming semantics.
3. Rate-limit and quota by tenant or key. Tokens per minute (TPM) and requests per minute (RPM), optionally combined with daily or monthly budgets, belong in the gateway or auth layer (API gateway, Envoy, Kong, or a thin BFF) keyed off API keys or JWT claims. That caps aggregate burn even when each individual request is small.
4. Meter and alert. Log or export prompt_tokens, completion_tokens, and total per request ID and tenant; chart burn rate and set alerts before the bill surprises anyone. Chargeback by team or product line turns “model cost” from a black box into an engineering lever.
5. Operational knobs that are not a substitute for policy. Continuous batching and high concurrency improve throughput; they do not stop runaway prompts. Default low max_tokens for anonymous or trial tiers; require review or higher entitlements for large completions.
In short: runtime caps define physics (max length, VRAM); API/gateway caps define product and finance (per-call generation limit, per-tenant TPM, monthly budget). You want both.
Concrete runtimes: commands and trade-offs
These are illustrative; pin container tags and model revisions in your own manifests.
vLLM (multi-session GPU throughput) — OpenAI-compatible server, good when you already have CUDA and want continuous batching:
# Example only: replace model path and image tag with pinned versions
# --max-model-len caps context (and thus per-request upper bound on work)
docker run --gpus all -p 8000:8000 \
vllm/vllm-openai:latest \
--model mistralai/Mistral-7B-Instruct-v0.3 \
--dtype bfloat16 \
--max-model-len 8192
TGI (Text Generation Inference) — Strong Hub/Docker story; similar niche to vLLM for transformer serving behind HTTP.
llama.cpp server — CPU or smaller GPUs; compile with the same quant you evaluated (Q4_K_M, etc.):
./server -m ./models/model.gguf --port 8080 -c 4096
Runtime choice: vLLM/TGI when concurrent GPU sessions and batching dominate SLOs; llama.cpp when footprint, CPU-only, or edge packaging matters more than peak tokens/sec. Always re-benchmark your adapter + prompt template on the chosen runtime—kernel and quant interact with LoRA paths.
Minimal OpenAI-shaped curl contract test
After deploy, hit non-streaming chat with a frozen prompt and compare SHA of normalized JSON (strip volatile fields like id timestamps if needed):
curl -s http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"mistralai/Mistral-7B-Instruct-v0.3","messages":[{"role":"user","content":"Say OK."}],"max_tokens":8,"temperature":0}' \
| jq .
Check choices[0].finish_reason, token usage fields, and error schema for oversize prompts—those are the surfaces Token Budgets: Enforcing Limits at the API Layer tightens.
Conclusion
This post closes the loop from “weights on disk” to “service behind a load balancer,” including where to enforce token and rate limits so costs stay predictable: ceilings on completion length and context at the engine, plus TPM/RPM and budgets at the gateway. Token Budgets: Enforcing Limits at the API Layer goes deeper on pre-flight counting, truncation strategies, and client contracts—stack it with the serving layer you run here.