Token Optimization Guide
Reduce costs and improve performance by optimizing token usage in your LLM applications.
The Cost of LLM Applications
LLM costs can spiral quickly at scale. A seemingly efficient application using GPT-4 at $0.03 per 1K input tokens can cost thousands per month with just moderate traffic. Latency compounds the problem—users abandon slow applications.
Measuring Current Performance
Start by understanding your baseline. Navigate to the Traces dashboard to see:
- Average tokens per request (input + output)
- P50, P95, P99 latency
- Cost per request and total monthly spend
- Slowest operations and bottlenecks
Strategy 1: Reduce Input Tokens
1. Trim Unnecessary Context
Are you sending more context than the model needs?
Before: 3,500 tokens (full conversation history)
After: 1,200 tokens (last 3 turns + summary)
Savings: 66% input tokens
2. Semantic Search Over Full Documents
For RAG systems, don't dump entire documents into the prompt. Use embeddings to retrieve only the most relevant chunks.
3. Prompt Compression
Remove verbose instructions and redundant examples:
- Use terse, imperative language
- Replace examples with few-shot learning (2-3 examples max)
- Remove filler words and formatting
Strategy 2: Reduce Output Tokens
1. Set max_tokens Limits
Prevent runaway generation by capping output length:
2. Use Structured Outputs
JSON outputs are more token-efficient than prose:
Prose (120 tokens):
"The sentiment of this review is positive. The user seems happy with the product quality and delivery speed."
JSON (15 tokens):
{"sentiment": "positive", "aspects": ["quality", "delivery"]}
3. Stop Sequences
Use stop sequences to halt generation early when the task is complete:
Strategy 3: Choose the Right Model
Model Selection Matrix
| Task | Model | Cost/1K tokens |
|---|---|---|
| Simple classification | GPT-3.5-turbo | $0.0005 |
| Creative writing | GPT-4 | $0.03 |
| Code generation | GPT-4 | $0.03 |
| Embeddings | text-embedding-3-small | $0.00002 |
Don't use GPT-4 for tasks GPT-3.5 can handle. Test with smaller models first.
Model Cascading
Route requests to cheaper models when possible:
Strategy 4: Reduce Latency
1. Streaming Responses
Show tokens as they're generated instead of waiting for completion:
Perceived latency drops from 3s to 0.5s.
2. Parallel Requests
For independent operations, call LLMs in parallel:
3. Caching
Cache responses for repeated queries:
- Exact match caching: Identical inputs return cached responses
- Semantic caching: Similar queries reuse cached responses
- TTL: Expire cache after 24 hours for time-sensitive content
4. Reduce Retrieval Overhead
For RAG systems, optimize the retrieval step:
- Use faster vector databases (Pinecone, Weaviate)
- Pre-compute embeddings at write-time, not read-time
- Index frequently accessed documents
Monitoring and Alerting
Set up alerts in the platform to catch regressions:
- Cost spike: Alert if daily spend exceeds $X
- Latency degradation: Alert if P95 latency > 3s
- Token anomalies: Alert if average tokens per request jumps 2x
Real-World Optimization Case Study
Customer Support Chatbot
Initial State:
- Model: GPT-4
- Avg tokens: 4,500 per request
- Latency: 3.2s
- Cost: $8,400/month
Optimizations Applied:
- Switched to GPT-3.5-turbo for 70% of simple queries (model cascading)
- Reduced context window from full history to last 3 turns + summary
- Added semantic caching with 40% hit rate
- Enabled streaming for perceived latency
Final State:
- Avg tokens: 1,800 per request (-60%)
- Latency: 1.1s (-66%)
- Cost: $2,100/month (-75%)
Quick Wins Checklist
- ✅ Set max_tokens limits on all completions
- ✅ Use GPT-3.5-turbo for simple tasks
- ✅ Enable streaming for all user-facing responses
- ✅ Implement exact-match caching
- ✅ Trim conversation history to last 3-5 turns
- ✅ Use structured outputs (JSON) instead of prose
- ✅ Monitor token usage and set cost alerts