EvalGate

The Cost of LLM Applications

LLM costs can spiral quickly at scale. A seemingly efficient application using GPT-4 at $0.03 per 1K input tokens can cost thousands per month with just moderate traffic. Latency compounds the problem—users abandon slow applications.

Measuring Current Performance

Start by understanding your baseline. Navigate to the Traces dashboard to see:

Average tokens per request (input + output)
P50, P95, P99 latency
Cost per request and total monthly spend
Slowest operations and bottlenecks

Strategy 1: Reduce Input Tokens

1. Trim Unnecessary Context

Are you sending more context than the model needs?

Before: 3,500 tokens (full conversation history)

After: 1,200 tokens (last 3 turns + summary)

Savings: 66% input tokens

2. Semantic Search Over Full Documents

For RAG systems, don't dump entire documents into the prompt. Use embeddings to retrieve only the most relevant chunks.

// Before: Sending 10 full documents (15,000 tokens) const context = documents.join('\n'); // After: Top 3 relevant chunks (2,000 tokens) const embedding = await embed(query); const chunks = await vectorDB.search(embedding, limit: 3); const context = chunks.join('\n');

3. Prompt Compression

Remove verbose instructions and redundant examples:

Use terse, imperative language
Replace examples with few-shot learning (2-3 examples max)
Remove filler words and formatting

Strategy 2: Reduce Output Tokens

1. Set max_tokens Limits

Prevent runaway generation by capping output length:

max_tokens: 150 // For short answers max_tokens: 500 // For detailed explanations

2. Use Structured Outputs

JSON outputs are more token-efficient than prose:

Prose (120 tokens):

"The sentiment of this review is positive. The user seems happy with the product quality and delivery speed."

JSON (15 tokens):

{"sentiment": "positive", "aspects": ["quality", "delivery"]}

3. Stop Sequences

Use stop sequences to halt generation early when the task is complete:

stop: ["\n\n", "###", "END"] // Stops at natural breakpoints

Strategy 3: Choose the Right Model

Model Selection Matrix

Task	Model	Cost/1K tokens
Simple classification	GPT-3.5-turbo	$0.0005
Creative writing	GPT-4	$0.03
Code generation	GPT-4	$0.03
Embeddings	text-embedding-3-small	$0.00002

Don't use GPT-4 for tasks GPT-3.5 can handle. Test with smaller models first.

Model Cascading

Route requests to cheaper models when possible:

// Try GPT-3.5 first let response = await callGPT35(prompt); // Only escalate to GPT-4 if confidence is low if (response.confidence < 0.8) { response = await callGPT4(prompt); }

Strategy 4: Reduce Latency

1. Streaming Responses

Show tokens as they're generated instead of waiting for completion:

Perceived latency drops from 3s to 0.5s.

2. Parallel Requests

For independent operations, call LLMs in parallel:

// Sequential: 6 seconds total const title = await generateTitle(content); const summary = await generateSummary(content); const tags = await generateTags(content); // Parallel: 2 seconds total (limited by slowest call) const [title, summary, tags] = await Promise.all([ generateTitle(content), generateSummary(content), generateTags(content) ]);

3. Caching

Cache responses for repeated queries:

Exact match caching: Identical inputs return cached responses
Semantic caching: Similar queries reuse cached responses
TTL: Expire cache after 24 hours for time-sensitive content

4. Reduce Retrieval Overhead

For RAG systems, optimize the retrieval step:

Use faster vector databases (Pinecone, Weaviate)
Pre-compute embeddings at write-time, not read-time
Index frequently accessed documents

Monitoring and Alerting

Set up alerts in the platform to catch regressions:

Cost spike: Alert if daily spend exceeds $X
Latency degradation: Alert if P95 latency > 3s
Token anomalies: Alert if average tokens per request jumps 2x

Real-World Optimization Case Study

Customer Support Chatbot

Initial State:

Model: GPT-4
Avg tokens: 4,500 per request
Latency: 3.2s
Cost: $8,400/month

Optimizations Applied:

Switched to GPT-3.5-turbo for 70% of simple queries (model cascading)
Reduced context window from full history to last 3 turns + summary
Added semantic caching with 40% hit rate
Enabled streaming for perceived latency

Final State:

Avg tokens: 1,800 per request (-60%)
Latency: 1.1s (-66%)
Cost: $2,100/month (-75%)

Quick Wins Checklist

✅ Set max_tokens limits on all completions
✅ Use GPT-3.5-turbo for simple tasks
✅ Enable streaming for all user-facing responses
✅ Implement exact-match caching
✅ Trim conversation history to last 3-5 turns
✅ Use structured outputs (JSON) instead of prose
✅ Monitor token usage and set cost alerts

The Cost of LLM Applications

Measuring Current Performance

Start by understanding your baseline. Navigate to the Traces dashboard to see:

Average tokens per request (input + output)
P50, P95, P99 latency
Cost per request and total monthly spend
Slowest operations and bottlenecks

Strategy 1: Reduce Input Tokens

1. Trim Unnecessary Context

Are you sending more context than the model needs?

Before: 3,500 tokens (full conversation history)

After: 1,200 tokens (last 3 turns + summary)

Savings: 66% input tokens

2. Semantic Search Over Full Documents

For RAG systems, don't dump entire documents into the prompt. Use embeddings to retrieve only the most relevant chunks.

3. Prompt Compression

Remove verbose instructions and redundant examples:

Use terse, imperative language
Replace examples with few-shot learning (2-3 examples max)
Remove filler words and formatting

Strategy 2: Reduce Output Tokens

1. Set max_tokens Limits

Prevent runaway generation by capping output length:

max_tokens: 150 // For short answers max_tokens: 500 // For detailed explanations

2. Use Structured Outputs

JSON outputs are more token-efficient than prose:

Prose (120 tokens):

"The sentiment of this review is positive. The user seems happy with the product quality and delivery speed."

JSON (15 tokens):

{"sentiment": "positive", "aspects": ["quality", "delivery"]}

3. Stop Sequences

Use stop sequences to halt generation early when the task is complete:

stop: ["\n\n", "###", "END"] // Stops at natural breakpoints

Strategy 3: Choose the Right Model

Model Selection Matrix

Task	Model	Cost/1K tokens
Simple classification	GPT-3.5-turbo	$0.0005
Creative writing	GPT-4	$0.03
Code generation	GPT-4	$0.03
Embeddings	text-embedding-3-small	$0.00002

Don't use GPT-4 for tasks GPT-3.5 can handle. Test with smaller models first.

Model Cascading

Route requests to cheaper models when possible:

// Try GPT-3.5 first let response = await callGPT35(prompt); // Only escalate to GPT-4 if confidence is low if (response.confidence < 0.8) { response = await callGPT4(prompt); }

Strategy 4: Reduce Latency

1. Streaming Responses

Show tokens as they're generated instead of waiting for completion:

Perceived latency drops from 3s to 0.5s.

2. Parallel Requests

For independent operations, call LLMs in parallel:

3. Caching

Cache responses for repeated queries:

Exact match caching: Identical inputs return cached responses
Semantic caching: Similar queries reuse cached responses
TTL: Expire cache after 24 hours for time-sensitive content

4. Reduce Retrieval Overhead

For RAG systems, optimize the retrieval step:

Use faster vector databases (Pinecone, Weaviate)
Pre-compute embeddings at write-time, not read-time
Index frequently accessed documents

Monitoring and Alerting

Set up alerts in the platform to catch regressions:

Cost spike: Alert if daily spend exceeds $X
Latency degradation: Alert if P95 latency > 3s
Token anomalies: Alert if average tokens per request jumps 2x

Real-World Optimization Case Study

Customer Support Chatbot

Initial State:

Model: GPT-4
Avg tokens: 4,500 per request
Latency: 3.2s
Cost: $8,400/month

Optimizations Applied:

Switched to GPT-3.5-turbo for 70% of simple queries (model cascading)
Reduced context window from full history to last 3 turns + summary
Added semantic caching with 40% hit rate
Enabled streaming for perceived latency

Final State:

Avg tokens: 1,800 per request (-60%)
Latency: 1.1s (-66%)
Cost: $2,100/month (-75%)

Quick Wins Checklist

✅ Set max_tokens limits on all completions
✅ Use GPT-3.5-turbo for simple tasks
✅ Enable streaming for all user-facing responses
✅ Implement exact-match caching
✅ Trim conversation history to last 3-5 turns
✅ Use structured outputs (JSON) instead of prose
✅ Monitor token usage and set cost alerts

Token Optimization Guide

The Cost of LLM Applications

Measuring Current Performance

Strategy 1: Reduce Input Tokens

1. Trim Unnecessary Context

2. Semantic Search Over Full Documents

3. Prompt Compression

Strategy 2: Reduce Output Tokens

1. Set max_tokens Limits

2. Use Structured Outputs

3. Stop Sequences

Strategy 3: Choose the Right Model

Model Selection Matrix

Model Cascading

Strategy 4: Reduce Latency

1. Streaming Responses

2. Parallel Requests

3. Caching

4. Reduce Retrieval Overhead

Monitoring and Alerting

Real-World Optimization Case Study

Customer Support Chatbot

Quick Wins Checklist

Related Guides

Token Optimization Guide

The Cost of LLM Applications

Measuring Current Performance

Strategy 1: Reduce Input Tokens

1. Trim Unnecessary Context

2. Semantic Search Over Full Documents

3. Prompt Compression

Strategy 2: Reduce Output Tokens

1. Set max_tokens Limits

2. Use Structured Outputs

3. Stop Sequences

Strategy 3: Choose the Right Model

Model Selection Matrix

Model Cascading

Strategy 4: Reduce Latency

1. Streaming Responses

2. Parallel Requests

3. Caching

4. Reduce Retrieval Overhead

Monitoring and Alerting

Real-World Optimization Case Study

Customer Support Chatbot

Quick Wins Checklist

Related Guides