Documentation Index
Fetch the complete documentation index at: https://threatbasis.io/llms.txt
Use this file to discover all available pages before exploring further.
AI costs can escalate rapidly in production security systems—high query volumes, large context windows, and premium models create significant operational expenses. A single security operations center processing 10,000 alerts daily through GPT-4 could incur monthly costs exceeding $50,000 without optimization. Effective cost management balances performance requirements against budget constraints while maintaining the security effectiveness that justifies AI investment.
The economics of AI in security differ from traditional software costs. Token-based pricing creates variable expenses that scale with usage, making cost prediction challenging. Security workloads exhibit high variance—quiet periods may cost little while active incidents can consume substantial resources. Additionally, security applications often require premium models for accuracy, creating tension between cost efficiency and detection effectiveness.
According to a16z research on AI infrastructure costs, compute and inference costs represent the largest expense category for AI applications, often exceeding 50% of total operational costs. For security teams, understanding and optimizing these costs is essential for sustainable AI adoption and demonstrating ROI to leadership.
Understanding AI Costs
Before optimizing costs, security teams must understand the components that drive AI expenses. Unlike traditional software licensing with predictable costs, AI systems incur variable expenses based on usage patterns, model selection, and implementation architecture.
Cost Components
| Component | Description | Cost Driver | Security Consideration |
|---|
| Input tokens | Prompt + context length | Context size, RAG retrieval | Security context often requires extensive background |
| Output tokens | Response length | Output verbosity | Detailed analysis requires longer outputs |
| Model tier | Capability level | GPT-4 vs GPT-3.5 pricing | Accuracy requirements may mandate premium models |
| API calls | Request volume | Query frequency | High alert volumes drive request counts |
| Infrastructure | Compute, storage | Self-hosted deployments | Data sovereignty may require on-premises hosting |
| Embedding generation | Vector creation | Document ingestion | Threat intel feeds require continuous embedding |
| Vector storage | Database costs | Index size, query volume | Growing knowledge bases increase storage costs |
Input tokens typically represent the largest cost component for security applications. RAG-based systems that retrieve context from threat intelligence, previous incidents, and organizational knowledge can easily consume 10,000+ tokens per query. A system prompt of 2,000 tokens plus 8,000 tokens of retrieved context means 10,000 input tokens before the user query even begins.
Output tokens cost more per token than inputs (typically 2-4x) but represent smaller volumes. However, security applications requiring detailed analysis, step-by-step reasoning, or comprehensive reports can generate substantial output costs.
Model selection creates the most significant cost variance. GPT-4 costs approximately 20-30x more than GPT-3.5-turbo per token. Claude 3 Opus costs roughly 15x more than Claude 3 Haiku. Choosing appropriate models for each task type is the highest-leverage optimization available.
Cost Comparison by Model
Understanding relative model costs helps security teams make informed decisions about which models to use for different tasks. Prices change frequently, so teams should consult current pricing from OpenAI, Anthropic, and Google.
| Model Category | Relative Cost | Typical Use Case | Security Application |
|---|
| Frontier models (GPT-4o, Claude 3.5 Sonnet) | $$$$$ | Complex reasoning, high stakes | Incident investigation, threat analysis |
| Mid-tier models (GPT-4o-mini, Claude 3 Haiku) | $$ | Balanced cost/performance | Alert triage, log summarization |
| Efficient models (GPT-3.5-turbo) | $ | High volume, simpler tasks | Classification, extraction |
| Open source (Llama 3.1, Mistral) | $ (compute) | Self-hosted, privacy requirements | Sensitive data processing |
| Embedding models (text-embedding-3-small) | ¢ | Vector generation | Threat intel indexing |
The cost differential between model tiers is substantial. Processing 1 million tokens through GPT-4o costs approximately 5−15(input/outputcombined),whilethesamevolumethroughGPT−3.5−turbocostsunder1. This 10-20x difference makes model selection the most impactful optimization lever.
Token Optimization
Token optimization represents the most direct path to cost reduction. Since costs scale linearly with token consumption, reducing tokens directly reduces expenses. Security teams should focus on both input and output optimization, with input optimization typically offering larger savings due to the volume of context required for security analysis.
Input tokens often represent 80-90% of total token consumption in RAG-based security systems. Reducing input tokens without sacrificing quality requires careful attention to context selection, prompt engineering, and compression techniques.
| Strategy | Description | Savings Potential | Implementation Complexity |
|---|
| Context compression | Reduce retrieved context size | 30-70% | Medium - requires summarization |
| Prompt optimization | Concise, efficient instructions | 10-30% | Low - prompt engineering |
| Selective retrieval | Retrieve only relevant content | 40-60% | Medium - improved retrieval |
| Schema constraints | Structured, minimal inputs | 20-40% | Low - format standardization |
| Dynamic context | Adjust context by query type | 30-50% | High - query classification |
| Deduplication | Remove redundant information | 10-25% | Low - preprocessing |
Context compression uses summarization or extraction to reduce the size of retrieved documents while preserving essential information. Tools like LLMLingua from Microsoft Research can compress prompts by 2-10x with minimal quality loss. For security applications, compression should preserve IOCs, timestamps, and technical details while removing boilerplate.
Selective retrieval improves the precision of RAG systems to retrieve only highly relevant content. Rather than retrieving 10 documents and hoping some are relevant, improved retrieval returns 3-5 highly relevant documents. Techniques include hybrid search (combining semantic and keyword matching), reranking retrieved results, and query expansion.
Dynamic context adjusts the amount of context based on query complexity. Simple classification queries may need minimal context, while complex investigation queries require extensive background. Implementing query classification to route requests to appropriate context levels can significantly reduce average token consumption.
Output Token Control
Output tokens cost more per token than inputs but represent smaller volumes. However, security applications often require detailed outputs—investigation summaries, threat analysis reports, and remediation recommendations. Balancing detail with cost requires careful output management.
| Strategy | Description | Implementation | Trade-off |
|---|
| Response length limits | Cap output tokens | max_tokens parameter | May truncate important content |
| Structured outputs | Constrained formats | JSON mode, function calling | Reduces flexibility |
| Concise prompting | Request brevity | Prompt engineering | May lose detail |
| Streaming cutoffs | Early termination | Stop sequences | Requires careful design |
| Tiered responses | Brief first, detail on request | Multi-turn design | Adds latency for detail |
Structured outputs using JSON mode or function calling constrain model outputs to specific schemas, eliminating verbose prose in favor of structured data. For alert triage, a structured output with severity, category, and recommended action fields is more cost-effective than a prose explanation—and often more useful for downstream automation.
Tiered responses provide brief initial answers with the option to request more detail. An alert triage system might first return a one-line classification, then provide detailed analysis only when requested. This approach reduces costs for routine queries while preserving capability for complex cases.
Context Window Management
Modern models offer context windows from 8K to 200K+ tokens, but larger contexts increase costs and may reduce quality. Security teams should right-size context windows for their use cases rather than defaulting to maximum capacity.
| Context Size | Appropriate Use Case | Cost Implication |
|---|
| 4K-8K tokens | Simple classification, extraction | Lowest cost, fastest |
| 16K-32K tokens | Standard analysis with context | Moderate cost |
| 64K-128K tokens | Complex investigation, multi-document | Higher cost |
| 128K+ tokens | Full incident timeline, extensive history | Highest cost |
For most security operations tasks, 16K-32K tokens provides sufficient context. Reserve larger context windows for complex investigations where extensive history is genuinely required. Implement context budgets that allocate tokens across system prompt, retrieved context, conversation history, and user query.
Caching Strategies
Caching is one of the most effective cost optimization techniques, potentially reducing LLM API costs by 30-70% for workloads with repetitive queries. Security operations often exhibit high query similarity—analysts ask similar questions about similar alert types, and automated systems process similar events repeatedly. Effective caching captures this repetition to avoid redundant API calls.
Caching Layers
Different caching layers address different types of repetition. A comprehensive caching strategy implements multiple layers, each capturing different optimization opportunities.
| Layer | Cached Content | Hit Rate Potential | Best For |
|---|
| Exact query cache | Identical requests | Low (5-15%) | Automated systems, repeated queries |
| Semantic cache | Similar queries | Medium (15-40%) | Analyst queries, natural language |
| Embedding cache | Vector computations | High (70-90%) | RAG systems, document processing |
| RAG cache | Retrieved contexts | Medium (30-50%) | Knowledge base queries |
| Response fragment cache | Partial responses | Medium (20-40%) | Structured outputs, templates |
Exact query caching stores responses for identical queries. While hit rates are typically low for human queries (which vary in phrasing), automated systems that generate consistent queries can achieve high hit rates. Alert enrichment systems that query the same IOC multiple times benefit significantly from exact caching.
Semantic caching uses embedding similarity to identify queries that are semantically equivalent even if phrased differently. “What is this IP address?” and “Tell me about IP 192.168.1.1” might return the same cached response if the IP is identical. This approach dramatically increases hit rates for human-generated queries.
Embedding caching stores computed embeddings to avoid regenerating vectors for previously processed content. Since embedding generation has its own costs and latency, caching embeddings for frequently accessed documents provides significant savings.
Semantic Caching Implementation
Semantic caching requires careful tuning to balance hit rates against response quality. Too aggressive caching returns irrelevant responses; too conservative caching misses optimization opportunities.
| Configuration | Description | Recommendation |
|---|
| Similarity threshold | Minimum similarity for cache hit | Start at 0.95, tune based on quality |
| Cache TTL | Time before expiration | Hours for dynamic data, days for static |
| Cache scope | Per-user vs. shared | Shared for general queries, per-user for personalized |
| Invalidation triggers | When to clear cache | Data updates, model changes, quality issues |
Similarity threshold determines how similar a query must be to return a cached response. Higher thresholds (0.95+) ensure high relevance but lower hit rates. Lower thresholds (0.85-0.90) increase hit rates but risk returning less relevant responses. Security applications should err toward higher thresholds given the importance of accuracy.
Cache invalidation is critical for security applications where stale data can be dangerous. Threat intelligence caches should invalidate when feeds update. Incident-related caches should invalidate when incident status changes. Implement event-driven invalidation rather than relying solely on TTL.
| Tool | Description | Key Features | Documentation |
|---|
| GPTCache | Semantic caching for LLMs | Multiple similarity backends, LangChain integration | GPTCache |
| Redis Semantic Cache | Redis-based semantic caching | Vector similarity search, high performance | Redis Vector Search |
| Momento | Serverless caching | Managed service, low latency | Momento |
| LangChain Caching | Built-in LangChain caching | Easy integration, multiple backends | LangChain Caching |
Caching Considerations for Security
Security applications have unique caching requirements that differ from general-purpose AI systems:
- Sensitivity classification - Don’t cache responses containing sensitive data that shouldn’t persist
- Access control - Ensure cached responses respect user permissions
- Audit requirements - Log cache hits for compliance, not just cache misses
- Freshness requirements - Security data often has strict freshness requirements
- Multi-tenancy - Prevent cache leakage between tenants or security boundaries
Model Selection Optimization
Model selection is the highest-leverage cost optimization available—the difference between frontier and efficient models can exceed 20x per token. However, cheaper models sacrifice capability, potentially reducing detection accuracy or analysis quality. The key is matching model capability to task requirements, using expensive models only where their capabilities are genuinely needed.
Model Routing Strategies
Model routing directs queries to appropriate models based on characteristics that predict required capability. Effective routing achieves most of the quality of always using premium models at a fraction of the cost.
| Strategy | Description | Implementation | Cost Savings |
|---|
| Complexity-based | Route by query difficulty | Classify query complexity before routing | 40-70% |
| Confidence-based | Escalate low confidence | Start cheap, retry with premium if uncertain | 30-50% |
| Task-based | Model per task type | Static mapping of tasks to models | 50-80% |
| Cost-aware | Dynamic optimization | Real-time cost/quality optimization | 40-60% |
| Hybrid | Combine multiple strategies | Layered routing logic | 50-75% |
Complexity-based routing uses a classifier (often a small, fast model) to assess query difficulty before routing. Simple queries like “classify this alert” route to efficient models, while complex queries like “analyze this incident timeline and identify the attack chain” route to premium models. The classification cost is minimal compared to the savings from appropriate routing.
Confidence-based routing starts with efficient models and escalates only when needed. If the initial model returns low confidence or uncertain results, the query is automatically retried with a more capable model. This approach works well for tasks where most queries are simple, with occasional complex cases.
Task-based routing assigns specific models to specific task types based on known requirements. Classification tasks might always use GPT-3.5-turbo, while threat analysis always uses GPT-4. This approach is simple to implement and highly predictable but less adaptive than dynamic routing.
Model Cascade Pattern
The cascade pattern routes queries through progressively more capable (and expensive) models, stopping when quality criteria are met. This approach maximizes cost efficiency by using expensive models only for genuinely difficult cases.
| Tier | Model Examples | Role | Typical Usage |
|---|
| Tier 1 | GPT-3.5-turbo, Claude 3 Haiku | Handle simple, clear queries | 60-80% of queries |
| Tier 2 | GPT-4o-mini, Claude 3.5 Sonnet | Moderate complexity, ambiguous cases | 15-30% of queries |
| Tier 3 | GPT-4o, Claude 3 Opus | Complex reasoning, high-stakes decisions | 5-15% of queries |
The cascade pattern requires clear criteria for escalation. Common approaches include:
- Confidence thresholds - Escalate when model-reported confidence falls below threshold
- Consistency checks - Escalate when multiple attempts produce inconsistent results
- Complexity indicators - Escalate based on query characteristics (length, technical terms, ambiguity)
- Output validation - Escalate when outputs fail validation checks
Security-Specific Model Selection
Security applications have unique model selection considerations beyond general cost optimization:
| Security Task | Recommended Tier | Rationale |
|---|
| Alert classification | Tier 1-2 | High volume, structured task |
| IOC extraction | Tier 1 | Pattern matching, structured output |
| Threat analysis | Tier 2-3 | Complex reasoning required |
| Incident investigation | Tier 3 | High stakes, complex context |
| Log summarization | Tier 1-2 | Straightforward extraction |
| Vulnerability assessment | Tier 2-3 | Technical accuracy critical |
For security applications, err toward higher-capability models when:
- Decisions trigger automated responses
- False negatives have severe consequences
- Analysis informs high-stakes decisions
- Complex reasoning or multi-step analysis is required
| Tool | Description | Documentation |
|---|
| Martian | Automatic model routing | Martian |
| OpenRouter | Multi-model routing | OpenRouter |
| LiteLLM | Unified API with routing | LiteLLM |
| Portkey | AI gateway with routing | Portkey |
Batching and Async Processing
Batching combines multiple operations into single requests, reducing overhead and often qualifying for volume discounts. Async processing defers non-urgent work to optimize resource utilization. Both techniques trade latency for cost efficiency, making them appropriate for workloads that don’t require real-time responses.
Batching Strategies
| Strategy | Description | Latency Trade-off | Cost Benefit |
|---|
| Request batching | Combine multiple LLM queries | Higher latency (wait for batch) | Reduced API overhead |
| Embedding batching | Batch vector operations | Minimal latency impact | Significant savings at scale |
| Background processing | Defer non-urgent work | Async completion | Off-peak pricing, better utilization |
| Bulk ingestion | Batch document processing | Hours to complete | Optimized throughput |
| Scheduled processing | Time-based batching | Predictable delays | Consistent cost patterns |
Request batching accumulates queries over a time window or until reaching a batch size threshold, then processes them together. This approach works well for alert enrichment, where enriching 100 alerts in a batch is more efficient than 100 individual requests. The trade-off is increased latency for individual items.
Embedding batching processes multiple documents or chunks in single API calls. Most embedding APIs support batch operations that are significantly more efficient than individual calls. Processing 100 documents in a single batch call may cost 20-30% less than 100 individual calls.
Background processing moves non-urgent work to asynchronous queues, processing during low-demand periods. This approach enables better resource utilization and may qualify for lower pricing tiers with some providers.
When to Batch vs. Real-time
Security operations have varying latency requirements. Some tasks require immediate responses, while others can tolerate delays of seconds, minutes, or even hours.
| Use Case | Latency Tolerance | Batch Appropriate | Processing Approach |
|---|
| Alert enrichment | Minutes | ✓ Yes | Batch every 30-60 seconds |
| Threat intelligence processing | Hours | ✓ Yes | Daily batch jobs |
| Report generation | Hours | ✓ Yes | Scheduled processing |
| Historical analysis | Days | ✓ Yes | Background jobs |
| Real-time detection | Milliseconds | ✗ No | Stream processing |
| Interactive investigation | Seconds | ✗ No | On-demand |
| Automated response | Milliseconds | ✗ No | Real-time |
Implementing Batch Processing
Effective batch processing requires infrastructure for queue management, batch accumulation, and result distribution:
- Queue systems - Use message queues (Redis, RabbitMQ, SQS) to accumulate work
- Batch windows - Configure time-based or count-based batching thresholds
- Priority queues - Allow urgent items to bypass batching when needed
- Result distribution - Ensure results route back to appropriate consumers
- Error handling - Handle partial batch failures gracefully
| Tool | Description | Documentation |
|---|
| Celery | Python distributed task queue | Celery |
| Bull | Node.js queue system | Bull |
| AWS SQS | Managed message queue | SQS |
| Modal | Serverless batch processing | Modal |
Cost Governance
Effective cost governance establishes organizational controls, accountability, and processes for managing AI expenses. Without governance, costs can escalate rapidly as teams adopt AI without visibility into spending or incentives to optimize.
Budget Controls
| Control | Description | Implementation | Enforcement Level |
|---|
| Spending limits | Hard caps on usage | API key limits, gateway controls | Hard stop |
| Rate limiting | Query throttling | Requests per minute/hour | Graceful degradation |
| Cost alerts | Threshold notifications | Monitoring alerts | Warning only |
| Usage reporting | Visibility into spend | Dashboards, reports | Transparency |
| Approval workflows | Authorization for high-cost operations | Request/approve process | Pre-authorization |
| Quota allocation | Per-team or per-project budgets | Budget pools | Soft caps with escalation |
Spending limits provide hard caps that prevent runaway costs. Most API providers support spending limits at the account or API key level. AI gateways like Portkey and Helicone provide more granular controls including per-user and per-application limits.
Rate limiting throttles request volume rather than spending directly. Rate limits provide graceful degradation—queries are queued or rejected rather than immediately failing. This approach prevents cost spikes from traffic bursts while maintaining availability.
Approval workflows require authorization before high-cost operations. An investigation requiring extended context or premium models might require manager approval, ensuring expensive operations are justified.
Cost Attribution
Cost attribution tracks spending across organizational dimensions, enabling accountability and ROI analysis. Without attribution, costs become shared overhead with no incentive for individual teams to optimize.
| Attribution Dimension | Purpose | Implementation |
|---|
| By team/department | Departmental chargeback | Team identifiers in requests |
| By use case | ROI analysis per application | Application tagging |
| By model | Model economics analysis | Model tracking in logs |
| By user | Individual accountability | User identifiers |
| By time period | Trend analysis, budgeting | Timestamp aggregation |
| By query type | Optimization targeting | Query classification |
Implement attribution through consistent tagging of all AI requests with relevant metadata. AI gateways and observability platforms typically provide attribution capabilities out of the box.
ROI Measurement
Cost optimization efforts should demonstrate return on investment. Security AI systems should quantify value delivered relative to costs incurred.
| Metric | Calculation | Target |
|---|
| Cost per alert triaged | Total AI cost / alerts processed | < analyst hourly cost |
| Analyst time saved | Manual triage time × alerts automated | Positive ROI |
| Detection improvement | (AI detections - baseline) / baseline | Measurable lift |
| False positive reduction | Baseline FP rate - AI FP rate | Measurable reduction |
| Cost per investigation hour saved | AI cost / analyst hours saved | < analyst cost |
Governance Framework
Establish clear policies and processes for AI cost management:
- Cost ownership - Assign budget owners responsible for AI spending
- Optimization reviews - Regular reviews of cost efficiency and optimization opportunities
- Model approval - Process for approving new model deployments
- Escalation procedures - Clear paths for budget overruns or unusual spending
- Optimization incentives - Recognize teams that achieve cost efficiency improvements
Cost Monitoring
Effective cost monitoring provides visibility into AI spending, enables optimization, and supports governance. Monitoring should capture both aggregate trends and granular details for troubleshooting.
Key Metrics
| Metric | Description | Target | Alert Threshold |
|---|
| Cost per query | Average query cost | Minimize | > 2x baseline |
| Token efficiency | Output value per token | Maximize | < 50% baseline |
| Cache hit rate | Cached vs. new queries | > 30% | < 20% |
| Model utilization | Cheap model usage | > 70% | < 50% |
| Cost per detection | AI cost per threat detected | Minimize | > 3x baseline |
| Daily/weekly spend | Aggregate spending | Within budget | > 80% of budget |
| Cost trend | Week-over-week change | Stable or decreasing | > 20% increase |
| Error rate | Failed queries wasting cost | < 1% | > 5% |
Monitoring Dashboards
Effective cost monitoring dashboards should include:
- Real-time spend - Current period spending vs. budget
- Cost breakdown - By model, team, use case, and time
- Efficiency metrics - Cache hit rates, model utilization, token efficiency
- Trend analysis - Week-over-week and month-over-month comparisons
- Anomaly detection - Unusual spending patterns flagged for investigation
- Optimization opportunities - Identified areas for improvement
| Tool | Description | Key Features | Documentation |
|---|
| Helicone | LLM cost tracking | Request logging, cost attribution, caching | Helicone |
| LangSmith | LangChain analytics | Tracing, evaluation, cost tracking | LangSmith |
| Portkey | AI gateway | Routing, caching, observability | Portkey |
| OpenAI Usage | Native dashboard | Account-level usage | OpenAI |
| Anthropic Console | Native dashboard | Claude usage tracking | Anthropic Console |
| Datadog LLM Observability | APM integration | End-to-end tracing | Datadog |
Alerting Strategy
Configure alerts to catch cost issues before they become problems:
| Alert Type | Trigger | Response |
|---|
| Budget threshold | 80% of period budget consumed | Review spending, identify drivers |
| Spend spike | > 2x normal daily spend | Investigate anomaly, potential issues |
| Efficiency drop | Cache hit rate < 20% | Check cache configuration |
| Model drift | Premium model usage > 40% | Review routing configuration |
| Error spike | Error rate > 5% | Investigate failures, potential waste |
Related Articles
References
Model Routing and Gateways
- LiteLLM - Unified API for multiple LLM providers
- Portkey - AI gateway with routing and observability
- OpenRouter - Multi-provider LLM routing
- Martian - Automatic model routing optimization
Observability and Monitoring
Research Papers
Industry Resources