Skip to main content

Documentation Index

Fetch the complete documentation index at: https://threatbasis.io/llms.txt

Use this file to discover all available pages before exploring further.

AI costs can escalate rapidly in production security systems—high query volumes, large context windows, and premium models create significant operational expenses. A single security operations center processing 10,000 alerts daily through GPT-4 could incur monthly costs exceeding $50,000 without optimization. Effective cost management balances performance requirements against budget constraints while maintaining the security effectiveness that justifies AI investment. The economics of AI in security differ from traditional software costs. Token-based pricing creates variable expenses that scale with usage, making cost prediction challenging. Security workloads exhibit high variance—quiet periods may cost little while active incidents can consume substantial resources. Additionally, security applications often require premium models for accuracy, creating tension between cost efficiency and detection effectiveness. According to a16z research on AI infrastructure costs, compute and inference costs represent the largest expense category for AI applications, often exceeding 50% of total operational costs. For security teams, understanding and optimizing these costs is essential for sustainable AI adoption and demonstrating ROI to leadership.

Understanding AI Costs

Before optimizing costs, security teams must understand the components that drive AI expenses. Unlike traditional software licensing with predictable costs, AI systems incur variable expenses based on usage patterns, model selection, and implementation architecture.

Cost Components

ComponentDescriptionCost DriverSecurity Consideration
Input tokensPrompt + context lengthContext size, RAG retrievalSecurity context often requires extensive background
Output tokensResponse lengthOutput verbosityDetailed analysis requires longer outputs
Model tierCapability levelGPT-4 vs GPT-3.5 pricingAccuracy requirements may mandate premium models
API callsRequest volumeQuery frequencyHigh alert volumes drive request counts
InfrastructureCompute, storageSelf-hosted deploymentsData sovereignty may require on-premises hosting
Embedding generationVector creationDocument ingestionThreat intel feeds require continuous embedding
Vector storageDatabase costsIndex size, query volumeGrowing knowledge bases increase storage costs
Input tokens typically represent the largest cost component for security applications. RAG-based systems that retrieve context from threat intelligence, previous incidents, and organizational knowledge can easily consume 10,000+ tokens per query. A system prompt of 2,000 tokens plus 8,000 tokens of retrieved context means 10,000 input tokens before the user query even begins. Output tokens cost more per token than inputs (typically 2-4x) but represent smaller volumes. However, security applications requiring detailed analysis, step-by-step reasoning, or comprehensive reports can generate substantial output costs. Model selection creates the most significant cost variance. GPT-4 costs approximately 20-30x more than GPT-3.5-turbo per token. Claude 3 Opus costs roughly 15x more than Claude 3 Haiku. Choosing appropriate models for each task type is the highest-leverage optimization available.

Cost Comparison by Model

Understanding relative model costs helps security teams make informed decisions about which models to use for different tasks. Prices change frequently, so teams should consult current pricing from OpenAI, Anthropic, and Google.
Model CategoryRelative CostTypical Use CaseSecurity Application
Frontier models (GPT-4o, Claude 3.5 Sonnet)$$$$$Complex reasoning, high stakesIncident investigation, threat analysis
Mid-tier models (GPT-4o-mini, Claude 3 Haiku)$$Balanced cost/performanceAlert triage, log summarization
Efficient models (GPT-3.5-turbo)$High volume, simpler tasksClassification, extraction
Open source (Llama 3.1, Mistral)$ (compute)Self-hosted, privacy requirementsSensitive data processing
Embedding models (text-embedding-3-small)¢Vector generationThreat intel indexing
The cost differential between model tiers is substantial. Processing 1 million tokens through GPT-4o costs approximately 515(input/outputcombined),whilethesamevolumethroughGPT3.5turbocostsunder5-15 (input/output combined), while the same volume through GPT-3.5-turbo costs under 1. This 10-20x difference makes model selection the most impactful optimization lever.

Token Optimization

Token optimization represents the most direct path to cost reduction. Since costs scale linearly with token consumption, reducing tokens directly reduces expenses. Security teams should focus on both input and output optimization, with input optimization typically offering larger savings due to the volume of context required for security analysis.

Input Token Reduction

Input tokens often represent 80-90% of total token consumption in RAG-based security systems. Reducing input tokens without sacrificing quality requires careful attention to context selection, prompt engineering, and compression techniques.
StrategyDescriptionSavings PotentialImplementation Complexity
Context compressionReduce retrieved context size30-70%Medium - requires summarization
Prompt optimizationConcise, efficient instructions10-30%Low - prompt engineering
Selective retrievalRetrieve only relevant content40-60%Medium - improved retrieval
Schema constraintsStructured, minimal inputs20-40%Low - format standardization
Dynamic contextAdjust context by query type30-50%High - query classification
DeduplicationRemove redundant information10-25%Low - preprocessing
Context compression uses summarization or extraction to reduce the size of retrieved documents while preserving essential information. Tools like LLMLingua from Microsoft Research can compress prompts by 2-10x with minimal quality loss. For security applications, compression should preserve IOCs, timestamps, and technical details while removing boilerplate. Selective retrieval improves the precision of RAG systems to retrieve only highly relevant content. Rather than retrieving 10 documents and hoping some are relevant, improved retrieval returns 3-5 highly relevant documents. Techniques include hybrid search (combining semantic and keyword matching), reranking retrieved results, and query expansion. Dynamic context adjusts the amount of context based on query complexity. Simple classification queries may need minimal context, while complex investigation queries require extensive background. Implementing query classification to route requests to appropriate context levels can significantly reduce average token consumption.

Output Token Control

Output tokens cost more per token than inputs but represent smaller volumes. However, security applications often require detailed outputs—investigation summaries, threat analysis reports, and remediation recommendations. Balancing detail with cost requires careful output management.
StrategyDescriptionImplementationTrade-off
Response length limitsCap output tokensmax_tokens parameterMay truncate important content
Structured outputsConstrained formatsJSON mode, function callingReduces flexibility
Concise promptingRequest brevityPrompt engineeringMay lose detail
Streaming cutoffsEarly terminationStop sequencesRequires careful design
Tiered responsesBrief first, detail on requestMulti-turn designAdds latency for detail
Structured outputs using JSON mode or function calling constrain model outputs to specific schemas, eliminating verbose prose in favor of structured data. For alert triage, a structured output with severity, category, and recommended action fields is more cost-effective than a prose explanation—and often more useful for downstream automation. Tiered responses provide brief initial answers with the option to request more detail. An alert triage system might first return a one-line classification, then provide detailed analysis only when requested. This approach reduces costs for routine queries while preserving capability for complex cases.

Context Window Management

Modern models offer context windows from 8K to 200K+ tokens, but larger contexts increase costs and may reduce quality. Security teams should right-size context windows for their use cases rather than defaulting to maximum capacity.
Context SizeAppropriate Use CaseCost Implication
4K-8K tokensSimple classification, extractionLowest cost, fastest
16K-32K tokensStandard analysis with contextModerate cost
64K-128K tokensComplex investigation, multi-documentHigher cost
128K+ tokensFull incident timeline, extensive historyHighest cost
For most security operations tasks, 16K-32K tokens provides sufficient context. Reserve larger context windows for complex investigations where extensive history is genuinely required. Implement context budgets that allocate tokens across system prompt, retrieved context, conversation history, and user query.

Caching Strategies

Caching is one of the most effective cost optimization techniques, potentially reducing LLM API costs by 30-70% for workloads with repetitive queries. Security operations often exhibit high query similarity—analysts ask similar questions about similar alert types, and automated systems process similar events repeatedly. Effective caching captures this repetition to avoid redundant API calls.

Caching Layers

Different caching layers address different types of repetition. A comprehensive caching strategy implements multiple layers, each capturing different optimization opportunities.
LayerCached ContentHit Rate PotentialBest For
Exact query cacheIdentical requestsLow (5-15%)Automated systems, repeated queries
Semantic cacheSimilar queriesMedium (15-40%)Analyst queries, natural language
Embedding cacheVector computationsHigh (70-90%)RAG systems, document processing
RAG cacheRetrieved contextsMedium (30-50%)Knowledge base queries
Response fragment cachePartial responsesMedium (20-40%)Structured outputs, templates
Exact query caching stores responses for identical queries. While hit rates are typically low for human queries (which vary in phrasing), automated systems that generate consistent queries can achieve high hit rates. Alert enrichment systems that query the same IOC multiple times benefit significantly from exact caching. Semantic caching uses embedding similarity to identify queries that are semantically equivalent even if phrased differently. “What is this IP address?” and “Tell me about IP 192.168.1.1” might return the same cached response if the IP is identical. This approach dramatically increases hit rates for human-generated queries. Embedding caching stores computed embeddings to avoid regenerating vectors for previously processed content. Since embedding generation has its own costs and latency, caching embeddings for frequently accessed documents provides significant savings.

Semantic Caching Implementation

Semantic caching requires careful tuning to balance hit rates against response quality. Too aggressive caching returns irrelevant responses; too conservative caching misses optimization opportunities.
ConfigurationDescriptionRecommendation
Similarity thresholdMinimum similarity for cache hitStart at 0.95, tune based on quality
Cache TTLTime before expirationHours for dynamic data, days for static
Cache scopePer-user vs. sharedShared for general queries, per-user for personalized
Invalidation triggersWhen to clear cacheData updates, model changes, quality issues
Similarity threshold determines how similar a query must be to return a cached response. Higher thresholds (0.95+) ensure high relevance but lower hit rates. Lower thresholds (0.85-0.90) increase hit rates but risk returning less relevant responses. Security applications should err toward higher thresholds given the importance of accuracy. Cache invalidation is critical for security applications where stale data can be dangerous. Threat intelligence caches should invalidate when feeds update. Incident-related caches should invalidate when incident status changes. Implement event-driven invalidation rather than relying solely on TTL.

Caching Tools and Platforms

ToolDescriptionKey FeaturesDocumentation
GPTCacheSemantic caching for LLMsMultiple similarity backends, LangChain integrationGPTCache
Redis Semantic CacheRedis-based semantic cachingVector similarity search, high performanceRedis Vector Search
MomentoServerless cachingManaged service, low latencyMomento
LangChain CachingBuilt-in LangChain cachingEasy integration, multiple backendsLangChain Caching

Caching Considerations for Security

Security applications have unique caching requirements that differ from general-purpose AI systems:
  • Sensitivity classification - Don’t cache responses containing sensitive data that shouldn’t persist
  • Access control - Ensure cached responses respect user permissions
  • Audit requirements - Log cache hits for compliance, not just cache misses
  • Freshness requirements - Security data often has strict freshness requirements
  • Multi-tenancy - Prevent cache leakage between tenants or security boundaries

Model Selection Optimization

Model selection is the highest-leverage cost optimization available—the difference between frontier and efficient models can exceed 20x per token. However, cheaper models sacrifice capability, potentially reducing detection accuracy or analysis quality. The key is matching model capability to task requirements, using expensive models only where their capabilities are genuinely needed.

Model Routing Strategies

Model routing directs queries to appropriate models based on characteristics that predict required capability. Effective routing achieves most of the quality of always using premium models at a fraction of the cost.
StrategyDescriptionImplementationCost Savings
Complexity-basedRoute by query difficultyClassify query complexity before routing40-70%
Confidence-basedEscalate low confidenceStart cheap, retry with premium if uncertain30-50%
Task-basedModel per task typeStatic mapping of tasks to models50-80%
Cost-awareDynamic optimizationReal-time cost/quality optimization40-60%
HybridCombine multiple strategiesLayered routing logic50-75%
Complexity-based routing uses a classifier (often a small, fast model) to assess query difficulty before routing. Simple queries like “classify this alert” route to efficient models, while complex queries like “analyze this incident timeline and identify the attack chain” route to premium models. The classification cost is minimal compared to the savings from appropriate routing. Confidence-based routing starts with efficient models and escalates only when needed. If the initial model returns low confidence or uncertain results, the query is automatically retried with a more capable model. This approach works well for tasks where most queries are simple, with occasional complex cases. Task-based routing assigns specific models to specific task types based on known requirements. Classification tasks might always use GPT-3.5-turbo, while threat analysis always uses GPT-4. This approach is simple to implement and highly predictable but less adaptive than dynamic routing.

Model Cascade Pattern

The cascade pattern routes queries through progressively more capable (and expensive) models, stopping when quality criteria are met. This approach maximizes cost efficiency by using expensive models only for genuinely difficult cases.
TierModel ExamplesRoleTypical Usage
Tier 1GPT-3.5-turbo, Claude 3 HaikuHandle simple, clear queries60-80% of queries
Tier 2GPT-4o-mini, Claude 3.5 SonnetModerate complexity, ambiguous cases15-30% of queries
Tier 3GPT-4o, Claude 3 OpusComplex reasoning, high-stakes decisions5-15% of queries
The cascade pattern requires clear criteria for escalation. Common approaches include:
  • Confidence thresholds - Escalate when model-reported confidence falls below threshold
  • Consistency checks - Escalate when multiple attempts produce inconsistent results
  • Complexity indicators - Escalate based on query characteristics (length, technical terms, ambiguity)
  • Output validation - Escalate when outputs fail validation checks

Security-Specific Model Selection

Security applications have unique model selection considerations beyond general cost optimization:
Security TaskRecommended TierRationale
Alert classificationTier 1-2High volume, structured task
IOC extractionTier 1Pattern matching, structured output
Threat analysisTier 2-3Complex reasoning required
Incident investigationTier 3High stakes, complex context
Log summarizationTier 1-2Straightforward extraction
Vulnerability assessmentTier 2-3Technical accuracy critical
For security applications, err toward higher-capability models when:
  • Decisions trigger automated responses
  • False negatives have severe consequences
  • Analysis informs high-stakes decisions
  • Complex reasoning or multi-step analysis is required

Model Selection Tools

ToolDescriptionDocumentation
MartianAutomatic model routingMartian
OpenRouterMulti-model routingOpenRouter
LiteLLMUnified API with routingLiteLLM
PortkeyAI gateway with routingPortkey

Batching and Async Processing

Batching combines multiple operations into single requests, reducing overhead and often qualifying for volume discounts. Async processing defers non-urgent work to optimize resource utilization. Both techniques trade latency for cost efficiency, making them appropriate for workloads that don’t require real-time responses.

Batching Strategies

StrategyDescriptionLatency Trade-offCost Benefit
Request batchingCombine multiple LLM queriesHigher latency (wait for batch)Reduced API overhead
Embedding batchingBatch vector operationsMinimal latency impactSignificant savings at scale
Background processingDefer non-urgent workAsync completionOff-peak pricing, better utilization
Bulk ingestionBatch document processingHours to completeOptimized throughput
Scheduled processingTime-based batchingPredictable delaysConsistent cost patterns
Request batching accumulates queries over a time window or until reaching a batch size threshold, then processes them together. This approach works well for alert enrichment, where enriching 100 alerts in a batch is more efficient than 100 individual requests. The trade-off is increased latency for individual items. Embedding batching processes multiple documents or chunks in single API calls. Most embedding APIs support batch operations that are significantly more efficient than individual calls. Processing 100 documents in a single batch call may cost 20-30% less than 100 individual calls. Background processing moves non-urgent work to asynchronous queues, processing during low-demand periods. This approach enables better resource utilization and may qualify for lower pricing tiers with some providers.

When to Batch vs. Real-time

Security operations have varying latency requirements. Some tasks require immediate responses, while others can tolerate delays of seconds, minutes, or even hours.
Use CaseLatency ToleranceBatch AppropriateProcessing Approach
Alert enrichmentMinutes✓ YesBatch every 30-60 seconds
Threat intelligence processingHours✓ YesDaily batch jobs
Report generationHours✓ YesScheduled processing
Historical analysisDays✓ YesBackground jobs
Real-time detectionMilliseconds✗ NoStream processing
Interactive investigationSeconds✗ NoOn-demand
Automated responseMilliseconds✗ NoReal-time

Implementing Batch Processing

Effective batch processing requires infrastructure for queue management, batch accumulation, and result distribution:
  • Queue systems - Use message queues (Redis, RabbitMQ, SQS) to accumulate work
  • Batch windows - Configure time-based or count-based batching thresholds
  • Priority queues - Allow urgent items to bypass batching when needed
  • Result distribution - Ensure results route back to appropriate consumers
  • Error handling - Handle partial batch failures gracefully

Async Processing Tools

ToolDescriptionDocumentation
CeleryPython distributed task queueCelery
BullNode.js queue systemBull
AWS SQSManaged message queueSQS
ModalServerless batch processingModal

Cost Governance

Effective cost governance establishes organizational controls, accountability, and processes for managing AI expenses. Without governance, costs can escalate rapidly as teams adopt AI without visibility into spending or incentives to optimize.

Budget Controls

ControlDescriptionImplementationEnforcement Level
Spending limitsHard caps on usageAPI key limits, gateway controlsHard stop
Rate limitingQuery throttlingRequests per minute/hourGraceful degradation
Cost alertsThreshold notificationsMonitoring alertsWarning only
Usage reportingVisibility into spendDashboards, reportsTransparency
Approval workflowsAuthorization for high-cost operationsRequest/approve processPre-authorization
Quota allocationPer-team or per-project budgetsBudget poolsSoft caps with escalation
Spending limits provide hard caps that prevent runaway costs. Most API providers support spending limits at the account or API key level. AI gateways like Portkey and Helicone provide more granular controls including per-user and per-application limits. Rate limiting throttles request volume rather than spending directly. Rate limits provide graceful degradation—queries are queued or rejected rather than immediately failing. This approach prevents cost spikes from traffic bursts while maintaining availability. Approval workflows require authorization before high-cost operations. An investigation requiring extended context or premium models might require manager approval, ensuring expensive operations are justified.

Cost Attribution

Cost attribution tracks spending across organizational dimensions, enabling accountability and ROI analysis. Without attribution, costs become shared overhead with no incentive for individual teams to optimize.
Attribution DimensionPurposeImplementation
By team/departmentDepartmental chargebackTeam identifiers in requests
By use caseROI analysis per applicationApplication tagging
By modelModel economics analysisModel tracking in logs
By userIndividual accountabilityUser identifiers
By time periodTrend analysis, budgetingTimestamp aggregation
By query typeOptimization targetingQuery classification
Implement attribution through consistent tagging of all AI requests with relevant metadata. AI gateways and observability platforms typically provide attribution capabilities out of the box.

ROI Measurement

Cost optimization efforts should demonstrate return on investment. Security AI systems should quantify value delivered relative to costs incurred.
MetricCalculationTarget
Cost per alert triagedTotal AI cost / alerts processed< analyst hourly cost
Analyst time savedManual triage time × alerts automatedPositive ROI
Detection improvement(AI detections - baseline) / baselineMeasurable lift
False positive reductionBaseline FP rate - AI FP rateMeasurable reduction
Cost per investigation hour savedAI cost / analyst hours saved< analyst cost

Governance Framework

Establish clear policies and processes for AI cost management:
  • Cost ownership - Assign budget owners responsible for AI spending
  • Optimization reviews - Regular reviews of cost efficiency and optimization opportunities
  • Model approval - Process for approving new model deployments
  • Escalation procedures - Clear paths for budget overruns or unusual spending
  • Optimization incentives - Recognize teams that achieve cost efficiency improvements

Cost Monitoring

Effective cost monitoring provides visibility into AI spending, enables optimization, and supports governance. Monitoring should capture both aggregate trends and granular details for troubleshooting.

Key Metrics

MetricDescriptionTargetAlert Threshold
Cost per queryAverage query costMinimize> 2x baseline
Token efficiencyOutput value per tokenMaximize< 50% baseline
Cache hit rateCached vs. new queries> 30%< 20%
Model utilizationCheap model usage> 70%< 50%
Cost per detectionAI cost per threat detectedMinimize> 3x baseline
Daily/weekly spendAggregate spendingWithin budget> 80% of budget
Cost trendWeek-over-week changeStable or decreasing> 20% increase
Error rateFailed queries wasting cost< 1%> 5%

Monitoring Dashboards

Effective cost monitoring dashboards should include:
  • Real-time spend - Current period spending vs. budget
  • Cost breakdown - By model, team, use case, and time
  • Efficiency metrics - Cache hit rates, model utilization, token efficiency
  • Trend analysis - Week-over-week and month-over-month comparisons
  • Anomaly detection - Unusual spending patterns flagged for investigation
  • Optimization opportunities - Identified areas for improvement

Monitoring Tools and Platforms

ToolDescriptionKey FeaturesDocumentation
HeliconeLLM cost trackingRequest logging, cost attribution, cachingHelicone
LangSmithLangChain analyticsTracing, evaluation, cost trackingLangSmith
PortkeyAI gatewayRouting, caching, observabilityPortkey
OpenAI UsageNative dashboardAccount-level usageOpenAI
Anthropic ConsoleNative dashboardClaude usage trackingAnthropic Console
Datadog LLM ObservabilityAPM integrationEnd-to-end tracingDatadog

Alerting Strategy

Configure alerts to catch cost issues before they become problems:
Alert TypeTriggerResponse
Budget threshold80% of period budget consumedReview spending, identify drivers
Spend spike> 2x normal daily spendInvestigate anomaly, potential issues
Efficiency dropCache hit rate < 20%Check cache configuration
Model driftPremium model usage > 40%Review routing configuration
Error spikeError rate > 5%Investigate failures, potential waste

References

Pricing and Cost Information

Caching and Optimization Tools

Model Routing and Gateways

  • LiteLLM - Unified API for multiple LLM providers
  • Portkey - AI gateway with routing and observability
  • OpenRouter - Multi-provider LLM routing
  • Martian - Automatic model routing optimization

Observability and Monitoring

Research Papers

Industry Resources