Skip to main content

Documentation Index

Fetch the complete documentation index at: https://threatbasis.io/llms.txt

Use this file to discover all available pages before exploring further.

AI evaluation presents unique challenges for security applications—outputs are often non-deterministic, correctness depends on context, and security-critical decisions require high confidence. Unlike traditional software where unit tests provide deterministic pass/fail results, AI systems produce probabilistic outputs that require statistical evaluation, semantic comparison, and domain-expert judgment. Robust evaluation frameworks ensure AI systems perform reliably across diverse scenarios while catching regressions before they impact security operations. The consequences of inadequate AI evaluation in security contexts are severe. An undertested alert classification model may miss critical threats or generate excessive false positives that overwhelm analysts. A poorly evaluated investigation assistant may provide incorrect remediation guidance that worsens incidents. An unvalidated threat intelligence summarizer may hallucinate indicators of compromise that trigger unnecessary containment actions. Security teams must implement comprehensive evaluation strategies that address accuracy, reliability, safety, and robustness before deploying AI systems to production. According to research from Stanford HELM and Anthropic, model performance varies significantly across tasks and domains, with security-specific tasks often underrepresented in general benchmarks. Organizations must develop custom evaluation frameworks that reflect their specific use cases, threat landscape, and operational requirements rather than relying solely on generic LLM benchmarks.

Why AI Evaluation Differs from Traditional Testing

Traditional software testing relies on deterministic assertions—given input X, the system must produce output Y. AI systems fundamentally break this paradigm. The same prompt may produce different responses across invocations, multiple valid answers may exist for a single question, and correctness often requires human judgment rather than exact matching. Security engineers must adopt evaluation methodologies that account for this inherent variability while still maintaining rigorous quality standards.
ChallengeTraditional TestingAI Evaluation Approach
Non-determinismExact output matchingSemantic similarity, statistical bounds
Multiple valid answersSingle expected resultRubric-based scoring, acceptability criteria
Subjective qualityBinary pass/failContinuous quality scores, human ratings
Context dependenceIsolated unit testsScenario-based evaluation, conversation testing
Emergent behaviorsPredefined test casesAdversarial probing, red teaming
Distribution shiftStatic test suitesContinuous monitoring, drift detection
The NIST AI Risk Management Framework emphasizes that AI systems require ongoing evaluation throughout their lifecycle, not just pre-deployment testing. Security AI systems must be continuously monitored for performance degradation, emerging failure modes, and adversarial manipulation attempts.

Evaluation Fundamentals

Evaluation Dimensions

Security AI systems must be evaluated across multiple dimensions, each addressing different aspects of system quality and operational readiness. A system that scores well on accuracy but poorly on safety may be more dangerous than one with moderate accuracy and strong safety guarantees.
DimensionFocusSecurity RelevanceMeasurement Approach
AccuracyCorrectness of outputsAlert classification, threat detectionPrecision, recall, F1 against labeled data
ReliabilityConsistency across runsOperational stabilityVariance analysis, reproducibility testing
SafetyHarmful output preventionAction gating, content filteringRed team testing, guardrail evaluation
RobustnessAdversarial resistancePrompt injection defenseAdversarial benchmark suites
LatencyResponse timeReal-time detection requirementsPercentile latency measurements
CalibrationConfidence accuracyRisk-based decision makingExpected calibration error
Accuracy measures whether the AI system produces correct outputs for its intended task. For security applications, this includes correctly classifying alerts, accurately extracting indicators of compromise, and providing factually correct threat intelligence summaries. Reliability assesses whether the system produces consistent outputs across repeated invocations and varying conditions. Security operations require predictable AI behavior—analysts cannot effectively use tools that produce wildly different outputs for similar inputs. Safety evaluates whether the system avoids harmful outputs, including dangerous recommendations, policy violations, and content that could enable attacks. Safety evaluation requires adversarial testing by security experts who attempt to elicit harmful behaviors. Robustness measures resistance to adversarial manipulation, including prompt injection, jailbreaking, and input perturbations. Security AI systems are high-value targets for attackers who may attempt to manipulate their outputs. Calibration assesses whether the system’s confidence scores accurately reflect actual correctness probability. Well-calibrated systems enable risk-based decision making—high-confidence outputs can be automated while low-confidence outputs require human review.

Evaluation vs. Testing

While often used interchangeably, evaluation and testing serve distinct purposes in AI quality assurance. Understanding this distinction helps security teams implement appropriate processes for each.
AspectEvaluationTesting
PurposeMeasure quality holisticallyVerify specific behaviors
ApproachScoring, benchmarking, comparisonPass/fail assertions
FrequencyContinuous monitoringCI/CD integration, pre-deployment
OutputQuality metrics, rankingsTest results, coverage reports
ScopeSystem-level performanceComponent-level correctness
MethodologyStatistical analysisDeterministic verification
Evaluation provides a comprehensive view of system quality through benchmarks, human assessment, and statistical analysis. Evaluation answers questions like “How well does this system perform on alert classification?” or “Is this model better than our current production model?” Testing verifies specific behaviors and catches regressions through automated assertions. Testing answers questions like “Does this prompt still produce valid JSON?” or “Does the system correctly reject this known prompt injection?” Effective AI quality assurance requires both approaches—evaluation to understand overall system quality and testing to catch specific regressions and verify critical behaviors.

Evaluation Methodologies

Effective AI evaluation combines multiple methodologies, each with distinct strengths and limitations. Human evaluation provides ground truth but doesn’t scale. Automated metrics scale but may miss nuanced quality issues. LLM-as-judge approaches offer a middle ground but introduce their own biases. Security teams should implement a portfolio of evaluation methods appropriate to their use cases and risk tolerance.

Human Evaluation

Human evaluation remains the gold standard for assessing AI output quality, particularly for subjective dimensions like helpfulness, clarity, and appropriateness. For security applications, domain experts can identify subtle errors that automated metrics miss—incorrect threat actor attribution, implausible attack chains, or recommendations that violate operational constraints.
MethodDescriptionUse CasesConsiderations
Expert reviewSecurity analysts assess outputsHigh-stakes decisions, novel scenariosExpensive, limited scale
Preference rankingCompare output quality between modelsModel selection, A/B testingRequires paired comparisons
Error analysisCategorize and analyze failure modesImprovement prioritization, debuggingTime-intensive, requires expertise
Blind evaluationRemove model identification biasFair comparison, vendor evaluationRequires careful study design
Likert scoringRate outputs on numeric scalesQuality tracking, regression detectionRequires calibrated raters
Expert review involves security analysts examining AI outputs for correctness, completeness, and appropriateness. This approach catches domain-specific errors that automated metrics miss but requires significant analyst time. Reserve expert review for high-stakes evaluations and sample-based quality monitoring. Preference ranking asks evaluators to compare outputs from different models or configurations, identifying which produces better results. This approach is more reliable than absolute scoring because humans are better at relative comparisons. Use preference ranking for model selection and prompt optimization. Error analysis systematically categorizes failure modes to identify improvement priorities. Security teams should maintain taxonomies of error types—hallucinated IOCs, incorrect severity assessments, missing context, unsafe recommendations—and track their frequency over time.

Automated Evaluation

Automated evaluation enables continuous quality monitoring at scale, catching regressions quickly and providing consistent measurement across deployments. However, automated metrics often fail to capture nuanced quality dimensions that matter for security applications.
MethodDescriptionStrengthsLimitations
Reference matchingCompare to ground truthObjective, reproducibleRequires labeled data, misses valid alternatives
Semantic similarityEmbedding-based comparisonCaptures meaning, not just wordsMay miss factual errors
LLM-as-judgeUse LLM to evaluate outputsScalable, nuancedIntroduces evaluator bias
Rule-based checksStructured output validationFast, deterministicLimited to structural properties
Factual verificationCheck claims against sourcesCatches hallucinationsRequires authoritative sources
Reference matching compares AI outputs to known-correct answers using metrics like exact match, BLEU, or ROUGE. While objective and reproducible, these metrics penalize valid alternative phrasings and require expensive labeled datasets. Use reference matching for tasks with clear correct answers, such as entity extraction or classification. Semantic similarity uses embedding models to compare the meaning of outputs rather than exact wording. This approach better handles paraphrasing but may miss factual errors if the incorrect output is semantically similar to the correct one. Combine semantic similarity with factual verification for security applications. Rule-based checks verify structural properties like JSON schema compliance, required field presence, and value range constraints. These checks are fast and deterministic but only catch format errors, not semantic issues. Implement rule-based checks as a first validation layer before more expensive evaluation.

LLM-as-Judge Patterns

LLM-as-judge approaches use language models to evaluate other language model outputs, offering a scalable alternative to human evaluation. Research from Anthropic and OpenAI demonstrates that well-designed LLM judges can achieve high correlation with human preferences, though they exhibit systematic biases that evaluators must understand and mitigate.
PatternDescriptionBest ForBias Considerations
Pointwise scoringRate individual outputs on criteriaAbsolute quality measurementPosition bias, verbosity bias
Pairwise comparisonCompare two outputs directlyModel comparison, rankingOrder effects, tie handling
Reference-guidedEvaluate against ground truthFactual accuracy assessmentReference quality dependence
Criteria-basedScore on specific rubricsMulti-dimensional evaluationRubric interpretation variance
Pointwise scoring asks the judge model to rate individual outputs on a numeric scale or categorical rubric. This approach enables absolute quality measurement but suffers from calibration drift and position bias. Mitigate by randomizing example order and using consistent prompting. Pairwise comparison presents two outputs and asks which is better, often with justification. This approach is more reliable than pointwise scoring because relative comparisons are easier than absolute judgments. Use pairwise comparison for model selection and prompt optimization. Reference-guided evaluation provides the judge with a reference answer and asks it to assess how well the candidate output matches. This approach works well for factual tasks but depends heavily on reference quality. Ensure references are verified by domain experts. Criteria-based evaluation defines specific rubrics (accuracy, completeness, safety, clarity) and asks the judge to score each dimension. This approach provides actionable feedback for improvement but requires careful rubric design to ensure consistent interpretation. When implementing LLM-as-judge, use the most capable available model as the judge, provide clear evaluation criteria with examples, and validate judge outputs against human evaluation on a sample basis. Be aware that LLM judges tend to prefer longer, more verbose outputs and may exhibit self-preference bias when evaluating outputs from the same model family.

Benchmark Design

Benchmarks provide standardized measurement of AI system capabilities, enabling comparison across models, tracking performance over time, and identifying specific weaknesses. For security applications, general-purpose LLM benchmarks often fail to capture domain-specific requirements—security teams must develop custom benchmarks that reflect their operational reality.

Security AI Benchmarks

Security AI benchmarks should assess both general AI capabilities and domain-specific competencies. A comprehensive benchmark suite includes tests for threat detection accuracy, security knowledge, reasoning about attack chains, and safe handling of sensitive information.
Benchmark TypePurposeExample MetricsSecurity Considerations
Detection accuracyThreat identificationPrecision, recall, F1Balance false positive/negative costs
ClassificationAlert categorizationAccuracy, confusion matrixClass imbalance handling
Explanation qualityInvestigation supportCompleteness, accuracy, clarityActionable guidance
Action appropriatenessAutomation decisionsSafety, effectivenessRisk-weighted evaluation
Knowledge recallSecurity expertiseFactual accuracyCurrency of threat intelligence
ReasoningAttack chain analysisLogical consistencyMulti-step inference validation
Detection benchmarks measure the system’s ability to identify threats, malicious activity, or security-relevant events. Design these benchmarks to reflect realistic class distributions—most production traffic is benign, so systems must maintain high precision while catching true threats. Include metrics at multiple confidence thresholds to understand the precision-recall tradeoff. Classification benchmarks assess categorization accuracy across alert types, severity levels, or MITRE ATT&CK mappings. Use stratified sampling to ensure adequate coverage of rare but critical categories. Track per-class performance to identify systematic weaknesses. Explanation benchmarks evaluate the quality of AI-generated analysis and recommendations. These are inherently more subjective and often require human evaluation or LLM-as-judge approaches. Assess completeness (does the explanation cover all relevant factors?), accuracy (are the claims correct?), and actionability (can an analyst act on this guidance?).

Dataset Construction

Building high-quality evaluation datasets is often the most challenging aspect of AI evaluation. Security datasets must balance representativeness, diversity, and sensitivity concerns while avoiding data leakage that would inflate apparent performance.
Dataset TypeSourceAdvantagesChallenges
Real incident dataHistorical investigationsAuthentic scenarios, proven relevancePrivacy concerns, limited volume
Synthetic scenariosExpert-generated or AI-augmentedControlled coverage, edge casesMay not reflect reality
Adversarial examplesRed team exercisesTests robustnessResource-intensive to create
Regression casesPrevious failuresPrevents known issuesMay overfit to past problems
Public benchmarksResearch datasetsReproducible, comparableMay not match your use cases
Real incident data provides the most authentic evaluation material but requires careful anonymization to protect sensitive information. Work with legal and privacy teams to establish data handling procedures. Consider using synthetic variants of real incidents that preserve structure while changing identifying details. Synthetic scenarios enable controlled coverage of edge cases, rare events, and novel threat types. Use domain experts to design scenarios that stress-test specific capabilities. Validate synthetic data against real-world patterns to ensure relevance. Adversarial examples test system robustness against manipulation attempts. Include prompt injection attacks, jailbreaking attempts, and input perturbations. Partner with red teams to develop realistic adversarial scenarios based on actual attack techniques. Dataset versioning is critical for reproducible evaluation. Track dataset composition, creation methodology, and any modifications. Maintain separate training and evaluation splits to prevent data contamination. Update datasets regularly to reflect evolving threats while preserving historical versions for regression tracking.

Testing Strategies

While evaluation measures overall system quality through benchmarks and metrics, testing verifies specific behaviors through deterministic assertions. AI testing requires adapting traditional testing practices to handle non-deterministic outputs while maintaining rigorous quality gates.

Unit Testing for AI

Unit testing for AI systems focuses on verifiable components—prompt templates, output parsers, tool integrations, and error handling logic. While LLM outputs are non-deterministic, many surrounding components can be tested deterministically.
Test TypeFocusImplementationVerification Approach
Prompt testingPrompt behaviorTest prompt variationsOutput constraint checking
Output formatStructural complianceSchema validationJSON schema, Pydantic models
Tool integrationExternal callsMock responsesFunction call verification
Error handlingFailure modesEdge case inputsException handling validation
Guardrail testingSafety mechanismsAdversarial inputsBlock rate verification
Prompt testing verifies that prompts produce outputs meeting specified constraints. Rather than checking for exact outputs, test that outputs satisfy requirements: correct format, required fields present, values within expected ranges. Use property-based testing to explore prompt behavior across input variations. Output format testing ensures AI outputs can be parsed by downstream systems. Validate against JSON schemas, Pydantic models, or custom parsers. Test edge cases like empty responses, malformed JSON, and unexpectedly long outputs. Tool integration testing verifies that the AI system correctly invokes external tools and handles their responses. Mock external services to test error handling, timeout behavior, and response parsing. Verify that tool calls include required parameters and follow expected patterns.

Integration Testing

Integration testing validates that AI system components work together correctly, including multi-step workflows, retrieval pipelines, and agent behaviors. These tests are more expensive to run but catch issues that unit tests miss.
Test LevelScopeConsiderationsKey Assertions
Chain testingMulti-step workflowsState management, error propagationStep completion, state consistency
RAG testingRetrieval + generationContext relevance, groundingRetrieved context quality, citation accuracy
Agent testingAutonomous actionsSafety bounds, resource limitsAction constraints, termination conditions
End-to-endComplete workflowsReal dependencies, timingWorkflow completion, output correctness
Chain testing validates multi-step workflows where each step’s output feeds into the next. Test that state propagates correctly, errors are handled gracefully, and the chain produces expected final outputs. Include tests for partial failures where intermediate steps fail but recovery is possible. RAG testing assesses the entire retrieval-augmented generation pipeline, from query understanding through retrieval to response generation. Verify that retrieved context is relevant, responses are grounded in retrieved information, and the system handles cases where retrieval returns no relevant results. Agent testing validates autonomous AI systems that take actions in external environments. Test safety constraints to ensure agents stay within permitted action bounds. Verify termination conditions prevent runaway execution. Test resource limits to prevent excessive API calls or compute usage.

Regression Testing

Regression testing prevents quality degradation as systems evolve. AI systems are particularly susceptible to subtle regressions—prompt changes, model updates, or dependency upgrades may cause performance degradation that’s invisible without systematic monitoring.
Regression TypeDetection MethodTriggerResponse
Behavioral regressionGolden dataset comparisonCode changes, prompt updatesBlock deployment, investigate
Semantic regressionEmbedding similarityModel updatesReview changes, update baselines
Performance regressionLatency percentilesInfrastructure changesProfile, optimize
Safety regressionAdversarial test suitesAny changeBlock deployment, escalate
Golden dataset testing maintains a fixed set of inputs with verified correct outputs. Run the golden dataset before every deployment and compare outputs against baselines. Use semantic similarity rather than exact matching to allow acceptable variation while catching meaningful changes. Semantic regression detection identifies when outputs change meaning even if they remain structurally valid. Track embedding similarity of outputs over time, alerting when outputs drift beyond acceptable thresholds. This catches subtle degradation that format-based tests miss. Safety regression testing specifically verifies that safety mechanisms remain effective. Maintain a suite of adversarial inputs that the system must correctly reject. Any safety regression should block deployment regardless of other metrics.

Quality Assurance

Quality assurance for AI systems requires continuous monitoring rather than one-time validation. Unlike traditional software where passing tests provides confidence in correctness, AI systems can degrade due to distribution shift, model updates, or subtle changes in upstream data. Implement layered quality gates that catch issues at every stage of the deployment lifecycle.

Continuous Evaluation

Production AI systems require ongoing evaluation to detect quality degradation before it impacts operations. Implement automated monitoring that samples production traffic, evaluates output quality, and alerts on anomalies.
PracticeFrequencyPurposeImplementation
Automated benchmarksPer deploymentCatch regressionsCI/CD integration, blocking gates
Sample evaluationDaily/weeklyMonitor production qualityRandom sampling, LLM-as-judge
Human reviewPeriodicDeep quality assessmentExpert queue, structured rubrics
A/B testingPer changeCompare variationsTraffic splitting, statistical significance
Drift detectionContinuousIdentify distribution shiftInput/output embedding monitoring
Automated benchmarks run on every deployment, comparing performance against established baselines. Configure CI/CD pipelines to block deployments that fail benchmark thresholds. Include both accuracy metrics and safety tests in the benchmark suite. Sample evaluation continuously monitors production quality by evaluating a random sample of inputs and outputs. Use LLM-as-judge approaches for scalable evaluation, with periodic human validation of judge accuracy. Alert when quality metrics fall below thresholds. A/B testing enables safe comparison of model changes by routing a portion of traffic to the new version while maintaining a control group. Use appropriate statistical methods to determine significance and ensure adequate sample sizes before drawing conclusions.

Quality Gates

Quality gates enforce minimum standards at each stage of deployment, preventing degraded models from reaching production and enabling rapid rollback when issues occur.
GateStageCriteriaAction on Failure
Pre-deploymentCI/CD pipelineBenchmark thresholds, safety testsBlock deployment
CanaryInitial rollout (5-10%)Live quality metrics, error ratesAutomatic rollback
ProgressiveStaged rolloutConsistent metrics across stagesPause expansion
ProductionFull deploymentOngoing monitoringAlert, investigate, rollback
Pre-deployment gates verify that new versions meet minimum quality standards before any production exposure. Include accuracy benchmarks, latency requirements, and safety test suites. Make these gates blocking—no exceptions for “urgent” deployments. Canary deployments expose new versions to a small percentage of production traffic while monitoring for quality degradation. Configure automatic rollback triggers based on error rates, latency degradation, or quality metric drops. The Google SRE Workbook provides detailed guidance on canary analysis. Production monitoring continues after full deployment, detecting issues that may emerge over time due to distribution shift, upstream changes, or adversarial activity. Implement alerting thresholds with appropriate sensitivity—too sensitive creates alert fatigue, too lax misses real issues.

Evaluation Tools and Frameworks

The AI evaluation ecosystem includes specialized frameworks for different evaluation needs. Select tools based on your specific requirements—RAG evaluation, prompt testing, LLM-as-judge, or general ML metrics. Most organizations benefit from combining multiple tools into a comprehensive evaluation pipeline.

Evaluation Frameworks

ToolDescriptionBest ForDocumentation
RAGASRAG evaluation frameworkRetrieval quality, answer correctnessRAGAS
DeepEvalLLM evaluation frameworkComprehensive LLM testingDeepEval
PromptfooPrompt testing and evaluationPrompt engineering, red teamingPromptfoo
LangSmithLangChain observability and evaluationChain debugging, production monitoringLangSmith
Hugging Face EvaluateML evaluation libraryStandard NLP metricsEvaluate
OpenAI EvalsOpenAI’s evaluation frameworkCustom evaluation developmentOpenAI Evals
BraintrustLLM evaluation platformContinuous evaluation, experimentationBraintrust
RAGAS specializes in evaluating retrieval-augmented generation systems, providing metrics for context relevance, faithfulness, and answer correctness. Use RAGAS when RAG quality is a primary concern. Promptfoo excels at prompt testing and red teaming, enabling systematic evaluation of prompt variations and adversarial robustness. Its configuration-based approach makes it easy to define test suites and run evaluations at scale. LangSmith provides end-to-end observability for LangChain applications, including tracing, evaluation, and production monitoring. Use LangSmith for comprehensive visibility into complex chain behaviors.

Testing Frameworks

ToolDescriptionBest ForDocumentation
pytestPython testing frameworkUnit tests, integration testspytest
HypothesisProperty-based testingEdge case discovery, fuzzingHypothesis
LangChain TestingChain testing utilitiesLangChain-specific testingLangChain
TestcontainersContainer-based testingIntegration tests with dependenciesTestcontainers
Great ExpectationsData validationInput/output data qualityGreat Expectations
Hypothesis enables property-based testing where you define properties that outputs should satisfy rather than specific expected outputs. This approach is particularly valuable for AI testing where multiple valid outputs may exist.

Common Pitfalls and Anti-Patterns

Evaluation failures often stem from methodological issues rather than tooling problems. Understanding common pitfalls helps teams avoid wasted effort and misleading results.
Anti-PatternProblemSolution
Overfitting to benchmarksHigh benchmark scores don’t translate to productionUse diverse evaluation sets, monitor production metrics
Data leakageEvaluation data appears in trainingStrict dataset separation, temporal splits
Single metric focusMissing important quality dimensionsMulti-dimensional evaluation rubrics
Ignoring edge casesFailures on rare but critical scenariosAdversarial testing, red teaming
Static evaluationDatasets don’t reflect evolving threatsRegular dataset updates, drift detection
Automation without validationLLM judges introduce systematic biasesPeriodic human validation of automated metrics
Overfitting to benchmarks occurs when teams optimize specifically for benchmark performance without ensuring real-world quality. Include held-out evaluation sets that are never used for optimization, and validate benchmark improvements against production metrics. Data leakage happens when evaluation data inadvertently appears in training or prompt examples. For security AI, this can occur when incident data used for evaluation was previously used to train detection models. Implement strict data governance and temporal splits. Single metric focus leads to optimizing one dimension at the expense of others. A model that achieves high accuracy but produces unsafe outputs is worse than one with moderate accuracy and strong safety. Define multi-dimensional evaluation criteria with minimum thresholds for each dimension.

Implementation Checklist

Use this checklist when establishing AI evaluation and testing practices for security systems:

Evaluation Setup

  • Define evaluation dimensions relevant to your use cases (accuracy, safety, robustness)
  • Establish baseline metrics from current system or human performance
  • Create labeled evaluation datasets covering normal and edge cases
  • Implement LLM-as-judge with validated correlation to human judgment
  • Set up human evaluation workflows for periodic deep assessment

Testing Infrastructure

  • Implement unit tests for prompts, parsers, and tool integrations
  • Create integration tests for multi-step workflows and RAG pipelines
  • Build golden datasets for regression detection
  • Configure CI/CD integration with blocking quality gates
  • Establish safety test suites that must pass for any deployment

Quality Assurance

  • Deploy canary infrastructure with automatic rollback triggers
  • Implement production sampling and continuous evaluation
  • Set up alerting for quality metric degradation
  • Create incident response procedures for AI quality issues
  • Schedule periodic human review of production outputs

Governance

  • Document evaluation methodology and metrics definitions
  • Establish dataset versioning and change tracking
  • Define minimum quality thresholds for production deployment
  • Create review processes for benchmark and threshold updates
  • Implement audit logging for evaluation results

References

Evaluation Frameworks and Research

LLM Provider Documentation

Testing and Quality Assurance

Security and Standards

Research Papers