Documentation Index
Fetch the complete documentation index at: https://threatbasis.io/llms.txt
Use this file to discover all available pages before exploring further.
AI evaluation presents unique challenges for security applications—outputs are often non-deterministic, correctness depends on context, and security-critical decisions require high confidence. Unlike traditional software where unit tests provide deterministic pass/fail results, AI systems produce probabilistic outputs that require statistical evaluation, semantic comparison, and domain-expert judgment. Robust evaluation frameworks ensure AI systems perform reliably across diverse scenarios while catching regressions before they impact security operations.
The consequences of inadequate AI evaluation in security contexts are severe. An undertested alert classification model may miss critical threats or generate excessive false positives that overwhelm analysts. A poorly evaluated investigation assistant may provide incorrect remediation guidance that worsens incidents. An unvalidated threat intelligence summarizer may hallucinate indicators of compromise that trigger unnecessary containment actions. Security teams must implement comprehensive evaluation strategies that address accuracy, reliability, safety, and robustness before deploying AI systems to production.
According to research from Stanford HELM and Anthropic, model performance varies significantly across tasks and domains, with security-specific tasks often underrepresented in general benchmarks. Organizations must develop custom evaluation frameworks that reflect their specific use cases, threat landscape, and operational requirements rather than relying solely on generic LLM benchmarks.
Why AI Evaluation Differs from Traditional Testing
Traditional software testing relies on deterministic assertions—given input X, the system must produce output Y. AI systems fundamentally break this paradigm. The same prompt may produce different responses across invocations, multiple valid answers may exist for a single question, and correctness often requires human judgment rather than exact matching. Security engineers must adopt evaluation methodologies that account for this inherent variability while still maintaining rigorous quality standards.
| Challenge | Traditional Testing | AI Evaluation Approach |
|---|
| Non-determinism | Exact output matching | Semantic similarity, statistical bounds |
| Multiple valid answers | Single expected result | Rubric-based scoring, acceptability criteria |
| Subjective quality | Binary pass/fail | Continuous quality scores, human ratings |
| Context dependence | Isolated unit tests | Scenario-based evaluation, conversation testing |
| Emergent behaviors | Predefined test cases | Adversarial probing, red teaming |
| Distribution shift | Static test suites | Continuous monitoring, drift detection |
The NIST AI Risk Management Framework emphasizes that AI systems require ongoing evaluation throughout their lifecycle, not just pre-deployment testing. Security AI systems must be continuously monitored for performance degradation, emerging failure modes, and adversarial manipulation attempts.
Evaluation Fundamentals
Evaluation Dimensions
Security AI systems must be evaluated across multiple dimensions, each addressing different aspects of system quality and operational readiness. A system that scores well on accuracy but poorly on safety may be more dangerous than one with moderate accuracy and strong safety guarantees.
| Dimension | Focus | Security Relevance | Measurement Approach |
|---|
| Accuracy | Correctness of outputs | Alert classification, threat detection | Precision, recall, F1 against labeled data |
| Reliability | Consistency across runs | Operational stability | Variance analysis, reproducibility testing |
| Safety | Harmful output prevention | Action gating, content filtering | Red team testing, guardrail evaluation |
| Robustness | Adversarial resistance | Prompt injection defense | Adversarial benchmark suites |
| Latency | Response time | Real-time detection requirements | Percentile latency measurements |
| Calibration | Confidence accuracy | Risk-based decision making | Expected calibration error |
Accuracy measures whether the AI system produces correct outputs for its intended task. For security applications, this includes correctly classifying alerts, accurately extracting indicators of compromise, and providing factually correct threat intelligence summaries.
Reliability assesses whether the system produces consistent outputs across repeated invocations and varying conditions. Security operations require predictable AI behavior—analysts cannot effectively use tools that produce wildly different outputs for similar inputs.
Safety evaluates whether the system avoids harmful outputs, including dangerous recommendations, policy violations, and content that could enable attacks. Safety evaluation requires adversarial testing by security experts who attempt to elicit harmful behaviors.
Robustness measures resistance to adversarial manipulation, including prompt injection, jailbreaking, and input perturbations. Security AI systems are high-value targets for attackers who may attempt to manipulate their outputs.
Calibration assesses whether the system’s confidence scores accurately reflect actual correctness probability. Well-calibrated systems enable risk-based decision making—high-confidence outputs can be automated while low-confidence outputs require human review.
Evaluation vs. Testing
While often used interchangeably, evaluation and testing serve distinct purposes in AI quality assurance. Understanding this distinction helps security teams implement appropriate processes for each.
| Aspect | Evaluation | Testing |
|---|
| Purpose | Measure quality holistically | Verify specific behaviors |
| Approach | Scoring, benchmarking, comparison | Pass/fail assertions |
| Frequency | Continuous monitoring | CI/CD integration, pre-deployment |
| Output | Quality metrics, rankings | Test results, coverage reports |
| Scope | System-level performance | Component-level correctness |
| Methodology | Statistical analysis | Deterministic verification |
Evaluation provides a comprehensive view of system quality through benchmarks, human assessment, and statistical analysis. Evaluation answers questions like “How well does this system perform on alert classification?” or “Is this model better than our current production model?”
Testing verifies specific behaviors and catches regressions through automated assertions. Testing answers questions like “Does this prompt still produce valid JSON?” or “Does the system correctly reject this known prompt injection?”
Effective AI quality assurance requires both approaches—evaluation to understand overall system quality and testing to catch specific regressions and verify critical behaviors.
Evaluation Methodologies
Effective AI evaluation combines multiple methodologies, each with distinct strengths and limitations. Human evaluation provides ground truth but doesn’t scale. Automated metrics scale but may miss nuanced quality issues. LLM-as-judge approaches offer a middle ground but introduce their own biases. Security teams should implement a portfolio of evaluation methods appropriate to their use cases and risk tolerance.
Human Evaluation
Human evaluation remains the gold standard for assessing AI output quality, particularly for subjective dimensions like helpfulness, clarity, and appropriateness. For security applications, domain experts can identify subtle errors that automated metrics miss—incorrect threat actor attribution, implausible attack chains, or recommendations that violate operational constraints.
| Method | Description | Use Cases | Considerations |
|---|
| Expert review | Security analysts assess outputs | High-stakes decisions, novel scenarios | Expensive, limited scale |
| Preference ranking | Compare output quality between models | Model selection, A/B testing | Requires paired comparisons |
| Error analysis | Categorize and analyze failure modes | Improvement prioritization, debugging | Time-intensive, requires expertise |
| Blind evaluation | Remove model identification bias | Fair comparison, vendor evaluation | Requires careful study design |
| Likert scoring | Rate outputs on numeric scales | Quality tracking, regression detection | Requires calibrated raters |
Expert review involves security analysts examining AI outputs for correctness, completeness, and appropriateness. This approach catches domain-specific errors that automated metrics miss but requires significant analyst time. Reserve expert review for high-stakes evaluations and sample-based quality monitoring.
Preference ranking asks evaluators to compare outputs from different models or configurations, identifying which produces better results. This approach is more reliable than absolute scoring because humans are better at relative comparisons. Use preference ranking for model selection and prompt optimization.
Error analysis systematically categorizes failure modes to identify improvement priorities. Security teams should maintain taxonomies of error types—hallucinated IOCs, incorrect severity assessments, missing context, unsafe recommendations—and track their frequency over time.
Automated Evaluation
Automated evaluation enables continuous quality monitoring at scale, catching regressions quickly and providing consistent measurement across deployments. However, automated metrics often fail to capture nuanced quality dimensions that matter for security applications.
| Method | Description | Strengths | Limitations |
|---|
| Reference matching | Compare to ground truth | Objective, reproducible | Requires labeled data, misses valid alternatives |
| Semantic similarity | Embedding-based comparison | Captures meaning, not just words | May miss factual errors |
| LLM-as-judge | Use LLM to evaluate outputs | Scalable, nuanced | Introduces evaluator bias |
| Rule-based checks | Structured output validation | Fast, deterministic | Limited to structural properties |
| Factual verification | Check claims against sources | Catches hallucinations | Requires authoritative sources |
Reference matching compares AI outputs to known-correct answers using metrics like exact match, BLEU, or ROUGE. While objective and reproducible, these metrics penalize valid alternative phrasings and require expensive labeled datasets. Use reference matching for tasks with clear correct answers, such as entity extraction or classification.
Semantic similarity uses embedding models to compare the meaning of outputs rather than exact wording. This approach better handles paraphrasing but may miss factual errors if the incorrect output is semantically similar to the correct one. Combine semantic similarity with factual verification for security applications.
Rule-based checks verify structural properties like JSON schema compliance, required field presence, and value range constraints. These checks are fast and deterministic but only catch format errors, not semantic issues. Implement rule-based checks as a first validation layer before more expensive evaluation.
LLM-as-Judge Patterns
LLM-as-judge approaches use language models to evaluate other language model outputs, offering a scalable alternative to human evaluation. Research from Anthropic and OpenAI demonstrates that well-designed LLM judges can achieve high correlation with human preferences, though they exhibit systematic biases that evaluators must understand and mitigate.
| Pattern | Description | Best For | Bias Considerations |
|---|
| Pointwise scoring | Rate individual outputs on criteria | Absolute quality measurement | Position bias, verbosity bias |
| Pairwise comparison | Compare two outputs directly | Model comparison, ranking | Order effects, tie handling |
| Reference-guided | Evaluate against ground truth | Factual accuracy assessment | Reference quality dependence |
| Criteria-based | Score on specific rubrics | Multi-dimensional evaluation | Rubric interpretation variance |
Pointwise scoring asks the judge model to rate individual outputs on a numeric scale or categorical rubric. This approach enables absolute quality measurement but suffers from calibration drift and position bias. Mitigate by randomizing example order and using consistent prompting.
Pairwise comparison presents two outputs and asks which is better, often with justification. This approach is more reliable than pointwise scoring because relative comparisons are easier than absolute judgments. Use pairwise comparison for model selection and prompt optimization.
Reference-guided evaluation provides the judge with a reference answer and asks it to assess how well the candidate output matches. This approach works well for factual tasks but depends heavily on reference quality. Ensure references are verified by domain experts.
Criteria-based evaluation defines specific rubrics (accuracy, completeness, safety, clarity) and asks the judge to score each dimension. This approach provides actionable feedback for improvement but requires careful rubric design to ensure consistent interpretation.
When implementing LLM-as-judge, use the most capable available model as the judge, provide clear evaluation criteria with examples, and validate judge outputs against human evaluation on a sample basis. Be aware that LLM judges tend to prefer longer, more verbose outputs and may exhibit self-preference bias when evaluating outputs from the same model family.
Benchmark Design
Benchmarks provide standardized measurement of AI system capabilities, enabling comparison across models, tracking performance over time, and identifying specific weaknesses. For security applications, general-purpose LLM benchmarks often fail to capture domain-specific requirements—security teams must develop custom benchmarks that reflect their operational reality.
Security AI Benchmarks
Security AI benchmarks should assess both general AI capabilities and domain-specific competencies. A comprehensive benchmark suite includes tests for threat detection accuracy, security knowledge, reasoning about attack chains, and safe handling of sensitive information.
| Benchmark Type | Purpose | Example Metrics | Security Considerations |
|---|
| Detection accuracy | Threat identification | Precision, recall, F1 | Balance false positive/negative costs |
| Classification | Alert categorization | Accuracy, confusion matrix | Class imbalance handling |
| Explanation quality | Investigation support | Completeness, accuracy, clarity | Actionable guidance |
| Action appropriateness | Automation decisions | Safety, effectiveness | Risk-weighted evaluation |
| Knowledge recall | Security expertise | Factual accuracy | Currency of threat intelligence |
| Reasoning | Attack chain analysis | Logical consistency | Multi-step inference validation |
Detection benchmarks measure the system’s ability to identify threats, malicious activity, or security-relevant events. Design these benchmarks to reflect realistic class distributions—most production traffic is benign, so systems must maintain high precision while catching true threats. Include metrics at multiple confidence thresholds to understand the precision-recall tradeoff.
Classification benchmarks assess categorization accuracy across alert types, severity levels, or MITRE ATT&CK mappings. Use stratified sampling to ensure adequate coverage of rare but critical categories. Track per-class performance to identify systematic weaknesses.
Explanation benchmarks evaluate the quality of AI-generated analysis and recommendations. These are inherently more subjective and often require human evaluation or LLM-as-judge approaches. Assess completeness (does the explanation cover all relevant factors?), accuracy (are the claims correct?), and actionability (can an analyst act on this guidance?).
Dataset Construction
Building high-quality evaluation datasets is often the most challenging aspect of AI evaluation. Security datasets must balance representativeness, diversity, and sensitivity concerns while avoiding data leakage that would inflate apparent performance.
| Dataset Type | Source | Advantages | Challenges |
|---|
| Real incident data | Historical investigations | Authentic scenarios, proven relevance | Privacy concerns, limited volume |
| Synthetic scenarios | Expert-generated or AI-augmented | Controlled coverage, edge cases | May not reflect reality |
| Adversarial examples | Red team exercises | Tests robustness | Resource-intensive to create |
| Regression cases | Previous failures | Prevents known issues | May overfit to past problems |
| Public benchmarks | Research datasets | Reproducible, comparable | May not match your use cases |
Real incident data provides the most authentic evaluation material but requires careful anonymization to protect sensitive information. Work with legal and privacy teams to establish data handling procedures. Consider using synthetic variants of real incidents that preserve structure while changing identifying details.
Synthetic scenarios enable controlled coverage of edge cases, rare events, and novel threat types. Use domain experts to design scenarios that stress-test specific capabilities. Validate synthetic data against real-world patterns to ensure relevance.
Adversarial examples test system robustness against manipulation attempts. Include prompt injection attacks, jailbreaking attempts, and input perturbations. Partner with red teams to develop realistic adversarial scenarios based on actual attack techniques.
Dataset versioning is critical for reproducible evaluation. Track dataset composition, creation methodology, and any modifications. Maintain separate training and evaluation splits to prevent data contamination. Update datasets regularly to reflect evolving threats while preserving historical versions for regression tracking.
Testing Strategies
While evaluation measures overall system quality through benchmarks and metrics, testing verifies specific behaviors through deterministic assertions. AI testing requires adapting traditional testing practices to handle non-deterministic outputs while maintaining rigorous quality gates.
Unit Testing for AI
Unit testing for AI systems focuses on verifiable components—prompt templates, output parsers, tool integrations, and error handling logic. While LLM outputs are non-deterministic, many surrounding components can be tested deterministically.
| Test Type | Focus | Implementation | Verification Approach |
|---|
| Prompt testing | Prompt behavior | Test prompt variations | Output constraint checking |
| Output format | Structural compliance | Schema validation | JSON schema, Pydantic models |
| Tool integration | External calls | Mock responses | Function call verification |
| Error handling | Failure modes | Edge case inputs | Exception handling validation |
| Guardrail testing | Safety mechanisms | Adversarial inputs | Block rate verification |
Prompt testing verifies that prompts produce outputs meeting specified constraints. Rather than checking for exact outputs, test that outputs satisfy requirements: correct format, required fields present, values within expected ranges. Use property-based testing to explore prompt behavior across input variations.
Output format testing ensures AI outputs can be parsed by downstream systems. Validate against JSON schemas, Pydantic models, or custom parsers. Test edge cases like empty responses, malformed JSON, and unexpectedly long outputs.
Tool integration testing verifies that the AI system correctly invokes external tools and handles their responses. Mock external services to test error handling, timeout behavior, and response parsing. Verify that tool calls include required parameters and follow expected patterns.
Integration Testing
Integration testing validates that AI system components work together correctly, including multi-step workflows, retrieval pipelines, and agent behaviors. These tests are more expensive to run but catch issues that unit tests miss.
| Test Level | Scope | Considerations | Key Assertions |
|---|
| Chain testing | Multi-step workflows | State management, error propagation | Step completion, state consistency |
| RAG testing | Retrieval + generation | Context relevance, grounding | Retrieved context quality, citation accuracy |
| Agent testing | Autonomous actions | Safety bounds, resource limits | Action constraints, termination conditions |
| End-to-end | Complete workflows | Real dependencies, timing | Workflow completion, output correctness |
Chain testing validates multi-step workflows where each step’s output feeds into the next. Test that state propagates correctly, errors are handled gracefully, and the chain produces expected final outputs. Include tests for partial failures where intermediate steps fail but recovery is possible.
RAG testing assesses the entire retrieval-augmented generation pipeline, from query understanding through retrieval to response generation. Verify that retrieved context is relevant, responses are grounded in retrieved information, and the system handles cases where retrieval returns no relevant results.
Agent testing validates autonomous AI systems that take actions in external environments. Test safety constraints to ensure agents stay within permitted action bounds. Verify termination conditions prevent runaway execution. Test resource limits to prevent excessive API calls or compute usage.
Regression Testing
Regression testing prevents quality degradation as systems evolve. AI systems are particularly susceptible to subtle regressions—prompt changes, model updates, or dependency upgrades may cause performance degradation that’s invisible without systematic monitoring.
| Regression Type | Detection Method | Trigger | Response |
|---|
| Behavioral regression | Golden dataset comparison | Code changes, prompt updates | Block deployment, investigate |
| Semantic regression | Embedding similarity | Model updates | Review changes, update baselines |
| Performance regression | Latency percentiles | Infrastructure changes | Profile, optimize |
| Safety regression | Adversarial test suites | Any change | Block deployment, escalate |
Golden dataset testing maintains a fixed set of inputs with verified correct outputs. Run the golden dataset before every deployment and compare outputs against baselines. Use semantic similarity rather than exact matching to allow acceptable variation while catching meaningful changes.
Semantic regression detection identifies when outputs change meaning even if they remain structurally valid. Track embedding similarity of outputs over time, alerting when outputs drift beyond acceptable thresholds. This catches subtle degradation that format-based tests miss.
Safety regression testing specifically verifies that safety mechanisms remain effective. Maintain a suite of adversarial inputs that the system must correctly reject. Any safety regression should block deployment regardless of other metrics.
Quality Assurance
Quality assurance for AI systems requires continuous monitoring rather than one-time validation. Unlike traditional software where passing tests provides confidence in correctness, AI systems can degrade due to distribution shift, model updates, or subtle changes in upstream data. Implement layered quality gates that catch issues at every stage of the deployment lifecycle.
Continuous Evaluation
Production AI systems require ongoing evaluation to detect quality degradation before it impacts operations. Implement automated monitoring that samples production traffic, evaluates output quality, and alerts on anomalies.
| Practice | Frequency | Purpose | Implementation |
|---|
| Automated benchmarks | Per deployment | Catch regressions | CI/CD integration, blocking gates |
| Sample evaluation | Daily/weekly | Monitor production quality | Random sampling, LLM-as-judge |
| Human review | Periodic | Deep quality assessment | Expert queue, structured rubrics |
| A/B testing | Per change | Compare variations | Traffic splitting, statistical significance |
| Drift detection | Continuous | Identify distribution shift | Input/output embedding monitoring |
Automated benchmarks run on every deployment, comparing performance against established baselines. Configure CI/CD pipelines to block deployments that fail benchmark thresholds. Include both accuracy metrics and safety tests in the benchmark suite.
Sample evaluation continuously monitors production quality by evaluating a random sample of inputs and outputs. Use LLM-as-judge approaches for scalable evaluation, with periodic human validation of judge accuracy. Alert when quality metrics fall below thresholds.
A/B testing enables safe comparison of model changes by routing a portion of traffic to the new version while maintaining a control group. Use appropriate statistical methods to determine significance and ensure adequate sample sizes before drawing conclusions.
Quality Gates
Quality gates enforce minimum standards at each stage of deployment, preventing degraded models from reaching production and enabling rapid rollback when issues occur.
| Gate | Stage | Criteria | Action on Failure |
|---|
| Pre-deployment | CI/CD pipeline | Benchmark thresholds, safety tests | Block deployment |
| Canary | Initial rollout (5-10%) | Live quality metrics, error rates | Automatic rollback |
| Progressive | Staged rollout | Consistent metrics across stages | Pause expansion |
| Production | Full deployment | Ongoing monitoring | Alert, investigate, rollback |
Pre-deployment gates verify that new versions meet minimum quality standards before any production exposure. Include accuracy benchmarks, latency requirements, and safety test suites. Make these gates blocking—no exceptions for “urgent” deployments.
Canary deployments expose new versions to a small percentage of production traffic while monitoring for quality degradation. Configure automatic rollback triggers based on error rates, latency degradation, or quality metric drops. The Google SRE Workbook provides detailed guidance on canary analysis.
Production monitoring continues after full deployment, detecting issues that may emerge over time due to distribution shift, upstream changes, or adversarial activity. Implement alerting thresholds with appropriate sensitivity—too sensitive creates alert fatigue, too lax misses real issues.
The AI evaluation ecosystem includes specialized frameworks for different evaluation needs. Select tools based on your specific requirements—RAG evaluation, prompt testing, LLM-as-judge, or general ML metrics. Most organizations benefit from combining multiple tools into a comprehensive evaluation pipeline.
Evaluation Frameworks
| Tool | Description | Best For | Documentation |
|---|
| RAGAS | RAG evaluation framework | Retrieval quality, answer correctness | RAGAS |
| DeepEval | LLM evaluation framework | Comprehensive LLM testing | DeepEval |
| Promptfoo | Prompt testing and evaluation | Prompt engineering, red teaming | Promptfoo |
| LangSmith | LangChain observability and evaluation | Chain debugging, production monitoring | LangSmith |
| Hugging Face Evaluate | ML evaluation library | Standard NLP metrics | Evaluate |
| OpenAI Evals | OpenAI’s evaluation framework | Custom evaluation development | OpenAI Evals |
| Braintrust | LLM evaluation platform | Continuous evaluation, experimentation | Braintrust |
RAGAS specializes in evaluating retrieval-augmented generation systems, providing metrics for context relevance, faithfulness, and answer correctness. Use RAGAS when RAG quality is a primary concern.
Promptfoo excels at prompt testing and red teaming, enabling systematic evaluation of prompt variations and adversarial robustness. Its configuration-based approach makes it easy to define test suites and run evaluations at scale.
LangSmith provides end-to-end observability for LangChain applications, including tracing, evaluation, and production monitoring. Use LangSmith for comprehensive visibility into complex chain behaviors.
Testing Frameworks
| Tool | Description | Best For | Documentation |
|---|
| pytest | Python testing framework | Unit tests, integration tests | pytest |
| Hypothesis | Property-based testing | Edge case discovery, fuzzing | Hypothesis |
| LangChain Testing | Chain testing utilities | LangChain-specific testing | LangChain |
| Testcontainers | Container-based testing | Integration tests with dependencies | Testcontainers |
| Great Expectations | Data validation | Input/output data quality | Great Expectations |
Hypothesis enables property-based testing where you define properties that outputs should satisfy rather than specific expected outputs. This approach is particularly valuable for AI testing where multiple valid outputs may exist.
Common Pitfalls and Anti-Patterns
Evaluation failures often stem from methodological issues rather than tooling problems. Understanding common pitfalls helps teams avoid wasted effort and misleading results.
| Anti-Pattern | Problem | Solution |
|---|
| Overfitting to benchmarks | High benchmark scores don’t translate to production | Use diverse evaluation sets, monitor production metrics |
| Data leakage | Evaluation data appears in training | Strict dataset separation, temporal splits |
| Single metric focus | Missing important quality dimensions | Multi-dimensional evaluation rubrics |
| Ignoring edge cases | Failures on rare but critical scenarios | Adversarial testing, red teaming |
| Static evaluation | Datasets don’t reflect evolving threats | Regular dataset updates, drift detection |
| Automation without validation | LLM judges introduce systematic biases | Periodic human validation of automated metrics |
Overfitting to benchmarks occurs when teams optimize specifically for benchmark performance without ensuring real-world quality. Include held-out evaluation sets that are never used for optimization, and validate benchmark improvements against production metrics.
Data leakage happens when evaluation data inadvertently appears in training or prompt examples. For security AI, this can occur when incident data used for evaluation was previously used to train detection models. Implement strict data governance and temporal splits.
Single metric focus leads to optimizing one dimension at the expense of others. A model that achieves high accuracy but produces unsafe outputs is worse than one with moderate accuracy and strong safety. Define multi-dimensional evaluation criteria with minimum thresholds for each dimension.
Implementation Checklist
Use this checklist when establishing AI evaluation and testing practices for security systems:
Evaluation Setup
Testing Infrastructure
Quality Assurance
Governance
References
Evaluation Frameworks and Research
LLM Provider Documentation
Testing and Quality Assurance
Security and Standards
Research Papers