Skip to main content

Documentation Index

Fetch the complete documentation index at: https://threatbasis.io/llms.txt

Use this file to discover all available pages before exploring further.

AI red teaming applies adversarial thinking to identify vulnerabilities in AI systems before attackers exploit them. Unlike traditional red teaming, AI red teaming must address unique attack surfaces—prompt injection, jailbreaking, data extraction, and model manipulation—that require specialized techniques and tools. Security teams must regularly test their AI deployments using offensive techniques to identify weaknesses, validate defenses, and improve resilience. This guide covers AI red teaming methodologies, attack techniques, and testing frameworks.

AI Red Team Methodology

Assessment Framework

PhaseObjectivesTechniques
ReconnaissanceUnderstand AI system architectureDocumentation review, API exploration
Threat modelingIdentify attack surfacesSTRIDE for AI, MITRE ATLAS mapping
Attack executionTest identified vulnerabilitiesPrompt injection, jailbreaking, extraction
Impact assessmentEvaluate successful attacksData access, action capability, scope
ReportingDocument findings, recommendationsRisk-prioritized remediation

Attack Surface Mapping

Attack SurfaceDescriptionTest Focus
User inputsDirect user interactionPrompt injection, jailbreaking
Retrieved contentRAG data sourcesIndirect injection
System promptsInstructions to modelExtraction, manipulation
Tool integrationsExternal actionsPrivilege escalation
Training dataModel knowledgeExtraction, poisoning
API endpointsService interfacesAuthentication, rate limits

Prompt Injection Testing

Direct Injection Techniques

TechniqueDescriptionExample Pattern
Instruction overrideDirect instruction to ignore rules”Ignore previous instructions…”
Role assumptionAdopt unrestricted persona”You are now DAN…”
Delimiter escapeBreak out of structured promptsClosing tags, format breaks
Language switchingInstructions in other languagesNon-English instructions
Encoding tricksObfuscated instructionsBase64, ROT13, Unicode

Indirect Injection Testing

VectorTest ApproachSuccess Indicators
Document retrievalEmbed instructions in documentsModel follows embedded instructions
API responsesInject into enrichment dataBehavior change from external data
User-generated contentInject into forums, ticketsCross-user attack success
Web contentInject into scraped pagesModel follows web instructions

Jailbreaking Techniques

Common Jailbreak Patterns

PatternMechanismDetection
Roleplay jailbreaksFictional context bypasses guardrailsRole-playing language
Multi-turn escalationGradual boundary pushingConversation trajectory
Hypothetical framing”What if” scenariosHypothetical language
Competing objectivesConflicting instructionsInstruction conflicts
Token manipulationUnusual token sequencesAbnormal perplexity

Jailbreak Testing Process

StepActionOutput
1. BaselineTest guardrail effectivenessBaseline refusal rate
2. Known jailbreaksTest published techniquesCurrent vulnerability state
3. Novel attemptsCreate new jailbreak attemptsDiscovery of new vulnerabilities
4. Combination attacksChain multiple techniquesComplex attack paths
5. Automated fuzzingSystematic variation testingCoverage metrics

Data Extraction Testing

Extraction Targets

TargetRiskTest Approach
System promptsReveals instructions, constraintsDirect asking, partial extraction
Training dataMay contain sensitive informationMembership inference, extraction
RAG contentAccess to knowledge baseProbing for document content
User dataCross-user leakageSession isolation tests
API keys/credentialsPrivileged accessConfiguration extraction

Extraction Techniques

TechniqueDescriptionMitigation Test
Direct extractionAsk for system promptPrompt protection validation
Indirect extractionInfer from behaviorBehavioral consistency
Completion attacksPartial prompt completionPrompt structure security
Memorization probingTraining data extractionDifferential privacy effectiveness

Tool & Action Abuse

Privilege Escalation Testing

Test CategoryObjectiveApproach
Tool boundary testingExceed permitted tool useRequest unauthorized actions
Parameter manipulationModify tool parametersInject malicious parameters
Action chainingCombine tools maliciouslyMulti-step attack sequences
Approval bypassCircumvent human oversightTest approval gate robustness

Red Team Program Development

Program Structure

ComponentDescriptionCadence
Continuous testingAutomated vulnerability scanningContinuous
Periodic assessmentsComprehensive red team exercisesQuarterly
Pre-deployment testingNew AI feature validationPer release
Incident-triggeredTest after security eventsAs needed

Metrics and Reporting

MetricDescriptionTarget
Attack success rate% of attack attempts successfulTrack trends
Time to detectionHow quickly attacks detectedMinimize
Coverage% of attack surface testedMaximize
Remediation timeTime to fix identified issues< 30 days critical

Anti-Patterns to Avoid

  • Testing only known attacks — Novel attacks emerge constantly. Include creative, exploratory testing.
  • Ignoring indirect injection — Direct injection is obvious. Test data-driven injection paths.
  • One-time testing — AI systems evolve. Establish continuous testing programs.
  • Insufficient documentation — Detailed findings enable effective remediation. Document thoroughly.

References