Documentation Index
Fetch the complete documentation index at: https://threatbasis.io/llms.txt
Use this file to discover all available pages before exploring further.
AI red teaming applies adversarial thinking to identify vulnerabilities in AI systems before attackers exploit them. Unlike traditional red teaming, AI red teaming must address unique attack surfaces—prompt injection, jailbreaking, data extraction, and model manipulation—that require specialized techniques and tools.
Security teams must regularly test their AI deployments using offensive techniques to identify weaknesses, validate defenses, and improve resilience. This guide covers AI red teaming methodologies, attack techniques, and testing frameworks.
AI Red Team Methodology
Assessment Framework
| Phase | Objectives | Techniques |
|---|
| Reconnaissance | Understand AI system architecture | Documentation review, API exploration |
| Threat modeling | Identify attack surfaces | STRIDE for AI, MITRE ATLAS mapping |
| Attack execution | Test identified vulnerabilities | Prompt injection, jailbreaking, extraction |
| Impact assessment | Evaluate successful attacks | Data access, action capability, scope |
| Reporting | Document findings, recommendations | Risk-prioritized remediation |
Attack Surface Mapping
| Attack Surface | Description | Test Focus |
|---|
| User inputs | Direct user interaction | Prompt injection, jailbreaking |
| Retrieved content | RAG data sources | Indirect injection |
| System prompts | Instructions to model | Extraction, manipulation |
| Tool integrations | External actions | Privilege escalation |
| Training data | Model knowledge | Extraction, poisoning |
| API endpoints | Service interfaces | Authentication, rate limits |
Prompt Injection Testing
Direct Injection Techniques
| Technique | Description | Example Pattern |
|---|
| Instruction override | Direct instruction to ignore rules | ”Ignore previous instructions…” |
| Role assumption | Adopt unrestricted persona | ”You are now DAN…” |
| Delimiter escape | Break out of structured prompts | Closing tags, format breaks |
| Language switching | Instructions in other languages | Non-English instructions |
| Encoding tricks | Obfuscated instructions | Base64, ROT13, Unicode |
Indirect Injection Testing
| Vector | Test Approach | Success Indicators |
|---|
| Document retrieval | Embed instructions in documents | Model follows embedded instructions |
| API responses | Inject into enrichment data | Behavior change from external data |
| User-generated content | Inject into forums, tickets | Cross-user attack success |
| Web content | Inject into scraped pages | Model follows web instructions |
Jailbreaking Techniques
Common Jailbreak Patterns
| Pattern | Mechanism | Detection |
|---|
| Roleplay jailbreaks | Fictional context bypasses guardrails | Role-playing language |
| Multi-turn escalation | Gradual boundary pushing | Conversation trajectory |
| Hypothetical framing | ”What if” scenarios | Hypothetical language |
| Competing objectives | Conflicting instructions | Instruction conflicts |
| Token manipulation | Unusual token sequences | Abnormal perplexity |
Jailbreak Testing Process
| Step | Action | Output |
|---|
| 1. Baseline | Test guardrail effectiveness | Baseline refusal rate |
| 2. Known jailbreaks | Test published techniques | Current vulnerability state |
| 3. Novel attempts | Create new jailbreak attempts | Discovery of new vulnerabilities |
| 4. Combination attacks | Chain multiple techniques | Complex attack paths |
| 5. Automated fuzzing | Systematic variation testing | Coverage metrics |
| Target | Risk | Test Approach |
|---|
| System prompts | Reveals instructions, constraints | Direct asking, partial extraction |
| Training data | May contain sensitive information | Membership inference, extraction |
| RAG content | Access to knowledge base | Probing for document content |
| User data | Cross-user leakage | Session isolation tests |
| API keys/credentials | Privileged access | Configuration extraction |
| Technique | Description | Mitigation Test |
|---|
| Direct extraction | Ask for system prompt | Prompt protection validation |
| Indirect extraction | Infer from behavior | Behavioral consistency |
| Completion attacks | Partial prompt completion | Prompt structure security |
| Memorization probing | Training data extraction | Differential privacy effectiveness |
Privilege Escalation Testing
| Test Category | Objective | Approach |
|---|
| Tool boundary testing | Exceed permitted tool use | Request unauthorized actions |
| Parameter manipulation | Modify tool parameters | Inject malicious parameters |
| Action chaining | Combine tools maliciously | Multi-step attack sequences |
| Approval bypass | Circumvent human oversight | Test approval gate robustness |
Red Team Program Development
Program Structure
| Component | Description | Cadence |
|---|
| Continuous testing | Automated vulnerability scanning | Continuous |
| Periodic assessments | Comprehensive red team exercises | Quarterly |
| Pre-deployment testing | New AI feature validation | Per release |
| Incident-triggered | Test after security events | As needed |
Metrics and Reporting
| Metric | Description | Target |
|---|
| Attack success rate | % of attack attempts successful | Track trends |
| Time to detection | How quickly attacks detected | Minimize |
| Coverage | % of attack surface tested | Maximize |
| Remediation time | Time to fix identified issues | < 30 days critical |
Anti-Patterns to Avoid
-
Testing only known attacks — Novel attacks emerge constantly. Include creative, exploratory testing.
-
Ignoring indirect injection — Direct injection is obvious. Test data-driven injection paths.
-
One-time testing — AI systems evolve. Establish continuous testing programs.
-
Insufficient documentation — Detailed findings enable effective remediation. Document thoroughly.
References