Embedding models and vector search form the foundation of modern AI-powered security systems. Embeddings convert text into dense numerical vectors that capture semantic meaning, enabling similarity search across security knowledge bases, threat intelligence repositories, and incident databases. This semantic understanding powers the retrieval component of RAG systems and enables intelligent matching that traditional keyword search cannot achieve. Security engineers leverage embeddings to find similar incidents from historical data, match indicators of compromise (IOCs) to threat intelligence, retrieve relevant runbooks during incident response, and build intelligent search over security documentation. Unlike keyword search, which requires exact term matches, vector search understands that “lateral movement” relates to “network pivoting” and “credential hopping” even when those exact terms don’t appear together. The effectiveness of AI security tools depends heavily on retrieval quality. When a security analyst asks about a specific attack technique, the system must retrieve the most relevant threat intelligence, prior incidents, and response procedures. Poor embeddings or inadequate vector search configuration leads to irrelevant retrievals, which cascade into poor AI responses and reduced analyst trust. Understanding embedding fundamentals enables security teams to build retrieval systems that surface the right information at the right time.Documentation Index
Fetch the complete documentation index at: https://threatbasis.io/llms.txt
Use this file to discover all available pages before exploring further.
How Embeddings Work
Embeddings map text into high-dimensional vector spaces where semantically similar content clusters together. When you embed the phrase “ransomware encryption detected,” the resulting vector sits near vectors for related concepts like “file encryption malware,” “crypto-locker activity,” and “data held hostage.” This geometric relationship enables mathematical similarity comparison—finding related content becomes a matter of measuring distances between vectors. Modern embedding models use transformer architectures trained on massive text corpora to learn these semantic relationships. The training process teaches models to place similar concepts nearby in vector space while pushing unrelated concepts apart. The result is a numerical representation that captures meaning rather than just surface-level word matching. Vector dimensions determine the expressiveness of embeddings. Smaller dimensions (384-512) produce compact, efficient representations suitable for large-scale applications where storage and computation costs matter. Larger dimensions (1536-3072) capture more nuanced semantic relationships but require more storage and increase search latency. For most security applications, models in the 768-1536 dimension range provide an effective balance. Distance metrics measure how similar two vectors are. Cosine similarity, the most common metric, measures the angle between vectors—identical directions score 1.0 while orthogonal vectors score 0.0. Euclidean distance measures the straight-line distance between points, while dot product combines magnitude and direction. Most vector databases default to cosine similarity, which works well for normalized embeddings and ignores document length differences.Embedding Model Selection
Choosing the right embedding model significantly impacts retrieval quality. Different models excel at different tasks, and the MTEB Leaderboard provides benchmark comparisons across retrieval, classification, and clustering tasks.Commercial Embedding Models
OpenAI’s text-embedding-3 family offers the easiest path to production embeddings. Thetext-embedding-3-large model (3072 dimensions) provides state-of-the-art quality for most use cases, while text-embedding-3-small (1536 dimensions) reduces costs by approximately 5x with modest quality reduction. OpenAI embeddings work well for general security content and require no infrastructure management. See the OpenAI Embeddings Guide for implementation details.
Cohere’s Embed models provide strong multilingual support, critical for organizations processing threat intelligence in multiple languages. The Cohere Embed v3 models include variants optimized for different use cases—English-only, multilingual, and compressed representations for cost-sensitive deployments.
Voyage AI specializes in domain-specific embeddings, including models fine-tuned for code and technical documentation. For security teams building search over codebases or technical runbooks, Voyage’s specialized models may outperform general-purpose alternatives.
Open Source Embedding Models
Open source models enable on-premises deployment where security data cannot leave organizational boundaries. Sentence Transformers provides the most mature ecosystem for deploying open source embeddings. BGE (BAAI General Embedding) models from the Beijing Academy of Artificial Intelligence consistently rank among the best open source options. Thebge-large-en-v1.5 model provides near-commercial quality for English content, while BGE-M3 handles multilingual requirements.
E5 models from Microsoft Research offer instruction-tuned variants that handle query-document asymmetry well. When queries are short (“What is credential dumping?”) and documents are long (full technique descriptions), instruction-tuned models like e5-large-v2 outperform symmetric models.
all-MiniLM-L6-v2 provides a lightweight option (384 dimensions) suitable for resource-constrained environments or applications where embedding millions of documents makes larger models cost-prohibitive.
For security teams requiring air-gapped deployment or processing classified data, open source models running on local infrastructure represent the only viable option. The quality gap between commercial and open source models has narrowed significantly, making self-hosted deployments increasingly practical.
Vector Search Architecture
Vector search enables fast similarity retrieval across millions or billions of vectors. The architecture involves indexing strategies that trade off between search accuracy, speed, and memory consumption.Indexing Strategies
Flat indexes perform exact brute-force search by comparing the query vector against every vector in the database. While slow for large datasets, flat indexes provide 100% recall—every relevant document will be found. Use flat indexes for small datasets (under 100,000 vectors), ground truth evaluation, and quality benchmarking. HNSW (Hierarchical Navigable Small World) indexes build a graph structure that enables fast approximate search. Vectors connect to their nearest neighbors at multiple hierarchical layers, allowing queries to quickly navigate to relevant regions. HNSW typically achieves 95-99% recall while reducing search time by orders of magnitude. Most production deployments use HNSW for its balance of speed and accuracy. IVF (Inverted File) indexes partition vectors into clusters and search only the most promising clusters at query time. By limiting search to a subset of vectors, IVF dramatically reduces computation at the cost of potentially missing relevant results in other clusters. IVF works well for medium-scale deployments and combines effectively with other techniques. Product Quantization (PQ) compresses vectors into compact codes, reducing memory requirements by 10-100x. The compression introduces retrieval errors, but for extremely large datasets where memory constraints dominate, PQ enables deployments that would otherwise be infeasible. Production systems often combine these techniques—IVF-PQ for large-scale deployments with memory constraints, or HNSW with scalar quantization for high-performance requirements.Vector Database Selection
Vector databases provide the infrastructure for storing, indexing, and querying embeddings at scale. Selection depends on deployment requirements, scale, and operational capabilities. Managed services like Pinecone eliminate operational overhead and provide automatic scaling. Pinecone’s serverless offering handles capacity management automatically, making it attractive for teams without dedicated infrastructure expertise. The trade-off is vendor lock-in and data leaving organizational boundaries. Self-hosted options provide full control over data and infrastructure. Weaviate offers a complete vector search platform with hybrid search, GraphQL APIs, and multi-tenancy. Qdrant provides excellent performance and rich filtering capabilities. Milvus handles massive scale with GPU acceleration support. Embedded databases like Chroma run within your application process, simplifying deployment for smaller use cases. ChromaDB provides a Python-native experience ideal for prototyping and applications with modest scale requirements. PostgreSQL with pgvector adds vector capabilities to existing PostgreSQL deployments. For organizations already operating PostgreSQL, pgvector enables vector search without new infrastructure. Performance limits emerge at scale, but for datasets under a few million vectors, pgvector provides a familiar operational model.Embedding Security Content
Security data presents unique challenges for embedding and retrieval. Raw IOCs, structured logs, and technical documentation require different strategies than general text.The IOC Problem
IP addresses, file hashes, domain names, and other indicators of compromise don’t embed meaningfully. The string “192.168.1.100” has no semantic content that embedding models can capture—it’s an arbitrary sequence of numbers. Similarly, SHA-256 hashes are random-looking strings with no inherent meaning. The solution is to embed the context around IOCs rather than the IOCs themselves. Instead of embedding “192.168.1.100,” embed “Internal workstation 192.168.1.100 in the finance department exhibited beaconing behavior to external C2 infrastructure.” The embedding captures the meaningful context while the raw IOC supports exact-match keyword search. This principle extends to CVE identifiers, MITRE ATT&CK technique IDs, and other security-specific identifiers. Embed the description and context; use keyword search for the identifiers themselves.Chunking Strategies
How you split documents into chunks fundamentally impacts retrieval quality. Naive approaches that split on arbitrary character counts destroy semantic coherence and retrieve irrelevant fragments. For incident reports, chunk along phase boundaries—detection, analysis, containment, eradication, recovery, lessons learned. Each phase represents a coherent unit that analysts might query independently. Preserve section headers in chunks to maintain context. For threat intelligence, embed each indicator with its surrounding context as a unit. A threat report describing an APT campaign might yield chunks for each technique described, each infrastructure component, and the overall campaign summary. Keep related information together rather than splitting mid-description. For runbooks and procedures, chunk at the step or section level. Analysts searching for “how to isolate a compromised EC2 instance” should retrieve the relevant procedure section, not a fragment split across multiple steps. For security documentation, respect heading structure. Content under a single heading typically addresses a coherent topic and should remain together. Use recursive chunking that splits on headers before falling back to size-based splits.Chunk Size Considerations
Smaller chunks (200-500 tokens) enable precise retrieval but lose context. Larger chunks (1000-2000 tokens) maintain context but may include irrelevant information and increase costs when passed to language models. Security content often benefits from larger chunks than general text because security concepts require context to understand correctly. A 300-token chunk describing “credential harvesting” might omit critical context about the specific technique variant, detection opportunities, or related techniques. A 1000-token chunk captures the full picture. Experiment with chunk sizes on your specific data. Evaluate retrieval quality using queries your analysts actually ask, not generic benchmarks.Hybrid Search for Security
Pure semantic search misses exact matches critical for security operations. When an analyst searches for “CVE-2024-3400,” they need exact matches, not semantically similar CVEs. Hybrid search combines semantic and keyword approaches to handle both conceptual queries and exact-match requirements.Why Hybrid Matters for Security
Security queries fall into distinct categories that require different search approaches: Conceptual queries like “techniques for evading EDR” or “cloud lateral movement methods” benefit from semantic understanding. The analyst wants relevant content even if exact terms don’t appear. Exact match queries like “CVE-2024-3400,” “mimikatz,” or “192.168.10.0/24” require keyword precision. Semantic similarity to other CVEs or tools isn’t helpful. Mixed queries combine both: “remediation for Log4Shell in containerized environments” needs semantic understanding of “remediation” and “containerized” while exactly matching “Log4Shell.” Production security search systems must handle all three query types effectively. Hybrid search architectures run both semantic and keyword searches, then combine results using reciprocal rank fusion or learned scoring models.Implementing Hybrid Search
Most vector databases now support hybrid search natively. Pinecone, Weaviate, and Qdrant provide hybrid query APIs that handle the complexity of combining search modalities. The standard approach uses reciprocal rank fusion (RRF) to merge results. Each search method produces a ranked list, and RRF combines them by summing reciprocal ranks. A document ranked #1 by semantic search and #3 by keyword search receives a combined score of 1/1 + 1/3 = 1.33. This simple approach works remarkably well and requires no training. Tune the balance between semantic and keyword components based on your query distribution. Security applications typically weight keyword matching higher than general search applications because exact identifiers matter more.Quality and Evaluation
Retrieval quality directly determines the effectiveness of downstream AI systems. Poor retrieval means the AI lacks relevant context, leading to incorrect or unhelpful responses. Systematic evaluation identifies problems before they impact analyst workflows.Evaluation Metrics
Recall@k measures what fraction of relevant documents appear in the top k results. For security applications where missing a critical piece of threat intelligence could mean missing an attack, high recall matters more than perfect precision. Target recall@10 above 90% for production systems. Mean Reciprocal Rank (MRR) measures where the first relevant result appears. An MRR of 0.5 means the first relevant result typically appears at position 2 on average. Higher MRR reduces analyst effort—the most relevant content appears first. Precision@k measures what fraction of returned results are actually relevant. Low precision means analysts waste time filtering irrelevant results.Building Evaluation Datasets
Create evaluation datasets from real analyst queries and relevant documents. Ask analysts to identify queries they commonly run and the documents they expect to find. This ground truth enables systematic quality measurement. Include diverse query types: conceptual questions, exact identifiers, mixed queries, and edge cases. Test queries in languages your organization processes. Include queries that should return nothing—measuring false positive rates matters for security applications. Re-evaluate after any significant changes: new embedding models, chunking strategy changes, index configuration updates, or significant corpus additions.Anti-Patterns to Avoid
Embedding raw IOCs without context wastes embedding dimensions on semantically meaningless content. IP addresses, hashes, and identifiers should be handled through keyword search while their surrounding context gets embedded. Ignoring chunk boundaries destroys retrieval quality. Content split mid-sentence or mid-concept yields fragments that match queries poorly and confuse downstream language models. Invest time in chunking strategies appropriate for your content types. Using a single embedding model for everything may not optimize for your specific retrieval needs. Security documentation, code, and threat intelligence have different characteristics. Evaluate multiple models on representative queries from each content type. Skipping hybrid search fails analysts who need exact-match capabilities. Security data contains identifiers, hashes, and technical terms that require precise matching alongside semantic understanding. Ignoring evaluation means problems accumulate undetected. Without systematic quality measurement, retrieval degrades as the corpus grows and query patterns shift.Integration with Security AI Systems
Embeddings and vector search integrate with broader AI architectures covered elsewhere in this knowledge base:- Advanced RAG patterns for sophisticated retrieval strategies
- Context window management for handling retrieved content
- AI memory and state management for persisting retrieval context
- SIEM-LLM integration for searching security event data
- Threat intelligence AI for intelligence retrieval use cases
References
Embedding Model Documentation
- OpenAI Embeddings Guide — API documentation and best practices
- Cohere Embed Models — Multilingual and specialized embeddings
- Sentence Transformers — Open source embedding framework
- Voyage AI Documentation — Domain-specific embeddings
Vector Database Resources
- Pinecone Learning Center — Comprehensive vector search tutorials
- Weaviate Documentation — Self-hosted vector search
- Qdrant Documentation — High-performance vector database
- pgvector GitHub — PostgreSQL vector extension
Benchmarks and Research
- MTEB Leaderboard — Embedding model benchmarks
- BEIR Benchmark — Information retrieval evaluation
- Matryoshka Representation Learning — Flexible dimension embeddings

