# Similarity search
> [!ai]+ phind.com [Phind](https://www.phind.com/agent?cache=cluyqn3z6002qjf08mmyj0rt7)
>
> # Similarity Search in [[Vector Databases]]
>
> ## What is Similarity Search?
> - Similarity search, also known as nearest neighbor search, is the task of finding the most similar vectors to a query vector in a database
> - Used for applications like recommendation systems, semantic search, image search, etc. where you want to find "similar" items to a query
>
> ## [[Vector Database]]
> - Vector databases are purpose-built for storing and searching high-dimensional vector embeddings
> - Vectors are dense numerical representations of items like text, images, audio, etc. ([[Word Embeddings]])
> - [[Deep Learning]] learning models map data to a vector space such that similar items have vectors close to each other
>
> ## How Vector Similarity Search Works
> 1. Items are embedded into a high-dimensional vector space using a deep learning model
> 2. Vectors are indexed in a vector database to allow for efficient similarity search
> 3. To perform a search, the query is embedded into a vector using the same model
> 4. The vector database finds the nearest neighbor vectors to the query vector based on a distance metric like Euclidean distance or cosine similarity
>
> ## Indexing Algorithms for Efficient Search
> - Exact search is too slow for large databases, so [[approximate nearest neighbor]] (ANN) algorithms are used
> - [[Locality-sensitive hashing]] (LSH) - hashes similar vectors to the same bucket
> - [[Hierarchical navigable small world graphs]] (HNSW) - builds a graph to allow quickly navigating to nearest neighbors
> - [[Product quantization]] (PQ) - compresses vectors and builds lookup tables for fast asymmetric distance computation
>
> ## Popular Vector Database Libraries
> - [[FAISS]] (Facebook AI Similarity Search) - library for efficient similarity search and clustering of dense vectors
> - Annoy (Approximate Nearest Neighbors Oh Yeah) - C++ library with Python bindings for fast ANN search
> - NMSLIB (Non-Metric Space Library) - library for efficient similarity search on generic non-metric spaces
>
> The key aspects of similarity search in vector databases are:
> 1. Embedding data into vector representations
> 2. Indexing vectors in a purpose-built database
> 3. Using ANN algorithms for efficient similarity search
> 4. Applying it to search, recommendations, ranking and more
> [!ai]+ Claude
> ### Index Data Structures for Cosine Similarity Search in Vector Databases
>
> k-d Trees
> - Partition the high-dimensional space using binary trees
> - Each node represents a splitting hyperplane along a specific dimension
> - Effective for low-dimensional data (e.g., up to 20 dimensions)
> - Support various distance metrics, including cosine similarity
> - Complexity: O(log n) average case, O(n) worst case for search
>
> Ball Trees
> - Organize data into a hierarchy of nested hyper-spheres
> - Each node represents a ball containing a subset of data points
> - Well-suited for high-dimensional data
> - Support non-Euclidean distance metrics like cosine similarity
> - Complexity: O(log n) average case for search
>
> VP Trees (Vantage Point Trees)
> - Partition data based on distance from vantage points
> - Each node represents a vantage point and a radius defining a region
> - Effective for high-dimensional data
> - Work with various distance metrics, including cosine similarity
> - Complexity: O(log n) average case for search
>
> Random Projection Trees
> - Use random projections to reduce data dimensionality
> - Partition data based on projected coordinates
> - Handle high-dimensional data efficiently
> - Computationally efficient construction and search
> - Complexity: O(log n) average case for search
>
> Locality Sensitive Hashing (LSH)
> - Hash similar vectors into the same buckets with high probability
> - Enables approximate nearest neighbor search
> - Suitable for high-dimensional data
> - Supports cosine similarity and other distance metrics
> - Complexity: O(1) for hash computation, O(n) for candidate generation
>
> Hierarchical Navigable Small World (HNSW) Graphs
> - Construct a multi-layer graph structure
> - Each layer represents a different level of abstraction
> - Navigates through the graph to find nearest neighbors
> - Efficient for high-dimensional data
> - Supports various distance metrics, including cosine similarity
> - Complexity: O(log n) average case for search
>
> Product Quantization (PQ)
> - Divides the high-dimensional space into subspaces
> - Quantizes each subspace independently
> - Represents vectors as a combination of quantized subspace centroids
> - Enables efficient storage and fast approximate similarity search
> - Supports Euclidean distance and can be adapted for cosine similarity
> - Complexity: O(1) for distance computation, O(n) for candidate generation