Similarity Search - mnml's vault

# Similarity search > [!ai]+ phind.com [Phind](https://www.phind.com/agent?cache=cluyqn3z6002qjf08mmyj0rt7) > > # Similarity Search in [[Vector Databases]] > > ## What is Similarity Search? > - Similarity search, also known as nearest neighbor search, is the task of finding the most similar vectors to a query vector in a database > - Used for applications like recommendation systems, semantic search, image search, etc. where you want to find "similar" items to a query > > ## [[Vector Database]] > - Vector databases are purpose-built for storing and searching high-dimensional vector embeddings > - Vectors are dense numerical representations of items like text, images, audio, etc. ([[Word Embeddings]]) > - [[Deep Learning]] learning models map data to a vector space such that similar items have vectors close to each other > > ## How Vector Similarity Search Works > 1. Items are embedded into a high-dimensional vector space using a deep learning model > 2. Vectors are indexed in a vector database to allow for efficient similarity search > 3. To perform a search, the query is embedded into a vector using the same model > 4. The vector database finds the nearest neighbor vectors to the query vector based on a distance metric like Euclidean distance or cosine similarity > > ## Indexing Algorithms for Efficient Search > - Exact search is too slow for large databases, so [[approximate nearest neighbor]] (ANN) algorithms are used > - [[Locality-sensitive hashing]] (LSH) - hashes similar vectors to the same bucket > - [[Hierarchical navigable small world graphs]] (HNSW) - builds a graph to allow quickly navigating to nearest neighbors > - [[Product quantization]] (PQ) - compresses vectors and builds lookup tables for fast asymmetric distance computation > > ## Popular Vector Database Libraries > - [[FAISS]] (Facebook AI Similarity Search) - library for efficient similarity search and clustering of dense vectors > - Annoy (Approximate Nearest Neighbors Oh Yeah) - C++ library with Python bindings for fast ANN search > - NMSLIB (Non-Metric Space Library) - library for efficient similarity search on generic non-metric spaces > > The key aspects of similarity search in vector databases are: > 1. Embedding data into vector representations > 2. Indexing vectors in a purpose-built database > 3. Using ANN algorithms for efficient similarity search > 4. Applying it to search, recommendations, ranking and more > [!ai]+ Claude > ### Index Data Structures for Cosine Similarity Search in Vector Databases > > k-d Trees > - Partition the high-dimensional space using binary trees > - Each node represents a splitting hyperplane along a specific dimension > - Effective for low-dimensional data (e.g., up to 20 dimensions) > - Support various distance metrics, including cosine similarity > - Complexity: O(log n) average case, O(n) worst case for search > > Ball Trees > - Organize data into a hierarchy of nested hyper-spheres > - Each node represents a ball containing a subset of data points > - Well-suited for high-dimensional data > - Support non-Euclidean distance metrics like cosine similarity > - Complexity: O(log n) average case for search > > VP Trees (Vantage Point Trees) > - Partition data based on distance from vantage points > - Each node represents a vantage point and a radius defining a region > - Effective for high-dimensional data > - Work with various distance metrics, including cosine similarity > - Complexity: O(log n) average case for search > > Random Projection Trees > - Use random projections to reduce data dimensionality > - Partition data based on projected coordinates > - Handle high-dimensional data efficiently > - Computationally efficient construction and search > - Complexity: O(log n) average case for search > > Locality Sensitive Hashing (LSH) > - Hash similar vectors into the same buckets with high probability > - Enables approximate nearest neighbor search > - Suitable for high-dimensional data > - Supports cosine similarity and other distance metrics > - Complexity: O(1) for hash computation, O(n) for candidate generation > > Hierarchical Navigable Small World (HNSW) Graphs > - Construct a multi-layer graph structure > - Each layer represents a different level of abstraction > - Navigates through the graph to find nearest neighbors > - Efficient for high-dimensional data > - Supports various distance metrics, including cosine similarity > - Complexity: O(log n) average case for search > > Product Quantization (PQ) > - Divides the high-dimensional space into subspaces > - Quantizes each subspace independently > - Represents vectors as a combination of quantized subspace centroids > - Enables efficient storage and fast approximate similarity search > - Supports Euclidean distance and can be adapted for cosine similarity > - Complexity: O(1) for distance computation, O(n) for candidate generation