BM25 - mnml's vault - Obsidian Publish

[ChatGPT](https://chatgpt.com/c/67a912ab-aa74-8012-aa96-edd97748191a) ### BM25 Explained for a Developer BM25 (Best Matching 25) is a ranking function used in search engines and information retrieval systems to rank documents based on their relevance to a query. It is an improvement over traditional TF-IDF and is part of the probabilistic ranking family. --- ### 1. **Intuition** BM25 scores documents based on: - **Term frequency (TF)**: How often the query term appears in the document. - **Inverse document frequency (IDF)**: How rare the query term is across all documents. - **Document length normalization**: Shorter documents get slightly boosted to prevent long documents from always dominating. It's called BM25 because it's one of many variations of the "Best Matching" ranking functions. --- ### 2. **Formula** The BM25 score for a document DD given a query QQ is: score(D,Q)=∑t∈QIDF(t)⋅TF(t,D)⋅(k1+1)TF(t,D)+k1⋅(1−b+b⋅∣D∣avgD)\text{score}(D, Q) = \sum_{t \in Q} \text{IDF}(t) \cdot \frac{\text{TF}(t, D) \cdot (k_1 + 1)}{\text{TF}(t, D) + k_1 \cdot (1 - b + b \cdot \frac{|D|}{\text{avgD}})} where: - TF(t,D)\text{TF}(t, D) = term frequency of tt in DD - IDF(t)\text{IDF}(t) = inverse document frequency of term tt, usually computed as: IDF(t)=log⁡(N−nt+0.5nt+0.5+1)\text{IDF}(t) = \log \left(\frac{N - n_t + 0.5}{n_t + 0.5} + 1 \right) - NN = total number of documents - ntn_t = number of documents containing term tt - k1k_1 = controls term frequency saturation (commonly 1.2 to 2.0) - bb = controls document length normalization (typically 0.75) - ∣D∣|D| = length of document DD (word count) - avgD\text{avgD} = average document length in the corpus --- ### 3. **Why BM25 Over TF-IDF?** |Feature|TF-IDF|BM25| |---|---|---| |Frequency Handling|Linear increase|Saturates (diminishing returns)| |Length Normalization|None|Adjusted via bb| |Relevance Weighting|Simple|Tunable via k1k_1 and bb| |Performance|Fast|Slightly slower but better relevance| BM25 prevents long documents from getting unfair advantages and avoids overweighting repeated words. --- ### 4. **Implementation in Python** A basic BM25 implementation using `rank_bm25`: ```python from rank_bm25 import BM25Okapi documents = [ "The quick brown fox jumps over the lazy dog", "Never jump over the lazy dog quickly", "Brown foxes are fast and clever" ] tokenized_docs = [doc.lower().split() for doc in documents] bm25 = BM25Okapi(tokenized_docs) query = "quick fox" tokenized_query = query.lower().split() scores = bm25.get_scores(tokenized_query) print(scores) # BM25 relevance scores for each document ``` --- ### 5. **Use Cases** - Search engines (e.g., Elasticsearch, Solr) - Recommendation systems - Question answering - Code search tools - Document retrieval in legal/medical texts BM25 is widely used in modern search applications because it provides a strong balance of performance and ranking quality. Would you like a deeper dive into the implementation details or optimizations?