Seek Evaluation & Development - Ryan Manor

Part blog post, part academic report, this page outlines how I evaluated relevance on Seek. # Building Seek: Eval Data Seek was evaluated on two kinds of data: 1. Public Information Retrieval (IR) benchmark data sets 2. QRels from my own vault, logged and annotated from search history in obsidian over several months ## Public IR Benchmarks Early on in the design of seek I was tempted to structure the problem as basically an agentic auto-research loop against MTEB. Sadly I quickly discovered that the many public information retrieval (IR) data sets, while robust, do not transfer to a PKM/Journal/etc vault search use case very well. There's several reasons for this, but to make a long story short the MTEB is actually a larger collection of datasets under the hood which range from tasks like clustering, question answering, retrieval, and more. Even when zooming into the retrieval sets specifically, there's notable divergence in the goals and design of these sets compared to what someone would want in a vault search use case. To call out two: **Data Structure** In these sets, documents often lack any of the basic metadata you would often see in Obsidian. Each public set ships as a flat pile of documents with separate query and relevance files. No title, or tags, aliases, or text formatting within the content. **Query & Document shape** Many of the data sets and queries are captured from web users. And were made by people searching for information they did not have and had never seen. This is a foundational different scenario than Obsidian search. In a personal Vault, even if you have agentically written documents the topics and terms are more likely to be known concepts. This results in very different behavior at query time. To really use these sets to evaluate Seek, every document is reshaped into a note: its title becomes the title field, its body becomes note content. Additionally I leaned on Gemma 4 to enrich some sets (namely the Android & Gaming sets from CQADupstack) with things like tags and body content formatting for more realistic testing. ### Code & Non-English Because I wanted Seek to work for as many Obsidian users as possible. I also evaluated for a range of different languages, as well as code retrieval: | Dataset (n) | Source | What it is | | -------------------------- | ------ | ----------------------------------------------------------- | | Code, CoIR (250) | CoIR | Natural-language descriptions for retrieving code snippets. | | MLQA en/es/de/zh (600 ea.) | MLQA | Long natural-language questions on Wikipedia paragraphs. | | MIRACL en/es/de/zh/ja/hi/te | MMTEB | Human-judged multilingual retrieval (+ hard negatives). | ## Qrels From My Own Vault Seek is not the first iteration of search i've built for obsidian. For the last 8 months or so i'd run a local embeddingGemma server and run my search out of that. As part of this early exploration into obsidian search, i'd built a very detailed query and click logger. Using this history, I was able to clean up and annotate a set of 482 qrels (query + result annotations for evaluation). The major callout (and weakness) in this set is that because it was trained on earlier, inferior search implementations, and I tended to have short, keyword dense, and entity oriented queries in my logged set. But, while this corpus was less ambitious in terms of conceptual matching than the public IR sets, it directly mirrored live query patterns for Obsidian search and is none the less valuable. With all of this data in mind... I will make the callout that ultimately nothing can replace live testing something live, and Seek was tested extensively as my (and the friends & family I was able to rope into this) daily driver search plugin before release. # Building Seek: Finding a Model --- With my eval data now in hand I ran a study on the current landscape of small embedding models against my personal vault set: | System | Params | Fused NDCG@10 | Dense-Only NDCG@10 | | -------------- | -----: | ------------: | -----------------: | | `gemma_q8_512` | 300M | 0.6127 | 0.5696 | | `gemma_q4_512` | 300M | 0.6122 | 0.5527 | | `gemma_q4_256` | 300M | 0.5905 | 0.4846 | | `f2llm_160m` | 160M | 0.5931 | 0.5085 | | `harrier_270m` | 270M | 0.5910 | 0.5072 | | `f2llm_80m` | 80M | 0.5877 | 0.5236 | | `snowflake_m` | 109M | 0.5871 | 0.4924 | | `jina_v5_nano` | 239M | 0.5789 | 0.4164 | | `gte_base` | 110M | 0.5781 | 0.5058 | | `bge_small` | 33M | 0.5703 | 0.4864 | | `minilm` | 22M | 0.5568 | 0.4301 | I have a lot of thoughts on some of these models, and how well/poorly they did but I leave that to another time... ### LLMs as embedding models As you can see Gemma is quite a strong performer across quantization and vector truncation levels. I'd originally used Embedding Gemma 300M for local vault search in prior prototypes for Seek. Sadly (despite much effort) I just couldn't get it to fit the compute constraints I had for my usability and stability goals with seek. At this point I noticed a pattern: most of the top-performing embedding models these days are repurposed from decoder LLMs, carrying a full decoder backbone. That LLM heritage is a huge problem for my use case because I did not want to fall back on a local Ollama server. Being and feeling "Obsidian native" was a core goal for Seek. It makes these models genuinely strong, with huge vocabularies and the ability to handle promptable tasks like clustering or Q&A at query time (which is honestly a very cool feature), but it also makes them heavy, and the resource cost does not scale down gracefully. I'd experimented cutting vocabulary on Embedding Gemma, but found it sheds far more relevance than it sheds compute. This is what eventually led me to the IBM [Granite R2 family](https://arxiv.org/abs/2605.13521). ### Granite R2 In my search for a modern, well trained, multi-lingual, *encoder only* model. IBM was far from the top of my list when I started building Seek, but you can't argue with results: | Model + BM25 | Fused NDCG@10 | Δ vs Gemma | | ---------------------------------------------------------- | ------------: | ----------: | | Gemma-300M + era-1 fusion (anchor) | 0.6669 | — | | `granite_ml97_r2` + current Seek ranking engine | **0.6637** | **−0.0032** | | `granite_small_r2` (en only) + current Seek ranking engine | 0.6587 | −0.0082 | | `granite_ml97_r2` + era-1 fusion (model swap only) | 0.6170 | −0.0499 | Obviously this meant losing out on being able to prompt for these other use cases, but ultimately the compute efficiency of Granite won me over. A big thanks to the IBM Granite Team for their *excellent* little model. # Fusion --- With a model in hand, the rest was blending its output with the lexical search path into one ranking. Much ink has been spilled on how to do this, so to make a very long story short, I've ended up following the research from [Bruch et al. (TOIS 2023)](https://arxiv.org/abs/2210.11934), who show convex combination beats Reciprocal Rank Fusion, and MinMax and is largely agnostic to the score normalization. Seek follows the method they put forth and normalizes each channel against its theoretical maximum for the query (the fixed cosine endpoints for dense; the largest BM25 score the query's terms could possibly earn for lexical), so a weak query stays small in both channels and the blend reflects real confidence. The two are then combined by convex combination at one global weight. I did test whether that weight might need to vary per query. A perfect per-query weight was potentially worth about +0.05 nDCG@10 in my evaluations, but it proved un-learnable from cheap features. Additionally, the findings in [DAT, 2025](https://arxiv.org/abs/2503.23013) suggested that LLM judgement would be required to *really* move the needle here. This was well outside my compute constraints, to say nothing of query time latency impacts, so I set query adaptive alpha aside. Two final signals, a navigational boost when a title or alias fully contains the query and an optional recency term are added on top of the blend rather than mixed in. Seek treats relevance as a kind of protected or immutable calculation, so these bonuses can only nudge the order, never overwrite a strong relevance signal. # Seek vs Prior Art --- For a baseline comparison i've evaluated the current leaders in lexical, and semantic search for the Obsidian ecosystem, Omnisearch, and Smart Lookup (Smart Connections). Each cell is nDCG@10 and bold marks the best arm in each row; the full MRR@10 and P@1 breakdown for every row sits in the [[#Full Metrics]] table below. ## English | Dataset (n) | Seek | Omnisearch* | Smart Lookup | | ------------------ | --------- | ----------- | ------------ | | My vault (483) | **0.882** | 0.811 | 0.670 | | DBPedia (229) | **0.732** | 0.247 | 0.696 | | BEIR gaming (1595) | 0.420 | 0.218 | **0.422** | | BEIR android (699) | 0.334 | 0.198 | **0.337** | | Code, CoIR (250) | **0.948** | 0.008 | 0.947 | *nDCG@10* I suspect Omnisearch's poor performance on the QA, retrieval, and code sets is largely it's behavior around longer queries being treated as an AND of all terms. The queries in these evaluations tend to be longer. While punishing on benchmarks, this is the right behavior for more realistic user vault search. As seen in it's much higher score on My Vault set. Smart Lookup also beats on the two BEIR sets here, this is predominantly due to Smart Lookup being a pure semantic search plugin, which is just falt out the higher performing methodology on these sets. Despite this, I've kept Seek a hybrid solution which returns superior results in more realistic Obsidian search patterns. ## Multi-Lingual I don't have much in the way of multi-lingual content in my vault. So I again turned to public datasets. ### Question Answering (MLQA) [MLQA](https://arxiv.org/abs/1910.07475) is a multilingual question-answering set: long natural-language questions over paragraphs. While not Seek's target use case, it is a convenient cross-lingual probe, and the `en` row doubles as a control against Smart Lookup, which bills itself as a question answering system first and foremost. | MLQA (600 ea.) | Seek | Omnisearch* | Smart Lookup | | -------------- | --------- | ----------- | ------------ | | English (en) | **0.662** | 0.016 | 0.603 | | Spanish (es) | **0.685** | 0.029 | 0.369 | | German (de) | **0.659** | 0.005 | 0.320 | | Chinese (zh) | **0.655** | 0.002 | 0.094 | *nDCG@10* ### Retrieval (MIRACL) [MIRACL](https://arxiv.org/abs/2210.09984) is a multilingual retrieval benchmark with human-annotated passage-level relevance judgments (*both positive and negative*) on Wikipedia. Since the full corpora are too large to embed locally (and runpod was out of cheap nodes at the time), i've scored on the MMTEB top-250-plus-judged subset per query. This means that in absolute terms, these scores are artificially inflated in comparison to the full corpora since there are fewer irrelevant docs in the corpus. But crucially, the evaluation is identical across all three systems making the comparison fair, if slightly distorted. | MIRACL pool | Queries (n) | Seek | Omnisearch * | Smart Lookup | | ------------- | ----------- | --------- | ------------ | ------------ | | English (en) | 799 | **0.466** | 0.049 | 0.344 | | Spanish (es) | 648 | **0.490** | 0.053 | 0.199 | | German (de) | 305 | **0.477** | 0.011 | 0.154 | | Chinese (zh) | 393 | **0.522** | 0.002 | 0.014 | | Japanese (ja) | 860 | **0.585** | 0.048 | 0.025 | | Hindi (hi) | 350 | **0.506** | 0.010 | 0.002 | | Telugu (te) | 828 | **0.753** | 0.012 | 0.001 | | avg (non-en) | | **0.555** | 0.023 | 0.066 | *nDCG@10* ### A note on Multi Lingual benchmarks You may come away from the benchmarks above wondering: > Why is English one of the lowest rows? Is Seek actually worse on English? This speaks less to the actual performance of the model than the data within the benchmark. Because much more information is often available in english, the QRels have more conceptual overlap, and the annotated hard negatives are often much more difficult to suppress. This shows up more prominently in a measure like nDCG. | Language | Positives / query | Pool size | Seek nDCG@10 | | ------------- | ----------------: | --------: | -----------: | | English (en) | 2.91 | 178,768 | 0.466 | | Spanish (es) | 4.61 | 146,750 | 0.490 | | German (de) | 2.66 | 71,277 | 0.477 | | Chinese (zh) | 2.53 | 81,309 | 0.522 | | Japanese (ja) | 2.08 | 185,319 | 0.585 | | Hindi (hi) | 2.15 | 63,066 | 0.506 | | Telugu (te) | 1.03 | 101,961 | 0.753 | # Conclusion --- Hopefully this paints a bit of a picture about how Seek arrived where it is now. There's still some things I want to push (see [[#The Boneyard]]), and a few more ideas for features in the near future. But for now I feel like i've arrived at a good place. It's made a big impact on how I use Obsidian, and I hope others find success with Seek as well. That said, if you ever have a poorly performing query with Seek, please raise a [GitHub issue](https://github.com/ryan-manor/Obsidian-Seek/issues) and I would be happy to chat. ## Some lessons learned... - The IR community has done an incredible job building public evaluation data sets on a wide range of tasks, and I'm forever in their debt. But... - **Beware benchmaxxing.** Don't get too wrapped up in benchmarks unless you've ***really*** scrutinized how they're constructed, and have decided they're a good proxy for your use case. - Writing and grooming QRels is one of the most tedious and boring activities you can do. - Writing and grooming QRels is one of highest ROI activities you can do. - Time for a Beir :) ## Citations - Awasthy, Trivedi, Yang, et al. (2026). "Granite Embedding Multilingual R2 Models." arXiv preprint. [arXiv:2605.13521](https://arxiv.org/abs/2605.13521) — IBM Granite Embedding R2; Seek ships the 4-bit-quantized `97m-multilingual-r2`, with the English-only `small-r2` evaluated above only as a comparison point. - Bruch, Gai & Ingber (2023). "An Analysis of Fusion Functions for Hybrid Retrieval." ACM Transactions on Information Systems. [arXiv:2210.11934](https://arxiv.org/abs/2210.11934) — basis for Seek's convex-combination fusion. - Robertson & Zaragoza (2009). "The Probabilistic Relevance Framework: BM25 and Beyond." Foundations and Trends in Information Retrieval 3(4):333–389. [doi:10.1561/1500000019](https://doi.org/10.1561/1500000019) - Lv & Zhai (2011). "Lower-bounding Term Frequency Normalization." CIKM 2011. [doi:10.1145/2063576.2063759](https://doi.org/10.1145/2063576.2063759) — the BM25+ lower-bound (`d=0.5`) Seek's lexical channel uses. - Hsu & Tzeng (2025). "DAT: Dynamic Alpha Tuning for Hybrid Retrieval in Retrieval-Augmented Generation." arXiv preprint. [arXiv:2503.23013](https://arxiv.org/abs/2503.23013) — LLM-judged per-query weighting (see Fusion). - Cormack, Clarke & Büttcher (2009). "Reciprocal Rank Fusion Outperforms Condorcet and Individual Rank Learning Methods." SIGIR 2009. [doi:10.1145/1571941.1572114](https://doi.org/10.1145/1571941.1572114) — rejected fusion baseline (Boneyard). - Formal, Piwowarski & Clinchant (2021). "SPLADE: Sparse Lexical and Expansion Model for First Stage Ranking." SIGIR 2021. [arXiv:2107.05720](https://arxiv.org/abs/2107.05720) — rejected sparse-expansion approach (Boneyard). - Muennighoff, Tazi, Magné & Reimers (2023). "MTEB: Massive Text Embedding Benchmark." EACL 2023. [arXiv:2210.07316](https://arxiv.org/abs/2210.07316) - Enevoldsen, Chung, et al. (2025). "MMTEB: Massive Multilingual Text Embedding Benchmark." ICLR 2025. [arXiv:2502.13595](https://arxiv.org/abs/2502.13595) — source of the MIRACL top-250 judged subset. - Thakur, Reimers, Rücklé, Srivastava & Gurevych (2021). "BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models." NeurIPS 2021 Datasets & Benchmarks Track. [arXiv:2104.08663](https://arxiv.org/abs/2104.08663) — source of the gaming, android, and DBPedia sets. - Hoogeveen, Verspoor & Baldwin (2015). "CQADupStack: A Benchmark Data Set for Community Question-Answering Research." ADCS 2015. [doi:10.1145/2838931.2838934](https://doi.org/10.1145/2838931.2838934) — the Android & Gaming subforums. - Hasibi, Nikolaev, Xiong, et al. (2017). "DBpedia-Entity v2: A Test Collection for Entity Search." SIGIR 2017. [doi:10.1145/3077136.3080751](https://doi.org/10.1145/3077136.3080751) - Li, Dong, Lee, et al. (2025). "CoIR: A Comprehensive Benchmark for Code Information Retrieval Models." ACL 2025. [arXiv:2407.02883](https://arxiv.org/abs/2407.02883) - Lewis, Oğuz, Rinott, Riedel & Schwenk (2020). "MLQA: Evaluating Cross-lingual Extractive Question Answering." ACL 2020. [arXiv:1910.07475](https://arxiv.org/abs/1910.07475) - Zhang, Thakur, Ogundepo, et al. (2023). "Making a MIRACL: Multilingual Information Retrieval Across a Continuum of Languages." TACL 2023. [arXiv:2210.09984](https://arxiv.org/abs/2210.09984) > And a special thanks to my wife for the architecture reviews, and testing during Seek's development. > A less special thanks to my brother who looked at my repo and said "lgtm". # Appendix --- ## The Boneyard Rejected ideas, and little explorations. | Approach | Verdict | Notes | | ------------------------------------------------- | -------- | ----------------------------------------------------------------------------------------------------------------------------------------------- | | SPLADE-doc | rejected | Minimal relevance gain, massive increase in index size & compute resource needs | | Per-query / stratum alpha | Parked | Not a lot of corpus/query agnostic gains. I'd be interested to give this another go with some kind of simple intent/query type detection. | | Adv. Stemming (Porter2) | rejected | No gains for En minimal on many others. I will reopen this if there's a need. | | Cross-encoder rerankers (ettin-17M, MiniLM) | Parked | Did not improve relevance and added 300-400ms of latency at query time. Something I'll revisit for sure. | | PageRank / inbound-degree prior as ranking signal | rejected | no real graph index; in-degree dominated by hubs/MOCs | | RRF / z-score fusion | rejected | Worse than TM2C2 | | Global-maxfloor fusion | rejected | wins BEIR but -0.14 ndcg on personal, mixed on other sets. | | Coverage-aware lexical (cov^gamma / hard-gate) | rejected | net wash; helps some tests, hurts others, didn't work well in hands on testing. | | LoRA for stage-1 recall set building | Parked | I'd love to spend more time here but the few prototypes I ran either overfit to the vault based on corpus domain, or provided no material lift. | ## Full Metrics nDCG@10 / MRR@10 / P@1. Bold still marks the best nDCG@10 per row. ## **English & Code** | Dataset (n) | Seek (hybrid) | Omnisearch* (lexical) | Smart Lookup (semantic) | | ------------------ | ------------------------- | --------------------- | ------------------------- | | My vault (483) | **0.882** / 0.862 / 0.815 | 0.811 / 0.792 / 0.746 | 0.670 / 0.623 / 0.515 | | DBPedia (229) | **0.732** / 0.958 / 0.939 | 0.247 / 0.390 / 0.389 | 0.696 / 0.947 / 0.930 | | BEIR gaming (1595) | 0.420 / 0.394 / 0.303 | 0.218 / 0.221 / 0.193 | **0.422** / 0.396 / 0.303 | | BEIR android (699) | 0.334 / 0.327 / 0.249 | 0.198 / 0.205 / 0.165 | **0.337** / 0.333 / 0.255 | | Code, CoIR (250) | **0.948** / 0.936 / 0.908 | 0.008 / 0.008 / 0.008 | 0.947 / 0.935 / 0.908 | *nDCG@10 / MRR@10 / P@1* ## **MLQA (600 ea.)** | Language | Seek (hybrid) | Omnisearch* (lexical) | Smart Lookup (semantic) | |---|---|---|---| | en | **0.662** / 0.625 / 0.553 | 0.016 / 0.015 / 0.015 | 0.603 / 0.561 / 0.473 | | es | **0.685** / 0.650 / 0.577 | 0.029 / 0.029 / 0.028 | 0.369 / 0.339 / 0.287 | | de | **0.659** / 0.621 / 0.540 | 0.005 / 0.005 / 0.005 | 0.320 / 0.296 / 0.253 | | zh | **0.655** / 0.618 / 0.545 | 0.002 / 0.002 / 0.002 | 0.094 / 0.081 / 0.060 | *nDCG@10 / MRR@10 / P@1* ## **MIRACL pool** | Language | Seek (hybrid) | Omnisearch* (lexical) | Smart Lookup (semantic) | |---|---|---|---| | en (799) | **0.466** / 0.532 / 0.384 | 0.049 / 0.083 / 0.068 | 0.344 / 0.406 / 0.269 | | es (648) | **0.490** / 0.633 / 0.506 | 0.053 / 0.110 / 0.088 | 0.199 / 0.275 / 0.164 | | de (305) | **0.477** / 0.534 / 0.397 | 0.011 / 0.023 / 0.023 | 0.154 / 0.173 / 0.079 | | zh (393) | **0.522** / 0.555 / 0.415 | 0.002 / 0.003 / 0.003 | 0.014 / 0.014 / 0.005 | | ja (860) | **0.585** / 0.617 / 0.488 | 0.048 / 0.056 / 0.041 | 0.025 / 0.025 / 0.012 | | hi (350) | **0.506** / 0.541 / 0.426 | 0.010 / 0.019 / 0.017 | 0.002 / 0.005 / 0.003 | | te (828) | **0.753** / 0.706 / 0.586 | 0.012 / 0.011 / 0.010 | 0.001 / 0.001 / 0.000 | | avg (non-en) | **0.555** / 0.598 / 0.470 | 0.023 / 0.037 / 0.030 | 0.066 / 0.082 / 0.044 | *nDCG@10 / MRR@10 / P@1*