From Inverted Indexes to BM25: How Search Ranking Actually Works

Most developers use search every day — Elasticsearch queries, RAG pipelines, 'Ctrl+F' in a codebase — without ever understanding what happens between the query and the ranked results.

This post builds that understanding from scratch, one failure at a time.

The journey:

Boolean matching → TF-IDF → BM25 → Hybrid search. Each step is a precise fix for the failure of the previous one.

Step 1: The Problem of Search

Imagine you have a library with 1 million books. A user walks in and says:

"I want books about encryption."

The naive solution: look inside every book, check if the word "encryption" appears, return all matches.

This works for 10 books. At 1 million, it's too slow. At 1 billion web pages, it's impossible.

So the first big idea in search is: don't search the documents — search an index.

The Inverted Index

Instead of storing documents and scanning them, you build a map from words → documents:

\begin{aligned} \text{"encryption"} &\rightarrow [\text{doc\_4, doc\_17, doc\_203, doc\_891}] \\ \text{"ciphertext"} &\rightarrow [\text{doc\_4, doc\_203}] \\ \text{"database"} &\rightarrow [\text{doc\_12, doc\_55, doc\_300}] \end{aligned}

Now when someone searches "encryption", you look it up directly in this map — O(1) instead of scanning everything. This structure is called an inverted index, and it's the foundation of every search engine (Elasticsearch, Lucene, Google).^[1]

Step 1, distilled:

Scanning everything doesn't scale → you need an index
An inverted index maps words → document IDs

Step 2: Ranking — Which Document Is Most Relevant?

You searched "encryption" and your inverted index returned 10,000 matching documents. Great. But now what? You can't show all 10,000. You need to sort them — most relevant first.

The question becomes: how do you measure relevance?

First Instinct: Count the Word

The simplest idea — the document that mentions "encryption" the most is probably the most about encryption.

So you count. For each document, count how many times the query word appears. This count is called Term Frequency, or TF.

\begin{aligned} \text{Doc A:} &\quad \text{"Encryption protects data by converting plaintext..." (5 mentions)} \rightarrow \text{TF} = 5 \\ \text{Doc B:} &\quad \text{"Encryption." (a one-line dictionary entry)} \rightarrow \text{TF} = 1 \\ \text{Doc C:} &\quad \text{"...encryption... encryption... encryption..." (spam)} \rightarrow \text{TF} = 30 \end{aligned}

By raw TF, Doc C wins. But that's clearly wrong — it's just keyword stuffing.

Still, the direction is right. More occurrences = more relevant, up to a point. We'll fix the "up to a point" problem later.

Second Instinct: Not All Words Are Equal

Now imagine the query is "the encryption process".

The word "the" appears in literally every document ever written. Matching on "the" tells you nothing. But "encryption" is rare — if a document contains it, that's genuinely meaningful.

So you need a way to reward rare words and discount common ones.

This measure is called Inverse Document Frequency, or IDF.^[2]

The idea:

Count how many documents contain this word → 'n'
Total documents in your corpus → 'N'
If 'n' is small (rare word), IDF is high → strong signal
If 'n' is large (common word), IDF is low → weak signal

\begin{aligned} \text{"the"} &\rightarrow \text{appears in } 1{,}000{,}000 / 1{,}000{,}000 \text{ docs} \rightarrow \text{IDF} \approx 0 \quad \text{(useless)} \\ \text{"process"} &\rightarrow \text{appears in } 400{,}000 / 1{,}000{,}000 \text{ docs} \rightarrow \text{IDF = low} \\ \text{"encryption"} &\rightarrow \text{appears in } 1{,}200 / 1{,}000{,}000 \text{ docs} \rightarrow \text{IDF = high (valuable!)} \end{aligned}

Combining Both: TF-IDF

Multiply them together:

\text{Score}(\text{document}, \text{term}) = \text{TF} \times \text{IDF}

A document scores high if it:

Mentions the term often (high TF) AND
The term is rare across the corpus (high IDF)

For a multi-word query, you sum the score across all terms:

\begin{aligned} \text{Score}(\text{doc}, \text{"the encryption process"}) &= \text{Score}(\text{doc}, \text{"the"}) + \text{Score}(\text{doc}, \text{"encryption"}) + \text{Score}(\text{doc}, \text{"process"}) \\ &\approx 0 + \text{big number} + \text{small number} \end{aligned}

"the" contributes almost nothing. "Encryption" dominates. Which is exactly what you want.

TF-IDF in one line:

Invented in the 1970s. Surprisingly powerful for such a simple formula. Search engines used it in production for decades.^[2]^[3]

But it has two specific, concrete failure modes. And fixing those two failures — precisely and elegantly — is exactly what BM25 does.^[3]

Step 3: Where TF-IDF Breaks

TF-IDF works well enough that it powered real search engines for decades. But it has two specific cracks that get worse as your corpus grows.^[2]^[3]

Failure 1: TF is Linear — But Relevance Isn't

With TF-IDF, if a document mentions "encryption" 100 times, it scores exactly 10× a document that mentions it 10 times.

But think about what that actually means in practice:

Doc A: A focused 3-page cryptography paper. Mentions "encryption" 10 times.
Doc B: A 200-page security textbook. Mentions "encryption" 100 times.

Is Doc B really 10× more relevant? Probably not. Doc A might be the better answer — it's entirely about encryption. Doc B just mentions it more because it's enormous and covers many topics.

The deeper insight:

The first few occurrences of a word tell you a lot — the document is probably about this topic. After that, each additional occurrence tells you less and less. Going from 0 → 1 mention is huge. Going from 100 → 101 mentions is meaningless. But TF-IDF treats both jumps as equal.

Visually:

What TF-IDF assumes

Loading chart…

What relevance actually looks like

Loading chart…

TF-IDF assumes a linear relationship — more mentions always means more relevant. But real relevance saturates. The first few mentions matter a lot; after that, each additional mention tells you almost nothing.

Failure 2: Document Length is Ignored

Consider these two documents, both matching the query "black holes":

Doc A (50 words):     "Black holes are regions where gravity is so strong 
                       that nothing, not even light, can escape."
                       mentions "black holes" 2 times.
 
Doc B (10,000 words): "A sprawling astrophysics survey covering stars, 
                       nebulae, black holes, quasars, dark matter..."
                       mentions "black holes" 15 times.

Raw TF: Doc B wins with 15 mentions vs Doc A's 2.

But Doc A is entirely about black holes. Doc B just mentions them in passing as one of twenty topics — it's a longer document so it naturally accumulates more mentions.

Long documents will almost always beat short ones on raw TF, simply because they have more words. Not because they're more relevant.

The Two Problems, Summarized

Problem	What happens	Why it's bad
Linear TF	Score grows forever with repetition	Keyword stuffing wins; a 1000-mention spam doc beats a focused 10-mention paper
No length awareness	Longer docs always accumulate more TF	A brief focused article loses to a long tangential chapter

TF-IDF has no defense against either.

BM25 fixes both problems, with one elegant formula that adds exactly two new parameters.

Step 4: BM25 — Fixing TF-IDF, One Problem at a Time

Fix 1: Making TF Saturate

Instead of using raw term frequency 'f', BM25 transforms it with this expression:

\frac{f \cdot (k_1 + 1)}{f + k_1}

Don't let it intimidate you. Let's just plug in numbers and watch what happens. Using k1 = 1.2 (the default):^[3]^[4]

\begin{aligned} f = 0 &\rightarrow \frac{0 \times 2.2}{0 + 1.2} = 0.0 \\ f = 1 &\rightarrow \frac{1 \times 2.2}{1 + 1.2} = 1.0 \\ f = 2 &\rightarrow \frac{2 \times 2.2}{2 + 1.2} = 1.375 \\ f = 5 &\rightarrow \frac{5 \times 2.2}{5 + 1.2} = 1.77 \\ f = 10 &\rightarrow \frac{10 \times 2.2}{10 + 1.2} = 1.96 \\ f = 50 &\rightarrow \frac{50 \times 2.2}{50 + 1.2} = 2.15 \\ f = 100 &\rightarrow \frac{100 \times 2.2}{100 + 1.2} = 2.17 \end{aligned}

Notice what's happening:

f goes:     1  →  2  →  5  →  10  →  50  →  100
Score goes: 1  → 1.4 → 1.8 →  2.0 →  2.1 →  2.2

The score grows fast at first, then almost stops growing. It can never exceed k1 + 1 = 2.2. That's the ceiling — the asymptote.

Compare that to raw TF:

f goes:     1  →  2  →  5  →  10  →  50  →  100
Raw TF:     1  →  2  →  5  →  10  →  50  →  100

Raw TF grows forever. BM25's version flattens out. Keyword stuffing now gives almost no benefit.

What does k1 actually control?

k1 controls how fast the curve flattens:

Effect of k1 on TF Saturation

Loading chart…

cyan = k1 = 0.5 (flattens quickly, ceiling = 1.5)
purple = k1 = 1.2 (default, ceiling = 2.2)
amber = k1 = 2.0 (flattens slowly, ceiling = 3.0)

Low k1 — the very first mention of a word does almost all the work. Good for short documents like tweets or product titles where one mention is enough signal.

High k1 — each additional mention still meaningfully contributes. Good for long technical documents like research papers where repetition genuinely signals depth of coverage.

Fix 2: Normalizing for Document Length

Now we extend the denominator to account for length. The full expression becomes:

\frac{f \cdot (k_1 + 1)}{f + k_1 \cdot \left(1 - b + b \cdot \frac{|D|}{\text{avgdl}}\right)}

Two new things appeared:

|D|: The length of the current document in tokens
avgdl: The average document length across your entire corpus
b: A parameter controlling how strongly you penalize length (default 0.75)

Understanding the length term

Focus on the new part inside the denominator:

1 - b + b \cdot \frac{|D|}{\text{avgdl}}

Let's call |D| / avgdl the length ratio — how long this document is compared to average.

If the document is exactly average length, ratio = 1.0, so the expression = 1 - b + b*1 = 1. No effect.
If the document is shorter than average, ratio < 1.0, expression < 1. Denominator shrinks → score goes up.
If the document is longer than average, ratio > 1.0, expression > 1. Denominator grows → score goes down.

The intuition:

Short focused documents get a boost. Long sprawling documents get penalized. Exactly what we wanted.

What does b control?

b controls how aggressively length affects the score:

When b = 0:

1 - 0 + 0 \cdot \frac{|D|}{\text{avgdl}} = 1

The whole length term collapses to 1. Length is completely ignored. You're back to just the saturation fix.

When b = 1:

1 - 1 + 1 \cdot \frac{|D|}{\text{avgdl}} = \frac{|D|}{\text{avgdl}}

Full normalization. A document twice the average length gets its score roughly halved.

Default b = 0.75 sits in between — partial normalization. Enough to stop long docs dominating, but not so aggressive that a long comprehensive article is always buried.^[3]^[4]

Concrete example

Query: "black holes". k1=1.2, b=0.75, avgdl=300 tokens, term appears 3 times in both docs.

Doc A — 100 tokens (short, focused):

\begin{aligned} \text{length ratio} &= 100/300 = 0.33 \\ \text{length term} &= 1 - 0.75 + 0.75 \times 0.33 = 0.25 + 0.25 = 0.50 \\ \text{denominator} &= 3 + 1.2 \times 0.50 = 3.6 \\ \text{TF component} &= 3 \times 2.2 \;/\; 3.6 = 1.83 \end{aligned}

Doc B — 900 tokens (long, tangential):

\begin{aligned} \text{length ratio} &= 900/300 = 3.0 \\ \text{length term} &= 1 - 0.75 + 0.75 \times 3.0 = 0.25 + 2.25 = 2.50 \\ \text{denominator} &= 3 + 1.2 \times 2.50 = 6.0 \\ \text{TF component} &= 3 \times 2.2 \;/\; 6.0 = 1.10 \end{aligned}

BM25 Length Normalization: Short vs Long Doc

Loading chart…

Result:

Same term count. Same query. But Doc A scores 1.83 vs Doc B's 1.10 — 66% higher. The short focused document wins, as it should.

Step 5: IDF — Rare Words Matter More

We already know the intuition: rare words are more meaningful than common ones. BM25 keeps this idea but gives it a more rigorous formula.

Classic IDF (from TF-IDF)

\text{IDF}(q) = \log\!\left(\frac{N}{n(q)}\right)

N = total documents in corpus
n(q) = documents containing the term

Simple ratio. If a term appears in 10 out of 1,000,000 docs, IDF = log(100,000) = very high. If it appears in 900,000 docs, IDF = log(1.11) ≈ very low.

Works fine in practice, but it's an empirical heuristic — someone just noticed it worked, not derived from theory.

BM25's IDF

BM25 derives IDF from probability theory instead:

\text{IDF}(q) = \ln\!\left(\frac{N - n(q) + 0.5}{n(q) + 0.5}\right)

Looks more complex, but the structure is the same — it's still a ratio involving N and n(q). The differences:

The 0.5 constants are smoothing — they prevent two nasty edge cases:

Term never seen before → n(q) = 0 → division by zero without smoothing
Term in every document → n(q) = N → log(0) = undefined without smoothing

The 0.5 acts like saying "assume we've seen half a relevant and half a non-relevant document" before any real data comes in. A small nudge to keep the math stable.

The Edge Case: Negative IDF

Watch what happens when a term appears in more than half the corpus:

\begin{aligned} N &= 1{,}000{,}000 \\ n(q) &= 600{,}000 \quad \text{(term appears in 60\% of documents)} \\ \\ \text{IDF} &= \ln\!\left(\frac{1{,}000{,}000 - 600{,}000 + 0.5}{600{,}000 + 0.5}\right) \\ &= \ln\!\left(\frac{400{,}000.5}{600{,}000.5}\right) \\ &= \ln(0.667) \\ &= -0.405 \quad \leftarrow \text{negative!} \end{aligned}

A negative IDF would subtract from a document's score when it contains the term — backwards behavior. The term is common, so it shouldn't help much, but it definitely shouldn't hurt.

Lucene's fix:

Add 1 inside the log: IDF(q) = ln( 1 + (N - n(q) + 0.5) / (n(q) + 0.5) ). Now the minimum possible value is ln(1) = 0 — common terms contribute nothing, never negative. This is what Elasticsearch actually uses in production.^[4]

The Complete BM25 Formula

Now we can assemble everything:

\text{BM25}(D, Q) = \sum_{i} \text{IDF}(q_i) \cdot \frac{f(q_i, D) \cdot (k_1 + 1)}{f(q_i, D) + k_1 \cdot \left(1 - b + b \cdot \frac{|D|}{\text{avgdl}}\right)}

For each query term qi, compute:

IDF — how rare is this term globally?
TF component — how focused is its presence in this specific document?

Multiply them. Sum across all query terms. That's the final score.

Full Worked Example

Let's do this properly, from scratch.

Corpus:

D1 (80 tokens):    "Inverted index maps terms to documents. 
                    Inverted indexes power search engines."
 
D2 (20 tokens):    "Inverted index: fast retrieval structure."
 
D3 (500 tokens):   "Search engines use many structures. An inverted 
                    index is one. [480 more tokens on other topics]"

Query: "inverted index"

Parameters: k1 = 1.2, b = 0.75

avgdl = (80 + 20 + 500) / 3 = 200 tokens

Step A: Count Term Frequencies

	"inverted"	"index"
D1	2	2
D2	1	1
D3	1	1

Step B: Compute IDF for Each Term

All 3 documents contain both terms, so n(q) = 3, N = 3:

\begin{aligned} \text{IDF}(\text{"inverted"}) &= \ln\!\left(1 + \frac{3 - 3 + 0.5}{3 + 0.5}\right) = \ln\!\left(1 + \frac{0.5}{3.5}\right) = \ln(1.143) = 0.134 \\ \text{IDF}(\text{"index"}) &= \text{same} = 0.134 \end{aligned}

Both terms appear in every document, so IDF is low (but positive due to Lucene's fix). In a real corpus of millions of docs, rare terms would have IDF values of 5–10+.

Step C: Compute TF Component for Each Doc

Formula: $\frac{f \cdot (k_1+1)}{f + k_1 \cdot (1 - b + b \cdot |D|/\text{avgdl})}$

For D1 (|D| = 80):

\begin{aligned} \text{length ratio} &= 80/200 = 0.4 \\ \text{length term} &= 1 - 0.75 + 0.75 \times 0.4 = 0.25 + 0.30 = 0.55 \\ \text{denominator} &= f + 1.2 \times 0.55 = f + 0.66 \\ \\ \text{"inverted" } (f{=}2) &: \frac{2 \times 2.2}{2 + 0.66} = \frac{4.4}{2.66} = 1.654 \\ \text{"index" } (f{=}2) &: \text{same} = 1.654 \end{aligned}

For D2 (|D| = 20):

\begin{aligned} \text{length ratio} &= 20/200 = 0.1 \\ \text{length term} &= 1 - 0.75 + 0.75 \times 0.1 = 0.25 + 0.075 = 0.325 \\ \text{denominator} &= f + 1.2 \times 0.325 = f + 0.39 \\ \\ \text{"inverted" } (f{=}1) &: \frac{1 \times 2.2}{1 + 0.39} = \frac{2.2}{1.39} = 1.583 \\ \text{"index" } (f{=}1) &: \text{same} = 1.583 \end{aligned}

For D3 (|D| = 500):

\begin{aligned} \text{length ratio} &= 500/200 = 2.5 \\ \text{length term} &= 1 - 0.75 + 0.75 \times 2.5 = 0.25 + 1.875 = 2.125 \\ \text{denominator} &= f + 1.2 \times 2.125 = f + 2.55 \\ \\ \text{"inverted" } (f{=}1) &: \frac{1 \times 2.2}{1 + 2.55} = \frac{2.2}{3.55} = 0.620 \\ \text{"index" } (f{=}1) &: \text{same} = 0.620 \end{aligned}

Step D: Final BM25 Scores

\text{BM25}(D, Q) = \text{IDF}(\text{"inverted"}) \cdot \text{TF}(\text{"inverted"}) + \text{IDF}(\text{"index"}) \cdot \text{TF}(\text{"index"})

\begin{aligned} \text{D1} &: 0.134 \times 1.654 + 0.134 \times 1.654 = 0.222 + 0.222 = 0.443 \\ \text{D2} &: 0.134 \times 1.583 + 0.134 \times 1.583 = 0.212 + 0.212 = 0.424 \\ \text{D3} &: 0.134 \times 0.620 + 0.134 \times 0.620 = 0.083 + 0.083 = 0.166 \end{aligned}

Final Ranking:

BM25 Final Scores

Loading chart…

D1 wins because it mentions both terms twice in a short document — dense signal. D2 is close despite only one mention because it's tiny. D3 loses badly despite being longest — the terms are buried in noise.

This is precisely what TF-IDF gets wrong:

Raw TF would have scored D1 and D3 identically on term count, letting D3's length advantage potentially win. BM25 correctly buries it.

The Full Picture So Far

BM25 is now complete. But it has real limitations — and knowing them tells you exactly when to reach for something else.

Step 6: Where BM25 Breaks — And What Comes After

BM25 is excellent, but it has a fundamental ceiling. All its problems trace back to one root cause.

The Root Cause: Bag of Words

BM25 treats a document as an unordered bag of words — it only cares about which words appear and how often. It throws away everything else.^[3]^[5]

Concretely, to BM25 these two sentences are identical:

"The dog bit the man"
"The man bit the dog"

Same words, same frequencies, same score. But completely different meanings.

That one design decision — summing independent per-term scores — causes every limitation BM25 has.

Limitation 1: No Synonyms

Search for "heart attack". BM25 looks for those exact words.

A document containing "myocardial infarction" — the clinical term for the exact same thing — scores zero for your query.

Query:    "heart attack"
Doc A:    "...symptoms of a heart attack include..."        → scores HIGH ✓
Doc B:    "...myocardial infarction is caused by..."        → scores ZERO ✗
Doc C:    "...treatment of cardiac arrest begins with..."   → scores ZERO ✗

Docs B and C might be far more medically rigorous and relevant, but BM25 can't find them.

The same problem appears everywhere: "car" vs "automobile", "buy" vs "purchase", "fix" vs "resolve" vs "debug".

BM25 sees these as completely unrelated words.

Limitation 2: No Word Order or Proximity

"New York" and "York New" produce identical BM25 scores.

More practically — a document where "python" and "install" appear in the same sentence vs. 5 pages apart gets the same score. BM25 has no concept of words being near each other.^[6]

Limitation 3: No Intent or Context

A user searches "python". They could mean the programming language, the snake, or Monty Python.

BM25 has absolutely no way to distinguish. It returns documents containing "python" — regardless of which meaning the user had in mind.

Limitation 4: Exact Vocabulary Match Required

BM25 needs the exact word to appear in the document:

"run" vs "running" vs "ran"      ← different forms of same word
"USA" vs "United States"         ← abbreviations
"JS" vs "JavaScript"             ← shorthand
"GPT-4" vs "GPT4" vs "gpt4"     ← formatting variants

Partial workarounds exist — like stemming (reducing "running" → "run") — but they're fragile and language-specific.^[7]

The core problem, stated clearly:

BM25 operates on surface form — the literal characters in the query and document. It has no model of meaning. What you actually want when searching is semantic similarity — documents that mean the same thing as your query, even if they use different words.

The Modern Solution: Vector Search

Here's the paradigm shift that happened in the last 10 years.

Instead of representing documents as bags of words, you represent them as vectors — lists of numbers that encode meaning.

"heart attack"          → [0.23, -0.87, 0.45, 0.12, ...]   (384 numbers)
"myocardial infarction" → [0.21, -0.85, 0.47, 0.11, ...]   (384 numbers)
"the cat sat"           → [0.91,  0.34, -0.22, 0.67, ...]   (384 numbers)

The first two vectors are almost identical — they're close in vector space — because a neural network (an embedding model) learned that they mean the same thing.

Searching then becomes: find documents whose vectors are closest to the query vector. This is called semantic search or dense retrieval.^[8]

BM25:          "heart attack" → find docs with exact words "heart" and "attack"
Vector search: "heart attack" → find docs whose meaning is closest to this query

So Should You Just Use Vector Search?

Not always. Vector search has its own failure modes:

Query: "Error code NX-4521-B"

An embedding model has likely never seen this specific error code. It has no idea what it means. It'll return vaguely "error-ish" documents. BM25 would find the exact match immediately.

Query: "function getUserById"  (searching a codebase)

Exact keyword match is what you want here. Semantic similarity is meaningless — you want that specific function name.

The Real Answer: Hybrid Search

Modern production systems — including most RAG pipelines — use both:

Each leg covers the other's blind spots:

Capability	BM25	Vector Search
Exact keywords, codes, names	✅	❌
Synonyms, paraphrases	❌	✅
Domain jargon (new terms)	✅	❌
Natural language queries	❌	✅
Interpretable / debuggable	✅	❌
No GPU needed	✅	❌

This combination is called hybrid search, and it's the standard retrieval pattern for serious search systems today — including most RAG pipelines you'd build with LLMs and tools like QMD,^[10] which combine BM25, vector retrieval, and reranking in a single pipeline.^[8]^[9]

Hybrid search means combining lexical and semantic retrieval signals. That can run inside one engine or across separate systems, depending on product and operational tradeoffs.

The Full Journey

Each step was a precise fix for the failure of the previous one. That's the whole history of information retrieval in a straight line.

You now have a complete mental model:

From inverted indexes all the way to modern hybrid RAG pipelines — every step motivated by a concrete failure in the previous approach.

Where to Go Next

Some natural directions from here:

How embedding models work — what actually produces those vectors
How re-ranking works — how you merge BM25 and vector results intelligently (RRF, cross-encoders)
Studying QMD's hybrid pipeline — a concrete example of how BM25, vector retrieval, and reranking work together in a local-first system^[10]
Building this in practice — either a single system that supports both lexical and vector retrieval natively (for example Elasticsearch, OpenSearch, Azure AI Search, Weaviate, Qdrant), or a split architecture where one system handles BM25/filtering and another handles vector retrieval

References

[1]: Manning, Raghavan, and Schutze, Introduction to Information Retrieval — A first take at building an inverted index

[2]: Manning, Raghavan, and Schutze, Introduction to Information Retrieval — Inverse document frequency

[3]: Manning, Raghavan, and Schutze, Introduction to Information Retrieval — Okapi BM25: a non-binary model

[4]: Apache Lucene — BM25Similarity

[5]: Wikipedia — Bag-of-words model

[6]: Manning, Raghavan, and Schutze, Introduction to Information Retrieval — Positional postings and phrase queries

[7]: Manning, Raghavan, and Schutze, Introduction to Information Retrieval — Stemming and lemmatization

[8]: Microsoft Learn — Vector search in Azure AI Search

[9]: Microsoft Learn — Hybrid search using vectors and full text in Azure AI Search

[10]: QMD — GitHub repository