Understanding Hybrid Search > BM25 and Vector Search for Legal Documents

Search isn't just about finding text — it's about finding the right answer.

In legal search, users often type things like:

"Điều 260 quy định chuyển tiếp"
"Quy định tạm thời"
"Khoản 3 Điều 5"

These queries mix exact references with meaningful intent. A search system must handle both.

1. What is BM25?

BM25 is a classic keyword relevance algorithm used for ranking documents based on keyword occurrences.

How BM25 works

BM25 scores a document based on:

Term Frequency (TF) — how often a word appears in the document
Inverse Document Frequency (IDF) — how rare a word is across the collection
Document Length Normalization — adjusts for long vs short documents

BM25 Tokenization Process

For a Vietnamese query like "Giúp tôi giải thích về quy định chuyển tiếp":

Tokenize: Split on spaces and punctuation → ["Giúp", "tôi", "giải", "thích", "về", "quy", "định", "chuyển", "tiếp"]
Filter: Remove stop words (optional) → ["giải", "thích", "quy", "định", "chuyển", "tiếp"]
Search: Find documents containing these tokens in the text field (not vectors)
Score: Calculate BM25 relevance score for each document
Rank: Sort by BM25 score

BM25 is excellent at:

Exact matches
Rare keywords (like "Điều 260")
Keyword-based relevance

But BM25 does not understand meaning or paraphrases.

2. How MySQL `LIKE` differs from BM25

Feature	`LIKE`	BM25
Match type	Exact text pattern	Ranked relevance
Ranking	❌ No	✅ Yes
Multiple keywords	Poor	Good
Word importance	❌ No	✅ Yes

LIKE just checks if text contains a pattern — it does not rank results by relevance.

BM25 scores documents so the best matches rise to the top.

3. What Vector Search Does

Vector search uses embeddings to represent text as vectors in a high-dimensional space.

It handles meaning, synonyms, paraphrases
It finds documents that mean something similar, even with different wording
Uses cosine similarity to compare query vectors with document vectors

But vector search struggles with:

Exact codes
Article numbers
Legal identifiers

Embeddings don't reliably capture exact numbers or reference strings.

4. What is Hybrid Search?

Hybrid search combines BM25 and vector search to get the best of both worlds:

BM25 — finds documents that contain the right keywords/article numbers
Vector search — finds documents that are semantically similar
Both run in parallel, then results are fused using score normalization and weighting

This works exceptionally well for law documents where numeric references and meaning both matter.

5. Hybrid Search Flow (Complete Process)

User Query: "Giúp tôi giải thích về quy định chuyển tiếp"
    ↓
Apply Metadata Filters (if any): article_id = '260', chapter_id = 'XVI'
    ↓
┌─────────────────────────────────────────────────────────────┐
│  PARALLEL SEARCH (both run simultaneously)                  │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  BM25 Search:                                               │
│  • Tokenize: ["giải", "thích", "quy", "định",             │
│              "chuyển", "tiếp"]                             │
│  • Search TEXT field (not vectors)                         │
│  • Calculate BM25 scores (TF-IDF + doc length)             │
│  • Rank by keyword relevance                               │
│  • Return top-k documents with BM25 scores                 │
│                                                             │
│  Vector Search:                                             │
│  • Convert query to embedding vector                       │
│  • Compare with document vectors (cosine similarity)       │
│  • Calculate similarity scores                             │
│  • Rank by semantic similarity                             │
│  • Return top-k documents with vector scores               │
│                                                             │
└─────────────────────────────────────────────────────────────┘
    ↓
Normalize Scores: Both BM25 & Vector scores → [0, 1] range
    ↓
Score Fusion with Alpha Parameter:
  • alpha = 0.0 → Pure BM25 (keyword only)
  • alpha = 0.5 → Equal weight to both (balanced)
  • alpha = 1.0 → Pure vector (semantic only)
    ↓
Final Score = combine(normalized_BM25, normalized_Vector)
    ↓
Rank & Return: Sort by final hybrid score (highest first)

Real Example from Our Test

Query: "Giúp tôi giải thích về quy định chuyển tiếp"

Article 255 Results:

BM25 original score: 17.96101 → normalized: 0.40038913
Vector original score: 0.44675517 → normalized: 0.4777541
Final hybrid score: 0.8781431913375854 (sum of normalized scores)

Why Article 255 ranked #1:

Strong BM25 match for keywords "quy định", "chuyển tiếp"
Good semantic similarity via vector search
Balanced combination with alpha = 0.5

6. Key Differences: Filters vs BM25

Important: Metadata filters and BM25 serve different purposes:

Metadata Filters (Applied First)

Purpose: Filter documents by structured metadata before search
Example: article_id = '260', chapter_id = 'XVI'
When: Applied before any search (vector or BM25)
Scope: Database-level filtering

BM25 Keyword Search (During Search)

Purpose: Search and rank by keywords within document text content
Example: Query "quy định chuyển tiếp" matches keywords in text
When: During the search phase (parallel with vector search)
Scope: Text content matching and ranking

You still need both: Filters for structured metadata filtering, and hybrid search (BM25 + vector) for better text matching and ranking.

7. Score Normalization and Alpha Parameter

Why Normalization Matters

Original scores have different scales:

BM25 scores: Can be 15-30+ (higher = better)
Vector scores: Usually 0.3-0.5 (higher = better)

Normalization converts both to [0, 1] for fair comparison.

Alpha Controls the Balance

// Configuration
alpha: 0.5,           // Equal weight to both methods
fusionType: 'RelativeScore'  // How scores are combined

With alpha = 0.5, both normalized scores contribute equally:

Final Score ≈ normalized_BM25 + normalized_Vector

Score Analysis from Our Results

Article	BM25 (norm)	Vector (norm)	Final Score	Why It Ranked
255	0.400	0.478	0.878	Best overall balance
253	0.391	0.465	0.856	Good keyword + semantic match
3	0.500	0.352	0.852	Highest BM25, lower vector
254	0.320	0.489	0.809	High vector, lower BM25
260	0.381	0.272	0.654	Moderate on both

8. Can Traditional DBs and Vector DBs Both Use BM25?

Yes.

Traditional databases use BM25 (or similar full-text ranking):
- Elasticsearch
- MySQL FULLTEXT
- PostgreSQL full-text search
Vector databases often support BM25 as well for hybrid search:
- Weaviate (our example)
- Pinecone
- Qdrant

BM25 isn't tied to vectors — it's a ranking function that searches text fields, while vector search uses embedding fields. Both can coexist in the same database.

9. Implementation: asRetriever vs hybridSearch

asRetriever() (LangChain Standard)

store.asRetriever({
  k: 5,
  filter,
  searchType: 'similarity', // or 'mmr'
})

Purpose: Converts vector store to LangChain retriever
Search types: 'similarity' (vector) or 'mmr' (Max Marginal Relevance)
Limitation: Vector-based only, no BM25

hybridSearch() (Weaviate Native)

store.hybridSearch(query, {
  limit: 5,
  filter,
  alpha: 0.5,
  returnMetadata: ['score', 'explainScore'],
})

Purpose: Combines BM25 + vector search natively
Advantage: True hybrid search with score fusion
Control: Alpha parameter for balancing methods

10. Key Takeaways

✅ BM25 = keyword relevance (searches text, not vectors)
✅ Vector search = semantic meaning (uses embeddings)
✅ Hybrid search = parallel execution + score fusion
✅ Filters are applied before search (metadata filtering)
✅ Score normalization ensures fair comparison between methods
✅ Alpha parameter controls the balance between keyword and semantic search
✅ Traditional DBs and vector DBs can both use BM25
✅ Hybrid search is ideal for law documents where exact references + meaning matter

11. Why This Matters for Legal Search

Legal queries uniquely mix:

Exact article references ("Điều 260")
Explanatory intent ("giải thích", "quy định chuyển tiếp")

The Problem with Single-Method Search

Pure keyword search (BM25 only):

✅ Gets exact articles and legal references
❌ Lacks semantic understanding
❌ Misses paraphrased or conceptual queries

Pure vector search (semantic only):

✅ Captures meaning and intent
❌ Can miss exact references and article numbers
❌ Less reliable for precise legal citations

The Hybrid Solution

Hybrid search provides accuracy + understanding by:

Running both searches in parallel
Normalizing scores for fair comparison
Fusing results with configurable alpha weighting
Ranking by combined relevance

Result: Users get both precise legal references AND contextually relevant content — critical for legal systems where precision and understanding are both essential.

Real-World Impact

In our Vietnamese Land Law example:

Query: "Giúp tôi giải thích về quy định chuyển tiếp"
BM25 catches keywords: "quy định", "chuyển tiếp"
Vector search understands the explanatory intent
Hybrid fusion ranks the most relevant transitional provisions first
Users get exactly what they need: relevant law articles with proper context

This is why hybrid search has become the gold standard for legal document retrieval systems.

User Query: "Giúp tôi giải thích về quy định chuyển tiếp"
    ↓
Apply Metadata Filters (if any): article_id = '260', chapter_id = 'XVI'
    ↓
┌─────────────────────────────────────────────────────────────┐
│  PARALLEL SEARCH (both run simultaneously)                  │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  BM25 Search:                                               │
│  • Tokenize: ["giải", "thích", "quy", "định",             │
│              "chuyển", "tiếp"]                             │
│  • Search TEXT field (not vectors)                         │
│  • Calculate BM25 scores (TF-IDF + doc length)             │
│  • Rank by keyword relevance                               │
│  • Return top-k documents with BM25 scores                 │
│                                                             │
│  Vector Search:                                             │
│  • Convert query to embedding vector                       │
│  • Compare with document vectors (cosine similarity)       │
│  • Calculate similarity scores                             │
│  • Rank by semantic similarity                             │
│  • Return top-k documents with vector scores               │
│                                                             │
└─────────────────────────────────────────────────────────────┘
    ↓
Normalize Scores: Both BM25 & Vector scores → [0, 1] range
    ↓
Score Fusion with Alpha Parameter:
  • alpha = 0.0 → Pure BM25 (keyword only)
  • alpha = 0.5 → Equal weight to both (balanced)
  • alpha = 1.0 → Pure vector (semantic only)
    ↓
Final Score = combine(normalized_BM25, normalized_Vector)
    ↓
Rank & Return: Sort by final hybrid score (highest first)

// Configuration
alpha: 0.5,           // Equal weight to both methods
fusionType: 'RelativeScore'  // How scores are combined

Final Score ≈ normalized_BM25 + normalized_Vector

store.asRetriever({
  k: 5,
  filter,
  searchType: 'similarity', // or 'mmr'
})

store.hybridSearch(query, {
  limit: 5,
  filter,
  alpha: 0.5,
  returnMetadata: ['score', 'explainScore'],
})