Understanding Hybrid Search > BM25 and Vector Search for Legal Documents

#VectorDatabase
single

Search isn't just about finding text — it's about finding the right answer.

In legal search, users often type things like:

  • "Điều 260 quy định chuyển tiếp"
  • "Quy định tạm thời"
  • "Khoản 3 Điều 5"

These queries mix exact references with meaningful intent. A search system must handle both.


1. What is BM25?

BM25 is a classic keyword relevance algorithm used for ranking documents based on keyword occurrences.

How BM25 works

BM25 scores a document based on:

  • Term Frequency (TF) — how often a word appears in the document
  • Inverse Document Frequency (IDF) — how rare a word is across the collection
  • Document Length Normalization — adjusts for long vs short documents

BM25 Tokenization Process

For a Vietnamese query like "Giúp tôi giải thích về quy định chuyển tiếp":

  1. Tokenize: Split on spaces and punctuation → ["Giúp", "tôi", "giải", "thích", "về", "quy", "định", "chuyển", "tiếp"]
  2. Filter: Remove stop words (optional) → ["giải", "thích", "quy", "định", "chuyển", "tiếp"]
  3. Search: Find documents containing these tokens in the text field (not vectors)
  4. Score: Calculate BM25 relevance score for each document
  5. Rank: Sort by BM25 score

BM25 is excellent at:

  • Exact matches
  • Rare keywords (like "Điều 260")
  • Keyword-based relevance

But BM25 does not understand meaning or paraphrases.


2. How MySQL LIKE differs from BM25

FeatureLIKEBM25
Match typeExact text patternRanked relevance
Ranking❌ No✅ Yes
Multiple keywordsPoorGood
Word importance❌ No✅ Yes

LIKE just checks if text contains a pattern — it does not rank results by relevance.

BM25 scores documents so the best matches rise to the top.


3. What Vector Search Does

Vector search uses embeddings to represent text as vectors in a high-dimensional space.

  • It handles meaning, synonyms, paraphrases
  • It finds documents that mean something similar, even with different wording
  • Uses cosine similarity to compare query vectors with document vectors

But vector search struggles with:

  • Exact codes
  • Article numbers
  • Legal identifiers

Embeddings don't reliably capture exact numbers or reference strings.


Hybrid search combines BM25 and vector search to get the best of both worlds:

  1. BM25 — finds documents that contain the right keywords/article numbers
  2. Vector search — finds documents that are semantically similar
  3. Both run in parallel, then results are fused using score normalization and weighting

This works exceptionally well for law documents where numeric references and meaning both matter.


5. Hybrid Search Flow (Complete Process)

User Query: "Giúp tôi giải thích về quy định chuyển tiếp"
    ↓
Apply Metadata Filters (if any): article_id = '260', chapter_id = 'XVI'
    ↓
┌─────────────────────────────────────────────────────────────┐
│  PARALLEL SEARCH (both run simultaneously)                  │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  BM25 Search:                                               │
│  • Tokenize: ["giải", "thích", "quy", "định",             │
│              "chuyển", "tiếp"]                             │
│  • Search TEXT field (not vectors)                         │
│  • Calculate BM25 scores (TF-IDF + doc length)             │
│  • Rank by keyword relevance                               │
│  • Return top-k documents with BM25 scores                 │
│                                                             │
│  Vector Search:                                             │
│  • Convert query to embedding vector                       │
│  • Compare with document vectors (cosine similarity)       │
│  • Calculate similarity scores                             │
│  • Rank by semantic similarity                             │
│  • Return top-k documents with vector scores               │
│                                                             │
└─────────────────────────────────────────────────────────────┘
    ↓
Normalize Scores: Both BM25 & Vector scores → [0, 1] range
    ↓
Score Fusion with Alpha Parameter:
  • alpha = 0.0 → Pure BM25 (keyword only)
  • alpha = 0.5 → Equal weight to both (balanced)
  • alpha = 1.0 → Pure vector (semantic only)
    ↓
Final Score = combine(normalized_BM25, normalized_Vector)
    ↓
Rank & Return: Sort by final hybrid score (highest first)

Real Example from Our Test

Query: "Giúp tôi giải thích về quy định chuyển tiếp"

Article 255 Results:

  • BM25 original score: 17.96101 → normalized: 0.40038913
  • Vector original score: 0.44675517 → normalized: 0.4777541
  • Final hybrid score: 0.8781431913375854 (sum of normalized scores)

Why Article 255 ranked #1:

  • Strong BM25 match for keywords "quy định", "chuyển tiếp"
  • Good semantic similarity via vector search
  • Balanced combination with alpha = 0.5

6. Key Differences: Filters vs BM25

Important: Metadata filters and BM25 serve different purposes:

Metadata Filters (Applied First)

  • Purpose: Filter documents by structured metadata before search
  • Example: article_id = '260', chapter_id = 'XVI'
  • When: Applied before any search (vector or BM25)
  • Scope: Database-level filtering
  • Purpose: Search and rank by keywords within document text content
  • Example: Query "quy định chuyển tiếp" matches keywords in text
  • When: During the search phase (parallel with vector search)
  • Scope: Text content matching and ranking

You still need both: Filters for structured metadata filtering, and hybrid search (BM25 + vector) for better text matching and ranking.


7. Score Normalization and Alpha Parameter

Why Normalization Matters

Original scores have different scales:

  • BM25 scores: Can be 15-30+ (higher = better)
  • Vector scores: Usually 0.3-0.5 (higher = better)

Normalization converts both to [0, 1] for fair comparison.

Alpha Controls the Balance

// Configuration
alpha: 0.5,           // Equal weight to both methods
fusionType: 'RelativeScore'  // How scores are combined

With alpha = 0.5, both normalized scores contribute equally:

Final Score ≈ normalized_BM25 + normalized_Vector

Score Analysis from Our Results

ArticleBM25 (norm)Vector (norm)Final ScoreWhy It Ranked
2550.4000.4780.878Best overall balance
2530.3910.4650.856Good keyword + semantic match
30.5000.3520.852Highest BM25, lower vector
2540.3200.4890.809High vector, lower BM25
2600.3810.2720.654Moderate on both

8. Can Traditional DBs and Vector DBs Both Use BM25?

Yes.

  • Traditional databases use BM25 (or similar full-text ranking):

    • Elasticsearch
    • MySQL FULLTEXT
    • PostgreSQL full-text search
  • Vector databases often support BM25 as well for hybrid search:

    • Weaviate (our example)
    • Pinecone
    • Qdrant

BM25 isn't tied to vectors — it's a ranking function that searches text fields, while vector search uses embedding fields. Both can coexist in the same database.


9. Implementation: asRetriever vs hybridSearch

asRetriever() (LangChain Standard)

store.asRetriever({
  k: 5,
  filter,
  searchType: 'similarity', // or 'mmr'
})
  • Purpose: Converts vector store to LangChain retriever
  • Search types: 'similarity' (vector) or 'mmr' (Max Marginal Relevance)
  • Limitation: Vector-based only, no BM25

hybridSearch() (Weaviate Native)

store.hybridSearch(query, {
  limit: 5,
  filter,
  alpha: 0.5,
  returnMetadata: ['score', 'explainScore'],
})
  • Purpose: Combines BM25 + vector search natively
  • Advantage: True hybrid search with score fusion
  • Control: Alpha parameter for balancing methods

10. Key Takeaways

BM25 = keyword relevance (searches text, not vectors)
Vector search = semantic meaning (uses embeddings)
Hybrid search = parallel execution + score fusion
Filters are applied before search (metadata filtering)
Score normalization ensures fair comparison between methods
Alpha parameter controls the balance between keyword and semantic search
Traditional DBs and vector DBs can both use BM25
Hybrid search is ideal for law documents where exact references + meaning matter


Legal queries uniquely mix:

  • Exact article references ("Điều 260")
  • Explanatory intent ("giải thích", "quy định chuyển tiếp")

Pure keyword search (BM25 only):

  • ✅ Gets exact articles and legal references
  • ❌ Lacks semantic understanding
  • ❌ Misses paraphrased or conceptual queries

Pure vector search (semantic only):

  • ✅ Captures meaning and intent
  • ❌ Can miss exact references and article numbers
  • ❌ Less reliable for precise legal citations

The Hybrid Solution

Hybrid search provides accuracy + understanding by:

  • Running both searches in parallel
  • Normalizing scores for fair comparison
  • Fusing results with configurable alpha weighting
  • Ranking by combined relevance

Result: Users get both precise legal references AND contextually relevant content — critical for legal systems where precision and understanding are both essential.

Real-World Impact

In our Vietnamese Land Law example:

  • Query: "Giúp tôi giải thích về quy định chuyển tiếp"
  • BM25 catches keywords: "quy định", "chuyển tiếp"
  • Vector search understands the explanatory intent
  • Hybrid fusion ranks the most relevant transitional provisions first
  • Users get exactly what they need: relevant law articles with proper context

This is why hybrid search has become the gold standard for legal document retrieval systems.

User Query: "Giúp tôi giải thích về quy định chuyển tiếp"
    ↓
Apply Metadata Filters (if any): article_id = '260', chapter_id = 'XVI'
    ↓
┌─────────────────────────────────────────────────────────────┐
│  PARALLEL SEARCH (both run simultaneously)                  │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  BM25 Search:                                               │
│  • Tokenize: ["giải", "thích", "quy", "định",             │
│              "chuyển", "tiếp"]                             │
│  • Search TEXT field (not vectors)                         │
│  • Calculate BM25 scores (TF-IDF + doc length)             │
│  • Rank by keyword relevance                               │
│  • Return top-k documents with BM25 scores                 │
│                                                             │
│  Vector Search:                                             │
│  • Convert query to embedding vector                       │
│  • Compare with document vectors (cosine similarity)       │
│  • Calculate similarity scores                             │
│  • Rank by semantic similarity                             │
│  • Return top-k documents with vector scores               │
│                                                             │
└─────────────────────────────────────────────────────────────┘
    ↓
Normalize Scores: Both BM25 & Vector scores → [0, 1] range
    ↓
Score Fusion with Alpha Parameter:
  • alpha = 0.0 → Pure BM25 (keyword only)
  • alpha = 0.5 → Equal weight to both (balanced)
  • alpha = 1.0 → Pure vector (semantic only)
    ↓
Final Score = combine(normalized_BM25, normalized_Vector)
    ↓
Rank & Return: Sort by final hybrid score (highest first)
// Configuration
alpha: 0.5,           // Equal weight to both methods
fusionType: 'RelativeScore'  // How scores are combined
Final Score ≈ normalized_BM25 + normalized_Vector
store.asRetriever({
  k: 5,
  filter,
  searchType: 'similarity', // or 'mmr'
})
store.hybridSearch(query, {
  limit: 5,
  filter,
  alpha: 0.5,
  returnMetadata: ['score', 'explainScore'],
})
thongvmdev_M9VMOt
WRITTEN BY

thongvmdev

Share and grow together