Search isn't just about finding text — it's about finding the right answer.
In legal search, users often type things like:
- "Điều 260 quy định chuyển tiếp"
- "Quy định tạm thời"
- "Khoản 3 Điều 5"
These queries mix exact references with meaningful intent. A search system must handle both.
1. What is BM25?
BM25 is a classic keyword relevance algorithm used for ranking documents based on keyword occurrences.
How BM25 works
BM25 scores a document based on:
- Term Frequency (TF) — how often a word appears in the document
- Inverse Document Frequency (IDF) — how rare a word is across the collection
- Document Length Normalization — adjusts for long vs short documents
BM25 Tokenization Process
For a Vietnamese query like "Giúp tôi giải thích về quy định chuyển tiếp":
- Tokenize: Split on spaces and punctuation →
["Giúp", "tôi", "giải", "thích", "về", "quy", "định", "chuyển", "tiếp"] - Filter: Remove stop words (optional) →
["giải", "thích", "quy", "định", "chuyển", "tiếp"] - Search: Find documents containing these tokens in the text field (not vectors)
- Score: Calculate BM25 relevance score for each document
- Rank: Sort by BM25 score
BM25 is excellent at:
- Exact matches
- Rare keywords (like "Điều 260")
- Keyword-based relevance
But BM25 does not understand meaning or paraphrases.
2. How MySQL LIKE differs from BM25
| Feature | LIKE | BM25 |
|---|---|---|
| Match type | Exact text pattern | Ranked relevance |
| Ranking | ❌ No | ✅ Yes |
| Multiple keywords | Poor | Good |
| Word importance | ❌ No | ✅ Yes |
LIKE just checks if text contains a pattern — it does not rank results by relevance.
BM25 scores documents so the best matches rise to the top.
3. What Vector Search Does
Vector search uses embeddings to represent text as vectors in a high-dimensional space.
- It handles meaning, synonyms, paraphrases
- It finds documents that mean something similar, even with different wording
- Uses cosine similarity to compare query vectors with document vectors
But vector search struggles with:
- Exact codes
- Article numbers
- Legal identifiers
Embeddings don't reliably capture exact numbers or reference strings.
4. What is Hybrid Search?
Hybrid search combines BM25 and vector search to get the best of both worlds:
- BM25 — finds documents that contain the right keywords/article numbers
- Vector search — finds documents that are semantically similar
- Both run in parallel, then results are fused using score normalization and weighting
This works exceptionally well for law documents where numeric references and meaning both matter.
5. Hybrid Search Flow (Complete Process)
User Query: "Giúp tôi giải thích về quy định chuyển tiếp" ↓ Apply Metadata Filters (if any): article_id = '260', chapter_id = 'XVI' ↓ ┌─────────────────────────────────────────────────────────────┐ │ PARALLEL SEARCH (both run simultaneously) │ ├─────────────────────────────────────────────────────────────┤ │ │ │ BM25 Search: │ │ • Tokenize: ["giải", "thích", "quy", "định", │ │ "chuyển", "tiếp"] │ │ • Search TEXT field (not vectors) │ │ • Calculate BM25 scores (TF-IDF + doc length) │ │ • Rank by keyword relevance │ │ • Return top-k documents with BM25 scores │ │ │ │ Vector Search: │ │ • Convert query to embedding vector │ │ • Compare with document vectors (cosine similarity) │ │ • Calculate similarity scores │ │ • Rank by semantic similarity │ │ • Return top-k documents with vector scores │ │ │ └─────────────────────────────────────────────────────────────┘ ↓ Normalize Scores: Both BM25 & Vector scores → [0, 1] range ↓ Score Fusion with Alpha Parameter: • alpha = 0.0 → Pure BM25 (keyword only) • alpha = 0.5 → Equal weight to both (balanced) • alpha = 1.0 → Pure vector (semantic only) ↓ Final Score = combine(normalized_BM25, normalized_Vector) ↓ Rank & Return: Sort by final hybrid score (highest first)
Real Example from Our Test
Query: "Giúp tôi giải thích về quy định chuyển tiếp"
Article 255 Results:
- BM25 original score:
17.96101→ normalized:0.40038913 - Vector original score:
0.44675517→ normalized:0.4777541 - Final hybrid score:
0.8781431913375854(sum of normalized scores)
Why Article 255 ranked #1:
- Strong BM25 match for keywords "quy định", "chuyển tiếp"
- Good semantic similarity via vector search
- Balanced combination with alpha = 0.5
6. Key Differences: Filters vs BM25
Important: Metadata filters and BM25 serve different purposes:
Metadata Filters (Applied First)
- Purpose: Filter documents by structured metadata before search
- Example:
article_id = '260',chapter_id = 'XVI' - When: Applied before any search (vector or BM25)
- Scope: Database-level filtering
BM25 Keyword Search (During Search)
- Purpose: Search and rank by keywords within document text content
- Example: Query "quy định chuyển tiếp" matches keywords in text
- When: During the search phase (parallel with vector search)
- Scope: Text content matching and ranking
You still need both: Filters for structured metadata filtering, and hybrid search (BM25 + vector) for better text matching and ranking.
7. Score Normalization and Alpha Parameter
Why Normalization Matters
Original scores have different scales:
- BM25 scores: Can be 15-30+ (higher = better)
- Vector scores: Usually 0.3-0.5 (higher = better)
Normalization converts both to [0, 1] for fair comparison.
Alpha Controls the Balance
// Configuration alpha: 0.5, // Equal weight to both methods fusionType: 'RelativeScore' // How scores are combined
With alpha = 0.5, both normalized scores contribute equally:
Final Score ≈ normalized_BM25 + normalized_Vector
Score Analysis from Our Results
| Article | BM25 (norm) | Vector (norm) | Final Score | Why It Ranked |
|---|---|---|---|---|
| 255 | 0.400 | 0.478 | 0.878 | Best overall balance |
| 253 | 0.391 | 0.465 | 0.856 | Good keyword + semantic match |
| 3 | 0.500 | 0.352 | 0.852 | Highest BM25, lower vector |
| 254 | 0.320 | 0.489 | 0.809 | High vector, lower BM25 |
| 260 | 0.381 | 0.272 | 0.654 | Moderate on both |
8. Can Traditional DBs and Vector DBs Both Use BM25?
Yes.
-
Traditional databases use BM25 (or similar full-text ranking):
- Elasticsearch
- MySQL FULLTEXT
- PostgreSQL full-text search
-
Vector databases often support BM25 as well for hybrid search:
- Weaviate (our example)
- Pinecone
- Qdrant
BM25 isn't tied to vectors — it's a ranking function that searches text fields, while vector search uses embedding fields. Both can coexist in the same database.
9. Implementation: asRetriever vs hybridSearch
asRetriever() (LangChain Standard)
store.asRetriever({ k: 5, filter, searchType: 'similarity', // or 'mmr' })
- Purpose: Converts vector store to LangChain retriever
- Search types:
'similarity'(vector) or'mmr'(Max Marginal Relevance) - Limitation: Vector-based only, no BM25
hybridSearch() (Weaviate Native)
store.hybridSearch(query, { limit: 5, filter, alpha: 0.5, returnMetadata: ['score', 'explainScore'], })
- Purpose: Combines BM25 + vector search natively
- Advantage: True hybrid search with score fusion
- Control: Alpha parameter for balancing methods
10. Key Takeaways
✅ BM25 = keyword relevance (searches text, not vectors)
✅ Vector search = semantic meaning (uses embeddings)
✅ Hybrid search = parallel execution + score fusion
✅ Filters are applied before search (metadata filtering)
✅ Score normalization ensures fair comparison between methods
✅ Alpha parameter controls the balance between keyword and semantic search
✅ Traditional DBs and vector DBs can both use BM25
✅ Hybrid search is ideal for law documents where exact references + meaning matter
11. Why This Matters for Legal Search
Legal queries uniquely mix:
- Exact article references ("Điều 260")
- Explanatory intent ("giải thích", "quy định chuyển tiếp")
The Problem with Single-Method Search
Pure keyword search (BM25 only):
- ✅ Gets exact articles and legal references
- ❌ Lacks semantic understanding
- ❌ Misses paraphrased or conceptual queries
Pure vector search (semantic only):
- ✅ Captures meaning and intent
- ❌ Can miss exact references and article numbers
- ❌ Less reliable for precise legal citations
The Hybrid Solution
Hybrid search provides accuracy + understanding by:
- Running both searches in parallel
- Normalizing scores for fair comparison
- Fusing results with configurable alpha weighting
- Ranking by combined relevance
Result: Users get both precise legal references AND contextually relevant content — critical for legal systems where precision and understanding are both essential.
Real-World Impact
In our Vietnamese Land Law example:
- Query: "Giúp tôi giải thích về quy định chuyển tiếp"
- BM25 catches keywords: "quy định", "chuyển tiếp"
- Vector search understands the explanatory intent
- Hybrid fusion ranks the most relevant transitional provisions first
- Users get exactly what they need: relevant law articles with proper context
This is why hybrid search has become the gold standard for legal document retrieval systems.
User Query: "Giúp tôi giải thích về quy định chuyển tiếp" ↓ Apply Metadata Filters (if any): article_id = '260', chapter_id = 'XVI' ↓ ┌─────────────────────────────────────────────────────────────┐ │ PARALLEL SEARCH (both run simultaneously) │ ├─────────────────────────────────────────────────────────────┤ │ │ │ BM25 Search: │ │ • Tokenize: ["giải", "thích", "quy", "định", │ │ "chuyển", "tiếp"] │ │ • Search TEXT field (not vectors) │ │ • Calculate BM25 scores (TF-IDF + doc length) │ │ • Rank by keyword relevance │ │ • Return top-k documents with BM25 scores │ │ │ │ Vector Search: │ │ • Convert query to embedding vector │ │ • Compare with document vectors (cosine similarity) │ │ • Calculate similarity scores │ │ • Rank by semantic similarity │ │ • Return top-k documents with vector scores │ │ │ └─────────────────────────────────────────────────────────────┘ ↓ Normalize Scores: Both BM25 & Vector scores → [0, 1] range ↓ Score Fusion with Alpha Parameter: • alpha = 0.0 → Pure BM25 (keyword only) • alpha = 0.5 → Equal weight to both (balanced) • alpha = 1.0 → Pure vector (semantic only) ↓ Final Score = combine(normalized_BM25, normalized_Vector) ↓ Rank & Return: Sort by final hybrid score (highest first)
// Configuration alpha: 0.5, // Equal weight to both methods fusionType: 'RelativeScore' // How scores are combined
Final Score ≈ normalized_BM25 + normalized_Vector
store.asRetriever({ k: 5, filter, searchType: 'similarity', // or 'mmr' })
store.hybridSearch(query, { limit: 5, filter, alpha: 0.5, returnMetadata: ['score', 'explainScore'], })
