A comprehensive guide to understanding how Chat LangChain ingests, processes, and stores documents for semantic search using local Weaviate with Docker.
Learning Journey: This is part of my exploration of the Chat LangChain open-source project.
Source Repository: langchain-ai/chat-langchain
Practice Repository: thongvmdev/chat-langchain-practice
šØ Problem & Solution
The Challenge
While learning RAG systems from the Chat LangChain project, I encountered two major obstacles:
-
Cloud Weaviate Limitations
- Free tier cluster expires after 14 days
- Not suitable for continuous learning and experimentation
- Need to recreate clusters repeatedly
- Limited storage and query capacity
-
OpenAI Embedding Costs
- OpenAI embeddings charge per token
- For learning purposes, re-running ingestion multiple times adds up quickly
- Example: Ingesting 500 documents Ć 3 practice runs = significant costs
- Not cost-effective for experimentation and iteration
The Solution: Local Docker Stack
I solved both problems by switching to a completely local, zero-cost setup:
Architecture Changes:
- Cloud Weaviate Cluster (14-day limit) + Local Weaviate Docker Container (unlimited) - OpenAI Embeddings ($$$) + Ollama + nomic-embed-text Model (free!) - External API dependencies + 100% local Docker Compose setup
Benefits:
- ā Zero Cost: No API fees, no cloud limits
- ā Privacy: All data stays on your machine
- ā Unlimited Learning: Run ingestion as many times as needed
- ā Offline Capable: Work without internet connection
- ā Fast Iteration: No rate limits or network latency
Technical Stack:
- PostgreSQL (Record Manager): Tracks indexed documents for deduplication
- Weaviate (Vector Store): Stores embeddings and document metadata
- Ollama (Embedding Service): Generates vectors using
nomic-embed-text- 768-dimensional vectors
- 2K token context window
- Runs locally on CPU/GPU
Setup: See docker-compose.yml for the complete configuration
šÆ What is Document Ingestion?
Document ingestion converts raw documents (web pages) into searchable vectors stored in a database. This enables AI assistants to search by meaning rather than keywords.
š³ Docker Infrastructure
Three-Container Architecture
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā ā Docker Compose Network ā āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā⤠ā ā ā āāāāāāāāāāāāāāāāāāāā āāāāāāāāāāāāāāāāāāāā ā ā ā PostgreSQL ā ā Weaviate ā ā ā ā :5432 ā ā :8080 ā ā ā ā ā ā :50051 ā ā ā ā Record Manager ā ā Vector Store ā ā ā ā (Deduplication) ā ā ā ā ā āāāāāāāāāāāāāāāāāāāā āāāāāāāāāāāāāāāāāāāā ā ā ā ā āāāāāāāāāāāāāāāāāāāāāāāā ā ā ā Ollama ā ā ā ā :11434 ā ā ā ā ā ā ā ā nomic-embed-text ā ā ā ā (2K context, 768d) ā ā ā āāāāāāāāāāāāāāāāāāāāāāāā ā āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
Container Roles
1. PostgreSQL (:5432)
- Tracks indexed documents for deduplication
- Stores content hashes and timestamps
- Prevents duplicate processing
2. Weaviate (:8080)
- Vector database server
- Stores vectors + text + metadata
- No built-in vectorizer (uses external embeddings)
3. Ollama (:11434)
- Embedding model:
nomic-embed-text - Converts text ā 768-dimensional vectors
- 2K token context window
- Called directly by Python via LangChain
š High-Level System Flow
Ingestion Process
[Web Scraping] ā Load 32 pages from js.langchain.com ā [Text Splitting] ā Split into ~120 chunks (4000 chars each) ā [Python: ingest.py] ā Send text to Ollama ā [Ollama Server] ā Generate 768-dim vectors with nomic-embed-text ā [Python: ingest.py] ā Send vectors + text to Weaviate ā [Weaviate Server] ā Store: {text, vector, source, title} ā [Record Manager (PostgreSQL)] ā Track: source URL + content hash āāāŗ Deduplication on next run
Retrieval Process
[User Query] ā "What is LangChain?" ā [Python: retrieval.py] ā retriever.invoke(query) ā [Ollama Server] ā Generate query vector with nomic-embed-text ā [Python: retrieval.py] ā Send query vector to Weaviate ā [Weaviate Server] ā Find similar vectors (cosine similarity) ā Return top-k documents with distance scores ā [Python: retrieval.py] āāāŗ List[Document] with metadata
š Complete Ingestion Flow
š Step-by-Step Explanation
Step 1: Load Documents from Web
# Load from sitemaps langchain_python_docs = load_langchain_python_docs() langchain_js_docs = load_langchain_js_docs() aggregated_site_docs = load_aggregated_docs_site()
What does load_langchain_js_docs() return?
A list of LangChain Document objects loaded from the sitemap:
[ Document( page_content="LangChain.js\n\nIntroduction\n\nLangChain is a framework for developing applications powered by language models...", metadata={ "source": "https://js.langchain.com/docs/get_started/introduction", "title": "Introduction | š¦ļøš LangChain.js", "description": "LangChain is a framework for developing applications powered by language models.", "language": "en", "loc": "https://js.langchain.com/docs/get_started/introduction", "lastmod": "2024-10-15T10:30:00+00:00", "changefreq": "weekly", "priority": 0.8 } ), Document( page_content="Installation\n\nTo get started with LangChain.js, install the package...", metadata={ "source": "https://js.langchain.com/docs/get_started/installation", "title": "Installation | š¦ļøš LangChain.js", ... } ), # ... hundreds more documents from the sitemap ]
Key Details:
- Return Type:
List[Document](typically 100-500 documents) - Filtering: Only URLs matching
https://js.langchain.com/docs/are included - Content: Clean text extracted via
simple_extractor()(HTML tags removed, excessive newlines collapsed) - Metadata: Source URL, title, description, language from HTML elements
Step 2: Split Documents into Chunks
text_splitter = RecursiveCharacterTextSplitter( chunk_size=4000, # Max 4000 characters per chunk chunk_overlap=400 # 400 characters overlap between chunks ) docs_transformed = text_splitter.split_documents(general_guides_and_tutorials_docs) # Filter out tiny chunks docs_transformed = [doc for doc in docs_transformed if len(doc.page_content) > 10]
Transformation Process:
Before Splitting:
# 1 document (8000 characters) Document( page_content="LangChain.js\n\nIntroduction\n\nLangChain is a framework..." (8000 chars), metadata={ "source": "https://js.langchain.com/docs/intro", "title": "Introduction | š¦ļøš LangChain.js" } )
After Splitting:
# Becomes 3 smaller chunks [ Document( page_content="LangChain.js\n\nIntroduction\n\nLangChain is a framework..." (4000 chars), metadata={ "source": "https://js.langchain.com/docs/intro", # Same! "title": "Introduction | š¦ļøš LangChain.js" # Same! } ), Document( page_content="...powered by language models\n\nThe main value props..." (4000 chars, 200 overlap), metadata={ "source": "https://js.langchain.com/docs/intro", # Same! "title": "Introduction | š¦ļøš LangChain.js" # Same! } ), Document( page_content="...for accomplishing specific tasks..." (400 chars, 200 overlap), metadata={ "source": "https://js.langchain.com/docs/intro", # Same! "title": "Introduction | š¦ļøš LangChain.js" # Same! } ) ]
Key Points:
- Input: ~100-500 large documents (potentially 5000+ chars each)
- Output: ~500-2000+ smaller chunks (max 4000 chars each)
- Metadata preserved: All chunks keep the same
sourceURL (critical for record management!) - 200-char overlap: Ensures context isn't lost at boundaries
Why split documents?
- Embedding models have size limits
- Smaller chunks = more precise search results
- Overlap ensures context isn't lost between chunks
Step 3: Generate Embeddings (Server-Side)
# Python code generates embeddings via Ollama, then stores in Weaviate vectorstore.add_documents([ Document(page_content="LangChain is a framework...", metadata={...}) ]) # What happens internally: # 1. Python calls Ollama at localhost:11434 # 2. Ollama/nomic-embed-text returns 768-dim vector [0.023, -0.015, ..., 0.008] # 3. Python sends {text, vector, metadata} to Weaviate # 4. Weaviate stores the complete document
What is a vector?
- Mathematical representation of text meaning
- 768 floating-point numbers (nomic-embed-text)
- Similar texts have similar vectors
- Enables "meaning-based" search
- 2K token context window for better understanding
- Generated locally via Ollama (no API costs!)
Step 4: Check Record Manager & Smart Deduplication
The Record Manager uses source_id_key="source" to group chunks by their source URL and track changes.
# For each chunk after splitting for doc in docs_transformed: # Extract document key (unique identifier) doc_key = doc.metadata["source"] # e.g., "https://js.langchain.com/docs/intro" # Generate content hash content_hash = hash(doc.page_content + doc_key) # e.g., "a3f5c9b2e8d1..." # Check if already indexed record = record_manager.get(key=doc_key)
How Record Management Works:
Real Example - First Run:
# URL: https://js.langchain.com/docs/intro (8000 chars) # After splitting ā 3 chunks # Record Manager stores: { "group_id": "https://js.langchain.com/docs/intro", # source_id_key "doc_hashes": [ "hash_abc123", # Chunk 1 hash "hash_def456", # Chunk 2 hash "hash_ghi789" # Chunk 3 hash ], "vector_ids": ["vec_1", "vec_2", "vec_3"], "updated_at": "2024-10-25T10:00:00" }
Real Example - Second Run (Content Changed):
# Same URL, but website content updated # After splitting ā 3 NEW chunks (different content) # Record Manager compares: # - Groups by: "https://js.langchain.com/docs/intro" # - Old hashes: [hash_abc123, hash_def456, hash_ghi789] # - New hashes: [hash_xyz111, hash_xyz222, hash_xyz333] # - Result: NO MATCHES! # Actions: # 1. DELETE old vectors: vec_1, vec_2, vec_3 from Weaviate # 2. INSERT new vectors: vec_4, vec_5, vec_6 to Weaviate # 3. UPDATE record: { "group_id": "https://js.langchain.com/docs/intro", "doc_hashes": [ "hash_xyz111", # NEW Chunk 1 hash "hash_xyz222", # NEW Chunk 2 hash "hash_xyz333" # NEW Chunk 3 hash ], "vector_ids": ["vec_4", "vec_5", "vec_6"], # NEW vector IDs "updated_at": "2024-10-26T14:30:00" }
Key Insight: Grouping by source_id_key
The critical parameter is source_id_key="source" in the index function:
indexing_stats = index( docs_transformed, record_manager, vectorstore, cleanup="full", source_id_key="source" # ā Groups all chunks by their 'source' URL! )
This means:
- All chunks from the same URL are treated as a group
- When content changes, all old chunks are deleted and all new chunks are inserted
- Record Manager tracks at the source URL level, not individual chunk level
Three possible outcomes:
A. New Document (Not Found)
# Record doesn't exist if record is None: # ā Generate embedding # ā Insert to vector DB # ā Record in Record Manager status = "ADDED"
B. Unchanged Document (Hash Matches)
# Record exists, hash matches if record.content_hash == content_hash: # ā Skip! No processing needed # ā Save API costs and time status = "SKIPPED"
C. Updated Document (Hash Different)
# Record exists, but content changed if record.content_hash != content_hash: # ā Generate new embedding # ā Update vector DB # ā Update Record Manager status = "UPDATED"
Step 5: Store in Vector Database
# What gets stored in Weaviate { "_id": "uuid-abc-123", "vector": [0.023, -0.015, 0.034, ..., 0.008], # 1,536 numbers "text": "LangChain is a framework for developing...", "source": "https://python.langchain.com/docs/intro", "title": "Introduction to LangChain" }
Step 6: Record in Record Manager
# What gets stored in PostgreSQL { "key": "https://python.langchain.com/docs/intro", "content_hash": "a3f5c9b2e8d1a4f7c3e9", "updated_at": "2024-10-25 10:30:00", "namespace": "weaviate/LangChain_general_guides" }
š How Search Works After Ingestion
User Query: "How does memory work?" ā ā Python: retriever.invoke(query) ā Sends raw text to Weaviate ā Weaviate Server ā Calls text2vec-transformers:8080 ā text2vec-transformers ā Returns query vector [0.019, -0.013, ...] ā Weaviate Server ā Calculates cosine similarity with all stored vectors ā Memory doc: distance = 0.12 (close!) ā Chains doc: distance = 0.89 (far) ā Python Code ā Receives top-k documents with distance scores āāāŗ Returns: List[Document]
Example:
from backend.retrieval import make_retriever config = RunnableConfig(configurable={"search_kwargs": {"k": 3}}) with make_retriever(config) as retriever: # Uses Weaviate's near_text query (server-side vectorization) docs = retriever.invoke("How does memory work in LangChain?") for doc in docs: print(f"Title: {doc.metadata['title']}") print(f"Distance: {doc.metadata['distance']:.4f}") print(f"Content: {doc.page_content[:200]}...")
š Record Management Summary
How Chunks Are Tracked Across Runs
Key Concepts
1. One Source URL ā Multiple Chunks ā One Record Group
URL: https://js.langchain.com/docs/intro āāā Chunk 1 (hash: abc123) āāā Chunk 2 (hash: def456) āāā Chunk 3 (hash: ghi789) Record Manager Entry: { group_id: "https://js.langchain.com/docs/intro", hashes: ["abc123", "def456", "ghi789"] }
2. Comparison Happens at Group Level
- Record Manager groups all chunks by
source_id_key="source" - When the same URL is re-ingested:
- Compare: Old hash list vs New hash list
- If ANY hash is different ā Update ALL chunks in that group
- If all hashes match ā Skip ALL chunks in that group
3. cleanup="full" Removes Deleted Sources
# Run 1: 3 URLs - docs/intro ā 3 chunks - docs/install ā 2 chunks - docs/guide ā 4 chunks # Run 2: docs/install removed from sitemap - docs/intro ā 3 chunks (unchanged, SKIP) - docs/guide ā 4 chunks (unchanged, SKIP) # ā docs/install automatically DELETED from vector DB
4. Time Savings Through Smart Tracking
First Run: - 32 URLs ā 465 chunks - Generate 465 embeddings (server-side) - Time: ~8 seconds Second Run (no changes): - 32 URLs ā 465 chunks - Generate 0 embeddings (all skipped!) - Time: ~1 second - Savings: 87.5% time! ā” Third Run (5 URLs changed): - 27 URLs unchanged ā 390 chunks SKIPPED - 5 URLs changed ā 75 chunks UPDATED - Generate 75 embeddings (server-side) - Time: ~2 seconds - Savings: 75% time! ā”
š Real-World Example with Numbers
Scenario: Ingesting LangChain JS Documentation
First Run (Initial Load):
# Documents loaded: 32 pages (from sitemap) # After splitting: ~120 chunks (4000 chars each) # Embeddings generated: ~120 (via Ollama, no cost!) # Time: ~10-15 seconds (local generation) indexing_stats = { 'num_added': 120, 'num_updated': 0, 'num_skipped': 0, 'num_deleted': 0 }
Second Run (No Changes):
# Documents loaded: 32 pages (same) # After splitting: ~120 chunks (same) # Embeddings generated: 0 ā (all skipped!) # Time: ~1 second indexing_stats = { 'num_added': 0, 'num_updated': 0, 'num_skipped': 120, # All documents unchanged 'num_deleted': 0 }
Third Run (5 Pages Updated):
# New documents: 0 # Updated documents: 5 pages ā ~19 chunks # Unchanged: 27 pages ā ~101 chunks # Embeddings generated: 19 (only updated chunks) # Time: ~3 seconds indexing_stats = { 'num_added': 0, 'num_updated': 19, # Changed chunks 'num_skipped': 101, # Unchanged chunks 'num_deleted': 0 }
š” Why This System is Smart
1. Local & Private
- All processing happens locally in Docker containers
- Ollama runs embeddings on your machine
- Your data never leaves your machine
- No API costs for embeddings! š
- 768-dimensional vectors for better semantic understanding
2. No Duplicates
- Without Record Manager: Each run adds duplicates
- Run 1: 120 vectors
- Run 2: 240 vectors (50% duplicates!)
- Run 3: 360 vectors (67% duplicates!)
- With Record Manager: Always exactly the right number
- Run 1: 120 vectors
- Run 2: 120 vectors (same)
- Run 3: 120 vectors (same)
3. Speed & Efficiency
- Skip unchanged documents (no re-processing)
- Only generate embeddings when content changes
- 85% faster on subsequent runs
- Larger chunks (4000 chars) = fewer vectors to process
- First run: ~10-15 seconds for 32 pages ā ~120 vectors
- Second run (unchanged): ~1 second (all skipped)
4. Automatic Cleanup
Using cleanup="full" removes deleted documents:
# If a page is removed from the source website # Record Manager detects: "This was indexed before but not in current batch" # ā Automatically deletes from Vector DB # ā Removes from Record Manager # ā Keeps database clean and up-to-date
š§ Implementation in Code
def ingest_docs(): # 1. Load and prepare documents docs = ingest_general_guides_and_tutorials() # 32 HTML pages text_splitter = RecursiveCharacterTextSplitter( chunk_size=4000, # 4000 chars per chunk (larger for better context) chunk_overlap=400 # 400 chars overlap ) docs_transformed = text_splitter.split_documents(docs) # ā ~120 chunks # 2. Get embedding model (OllamaEmbeddings for nomic-embed-text) embedding = get_embeddings_model() # ā OllamaEmbeddings # 3. Connect to local Weaviate with weaviate.connect_to_local() as weaviate_client: # 4. Create Vector Store (embedding=OllamaEmbeddings) vectorstore = WeaviateVectorStore( client=weaviate_client, index_name="LangChain_General_Guides_And_Tutorials_nomic_embed_text", text_key="text", embedding=embedding, # ā Python handles vectorization via Ollama! attributes=["source", "title"] ) # 5. Initialize Record Manager record_manager = SQLRecordManager( "weaviate/LangChain_General_Guides_...", db_url="postgresql://postgres:zkdtn1234@localhost:5432/chat_langchain" ) record_manager.create_schema() # 6. Index with smart deduplication indexing_stats = index( docs_transformed, record_manager, vectorstore, cleanup="full", source_id_key="source", # Group by URL force_update=False ) print(f"Indexing complete: {indexing_stats}") # Output: {'num_added': 120, 'num_updated': 0, 'num_skipped': 0, 'num_deleted': 0}
šÆ Key Takeaways
-
Docker-Based Local Setup:
- PostgreSQL = Tracks indexed documents (deduplication)
- Weaviate = Vector database server (storage only)
- Ollama = Embedding service with nomic-embed-text (768-dim vectors, 2K context)
- Python calls Ollama for embeddings, then stores in Weaviate
-
Client-Side Vectorization via Ollama:
- Python calls Ollama at
localhost:11434 - Ollama generates vectors using nomic-embed-text
- Python sends vectors to Weaviate for storage
- No external API calls or costs
- Your data stays private and local
- Python calls Ollama at
-
Document Pipeline:
- Load: 32 HTML pages from sitemap
- Split: 32 pages ā ~120 chunks (4000 chars, 400 overlap)
- Vectorize: Via Ollama/nomic-embed-text (768 dimensions)
- Store: Weaviate stores {text, vector, metadata}
-
Smart Deduplication with Record Manager:
- Groups chunks by source URL (
source_id_key="source") - Tracks content hashes in PostgreSQL
- NEW: First time seeing URL ā insert all chunks
- SKIP: URL exists, hashes match ā skip (huge savings!)
- UPDATE: URL exists, hash differs ā delete old + insert new
- DELETE: URL removed from source ā auto-cleanup
- Groups chunks by source URL (
-
Production Benefits:
- Private: All local, no data leaves your machine
- Fast: 85% faster on re-runs (skip unchanged)
- Reliable: No duplicates, automatic cleanup
- Cost: $0 for embeddings! š
š Quick Reference
Vector Database Schema (Weaviate)
{ "vector": [768 floats], // nomic-embed-text embeddings "text": "document content", "source": "https://js.langchain.com/docs/...", "title": "Page Title | š¦ļøš LangChain.js" }
Record Manager Schema (PostgreSQL)
{ "key": "https://js.langchain.com/docs/...", "content_hash": "a3f5c9b2e8d1...", "updated_at": "2024-10-30 10:30:00", "namespace": "weaviate/LangChain_General_Guides_..." }
Index Function Parameters
index( docs, # Documents to process record_manager, # Deduplication tracker vectorstore, # Storage destination cleanup="full", # Auto-delete removed documents source_id_key="source", # Group by URL force_update=False # Skip if unchanged )
š Running the System
Start Docker Containers
# Start all services (PostgreSQL, Weaviate, Ollama) docker-compose up -d # Check status docker-compose ps # View logs docker-compose logs -f weaviate
Run Ingestion
# Set environment (already configured in .env) # RECORD_MANAGER_DB_URL=postgresql://postgres:zkdtn1234@localhost:5432/chat_langchain # Run ingestion (make sure Ollama is running and has pulled nomic-embed-text) PYTHONPATH=. poetry run python backend/ingest.py # Output: # Fetching pages: 100%|āāāāāāāā| 32/32 [00:08<00:00, 3.60it/s] # INFO: Indexing stats: {'num_added': 120, 'num_updated': 0, 'num_skipped': 0, 'num_deleted': 0} # INFO: General Guides and Tutorials now has this many vectors: 120
Query Documents
# Test retrieval PYTHONPATH=. poetry run python -c " from langchain_core.runnables import RunnableConfig from backend.retrieval import make_retriever config = RunnableConfig(configurable={'search_kwargs': {'k': 3}}) with make_retriever(config) as retriever: docs = retriever.invoke('What is LangChain?') print(f'Found {len(docs)} documents') print(f'Top result: {docs[0].metadata[\"title\"][:60]}') "
