Introduction: The Problem with Traditional AI Responses
Imagine asking ChatGPT about recent startup funding in a specific domain. It might give you a confident answer—but is the information current? Are there citations? Does it know what it doesn't know?
Traditional LLM interactions follow a simple pattern:
User asks → LLM responds → Done
But what if we could teach AI to:
- Draft an initial answer
- Reflect on what's missing or unnecessary
- Research to fill gaps
- Revise with proper citations
- Repeat until the answer is comprehensive
This is the Reflexion Pattern—and in this tutorial, we'll build it from scratch using LangGraph.
What You'll Learn
By the end of this tutorial, you'll understand:
- ✅ What Reflexion agents are and why they're powerful
- ✅ How to build stateful AI workflows with LangGraph
- ✅ Implementing self-reflection in LLMs using structured outputs
- ✅ Creating iterative research loops with web search
- ✅ Managing conversation state across multiple iterations
Part 1: Understanding the Reflexion Pattern
Why Single-Shot LLM Responses Fall Short
# Traditional approach response = llm.invoke("Tell me about AI-powered SOC startups") # Problem: No verification, no sources, potentially outdated
Limitations:
- ❌ No self-awareness of knowledge gaps
- ❌ No verification or citations
- ❌ Potentially outdated information (training cutoff)
- ❌ One-shot generation without refinement
The Reflexion Solution
User Query ↓ [Draft] Generate initial answer + self-critique ↓ [Research] Execute targeted web searches ↓ [Revise] Improve answer with new information + citations ↓ [Loop] Repeat until quality threshold met ↓ Final Answer (cited, verified, comprehensive)
Benefits:
- ✅ Self-critiques and identifies gaps
- ✅ Conducts targeted research
- ✅ Provides citations and references
- ✅ Iteratively improves quality
- ✅ Transparent reasoning process
Part 2: System Architecture
The LangGraph Workflow
Component Breakdown
| Component | Purpose | Input | Output |
|---|---|---|---|
| Draft Node | Initial response + reflection | User query | Answer + critique + search queries |
| Tools Node | Web search execution | Search queries | Search results (JSON) |
| Revise Node | Answer improvement | Previous answer + search results | Revised answer + references |
| Event Loop | Iteration control | Message count | Continue or END |
Part 3: Step-by-Step Implementation
Step 1: Project Setup
Create your project:
# Create project directory mkdir reflexion-agent cd reflexion-agent # Install dependencies pip install langchain-core langchain-openai langgraph \ langchain-tavily python-dotenv pydantic
Create .env file:
OPENAI_API_KEY=sk-your-key-here TAVILY_API_KEY=tvly-your-key-here
Get API Keys:
- OpenAI: https://platform.openai.com/api-keys
- Tavily: https://tavily.com/ (free tier available)
Step 2: Define Data Schemas (schemas.py)
Why this matters: Structured outputs ensure the LLM returns exactly what we need.
from typing import List from pydantic import BaseModel, Field class Reflection(BaseModel): """Self-critique structure.""" missing: str = Field(description="Critique of what is missing.") superfluous: str = Field(description="Critique of what is superfluous") class AnswerQuestion(BaseModel): """Initial answer with reflection and research plan.""" answer: str = Field( description="~250 word detailed answer to the question." ) reflection: Reflection = Field( description="Your reflection on the initial answer." ) search_queries: List[str] = Field( description="1-3 search queries for researching improvements." ) class ReviseAnswer(AnswerQuestion): """Revised answer with citations.""" references: List[str] = Field( description="Citations motivating your updated answer." )
Key Insight: By inheriting ReviseAnswer from AnswerQuestion, we ensure the revised version includes all original fields plus references.
Step 3: Configure LLM Chains (chains.py)
The Draft Chain:
import datetime from dotenv import load_dotenv from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder from langchain_openai import ChatOpenAI from schemas import AnswerQuestion, ReviseAnswer load_dotenv() llm = ChatOpenAI(model="gpt-4") # Base prompt template actor_prompt_template = ChatPromptTemplate.from_messages([ ( "system", """You are expert researcher. Current time: {time} 1. {first_instruction} 2. Reflect and critique your answer. Be severe to maximize improvement. 3. Recommend search queries to research information and improve your answer.""", ), MessagesPlaceholder(variable_name="messages"), ("system", "Answer the user's question above using the required format."), ]).partial( time=lambda: datetime.datetime.now().isoformat(), ) # First responder (draft) first_responder = actor_prompt_template.partial( first_instruction="Provide a detailed ~250 word answer." ) | llm.bind_tools( tools=[AnswerQuestion], tool_choice="AnswerQuestion" # Force structured output )
The Revise Chain:
revise_instructions = """Revise your previous answer using the new information. - Use the previous critique to add important information. - MUST include numerical citations [1], [2], etc. - Add a "References" section at the bottom. - Remove superfluous information. - Keep under 250 words. """ revisor = actor_prompt_template.partial( first_instruction=revise_instructions ) | llm.bind_tools( tools=[ReviseAnswer], tool_choice="ReviseAnswer" )
What's happening:
MessagesPlaceholderinjects full conversation historybind_tools()forces LLM to use our Pydantic schemastool_choiceensures consistent structured outputspartial()customizes instructions per node
Step 4: Set Up Web Search (tool_executor.py)
from dotenv import load_dotenv from langchain_core.tools import StructuredTool from langchain_tavily import TavilySearch from langgraph.prebuilt import ToolNode from schemas import AnswerQuestion, ReviseAnswer load_dotenv() # Initialize search tool tavily_tool = TavilySearch(max_results=5) def run_queries(search_queries: list[str], **kwargs): """Batch execute search queries.""" return tavily_tool.batch([{"query": query} for query in search_queries]) # Create tool node execute_tools = ToolNode([ StructuredTool.from_function(run_queries, name=AnswerQuestion.__name__), StructuredTool.from_function(run_queries, name=ReviseAnswer.__name__), ])
Why two tool instances? LangGraph routes tool calls by name. The draft node calls AnswerQuestion, while revise calls ReviseAnswer.
Step 5: Build the LangGraph Workflow (main.py)
Import and Setup:
import logging from typing import List from langchain_core.messages import BaseMessage, ToolMessage from langgraph.graph import END, MessageGraph from chains import revisor, first_responder from tool_executor import execute_tools logging.basicConfig(level=logging.INFO) logger = logging.getLogger(__name__) MAX_ITERATIONS = 2
Define Node Functions:
def draft_node(state: List[BaseMessage]) -> List[BaseMessage]: """Generate initial answer with self-reflection.""" logger.info("ENTERING NODE: draft") logger.info(f"Query: {state[-1].content}") result = first_responder.invoke({"messages": state}) # Log the reflection if result.tool_calls: args = result.tool_calls[0]["args"] logger.info(f"Generated answer: {args.get('answer', '')[:100]}...") logger.info(f"Missing: {args.get('reflection', {}).get('missing', '')}") logger.info(f"Search queries: {args.get('search_queries', [])}") return state + [result] def tools_node(state: List[BaseMessage]) -> List[BaseMessage]: """Execute web searches.""" logger.info("ENTERING NODE: execute_tools") # Find search queries from last AI message for msg in reversed(state): if hasattr(msg, "tool_calls") and msg.tool_calls: queries = msg.tool_calls[0]["args"].get("search_queries", []) logger.info(f"Searching for: {queries}") break result = execute_tools.invoke(state) logger.info(f"Found {len(result)} search results") return result def revise_node(state: List[BaseMessage]) -> List[BaseMessage]: """Revise answer with search results.""" logger.info("ENTERING NODE: revise") tool_count = sum(1 for msg in state if isinstance(msg, ToolMessage)) logger.info(f"Revision iteration: {tool_count}") result = revisor.invoke({"messages": state}) if result.tool_calls: args = result.tool_calls[0]["args"] logger.info(f"Revised answer: {args.get('answer', '')[:100]}...") logger.info(f"References: {len(args.get('references', []))}") return state + [result] def event_loop(state: List[BaseMessage]) -> str: """Decide whether to continue or end.""" iterations = sum(isinstance(msg, ToolMessage) for msg in state) logger.info(f"Iteration {iterations}/{MAX_ITERATIONS}") if iterations > MAX_ITERATIONS: logger.info("Max iterations reached - ENDING") return END logger.info("Continuing to next iteration") return "execute_tools"
Build the Graph:
# Initialize graph builder = MessageGraph() # Add nodes builder.add_node("draft", draft_node) builder.add_node("execute_tools", tools_node) builder.add_node("revise", revise_node) # Add edges builder.add_edge("draft", "execute_tools") builder.add_edge("execute_tools", "revise") builder.add_conditional_edges( "revise", event_loop, {END: END, "execute_tools": "execute_tools"} ) # Set entry point and compile builder.set_entry_point("draft") graph = builder.compile()
Execute the Graph:
if __name__ == "__main__": query = "Write about AI-Powered SOC startups that raised capital." logger.info(f"Starting query: {query}") res = graph.invoke(query) # Extract final answer final_message = res[-1] if hasattr(final_message, "tool_calls") and final_message.tool_calls: answer = final_message.tool_calls[0]["args"]["answer"] references = final_message.tool_calls[0]["args"].get("references", []) print("\n" + "=" * 80) print("FINAL ANSWER:") print("=" * 80) print(answer) if references: print("\nReferences:") for i, ref in enumerate(references, 1): print(f"[{i}] {ref}")
Part 4: Understanding the Execution Flow
Message State Evolution
Let's trace how state grows through iterations:
Iteration 0 (Start):
[HumanMessage(content="Write about AI-Powered SOC startups...")]
After Draft Node:
[ HumanMessage(content="..."), AIMessage(tool_calls=[{ 'args': { 'answer': 'AI-Powered SOCs use ML for threat detection...', 'reflection': { 'missing': 'Specific company names and funding amounts', 'superfluous': 'Too much background on SOC basics' }, 'search_queries': [ 'AI SOC startups funding 2024', 'autonomous SOC companies capital raised' ] } }]) ]
After Execute Tools:
[ HumanMessage(...), AIMessage(...), ToolMessage( content='[{"url":"...", "content":"Darktrace raised $230M..."}]', name='AnswerQuestion' ) ]
After Revise (Iteration 1):
[ HumanMessage(...), AIMessage(...), # draft ToolMessage(...), # search results AIMessage(tool_calls=[{ # revised 'args': { 'answer': 'AI-Powered SOCs... Darktrace ($230M) [1]...', 'references': ['https://techcrunch.com/...'], 'search_queries': ['Recent AI SOC unicorns'] } }]) ]
Second Iteration repeats Tools → Revise, then ends.
Part 5: Running Your Agent
Basic Execution
python main.py
Sample Output
See a full interactive run here:
https://smith.langchain.com/public/722db09e-18e7-465a-9398-efd386410cda/r
Part 6: Best Practices & Performance
Cost Optimization
Typical costs per query (GPT-4):
- Draft:
500 tokens ($0.015) - 2 Revisions:
1000 tokens each ($0.06) - Total: ~$0.08-0.10 per query
Tips to reduce costs:
# Use GPT-3.5 for draft draft_llm = ChatOpenAI(model="gpt-3.5-turbo") # Use GPT-4 only for revisions revise_llm = ChatOpenAI(model="gpt-4") # Limit iterations MAX_ITERATIONS = 1 # Single revision cycle
Performance Tuning
Typical execution time:
- Draft: 5-10 seconds
- Search: 2-3 seconds per query
- Revise: 5-10 seconds
- Total: 30-60 seconds for 2 iterations
Part 7: Real-World Applications
Use Cases
- Research Assistant: Academic literature reviews
- Market Intelligence: Competitive analysis with sources
- Content Creation: Blog posts with verified facts
- Due Diligence: Investment research
- Medical Information: Clinical guidelines with citations
- Legal Research: Case law analysis
- News Analysis: Multi-source fact-checked summaries
Conclusion
Congratulations! You've built a sophisticated Reflexion Agent that:
✅ Generates thoughtful initial responses
✅ Critiques its own work with self-reflection
✅ Researches missing information via web search
✅ Revises answers with proper citations
✅ Iterates until reaching quality thresholds
Key Insights
-
Self-reflection transforms LLM capabilities - By making the model aware of its limitations, we dramatically improve output quality.
-
Structured outputs are essential - Pydantic schemas ensure consistency and enable programmatic access to reflections and citations.
-
Iteration beats single-shot - Multiple research cycles compound improvements.
-
State management is crucial - LangGraph's message-based state enables complex workflows.
