Building a Reflexion Agent with LangGraph, Teaching AI to Think, Reflect, and Improve Its Own Answers

Teaching AI to Think, Reflect, and Improve Its Own Answers with LangGraph
ai
single

Introduction: The Problem with Traditional AI Responses

Imagine asking ChatGPT about recent startup funding in a specific domain. It might give you a confident answer—but is the information current? Are there citations? Does it know what it doesn't know?

Traditional LLM interactions follow a simple pattern:

User asks → LLM responds → Done

But what if we could teach AI to:

  1. Draft an initial answer
  2. Reflect on what's missing or unnecessary
  3. Research to fill gaps
  4. Revise with proper citations
  5. Repeat until the answer is comprehensive

This is the Reflexion Pattern—and in this tutorial, we'll build it from scratch using LangGraph.


What You'll Learn

By the end of this tutorial, you'll understand:

  • ✅ What Reflexion agents are and why they're powerful
  • ✅ How to build stateful AI workflows with LangGraph
  • ✅ Implementing self-reflection in LLMs using structured outputs
  • ✅ Creating iterative research loops with web search
  • ✅ Managing conversation state across multiple iterations

Part 1: Understanding the Reflexion Pattern

Why Single-Shot LLM Responses Fall Short

# Traditional approach
response = llm.invoke("Tell me about AI-powered SOC startups")
# Problem: No verification, no sources, potentially outdated

Limitations:

  • ❌ No self-awareness of knowledge gaps
  • ❌ No verification or citations
  • ❌ Potentially outdated information (training cutoff)
  • ❌ One-shot generation without refinement

The Reflexion Solution

User Query
    ↓
[Draft] Generate initial answer + self-critique
    ↓
[Research] Execute targeted web searches
    ↓
[Revise] Improve answer with new information + citations
    ↓
[Loop] Repeat until quality threshold met
    ↓
Final Answer (cited, verified, comprehensive)

Benefits:

  • ✅ Self-critiques and identifies gaps
  • ✅ Conducts targeted research
  • ✅ Provides citations and references
  • ✅ Iteratively improves quality
  • ✅ Transparent reasoning process

Part 2: System Architecture

The LangGraph Workflow

Component Breakdown

ComponentPurposeInputOutput
Draft NodeInitial response + reflectionUser queryAnswer + critique + search queries
Tools NodeWeb search executionSearch queriesSearch results (JSON)
Revise NodeAnswer improvementPrevious answer + search resultsRevised answer + references
Event LoopIteration controlMessage countContinue or END

Part 3: Step-by-Step Implementation

Step 1: Project Setup

Create your project:

# Create project directory
mkdir reflexion-agent
cd reflexion-agent

# Install dependencies
pip install langchain-core langchain-openai langgraph \
            langchain-tavily python-dotenv pydantic

Create .env file:

OPENAI_API_KEY=sk-your-key-here
TAVILY_API_KEY=tvly-your-key-here

Get API Keys:


Step 2: Define Data Schemas (schemas.py)

Why this matters: Structured outputs ensure the LLM returns exactly what we need.

from typing import List
from pydantic import BaseModel, Field

class Reflection(BaseModel):
    """Self-critique structure."""
    missing: str = Field(description="Critique of what is missing.")
    superfluous: str = Field(description="Critique of what is superfluous")

class AnswerQuestion(BaseModel):
    """Initial answer with reflection and research plan."""
    
    answer: str = Field(
        description="~250 word detailed answer to the question."
    )
    reflection: Reflection = Field(
        description="Your reflection on the initial answer."
    )
    search_queries: List[str] = Field(
        description="1-3 search queries for researching improvements."
    )

class ReviseAnswer(AnswerQuestion):
    """Revised answer with citations."""
    
    references: List[str] = Field(
        description="Citations motivating your updated answer."
    )

Key Insight: By inheriting ReviseAnswer from AnswerQuestion, we ensure the revised version includes all original fields plus references.


Step 3: Configure LLM Chains (chains.py)

The Draft Chain:

import datetime
from dotenv import load_dotenv
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
from langchain_openai import ChatOpenAI
from schemas import AnswerQuestion, ReviseAnswer

load_dotenv()

llm = ChatOpenAI(model="gpt-4")

# Base prompt template
actor_prompt_template = ChatPromptTemplate.from_messages([
    (
        "system",
        """You are expert researcher.
Current time: {time}

1. {first_instruction}
2. Reflect and critique your answer. Be severe to maximize improvement.
3. Recommend search queries to research information and improve your answer.""",
    ),
    MessagesPlaceholder(variable_name="messages"),
    ("system", "Answer the user's question above using the required format."),
]).partial(
    time=lambda: datetime.datetime.now().isoformat(),
)

# First responder (draft)
first_responder = actor_prompt_template.partial(
    first_instruction="Provide a detailed ~250 word answer."
) | llm.bind_tools(
    tools=[AnswerQuestion], 
    tool_choice="AnswerQuestion"  # Force structured output
)

The Revise Chain:

revise_instructions = """Revise your previous answer using the new information.
- Use the previous critique to add important information.
- MUST include numerical citations [1], [2], etc.
- Add a "References" section at the bottom.
- Remove superfluous information.
- Keep under 250 words.
"""

revisor = actor_prompt_template.partial(
    first_instruction=revise_instructions
) | llm.bind_tools(
    tools=[ReviseAnswer], 
    tool_choice="ReviseAnswer"
)

What's happening:

  1. MessagesPlaceholder injects full conversation history
  2. bind_tools() forces LLM to use our Pydantic schemas
  3. tool_choice ensures consistent structured outputs
  4. partial() customizes instructions per node

Step 4: Set Up Web Search (tool_executor.py)

from dotenv import load_dotenv
from langchain_core.tools import StructuredTool
from langchain_tavily import TavilySearch
from langgraph.prebuilt import ToolNode
from schemas import AnswerQuestion, ReviseAnswer

load_dotenv()

# Initialize search tool
tavily_tool = TavilySearch(max_results=5)

def run_queries(search_queries: list[str], **kwargs):
    """Batch execute search queries."""
    return tavily_tool.batch([{"query": query} for query in search_queries])

# Create tool node
execute_tools = ToolNode([
    StructuredTool.from_function(run_queries, name=AnswerQuestion.__name__),
    StructuredTool.from_function(run_queries, name=ReviseAnswer.__name__),
])

Why two tool instances? LangGraph routes tool calls by name. The draft node calls AnswerQuestion, while revise calls ReviseAnswer.


Step 5: Build the LangGraph Workflow (main.py)

Import and Setup:

import logging
from typing import List
from langchain_core.messages import BaseMessage, ToolMessage
from langgraph.graph import END, MessageGraph
from chains import revisor, first_responder
from tool_executor import execute_tools

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

MAX_ITERATIONS = 2

Define Node Functions:

def draft_node(state: List[BaseMessage]) -> List[BaseMessage]:
    """Generate initial answer with self-reflection."""
    logger.info("ENTERING NODE: draft")
    logger.info(f"Query: {state[-1].content}")
    
    result = first_responder.invoke({"messages": state})
    
    # Log the reflection
    if result.tool_calls:
        args = result.tool_calls[0]["args"]
        logger.info(f"Generated answer: {args.get('answer', '')[:100]}...")
        logger.info(f"Missing: {args.get('reflection', {}).get('missing', '')}")
        logger.info(f"Search queries: {args.get('search_queries', [])}")
    
    return state + [result]

def tools_node(state: List[BaseMessage]) -> List[BaseMessage]:
    """Execute web searches."""
    logger.info("ENTERING NODE: execute_tools")
    
    # Find search queries from last AI message
    for msg in reversed(state):
        if hasattr(msg, "tool_calls") and msg.tool_calls:
            queries = msg.tool_calls[0]["args"].get("search_queries", [])
            logger.info(f"Searching for: {queries}")
            break
    
    result = execute_tools.invoke(state)
    logger.info(f"Found {len(result)} search results")
    
    return result

def revise_node(state: List[BaseMessage]) -> List[BaseMessage]:
    """Revise answer with search results."""
    logger.info("ENTERING NODE: revise")
    
    tool_count = sum(1 for msg in state if isinstance(msg, ToolMessage))
    logger.info(f"Revision iteration: {tool_count}")
    
    result = revisor.invoke({"messages": state})
    
    if result.tool_calls:
        args = result.tool_calls[0]["args"]
        logger.info(f"Revised answer: {args.get('answer', '')[:100]}...")
        logger.info(f"References: {len(args.get('references', []))}")
    
    return state + [result]

def event_loop(state: List[BaseMessage]) -> str:
    """Decide whether to continue or end."""
    iterations = sum(isinstance(msg, ToolMessage) for msg in state)
    
    logger.info(f"Iteration {iterations}/{MAX_ITERATIONS}")
    
    if iterations > MAX_ITERATIONS:
        logger.info("Max iterations reached - ENDING")
        return END
    
    logger.info("Continuing to next iteration")
    return "execute_tools"

Build the Graph:

# Initialize graph
builder = MessageGraph()

# Add nodes
builder.add_node("draft", draft_node)
builder.add_node("execute_tools", tools_node)
builder.add_node("revise", revise_node)

# Add edges
builder.add_edge("draft", "execute_tools")
builder.add_edge("execute_tools", "revise")
builder.add_conditional_edges(
    "revise", 
    event_loop, 
    {END: END, "execute_tools": "execute_tools"}
)

# Set entry point and compile
builder.set_entry_point("draft")
graph = builder.compile()

Execute the Graph:

if __name__ == "__main__":
    query = "Write about AI-Powered SOC startups that raised capital."
    
    logger.info(f"Starting query: {query}")
    res = graph.invoke(query)
    
    # Extract final answer
    final_message = res[-1]
    if hasattr(final_message, "tool_calls") and final_message.tool_calls:
        answer = final_message.tool_calls[0]["args"]["answer"]
        references = final_message.tool_calls[0]["args"].get("references", [])
        
        print("\n" + "=" * 80)
        print("FINAL ANSWER:")
        print("=" * 80)
        print(answer)
        
        if references:
            print("\nReferences:")
            for i, ref in enumerate(references, 1):
                print(f"[{i}] {ref}")

Part 4: Understanding the Execution Flow

Message State Evolution

Let's trace how state grows through iterations:

Iteration 0 (Start):

[HumanMessage(content="Write about AI-Powered SOC startups...")]

After Draft Node:

[
    HumanMessage(content="..."),
    AIMessage(tool_calls=[{
        'args': {
            'answer': 'AI-Powered SOCs use ML for threat detection...',
            'reflection': {
                'missing': 'Specific company names and funding amounts',
                'superfluous': 'Too much background on SOC basics'
            },
            'search_queries': [
                'AI SOC startups funding 2024',
                'autonomous SOC companies capital raised'
            ]
        }
    }])
]

After Execute Tools:

[
    HumanMessage(...),
    AIMessage(...),
    ToolMessage(
        content='[{"url":"...", "content":"Darktrace raised $230M..."}]',
        name='AnswerQuestion'
    )
]

After Revise (Iteration 1):

[
    HumanMessage(...),
    AIMessage(...),  # draft
    ToolMessage(...),  # search results
    AIMessage(tool_calls=[{  # revised
        'args': {
            'answer': 'AI-Powered SOCs... Darktrace ($230M) [1]...',
            'references': ['https://techcrunch.com/...'],
            'search_queries': ['Recent AI SOC unicorns']
        }
    }])
]

Second Iteration repeats Tools → Revise, then ends.


Part 5: Running Your Agent

Basic Execution

python main.py

Sample Output

See a full interactive run here:
https://smith.langchain.com/public/722db09e-18e7-465a-9398-efd386410cda/r


Part 6: Best Practices & Performance

Cost Optimization

Typical costs per query (GPT-4):

  • Draft: 500 tokens ($0.015)
  • 2 Revisions: 1000 tokens each ($0.06)
  • Total: ~$0.08-0.10 per query

Tips to reduce costs:

# Use GPT-3.5 for draft
draft_llm = ChatOpenAI(model="gpt-3.5-turbo")

# Use GPT-4 only for revisions
revise_llm = ChatOpenAI(model="gpt-4")

# Limit iterations
MAX_ITERATIONS = 1  # Single revision cycle

Performance Tuning

Typical execution time:

  • Draft: 5-10 seconds
  • Search: 2-3 seconds per query
  • Revise: 5-10 seconds
  • Total: 30-60 seconds for 2 iterations

Part 7: Real-World Applications

Use Cases

  1. Research Assistant: Academic literature reviews
  2. Market Intelligence: Competitive analysis with sources
  3. Content Creation: Blog posts with verified facts
  4. Due Diligence: Investment research
  5. Medical Information: Clinical guidelines with citations
  6. Legal Research: Case law analysis
  7. News Analysis: Multi-source fact-checked summaries

Conclusion

Congratulations! You've built a sophisticated Reflexion Agent that:

Generates thoughtful initial responses
Critiques its own work with self-reflection
Researches missing information via web search
Revises answers with proper citations
Iterates until reaching quality thresholds

Key Insights

  1. Self-reflection transforms LLM capabilities - By making the model aware of its limitations, we dramatically improve output quality.

  2. Structured outputs are essential - Pydantic schemas ensure consistency and enable programmatic access to reflections and citations.

  3. Iteration beats single-shot - Multiple research cycles compound improvements.

  4. State management is crucial - LangGraph's message-based state enables complex workflows.

Further Resources

thongvmdev_M9VMOt
WRITTEN BY

thongvmdev

Share and grow together