Reflection Agents

What are Reflection Agents & Why They Matter

A reflection agent is an AI (LLM-based) agent that doesn’t just act/react immediately, but “reflects” on its own prior actions (or outputs), critiques or evaluates them, and then uses that self-reflection to improve subsequent behavior or responses. (LangChain Blog)
The idea is drawn from human thinking: System 1 (fast, instinctive) and System 2 (slow, deliberate, reflective) thinking. Reflection gives the LLM a chance to do a System 2 style pass to catch mistakes, refine, improve quality. (LangChain Blog)
Reflection is important especially for complex, knowledge-intensive, or high‐stakes tasks, where you can’t rely on a single pass. For example: generating technical documentation, planning, reasoning, code, complex decision-making, or any situation where errors are costly. (LangChain Blog)

The Specific Techniques Presented

The blog describes three different reflection‐based methods / architectures. I’ll describe each: what it does, how it works, advantages/disadvantages.

Technique	What it is / Overview	How it works (steps)	Strengths	Weaknesses / Trade-offs
Basic Reflection	The simplest form. You have two LLM calls: one to generate output, then one to reflect/critique, then possibly repeat the loop a fixed number of times. (LangChain Blog)	1. Receive a user request. 2. Generator LLM produces an output. 3. Reflector LLM plays “teacher” role: critiques the generator’s output — what’s good, what’s missing, what could be better. 4. Possibly loop: feed the reflection back to improve next generation. The blog shows using something called `MessageGraph` to alternate generator ↔ reflector, looping until a fixed limit. (LangChain Blog)	Easy to implement; relatively low cost (some extra LLM calls). Good at catching obvious issues. Helps improve correctness, clarity, perhaps refinement. (LangChain Blog)	Because it is internal/self-reflection, not necessarily grounded in external data/tools; might not catch deeper errors. May not scale well for very complex tasks. Also latency (time) and cost go up. If the reflection is not well prompted, might be shallow. (LangChain Blog)
Reflexion	A more structured form of reflection: actor + revisor, and use external feedback and more explicit criticism. In particular, “actor” generates responses (with search / tool use), then the “revisor” reflects: finds what’s missing or superfluous, includes citations, etc. (LangChain Blog)	Steps (roughly): 1. Actor generates draft response, possibly also runs external searches or tools to gather info. 2. Execute any tools needed (e.g. fact checking, searches). 3. Revisor reflects: critiques the draft — pointing out what’s missing, what’s extra, citing sources. 4. Use that reflection to revise or produce a better final version. 5. Loop for N iterations (fixed). (LangChain Blog)	More rigorous; because of citations/external data/tools, more likely to catch factual errors. Better for tasks needing precision. The critique is more structured, so the improvement tends to be more meaningful. (LangChain Blog)	More computational cost (each iteration uses searches, tools, LLM calls). More latency. Complexity in prompt design and orchestration. May still suffer if tools/data are incomplete, or the LLM’s judgement is weak. Also fixed trajectory: if initial draft is very misguided, later corrections may not fully recover. (LangChain Blog)
Language Agent Tree Search (LATS)	This is the most advanced in the article. It combines reflection and search (including ideas from Monte Carlo Tree Search) to explore possible action trajectories, evaluate them, and pick the best path. It is useful when the agent must plan through many possible branches. (LangChain Blog)	Key steps: 1. Select: choose which action (or trajectory) to expand next, based on rewards so far. 2. Expand & simulate: generate possible next actions in parallel (multiple options). 3. Reflect & evaluate: for each of those, observe outcomes, possibly with external feedback, evaluate them (how good is the result, how well action aligns, etc.). 4. Backpropagate: propagate the evaluations up the tree so the root node (initial step) gains information about which trajectories are promising. 5. Continue until a solution is found, or depth/search budget is exhausted. 6. Return chosen trajectory / output. (LangChain Blog)	Very powerful for tasks with branching decision points. Good for planning, code generation, multi-step reasoning. Tends to produce higher quality and more robust outcomes. Ability to “look ahead” and compare alternatives. (LangChain Blog)	Computationally expensive: many LLM/tool calls, many simulations. Latency high. Complexity of implementation. Requires good evaluation/reflection function (i.e. capacity to score trajectories well). Also risk of combinatorial explosion if branching factor large. May be overkill for simpler tasks. (LangChain Blog)

Code Implementation Example

Here's a practical implementation of a Basic Reflection agent using LangGraph, demonstrated through a Twitter content improvement system:

System Architecture

The reflection agent follows this flow:

📝 Prompt Chains Implementation (chains.py)

from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
from langchain_openai import ChatOpenAI

reflection_prompt = ChatPromptTemplate.from_messages(
    [
        (
            "system",
            "You are a viral twitter influencer grading a tweet. Generate critique and recommendations for the user's tweet."
            "Always provide detailed recommendations, including requests for length, virality, style, etc.",
        ),
        MessagesPlaceholder(variable_name="messages"),
    ]
)

generation_prompt = ChatPromptTemplate.from_messages(
    [
        (
            "system",
            "You are a twitter techie influencer assistant tasked with writing excellent twitter posts."
            " Generate the best twitter post possible for the user's request."
            " If the user provides critique, respond with a revised version of your previous attempts.",
        ),
        MessagesPlaceholder(variable_name="messages"),
    ]
)

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
generate_chain = generation_prompt | llm
reflect_chain = reflection_prompt | llm

🔄 Graph Structure Implementation (main.py)

from typing import TypedDict, Annotated
from dotenv import load_dotenv

load_dotenv()

from langchain_core.messages import BaseMessage, HumanMessage
from langgraph.graph import END, StateGraph
from langgraph.graph.message import add_messages
from chains import generate_chain, reflect_chain

class MessageGraph(TypedDict):
    messages: Annotated[list[BaseMessage], add_messages]

REFLECT = "reflect"
GENERATE = "generate"

def generation_node(state: MessageGraph):
    return {"messages": [generate_chain.invoke({"messages": state["messages"]})]}

def reflection_node(state: MessageGraph):
    res = reflect_chain.invoke({"messages": state["messages"]})
    return {"messages": [HumanMessage(content=res.content)]}

builder = StateGraph(state_schema=MessageGraph)
builder.add_node(GENERATE, generation_node)
builder.add_node(REFLECT, reflection_node)
builder.set_entry_point(GENERATE)

def should_continue(state: MessageGraph):
    if len(state["messages"]) > 6:
        return END
    return REFLECT

builder.add_conditional_edges(GENERATE, should_continue)
builder.add_edge(REFLECT, GENERATE)

graph = builder.compile()

🚀 Usage Example

if __name__ == "__main__":
    inputs = {
        "messages": [
            HumanMessage(
                content="""Make this tweet better:"
                                    @LangChainAI
            — newly Tool Calling feature is seriously underrated.

            After a long wait, it's  here- making the implementation of agents across different models with function calling - super easy.

            Made a video covering their newest blog post
                                  """
            )
        ]
    }
    response = graph.invoke(inputs)
    print(response)

Key Implementation Features

State Management: Uses MessageGraph TypedDict to maintain conversation history
Node Functions: Separate functions for generation and reflection logic
Conditional Logic: should_continue function limits iterations to prevent infinite loops
Chain Integration: Leverages LangChain's prompt templates and OpenAI integration
Graph Compilation: Creates executable workflow with entry points and edges

This implementation demonstrates the Basic Reflection technique from the table above, showing how generator and reflector alternate with a fixed iteration limit.

Implementation Details & Elements to Consider

Here are things the blog mentions (or implies) that matter when deploying reflection agents:

Prompting Personas: For example, in basic reflection, the “reflector” may be asked to assume teacher persona, critic, etc., with style instructions. That helps guide constructive feedback. (LangChain Blog)
External Tools / Observations: Reflection is more powerful when grounded: using searches, tools, external data to verify or supply missing content. Without grounding, reflection might just be opinion. Reflexion method includes these. (LangChain Blog)
Looping / Iterations: You must decide how many cycles of reflection, or when to stop (fixed count, or dynamic based on score, or other criteria). (LangChain Blog)
Graph / Tree Structure: Implementing via LangGraph (in the blog) — a graph abstraction of states + messages (or for LATS, a tree of trajectories). Helps organizing and controlling flow. (LangChain Blog)
Evaluation / Reward / Scoring: Particularly for tree search you need a reliable way to evaluate or reflect on outcomes (scores). Without a good scoring metric, the search may choose poor trajectories. (LangChain Blog)
Performance vs Quality Tradeoff: Reflection adds cost: more compute time, more LLM calls, more tooling. There is latency. For real-time or cost-sensitive tasks, you may accept less reflection. The blog acknowledges that. (LangChain Blog)

Possible Applications & Relevance

Given your context (factories, managing operations, perhaps business decisions, planning, documentation, receivables etc.), here are how reflection agents might help:

Improving decision support: If you use LLMs to analyze financials, to forecast, to plan improvements, a reflection agent could help ensure analysis is more accurate (catch missing items, validate numbers).
Automating reports or documentation: For example, safety procedures, maintenance manuals, specification sheets. Reflection can help produce higher quality drafts, catching omissions.
Customer communications / contract drafting: Ensuring completeness (terms, risks), using reflection to critique draft before sending.
Risk assessment / compliance: Reflecting on outputs to check if legal / accounting / labor steps are missing.

Risks & Things to Watch Out For

Cost & Latency: More computational cost; slower responses. This could matter if you need prompt answers.
Reliability of reflections / evaluations: If the LLM’s feedback or the tools it uses (search, external data) are weak, then reflections may be shallow or misguided. There’s a risk of “overconfidence” or reinforcing wrong paths.
Complexity of implementation: It’s non-trivial to build the graphs, loop decisions, and manage evaluation metrics. You may need developer or AI engineering resources.
Diminishing returns: For simple tasks, the benefit may not justify the cost. Also, too many loops or too deep tree search could give marginal gains.

Summary & How You Could Start

If you were to try applying reflection agents, here is a suggested procedure:

Pick a use-case where the output quality is critical (e.g. financial forecasting, legal/compliance text, planning).
Prototype the basic reflection method first: generator + reflector (teacher critique). This is relatively simple and lets you experiment.
Define quality criteria / evaluation metrics: what does “good output” mean for your case? Completeness, factual correctness, alignment with regulations etc.
Add external grounding/tools if possible: for example, link to your internal data, relevant sources, search tools.
Explore more advanced methods (like Reflexion, or eventually tree search) once you see value and have infrastructure.
Monitor cost, latency, and how much human correction is still needed – to see whether it’s worth scaling.