Skip to content
Code & Context logoCode&Context

The 7 Research Papers That Explain How Every AI Agent Actually Works

A layered reading path through the seven papers that explain the reasoning, planning, tool use, memory, and grounding behind modern AI agents.

Saurabh Prakash

Author

Mar 14, 20269 min read
Share:

Most people building AI agents are using frameworks without understanding the foundations beneath them. These 7 papers are those foundations.


There's a lot of noise right now about AI agents. Frameworks, demos, product launches, hot takes. But strip all of that away and you'll find that the cognitive architecture powering nearly every serious agent system today traces back to a handful of academic papers.

I recently went through each of them, in the order they logically build on each other, and the clarity it gave me on why modern agents work the way they do was significant.

Here's the full breakdown, structured as a progression: each paper adds a new capability layer, and by the end you can see how a complete agent mind gets assembled from scratch.


Layer 1: Teaching the Agent to Reason

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models[1]

Before an agent can do anything useful, it needs to think. Chain-of-Thought (CoT) is where that starts.

Before this paper, LLMs worked like a black box: input in, answer out. For simple queries, that was fine. For anything requiring multiple steps, math, logic, planning, the model would routinely fail by trying to compress complex reasoning into a single forward pass.

CoT's insight was deceptively simple: make the model show its work. Prompting it to generate intermediate reasoning steps before arriving at a final answer dramatically improved accuracy on hard tasks. The phrase "let's think step by step" originates here.

An agent that can't reason step by step can't plan, can't debug its own logic, and can't handle any task with more than one moving part. CoT is the bedrock everything else builds on.


Layer 2: Giving Reasoning Structure

Tree of Thoughts: Deliberate Problem Solving with Large Language Models[2]

CoT is linear. Thought -> Thought -> Thought -> Answer. That works for straightforward problems, but what about problems where you might go down the wrong path? A linear chain has no way to backtrack.

Tree of Thoughts (ToT) upgrades the reasoning process from a chain to a tree. The model generates multiple possible next thoughts at each step, evaluates which branches look most promising, and explores the best ones using search strategies like BFS or DFS. Dead ends get pruned. Promising paths go deeper.

Think of how a chess player thinks: not just "my next move," but "if I do this, then they do that, then I can do this." ToT gives agents that same lookahead and recovery capability.

Any agent doing multi-step planning, writing complex code, or solving problems with many possible approaches needs this. CoT gives you a straight line. ToT gives you a map.


Layer 3: Connecting Reasoning to the Real World

ReAct: Synergizing Reasoning and Acting in Language Models[3]

The agent can now reason deeply. But reasoning alone is isolated. It's all happening inside the model's head, with no connection to external reality. That's where hallucination lives.

ReAct (Reasoning + Acting) proposes that agents should alternate between thinking and doing. The loop looks like this:

  • Thought: "I need to verify the current data."
  • Action: Search("Q1 2024 revenue figures")
  • Observation: "$4.2B, up 12% YoY"
  • Thought: "Now I can use this to answer accurately."
  • Answer: ...

By grounding reasoning steps in real observations from the environment, the model stops confabulating. It checks its assumptions rather than hallucinating answers.

This is the core action loop of almost every agent framework you'll encounter. LangChain, LangGraph, and AutoGen all implement variations of this Think -> Act -> Observe cycle. ReAct is the heartbeat.


Layer 4: Expanding What "Acting" Means

Toolformer: Language Models Can Teach Themselves to Use Tools[4]

ReAct establishes that agents should call tools. Toolformer asks the deeper question: how does a model learn to use tools well in the first place?

The paper presents a self-supervised method where the model learns, without heavy human annotation, to:

  • Decide when a tool call is actually needed
  • Determine what arguments to pass
  • Incorporate the tool's result naturally back into generation

Tools in the paper include a calculator, search engine, calendar, translation API, and more. The model learns to invoke them mid-generation, exactly where they add informational value, not randomly.

Every modern agent has a tool belt: web search, code execution, database queries, API calls. Toolformer is the theoretical backbone for why models can do this coherently. It reframes tools not as external patches but as a natural extension of language generation itself.


Layer 5: Learning from Mistakes Without Retraining

Reflexion: Language Agents with Verbal Reinforcement Learning[5]

The agent can now reason, plan, and use tools. But what happens when it fails? Traditional ML says: collect the failure, retrain. That's expensive, slow, and impossible at inference time.

Reflexion solves this with verbal reinforcement. After a failed attempt, the agent writes a natural language reflection on what went wrong:

"I called the wrong API endpoint because I assumed the parameter was a string, not an integer."

That reflection gets stored in memory. On the next attempt, the agent reads its own post-mortem and avoids repeating the mistake. No gradient updates. No retraining. Just the agent reading its own notes and doing better.

This is what separates an agent that loops forever making the same errors from one that genuinely improves. Any long-horizon task, writing a full codebase, executing a multi-step research workflow, requires learning within the session. Reflexion is how you get that.


Layer 6: Scaling to Full Behavioral Simulation

Generative Agents: Interactive Simulacra of Human Behavior[6]

Now we have an agent that can reason deeply, plan with structure, act in the world, use tools, and learn from failure. Generative Agents asks: can we sustain believable human behavior over long time horizons?

Doing so required three new components:

Memory stream: every experience the agent has is logged as a timestamped natural language memory.

Reflection: periodically, the agent synthesizes its own memories into higher-order insights. "Alice tends to start her mornings slowly. She values routine." This prevents the agent from drowning in raw events with no sense of larger pattern.

Planning: reflections inform daily plans, which in turn inform moment-to-moment actions. The agent isn't purely reactive; it has intentions that persist over time.

The paper demonstrated this with 25 agents in a Sims-like sandbox environment, who spontaneously organized a social event. One agent told another, plans spread organically, without anyone scripting that sequence.

Most production agents are short-horizon: do this one task, stop. Generative Agents shows what the architecture looks like when you need persistent identity, long-term goals, and behavior that remains coherent over days, not just seconds.


Layer 7: Keeping the Agent Grounded in Facts

Retrieval-Augmented Generation for Large Language Models: A Survey[7]

Every layer so far covers how the agent thinks and acts. RAG addresses a different problem: what does the agent actually know, and how do you keep that knowledge accurate?

LLMs have two fundamental knowledge problems. Their training data has a cutoff, so they don't know recent events. And they hallucinate, generating plausible but wrong information when queried outside their training distribution.

RAG fixes both by splitting the problem in two:

  • Retrieval: before generating, fetch relevant documents from an external knowledge base, your company docs, a vector database, or the live web
  • Generation: the LLM answers conditioned on those retrieved documents, not just on its parametric memory

The survey covers the full evolution: naive RAG (basic fetch-then-generate), advanced RAG (smarter chunking, re-ranking, query rewriting), and modular RAG (mixing retrieval strategies based on task type).

Any agent operating in a real-world domain, legal, medical, financial, enterprise, cannot rely purely on training weights. RAG is how you plug an agent into your actual data and keep it honest.


The Complete Picture

PaperCapability Added
Chain-of-ThoughtStep-by-step reasoning
Tree of ThoughtsExploration and backtracking
ReActGrounded action loops
ToolformerFluent tool use
ReflexionIn-context learning from failure
Generative AgentsLong-horizon persistent behavior
RAGFactual grounding in external knowledge

Read them in this order and you're not just collecting summaries. You're watching a cognitive architecture get assembled, one capability at a time.

The people building serious agent systems aren't doing so by following framework tutorials alone. They understand why the loop works the way it does. These papers are where that understanding lives.


Further Reading and Resources

  1. Read the Lilian Weng agent guide for a holistic systems view
  2. Work through the papers in the order above
  3. Take the DeepLearning.AI LangGraph course with free audit access
  4. Clone LangChain and start building

If this was useful, share it with someone building in AI. The foundations matter more than the frameworks.


References

[1]: Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., Chi, E., Le, Q., & Zhou, D. (2022). Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. NeurIPS 2022. arXiv

[2]: Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T. L., Cao, Y., & Narasimhan, K. (2023). Tree of Thoughts: Deliberate Problem Solving with Large Language Models. NeurIPS 2023. arXiv

[3]: Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., & Cao, Y. (2022). ReAct: Synergizing Reasoning and Acting in Language Models. ICLR 2023. arXiv

[4]: Schick, T., Dwivedi-Yu, J., Dessi, R., Raileanu, R., Lomeli, M., Zettlemoyer, L., Cancedda, N., & Scialom, T. (2023). Toolformer: Language Models Can Teach Themselves to Use Tools. NeurIPS 2023. arXiv

[5]: Shinn, N., Cassano, F., Berman, E., Gopinath, A., Narasimhan, K., & Yao, S. (2023). Reflexion: Language Agents with Verbal Reinforcement Learning. NeurIPS 2023. arXiv

[6]: Park, J. S., O'Brien, J. C., Cai, C. J., Morris, M. R., Liang, P., & Bernstein, M. S. (2023). Generative Agents: Interactive Simulacra of Human Behavior. UIST 2023. arXiv

[7]: Gao, Y., Xiong, Y., Gao, X., Jia, K., Pan, J., Bi, Y., Dai, Y., Sun, J., Guo, Q., Wang, M., & Wang, H. (2023). Retrieval-Augmented Generation for Large Language Models: A Survey. arXiv