This post is adapted from my talk at Node Congress 2026.
TL;DR - Context windows are marketing numbers — LLMs struggle with information buried in the middle of long prompts, even when it technically "fits" - Recursive Language Models (RLM) treat prompts as programmatic environments where a root LM writes code to orchestrate recursive sub-LM calls. Our implementation adapts this concept into a practical three-role architecture (orchestrator, parallel workers, synthesizer) using LangGraph. - We ran it on a 904K-character document (213K tokens) — a naive single-pass call fails with a 400 error; RLM found all 10 breaking change areas with chunk-level citations - LangGraph gives you the state management, iteration, and conditional routing needed to implement RLM cleanly in ~400 lines of TypeScript
The 200K Token Lie¶
You've heard the pitch. Claude offers up to 200K tokens (with 1M in beta). GPT-4o handles 128K tokens (with newer models like GPT-4.1 Mini reaching 1M and GPT-5.4 exceeding 1.1M). Gemini supports up to 2 million tokens. So you stuff your entire API migration guide into the prompt, ask "what are all the breaking changes?" and get back... a partial list that confidently misses the webhook schema change buried on page 34.
What happened? Your document was well under the token limit. The API didn't complain. You paid for all those tokens.
Here's the uncomfortable truth: context windows are theoretical maximums, not practical working limits. Research by Liu et al. (2024) at Stanford demonstrated that LLMs experience significant performance degradation when relevant information is placed in the middle of long contexts. They're decent at using information near the beginning (primacy) and near the end (recency), but everything in between becomes progressively fuzzier. The phenomenon is so consistent there's a name for it: the "lost in the middle" problem.
The answer isn't to wait for bigger context windows. Even if we had 10 million token windows tomorrow, the attention mechanism that powers these models would still struggle with needle-in-haystack retrieval across massive contexts. We need a different approach.
The Problem with Long Context¶
Let's get specific about why context windows fail in practice.
The "needle in a haystack" benchmark is the sanitized version of this problem. Drop a specific fact into a sea of irrelevant text and ask the model to retrieve it. Leading frontier LLMs (Claude Opus, GPT-5, Gemini Pro) now score well on this synthetic test, achieving >95% accuracy. But real documents aren't haystacks with one needle — they're complex webs of interrelated information where understanding requires synthesis across multiple sections.
RAG (Retrieval Augmented Generation) helps by chunking documents, embedding them, and retrieving only relevant chunks. But traditional RAG has blindspots. It's optimized for finding specific facts, not for understanding how different sections of a document relate to each other. If you're analyzing a monorepo and need to understand how the authentication middleware in auth/middleware.ts interacts with the rate limiter in services/limiter.ts and the retry logic in lib/resilience.ts, basic RAG's independent chunk retrieval often misses these connections. Advanced techniques like GraphRAG and agentic RAG have begun to address these limitations, but they add significant complexity and are still maturing.
Map-reduce gets closer. Split the document, process each chunk, aggregate the results. But classic map-reduce is too rigid. It processes every chunk even when most are irrelevant. Single-pass map-reduce runs one iteration and stops. While iterative variants like refine chains exist in LangChain, they add complexity and still lack the dynamic chunk selection that makes RLM effective. The "map" phase is embarrassingly parallel but dumb — no guidance about what to focus on.
What we actually need is something that combines the efficiency of selective processing with the thoroughness of multi-pass analysis. We need an approach that can decide which parts of a document deserve deep analysis, process those parts in parallel, accumulate structured findings, and iterate if the initial pass missed something important.
Enter the Recursive Language Model¶
The Recursive Language Model concept, introduced by Zhang, Kraska, and Khattab at MIT (2025), offers a principled solution to the long-context problem. Instead of fighting context limits by stuffing more tokens into a single prompt, RLM works deliberately within those limits through iteration and programmatic orchestration.
The MIT Paper's Architecture
The original RLM paper describes a single root language model that writes Python code in a persistent REPL environment. The root LM can invoke sub-LMs recursively through programmatic function calls, with each sub-LM call happening sequentially (blocking) and returning structured results that the root LM processes through code execution. This creates a feedback loop where the model's code output guides its own next steps — hence "recursive." The paper's implementation treats prompts as programmatic environments, not as a multi-agent system with distinct roles.
Our Practical Adaptation for TypeScript
For our practical TypeScript implementation with LangGraph, we've adapted the RLM concept into three distinct roles, each powered by LLM calls. This is our design choice for a production-ready system, not what the MIT paper describes:
The Orchestrator manages the workflow. On the first iteration, it chunks the document into manageable pieces and decides which chunks are relevant to the user's query. Unlike blind map-reduce, the orchestrator uses an LLM to make this selection — it reads chunk summaries and applies reasoning about relevance. On subsequent iterations, it reviews accumulated findings and decides whether to analyze additional chunks or proceed to synthesis.
The Workers are specialized agents that each analyze one chunk in depth. Because each worker only sees a single chunk plus the query, they operate well within the effective context window. They're prompted to extract specific findings, assign relevance scores, and cite their sources. In our implementation, workers run in parallel using Promise.all, making this phase fast despite multiple LLM calls. (Note: The MIT paper's implementation uses sequential, blocking sub-LM calls; parallel execution is our design choice for performance.)
The Synthesizer takes all accumulated findings and produces the final answer. It's not working with raw document text — it's working with structured, pre-analyzed findings. This makes synthesis tractable even for very long documents.
The key insight from the RLM concept is iteration. After the first round of analysis, the orchestrator can review what was found and decide if additional chunks need examination. Maybe the initial chunk selection missed a relevant section. Maybe a finding from one chunk suggests another chunk deserves analysis. This feedback loop enables comprehensive document analysis within practical context limits.
Why LangGraph?¶
LangGraph is a graph execution framework from the LangChain team, designed specifically for building stateful, multi-step LLM workflows. It's not another prompt wrapper library. It's a runtime for applications where LLM calls are nodes in a graph, and the graph structure determines how information flows between them.
The framework gives you three critical capabilities for implementing RLM:
State management with reducers. Each node in the graph can read and modify a shared state object. You define reducers that specify how state updates merge — do new findings append to the list, or replace it? This is essential for RLM's accumulation pattern.
Conditional edges. After a node executes, you can route to different next nodes based on the current state. The orchestrator can decide whether to fan out to workers or skip to synthesis. Workers can decide whether to iterate again or finish.
Native support for cycles. Unlike traditional DAGs, LangGraph graphs can have loops. You can send execution back to a previous node based on conditions. This enables the multi-pass iteration that makes RLM effective.
The framework has a well-maintained TypeScript SDK with proper type inference and async/await throughout. While the Python SDK often receives experimental features first, the TypeScript SDK has reached production-ready maturity. No callback hell, no stringly-typed state keys. For developers coming from the Node.js ecosystem, the async/await patterns and middleware-like node composition will feel familiar.
LangGraph is the perfect fit for RLM because the pattern maps directly to its primitives: stateful nodes, conditional routing, iteration.
Implementation Walkthrough¶
Let's build a working RLM system. I'll walk through the key pieces with actual code.
Defining the State¶
The state object is the spine of the application. Every node reads from it and writes to it. LangGraph uses an Annotation.Root structure to define both the shape of the state and how updates merge.
import { Annotation } from "@langchain/langgraph";
interface Finding {
content: string;
chunkId: string;
relevanceScore: number;
citations: string[];
}
const RLMState = Annotation.Root({
query: Annotation<string>,
document: Annotation<string>,
documentChunks: Annotation<string[]>({
reducer: (_, b) => b,
default: () => [],
}),
pendingChunks: Annotation<string[]>({
reducer: (_, b) => b,
default: () => [],
}),
findings: Annotation<Finding[]>({
reducer: (a, b) => [...a, ...b],
default: () => [],
}),
iteration: Annotation<number>({
reducer: (_, b) => b,
default: () => 0,
}),
maxIterations: Annotation<number>({
reducer: (_, b) => b,
default: () => 3,
}),
finalAnswer: Annotation<string>({
reducer: (_, b) => b,
default: () => "",
}),
});
type RLMStateType = typeof RLMState.State;
Pay attention to the findings field. Its reducer is (a, b) => [...a, ...b] — this appends new findings to the existing array rather than replacing it. This accumulation pattern is central to RLM. Each iteration adds to the findings, building up a comprehensive analysis across multiple passes.
The iteration counter tracks how many times we've looped. The maxIterations cap prevents infinite loops if the orchestrator keeps deciding more analysis is needed.
The Orchestrator¶
The orchestrator's job on the first iteration is to chunk the document and select relevant chunks. On subsequent iterations, it reviews findings and decides whether to continue.
async function orchestrator(state: RLMStateType): Promise<Partial<RLMStateType>> {
const { document, query, iteration, findings, maxIterations } = state;
// First iteration: chunk and select
if (iteration === 0) {
const chunks = chunkDocument(document, 2000); // ~2K tokens per chunk
const selectionPrompt = `
You are analyzing a document to answer this query: "${query}"
Here are summaries of each chunk:
${chunks.map((chunk, i) => `Chunk ${i}: ${chunk.slice(0, 200)}...`).join('\n\n')}
Which chunks are most relevant to answering the query?
Return a JSON array of chunk indices, e.g., [0, 3, 7]
`.trim();
const response = await llm.invoke(selectionPrompt);
const selectedIndices = JSON.parse(response.content as string);
const pendingChunks = selectedIndices.map((i: number) => chunks[i]);
return {
documentChunks: chunks,
pendingChunks,
iteration: 1,
};
}
// Later iterations: decide whether to continue
if (iteration >= maxIterations) {
return { pendingChunks: [] }; // Stop iterating
}
const reviewPrompt = `
You have gathered ${findings.length} findings so far for the query: "${query}"
Findings summary:
${findings.map(f => `- ${f.content} (relevance: ${f.relevanceScore})`).join('\n')}
Do you need to analyze additional chunks? If yes, which ones?
Return a JSON array of chunk indices or an empty array [] to proceed to synthesis.
`.trim();
const response = await llm.invoke(reviewPrompt);
const selectedIndices = JSON.parse(response.content as string);
const pendingChunks = selectedIndices.map((i: number) => state.documentChunks[i]);
return {
pendingChunks,
iteration: iteration + 1,
};
}
The first iteration splits the document into ~2K token chunks using a simple text splitter. Then it uses an LLM call to select relevant chunks. The LLM sees a short summary of each chunk and decides which ones are worth analyzing in depth. This is smarter than processing everything or using keyword matching — it's semantic relevance filtering.
On subsequent iterations, the orchestrator reviews the accumulated findings and decides if more chunks are needed. If the findings look comprehensive, it returns an empty pendingChunks array, which signals synthesis should begin.
Parallel Workers¶
The worker node is where deep analysis happens. Each worker receives one chunk and extracts structured findings.
async function subAgent(state: RLMStateType): Promise<Partial<RLMStateType>> {
const { query, pendingChunks } = state;
// Process all pending chunks in parallel
const findingBatches = await Promise.all(
pendingChunks.map(async (chunk, index) => {
const workerPrompt = `
You are analyzing one section of a larger document.
Query: "${query}"
Chunk:
${chunk}
Extract all findings relevant to the query. For each finding, provide:
1. The finding content
2. A relevance score (0-10)
3. Direct citations from the chunk
Return a JSON array of findings with shape:
[
{
"content": "...",
"relevanceScore": 8,
"citations": ["exact quote from chunk"]
}
]
`.trim();
const response = await llm.invoke(workerPrompt);
const findings: Omit<Finding, 'chunkId'>[] = JSON.parse(response.content as string);
return findings.map(f => ({
...f,
chunkId: `chunk-${index}`,
}));
})
);
const allFindings = findingBatches.flat();
return {
findings: allFindings,
pendingChunks: [], // Clear pending chunks after processing
};
}
The critical line is Promise.all(pendingChunks.map(...)). Each chunk gets analyzed in parallel. If you selected 5 chunks, you make 5 concurrent LLM calls. Because each worker only sees one chunk (~2K tokens) plus the query and prompt, you're well within effective context limits. The LLM can focus.
Note on Production Patterns: For production use, LangGraph's Send API is the idiomatic approach for fan-out parallelism, providing better observability, state management, and graph visualization. We use Promise.all here for simplicity and pedagogical clarity.
Workers return structured Finding objects, not unstructured text. Each finding has a relevance score and citations. This structure makes synthesis much easier — the synthesizer doesn't need to parse natural language to understand what was found.
The Synthesizer¶
The synthesizer receives all accumulated findings and produces the final answer.
async function synthesizer(state: RLMStateType): Promise<Partial<RLMStateType>> {
const { query, findings } = state;
const synthesisPrompt = `
You are synthesizing findings from document analysis.
Query: "${query}"
Findings:
${findings
.sort((a, b) => b.relevanceScore - a.relevanceScore)
.map(f => `
[${f.chunkId}] Relevance: ${f.relevanceScore}/10
${f.content}
Citations: ${f.citations.join('; ')}
`).join('\n\n')}
Provide a comprehensive answer to the query based on these findings.
Cite specific chunks using [chunk-N] notation.
`.trim();
const response = await llm.invoke(synthesisPrompt);
return {
finalAnswer: response.content as string,
};
}
Notice the findings are sorted by relevance score before being presented to the LLM. High-signal information goes first, making the most of the context window's primacy bias.
The synthesizer's prompt is working with pre-analyzed, structured data. It's not rereading the entire document. It's not trying to scan 50 pages for termination clauses. The workers already did that work and handed back clean findings.
Wiring the Graph¶
Now we assemble the nodes into a graph with conditional routing.
import { StateGraph, START, END } from "@langchain/langgraph";
function orchestratorRoute(state: RLMStateType): string {
return state.pendingChunks.length > 0 ? "subAgent" : "synthesizer";
}
function subAgentRoute(state: RLMStateType): string {
return state.iteration >= state.maxIterations ? "synthesizer" : "orchestrator";
}
const graph = new StateGraph(RLMState)
.addNode("orchestrator", orchestrator)
.addNode("subAgent", subAgent)
.addNode("synthesizer", synthesizer)
.addEdge(START, "orchestrator")
.addConditionalEdges("orchestrator", orchestratorRoute)
.addConditionalEdges("subAgent", subAgentRoute)
.addEdge("synthesizer", END)
.compile();
The orchestratorRoute function decides: if there are pending chunks, route to the worker node. Otherwise, go straight to synthesis.
The subAgentRoute function decides: if we've hit max iterations, synthesize. Otherwise, loop back to the orchestrator for another round.
This creates a cycle: orchestrator → workers → orchestrator → workers → ... → synthesizer. The number of iterations is determined dynamically based on the orchestrator's decisions, up to maxIterations.
Running the Demo¶
Clone the repo, install dependencies, and run against any document:
cd example
npm install
echo "ANTHROPIC_API_KEY=your-key-here" > .env
npx tsx src/index.ts ./api-v2-migration-guide.md \
"What are all the breaking changes in the v2 API migration?"
Here's what the real output looks like on a 904K-character API migration guide:
🚀 Recursive Language Model Demo
[Config] File: ./api-v2-migration-guide.md
[Config] Query: What are all the breaking changes in the v2 API migration?
✓ Loaded document (903,920 characters)
Starting graph execution...
[Orchestrator] Iteration 1
[Orchestrator] Chunking document...
✓ Created 127 chunks
[Orchestrator] Selecting most relevant chunks...
✓ Selected 6 relevant chunks
[SubAgent] Processing 6 chunks...
✓ Extracted 41 findings
[Synthesizer] Synthesizing answer from 41 findings...
✓ Answer synthesized
FINAL ANSWER
========================================
# Breaking Changes in v2 API Migration
1. **Authentication System Overhaul** [Chunk 1, 58]:
X-API-Key header removed, replaced by Authorization: Bearer <jwt>.
OAuth 2.1 with PKCE support added.
2. **Pagination → Cursor-Based** [Chunk 4, 57, 58]:
page/per_page params removed. Use cursor and limit instead.
3. **Response Structure** [Chunk 57, 58]:
items → data, total → pagination.total_count. Wrapped format.
4. **User Model Schema** [Chunk 57, 58]:
Single name field → first_name/last_name. IDs require usr_ prefix.
5. **Rate Limiting → IETF** [Chunk 4, 58, 59]:
X-RateLimit-* → RateLimit-*. Retry-After returns seconds, not date.
6. **Error Format → RFC 7807** [Chunk 57, 58, 59]:
Simple error messages replaced by Problem Details structure.
7. **Webhook Changes** [Chunk 57]:
New event/data envelope. Signature header renamed.
8. **Endpoint Restructuring** [Chunk 58]:
Base path /v1/ → /v2/. RESTful naming standardized.
9. **Security Enhancements** [Chunk 59]:
JWT validation, token expiration, refresh token rotation.
10. **Deprecation Timeline** [Chunk 59]:
v1 sunset: 2026-09-30. Legacy webhooks: 2026-06-30.
✓ Graph execution completed!
Before & After: Why RLM Matters¶
To understand the difference, compare what a naive single-pass LLM call returns for the same query on the same 904K-character document:
Naive approach (entire document in one prompt):
ERROR: 400
Type: invalid_request_error
Message: prompt is too long: 212958 tokens > 200000 maximum
The document doesn't even fit in the context window. At 212,959 tokens, it exceeds Claude's 200K limit. You can't even start the analysis. And even if a model had a large enough window, context rot would degrade the results for a document this size.
RLM approach (chunked, parallel, synthesized): Found all 10 breaking change areas with 41 specific findings across 6 selected chunks (out of 127 total), with chunk-level citations. Each worker analyzed its chunk with full attention, so nothing was lost — auth changes, pagination, rate limits, webhooks, error format, response structure, the works.
This is the core value proposition: comprehensive analysis that doesn't miss information buried in the middle of long documents, and that works on documents far too large for any single LLM call.
Production Considerations¶
Before you ship this to production, let's talk about the practical challenges.
Cost is real. Let's break down a typical run using our real 904K-character document. The chunker produces 127 chunks at ~8K chars each. The orchestrator makes one LLM call to select relevant chunks (~30K tokens for chunk summaries). It selects 6 chunks, which get processed in parallel (6 LLM calls, ~2K tokens each = 12K tokens input, ~6K tokens output). The synthesizer makes one final call (~10K tokens input). Total: roughly 52K input tokens, 10K output tokens.
At Claude Sonnet pricing (as of March 2026: $3 per million input tokens, $15 per million output tokens), that's about $0.31 per run (52K × $0.000003 + 10K × $0.000015 ≈ $0.306). Not bad for one-off analysis of a 904K document that a naive approach can't even start on. But if you're processing hundreds of documents daily, costs add up fast. Budget accordingly.
Latency matters. Even with parallel workers, you're making multiple round trips to the LLM API. A typical run might be: 2 seconds for orchestrator, 4 seconds for parallel workers (assuming 4-second individual calls), 3 seconds for synthesizer. That's 9 seconds minimum, and you might iterate twice. Compare this to a single 6-second RAG query.
Mitigation strategies: adjust chunk size (bigger chunks = fewer LLM calls but higher per-call latency), batch worker calls if your API supports it, cache chunk embeddings if you reprocess similar documents, stream the final answer to the user while synthesis is still running.
Failure handling is non-negotiable. Any individual LLM call can fail or timeout. Wrap worker calls in try-catch blocks and decide whether to fail fast or continue with partial results. Rate limiting is common — implement exponential backoff. The maxIterations cap prevents runaway loops if the orchestrator misbehaves, but you should also add a wall-clock timeout.
Observability is essential. When a run produces bad results, you need to debug. Was chunk selection wrong? Did a worker hallucinate? Did synthesis miss key findings? LangSmith (LangChain's tracing platform) integrates directly with LangGraph and shows you the full execution trace — every node, every state transition, every LLM call with prompt and response. Invaluable for debugging.
Beyond tracing, log iteration counts, chunk selection rationale, and cost per run. If you notice the orchestrator always selects the maximum number of chunks, your selection prompt might be too permissive. If synthesis quality drops as document size increases, you might need better finding deduplication.
When NOT to use RLM. If your use case is well-served by RAG, stick with RAG. It's faster, cheaper, and simpler. RLM shines when you need comprehensive cross-document analysis, nuanced synthesis, or when simple chunk retrieval misses important connections. Don't overcomplicate if you don't need to.
RLM vs RAG: A Decision Framework¶
Here's a quick decision tree to help you choose:
- Need a specific fact from a known location? → Just use the LLM directly with the relevant section
- Need facts scattered across an unknown location in a large document? → RAG
- Need comprehensive analysis that synthesizes across the entire document? → RLM
- Processing multiple related documents that need cross-referencing? → RLM
- Extreme cost or latency sensitivity with acceptable quality trade-offs? → RAG
- Document comfortably fits in context window (< 20 pages)? → Just use the LLM directly
RAG is your workhorse for most retrieval tasks. RLM is your specialist for complex analysis.
One more consideration: RLM's structured finding accumulation gives you audit trails. You can see exactly which chunks contributed which findings. For legal, compliance, or research applications where provenance matters, this is valuable even if RAG would technically work.
Conclusion¶
Context windows will keep growing. We've already seen 10 million token windows (Llama 4 Scout launched with 10M context in April 2025), and this trend will continue. But the "lost in the middle" problem persists in current transformer architectures. While research into sparse attention, ALiBi, RoPE, and alternative positional encodings may eventually mitigate it through architectural improvements, it remains a practical concern for production systems today. Bigger context windows just move the problem further out, they don't solve it.
Recursive Language Models give you a principled way to work with these limits rather than against them. Instead of hoping the LLM can find the needle in the haystack, you use the LLM itself to decide where to look, what to analyze, and how to synthesize.
LangGraph makes the implementation surprisingly clean. The entire system we walked through is ~400 lines of TypeScript. No black magic, and while the implementation uses LangGraph APIs, the underlying patterns (stateful graphs, conditional routing, iteration) are conceptually portable to other frameworks. Just stateful graphs, conditional routing, and iteration.
If you're building anything that processes documents longer than 20 pages — legal contracts, research papers, technical specifications, policy documents — this pattern is worth having in your toolkit. You're not fighting the context ceiling anymore. You're working within it deliberately.
The code is modular enough that you can swap out the chunking strategy, change the worker prompts for domain-specific analysis, or add validation layers between steps. Start with the example repo, break it, extend it, make it yours.
Context windows are a constraint. But constraints breed creativity. RLM is proof of that.
Further Reading¶
Research & Theory - Recursive Language Models (MIT CSAIL) — Zhang, Kraska & Khattab (2025). Original RLM paper. - Lost in the Middle: How Language Models Use Long Contexts — Liu et al. (2024). Transactions of the Association for Computational Linguistics, 12, 157–173. Research on context window degradation. - LangGraph Documentation — Official LangGraph TypeScript documentation.
Code & Examples - RLM LangGraph Demo Repository — Complete working example from this post. - LangSmith — Observability platform for LLM applications.
Alternative Approaches - RAG from Scratch — Understanding when RAG is the right choice. - Anthropic Prompt Engineering Guide — Maximizing single-pass LLM performance. - OpenAI Developer Cookbook — Techniques for working with long contexts.