The prerequisite before multi-turn RL: what an agent is, the ReAct loop, tool calling, the protocol layer (MCP/A2A), production engineering patterns, evaluation, and failure modes. Read this first, then head to agentic-and-long-horizon-rl to learn how to train it with RL.
Study notes, not the author's own research (see README disclaimer). Numbers/conclusions follow the original papers; benchmark numbers move fast and are contamination-prone, so this page only records the original-paper human baseline + what it tests, never model SOTA.
0. TL;DR
- Agent = LLM (policy) + tool I/O + memory + control loop; what it has over a chatbot is "actions that change external-world state + autonomous multi-step".
- The minimal skeleton = ReAct: Thought→Action→Observation loop until Final Answer; the Observation MUST be injected by the environment, never fabricated by the model (stop-token footgun).
- Two routes to tool use: text protocol (ReAct / Toolformer) and structured function calling (JSON schema); training back-propagates only on agent-generated tokens (see the react-tool-call-loop drill).
- Planning: Plan-and-Execute (full upfront plan) vs ReAct (per-step decision); production usually mixes "high-level plan + per-step ReAct" + plan repair.
- Protocol layer: MCP (Anthropic, connects tools/data, vertical) and A2A (Google, agent interop, horizontal); they standardize connection, they do not solve security (injection is the host's job).
- Three long-horizon system bottlenecks: context / cost , lost-in-the-middle, error compounding .
- Production patterns: subagent orchestration (≠ multi-agent debate), tool retrieval (100+ tools), tiered memory, budget guard.
- Evaluation: SWE-bench / GAIA / OSWorld / WebArena / τ-bench each test different abilities; record only the human baseline and "what it tests".
- Reliability metric: pass@k (can it do it) vs pass^k (is it stable); agent deployment looks at the latter.
- The 6 failure modes have names (see §10); in interviews you should recite them + their mitigations.
1. Mental model
Treat the LLM as a policy : given history (dialogue + past observations), output the next action (a span of text, possibly containing a tool call). Agent = policy + tool I/O + memory + control loop:
obs ───▶ [LLM policy] ───▶ action ───▶ [environment / tools] ──┐
▲ │
└──────────────── new observation ◀─────────────────────────┘
(loop until Final Answer or budget exhausted)
Three-axis design framework (any agent can be located on three axes):
| Axis | Options (simple → complex) |
|---|---|
| Reasoning structure | direct answer → CoT2Provides intermediate reasoning steps as few-shot exemplars, substantially improving reasoning tasks. Wei 2022 ↗ → ReAct → ToT4Expands reasoning into a tree, self-evaluating intermediate "thoughts" for deliberate search. Yao 2023 ↗ / search |
| Tool interface | none → text protocol (ReAct) → structured function calling → computer-use (screenshot+coordinates) |
| Learning signal | prompt-only → trajectory SFT → RL (see the sibling agentic-RL page) |
Classic RL vs LLM-agent:
| Dimension | Classic RL agent | LLM agent |
|---|---|---|
| Policy | small net, trained from scratch | pretrained LLM, light post-training |
| Action space | fixed low-dim | open text + tool calls |
| Prior | almost none | vast world knowledge |
| Sample efficiency | low (millions of steps) | high (zero-shot start from a prompt) |
Misconception: "a chatbot that can call tools is an agent." The key is not the tool but the closed loop + autonomous multi-step + change of external state: single-shot retrieval augmentation (RAG) is still one-shot Q&A; an agent must decide the next step from observations and keep acting until the goal is met.
2. ReAct — the minimal agent skeleton
ReAct1Interleaves reasoning and acting (think→act→observe) so the model thinks while it acts. Yao 2022 ↗ interleaves reasoning (Thought) and acting (Action), each action calls a tool, and the Observation is injected back into the context:
Thought: I should look up X first.
Action: search
Action Input: X
Observation: <tool return — injected by the environment, not generated by the model>
Thought: now I know.
Final Answer: …
Why it hallucinates less than pure CoT: pure chain-of-thought rolls forward on its own output and cannot correct intermediate facts; ReAct conditions each step on the real tool return, grounding the reasoning so a wrong fact can be corrected on the next turn.
Misconception: "ReAct always beats CoT." On pure-reasoning tasks ReAct does not necessarily beat CoT-self-consistency (in the original paper ReAct alone is weaker on HotpotQA and needs to be combined with CoT-SC); ReAct's value is on tasks that need external knowledge / actions.
stop-token footgun: at inference you MUST set Observation: as a stop sequence. Otherwise the model will continue generating an Observation: … itself (hallucinating the tool return) instead of stopping to wait for the environment to inject the real result — this is the most common ReAct production bug. Hands-on: react-tool-call-loop.
From-scratch implementation (interview hand-tear standard: minimal ReAct loop + stop-sequence + environment-injected observations):
45 行 / lines
import re
def react_loop(prompt, tools, llm_generate, max_steps=10):
"""Minimal ReAct loop: Thought → Action → Observation → loop.
llm_generate(messages, stop)->str: calls the LLM, stops immediately on stop sequences.
tools: dict[str, callable], tool_name→executor (Obs from the environment, not the model).
Returns (final_answer, trajectory)."""
messages = [{"role": "user", "content": prompt}]
trajectory = []
for _ in range(max_steps):
# 1. Reason + act; stop=["Observation:"] prevents hallucinated tool returns (model must stop)
raw = llm_generate(messages, stop=["Observation:"])
trajectory.append(raw)
# 2. Check for final answer
final = re.search(r"Final Answer:\s*(.*)", raw, re.S)
if final:
return final.group(1).strip(), trajectory
# 3. Parse Action and Action Input
action = re.search(r"Action:\s*(\S+)", raw)
action_input = re.search(r"Action Input:\s*(.*)", raw, re.S)
if not action:
obs = "Error: no Action found. Please output 'Action: <tool_name>' then 'Action Input: <args>'."
elif action.group(1) not in tools:
obs = f"Error: unknown tool '{action.group(1)}'. Available: {list(tools.keys())}"
else:
try:
result = tools[action.group(1)](action_input.group(1).strip())
obs = str(result) # Observation from environment/tool
except Exception as e:
obs = f"Tool error: {e}"
# 4. Inject the real Observation back into context (new user/system message, NOT assistant continuation)
messages.append({"role": "assistant", "content": raw})
messages.append({"role": "user", "content": f"Observation: {obs}"})
return None, trajectory # step budget exhausted
# Interview key points:
# ① stop=["Observation:"] prevents hallucination — model stops, env injects real Obs
# ② Observation is a new message (role=user/system), not an assistant continuation
# ③ Action/Action Input parsed with regex; production would use JSON or structured output
# ④ Tool execution wrapped in try/except + unknown-tool fallback to prevent one-step kills
3. Planning: Plan-and-Execute vs ReAct
- ReAct: per-step decision (reactive) — decides the next action from the current observation each step; flexible but no global view.
- Plan-and-Execute / Plan-and-Solve3Zero-shot: first have the model "make a plan → decompose subtasks" then execute; beats Zero-shot-CoT. Wang 2023 ↗: generate a full plan first, then execute step by step (the executor can even be a different model); globally consistent but the plan may go stale.
- Production usually mixes: a high-level plan cuts the big steps + ReAct reacts within each step; on failure, plan repair (Reflexion5"Verbally reinforces" by storing language reflections in episodic memory, without weight updates. Shinn 2023 ↗-style reflection / ToT-style search / step-wise replan).
When pure plan-execute fails: when the environment is uncertain and mid-run state changes a lot (tool returns are unexpected), a plan fixed at the start goes stale — here reactive ReAct or a mix with replan is more robust.
4. Tool use
Toolformer6Self-supervised learning of "when/how" to call an API, keeping only useful self-labeled calls via a utility filter. Schick 2023 ↗: learns to call APIs without human annotation. How: randomly insert candidate API calls into text → execute to get returns → a utility filter keeps only samples where "inserting that call + its return significantly lowers the model's loss on subsequent tokens" for SFT. I.e. it uses "did the call actually help predict what follows" to auto-filter useless / mis-placed calls.
Structured function calling9Introduced 2023-06: describe functions with JSON Schema, the model emits structured calls. OpenAI 2023 ↗: describe the function signature with JSON Schema; the (fine-tuned) model directly emits a structured {name, arguments} call. parallel tool calls (returning several calls at once) require those calls to be idempotent + mutually independent (no data dependency), otherwise they cannot run in parallel.
Misconception: "function calling and ReAct are two opposing kinds of agent." Both are tool use, but at different levels: FC is a structured format for a tool call (the model is fine-tuned to emit JSON schema), ReAct is a reason-act loop pattern (a prompting paradigm); you can perfectly well "use a ReAct loop + issue tools via function calling each step." The SFT label-masking difference is in the react drill and agentic-page Q11.
5. Protocols: MCP & A2A
| Protocol | Direction | What it standardizes |
|---|---|---|
| MCP (Model Context Protocol)7Anthropic 2024-11 open protocol: client-server + JSON-RPC 2.0, 3 primitives (tools/resources/prompts). Anthropic 2024 ↗ | vertical (agent ↔ tools/data) | how a model connects to external tools and data: client-server, JSON-RPC 2.0, three primitives (tools / resources / prompts), transport (stdio / Streamable HTTP) |
| A2A (Agent2Agent)8Proposed by Google 2025-04, later under the Linux Foundation: agent interop, JSON-RPC over HTTP + agent card. Google 2025 ↗ | horizontal (agent ↔ agent) | how agents from different vendors interoperate: agent card (capability declaration) + task state machine + JSON-RPC over HTTP |
Misconception: "using MCP makes it safe." The protocol does not defend against prompt injection — malicious instructions hidden in a tool return can hijack the agent (see §10 + Greshake11Indirect prompt injection: instructions inside external content (web pages / tool returns) hijack the LLM application. Greshake 2023 ↗); defense is the host/agent's responsibility (least privilege, treat tool output as untrusted).
6. Production patterns
- Subagent orchestration: the main agent dispatches subtasks to context-isolated subagents (each with a limited tool set), decomposing in parallel or in sequence → prevents main-context blow-up and an over-long tool table.
- Tool retrieval: with a 100+ tool pool, do not stuff all schemas into the prompt; instead retrieve the relevant tools by embedding top-k against the current subtask and inject only those.
- Tiered memory: working (in-context) / episodic (external trajectory history) / retrieve-back-in; essential for long horizons.
- Budget guard: a multi-dimensional budget (token + step count + tool-call count + wall-time); when any threshold is exceeded, force convergence / termination to stop looping from burning money.
Misconception: "subagent = multi-agent system." A subagent is hierarchical decomposition (one main goal delegated to subordinates, context-isolated); multi-agent debate / collaboration is several peer agents each holding a viewpoint then aggregating — different goals and communication structures (the latter is in the agentic page's multi-agent credit).
7. Computer-use / GUI agent
The action space changes from "text tools" to screenshot → coordinate click / keyboard input, operating a real GUI directly. Two bottlenecks:
- grounding: mapping a semantic intent ("click the login button") to precise pixel coordinates — imprecise visual localization is the main error source; in practice one often prefers the accessibility tree (a structured element tree) over raw screenshot pixels.
- long-horizon: GUI tasks have many steps (open app→navigate→fill form→submit), so error compounding (see §9) is severe.
The evaluation arenas are OSWorld (desktop) and WebArena (web) in §8.
8. Benchmarks
Model SOTA on these benchmarks moves fast and is prone to training contamination; this page only lists the original-paper human baseline + what it tests. For current SOTA check each official leaderboard, and mind contamination and scaffold-version differences.
| Benchmark | What it tests | human baseline (original paper) |
|---|---|---|
| SWE-bench122294 real GitHub issues; edit the codebase to make tests pass. Jimenez 2023 ↗ | resolving real GitHub issues (edit codebase, pass unit tests), 2294 tasks | no human solve-rate in the original; at release the best model was only ~2% (Claude 2) — showing its difficulty |
| SWE-bench Verified13OpenAI 2024-08 human-validated 500-problem subset, excluding unsolvable/over-strict-test items. OpenAI 2024 ↗ | the same, as a 500-problem human-validated subset (cleaner) | no human solve-rate reported |
| GAIA14General AI assistant: needs reasoning + multimodality + web + tools, in three difficulty levels. Mialon 2023 ↗ | general assistant (reasoning + multimodality + web + tools), three levels | 92% (L1 93.9 / L2 91.8 / L3 87.3), original annotators |
| OSWorld15369 open-ended computer-use tasks on a real OS (multi-app/web). Xie 2024 ↗ | real-OS computer-use, 369 tasks | 72.36% |
| WebArena16812 long-horizon tasks on real websites (e-commerce/forum/code/CMS). Zhou 2023 ↗ | real-web long-horizon tasks, 812 tasks | 78.24% |
| AgentBench178 interactive environments (OS/DB/KG/games/web) to evaluate LLM-as-agent. Liu 2023 ↗ | LLM-as-agent across 8 environments | none (designed for model-vs-model) |
| τ-bench18Multi-turn, policy-constrained customer-service tool-agent-user tasks; introduces the pass^k reliability metric. Yao 2024 ↗ | multi-turn, policy-constrained tool-user interaction (customer service) | none; the key contribution is the pass^k reliability metric |
| MLE-bench1975 Kaggle ML-engineering competitions, scored by medal rate (bronze/silver/gold). Chan 2024 ↗ | Kaggle ML engineering, 75 competitions, by medal rate | by Kaggle leaderboard percentile |
9. Cost & reliability
- cost: every step re-reads the full context, and the context grows linearly with steps (, full history concatenated), so total tokens . This is the core cost driver for long-horizon agents. Mitigations: context compression / summarization, KV eviction, decomposing subtasks into isolated subagents, prompt caching.
- Serial-latency lower bound: steps have data dependencies, so parallel tools save the wait within a step but cannot break the cross-step serial latency lower bound — an 8-step task serializes at least 8 LLM decodes no matter how parallel.
- pass@k vs pass^k: = at least one success in tries (capability upper bound, optimistic); = all succeed (reliability). Agent deployment looks at pass^k — a customer-service / code agent that wrecks the database 1-in-10 runs is unusable; τ-bench uses pass^k precisely to expose this instability.
10. Failure modes
| # | Failure mode | Mechanism | Mitigation |
|---|---|---|---|
| 1 | hallucinated tool call | calling a non-existent tool/arg, or fabricating an Observation when no stop is set | JSON schema validation + stop sequence + tool whitelist |
| 2 | loop / stalemate | repeating the same action without progress | step budget + loop detection + forced final |
| 3 | lost-in-the-middle10Key info in the middle of a long context is easily ignored, in a U-shape. Liu 2024 ↗ | mid-context info ignored (U-shaped) | put key info at the ends + summarize + retrieve |
| 4 | tool over/under-use | calling when it shouldn't / not calling when it should | reward / SFT shaping + tool retrieval |
| 5 | tool-output injection | instructions hidden in a tool return hijack the agent | treat tool output as untrusted + least privilege + host defense |
| 6 | benchmark reward hacking | exploiting eval loopholes instead of truly solving | verifiable terminal + adversarial test set + anti-contamination |
Stratified follow-ups
L1 Basics
1. What is the essential difference between an agent and a chatbot? Does giving an LLM a search API make it an agent?
Answer: The essential difference is closed loop + autonomous multi-step + change of external state — an agent decides the next step from observations and keeps acting until the goal is met, and its actions can change external-world state; a chatbot is single Q&A. Simply giving an LLM a search API for one retrieval-augmented call is not yet an agent (still single-turn); only when it can autonomously decide whether to keep querying, what to query, and when to stop based on tool returns does it enter the agent regime.
Follow-up: agent = policy + what? → policy (LLM) + tool I/O + memory + control loop; treat the LLM as a policy running in an "observe→act→new observation" loop.
2. What are ReAct's Thought/Action/Observation? Why does it hallucinate less than pure CoT?
Answer: Thought = reasoning, Action = call a tool, Observation = tool return (environment-injected). Pure CoT rolls forward on its own output and cannot externally correct intermediate facts; ReAct conditions each step on the real tool return, grounding the reasoning so a wrong fact can be corrected the next turn.
Follow-up: what is the most common ReAct production bug? → the stop-token footgun: not setting Observation: as a stop sequence, so the model writes its own Observation: … hallucinated return instead of stopping to wait for the environment's real result.
3. Are function calling and ReAct two opposing kinds of agent?
Answer: No, they are at different levels. Function calling is a structured format for a tool call (the model is fine-tuned to emit a {name, arguments} JSON schema); ReAct is a reason-act loop pattern (a prompting paradigm). They compose: run a ReAct loop and issue tools via function calling each step.
Follow-up: how do the two formats differ in label masking during training? → the masked set is the same (both mask tool-return tokens); the difference is that the fixed template parts of the JSON ({"name":, punctuation) are schema, not decisions, so over-training wastes gradient memorizing the template (see the react drill / agentic Q11).
4. What problem does MCP solve? Does it guarantee agent safety?
Answer: MCP (Model Context Protocol, Anthropic 2024-11) standardizes the model ↔ external tools/data connection (vertical): client-server + JSON-RPC 2.0 + three primitives (tools/resources/prompts). It does not guarantee safety — the protocol itself does not defend against prompt injection; malicious instructions in a tool return must be defended by the host/agent.
Follow-up: how do MCP and A2A divide the work? → MCP is vertical (agent ↔ tools/data); A2A (Google 2025) is horizontal (agent ↔ agent, interop via agent card + task state machine).
5. What is the difference between pass@k and pass^k? Which should agent deployment look at?
Answer: = at least one success in tries (measures capability upper bound, optimistic); = all succeed (measures reliability). Agent deployment looks at pass^k — a customer-service/code agent that causes trouble 1-in-10 runs is unusable. τ-bench uses pass^k precisely to expose this instability.
Follow-up: why is a long-horizon agent's pass^k far below its pass@k? → with per-step success rate , the probability that all runs (or all steps) succeed decays exponentially with the number of steps/tries; long tasks have many steps, and any single step failing fails the whole trajectory, so reliability is far below "at least once can do it".
6. Why is an agent's inference cost typically O(T²)?
Answer: every step re-reads the entire context, and the context grows linearly with steps (, full history concatenated), so total tokens . This is the main cost source for long-horizon agents and the motivation for context management (compress/summarize/evict).
Follow-up: can parallel tool calls reduce this to O(T)? → No. Parallelism saves the within-step wait of multiple independent tools (lowering latency, not total tokens); the cross-step serial dependency and context accumulation remain, so the cost order is unchanged.
L2 Intermediate
7. What is the trade-off between Plan-and-Execute and ReAct? When does pure plan-and-execute fail?
Answer: ReAct decides per step (reactive), flexible but no global view; Plan-and-Execute generates a full plan first then executes (executor can be a different model), globally consistent but the plan may go stale. Pure plan-and-execute fails when the environment is uncertain and mid-run state changes a lot (tool returns are unexpected) — the plan fixed at the start cannot keep up. Production usually mixes "high-level plan + per-step ReAct" + plan repair.
Follow-up: what are the ways to do plan repair? → Reflexion-style language reflection then replan, ToT-style search over alternative plans, or detect deviation and replan locally step-wise.
8. How does Toolformer learn to call APIs without human annotation?
Answer: self-supervision + a utility filter. Randomly insert candidate API calls into text → execute to get returns → keep only samples where "inserting that call and its return significantly lowers the loss on subsequent tokens" for SFT. I.e. it uses "did the call actually help predict what follows" as the utility signal, auto-filtering useless or mis-placed calls without any human labeling of which/when to call.
Follow-up: what is the essential criterion of this utility filter? → compare the weighted loss on subsequent tokens under "with the API return" vs "without / empty return"; keep only when the former is clearly lower — essentially "how much did this tool call reduce the perplexity of what follows".
9. What is the prerequisite constraint for parallel tool calls in structured function calling?
Answer: returning multiple tool calls at once requires those calls to be idempotent + mutually independent (no data dependency): if call B needs call A's result, they cannot be parallel and must serialize until A returns. Parallelism only applies to mutually-unrelated calls like "get weather + get exchange rate"; dependent chained calls go in separate rounds.
Follow-up: why can't parallelism break a long-horizon agent's serial-latency lower bound? → parallelism saves the within-step wait of independent calls; the cross-step data dependency (the next step uses the previous step's result) is still serial, so a T-step task serializes at least T LLM decodes.
10. What is the difference between subagent orchestration and multi-agent debate?
Answer: Subagent orchestration is hierarchical decomposition — the main agent dispatches subtasks to context-isolated, tool-limited subordinates; single goal, communication is "assign ↔ return result". Multi-agent debate/collaboration is several peer agents each holding a viewpoint, challenging each other then aggregating; the goal is to use diversity to improve correctness. The two differ in structure (hierarchical vs peer), communication, and purpose.
Follow-up: what does subagent context isolation mainly solve? → it prevents main-agent context blow-up (the subtask's intermediate tokens don't flow back to the main thread) + an over-long tool table (each subagent only mounts relevant tools); the cost is that cross-subagent information sharing must be passed explicitly.
11. With a 100+ tool pool, how do you manage it without blowing up the context?
Answer: do not stuff all tool schemas into the prompt (it both blows up the context and lowers selection accuracy); instead do tool retrieval: vectorize each tool's description, do an embedding top-k retrieval against the current subtask query, and inject only the most relevant schemas. Essentially turning "tool selection" from one-shot full exposure into retrieval recall.
Follow-up: what is the main failure mode of tool retrieval? → incomplete recall (the needed tool isn't retrieved → the agent can't finish) and description ambiguity (two similar tools get confused); mitigate with better tool descriptions + larger k + hierarchical retrieval if needed.
12. What are the two bottlenecks of a computer-use / GUI agent? Why prefer the accessibility tree?
Answer: ① grounding — mapping a semantic intent to precise pixel coordinates; imprecise visual localization is the main error source. ② long-horizon — GUI tasks have many steps and severe error compounding. One often prefers the accessibility tree (a structured element tree with role/label/coords) over raw screenshots, because structured elements locate "that button" more reliably than pixels, sidestepping part of the grounding error.
Follow-up: if the accessibility tree is more reliable, why still need screenshots? → many interfaces (canvas, custom rendering, games) have no usable a11y tree, or the tree is incomplete; the screenshot is the universal fallback, and practice often fuses both.
13. To evaluate a coding agent and a web agent, which benchmarks would you pick and why?
Answer: coding agent → SWE-bench / SWE-bench Verified (resolve real GitHub issues by editing code to pass unit tests; Verified is the human-validated clean subset); web agent → WebArena (real-website long-horizon tasks, with a 78.24% human baseline), or OSWorld if it's desktop computer-use. The basis is "does the action space and task distribution match the target scenario".
Follow-up: what caveat must accompany SOTA numbers on these benchmarks? → model SOTA moves fast + training contamination + scaffold-version differences (SWE-bench therefore released the human-validated Verified subset); you can only cite the current official-leaderboard value with a date, never treat a second-hand number as a stable fact.
L3 Advanced
14. Designing context management for a long-horizon agent: how do O(T²) cost, lost-in-the-middle, and error compounding mitigate together?
Answer: all three stem from "context inflating with steps", so a combination is needed: ① for O(T²) cost — summarize old turns + KV eviction (keep sink + recent window) + dispatch independent subtasks to context-isolated subagents, splitting one long context into several short ones; ② for lost-in-the-middle (mid-context info ignored, U-shaped) — put key info (goal, constraints) at the ends and use retrieval to recall relevant history into the recent window rather than relying on the model to find it in a long middle; ③ for error compounding — shorten the effective horizon (decompose + verifiable milestones per subtask), add loop detection and a budget guard for early stop. Synergy: summarization + subtask isolation lower both cost and horizon; retrieval + end-anchoring fix both lost-in-the-middle and grounding.
Follow-up: what new risk does summarization itself introduce, and how to balance it? → lossy compression may discard key early info that turns out useful later (and if training saw the full history but inference only compresses, it creates a train-inference state-distribution mismatch); the balance is to keep a pointer / retrievable copy of "potentially-revisited" content and only summarize low-value turns.
15. What is the threat model of tool-output prompt injection? Give a defense-in-depth.
Answer: threat model (indirect prompt injection, Greshake 2023): the attacker hides malicious instructions in external content the agent will read (web pages, search results, tool returns, files); the agent executes them as instructions — it can be induced to leak context, abuse high-privilege tools, or make outbound requests. Defense-in-depth: ① mark all tool/external returns as untrusted data (isolated from system instructions, not executed as instructions); ② least privilege (minimal scope per tool; dangerous operations require confirmation); ③ output-side constraints (host-level policy gates on high-risk actions like outbound send / delete); ④ monitor anomalous tool-call sequences. Key insight: the protocol (MCP) is not responsible for injection defense — the host/agent is.
Follow-up: why is "let the model judge whether an instruction is trustworthy" not a reliable defense? → it puts the security boundary back inside a model that the same injection can compromise; reliable defense should backstop outside the model with a deterministic privilege/policy layer (whitelist, scope, human confirmation) rather than relying on model self-discipline.
16. Given benchmark reward hacking and data contamination, how do you design a hack-resistant agent evaluation?
Answer: two problems — ① reward hacking: the agent exploits eval-implementation loopholes (editing test files, mocking outputs, empty functions passing CI) instead of truly solving; ② contamination: test samples leaked into training, inflating scores. Hack-resistant eval: use a verifiable, hard-to-forge success criterion (hidden extra unit tests, final environment-state checks, not assertions the agent can see), rotate adversarial test sets / living benchmarks (periodically swap items to prevent memorization), a human-validated subset (like SWE-bench Verified excluding cheatable/unsolvable items), report pass^k (to prevent gaming pass@k by many samples), and publish the eval harness for reproducibility.
Follow-up: why can't an "ever-rising public leaderboard" be taken directly as agent capability progress? → high scores may come from contamination, scaffold engineering, or overfitting to that benchmark; what matters is whether it improves in lockstep on newly released, contamination-resistant living benchmarks and the pass^k reliability metric, with the harness and date verified.
References
All are primary sources for load-bearing methods, each web-verified (title + arXiv ID / official URL). Click a superscript to jump, click ↩ to return.
- Yao et al. ReAct: Synergizing Reasoning and Acting in Language Models. ICLR 2023. arXiv:2210.03629 — think→act→observe paradigm. ↩
- Wei et al. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. NeurIPS 2022. arXiv:2201.11903 — CoT reasoning. ↩
- Wang et al. Plan-and-Solve Prompting. ACL 2023. arXiv:2305.04091 — plan first, then execute. ↩
- Yao et al. Tree of Thoughts: Deliberate Problem Solving with LLMs. NeurIPS 2023. arXiv:2305.10601 — tree-of-thoughts search. ↩
- Shinn et al. Reflexion: Language Agents with Verbal Reinforcement Learning. NeurIPS 2023. arXiv:2303.11366 — verbal reflection / episodic memory. ↩
- Schick et al. Toolformer: Language Models Can Teach Themselves to Use Tools. NeurIPS 2023. arXiv:2302.04761 — self-supervised tool use + utility filter. ↩
- Anthropic. Model Context Protocol (MCP). 2024-11. modelcontextprotocol.io — model↔tools/data standard (vertical). ↩
- Google. Agent2Agent Protocol (A2A). 2025-04 (later under the Linux Foundation). a2a-protocol.org — agent↔agent interop (horizontal). ↩
- OpenAI. Function calling and other API updates. 2023-06-13. openai.com — structured JSON-Schema tool calling. ↩
- Liu et al. Lost in the Middle: How Language Models Use Long Contexts. TACL 2024. arXiv:2307.03172 — mid-context info is ignored. ↩
- Greshake et al. Not what you've signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection. 2023. arXiv:2302.12173 — indirect prompt injection. ↩
- Jimenez et al. SWE-bench: Can Language Models Resolve Real-World GitHub Issues? ICLR 2024. arXiv:2310.06770 — real code-fix benchmark. ↩
- OpenAI. Introducing SWE-bench Verified. 2024-08. openai.com — 500-problem human-validated subset. ↩
- Mialon et al. GAIA: a benchmark for General AI Assistants. ICLR 2024. arXiv:2311.12983 — general-assistant eval (human 92%). ↩
- Xie et al. OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments. NeurIPS 2024. arXiv:2404.07972 — computer-use eval (human 72.36%). ↩
- Zhou et al. WebArena: A Realistic Web Environment for Building Autonomous Agents. ICLR 2024. arXiv:2307.13854 — web-agent eval (human 78.24%). ↩
- Liu et al. AgentBench: Evaluating LLMs as Agents. ICLR 2024. arXiv:2308.03688 — 8-environment eval. ↩
- Yao et al. τ-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains. 2024. arXiv:2406.12045 — multi-turn tool-user + pass^k. ↩
- Chan et al. MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering. ICLR 2025. arXiv:2410.07095 — Kaggle ML-engineering eval. ↩