Cheatsheet

Agent Foundations

The prerequisite before multi-turn RL: what an agent is, the ReAct loop, tool calling, the protocol layer (MCP/A2A), production engineering patterns, evaluation, and failure modes. Read this first, then head to agentic-and-long-horizon-rl to learn how to train it with RL.

注意 / Caution

Study notes, not the author's own research (see README disclaimer). Numbers/conclusions follow the original papers; benchmark numbers move fast and are contamination-prone, so this page only records the original-paper human baseline + what it tests, never model SOTA.

0. TL;DR

1. Mental model

Treat the LLM as a policy πθ(ah)\pi_\theta(a \mid h): given history hh (dialogue + past observations), output the next action aa (a span of text, possibly containing a tool call). Agent = policy + tool I/O + memory + control loop:

obs ───▶ [LLM policy] ───▶ action ───▶ [environment / tools] ──┐
   ▲                                                            │
   └──────────────── new observation ◀─────────────────────────┘
                 (loop until Final Answer or budget exhausted)

Three-axis design framework (any agent can be located on three axes):

Axis Options (simple → complex)
Reasoning structure direct answer → CoT2Provides intermediate reasoning steps as few-shot exemplars, substantially improving reasoning tasks. Wei 2022 ↗ → ReAct → ToT4Expands reasoning into a tree, self-evaluating intermediate "thoughts" for deliberate search. Yao 2023 ↗ / search
Tool interface none → text protocol (ReAct) → structured function calling → computer-use (screenshot+coordinates)
Learning signal prompt-only → trajectory SFT → RL (see the sibling agentic-RL page)

Classic RL vs LLM-agent:

Dimension Classic RL agent LLM agent
Policy small net, trained from scratch pretrained LLM, light post-training
Action space fixed low-dim open text + tool calls
Prior almost none vast world knowledge
Sample efficiency low (millions of steps) high (zero-shot start from a prompt)
陷阱 / Pitfall

Misconception: "a chatbot that can call tools is an agent." The key is not the tool but the closed loop + autonomous multi-step + change of external state: single-shot retrieval augmentation (RAG) is still one-shot Q&A; an agent must decide the next step from observations and keep acting until the goal is met.

2. ReAct — the minimal agent skeleton

ReAct1Interleaves reasoning and acting (think→act→observe) so the model thinks while it acts. Yao 2022 ↗ interleaves reasoning (Thought) and acting (Action), each action calls a tool, and the Observation is injected back into the context:

Thought: I should look up X first.
Action: search
Action Input: X
Observation: <tool return — injected by the environment, not generated by the model>
Thought: now I know.
Final Answer: …

Why it hallucinates less than pure CoT: pure chain-of-thought rolls forward on its own output and cannot correct intermediate facts; ReAct conditions each step on the real tool return, grounding the reasoning so a wrong fact can be corrected on the next turn.

陷阱 / Pitfall

Misconception: "ReAct always beats CoT." On pure-reasoning tasks ReAct does not necessarily beat CoT-self-consistency (in the original paper ReAct alone is weaker on HotpotQA and needs to be combined with CoT-SC); ReAct's value is on tasks that need external knowledge / actions.

注意 / Caution

stop-token footgun: at inference you MUST set Observation: as a stop sequence. Otherwise the model will continue generating an Observation: … itself (hallucinating the tool return) instead of stopping to wait for the environment to inject the real result — this is the most common ReAct production bug. Hands-on: react-tool-call-loop.

From-scratch implementation (interview hand-tear standard: minimal ReAct loop + stop-sequence + environment-injected observations):

45 行 / lines
import re

def react_loop(prompt, tools, llm_generate, max_steps=10):
    """Minimal ReAct loop: Thought → Action → Observation → loop.
    llm_generate(messages, stop)->str: calls the LLM, stops immediately on stop sequences.
    tools: dict[str, callable], tool_name→executor (Obs from the environment, not the model).
    Returns (final_answer, trajectory)."""
    messages = [{"role": "user", "content": prompt}]
    trajectory = []

    for _ in range(max_steps):
        # 1. Reason + act; stop=["Observation:"] prevents hallucinated tool returns (model must stop)
        raw = llm_generate(messages, stop=["Observation:"])
        trajectory.append(raw)

        # 2. Check for final answer
        final = re.search(r"Final Answer:\s*(.*)", raw, re.S)
        if final:
            return final.group(1).strip(), trajectory

        # 3. Parse Action and Action Input
        action = re.search(r"Action:\s*(\S+)", raw)
        action_input = re.search(r"Action Input:\s*(.*)", raw, re.S)
        if not action:
            obs = "Error: no Action found. Please output 'Action: <tool_name>' then 'Action Input: <args>'."
        elif action.group(1) not in tools:
            obs = f"Error: unknown tool '{action.group(1)}'. Available: {list(tools.keys())}"
        else:
            try:
                result = tools[action.group(1)](action_input.group(1).strip())
                obs = str(result)                               # Observation from environment/tool
            except Exception as e:
                obs = f"Tool error: {e}"

        # 4. Inject the real Observation back into context (new user/system message, NOT assistant continuation)
        messages.append({"role": "assistant", "content": raw})
        messages.append({"role": "user",   "content": f"Observation: {obs}"})

    return None, trajectory                                     # step budget exhausted
# Interview key points:
# ① stop=["Observation:"] prevents hallucination — model stops, env injects real Obs
# ② Observation is a new message (role=user/system), not an assistant continuation
# ③ Action/Action Input parsed with regex; production would use JSON or structured output
# ④ Tool execution wrapped in try/except + unknown-tool fallback to prevent one-step kills

3. Planning: Plan-and-Execute vs ReAct

When pure plan-execute fails: when the environment is uncertain and mid-run state changes a lot (tool returns are unexpected), a plan fixed at the start goes stale — here reactive ReAct or a mix with replan is more robust.

4. Tool use

Toolformer6Self-supervised learning of "when/how" to call an API, keeping only useful self-labeled calls via a utility filter. Schick 2023 ↗: learns to call APIs without human annotation. How: randomly insert candidate API calls into text → execute to get returns → a utility filter keeps only samples where "inserting that call + its return significantly lowers the model's loss on subsequent tokens" for SFT. I.e. it uses "did the call actually help predict what follows" to auto-filter useless / mis-placed calls.

Structured function calling9Introduced 2023-06: describe functions with JSON Schema, the model emits structured calls. OpenAI 2023 ↗: describe the function signature with JSON Schema; the (fine-tuned) model directly emits a structured {name, arguments} call. parallel tool calls (returning several calls at once) require those calls to be idempotent + mutually independent (no data dependency), otherwise they cannot run in parallel.

陷阱 / Pitfall

Misconception: "function calling and ReAct are two opposing kinds of agent." Both are tool use, but at different levels: FC is a structured format for a tool call (the model is fine-tuned to emit JSON schema), ReAct is a reason-act loop pattern (a prompting paradigm); you can perfectly well "use a ReAct loop + issue tools via function calling each step." The SFT label-masking difference is in the react drill and agentic-page Q11.

5. Protocols: MCP & A2A

Protocol Direction What it standardizes
MCP (Model Context Protocol)7Anthropic 2024-11 open protocol: client-server + JSON-RPC 2.0, 3 primitives (tools/resources/prompts). Anthropic 2024 ↗ vertical (agent ↔ tools/data) how a model connects to external tools and data: client-server, JSON-RPC 2.0, three primitives (tools / resources / prompts), transport (stdio / Streamable HTTP)
A2A (Agent2Agent)8Proposed by Google 2025-04, later under the Linux Foundation: agent interop, JSON-RPC over HTTP + agent card. Google 2025 ↗ horizontal (agent ↔ agent) how agents from different vendors interoperate: agent card (capability declaration) + task state machine + JSON-RPC over HTTP
陷阱 / Pitfall

Misconception: "using MCP makes it safe." The protocol does not defend against prompt injection — malicious instructions hidden in a tool return can hijack the agent (see §10 + Greshake11Indirect prompt injection: instructions inside external content (web pages / tool returns) hijack the LLM application. Greshake 2023 ↗); defense is the host/agent's responsibility (least privilege, treat tool output as untrusted).

6. Production patterns

陷阱 / Pitfall

Misconception: "subagent = multi-agent system." A subagent is hierarchical decomposition (one main goal delegated to subordinates, context-isolated); multi-agent debate / collaboration is several peer agents each holding a viewpoint then aggregating — different goals and communication structures (the latter is in the agentic page's multi-agent credit).

7. Computer-use / GUI agent

The action space changes from "text tools" to screenshot → coordinate click / keyboard input, operating a real GUI directly. Two bottlenecks:

  1. grounding: mapping a semantic intent ("click the login button") to precise pixel coordinates — imprecise visual localization is the main error source; in practice one often prefers the accessibility tree (a structured element tree) over raw screenshot pixels.
  2. long-horizon: GUI tasks have many steps (open app→navigate→fill form→submit), so error compounding (see §9) is severe.

The evaluation arenas are OSWorld (desktop) and WebArena (web) in §8.

8. Benchmarks

注意 / Caution

Model SOTA on these benchmarks moves fast and is prone to training contamination; this page only lists the original-paper human baseline + what it tests. For current SOTA check each official leaderboard, and mind contamination and scaffold-version differences.

Benchmark What it tests human baseline (original paper)
SWE-bench122294 real GitHub issues; edit the codebase to make tests pass. Jimenez 2023 ↗ resolving real GitHub issues (edit codebase, pass unit tests), 2294 tasks no human solve-rate in the original; at release the best model was only ~2% (Claude 2) — showing its difficulty
SWE-bench Verified13OpenAI 2024-08 human-validated 500-problem subset, excluding unsolvable/over-strict-test items. OpenAI 2024 ↗ the same, as a 500-problem human-validated subset (cleaner) no human solve-rate reported
GAIA14General AI assistant: needs reasoning + multimodality + web + tools, in three difficulty levels. Mialon 2023 ↗ general assistant (reasoning + multimodality + web + tools), three levels 92% (L1 93.9 / L2 91.8 / L3 87.3), original annotators
OSWorld15369 open-ended computer-use tasks on a real OS (multi-app/web). Xie 2024 ↗ real-OS computer-use, 369 tasks 72.36%
WebArena16812 long-horizon tasks on real websites (e-commerce/forum/code/CMS). Zhou 2023 ↗ real-web long-horizon tasks, 812 tasks 78.24%
AgentBench178 interactive environments (OS/DB/KG/games/web) to evaluate LLM-as-agent. Liu 2023 ↗ LLM-as-agent across 8 environments none (designed for model-vs-model)
τ-bench18Multi-turn, policy-constrained customer-service tool-agent-user tasks; introduces the pass^k reliability metric. Yao 2024 ↗ multi-turn, policy-constrained tool-user interaction (customer service) none; the key contribution is the pass^k reliability metric
MLE-bench1975 Kaggle ML-engineering competitions, scored by medal rate (bronze/silver/gold). Chan 2024 ↗ Kaggle ML engineering, 75 competitions, by medal rate by Kaggle leaderboard percentile

9. Cost & reliability

10. Failure modes

# Failure mode Mechanism Mitigation
1 hallucinated tool call calling a non-existent tool/arg, or fabricating an Observation when no stop is set JSON schema validation + stop sequence + tool whitelist
2 loop / stalemate repeating the same action without progress step budget + loop detection + forced final
3 lost-in-the-middle10Key info in the middle of a long context is easily ignored, in a U-shape. Liu 2024 ↗ mid-context info ignored (U-shaped) put key info at the ends + summarize + retrieve
4 tool over/under-use calling when it shouldn't / not calling when it should reward / SFT shaping + tool retrieval
5 tool-output injection instructions hidden in a tool return hijack the agent treat tool output as untrusted + least privilege + host defense
6 benchmark reward hacking exploiting eval loopholes instead of truly solving verifiable terminal + adversarial test set + anti-contamination

Stratified follow-ups

L1 Basics

1. What is the essential difference between an agent and a chatbot? Does giving an LLM a search API make it an agent?

Answer: The essential difference is closed loop + autonomous multi-step + change of external state — an agent decides the next step from observations and keeps acting until the goal is met, and its actions can change external-world state; a chatbot is single Q&A. Simply giving an LLM a search API for one retrieval-augmented call is not yet an agent (still single-turn); only when it can autonomously decide whether to keep querying, what to query, and when to stop based on tool returns does it enter the agent regime.

Follow-up: agent = policy + what? → policy (LLM) + tool I/O + memory + control loop; treat the LLM as a policy πθ(ah)\pi_\theta(a\mid h) running in an "observe→act→new observation" loop.

2. What are ReAct's Thought/Action/Observation? Why does it hallucinate less than pure CoT?

Answer: Thought = reasoning, Action = call a tool, Observation = tool return (environment-injected). Pure CoT rolls forward on its own output and cannot externally correct intermediate facts; ReAct conditions each step on the real tool return, grounding the reasoning so a wrong fact can be corrected the next turn.

Follow-up: what is the most common ReAct production bug? → the stop-token footgun: not setting Observation: as a stop sequence, so the model writes its own Observation: … hallucinated return instead of stopping to wait for the environment's real result.

3. Are function calling and ReAct two opposing kinds of agent?

Answer: No, they are at different levels. Function calling is a structured format for a tool call (the model is fine-tuned to emit a {name, arguments} JSON schema); ReAct is a reason-act loop pattern (a prompting paradigm). They compose: run a ReAct loop and issue tools via function calling each step.

Follow-up: how do the two formats differ in label masking during training? → the masked set is the same (both mask tool-return tokens); the difference is that the fixed template parts of the JSON ({"name":, punctuation) are schema, not decisions, so over-training wastes gradient memorizing the template (see the react drill / agentic Q11).

4. What problem does MCP solve? Does it guarantee agent safety?

Answer: MCP (Model Context Protocol, Anthropic 2024-11) standardizes the model ↔ external tools/data connection (vertical): client-server + JSON-RPC 2.0 + three primitives (tools/resources/prompts). It does not guarantee safety — the protocol itself does not defend against prompt injection; malicious instructions in a tool return must be defended by the host/agent.

Follow-up: how do MCP and A2A divide the work? → MCP is vertical (agent ↔ tools/data); A2A (Google 2025) is horizontal (agent ↔ agent, interop via agent card + task state machine).

5. What is the difference between pass@k and pass^k? Which should agent deployment look at?

Answer: pass@k\text{pass@}k = at least one success in kk tries (measures capability upper bound, optimistic); passk\text{pass}^k = all kk succeed (measures reliability). Agent deployment looks at pass^k — a customer-service/code agent that causes trouble 1-in-10 runs is unusable. τ-bench uses pass^k precisely to expose this instability.

Follow-up: why is a long-horizon agent's pass^k far below its pass@k? → with per-step success rate p<1p<1, the probability that all kk runs (or all steps) succeed decays exponentially with the number of steps/tries; long tasks have many steps, and any single step failing fails the whole trajectory, so reliability is far below "at least once can do it".

6. Why is an agent's inference cost typically O(T²)?

Answer: every step re-reads the entire context, and the context grows linearly with steps (LttL_t \propto t, full history concatenated), so total tokens t=1Tt=O(T2)\propto \sum_{t=1}^T t = O(T^2). This is the main cost source for long-horizon agents and the motivation for context management (compress/summarize/evict).

Follow-up: can parallel tool calls reduce this to O(T)? → No. Parallelism saves the within-step wait of multiple independent tools (lowering latency, not total tokens); the cross-step serial dependency and context accumulation remain, so the cost order is unchanged.

L2 Intermediate

7. What is the trade-off between Plan-and-Execute and ReAct? When does pure plan-and-execute fail?

Answer: ReAct decides per step (reactive), flexible but no global view; Plan-and-Execute generates a full plan first then executes (executor can be a different model), globally consistent but the plan may go stale. Pure plan-and-execute fails when the environment is uncertain and mid-run state changes a lot (tool returns are unexpected) — the plan fixed at the start cannot keep up. Production usually mixes "high-level plan + per-step ReAct" + plan repair.

Follow-up: what are the ways to do plan repair? → Reflexion-style language reflection then replan, ToT-style search over alternative plans, or detect deviation and replan locally step-wise.

8. How does Toolformer learn to call APIs without human annotation?

Answer: self-supervision + a utility filter. Randomly insert candidate API calls into text → execute to get returns → keep only samples where "inserting that call and its return significantly lowers the loss on subsequent tokens" for SFT. I.e. it uses "did the call actually help predict what follows" as the utility signal, auto-filtering useless or mis-placed calls without any human labeling of which/when to call.

Follow-up: what is the essential criterion of this utility filter? → compare the weighted loss on subsequent tokens under "with the API return" vs "without / empty return"; keep only when the former is clearly lower — essentially "how much did this tool call reduce the perplexity of what follows".

9. What is the prerequisite constraint for parallel tool calls in structured function calling?

Answer: returning multiple tool calls at once requires those calls to be idempotent + mutually independent (no data dependency): if call B needs call A's result, they cannot be parallel and must serialize until A returns. Parallelism only applies to mutually-unrelated calls like "get weather + get exchange rate"; dependent chained calls go in separate rounds.

Follow-up: why can't parallelism break a long-horizon agent's serial-latency lower bound? → parallelism saves the within-step wait of independent calls; the cross-step data dependency (the next step uses the previous step's result) is still serial, so a T-step task serializes at least T LLM decodes.

10. What is the difference between subagent orchestration and multi-agent debate?

Answer: Subagent orchestration is hierarchical decomposition — the main agent dispatches subtasks to context-isolated, tool-limited subordinates; single goal, communication is "assign ↔ return result". Multi-agent debate/collaboration is several peer agents each holding a viewpoint, challenging each other then aggregating; the goal is to use diversity to improve correctness. The two differ in structure (hierarchical vs peer), communication, and purpose.

Follow-up: what does subagent context isolation mainly solve? → it prevents main-agent context blow-up (the subtask's intermediate tokens don't flow back to the main thread) + an over-long tool table (each subagent only mounts relevant tools); the cost is that cross-subagent information sharing must be passed explicitly.

11. With a 100+ tool pool, how do you manage it without blowing up the context?

Answer: do not stuff all tool schemas into the prompt (it both blows up the context and lowers selection accuracy); instead do tool retrieval: vectorize each tool's description, do an embedding top-k retrieval against the current subtask query, and inject only the most relevant schemas. Essentially turning "tool selection" from one-shot full exposure into retrieval recall.

Follow-up: what is the main failure mode of tool retrieval? → incomplete recall (the needed tool isn't retrieved → the agent can't finish) and description ambiguity (two similar tools get confused); mitigate with better tool descriptions + larger k + hierarchical retrieval if needed.

12. What are the two bottlenecks of a computer-use / GUI agent? Why prefer the accessibility tree?

Answer: ① grounding — mapping a semantic intent to precise pixel coordinates; imprecise visual localization is the main error source. ② long-horizon — GUI tasks have many steps and severe error compounding. One often prefers the accessibility tree (a structured element tree with role/label/coords) over raw screenshots, because structured elements locate "that button" more reliably than pixels, sidestepping part of the grounding error.

Follow-up: if the accessibility tree is more reliable, why still need screenshots? → many interfaces (canvas, custom rendering, games) have no usable a11y tree, or the tree is incomplete; the screenshot is the universal fallback, and practice often fuses both.

13. To evaluate a coding agent and a web agent, which benchmarks would you pick and why?

Answer: coding agent → SWE-bench / SWE-bench Verified (resolve real GitHub issues by editing code to pass unit tests; Verified is the human-validated clean subset); web agent → WebArena (real-website long-horizon tasks, with a 78.24% human baseline), or OSWorld if it's desktop computer-use. The basis is "does the action space and task distribution match the target scenario".

Follow-up: what caveat must accompany SOTA numbers on these benchmarks? → model SOTA moves fast + training contamination + scaffold-version differences (SWE-bench therefore released the human-validated Verified subset); you can only cite the current official-leaderboard value with a date, never treat a second-hand number as a stable fact.

L3 Advanced

14. Designing context management for a long-horizon agent: how do O(T²) cost, lost-in-the-middle, and error compounding mitigate together?

Answer: all three stem from "context inflating with steps", so a combination is needed: ① for O(T²) costsummarize old turns + KV eviction (keep sink + recent window) + dispatch independent subtasks to context-isolated subagents, splitting one long context into several short ones; ② for lost-in-the-middle (mid-context info ignored, U-shaped) — put key info (goal, constraints) at the ends and use retrieval to recall relevant history into the recent window rather than relying on the model to find it in a long middle; ③ for error compounding pTp^T — shorten the effective horizon (decompose + verifiable milestones per subtask), add loop detection and a budget guard for early stop. Synergy: summarization + subtask isolation lower both cost and horizon; retrieval + end-anchoring fix both lost-in-the-middle and grounding.

Follow-up: what new risk does summarization itself introduce, and how to balance it? → lossy compression may discard key early info that turns out useful later (and if training saw the full history but inference only compresses, it creates a train-inference state-distribution mismatch); the balance is to keep a pointer / retrievable copy of "potentially-revisited" content and only summarize low-value turns.

15. What is the threat model of tool-output prompt injection? Give a defense-in-depth.

Answer: threat model (indirect prompt injection, Greshake 2023): the attacker hides malicious instructions in external content the agent will read (web pages, search results, tool returns, files); the agent executes them as instructions — it can be induced to leak context, abuse high-privilege tools, or make outbound requests. Defense-in-depth: ① mark all tool/external returns as untrusted data (isolated from system instructions, not executed as instructions); ② least privilege (minimal scope per tool; dangerous operations require confirmation); ③ output-side constraints (host-level policy gates on high-risk actions like outbound send / delete); ④ monitor anomalous tool-call sequences. Key insight: the protocol (MCP) is not responsible for injection defense — the host/agent is.

Follow-up: why is "let the model judge whether an instruction is trustworthy" not a reliable defense? → it puts the security boundary back inside a model that the same injection can compromise; reliable defense should backstop outside the model with a deterministic privilege/policy layer (whitelist, scope, human confirmation) rather than relying on model self-discipline.

16. Given benchmark reward hacking and data contamination, how do you design a hack-resistant agent evaluation?

Answer: two problems — ① reward hacking: the agent exploits eval-implementation loopholes (editing test files, mocking outputs, empty functions passing CI) instead of truly solving; ② contamination: test samples leaked into training, inflating scores. Hack-resistant eval: use a verifiable, hard-to-forge success criterion (hidden extra unit tests, final environment-state checks, not assertions the agent can see), rotate adversarial test sets / living benchmarks (periodically swap items to prevent memorization), a human-validated subset (like SWE-bench Verified excluding cheatable/unsolvable items), report pass^k (to prevent gaming pass@k by many samples), and publish the eval harness for reproducibility.

Follow-up: why can't an "ever-rising public leaderboard" be taken directly as agent capability progress? → high scores may come from contamination, scaffold engineering, or overfitting to that benchmark; what matters is whether it improves in lockstep on newly released, contamination-resistant living benchmarks and the pass^k reliability metric, with the harness and date verified.


References

All are primary sources for load-bearing methods, each web-verified (title + arXiv ID / official URL). Click a superscript to jump, click ↩ to return.

  1. Yao et al. ReAct: Synergizing Reasoning and Acting in Language Models. ICLR 2023. arXiv:2210.03629 — think→act→observe paradigm.
  2. Wei et al. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. NeurIPS 2022. arXiv:2201.11903 — CoT reasoning.
  3. Wang et al. Plan-and-Solve Prompting. ACL 2023. arXiv:2305.04091 — plan first, then execute.
  4. Yao et al. Tree of Thoughts: Deliberate Problem Solving with LLMs. NeurIPS 2023. arXiv:2305.10601 — tree-of-thoughts search.
  5. Shinn et al. Reflexion: Language Agents with Verbal Reinforcement Learning. NeurIPS 2023. arXiv:2303.11366 — verbal reflection / episodic memory.
  6. Schick et al. Toolformer: Language Models Can Teach Themselves to Use Tools. NeurIPS 2023. arXiv:2302.04761 — self-supervised tool use + utility filter.
  7. Anthropic. Model Context Protocol (MCP). 2024-11. modelcontextprotocol.io — model↔tools/data standard (vertical).
  8. Google. Agent2Agent Protocol (A2A). 2025-04 (later under the Linux Foundation). a2a-protocol.org — agent↔agent interop (horizontal).
  9. OpenAI. Function calling and other API updates. 2023-06-13. openai.com — structured JSON-Schema tool calling.
  10. Liu et al. Lost in the Middle: How Language Models Use Long Contexts. TACL 2024. arXiv:2307.03172 — mid-context info is ignored.
  11. Greshake et al. Not what you've signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection. 2023. arXiv:2302.12173 — indirect prompt injection.
  12. Jimenez et al. SWE-bench: Can Language Models Resolve Real-World GitHub Issues? ICLR 2024. arXiv:2310.06770 — real code-fix benchmark.
  13. OpenAI. Introducing SWE-bench Verified. 2024-08. openai.com — 500-problem human-validated subset.
  14. Mialon et al. GAIA: a benchmark for General AI Assistants. ICLR 2024. arXiv:2311.12983 — general-assistant eval (human 92%).
  15. Xie et al. OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments. NeurIPS 2024. arXiv:2404.07972 — computer-use eval (human 72.36%).
  16. Zhou et al. WebArena: A Realistic Web Environment for Building Autonomous Agents. ICLR 2024. arXiv:2307.13854 — web-agent eval (human 78.24%).
  17. Liu et al. AgentBench: Evaluating LLMs as Agents. ICLR 2024. arXiv:2308.03688 — 8-environment eval.
  18. Yao et al. τ-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains. 2024. arXiv:2406.12045 — multi-turn tool-user + pass^k.
  19. Chan et al. MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering. ICLR 2025. arXiv:2410.07095 — Kaggle ML-engineering eval.