long-horizon = multi-step, long-running tasks requiring sustained autonomous execution; self-evolving = enabling an agent to continuously improve via self-generated data / self-feedback. ⚠️ This page uses strict column separation: 【Production】= shipped products / official engineering guides; 【Frontier】= papers / technical reports, not yet industry standards. In interviews, do not treat frontier findings as production-standard answers. Integrity notice: the "interview question clusters" on this page are high-frequency question clusters inferred from public papers / JDs, not verified real exam questions; no unverified benchmark numbers are cited. Deep frontier topics (fully automated self-evolution, etc.) are out of scope for this playbook — only signals are given.
1. 【Production】What Long-Horizon Agentic Systems Look Like Today (Shipped Products)
Two organizations have shipped "long-horizon agentic" as a product — hard currency when discussing "agent deployment":
- Anthropic computer use (2024-10-22 public beta; Anthropic API / Amazon Bedrock / Google Vertex AI): Claude views screenshots → moves cursor / clicks / types, translating instructions into a sequence of computer operations; tasks "require tens, sometimes hundreds of steps." Officially described as experimental and error-prone; recommended to start with low-risk tasks.1Claude connected to the screen: screenshot → cursor/click/type, tens to hundreds of steps of computer operation (officially experimental).Anthropic 2024 ↗
- OpenAI Operator (2025-01 research preview, ChatGPT Pro) → ChatGPT agent (2025-07-17, merging Operator + deep research): the underlying CUA (Computer-Using Agent) = GPT-4o vision + RL reasoning; loop = view screenshot → CoT reasoning for next step → click/scroll/type, until completion or human handoff required. Safety: password entry requires user takeover; high-risk tasks (e.g., bank transfers) are declined.4OpenAI's computer-use agent: GPT-4o vision + RL, screenshot → reasoning → action loop.OpenAI 2025 ↗ ChatGPT agent operates on a virtual computer with visual browser + text browser + terminal + API.5Combines Operator (action) and deep research (synthesis) on a single virtual computer.OpenAI 2025 ↗
2. 【Production】Engineering Pillars of Long-Horizon Agents (Official Guides, High Interview Frequency)
Anthropic《Building Effective Agents》(2024-12-19)2Anthropic's agent engineering classic: workflow vs agent, ACI, stopping conditions, per-step environment ground truth.Anthropic 2024 ↗ ≈ the engineering "bible" of this field:
- Workflow vs Agent (must-know): workflow = LLM / tools follow predefined code paths; agent = LLM dynamically decides the process and tool usage itself (who owns the control flow is the key distinction).
- Common patterns: prompt chaining, routing, parallelization (sectioning / voting), orchestrator-workers, evaluator-optimizer; and autonomous agents (loop driven by environment feedback, planning autonomously to completion or until a stopping condition is triggered).
- When to use an agent: task is open-ended, number of steps is unpredictable, path cannot be hardcoded, and the higher latency/cost tradeoff for better performance is acceptable — otherwise try a simpler solution first (single call + retrieval + few-shot).
- Engineering essentials: craft the ACI (agent-computer interface) as carefully as HCI; set stopping conditions (e.g., max iterations); evaluate progress at every step with environment ground truth (tool results / code execution); sandbox + guardrails to prevent error accumulation.
Claude Agent SDK3Long-horizon agent loop gather→act→verify→repeat + context management (compaction / files as memory / subagents).Anthropic ↗ core loop (worth memorizing):
gather context → take action → verify work → repeat
- Context management (avoid blowing context in long runs): compaction (auto-summarize old messages), file system as memory (grep / tail on demand), subagents (isolated context, parallel execution, return only summaries).
- Self-verification: rules-based (linter, precise), visual (screenshot, verifies UI), LLM-as-judge (fuzzy criteria, costs latency / stability).
- Reliability = appropriate tools + clear feedback + representative scenario testing + iterating on failure modes.
3. 【Frontier】Training Paradigms for Long-Horizon / Agentic Systems (Research Context, Not Yet Industry Standards)
The following is research context, not a production deployment claim for any specific product. In interviews you may say "I've been tracking direction X" — do not say "this is the industry standard."
3.1 Sparse / Long-Horizon Rewards + Difficulty-Band Design (High-Frequency System Design Question)
Long-horizon task rewards are sparse (often only a final pass/fail signal). A recurring engineering principle: effective RL signal only exists in the intermediate difficulty band — explicitly preventing training data from collapsing to either extreme is required.
Self-Play SWE-RL6Self-play RL where the same LLM both injects and fixes bugs; segmented reward for difficulty band (this page only uses the reward design).Wei 2025 ↗: the same LLM both injects bugs and fixes them, using a test suite as reward. The bug injection reward is a piecewise function ( = fraction of fixers who solved the bug, i.e., solve rate):
Negative reward is given for both "too hard (no one solves it, )" and "too easy (everyone solves it, )"; only intermediate difficulty is rewarded. (This page only uses the reward design; performance numbers are unverified and not cited.)
MiMo7Xiaomi's released-model RL recipe: remove KL loss, Clip-Higher, dynamic sampling to filter pass-rate 0/1.Xiaomi 2025 ↗ (RL recipe for Xiaomi's released model): dynamic sampling filters prompts with pass-rate, and maintains a 10% easy-question pool to prevent instability in late-stage policy updates.
The motivation of both is the same = difficulty-adaptive curriculum: concentrate signal on problems the model "can almost solve but hasn't yet mastered."
3.2 Three Core Challenges of Web / Agent RL (WebRL Framework)
Standard structure for answering "why is web/long-horizon agent training hard": ① scarcity of training tasks; ② sparse feedback signals; ③ policy distribution shift.8Three core challenges of web agent training: task scarcity / sparse feedback / policy shift (this page only uses this framework).Qi 2024 ↗ (WebRL's core is "failed trajectories → self-evolving curriculum"; this page only uses its three-challenge framework and does not expand on that mechanism.)
3.3 Connections to Other Pages
GRPO improvements (Clip-Higher, remove KL loss) originate from ByteDance DAPO9ByteDance's GRPO improvements: Clip-Higher, remove KL loss.ByteDance 2025 ↗, adopted by MiMo — see reasoning-rl-frontier; not repeated here.
4. 【Frontier · Least Mature】Self-Evolving / Self-Evolving Agents (Requires Most Caution)
The vast majority of this area is research; production evidence is weak. Do not claim this as an industry standard in interviews, and do not cite unverified numbers.
- Core idea: let the agent continuously improve via self-generated data / self-play / generating new tasks from failures / reflection-self-correction, bypassing human annotation.
- Current state of the field: papers exist exploring this direction (automated curriculum, self-play search, etc.), but strong claims of "unsupervised fully automatic scaling" often do not hold up to verification; treat as an open research question, not a mature solution. Deep treatment is left to the independent agent-post-training-playbook.
- Honest connection (your research): your Continual Agent (in progress) + Fed-TaLoRA anti-forgetting perspective → you can say "I focus on continual learning / catastrophic forgetting in agents, which gives me an understanding of why self-evolution is still immature in production and where the boundaries are"; do not say "I have built production self-evolving agent systems." See continual-post-training.
5. Interview Question Clusters / Stratified Follow-ups
Inferred from public papers / JDs as high-frequency clusters; not real exam questions.
L1 Fundamentals
- What is the difference between a workflow and an agent? When should you not use an agent (when a workflow / single call is sufficient)?
- How does computer use / Operator work (the screenshot → reasoning → action loop)?
L2 Intermediate
- How do you prevent context overflow in long-horizon agents (compaction / files as memory / subagents)?
- How do you handle sparse rewards in long-horizon tasks? Why use a difficulty band / dynamic sampling (filter pass-rate )?
- How does an agent self-verify (rules / visual / LLM-as-judge), and what are the costs of each?
L3 Deep Dive
- Design the reward for a long-horizon coding agent: how do you prevent reward hacking and degeneration (no signal when too hard / too easy)?
- What are the three core challenges of web / long-horizon agent training (task scarcity / sparse feedback / policy shift) and how is each mitigated?
- What are the failure modes of self-evolving / self-play training? Why is full automation still not trusted in production?
References
Click superscript
[N]to jump here; click↩to return to the original text; on wide screens the gist appears as a margin note.
- Anthropic — Introducing computer use, a new Claude 3.5 Sonnet, and Claude 3.5 Haiku(2024-10-22). anthropic.com ↩
- Anthropic — Building Effective Agents(2024-12-19). anthropic.com ↩
- Anthropic — Building agents with the Claude Agent SDK. claude.com ↩
- OpenAI — Introducing Operator / Computer-Using Agent (CUA)(2025-01). openai.com ↩
- OpenAI — Introducing ChatGPT agent(2025-07-17). openai.com ↩
- Wei et al.(Meta / FAIR) — Toward Training Superintelligent Software Agents through Self-Play SWE-RL. arXiv:2512.18552 — this page only uses the reward design; unverified performance numbers are not included. ↩
- Xiaomi LLM-Core — MiMo Technical Report. arXiv:2505.07608. ↩
- Qi et al. — WebRL: Training LLM Web Agents via Self-Evolving Online Curriculum RL. arXiv:2411.02337 — this page only uses the three-challenge framework; the self-evolving curriculum mechanism is not expanded. ↩
- ByteDance Seed — DAPO. arXiv:2503.14476. ↩