Cheatsheet

Long-Horizon / Self-Evolving Agents: Production State vs. Frontier (Interview-Oriented)

long-horizon = multi-step, long-running tasks requiring sustained autonomous execution; self-evolving = enabling an agent to continuously improve via self-generated data / self-feedback. ⚠️ This page uses strict column separation: 【Production】= shipped products / official engineering guides; 【Frontier】= papers / technical reports, not yet industry standards. In interviews, do not treat frontier findings as production-standard answers. Integrity notice: the "interview question clusters" on this page are high-frequency question clusters inferred from public papers / JDs, not verified real exam questions; no unverified benchmark numbers are cited. Deep frontier topics (fully automated self-evolution, etc.) are out of scope for this playbook — only signals are given.

1. 【Production】What Long-Horizon Agentic Systems Look Like Today (Shipped Products)

Two organizations have shipped "long-horizon agentic" as a product — hard currency when discussing "agent deployment":

2. 【Production】Engineering Pillars of Long-Horizon Agents (Official Guides, High Interview Frequency)

Anthropic《Building Effective Agents》(2024-12-19)2Anthropic's agent engineering classic: workflow vs agent, ACI, stopping conditions, per-step environment ground truth.Anthropic 2024 ↗ ≈ the engineering "bible" of this field:

Claude Agent SDK3Long-horizon agent loop gather→act→verify→repeat + context management (compaction / files as memory / subagents).Anthropic ↗ core loop (worth memorizing):

gather context → take action → verify work → repeat

3. 【Frontier】Training Paradigms for Long-Horizon / Agentic Systems (Research Context, Not Yet Industry Standards)

注意 / Caution

The following is research context, not a production deployment claim for any specific product. In interviews you may say "I've been tracking direction X" — do not say "this is the industry standard."

3.1 Sparse / Long-Horizon Rewards + Difficulty-Band Design (High-Frequency System Design Question)

Long-horizon task rewards are sparse (often only a final pass/fail signal). A recurring engineering principle: effective RL signal only exists in the intermediate difficulty band — explicitly preventing training data from collapsing to either extreme is required.

Self-Play SWE-RL6Self-play RL where the same LLM both injects and fixes bugs; segmented reward for difficulty band (this page only uses the reward design).Wei 2025 ↗: the same LLM both injects bugs and fixes them, using a test suite as reward. The bug injection reward is a piecewise function (ss = fraction of fixers who solved the bug, i.e., solve rate):

rinject={α,s{0,1}1(1+α)s,0<s<1,α=0.8r_{\text{inject}} = \begin{cases} -\alpha, & s \in \{0, 1\} \\ 1-(1+\alpha)\,s, & 0 < s < 1 \end{cases}, \quad \alpha = 0.8

Negative reward is given for both "too hard (no one solves it, s=0s{=}0)" and "too easy (everyone solves it, s=1s{=}1)"; only intermediate difficulty is rewarded. (This page only uses the reward design; performance numbers are unverified and not cited.)

MiMo7Xiaomi's released-model RL recipe: remove KL loss, Clip-Higher, dynamic sampling to filter pass-rate 0/1.Xiaomi 2025 ↗ (RL recipe for Xiaomi's released model): dynamic sampling filters prompts with pass-rate=0/1=0/1, and maintains a 10% easy-question pool to prevent instability in late-stage policy updates.

The motivation of both is the same = difficulty-adaptive curriculum: concentrate signal on problems the model "can almost solve but hasn't yet mastered."

3.2 Three Core Challenges of Web / Agent RL (WebRL Framework)

Standard structure for answering "why is web/long-horizon agent training hard": ① scarcity of training tasks; ② sparse feedback signals; ③ policy distribution shift.8Three core challenges of web agent training: task scarcity / sparse feedback / policy shift (this page only uses this framework).Qi 2024 ↗ (WebRL's core is "failed trajectories → self-evolving curriculum"; this page only uses its three-challenge framework and does not expand on that mechanism.)

3.3 Connections to Other Pages

GRPO improvements (Clip-Higher, remove KL loss) originate from ByteDance DAPO9ByteDance's GRPO improvements: Clip-Higher, remove KL loss.ByteDance 2025 ↗, adopted by MiMo — see reasoning-rl-frontier; not repeated here.

4. 【Frontier · Least Mature】Self-Evolving / Self-Evolving Agents (Requires Most Caution)

注意 / Caution

The vast majority of this area is research; production evidence is weak. Do not claim this as an industry standard in interviews, and do not cite unverified numbers.

5. Interview Question Clusters / Stratified Follow-ups

Inferred from public papers / JDs as high-frequency clusters; not real exam questions.

L1 Fundamentals

L2 Intermediate

L3 Deep Dive

References

Click superscript [N] to jump here; click to return to the original text; on wide screens the gist appears as a margin note.

  1. Anthropic — Introducing computer use, a new Claude 3.5 Sonnet, and Claude 3.5 Haiku(2024-10-22). anthropic.com
  2. Anthropic — Building Effective Agents(2024-12-19). anthropic.com
  3. Anthropic — Building agents with the Claude Agent SDK. claude.com
  4. OpenAI — Introducing Operator / Computer-Using Agent (CUA)(2025-01). openai.com
  5. OpenAI — Introducing ChatGPT agent(2025-07-17). openai.com
  6. Wei et al.(Meta / FAIR) — Toward Training Superintelligent Software Agents through Self-Play SWE-RL. arXiv:2512.18552 — this page only uses the reward design; unverified performance numbers are not included.
  7. Xiaomi LLM-Core — MiMo Technical Report. arXiv:2505.07608.
  8. Qi et al. — WebRL: Training LLM Web Agents via Self-Evolving Online Curriculum RL. arXiv:2411.02337 — this page only uses the three-challenge framework; the self-evolving curriculum mechanism is not expanded.
  9. ByteDance Seed — DAPO. arXiv:2503.14476.