When trying a new repo, it is useful to ask an agent to do the setup as a troubleshooting exercise: read the docs, install the prerequisites, run the expected commands, and debug the errors that come up.
That also suggests a useful design constraint for projects: setup can be expressed as textual instructions rather than hidden inside setup scripts. If an agent can follow the instructions from a clean checkout, the docs are probably good enough for a human too.
You can push the same idea further and make the product itself generated from a spec. OpenAI’s symphony is an example of a repository shaped around that kind of spec-driven generation.
Matt Pocock has a small handoff skill for compacting an AI coding session so another agent can pick it up later.
The useful bit is that the handoff is not just “summarize the chat”. It asks the agent to write a temporary handoff document for a fresh agent, suggest relevant skills for the next session, and avoid duplicating details already captured in durable artifacts such as plans, PRDs, issues, commits, and diffs.
That framing makes the handoff more operational: point at durable state, capture only the missing context, and tailor the note to what the next session is supposed to do.
In Build Agents That Run for Hours (Without Losing the Plot), Ash Prabaker and Andrew Wilson described a practical pattern for long-running coding agents: split planning, building, and evaluation into separate roles, and make the evaluator adversarial.
Takeaways:
Self-evaluation is a trap. Use an adversarial evaluator.
Make subjective quality gradable with clear evaluation criteria the model can apply.
Read the traces. They’re your primary debugging loop.
Delete scaffolding when the model catches up. The frontier moves.
Agents tend to take shortcuts when evaluating their own work, and they can be too agreeable. It helps to make evaluation a separate role: a skill or agent definition that specializes in QA, looks for missing cases, and refuses to accept success-shaped claims without evidence.
A practical way to make one is to ask an agent: “Create a skill to evaluate <your problem space>. Use <link to an existing evaluator skill> for tone and approach.” Then refine the evaluation criteria, tighten the failure modes, and apply it to real outputs.