Source: Machina (@EXM7777) on X. Distilled and processed for Signal; not original content.

TL;DR

New research: a tiny model, small enough to run on a laptop, that answers nothing itself and only acts as a manager, deciding which big model handles each piece of a task. It beat ChatGPT, Gemini, and Claude individually and set a record on a hard coding benchmark. Swapping a top-tier model into the manager seat made results worse. Knowing who to ask beats being the smartest in the room.

How the manager runs one task

  1. Picks the best model to plan the work.
  2. Picks the best model to do the work.
  3. Picks the best model to check the work.
  4. Loops until the checker signs off.

The manager is a conductor: it cannot play an instrument, but it knows exactly who should play when.

Actionable items

  • Stop betting everything on one “best” model. A team of models beats any single one.
  • Never let a model check its own work. It has the same blind spots it had when it wrote the output, so it waves its own bugs through. Use a different model to check.
  • Build a “check” step before you ship anything. It quietly kills most bad outputs.
  • Routing beats raw intelligence: a small router outperformed frontier models as the manager, and putting a top-tier model in the manager seat did worse.

Key quotes

being the smartest in the room doesn’t matter, knowing who to ask does

the future isn’t one giant brain, it’s a small manager pointing the right models at the right jobs

claude-workflows-that-run-unattended