Lesson

The Scorecard

The run IS an eval. Build the trace — commands, tokens, service health, encounter response time.

Stub — o11y as game mechanic. Every command logged. Token tracking per tick. Service state timeline. Run trace saved as JSON. Compare runs across models — which brain gets the best tokens/fix? The scorecard IS the eval report.