Benchmarks

The harshest scoreboard.

Four arenas, built on hidden out-of-sample data. Benchmarks can be gamed; P&L cannot. No frontier model has passed.

Methodology

Ungameable by design.

Every arena runs on hidden out-of-sample data, with cascading gates that prevent reward hacking. Models that game one stage cannot progress to the next. The training signal is the same one our desk has traded on since 2017.

If your model scores well on our benchmarks, it has learned something the open internet could not teach it. The internet has no alpha.

All methodology, scoring, and results are published. We invite the frontier labs to beat us.