Alpha-Eval

AI can code. It cannot predict.

Six frontier AI models. Thirty independent trials (pass@5). A rigorous multi-stage trading benchmark with hidden out-of-sample data and an 80% net-exposure constraint. Zero passed.

Models

Trials

+2.784

Ground-truth Sharpe

Equity curves

Ground truth compounds. Frontier models bleed.

Models that ship pull requests for Anthropic cannot construct a hedged portfolio. Models that solve IMO problems cannot read an order book. The capability is there. The training signal is not.

Equity Curves - Hidden Out-of-Sample Period (2023-2026)

Ground Truth

Claude Opus 4.6

Gemini 3.1 Pro

GPT-5.4

GPT-5.3 Codex

Grok-4

Qwen3 Coder

Synthetic equity curves based on model Sharpe ratios. Will be replaced with actual P&L data.

Interactive version ↗

Methodology

Five stages, one hidden OOS window.

Models must pass each gate in sequence. Early failures cap the score at −6.0 and block downstream stages. Stage 2 is scored on 2023–2026 data the model never sees.

An 80% net-exposure constraint (|net| / gross ≤ 0.80 at every bar) prevents pure directional bets. Models must actually build hedged portfolios.

Data Alignment

Merge NIFTY 50 + BANKNIFTY minute OHLC. 737K bars, 2015–2022.

SMA Crossover

Implement specified moving average strategy with known ground truth.

Z-Score Spread

Pairs trading on the NIFTY–BANKNIFTY spread with cooldown logic.

Portfolio Combination

Vol-normalised equal-weight combination of strategies.

Alpha Research

Discover, backtest, and combine hedged strategies. Scored on hidden 2023–2026 OOS.

Leaderboard

Results.

#	Model	Avg	Best	Worst	Std
★	Ground Truth	+2.784	+2.784	+2.784	0.000
1	Claude Opus 4.6	-2.098	-0.328	-3.000	1.218
2	Gemini 3.1 Pro	-2.724	+0.419	-4.041	1.671
3	GPT-5.4	-3.000	-3.000	-3.000	0.000
4	GPT-5.3 Codex	-3.300	-3.000	-4.500	0.671
5	Grok-4	-3.600	-3.000	-6.000	1.342
6	Qwen3 Coder Next	-5.000	-5.000	-5.000	0.000

Individual trials

Pass@5, independently seeded.

Claude Opus 4.6avg -2.098

-0.328-1.163-3.000-3.000-3.000

Gemini 3.1 Proavg -2.724

+0.419-3.000-3.000-4.000-4.041

GPT-5.4avg -3.000

-3.000-3.000-3.000-3.000-3.000

GPT-5.3 Codexavg -3.300

-3.000-3.000-3.000-3.000-4.500

Grok-4avg -3.600

-3.000-3.000-3.000-6.000-3.000

Qwen3 Coder Nextavg -5.000

-5.000-5.000-5.000-5.000-5.000

Gate pass rates

Gates passed, by model.

Model	S0	S1a	S1b	S1c	Look	Exp	All
Claude Opus 4.6	5/5	5/5	5/5	5/5	5/5	4/5	4/5
Gemini 3.1 Pro	5/5	5/5	4/5	3/5	5/5	5/5	3/5
GPT-5.4	5/5	5/5	5/5	5/5	5/5	5/5	5/5
GPT-5.3 Codex	5/5	5/5	4/5	5/5	5/5	5/5	4/5
Grok-4	4/5	4/5	4/5	4/5	4/5	3/5	2/5
Qwen3 Coder Next	5/5	0/5	0/5	0/5	0/5	0/5	0/5

Get access

Running a frontier lab? Test your model against the harshest scoreboard.

Get Access →Interactive version ↗