Alpha-Eval

AI can code. It cannot predict.

Six frontier AI models. Thirty independent trials (pass@5). A rigorous multi-stage trading benchmark with hidden out-of-sample data and an 80% net-exposure constraint. Zero passed.

6
Models
30
Trials
+2.784
Ground-truth Sharpe
Equity curves

Ground truth compounds. Frontier models bleed.

Models that ship pull requests for Anthropic cannot construct a hedged portfolio. Models that solve IMO problems cannot read an order book. The capability is there. The training signal is not.

Equity Curves - Hidden Out-of-Sample Period (2023-2026)
Ground Truth
Claude Opus 4.6
Gemini 3.1 Pro
GPT-5.4
GPT-5.3 Codex
Grok-4
Qwen3 Coder
Synthetic equity curves based on model Sharpe ratios. Will be replaced with actual P&L data.
Methodology

Five stages, one hidden OOS window.

Models must pass each gate in sequence. Early failures cap the score at −6.0 and block downstream stages. Stage 2 is scored on 2023–2026 data the model never sees.

An 80% net-exposure constraint (|net| / gross ≤ 0.80 at every bar) prevents pure directional bets. Models must actually build hedged portfolios.

0
Data Alignment
Merge NIFTY 50 + BANKNIFTY minute OHLC. 737K bars, 2015–2022.
1a
SMA Crossover
Implement specified moving average strategy with known ground truth.
1b
Z-Score Spread
Pairs trading on the NIFTY–BANKNIFTY spread with cooldown logic.
1c
Portfolio Combination
Vol-normalised equal-weight combination of strategies.
2
Alpha Research
Discover, backtest, and combine hedged strategies. Scored on hidden 2023–2026 OOS.
Leaderboard

Results.

#ModelAvgBestWorstStd
Ground Truth+2.784+2.784+2.7840.000
1Claude Opus 4.6-2.098-0.328-3.0001.218
2Gemini 3.1 Pro-2.724+0.419-4.0411.671
3GPT-5.4-3.000-3.000-3.0000.000
4GPT-5.3 Codex-3.300-3.000-4.5000.671
5Grok-4-3.600-3.000-6.0001.342
6Qwen3 Coder Next-5.000-5.000-5.0000.000
Individual trials

Pass@5, independently seeded.

Claude Opus 4.6avg -2.098
-0.328-1.163-3.000-3.000-3.000
Gemini 3.1 Proavg -2.724
+0.419-3.000-3.000-4.000-4.041
GPT-5.4avg -3.000
-3.000-3.000-3.000-3.000-3.000
GPT-5.3 Codexavg -3.300
-3.000-3.000-3.000-3.000-4.500
Grok-4avg -3.600
-3.000-3.000-3.000-6.000-3.000
Qwen3 Coder Nextavg -5.000
-5.000-5.000-5.000-5.000-5.000
Gate pass rates

Gates passed, by model.

ModelS0S1aS1bS1cLookExpAll
Claude Opus 4.65/55/55/55/55/54/54/5
Gemini 3.1 Pro5/55/54/53/55/55/53/5
GPT-5.45/55/55/55/55/55/55/5
GPT-5.3 Codex5/55/54/55/55/55/54/5
Grok-44/54/54/54/54/53/52/5
Qwen3 Coder Next5/50/50/50/50/50/50/5
Get access

Running a frontier lab? Test your model against the harshest scoreboard.