
Alpha-Eval
AI can code. It cannot predict.
Six frontier AI models. Thirty independent trials (pass@5). A rigorous multi-stage trading benchmark with hidden out-of-sample data and an 80% net-exposure constraint. Zero passed.
6
Models
30
Trials
+2.784
Ground-truth Sharpe
Equity curves
Ground truth compounds. Frontier models bleed.
Models that ship pull requests for Anthropic cannot construct a hedged portfolio. Models that solve IMO problems cannot read an order book. The capability is there. The training signal is not.
Equity Curves - Hidden Out-of-Sample Period (2023-2026)
Ground Truth
Claude Opus 4.6
Gemini 3.1 Pro
GPT-5.4
GPT-5.3 Codex
Grok-4
Qwen3 Coder
Synthetic equity curves based on model Sharpe ratios. Will be replaced with actual P&L data.
Methodology
Five stages, one hidden OOS window.
Models must pass each gate in sequence. Early failures cap the score at −6.0 and block downstream stages. Stage 2 is scored on 2023–2026 data the model never sees.
An 80% net-exposure constraint (|net| / gross ≤ 0.80 at every bar) prevents pure directional bets. Models must actually build hedged portfolios.
0
Data Alignment
Merge NIFTY 50 + BANKNIFTY minute OHLC. 737K bars, 2015–2022.
1a
SMA Crossover
Implement specified moving average strategy with known ground truth.
1b
Z-Score Spread
Pairs trading on the NIFTY–BANKNIFTY spread with cooldown logic.
1c
Portfolio Combination
Vol-normalised equal-weight combination of strategies.
2
Alpha Research
Discover, backtest, and combine hedged strategies. Scored on hidden 2023–2026 OOS.
Leaderboard
Results.
| # | Model | Avg | Best | Worst | Std |
|---|---|---|---|---|---|
| ★ | Ground Truth | +2.784 | +2.784 | +2.784 | 0.000 |
| 1 | Claude Opus 4.6 | -2.098 | -0.328 | -3.000 | 1.218 |
| 2 | Gemini 3.1 Pro | -2.724 | +0.419 | -4.041 | 1.671 |
| 3 | GPT-5.4 | -3.000 | -3.000 | -3.000 | 0.000 |
| 4 | GPT-5.3 Codex | -3.300 | -3.000 | -4.500 | 0.671 |
| 5 | Grok-4 | -3.600 | -3.000 | -6.000 | 1.342 |
| 6 | Qwen3 Coder Next | -5.000 | -5.000 | -5.000 | 0.000 |
Individual trials
Pass@5, independently seeded.
Claude Opus 4.6avg -2.098
-0.328-1.163-3.000-3.000-3.000
Gemini 3.1 Proavg -2.724
+0.419-3.000-3.000-4.000-4.041
GPT-5.4avg -3.000
-3.000-3.000-3.000-3.000-3.000
GPT-5.3 Codexavg -3.300
-3.000-3.000-3.000-3.000-4.500
Grok-4avg -3.600
-3.000-3.000-3.000-6.000-3.000
Qwen3 Coder Nextavg -5.000
-5.000-5.000-5.000-5.000-5.000
Gate pass rates
Gates passed, by model.
| Model | S0 | S1a | S1b | S1c | Look | Exp | All |
|---|---|---|---|---|---|---|---|
| Claude Opus 4.6 | 5/5 | 5/5 | 5/5 | 5/5 | 5/5 | 4/5 | 4/5 |
| Gemini 3.1 Pro | 5/5 | 5/5 | 4/5 | 3/5 | 5/5 | 5/5 | 3/5 |
| GPT-5.4 | 5/5 | 5/5 | 5/5 | 5/5 | 5/5 | 5/5 | 5/5 |
| GPT-5.3 Codex | 5/5 | 5/5 | 4/5 | 5/5 | 5/5 | 5/5 | 4/5 |
| Grok-4 | 4/5 | 4/5 | 4/5 | 4/5 | 4/5 | 3/5 | 2/5 |
| Qwen3 Coder Next | 5/5 | 0/5 | 0/5 | 0/5 | 0/5 | 0/5 | 0/5 |
Get access