FrontierSmith: Synthesizing Open-Ended Coding Problems at Scale
- Date
- May 15, 2026
- Category
- Research
A benchmark that doesn't saturate.
Unsolved
No solution has achieved perfect scores.
Open-ended
Research and optimization challenges.
Verifiable
Continuous scoring, always room to improve.
Diverse
Systems, ML, algorithms, security, and more.
→ Read our blog here.
→ Read our paper here.
frontier-cs · harbor trial · 2.0
# Run a Frontier-CS 2.0 task through Harbor
$ uv run frontier harbor trial 2.0 erdos_unit_distance \ -a codex -m gpt-5.5 --json
generating frontier-cs-2-0-erdos-unit-distance
starting Harbor trial with iterative submissions...
submit #1 score: 55.80
submit #2 score: 78.50
trial_status=scored agent_status=completed
reward=0.7850 cost=$1.94 submissions=2
tokens: 1.51M input / 19.9K output
✓ Harbor trial scored: 78.50
✓ result.json and verifier artifacts saved
172 problems
| Rank | Model | Score@1 | Avg@5 | Score@5 | Elo |
|---|---|---|---|---|---|
| 1 | gemini-3.0-pro | 33.12 | 34.58 | 56.09 | 1265 |
| 2 | gpt-5.2-thinking | 32.40 | 33.11 | 47.19 | 1242 |
| 3 | gpt-5-thinking | 23.10 | 22.58 | 39.73 | 1196 |
| 4 | deepseek-3.2 | 24.83 | 23.89 | 41.44 | 1193 |
| 5 | grok-4 | 24.04 | 22.98 | 36.81 | 1174 |
| 6 | gemini-2.5-pro | 20.34 | 19.32 | 36.65 | 1167 |
| 7 | gpt-5.1-thinking | 20.64 | 21.49 | 34.76 | 1164 |
Human reference: 86.99 (Score@1)
68 problems
| Rank | Model | Score@1 | Avg@5 | Score@5 | Elo |
|---|---|---|---|---|---|
| 1 | gemini-3.0-pro | 46.55 | 43.14 | 59.22 | 1283 |
| 2 | gpt-5-thinking | 30.91 | 34.94 | 55.25 | 1218 |
| 3 | gpt-5.1-thinking | 32.12 | 33.70 | 56.79 | 1214 |
| 4 | gpt-5.2-thinking | 30.29 | 34.09 | 58.90 | 1210 |
| 5 | gemini-2.5-pro | 21.66 | 25.74 | 51.57 | 1180 |
| 6 | grok-4 | 26.75 | 24.01 | 48.15 | 1149 |
| 7 | deepseek-3.2 | 21.51 | 21.76 | 44.41 | 1146 |
178 tasks
| Rank | Agent | Score | Avg Steps | Avg Tools | Avg Tokens |
|---|---|---|---|---|---|
| 1 | Kimi K2.6 | 46.9 | 67.2 | 70.6 | 155.6K |
| 2 | Claude Code Opus 4.7 | 43.0 | 77.2 | 42.2 | 251K |
Preview Harbor runs with a 5-hour timeout per task. See the release blog for trace-level analysis.
Frontier-CS tasks are open-ended, verifiable optimization challenges: agents can inspect the task, iterate on submissions, and climb a continuous score instead of passing a single hidden test.
Place a fixed set of planar points so that as many pairs as possible sit exactly one unit apart. The task rewards geometric search, symmetry-aware construction, and steady agent iteration.
Pack thousands of small geometric pieces into the tightest possible rectangle. The task rewards heuristic design, rotation/reflection handling, and long-horizon layout improvement.
Academic institutions
UC Berkeley
Princeton
Stanford
MIT
UCSD
UW
Georgia Tech
Michigan
NYU
UIUC
Toronto
NTU
Follow our work