Frontier-CS Blog Posts

Recent blog posts

Humans Still Beat AI in the Long Horizon: Revisiting Test-Time Scaling in the Agent Era

Date: June 16, 2026
Category: Research

Read the blog

Roadmap to FrontierCS 2.0

Date: June 11, 2026
Category: Research

Read the blog

FrontierSmith: Synthesizing Open-Ended Coding Problems at Scale

Date: May 15, 2026
Category: Research

Read the blog

Leaderboard

Updated 2026-07-25

Algorithmic Track

172 problems

Rank	Model	Score@1	Avg@5	Score@5	Elo
1	gemini-3.0-pro	33.12	34.58	56.09	1265
2	gpt-5.2-thinking	32.40	33.11	47.19	1242
3	gpt-5-thinking	23.10	22.58	39.73	1196
4	deepseek-3.2	24.83	23.89	41.44	1193
5	grok-4	24.04	22.98	36.81	1174
6	gemini-2.5-pro	20.34	19.32	36.65	1167
7	gpt-5.1-thinking	20.64	21.49	34.76	1164

Human reference: 86.99 (Score@1)

Research Track

68 problems

Rank	Model	Score@1	Avg@5	Score@5	Elo
1	gemini-3.0-pro	46.55	43.14	59.22	1283
2	gpt-5-thinking	30.91	34.94	55.25	1218
3	gpt-5.1-thinking	32.12	33.70	56.79	1214
4	gpt-5.2-thinking	30.29	34.09	58.90	1210
5	gemini-2.5-pro	21.66	25.74	51.57	1180
6	grok-4	26.75	24.01	48.15	1149
7	deepseek-3.2	21.51	21.76	44.41	1146

Agent Track

188 tasks

Rank	Agent	Score	Avg Steps	Avg Tools	Avg Tokens
1	GPT-5.6-sol (Codex)	76.4	57.8	51.8	1.93M
2	Claude Code Opus 4.8	74.5	355.4	145.7	14.72M
3	GPT-5.5 (Codex)	72.1	48.3	55.8	2.02M
4	Qwen3.7 Max (Claude Code)	61.9	133.9	139.1	13.85M
5	Gemini 3.1 Pro (Gemini CLI)	60.2	74.3	41.6	2M
6	Kimi K2.6	46.9	67.2	70.6	6.79M
7	Claude Code Opus 4.7	43.0	77.2	42.2	10.96M

Preview Harbor runs with a 5-hour timeout per task. See the release blog for trace-level analysis.

View benchmark on GitHub

Example tasks

Frontier-CS tasks are open-ended, verifiable optimization challenges: agents can inspect the task, iterate on submissions, and climb a continuous score instead of passing a single hidden test.

Browse tasks →

Erdos Unit Distance

View task →

Place a fixed set of planar points so that as many pairs as possible sit exactly one unit apart. The task rewards geometric search, symmetry-aware construction, and steady agent iteration.

Polyomino Packing

View task →

Pack thousands of small geometric pieces into the tightest possible rectangle. The task rewards heuristic design, rotation/reflection handling, and long-horizon layout improvement.