MazeBench

GitHubX

Benchmark measuring how well AI models solve mazes. Models are given a maze and must find a way out using a tool call.

More models coming soon™ (when I have money)

Complexity
Size
Observation

Charts

Gemini 3 Flash (low)
Gemini 3 Flash (high)
GPT-5 (default)
Grok 4.1 Fast
Gemini 3 Flash (none)
GPT-5 Mini
DeepSeek V3.1
GPT-5.2 (none)
GPT OSS 120B
Success Rate by Model
Avg Time vs Success Rate

How average time relates to success rate

Model Comparison

Successful Only
Gemini 3 Flash (low)66.7%
Escapes:12
Avg Steps:352.7
Avg Time:595.79s
Cost:$6.3489
Efficiency:42.20%
Gemini 3 Flash (high)44.4%
Escapes:8
Avg Steps:111.4
Avg Time:847.58s
Cost:$13.8257
Efficiency:37.15%
GPT-5 (default)44.4%
Escapes:8
Avg Steps:68.6
Avg Time:482.54s
Cost:$8.2309
Efficiency:39.77%
Grok 4.1 Fast38.9%
Escapes:7
Avg Steps:60.5
Avg Time:230.73s
Cost:$1.8620
Efficiency:37.86%
Gemini 3 Flash (none)33.3%
Escapes:6
Avg Steps:115.9
Avg Time:586.41s
Cost:$12.0850
Efficiency:30.13%
GPT-5 Mini27.8%
Escapes:5
Avg Steps:48.6
Avg Time:221.33s
Cost:$0.4753
Efficiency:27.78%
DeepSeek V3.116.7%
Escapes:3
Avg Steps:267.6
Avg Time:828.82s
Cost:$3.6561
Efficiency:7.49%
GPT-5.2 (none)16.7%
Escapes:3
Avg Steps:10.2
Avg Time:14.92s
Cost:$0.2980
Efficiency:16.67%
GPT OSS 120B16.7%
Escapes:3
Avg Steps:17.7
Avg Time:226.23s
Cost:$0.0809
Efficiency:15.05%

Individual Runs

Complexity
Observation
Runs (click to replay)
Tiny (5x5)
Small (11x11)
Medium (21x21)
31x31 (31x31)