Benchmark measuring how well AI models solve mazes. Models are given a maze and must find a way out using a tool call.
More models coming soon™ (when I have money)
Complexity
Size
Observation
Charts
Gemini 3 Flash (low)
Gemini 3 Flash (high)
GPT-5 (default)
Grok 4.1 Fast
Gemini 3 Flash (none)
GPT-5 Mini
DeepSeek V3.1
GPT-5.2 (none)
GPT OSS 120B
Model Comparison
Successful Only
Individual Runs
Complexity
Observation
Runs (click to replay)
Tiny (5x5)
Small (11x11)
Medium (21x21)
31x31 (31x31)