Validated Benchmarking for Long-Horizon Spatial Biology

Introducing SpatialBench-Long, a long-horizon spatial biology benchmark. Agents must return biological claims to raw data and practical test context without prescribed methods.
24 experiments include primary tumors, organoids, xenograft models, lineage tracing systems, and aging/interventional biology. The best agents get 11.1%.

Read the book manuscript.
Connect with leaderboard.
Experiments include the types of experiments that scientists use in practice.
Single function may depend on localization, histology, single cell references, and array recording data. Solving them requires inverse thinking, awareness of experimental design, and command of local workflows such as tissue classification, niche analysis, and characterizing local differences.

This examines the transition from using data analysis to doing science.
Finding the ground truth is very difficult for long-term biological benchmarks. The same data can support many valid conclusions, and some published claims do not reproduce cleanly under unbiased reanalysis.

Candidate works are hardened through independent reproductions, random expert reviews, and leads from multiple model families.
Grading uses decision tasks over planned final answers. We find the degree of discovery of scientific conclusions expressed in controlled biological terms rather than numbers from individual mathematical operations.
Of all 15 harness pairs and 1,080 trajectories, Gemini 3.5 Flash / Pi, GPT-5.5 / Pi, and GPT-5.5 / OpenAI Codex each succeeded in 8/72 attempts (11.11%), and Claude Opus 4.6 / Claude Codex closed after 7/72% of attempts.

However, the grading of the final response provides a diagnostic signal for long-term projects. Chokepoint rubrics – analytical decisions that are expected to remain stable across all plausible solution paths – are developed by judges as a companion diagnostic system.

We were interested in evaluating the use of rubric grade for mean reward and asked whether signal density correlated with endpoint quality. We conclude that rubric scores are promising adjunctive tools, not substitutes for endpoint measurement.

We also divided the rubric scores by the source agent model whose trajectory was judged to see if the rubric patterns were consistent across judges. All four judges maintained the same broad order with some variations.

Pairing manual trajectory reviews with a rubric and verifiable scores provides additional tools for interpreting model failures.

In practice, manual trajectory analysis is the first tool to understand this data. Eval authors keep production notes to provide a record for future benchmark updates, especially since robust models may solve tasks in unexpected but effective analytical ways that challenge current grading assumptions.
The results suggest compounding spatial analysis errors prevent a reliable assumption of the scientific horizon.
Before models can reliably reason about disease mechanisms, drug responses, or other profound effects on biology, they must be procedurally competent at local measures.
But the few completions we saw were very impressive. There seems to be a realistic way for agents to think and behave more like scientists do.
Read the book manuscript
Connect with leaderboard



