Ai Health

SpatialBench Human Validation – by Kenny Workman

SpatialBench measures the agent’s performance in real-world biological analysis tasks. The benchmark has 159 levels covering 5 types of spatial technologies, focusing on practical, spatial analysis rather than long-horizon scientific tasks.

All SpatialBench programming is deterministic and responses are quantitatively evaluated against ground truths developed through careful analysis. However, later we found problems with some evals.

Some functions may depend on certain analytical decisions that are not specified in the notification. In these cases, missing or unclear task context makes it difficult to understand if incorrect responses are due to a lack of skill or poor instruction.

Some functions have numerical responses with tolerance limits that allow the range of valid analysis methods to be exceeded. Sometimes this tolerance was measured unfairly, e.g. is set too small, rejecting valid analysis methods that the domain expert did not consider when compiling the problem.

In order to have confidence that the SpatialBench functions are sufficiently specific, we need evidence that each expected response can be reproduced by an independent expert’s effort.

We created a human-validated subset of SpatialBench called SpatialBench Verified that includes 115 functions. Each problem in SpatialBench Verified had at least one human expert who independently reconstructed the answer passing the work grades from the work order and related data.

We found the filtered subset maintained the relative order of model performance, but scores increased by 11.6 pp on average.

This sa_02_microglia_oligo_inflammation_correlation the work asked the agents to separate the mixed gene list into microglial activation and oligodendrocyte inflammatory signatures, score cells, locate the neighboring oligodendrocytes around each corpus callosum microglia using the “ideal area” and calculate the Spearman coordinates at 90wk and 4wk.

With just work and no solutions, the human reviewer returned:

Current grade expected:

The reviewer actually agreed with the objective quality definition (‘2’): age-related spatial association between activated microglia and inflamed oligodendrocytes. But their notes showed the correct numerical values ​​could not be found in the work as specified.

Why?

Looking at the original problem statement, the options open include:

  • method to separate a mixed gene array into two signatures

  • even if inflammatory genes can appear in both signals

  • how to make a MERFISH panel intended for shallow depth

  • what radius counts as a “neighbor”

  • grouping components or control within a category

  • even if the given 50k subset has enough correlation power for 4wk

Therefore a variety of defensible analysis methods were possible given the context of the work, making it difficult to trust the eval result.

This visium_bone_norm_within_niche_coexpression the task asked the agents to describe Visium’s bone-enriched areas as the upper quartile of COL1A1 (collagen gene marker) counts, then calculate the Pearson’s correlation between COL1A2 and SPARC within those areas. The desired biological conclusion was that COL1A2 and SPARC were strongly co-expressed at the bone surface, in accordance with an integrated osteoblast matrix-secretion system.

Again, without solutions and just work, a human reviewer reproduced the intended biological endpoint and reported:

The producer expected:

The reviewer notes in fact that correlations are highly dependent on familiarity despite stable biological interpretations. This is a clear example of a distance tolerance that was too small for the eval goal.

We conducted the first review cycle where 6 domain experts solved all 159 problems in SpatialBench with an average of 26 problems per expert. Human experts are given access to task information and related data, and are asked to express the evaluation response in the same way as the agent response.

For each problem, experts were asked to share the final answer they arrived at along with a Python Jupyter notebook or an R Markdown document using whatever bioinformatics libraries they thought appropriate. All answers were graded using the existing benchmark grader, with binary passing scores (all answer fields must be correct).

94/159 (59.1%) of cases passed the first round of review. When we reviewed the failed cases we found that the failures fell into several categories:

  1. A poor solution attempt

  2. A valid attempt but a poor analysis

  3. A clear mismatch between one’s solution and the underlying reality

  4. Ambiguous or weak eval specification

The first phase of failure was due to time constraints – over a week the experts were asked to solve an average of 5-6 problems per day which seemed impossible. To mitigate this, we conducted a second round of reviews.

In the second round of review we included all 65 problems that failed in the first round and 10 passing problems to serve as controls. Transient problems were randomly drawn from all 5 types of spatial technologies used in SpatialBench (2 per kit). Problems are assigned over a larger pool of 28 experts (3 problems per expert). Experts are given 2 days to produce solution artifacts for their assigned tasks.

The experts of the second round of review resumed with the task order and related data. All problems are assigned to different experts between cycles.

Passing solutions were produced in Round 2 for 21/65 of the failing problems from Round 1. From the control set, passing solutions were produced for 8/10 problems in Round 2.

We took the union of the problems that passed the Round 1 review (94/159) and the additional 21 problems that passed the Round 2 review, yielding 94 + 21 = 115 problems in SpatialBench Verified.

Anirudh Narsipur, Deborah Hayoun, Benjamin Kesler, Zachary Hemminger, Sahar Nasr, Aashka Bhowmick, Sahiti Marella, Zhen Yang, Shon George, Soo Hee Lee, Qian Xu, Lior Schachaf, Harihara Muralidharan, David Calcagno, Birendra U-Lannah Satiarka, Birendra U-Latiarka, Birendra Landhannah Le, Kenny Workman

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button