Ai Health

New Frontier Models are Faster, More Reliable, in Spatial Biology

New boundary models are faster in SpatialBench, but less accurate.

GPT-5.5 has almost half the running time compared to GPT-5.4, however the accuracy remains low: 57.6% compared to 57.4%. Opus 4.7 tied the same as Opus 4.6: 52.4% versus 52.8%.

Trajectories reviewed by scientists reveal gaps in between biological judgment that recognizes the test: statistical design, spatial units, cluster design, and scientific interpretation.

Complete benchmark data and select trajectories are available benchmarks.bio.

Spatial biology is a powerful measuring tool and an important category of agent power. Analytical workflows require a combination of coding and biological reasoning: agents must handle large data, understand domain-specific information, align with scientific objectives and return quantitative results similar to what a diligent scientist would compile.

SpatialBench measures this activity: 159 activities for local biological analysis across platforms such as Xenium, Visium FFPE, MERFISH, TakaraBio Seeker, and AtlasXomics DBiT-seq. Each function starts from a real analysis case and asks the agent to return a specific biological result. The compiler evaluates the programmed results against expert-based references (a subset of the available examples here).

Although the boundary models show improvements in speed and number of steps, they do not improve in overall accuracy in this benchmark.

GPT-5.5 is much faster than GPT-5.4, cutting the runtime almost in half and using far fewer steps. But its accuracy has not changed at all: 57.65% compared to 57.44%. Opus 4.7 is also successfully combined with Opus 4.6: 52.41% versus 52.83%.

Field-level clustering shows that GPT-5.5 improves on Visium, Xenium, and MERFISH, but trails GPT-5.4 on TakaraBio and AtlasXomics.

Similarly, Opus 4.7 leads Opus 4.6 on Xenium by 11.1 percent, tied with TakaraBio, and trails Visium, MERFISH, and AtlasXomics:

Trajectory reviews identify recurring failure categories for all model families:

  • Treat cells, beads, stains, or barcodes as independent observations when the biological replicate is a donor, animal, tissue section, or time point

  • Automated application of scRNA-seq normalization to local platforms where it is not appropriate

  • Combining data from multiple samples without integration, then misinterpreting the donor or time point structure

  • Local marker units are confusing cells or anatomical structures

  • It failed to restore the correct de novo niches, tissues, and temporal compartments

We’ll dive into some examples to get a feel for failure modes in the context of real-world operations. Each career trajectory was reviewed by a scientist with years of experience in the specific area being tested.

AtlasXomics SPATIAL10_genome_wide_de_pct work asks the model to evaluate 24,919 genes for gender differences in a dataset of human root DBiT-seq data. The dataset contains approximately 10,000 location barcodes from 8 contributors: 3 women and 5 men.

Barcodes are placed inside the donors. A researcher can pool at the donor level, computing about 1.2% of differentially expressed genes. Agents often ignore this donor information.

GPT-5.4 and GPT-5.5 both report 93.876% across six runs. Both Opus 4.7 and Opus 4.6 report about 92-94% of all genes as significantly different between sexes. Note that this translation is not biologically possible: sex cannot clearly change the accessibility of chromatin in 93% of all genes in 8 donors.

I SPATIAL07_sex_housekeeping_de work asks whether 10 housekeeping genes show sex differences in the same local ATAC-seq design. The expected answer is none.

Trajectory review reveals models calling for 9-10 key housekeeping genes. Both the Opus 4.7 and Opus 4.6 models call all 10 housekeeping genes sex-different for every run. GPT-5.4 and GPT-5.5 drive 9-10 important housekeeping genes. Driving ACTB, GAPDH, and other sex-differentiating genes is a clear sign of pseudoreplication: models treat thousands of barcodes as independent repeaters, artificially increasing statistical power and productivity.

MERFISH norm_02_myelin_gene_coexpression_normalization work requests a Spearman correlation between Mbp and Plp1 in oligodendrocytes. These are myelin structural genes and should be well expressed together. The expected value is approximately 0.308.

GPT-5.5 fails all three runs, consistently producing -0.157. Human review shows that GPT-5.5 uses a normalization step that significantly adjusts the target panel value, turning a good biological correlation into an apparent anti-correlation. GPT-5.4 passes 2 out of 3 runs with a correlation close to 0.326 by avoiding that particular trend.

Both Claude models reported a Spearman correlation of approximately -0.16 between Mbp and Plp1 for all runs. In this case, the negative correlation is an artifact of the library size common to the 374-gene target panel where a few myelin genes dominate the overall score. Standard import models for scRNA-seq:

sc.pp.normalize_total(adata, target_sum=1e4)

sc.pp.log1p(adata)

on the target MERFISH panel, instead of handling the platform properly.

I batch_driven_clustering (TakaraBio) and NORM01_batch_correction (AtlasXomics) functions check whether models include contributors or time points before interpreting clusters.

Both GPT-5.4 and GPT-5.5 fail TakaraBio batch_driven_clustering. The expected maximum proportion for a single time point is approximately 0.375; GPT-5.4 reports 0.967, 0.995, and 0.995, while GPT-5.5 reports 0.990, 0.994, and 0.988. Each cluster is dominated by a single time point, meaning clustering is a trace or cluster state rather than a cell type.

A similar issue arises from AtlasXomics. In NORM01_batch_correctionthe expected mean of the large sample proportion is 0.375, but the computed GPT values ​​are always close to 0.866-0.897. The trajectories of Opus 4.7 independently explain the same failure of AtlasXomics: without integration, PCA captures the technical variation of the sample in between, Leiden cells combine partitions by sample origin, and there are no model questions as to why clusters in a multi-donor dataset are dominated by single donors.

The TakaraBio Seeker uses 10um beads. A single large oocyte can contain many beads, and the RNA is spread across all of them. Counting fine bead markers such as cells or anatomical structures enhances biological calculations.

I oocyte_count_per_timepoint function checks this. The expected number of immature oocytes is 850. GPT-5.4 reports 1237-2086, and GPT-5.5 reports 1510-3463. In the time period 0h, the expected value is 275, while the models report 424-821.

The same report highlights cumulus_gc_count_immature: the number of cumulus granulosa cells expected in immature samples is 0 because the cumulus cells have not yet differentiated. GPT-5.4 reports 435-1474, and GPT-5.5 reports 1424-2395. Both models give the identity of the cumulus from the expression of the marker without using the improvement limit that the division of the cumulus requires the stimulation of hCG.

The Opus 4.7 report describes the same category of failure in terms of spatial classification. Opened follicle_count_immatureOpus 4.7 counts from 50 to 456 across runs because small changes in the DBSCAN radius produce oddly connected segments, while Opus 4.6’s robust oocyte scores provide a clean input set for clustering.

Xenium spatial_fibro_inflammatory_niche_emergence_2 the task asks the agent to rebuild the fibro-inflammatory niche at all time points of kidney injury. The expected pattern is a fake local collaboration, a Day14 high localization, and a double up at 6.9.

GPT-5.4 reports a sham co-localization of 0.112-0.124 and a fold increase of 2.26-2.32; GPT-5.5 reports a localization with sham of 0.121-0.373 and a fold increase of 1.99-4.23. The expected value of the fake is 0.033, the expected Day14 is 0.23, and the expected fold increase is 6.9.

Opus 4.6 and Opus 4.7 detect niches containing fibroblasts and immune cells, but cannot enforce specific compositional parameters that distinguish a pathological niche such as CN7 from the adjacent healthy stroma. They confuse the generic fibroblast-immune proximity with the fibro-inflammatory compartment organized by the disease.

Multiplicative calculations: Models often do not understand how to define duplicates. When donor, animal, category, or time point is the appropriate variable for comparison, models incorrectly approach cells, beads, and barcodes as independent observations, increasing significance and creating false positives.

Platform-aware normalization: The models treat local probes as switchable and often reach automation for scRNA-seq. The same normalization process can overcorrect MERFISH/Xenium target panels, confound the low correlation of Visium FFPE with depth, the bead capture signal, or discard marker elements selected by HVG selection.

Bundle integration and donors: Models are often assembled before asking what the axes of variation represent. For multi-donor or multi-period data, the uncorrected PCA/Leiden plot typically separates samples, clusters, or time points; the model then errs on the side of cell type, tissue condition, or clinical science.

Errors in the area unit and denominator: Models confuse expression and link units to cells or other structures.

De novo spatial niche discovery: Models can include niche analysis tools but often miss the biological goal. They confuse the general proximity or the wide regional enrichment of a specific part that the task asks for: follicle, lineage, pathological niche, or the state of tissues organized by diseases.

Although GPT-5.5 and Opus 4.7 are fast or spatially robust for certain task families, they are reasonably unreliable in SpatialBench.

The future development of spatial biology will probably not come from the common benefits of thinking alone and will require specific training in statistical structure, field-specific analytical steps, different replication experiments and other spatial biology knowledge. focused, test-specific benchmarks needed to properly estimate the complexity of biological data analysis accurately.

Related Articles

Back to top button