Can AI Agents Analyze Real-World Cell Data?
Biological agents are still stuck in the demo-to-shipment gap. With SpatialBenchwe have shown that frontier models strive to extract biological information from spatial transcriptomics data. But the location is only part of the story. Single-cell RNA sequencing is a powerful test in modern biology, the most widely accepted, with more public data, more mature tools, and more literature for models to train with. If agents cannot handle scRNA-seq, a reliable agent data analysis method is far away.
Here we introduce scBench (read the paper), a benchmark of 394 validation problems derived from a real scRNA-seq workflow. It includes six consecutive platforms, seven job categories, and eight border models. The best model achieves an accuracy of 52.8%. It’s better than local, but it means that the best agent in the world still fails about half the time on standard analysis tasks.
Existing biology benchmarks reward textbook memorization or literary style reasoning. Actual analysis requires loading raw datasets, writing code, making judgment calls about parameters and parameters, and generating tangible quantitative results. scBench recreates these conditions.
The benchmark includes 394 problems across six sequencing platforms (Chromium, BD Rhapsody, CSGenetics, Illumina, MissionBio, ParseBio) and seven workflows (QC, normalization, size reduction, clustering, cell typing, differential expression, and trajectory analysis). Each problem provides a summary of the data – usually an AnnData `.h5ad` file taken before the decision point, a natural language task prompt, and a decision grader that marks the agent’s structured JSON output as pass or fail.

We pressure-tested each problem to find shortcuts: removing computed embeddings, stripping cached labels, and ensuring that answers cannot be guessed from prior knowledge alone. Cell typing (118 problems, 30%) and different expressions (71, 18%) dominate the benchmark because these are the categories where dataset-specific judgment is most important and where agents struggle.
Across eight parameter models from four providers, accuracy ranged from 29% to 53%.

Claude Opus 4.6 leads at 52.8%, followed by Opus 4.5 (49.9%), GPT-5.2 (45.2%), and Sonnet 4.5 (44.2%). The bottom category – GPT-5.1 (37.9%), Grok-4.1 (35.6%), Grok-4 (33.9%), and Gemini 2.5 Pro (29.2%) – goes by a wide margin. The 23.6 percentage point spread between best and worst exceeds SpatialBench’s 18.3 point spread, which means that scBench discriminates the model’s ability even with absolute absolute accuracy.
The accuracy of SpatialBench possibly reflects the training data. scRNA-seq has the most public datasets and Scanpy dominates the ecosystem with extensive documentation and scholarship.
Not all analysis steps are created equal. Normalization is the easiest (cross-model mean 70.4%), followed by QC (55.3%). This is a process—use a known transformation, check the metric against the threshold. Agents behave rationally.

The story changes to perform difficult judicial tasks. Compilation drops to 38.3%, cell typing to 34.9%, and differential speech to 27.0%. Seven of the eight models follow this ordering difficulty. DE is also where the models diverge the most: 27.7 points separate the best from the worst.
The pattern is clear. Tasks that require genetics to select scientific reasoning, interpretive cluster identity, select statistical tests, identify tissue-specific signatures—this is where agents break down. The ability to write general-purpose code is necessary but not sufficient.
This is perhaps the most striking discovery. The mean accuracy of the various models ranged from 59.1% for CSGenetics to 26.4% for MissionBio—a gap of 32.7 points that exceeds the spread of 23.6 points between the best and worst models. The platform is more important than what model you use.

CSGenetics is the lightest in six of the eight models. MissionBio is the hardest of the eight. The fall of MissionBio is surprising: even the best model (Opus 4.6) only reaches 42%, and Gemini drops to 10.3%. Every model shows a large field swing—Gemini drops 42 points between its best and worst field. Even the Opus 4.5, the most consistent model, loses 39 points.
MissionBio transforms absolute levels in interesting ways. Grok-4 (sixth overall) beats GPT-5.2 (third overall) in MissionBio. Sonnet 4.5 outperforms GPT-5.2 by 11 points. Models who memorize Scanpy lessons without learning transferable analysis techniques fall into fields with unfamiliar data structures and unfamiliar tools.
These results probably reflect the structure of the training data. Chromium and Illumina dominate repositories and public documents. MissionBio and ParseBio appear slowly. Trusted agents will need a core of field knowledge and assay-specific tools, not one-size-fits-all thinking.
Together, scBench and SpatialBench cover the two main benchmarks. The top model reaches 52.8% in scBench compared to 38.4% in SpatialBench-scRNA-seq is very favorable. But the structural patterns are shared: adaptation is easy in both, platform results drive 30–40 points in both, and the level of the models is extremely conservative (Claude Opus leads both, Gemini ranks last in both).

Benchmarks perform parallel tests. scBench tests whether the models can handle the most common and well-documented type of test. SpatialBench tests whether that ability transfers to new, less scalable technologies. Together they reveal whether the agent has learned to think through conventional analytics or simply memorized Scanpy’s workflow.
Deterministic measurement enables confirmatory testing but divides scientific judgment into automatically testable fractions. Each test shortens one step of the workflow rather than capturing an iteration horizon where errors are compounded and parameters are updated. Real analysis is more complex, iterative, and far-reaching than any problem it captures. But measuring step-level capability reliably is a prerequisite for long-term workflow automation.
scBench confirms the SpatialBench pattern established: biological agents are in a state of administration where they can speed up routine analysis but cannot reliably answer scientific questions without human guidance. The way forward is the long tail of tactile engineering: field-specific context, better harness design, and exposure to workflows that are representative of all different biological situations.
We are building a family of benchmarks that cover major biological approaches, each a progressive formalization of the implicit knowledge and judgment that practicing scientists bring to data analysis. The goal is test-driven development of agent systems that evolve with both model training and wire engineering.
The code, canonical tests, and complete trajectories are available here github.com/latchbio/scbench.



