Synthetic dataset templates

GeneVector ships with a small catalog of synthetic spatial transcriptomics datasets, each modeling a different biological pathology. They share a common output contract (AnnData + ground-truth dict) and are useful for benchmarking, testing, and feature development across spatial-omics tools.

from genevector.benchmarks.synthetic import (
    build_paracrine_dataset,
    build_niche_dataset,
    build_gradient_dataset,
    build_pathology,
    list_templates,
)

print(list_templates())  # name → description map

adata, ground_truth = build_paracrine_dataset(seed=42)

The full reference, including the ground-truth schema and per-template parameter tables, lives in docs/synthetic_templates.md.

Templates

  • build_paracrine_dataset — two intermixed cell types with N ligand-receptor pairs and configurable mixing.

  • build_niche_dataset — tumor blob with surrounding T cells; niche genes induced by local tumor density.

  • build_gradient_dataset — 1D axial pathology, cells along a line with smooth monotone and peaked gene expression.

  • build_pathology — full grafiti-derived FOV with paracrine, niche, T-rare, and housekeeping overlays composed.

Shared contract

Every builder returns (adata, ground_truth) where:

  • adata.X is a scipy.sparse.csr_matrix of float counts.

  • adata.obs["phenotype"] is categorical.

  • adata.obsm["spatial"] is a (n_cells, 2) float64 array.

  • adata.var_names are uppercase, unique strings.

Ground-truth is a JSON-serializable dict with keys template, version, seed, params, phenotypes, genes, and pairs. Schema version is "2.0".