Co-expression Targets

GeneVector can train gene embeddings on different co-expression metrics. The target function defines what relationship between gene pairs the model learns to reproduce with dot products in the embedding space.

Built-in Targets

Target	Speed	Description
`mi`	varies	Mutual information (default). Captures nonlinear statistical dependence. Multiple backends available.
`pearson`	instant	Pearson correlation coefficient. Linear co-expression.
`spearman`	instant	Spearman rank correlation. Monotonic co-expression, robust to outliers.
`jaccard`	instant	Jaccard index on binarized expression (detected / not detected).
`cosine`	instant	Cosine similarity between gene expression vectors across cells.

Usage

from genevector.data import GeneVectorDataset

# Default: signed mutual information
dataset = GeneVectorDataset(adata, target="mi", signed_mi=True)

# Pearson correlation
dataset = GeneVectorDataset(adata, target="pearson")

# Spearman rank correlation
dataset = GeneVectorDataset(adata, target="spearman")

# Jaccard index
dataset = GeneVectorDataset(adata, target="jaccard")

# Cosine similarity
dataset = GeneVectorDataset(adata, target="cosine")

The mi_backend parameter only applies when target="mi". The matrix-based targets (Pearson, Spearman, Jaccard, cosine) compute in seconds via BLAS regardless of gene count.

Graph-Aware Targets

Graph-aware targets measure co-expression across graph neighbors rather than within individual cells. The graph parameter accepts any scipy sparse adjacency matrix — spatial neighbors, TCR similarity, or custom graphs.

import squidpy as sq

# Build a spatial neighbor graph
sq.gr.spatial_neighbors(adata, n_neighs=10, coord_type="generic")
graph = adata.obsp["spatial_connectivities"]

# Cross-correlation between self-expression and neighbor-aggregated expression
dataset = GeneVectorDataset(adata, target="graph_xcorr",
    target_kwargs={"graph": graph, "aggr": "mean"})

The graph is domain-agnostic — the same target works on spatial, TCR, or any graph topology:

from genevector.graphs import build_clonotype_graph

# Same target, different graph
clone_graph = build_clonotype_graph(adata, clone_key="clone_id")
dataset = GeneVectorDataset(adata, target="graph_xcorr",
    target_kwargs={"graph": clone_graph})

Custom Targets

Register a custom target function:

from genevector.metrics import register_target

@register_target("my_metric")
def my_target(X, gene_names, **kwargs):
    # Compute pairwise scores
    # Must return dict[str, dict[str, float]]
    scores = {}
    # ... your computation ...
    return scores

dataset = GeneVectorDataset(adata, target="my_metric")

Or pass a callable directly without registration:

dataset = GeneVectorDataset(adata,
    target=lambda X, names, **kw: my_score_function(X, names))

Caching

All computed target scores are cached automatically to ~/.genevector/cache/. Cache keys incorporate the expression matrix, gene list, target function name, and all parameters — different configurations never collide.

# Disable caching
dataset = GeneVectorDataset(adata, use_cache=False)

# Clear the cache
from genevector.cache import clear_cache
clear_cache()