Co-expression Targets
GeneVector can train gene embeddings on different co-expression metrics. The target function defines what relationship between gene pairs the model learns to reproduce with dot products in the embedding space.
Built-in Targets
Target |
Speed |
Description |
|---|---|---|
|
varies |
Mutual information (default). Captures nonlinear statistical dependence. Multiple backends available. |
|
instant |
Pearson correlation coefficient. Linear co-expression. |
|
instant |
Spearman rank correlation. Monotonic co-expression, robust to outliers. |
|
instant |
Jaccard index on binarized expression (detected / not detected). |
|
instant |
Cosine similarity between gene expression vectors across cells. |
Usage
from genevector.data import GeneVectorDataset
# Default: signed mutual information
dataset = GeneVectorDataset(adata, target="mi", signed_mi=True)
# Pearson correlation
dataset = GeneVectorDataset(adata, target="pearson")
# Spearman rank correlation
dataset = GeneVectorDataset(adata, target="spearman")
# Jaccard index
dataset = GeneVectorDataset(adata, target="jaccard")
# Cosine similarity
dataset = GeneVectorDataset(adata, target="cosine")
The mi_backend parameter only applies when target="mi". The matrix-based
targets (Pearson, Spearman, Jaccard, cosine) compute in seconds via BLAS
regardless of gene count.
Graph-Aware Targets
Graph-aware targets measure co-expression across graph neighbors rather than
within individual cells. The graph parameter accepts any scipy sparse
adjacency matrix — spatial neighbors, TCR similarity, or custom graphs.
import squidpy as sq
# Build a spatial neighbor graph
sq.gr.spatial_neighbors(adata, n_neighs=10, coord_type="generic")
graph = adata.obsp["spatial_connectivities"]
# Cross-correlation between self-expression and neighbor-aggregated expression
dataset = GeneVectorDataset(adata, target="graph_xcorr",
target_kwargs={"graph": graph, "aggr": "mean"})
The graph is domain-agnostic — the same target works on spatial, TCR, or any graph topology:
from genevector.graphs import build_clonotype_graph
# Same target, different graph
clone_graph = build_clonotype_graph(adata, clone_key="clone_id")
dataset = GeneVectorDataset(adata, target="graph_xcorr",
target_kwargs={"graph": clone_graph})
Custom Targets
Register a custom target function:
from genevector.metrics import register_target
@register_target("my_metric")
def my_target(X, gene_names, **kwargs):
# Compute pairwise scores
# Must return dict[str, dict[str, float]]
scores = {}
# ... your computation ...
return scores
dataset = GeneVectorDataset(adata, target="my_metric")
Or pass a callable directly without registration:
dataset = GeneVectorDataset(adata,
target=lambda X, names, **kw: my_score_function(X, names))
Caching
All computed target scores are cached automatically to ~/.genevector/cache/.
Cache keys incorporate the expression matrix, gene list, target function name,
and all parameters — different configurations never collide.
# Disable caching
dataset = GeneVectorDataset(adata, use_cache=False)
# Clear the cache
from genevector.cache import clear_cache
clear_cache()