MI Computation Backends

GeneVector computes pairwise co-expression scores between all genes as training targets. For mutual information — the default and most principled target — this involves computing MI for every gene pair, which is O(G² × N) where G is the number of genes and N is the number of cells.

To make this tractable for real datasets (2,000+ genes, 50,000+ cells), GeneVector provides multiple computation backends with automatic dispatch.

How MI Computation Works

Step 1: Discretization

Continuous gene expression values are discretized into integer bins per gene. Zero expression maps to bin 0. Nonzero values are split into quantile-based bins (default 10). This is done once for all genes via discretize_genes().

Step 2: Joint Histogram

For each gene pair (A, B), a joint histogram is built by counting cells in each (bin_A, bin_B) combination. Only cells where at least one gene is expressed contribute.

Step 3: MI from Joint Distribution

\[MI(A, B) = \sum_{a,b} p(a,b) \log_2 \frac{p(a,b)}{p(a) \cdot p(b)}\]

where \(p(a,b)\) is the joint probability, and \(p(a)\), \(p(b)\) are marginals.

Step 4: Signing (optional)

When signed_mi=True (default), MI is multiplied by the sign of the Pearson correlation between the two genes. This produces directional MI: positive for co-expressed genes, negative for anti-correlated genes.

Backend Dispatch

When mi_backend="auto" (the default), GeneVector selects the fastest available backend in this order:

Priority

Backend

Speedup

How It Works

Availability

1

CUDA GPU

~200-1000×

PyTorch scatter-based histograms

Requires device="cuda"

2

Rust

~100-800×

Native compiled, rayon parallelism

Requires maturin develop --release

3

Numba

~100-500×

JIT compiled, parallel across CPU cores

Requires pip install numba

4

NumPy

~10-30×

Vectorized discretization, Python pair loop

Always available

All backends produce identical results (within floating-point precision) — the same discretization, the same histogram, the same MI formula. The only difference is speed.

Selecting a Backend

# Auto-select (recommended)
dataset = GeneVectorDataset(adata, mi_backend="auto")

# Force a specific backend
dataset = GeneVectorDataset(adata, mi_backend="rust")
dataset = GeneVectorDataset(adata, mi_backend="numba")
dataset = GeneVectorDataset(adata, mi_backend="numpy")
dataset = GeneVectorDataset(adata, mi_backend="gpu", device="cuda")

Checking Available Backends

from genevector.metrics import HAS_NUMBA

try:
    from genevector._rust import compute_mi_pairs
    print("Rust backend: available")
except ImportError:
    print("Rust backend: not installed")

print(f"Numba backend: {'available' if HAS_NUMBA else 'not installed'}")

import torch
print(f"CUDA backend: {'available' if torch.cuda.is_available() else 'not available'}")

Practical Runtimes

For 2,000 genes × 5,000 cells (~2M gene pairs):

Backend

Time

Legacy (v0.2)

~26 hours

NumPy

~3 hours

Numba

~2-3 minutes

Rust

~1-2 minutes

Note

Scores are cached to ~/.genevector/cache/ after the first run. Subsequent runs with the same data and parameters load instantly.

Architecture

The backend dispatch lives in genevector/metrics.py. Each backend implements the same interface:

def compute_mi_BACKEND(X, gene_names, n_bins=10, signed=False) -> dict[str, dict[str, float]]

All backends share the discretize_genes() function for the binning step. The Rust backend (src/lib.rs) receives the pre-discretized integer matrix from Python via zero-copy numpy arrays (PyO3 + numpy-rs), then parallelizes the histogram + MI computation across gene pairs using rayon.

Tip

For datasets larger than 2,000 genes, use the entropy-based gene filter GeneVectorDataset.quality_control(adata, entropy_threshold=1.0) to reduce gene count before MI computation. The matrix-based targets (Pearson, Spearman, Jaccard, cosine) compute in seconds regardless of gene count since they use BLAS-backed matrix operations.