MI Computation Backends
=======================

GeneVector computes pairwise co-expression scores between all genes as training
targets. For mutual information — the default and most principled target — this
involves computing MI for every gene pair, which is O(G² × N) where G is the
number of genes and N is the number of cells.

To make this tractable for real datasets (2,000+ genes, 50,000+ cells), GeneVector
provides multiple computation backends with automatic dispatch.

How MI Computation Works
------------------------

**Step 1: Discretization**

Continuous gene expression values are discretized into integer bins per gene.
Zero expression maps to bin 0. Nonzero values are split into quantile-based
bins (default 10). This is done once for all genes via ``discretize_genes()``.

**Step 2: Joint Histogram**

For each gene pair (A, B), a joint histogram is built by counting cells in each
(bin_A, bin_B) combination. Only cells where at least one gene is expressed
contribute.

**Step 3: MI from Joint Distribution**

.. math::

   MI(A, B) = \sum_{a,b} p(a,b) \log_2 \frac{p(a,b)}{p(a) \cdot p(b)}

where :math:`p(a,b)` is the joint probability, and :math:`p(a)`, :math:`p(b)` are
marginals.

**Step 4: Signing (optional)**

When ``signed_mi=True`` (default), MI is multiplied by the sign of the Pearson
correlation between the two genes. This produces directional MI: positive for
co-expressed genes, negative for anti-correlated genes.

Backend Dispatch
----------------

When ``mi_backend="auto"`` (the default), GeneVector selects the fastest available
backend in this order:

.. list-table::
   :header-rows: 1
   :widths: 10 15 20 25 30

   * - Priority
     - Backend
     - Speedup
     - How It Works
     - Availability
   * - 1
     - **CUDA GPU**
     - ~200-1000×
     - PyTorch scatter-based histograms
     - Requires ``device="cuda"``
   * - 2
     - **Rust**
     - ~100-800×
     - Native compiled, rayon parallelism
     - Requires ``maturin develop --release``
   * - 3
     - **Numba**
     - ~100-500×
     - JIT compiled, parallel across CPU cores
     - Requires ``pip install numba``
   * - 4
     - **NumPy**
     - ~10-30×
     - Vectorized discretization, Python pair loop
     - Always available

All backends produce identical results (within floating-point precision) — the
same discretization, the same histogram, the same MI formula. The only difference
is speed.

Selecting a Backend
-------------------

.. code-block:: python

   # Auto-select (recommended)
   dataset = GeneVectorDataset(adata, mi_backend="auto")

   # Force a specific backend
   dataset = GeneVectorDataset(adata, mi_backend="rust")
   dataset = GeneVectorDataset(adata, mi_backend="numba")
   dataset = GeneVectorDataset(adata, mi_backend="numpy")
   dataset = GeneVectorDataset(adata, mi_backend="gpu", device="cuda")

Checking Available Backends
---------------------------

.. code-block:: python

   from genevector.metrics import HAS_NUMBA

   try:
       from genevector._rust import compute_mi_pairs
       print("Rust backend: available")
   except ImportError:
       print("Rust backend: not installed")

   print(f"Numba backend: {'available' if HAS_NUMBA else 'not installed'}")

   import torch
   print(f"CUDA backend: {'available' if torch.cuda.is_available() else 'not available'}")

Practical Runtimes
------------------

For 2,000 genes × 5,000 cells (~2M gene pairs):

.. list-table::
   :header-rows: 1
   :widths: 30 30

   * - Backend
     - Time
   * - Legacy (v0.2)
     - ~26 hours
   * - NumPy
     - ~3 hours
   * - Numba
     - ~2-3 minutes
   * - Rust
     - ~1-2 minutes

.. note::

   Scores are cached to ``~/.genevector/cache/`` after the first run.
   Subsequent runs with the same data and parameters load instantly.

Architecture
------------

The backend dispatch lives in ``genevector/metrics.py``. Each backend implements
the same interface:

.. code-block:: python

   def compute_mi_BACKEND(X, gene_names, n_bins=10, signed=False) -> dict[str, dict[str, float]]

All backends share the ``discretize_genes()`` function for the binning step.
The Rust backend (``src/lib.rs``) receives the pre-discretized integer matrix
from Python via zero-copy numpy arrays (PyO3 + numpy-rs), then parallelizes
the histogram + MI computation across gene pairs using rayon.

.. tip::

   For datasets larger than 2,000 genes, use the entropy-based gene filter
   ``GeneVectorDataset.quality_control(adata, entropy_threshold=1.0)`` to
   reduce gene count before MI computation. The matrix-based targets
   (Pearson, Spearman, Jaccard, cosine) compute in seconds regardless of
   gene count since they use BLAS-backed matrix operations.