genevector.data

GeneVector dataset and expression context for single-cell data.

class GeneVectorDataset(adata, device='cpu', mi_scores=None, load_expression=True, signed_mi=True, target='mi', target_kwargs=None, mi_backend='auto', use_cache=True)[source]

Bases: Dataset

This class provides extends the torch Dataset class with functionality to compute mutual information between genes and generate batches of input and output data for each gene pair for training..

Parameters:
  • adata (AnnData) – The AnnData Scanpy object that holds the dataset with expression data in .X.

  • device (str) – The device to load torch dataset (“cpu”,”cuda:0”,”mips” for torch metal acceleration).

  • mi_scores (dict of dict) – Optionallu side load a dictionary of two levels containing the training target for each gene pair.

  • processes (int) – Not functional, adding support for multiprocessing MI computation.

__init__(adata, device='cpu', mi_scores=None, load_expression=True, signed_mi=True, target='mi', target_kwargs=None, mi_backend='auto', use_cache=True)[source]

Constructor method

Parameters:
  • adata (AnnData) – The AnnData Scanpy object with expression data in .X.

  • device (str) – Device for torch tensors (“cpu”, “cuda”, “mps”).

  • mi_scores (dict, optional) – Side-load precomputed target scores.

  • load_expression (bool) – Whether to load expression into Context dicts.

  • signed_mi (bool) – If True, multiply MI by correlation sign for directional MI.

  • target (str or callable) – Name of registered target function, or a callable with signature f(X, gene_names, **kwargs) -> dict[dict[float]]. Default: “mi” (mutual information).

  • target_kwargs (dict, optional) – Extra keyword arguments passed to the target function.

  • mi_backend (str) – Backend for MI computation: “auto”, “numpy”, “numba”, “gpu”. Only used when target=”mi”.

  • use_cache (bool) – If True, cache computed target scores to disk and reload on subsequent runs with the same data+parameters.

create_inputs_outputs(c=100.0)[source]

Compute target scores, build training tensors, and prepare for model training.

Parameters:

c (float) – Scaling factor applied to target scores (score * c^2).

get_batches(batch_size)[source]

Yield randomized mini-batches of (target_values, i_indices, j_indices).

Parameters:

batch_size (int) – Number of gene pairs per batch.

Yields:

tuple of (torch.Tensor, torch.Tensor, torch.Tensor) – Target values, row gene indices, column gene indices.

static get_gene_entropy(adata)[source]

Compute individual gene entropy.

Parameters:

adata (AnnData) – The AnnData Scanpy object that holds the dataset with expression data in .X.

Returns:

Dictionary of gene to entropy.

Return type:

dict

load_target_scores(filepath)[source]

Load target scores from a specific .npz file.

load_targets(targets)[source]

Load precomputed target values. Can be mutual information.

Parameters:

targets (dict) – Dictionary of dictionaries mapping target value to gene pairs.

static quality_control(adata, entropy_threshold=1.0)[source]

Select genes with an entropy above the given threshold. Used in place of highly variable gene selection.

Parameters:
  • adata (AnnData) – The AnnData Scanpy object that holds the dataset with expression data in .X.

  • entropy_threshold (float) – Minimum entropy for a gene to be included in training and downstream analyses.

Returns:

Filtered AnnData object.

Return type:

anndata.AnnData

save_target_scores(filepath)[source]

Save computed target scores to a specific .npz file.