genevector.data
GeneVector dataset and expression context for single-cell data.
- class GeneVectorDataset(adata, device='cpu', mi_scores=None, load_expression=True, signed_mi=True, target='mi', target_kwargs=None, mi_backend='auto', use_cache=True)[source]
Bases:
DatasetThis class provides extends the torch Dataset class with functionality to compute mutual information between genes and generate batches of input and output data for each gene pair for training..
- Parameters:
adata (AnnData) – The AnnData Scanpy object that holds the dataset with expression data in .X.
device (str) – The device to load torch dataset (“cpu”,”cuda:0”,”mips” for torch metal acceleration).
mi_scores (dict of dict) – Optionallu side load a dictionary of two levels containing the training target for each gene pair.
processes (int) – Not functional, adding support for multiprocessing MI computation.
- __init__(adata, device='cpu', mi_scores=None, load_expression=True, signed_mi=True, target='mi', target_kwargs=None, mi_backend='auto', use_cache=True)[source]
Constructor method
- Parameters:
adata (AnnData) – The AnnData Scanpy object with expression data in .X.
device (str) – Device for torch tensors (“cpu”, “cuda”, “mps”).
mi_scores (dict, optional) – Side-load precomputed target scores.
load_expression (bool) – Whether to load expression into Context dicts.
signed_mi (bool) – If True, multiply MI by correlation sign for directional MI.
target (str or callable) – Name of registered target function, or a callable with signature
f(X, gene_names, **kwargs) -> dict[dict[float]]. Default: “mi” (mutual information).target_kwargs (dict, optional) – Extra keyword arguments passed to the target function.
mi_backend (str) – Backend for MI computation: “auto”, “numpy”, “numba”, “gpu”. Only used when target=”mi”.
use_cache (bool) – If True, cache computed target scores to disk and reload on subsequent runs with the same data+parameters.
- create_inputs_outputs(c=100.0)[source]
Compute target scores, build training tensors, and prepare for model training.
- Parameters:
c (float) – Scaling factor applied to target scores (score * c^2).
- get_batches(batch_size)[source]
Yield randomized mini-batches of (target_values, i_indices, j_indices).
- Parameters:
batch_size (int) – Number of gene pairs per batch.
- Yields:
tuple of (torch.Tensor, torch.Tensor, torch.Tensor) – Target values, row gene indices, column gene indices.
- static get_gene_entropy(adata)[source]
Compute individual gene entropy.
- Parameters:
adata (AnnData) – The AnnData Scanpy object that holds the dataset with expression data in .X.
- Returns:
Dictionary of gene to entropy.
- Return type:
dict
- load_targets(targets)[source]
Load precomputed target values. Can be mutual information.
- Parameters:
targets (dict) – Dictionary of dictionaries mapping target value to gene pairs.
- static quality_control(adata, entropy_threshold=1.0)[source]
Select genes with an entropy above the given threshold. Used in place of highly variable gene selection.
- Parameters:
adata (AnnData) – The AnnData Scanpy object that holds the dataset with expression data in .X.
entropy_threshold (float) – Minimum entropy for a gene to be included in training and downstream analyses.
- Returns:
Filtered AnnData object.
- Return type:
anndata.AnnData