genevector.embedding

Gene and cell embedding classes for downstream analysis and visualization.

class CellEmbedding(dataset, embed, log_normalize=True)[source]

Bases: object

This class provides an interface to the cell embedding, which can be used for tasks such as generating a UMAP visualization, assigning cell types, and identifying the similarity between cells and metagenes.

Parameters:

dataset (genevector.dataset.GeneVectorDataset) – The GeneVectorDataset object that was constructed from the original AnnData object.
embed (genevector.dataset.GeneEmbedding) – The GeneEmbeding object constructed from the dataset.
log_normalize – Weights average cell vector computed from all genes by the log normalized expression of each gene.

__init__(dataset, embed, log_normalize=True)[source]: Constructor method

batch_correct(column, reference)[source]

Corrects the matrix of cell vectors by computing vector representations for each category in a given variable in the dataset.

Parameters:

column (str) – Covariate signal to eliminate in the the cell embedding.
column – Covariate category selected as the reference to remain uncorrected.

Returns:

The vectors representing the correction applied to each category to the reference category.

Return type:

dict

cell_distance(target_vec, norm=False)[source]: Computes the cosine similarity of each cell in self.adata to a target_vec. Ensures self.adata is available (set by get_adata()). The ‘norm’ parameter controls if both target_vec and cell vectors are L2 normalized before similarity calculation.

cluster(adata, up_markers, down_markers={})[source]

Run GaussianMixture over cosine similarities for up and down markers.

Parameters:

adata (anndata.AnnData) – AnnData object with X_genevector embedding.
up_markers – Dictionary of up regulated genes defining phenotypes.
down_markers – Dictionary of down regulated genes defining phenotypes (optional).

Type:

up_markers: dict

Type:

down_markers: dict

Returns:

Anndata with clusters stored in metadata (“gcluster”) and probabilities (“{} Probability”).

Return type:

anndata.AnnData

compare_classification(adata, column1, column2)[source]

Plot a heatmap comparing two categorical assignments.

Parameters:

adata (AnnData) – Annotated data matrix.
column1 (str) – First obs column (rows of heatmap).
column2 (str) – Second obs column (columns of heatmap).

Returns:

Heatmap axes object.

Return type:

matplotlib.axes.Axes

compare_expression_to_similarity(adata, gene)[source]

Plot gene expression vs embedding similarity for a single gene.

Parameters:

adata (AnnData) – Annotated data with UMAP coordinates.
gene (str) – Gene symbol to compare.

denoise_cell_vectors(graph, alpha=0.5, adaptive=True, counts=None, k0=None, beta=0.0)[source]

Graph-denoise the cell matrix in gene-vector space (opt-in; modifies self.matrix).

Message passing over a (usually spatial) graph:

(1 - a) * self + a * mean(neighbours)   [+ beta * 2-hop]

With adaptive=True the per-cell weight a_i = min(0.85, k0/(k0+counts_i)) so low-count (noisy) cells borrow heavily from neighbours while high-count cells barely move — robust across sparsity regimes. Strongly improves cell typing on sparse and/or spatially organised data (e.g. low-depth Xenium), but can contaminate identity in intermixed tissue, so it is opt-in and OFF by default. It also gives previously near-empty cells a usable vector. Call BEFORE get_adata() / phenotype_probability().

Parameters:

graph – scipy sparse adjacency (cells x cells) aligned to the cell-matrix order.
alpha – smoothing weight (adaptive=False) or the per-cell cap (adaptive=True).
adaptive – scale the weight per cell by inverse count.
counts – per-cell totals; if None, derived from the loaded expression context.
k0 – adaptive midpoint; defaults to the median of positive counts.
beta – optional 2-hop weight.

Returns:

the new self.matrix (list of vectors). Original kept in self.uncorrected_matrix.

static entmax_15(values)[source]

Sparse probability mapping using 1.5-entmax (sparsemax variant).

Applies a sparse transformation that maps real-valued scores to a probability distribution where low-scoring entries are driven to exactly zero.

Parameters:: values (array-like) – Input scores.
Returns:: Sparse probability distribution summing to 1.
Return type:: np.ndarray

get_adata(min_dist=0.3, n_neighbors=50)[source]

Return a anndata object to use in downstream analyses that contains the cell embedding matrix (under “X_genevector” in obsm) alongside the neighbors graph and UMAP embedding computed using the cell vectors.

Parameters:

min_dist (float) – UMAP generation parameter.
n_neighbors – Number of neighbors defined by cosine similarity to include in neighborhood graph.

Type:

n_neighbors: int

Returns:

Anndata with cell embedding stored in metadata (“obsm”).

Return type:

anndata.AnnData

static get_expression(adata, gene)[source]

Extract expression values for a single gene from an AnnData object.

Parameters:

adata (AnnData) – Annotated data matrix.
gene (str) – Gene symbol.

Returns:

Expression values for all cells.

Return type:

list

get_inverse_predictive_genes(adata, label, n_genes=10)[source]

Compute the top n least similar genes to a given variable in the dataset.

Parameters:

adata – anndata object generated from “get_adata”, has “X_genevector” in the obsm dataframe.
label – Label that defines the cateogies to find the least predictive genes.
n_genes – Number of least similar genes to return for each category.

Returns:

The least similar genes to each label stored in a dictionary.

Return type:

dict

get_predictive_genes(adata, label, n_genes=10)[source]

Compute the top n most similar genes to a given variable in the dataset.

Parameters:

adata – anndata object generated from “get_adata”, has “X_genevector” in the obsm dataframe.
label – Label that defines the cateogies to find predictive genes.
n_genes – Number of most similar genes to return for each category.

Returns:

The most similar genes to each label stored in a dictionary.

Return type:

dict

module_score_r2(adata, markers)[source]

Plot R-squared between pseudo-probability and module score for each phenotype.

Parameters:

adata (AnnData) – Annotated data with pseudo-probabilities and module scores.
markers (dict) – Mapping of phenotype name to gene list.

static normalized_exponential_vector(values, temperature=1e-06)[source]

Temperature-scaled softmax normalization.

Parameters:

values (array-like) – Input scores.
temperature (float) – Temperature parameter. Lower values produce sharper distributions.

Returns:

Probability distribution summing to 1.

Return type:

np.ndarray

normalized_marker_expression(normalized_matrix, genes, cells, markers)[source]

Extract normalized expression for marker genes from a sparse matrix.

Parameters:

normalized_matrix (scipy.sparse matrix) – Normalized expression matrix.
genes (array-like) – Gene names.
cells (array-like) – Cell barcodes.
markers (list of str) – Marker genes to extract.

Returns:

Mapping from cell barcode to {gene: expression_value}.

Return type:

dict of dict

phenotype_probability(adata, phenotype_markers, return_distances=False, method='normalized_exponential', target_col='genevector', temperature=0.001, debias=0.0, contrastive=False, score_norm='none', smooth_graph=None, smooth_alpha=0.5, smooth_adaptive=True, smooth_counts=None, lp_graph=None, lp_alpha=0.0, lp_iter=3)[source]

Probabilistically assign phenotypes based on a set of cell type labels and associated markers. Loads into the anndata the pseudo-probabilities for each cell type and the deterministic label taken from the maximum probability over cell types.

Parameters:

adata (anndata.AnnData) – AnnData object. It’s assumed this is self.adata or is consistent with it, especially regarding adata.obs.index.
phenotype_markers (dict) – Dictionary of cell type labels (key) to gene markers (list of strings, value).
return_distances (bool) – If True, return a tuple: (adata, dictionary_of_raw_similarities).
method (str) – Probability conversion method: “softmax”, “sparsemax”, or “normalized_exponential”.
target_col (str) – Column name in adata.obs to store the final deterministic cell assignments.
temperature (float) – Temperature parameter for the “normalized_exponential” method.
debias (float) – Fraction (0–1) of the dataset (background) vector subtracted from each cell vector before scoring. Opt-in (default 0.0 = current behaviour). Mild values (~0.5) de-saturate scores on balanced data; large values can over-correct a dominant population (e.g. epithelial-rich tumours). See scoring_report().
contrastive (bool) – If True, subtract the mean of competing phenotype vectors from each phenotype vector before scoring (the get_predictive_genes formulation). Opt-in (default False). Helps when phenotypes are not well separated.
score_norm (str) – Per-phenotype normalization of the cell×phenotype similarity columns before the probability function. One of “none” (default), “zscore” (subtract each phenotype’s mean / divide by std — stops a phenotype that is close to everyone from winning by default), or “rank” (per-phenotype percentile rank). Opt-in; helps on some datasets, default off.
smooth_graph (scipy.sparse matrix or None) – Optional scipy sparse adjacency (cells x cells, same order as the cell embedding) to spatially denoise the cell vectors for scoring only (does not mutate self.matrix / the UMAP). Opt-in. For persistent denoising that also affects the embedding use denoise_cell_vectors().
smooth_alpha (float) – Smoothing weight (or per-cell cap when smooth_adaptive).
smooth_adaptive (bool) – Scale the smoothing weight per cell by inverse count (low-count cells borrow more). Default True.
smooth_counts (array-like or None) – Per-cell totals for adaptive smoothing; if None, derived from the loaded expression context.
lp_graph (scipy.sparse matrix or None) – Optional scipy sparse adjacency (cells x cells, same order as the cell embedding) for spatial label propagation over the soft probabilities. Opt-in.
lp_alpha (float) – Label-propagation coupling in [0, 1) (0 disables). Smooths probabilities over lp_graph as a post-step. Improves spatial coherence; can blur identity in intermixed tissue — keep small (~0.3) and opt-in.
lp_iter (int) – Number of label-propagation iterations.

Returns:

AnnData with cell type labels and probabilities. If return_distances is True, returns a tuple (adata, raw_similarities_dict).

Return type:

anndata.AnnData or tuple

phenotype_qc(adata, phenotype, genes, norm=True)[source]

Quality control comparison of pseudo-probability, module score, and embedding similarity.

Parameters:

adata (AnnData) – Annotated data with UMAP and phenotype probabilities.
phenotype (str) – Phenotype label name.
genes (list of str) – Marker genes for the phenotype.
norm (bool) – If True, normalize cell vectors before distance computation.

Returns:

QC metrics per cell.

Return type:

pandas.DataFrame

static plot_confusion_matrix(adata, label1, label2)[source]

Plot accuracy of GeneVector cell type assignments with a set of known cell types or clusters.

Parameters:

adata (anndata.AnnData) – AnnData object with both genevector cell type labels and ground truth cell type or cluster or assignment.
label1 – Target column for GeneVector cell type assignments.

plot_probabilities(adata, ncols=2, save='probs.pdf', palette='magma')[source]

Plot UMAP colored by pseudo-probability for all phenotypes.

Parameters:

adata (AnnData) – Annotated data with pseudo-probability columns.
ncols (int) – Number of columns in subplot grid.
save (str) – File path for saving the figure.
palette (str) – Matplotlib colormap name.

qc_marker_dict(adata, phenotype_markers, layer=None, specificity_threshold=0.5, min_markers=2, verbose=True)[source]

Quality-control a marker dictionary before phenotyping.

Computes a provisional mean-expression labelling and, for every (phenotype, marker), reports whether the marker is in the panel, its mean expression, fraction of cells expressing it, a fold enrichment over the global mean, and a proportion-independent specificity (tau index in [0, 1] over the per-type means; 1 = specific to one type, 0 = uniform). Flags low-specificity markers (tau below specificity_threshold), off-target markers (highest in a different type than declared) and phenotypes with too few usable markers. Tau is used (not raw fold-enrichment) because fold-enrichment is confounded by class proportions and by dense/normalised data.

Parameters:

adata – AnnData with expression (raw counts recommended via layer).
phenotype_markers – dict of phenotype -> marker gene list.
layer – layer to read expression from (e.g. “counts”); defaults to .X.
specificity_threshold – tau below this flags a marker as low-specificity.
min_markers – warn if a phenotype has fewer usable markers than this.
verbose – log warnings.

Returns:

pandas.DataFrame with one row per (phenotype, marker) and the QC columns.

Return type:

pandas.DataFrame

class GeneEmbedding(embedding_file, dataset=None, vector='average')[source]

Bases: object

This class provides an interface to the gene embedding, which can be used for tasks such as similarity computation, visualization, etc.

Parameters:

embedding_file (str) – Specifies the path to a set of .vec files generated for model training.
dataset (:class:'genevector.dataset.GeneVectorDataset') – The GeneVectorDataset object that was constructed from the original AnnData object.
vector (str) – Specifies if using the first set of weights (“1”), the second set of weights (“2”), or the average (“average”).

__init__(embedding_file, dataset=None, vector='average')[source]: Constructor method

static average_vector_results(vec1, vec2, fname)[source]

Average two .vec embedding files and write the result to a new file.

Parameters:

vec1 (str) – Path to first .vec file.
vec2 (str) – Path to second .vec file.
fname (str) – Output path for averaged .vec file.

compute_similarities(gene, subset=None)[source]

Compute the cosine similarities between a target gene and all other vectors in the embedding.

Parameters:

gene (str) – Target gene to compute cosine similarities.
subset (list, optional) – Only compute against a subset of gene vectors. (optional).

Returns:

A pandas dataframe holding a gene symbol column (“Gene”) and a cosine similarity column (“Similarity”).

Return type:

pandas.DataFrmae

generate_network(threshold=0.5)[source]

Computes networkx graph representation of the gene embedding.

Parameters:: threshold – Minimum cosine similarity to includea as edge in the graph.
Returns:: A networkx graph with each gene as a node and the edges weighted by cosine similarity.
Return type:: networkx.Graph

generate_vector(genes)[source]

Compute an averagve vector representation for a set of genes in the learned gene embedding.

Parameters:: genes (list) – List of genes to generate an average vector embedding.
Returns:: The average vector for a set of genes in the gene embedding.
Return type:: list

generate_weighted_vector(genes, weights)[source]

Compute an averagve vector representation for a set of genes in the learned gene embedding with a set of weights.

Parameters:

genes (list) – List of genes to generate an average vector embedding.
weights – List of floats in the same order of genes to weight each vector.

Returns:

The average vector for a set of genes in the gene embedding.

Return type:

list

get_adata(resolution=20.0)[source]

This method returns the AnnData object that contains the gene embedding with leiden clusters for metagenes, the neighbors graph, and the UMAP embedding.

Parameters:: resolution (float) – The resolution to pass to the sc.tl.leiden function.
Returns:: An AnnData object with metagenes stored in ‘leiden’ for the provided resolution, the neighbors graph, and UMAP embedding.
Return type:: AnnData

get_metagenes(gdata)[source]

Score a list of metagenes (get_metagenes) over all cells.

Parameters:: gdata (AnnData) – The AnnData object holding the gene embedding (from embedding.GeneEmbedding.get_adata).
Returns:: A dictionary of metagenes (identifier, gene list).
Return type:: dict

get_similar_genes(vector)[source]

Computes the similarity of each gene in the mebedding to a target vector representation.

Parameters:: vector – Vector representation used to find the gene similarity by cosine cosine.
Returns:: A pandas dataframe holding the gene symbol column (“Gene”) and a cosine similarity column (“Similarity”).
Return type:: pandas.DataFrmae

get_vector(gene)[source]

Return the embedding vector for a single gene.

Parameters:: gene (str) – Gene symbol.
Returns:: Embedding vector.
Return type:: np.ndarray

plot_metagene(gdata, mg=None, title='Gene Embedding')[source]

Plot a UMAP with the genes from a given metagene highlighted and annotated.

Parameters:

gdata (AnnData) – The AnnData object holding the gene embedding (from embedding.get_adata).
mg (str, optional) – The metagene identifier (leiden cluster number) (optional).
title (str, optional) – The title of the plot. (optional).

plot_metagenes_scores(adata, metagenes, column, plot=None)[source]

Plot a Seaborn clustermap with the gene module scores for a list of metagenes over a covariate (column). Requires running score_metagenes previously.

Parameters:

adata (AnnData) – The AnnData object holding the cell embedding (from embedding.CellEmbedding.get_adata).
metagenes (dict) – Dict of metagenes identifiers to plot in clustermap.
column (str) – Covariate in obs dataframe of AnnData.
column – Covariate in obs dataframe of AnnData.
plot (str) – Filename for saving a figure.

plot_similarities(gene, n_genes=10, save=None)[source]

Plot a horizontal bar plot of cosine similarity of the most similar vectors to ‘gene’ argument.

Parameters:

gene (str, optional) – The gene symbol of the gene of interest.
save – The path to save the figure (optional).

Returns:

A matplotlib axes object representing the plot.

Return type:

matplotlib.figure.axes

read_embedding(filename)[source]

Read a .vec embedding file into a dict of gene -> vector.

Parameters:: filename (str) – Path to .vec file (first line: dimensions, remaining: gene vectors).
Returns:: Mapping from gene symbol to numpy array.
Return type:: dict

static read_vector(vec)[source]

Read a .vec file into a dict of gene vectors and a dimension string.

Parameters:: vec (str) – Path to .vec file.
Returns:: Mapping from gene symbol to list of floats, and dimension header line.
Return type:: tuple of (dict, str)

score_metagenes(adata, metagenes)[source]

Score a list of metagenes (get_metagenes) over all cells.

Parameters:

adata (AnnData) – The AnnData object holding the cell embedding (from embedding.CellEmbedding.get_adata).
metagenes (dict) – Dict of metagenes identifiers to plot in clustermap.