genevector.embedding

Gene and cell embedding classes for downstream analysis and visualization.

class CellEmbedding(dataset, embed, log_normalize=True)[source]

Bases: object

This class provides an interface to the cell embedding, which can be used for tasks such as generating a UMAP visualization, assigning cell types, and identifying the similarity between cells and metagenes.

Parameters:
  • dataset (genevector.dataset.GeneVectorDataset) – The GeneVectorDataset object that was constructed from the original AnnData object.

  • embed (genevector.dataset.GeneEmbedding) – The GeneEmbeding object constructed from the dataset.

  • log_normalize – Weights average cell vector computed from all genes by the log normalized expression of each gene.

__init__(dataset, embed, log_normalize=True)[source]

Constructor method

batch_correct(column, reference)[source]

Corrects the matrix of cell vectors by computing vector representations for each category in a given variable in the dataset.

Parameters:
  • column (str) – Covariate signal to eliminate in the the cell embedding.

  • column – Covariate category selected as the reference to remain uncorrected.

Returns:

The vectors representing the correction applied to each category to the reference category.

Return type:

dict

cell_distance(target_vec, norm=False)[source]

Computes the cosine similarity of each cell in self.adata to a target_vec. Ensures self.adata is available (set by get_adata()). The ‘norm’ parameter controls if both target_vec and cell vectors are L2 normalized before similarity calculation.

cluster(adata, up_markers, down_markers={})[source]

Run GaussianMixture over cosine similarities for up and down markers.

Parameters:
  • adata (anndata.AnnData) – AnnData object with X_genevector embedding.

  • up_markers – Dictionary of up regulated genes defining phenotypes.

  • down_markers – Dictionary of down regulated genes defining phenotypes (optional).

Type:

up_markers: dict

Type:

down_markers: dict

Returns:

Anndata with clusters stored in metadata (“gcluster”) and probabilities (“{} Probability”).

Return type:

anndata.AnnData

compare_classification(adata, column1, column2)[source]

Plot a heatmap comparing two categorical assignments.

Parameters:
  • adata (AnnData) – Annotated data matrix.

  • column1 (str) – First obs column (rows of heatmap).

  • column2 (str) – Second obs column (columns of heatmap).

Returns:

Heatmap axes object.

Return type:

matplotlib.axes.Axes

compare_expression_to_similarity(adata, gene)[source]

Plot gene expression vs embedding similarity for a single gene.

Parameters:
  • adata (AnnData) – Annotated data with UMAP coordinates.

  • gene (str) – Gene symbol to compare.

static entmax_15(values)[source]

Sparse probability mapping using 1.5-entmax (sparsemax variant).

Applies a sparse transformation that maps real-valued scores to a probability distribution where low-scoring entries are driven to exactly zero.

Parameters:

values (array-like) – Input scores.

Returns:

Sparse probability distribution summing to 1.

Return type:

np.ndarray

get_adata(min_dist=0.3, n_neighbors=50)[source]

Return a anndata object to use in downstream analyses that contains the cell embedding matrix (under “X_genevector” in obsm) alongside the neighbors graph and UMAP embedding computed using the cell vectors.

Parameters:
  • min_dist (float) – UMAP generation parameter.

  • n_neighbors – Number of neighbors defined by cosine similarity to include in neighborhood graph.

Type:

n_neighbors: int

Returns:

Anndata with cell embedding stored in metadata (“obsm”).

Return type:

anndata.AnnData

static get_expression(adata, gene)[source]

Extract expression values for a single gene from an AnnData object.

Parameters:
  • adata (AnnData) – Annotated data matrix.

  • gene (str) – Gene symbol.

Returns:

Expression values for all cells.

Return type:

list

get_inverse_predictive_genes(adata, label, n_genes=10)[source]

Compute the top n least similar genes to a given variable in the dataset.

Parameters:
  • adata – anndata object generated from “get_adata”, has “X_genevector” in the obsm dataframe.

  • label – Label that defines the cateogies to find the least predictive genes.

  • n_genes – Number of least similar genes to return for each category.

Returns:

The least similar genes to each label stored in a dictionary.

Return type:

dict

get_predictive_genes(adata, label, n_genes=10)[source]

Compute the top n most similar genes to a given variable in the dataset.

Parameters:
  • adata – anndata object generated from “get_adata”, has “X_genevector” in the obsm dataframe.

  • label – Label that defines the cateogies to find predictive genes.

  • n_genes – Number of most similar genes to return for each category.

Returns:

The most similar genes to each label stored in a dictionary.

Return type:

dict

module_score_r2(adata, markers)[source]

Plot R-squared between pseudo-probability and module score for each phenotype.

Parameters:
  • adata (AnnData) – Annotated data with pseudo-probabilities and module scores.

  • markers (dict) – Mapping of phenotype name to gene list.

static normalized_exponential_vector(values, temperature=1e-06)[source]

Temperature-scaled softmax normalization.

Parameters:
  • values (array-like) – Input scores.

  • temperature (float) – Temperature parameter. Lower values produce sharper distributions.

Returns:

Probability distribution summing to 1.

Return type:

np.ndarray

normalized_marker_expression(normalized_matrix, genes, cells, markers)[source]

Extract normalized expression for marker genes from a sparse matrix.

Parameters:
  • normalized_matrix (scipy.sparse matrix) – Normalized expression matrix.

  • genes (array-like) – Gene names.

  • cells (array-like) – Cell barcodes.

  • markers (list of str) – Marker genes to extract.

Returns:

Mapping from cell barcode to {gene: expression_value}.

Return type:

dict of dict

phenotype_probability(adata, phenotype_markers, return_distances=False, method='normalized_exponential', target_col='genevector', temperature=0.001)[source]

Probabilistically assign phenotypes based on a set of cell type labels and associated markers. Loads into the anndata the pseudo-probabilities for each cell type and the deterministic label taken from the maximum probability over cell types.

Parameters:
  • adata (anndata.AnnData) – AnnData object. It’s assumed this is self.adata or is consistent with it, especially regarding adata.obs.index if self.cell_distance is used.

  • phenotype_markers (dict) – Dictionary of cell type labels (key) to gene markers (list of strings, value).

  • return_distances (bool) – If True, return a tuple: (adata, dictionary_of_raw_similarities).

  • method (str) – Probability conversion method: “softmax”, “sparsemax”, or “normalized_exponential”.

  • target_col (str) – Column name in adata.obs to store the final deterministic cell assignments.

  • temperature (float) – Temperature parameter for the “normalized_exponential” method.

Returns:

AnnData with cell type labels and probabilities. If return_distances is True, returns a tuple (adata, raw_similarities_dict).

Return type:

anndata.AnnData or tuple

phenotype_qc(adata, phenotype, genes, norm=True)[source]

Quality control comparison of pseudo-probability, module score, and embedding similarity.

Parameters:
  • adata (AnnData) – Annotated data with UMAP and phenotype probabilities.

  • phenotype (str) – Phenotype label name.

  • genes (list of str) – Marker genes for the phenotype.

  • norm (bool) – If True, normalize cell vectors before distance computation.

Returns:

QC metrics per cell.

Return type:

pandas.DataFrame

static plot_confusion_matrix(adata, label1, label2)[source]

Plot accuracy of GeneVector cell type assignments with a set of known cell types or clusters.

Parameters:
  • adata (anndata.AnnData) – AnnData object with both genevector cell type labels and ground truth cell type or cluster or assignment.

  • label1 – Target column for GeneVector cell type assignments.

plot_probabilities(adata, ncols=2, save='probs.pdf', palette='magma')[source]

Plot UMAP colored by pseudo-probability for all phenotypes.

Parameters:
  • adata (AnnData) – Annotated data with pseudo-probability columns.

  • ncols (int) – Number of columns in subplot grid.

  • save (str) – File path for saving the figure.

  • palette (str) – Matplotlib colormap name.

class GeneEmbedding(embedding_file, dataset=None, vector='average')[source]

Bases: object

This class provides an interface to the gene embedding, which can be used for tasks such as similarity computation, visualization, etc.

Parameters:
  • embedding_file (str) – Specifies the path to a set of .vec files generated for model training.

  • dataset (:class:'genevector.dataset.GeneVectorDataset') – The GeneVectorDataset object that was constructed from the original AnnData object.

  • vector (str) – Specifies if using the first set of weights (“1”), the second set of weights (“2”), or the average (“average”).

__init__(embedding_file, dataset=None, vector='average')[source]

Constructor method

static average_vector_results(vec1, vec2, fname)[source]

Average two .vec embedding files and write the result to a new file.

Parameters:
  • vec1 (str) – Path to first .vec file.

  • vec2 (str) – Path to second .vec file.

  • fname (str) – Output path for averaged .vec file.

compute_similarities(gene, subset=None)[source]

Compute the cosine similarities between a target gene and all other vectors in the embedding.

Parameters:
  • gene (str) – Target gene to compute cosine similarities.

  • subset (list, optional) – Only compute against a subset of gene vectors. (optional).

Returns:

A pandas dataframe holding a gene symbol column (“Gene”) and a cosine similarity column (“Similarity”).

Return type:

pandas.DataFrmae

generate_network(threshold=0.5)[source]

Computes networkx graph representation of the gene embedding.

Parameters:

threshold – Minimum cosine similarity to includea as edge in the graph.

Returns:

A networkx graph with each gene as a node and the edges weighted by cosine similarity.

Return type:

networkx.Graph

generate_vector(genes)[source]

Compute an averagve vector representation for a set of genes in the learned gene embedding.

Parameters:

genes (list) – List of genes to generate an average vector embedding.

Returns:

The average vector for a set of genes in the gene embedding.

Return type:

list

generate_weighted_vector(genes, weights)[source]

Compute an averagve vector representation for a set of genes in the learned gene embedding with a set of weights.

Parameters:
  • genes (list) – List of genes to generate an average vector embedding.

  • weights – List of floats in the same order of genes to weight each vector.

Returns:

The average vector for a set of genes in the gene embedding.

Return type:

list

get_adata(resolution=20.0)[source]

This method returns the AnnData object that contains the gene embedding with leiden clusters for metagenes, the neighbors graph, and the UMAP embedding.

Parameters:

resolution (float) – The resolution to pass to the sc.tl.leiden function.

Returns:

An AnnData object with metagenes stored in ‘leiden’ for the provided resolution, the neighbors graph, and UMAP embedding.

Return type:

AnnData

get_metagenes(gdata)[source]

Score a list of metagenes (get_metagenes) over all cells.

Parameters:

gdata (AnnData) – The AnnData object holding the gene embedding (from embedding.GeneEmbedding.get_adata).

Returns:

A dictionary of metagenes (identifier, gene list).

Return type:

dict

get_similar_genes(vector)[source]

Computes the similarity of each gene in the mebedding to a target vector representation.

Parameters:

vector – Vector representation used to find the gene similarity by cosine cosine.

Returns:

A pandas dataframe holding the gene symbol column (“Gene”) and a cosine similarity column (“Similarity”).

Return type:

pandas.DataFrmae

get_vector(gene)[source]

Return the embedding vector for a single gene.

Parameters:

gene (str) – Gene symbol.

Returns:

Embedding vector.

Return type:

np.ndarray

plot_metagene(gdata, mg=None, title='Gene Embedding')[source]

Plot a UMAP with the genes from a given metagene highlighted and annotated.

Parameters:
  • gdata (AnnData) – The AnnData object holding the gene embedding (from embedding.get_adata).

  • mg (str, optional) – The metagene identifier (leiden cluster number) (optional).

  • title (str, optional) – The title of the plot. (optional).

plot_metagenes_scores(adata, metagenes, column, plot=None)[source]

Plot a Seaborn clustermap with the gene module scores for a list of metagenes over a covariate (column). Requires running score_metagenes previously.

Parameters:
  • adata (AnnData) – The AnnData object holding the cell embedding (from embedding.CellEmbedding.get_adata).

  • metagenes (dict) – Dict of metagenes identifiers to plot in clustermap.

  • column (str) – Covariate in obs dataframe of AnnData.

  • column – Covariate in obs dataframe of AnnData.

  • plot (str) – Filename for saving a figure.

plot_similarities(gene, n_genes=10, save=None)[source]

Plot a horizontal bar plot of cosine similarity of the most similar vectors to ‘gene’ argument.

Parameters:
  • gene (str, optional) – The gene symbol of the gene of interest.

  • save – The path to save the figure (optional).

Returns:

A matplotlib axes object representing the plot.

Return type:

matplotlib.figure.axes

read_embedding(filename)[source]

Read a .vec embedding file into a dict of gene -> vector.

Parameters:

filename (str) – Path to .vec file (first line: dimensions, remaining: gene vectors).

Returns:

Mapping from gene symbol to numpy array.

Return type:

dict

static read_vector(vec)[source]

Read a .vec file into a dict of gene vectors and a dimension string.

Parameters:

vec (str) – Path to .vec file.

Returns:

Mapping from gene symbol to list of floats, and dimension header line.

Return type:

tuple of (dict, str)

score_metagenes(adata, metagenes)[source]

Score a list of metagenes (get_metagenes) over all cells.

Parameters:
  • adata (AnnData) – The AnnData object holding the cell embedding (from embedding.CellEmbedding.get_adata).

  • metagenes (dict) – Dict of metagenes identifiers to plot in clustermap.