genevector.embedding
Gene and cell embedding classes for downstream analysis and visualization.
- class CellEmbedding(dataset, embed, log_normalize=True)[source]
Bases:
objectThis class provides an interface to the cell embedding, which can be used for tasks such as generating a UMAP visualization, assigning cell types, and identifying the similarity between cells and metagenes.
- Parameters:
dataset (genevector.dataset.GeneVectorDataset) – The GeneVectorDataset object that was constructed from the original AnnData object.
embed (genevector.dataset.GeneEmbedding) – The GeneEmbeding object constructed from the dataset.
log_normalize – Weights average cell vector computed from all genes by the log normalized expression of each gene.
- batch_correct(column, reference)[source]
Corrects the matrix of cell vectors by computing vector representations for each category in a given variable in the dataset.
- Parameters:
column (str) – Covariate signal to eliminate in the the cell embedding.
column – Covariate category selected as the reference to remain uncorrected.
- Returns:
The vectors representing the correction applied to each category to the reference category.
- Return type:
dict
- cell_distance(target_vec, norm=False)[source]
Computes the cosine similarity of each cell in self.adata to a target_vec. Ensures self.adata is available (set by get_adata()). The ‘norm’ parameter controls if both target_vec and cell vectors are L2 normalized before similarity calculation.
- cluster(adata, up_markers, down_markers={})[source]
Run GaussianMixture over cosine similarities for up and down markers.
- Parameters:
adata (anndata.AnnData) – AnnData object with X_genevector embedding.
up_markers – Dictionary of up regulated genes defining phenotypes.
down_markers – Dictionary of down regulated genes defining phenotypes (optional).
- Type:
up_markers: dict
- Type:
down_markers: dict
- Returns:
Anndata with clusters stored in metadata (“gcluster”) and probabilities (“{} Probability”).
- Return type:
anndata.AnnData
- compare_classification(adata, column1, column2)[source]
Plot a heatmap comparing two categorical assignments.
- Parameters:
adata (AnnData) – Annotated data matrix.
column1 (str) – First obs column (rows of heatmap).
column2 (str) – Second obs column (columns of heatmap).
- Returns:
Heatmap axes object.
- Return type:
matplotlib.axes.Axes
- compare_expression_to_similarity(adata, gene)[source]
Plot gene expression vs embedding similarity for a single gene.
- Parameters:
adata (AnnData) – Annotated data with UMAP coordinates.
gene (str) – Gene symbol to compare.
- static entmax_15(values)[source]
Sparse probability mapping using 1.5-entmax (sparsemax variant).
Applies a sparse transformation that maps real-valued scores to a probability distribution where low-scoring entries are driven to exactly zero.
- Parameters:
values (array-like) – Input scores.
- Returns:
Sparse probability distribution summing to 1.
- Return type:
np.ndarray
- get_adata(min_dist=0.3, n_neighbors=50)[source]
Return a anndata object to use in downstream analyses that contains the cell embedding matrix (under “X_genevector” in obsm) alongside the neighbors graph and UMAP embedding computed using the cell vectors.
- Parameters:
min_dist (float) – UMAP generation parameter.
n_neighbors – Number of neighbors defined by cosine similarity to include in neighborhood graph.
- Type:
n_neighbors: int
- Returns:
Anndata with cell embedding stored in metadata (“obsm”).
- Return type:
anndata.AnnData
- static get_expression(adata, gene)[source]
Extract expression values for a single gene from an AnnData object.
- Parameters:
adata (AnnData) – Annotated data matrix.
gene (str) – Gene symbol.
- Returns:
Expression values for all cells.
- Return type:
list
- get_inverse_predictive_genes(adata, label, n_genes=10)[source]
Compute the top n least similar genes to a given variable in the dataset.
- Parameters:
adata – anndata object generated from “get_adata”, has “X_genevector” in the obsm dataframe.
label – Label that defines the cateogies to find the least predictive genes.
n_genes – Number of least similar genes to return for each category.
- Returns:
The least similar genes to each label stored in a dictionary.
- Return type:
dict
- get_predictive_genes(adata, label, n_genes=10)[source]
Compute the top n most similar genes to a given variable in the dataset.
- Parameters:
adata – anndata object generated from “get_adata”, has “X_genevector” in the obsm dataframe.
label – Label that defines the cateogies to find predictive genes.
n_genes – Number of most similar genes to return for each category.
- Returns:
The most similar genes to each label stored in a dictionary.
- Return type:
dict
- module_score_r2(adata, markers)[source]
Plot R-squared between pseudo-probability and module score for each phenotype.
- Parameters:
adata (AnnData) – Annotated data with pseudo-probabilities and module scores.
markers (dict) – Mapping of phenotype name to gene list.
- static normalized_exponential_vector(values, temperature=1e-06)[source]
Temperature-scaled softmax normalization.
- Parameters:
values (array-like) – Input scores.
temperature (float) – Temperature parameter. Lower values produce sharper distributions.
- Returns:
Probability distribution summing to 1.
- Return type:
np.ndarray
- normalized_marker_expression(normalized_matrix, genes, cells, markers)[source]
Extract normalized expression for marker genes from a sparse matrix.
- Parameters:
normalized_matrix (scipy.sparse matrix) – Normalized expression matrix.
genes (array-like) – Gene names.
cells (array-like) – Cell barcodes.
markers (list of str) – Marker genes to extract.
- Returns:
Mapping from cell barcode to {gene: expression_value}.
- Return type:
dict of dict
- phenotype_probability(adata, phenotype_markers, return_distances=False, method='normalized_exponential', target_col='genevector', temperature=0.001)[source]
Probabilistically assign phenotypes based on a set of cell type labels and associated markers. Loads into the anndata the pseudo-probabilities for each cell type and the deterministic label taken from the maximum probability over cell types.
- Parameters:
adata (anndata.AnnData) – AnnData object. It’s assumed this is self.adata or is consistent with it, especially regarding adata.obs.index if self.cell_distance is used.
phenotype_markers (dict) – Dictionary of cell type labels (key) to gene markers (list of strings, value).
return_distances (bool) – If True, return a tuple: (adata, dictionary_of_raw_similarities).
method (str) – Probability conversion method: “softmax”, “sparsemax”, or “normalized_exponential”.
target_col (str) – Column name in adata.obs to store the final deterministic cell assignments.
temperature (float) – Temperature parameter for the “normalized_exponential” method.
- Returns:
AnnData with cell type labels and probabilities. If return_distances is True, returns a tuple (adata, raw_similarities_dict).
- Return type:
anndata.AnnData or tuple
- phenotype_qc(adata, phenotype, genes, norm=True)[source]
Quality control comparison of pseudo-probability, module score, and embedding similarity.
- Parameters:
adata (AnnData) – Annotated data with UMAP and phenotype probabilities.
phenotype (str) – Phenotype label name.
genes (list of str) – Marker genes for the phenotype.
norm (bool) – If True, normalize cell vectors before distance computation.
- Returns:
QC metrics per cell.
- Return type:
pandas.DataFrame
- static plot_confusion_matrix(adata, label1, label2)[source]
Plot accuracy of GeneVector cell type assignments with a set of known cell types or clusters.
- Parameters:
adata (anndata.AnnData) – AnnData object with both genevector cell type labels and ground truth cell type or cluster or assignment.
label1 – Target column for GeneVector cell type assignments.
- plot_probabilities(adata, ncols=2, save='probs.pdf', palette='magma')[source]
Plot UMAP colored by pseudo-probability for all phenotypes.
- Parameters:
adata (AnnData) – Annotated data with pseudo-probability columns.
ncols (int) – Number of columns in subplot grid.
save (str) – File path for saving the figure.
palette (str) – Matplotlib colormap name.
- class GeneEmbedding(embedding_file, dataset=None, vector='average')[source]
Bases:
objectThis class provides an interface to the gene embedding, which can be used for tasks such as similarity computation, visualization, etc.
- Parameters:
embedding_file (str) – Specifies the path to a set of .vec files generated for model training.
dataset (:class:'genevector.dataset.GeneVectorDataset') – The GeneVectorDataset object that was constructed from the original AnnData object.
vector (str) – Specifies if using the first set of weights (“1”), the second set of weights (“2”), or the average (“average”).
- static average_vector_results(vec1, vec2, fname)[source]
Average two .vec embedding files and write the result to a new file.
- Parameters:
vec1 (str) – Path to first .vec file.
vec2 (str) – Path to second .vec file.
fname (str) – Output path for averaged .vec file.
- compute_similarities(gene, subset=None)[source]
Compute the cosine similarities between a target gene and all other vectors in the embedding.
- Parameters:
gene (str) – Target gene to compute cosine similarities.
subset (list, optional) – Only compute against a subset of gene vectors. (optional).
- Returns:
A pandas dataframe holding a gene symbol column (“Gene”) and a cosine similarity column (“Similarity”).
- Return type:
pandas.DataFrmae
- generate_network(threshold=0.5)[source]
Computes networkx graph representation of the gene embedding.
- Parameters:
threshold – Minimum cosine similarity to includea as edge in the graph.
- Returns:
A networkx graph with each gene as a node and the edges weighted by cosine similarity.
- Return type:
networkx.Graph
- generate_vector(genes)[source]
Compute an averagve vector representation for a set of genes in the learned gene embedding.
- Parameters:
genes (list) – List of genes to generate an average vector embedding.
- Returns:
The average vector for a set of genes in the gene embedding.
- Return type:
list
- generate_weighted_vector(genes, weights)[source]
Compute an averagve vector representation for a set of genes in the learned gene embedding with a set of weights.
- Parameters:
genes (list) – List of genes to generate an average vector embedding.
weights – List of floats in the same order of genes to weight each vector.
- Returns:
The average vector for a set of genes in the gene embedding.
- Return type:
list
- get_adata(resolution=20.0)[source]
This method returns the AnnData object that contains the gene embedding with leiden clusters for metagenes, the neighbors graph, and the UMAP embedding.
- Parameters:
resolution (float) – The resolution to pass to the sc.tl.leiden function.
- Returns:
An AnnData object with metagenes stored in ‘leiden’ for the provided resolution, the neighbors graph, and UMAP embedding.
- Return type:
AnnData
- get_metagenes(gdata)[source]
Score a list of metagenes (get_metagenes) over all cells.
- Parameters:
gdata (AnnData) – The AnnData object holding the gene embedding (from embedding.GeneEmbedding.get_adata).
- Returns:
A dictionary of metagenes (identifier, gene list).
- Return type:
dict
- get_similar_genes(vector)[source]
Computes the similarity of each gene in the mebedding to a target vector representation.
- Parameters:
vector – Vector representation used to find the gene similarity by cosine cosine.
- Returns:
A pandas dataframe holding the gene symbol column (“Gene”) and a cosine similarity column (“Similarity”).
- Return type:
pandas.DataFrmae
- get_vector(gene)[source]
Return the embedding vector for a single gene.
- Parameters:
gene (str) – Gene symbol.
- Returns:
Embedding vector.
- Return type:
np.ndarray
- plot_metagene(gdata, mg=None, title='Gene Embedding')[source]
Plot a UMAP with the genes from a given metagene highlighted and annotated.
- Parameters:
gdata (AnnData) – The AnnData object holding the gene embedding (from embedding.get_adata).
mg (str, optional) – The metagene identifier (leiden cluster number) (optional).
title (str, optional) – The title of the plot. (optional).
- plot_metagenes_scores(adata, metagenes, column, plot=None)[source]
Plot a Seaborn clustermap with the gene module scores for a list of metagenes over a covariate (column). Requires running score_metagenes previously.
- Parameters:
adata (AnnData) – The AnnData object holding the cell embedding (from embedding.CellEmbedding.get_adata).
metagenes (dict) – Dict of metagenes identifiers to plot in clustermap.
column (str) – Covariate in obs dataframe of AnnData.
column – Covariate in obs dataframe of AnnData.
plot (str) – Filename for saving a figure.
- plot_similarities(gene, n_genes=10, save=None)[source]
Plot a horizontal bar plot of cosine similarity of the most similar vectors to ‘gene’ argument.
- Parameters:
gene (str, optional) – The gene symbol of the gene of interest.
save – The path to save the figure (optional).
- Returns:
A matplotlib axes object representing the plot.
- Return type:
matplotlib.figure.axes
- read_embedding(filename)[source]
Read a .vec embedding file into a dict of gene -> vector.
- Parameters:
filename (str) – Path to .vec file (first line: dimensions, remaining: gene vectors).
- Returns:
Mapping from gene symbol to numpy array.
- Return type:
dict