Home Biology scapGNN: A graph neural community–primarily based framework for lively pathway and gene module inference from single-cell multi-omics information

scapGNN: A graph neural community–primarily based framework for lively pathway and gene module inference from single-cell multi-omics information

0
scapGNN: A graph neural community–primarily based framework for lively pathway and gene module inference from single-cell multi-omics information

[ad_1]

Quotation: Han X, Wang B, Situ C, Qi Y, Zhu H, Li Y, et al. (2023) scapGNN: A graph neural community–primarily based framework for lively pathway and gene module inference from single-cell multi-omics information. PLoS Biol 21(11):
e3002369.

https://doi.org/10.1371/journal.pbio.3002369

Tutorial Editor: Sui Huang, Institute for Programs Biology, UNITED STATES

Obtained: February 14, 2023; Accepted: October 7, 2023; Revealed: November 13, 2023

Copyright: © 2023 Han et al. That is an open entry article distributed beneath the phrases of the Inventive Commons Attribution License, which allows unrestricted use, distribution, and copy in any medium, supplied the unique writer and supply are credited.

Knowledge Availability: All related information are throughout the paper and its Supporting info recordsdata. The scapGNN has been applied as an R package deal is freely obtainable from CRAN (https://github.com/XuejiangGuo/scapGNN), GitHub (https://cran.r-project.org/internet/packages/scapGNN/index.html), and FigShare (https://figshare.com/articles/software program/scapGNN/23734017). The R packages for scapGNN and associated scripts are additionally obtainable from Zenodo (https://doi.org/10.5281/zenodo.8322402).

Funding: This work was supported by the Nationwide Key R&D Program of China (2021YFC2700200 to XG), the Chinese language Nationwide Pure Science Basis (Grants No. 82221005 to XG, 81971439 to XG, 82001611 to YL, 31871164 to HZ, 82071702 to HZ) and the fund from Well being Fee of Jiangsu Province (M2020071 to YL). The funders had no function in research design, information assortment and evaluation, determination to publish, or preparation of the manuscript.

Competing pursuits: The authors have declared that no competing pursuits exist.

Abbreviations:
ARI,
adjusted rand index; AUC,
space beneath the restoration curve; BCMI,
bias-corrected mutual info; COVID-19,
coronavirus illness 2019; DEC,
definitive endoderm cell; DEG,
differentially expressed gene; DNNAE,
deep neural community autoencoder; EC,
endothelial cell; eP,
early pachytene; ESC,
embryonic stem cell; FDR,
false discovery price; GAE,
GNN autoencoder; GLUE,
graph-linked unified embedding; GNN,
graph neural community; GO,
Gene Ontology; GSEA,
gene set enrichment evaluation; GSVA,
gene set variation evaluation; hPSC,
human pluripotent stem cell; IAV,
influenza A virus; IRS,
interior root sheath; KEGG,
Kyoto Encyclopedia of Genes and Genomes; LTMG,
left truncated combination Gaussian; NMI,
normalized mutual info; PBMC,
peripheral blood mononuclear cell; PPAR,
peroxisome proliferators–activated receptor; PPI,
protein–protein interplay; ROC,
receiver-operating attribute; RWR,
random stroll with restart; scATAC-seq,
single-cell ATAC sequencing; scDART,
single-cell Deep studying mannequin for ATAC-Seq and RNA-seq Trajectory integration; scRNA-seq,
single-cell RNA sequencing; ssGSEA,
single-sample gene set enrichment evaluation; SW,
silhouette width; TAC,
transit-amplifying cell; TRS,
transcriptional regulation state; tSNE,
t-distributed stochastic neighbor embedding; UMAP,
Uniform Manifold Approximation and Projection; VGAE,
variational graph autoencoder

Introduction

A organic pathway is a set of relationships between genes that result in a sure product or a change within the organic course of within the cell [1]. Some databases—such because the Kyoto Encyclopedia of Genes and Genomes (KEGG) and Gene Ontology (GO) databases—have manually grouped interacting or equally characterised molecules into pathways or gene units by evidence-supported annotations [2,3]. Organic pathways in distinct cell sorts have completely different activation patterns, which facilitates the understanding of cell features. In single-cell research, pathway activation evaluation has grow to be a strong strategy for the extraction of biologically related signatures to uncover the potential mechanisms of cell heterogeneity and dysfunction in human ailments [4,5]. Nonetheless, the present pathway enrichment evaluation strategies (e.g., gene set enrichment evaluation (GSEA) [6], single-sample gene set enrichment evaluation (ssGSEA) [7], and gene set variation evaluation (GSVA) [8]) developed for bulk RNA-seq information have been reported to be inappropriate for single-cell sequencing information [9,10]. In contrast with RNA-seq information obtained from bulk cell populations, single-cell sequencing information are a lot sparser, noisier, and decrease in library dimension because of the explicit sequencing methods and experiment protocols [11]. These severely compromise the accuracy and integrity of gene-level analyses in single-cell information [1]. Therefore, an environment friendly technique is urgently wanted to parcel out the pathway exercise of particular person cells.

Lately, some pathway enrichment strategies utilizing single-cell RNA sequencing (scRNA-seq) information, corresponding to AUCell [12], Pagoda2 [13], and UniPath [14], have been proposed to review mobile heterogeneity. For instance, AUCell calculates the realm beneath the restoration curve (AUC) rating for the pathway within the ranked listing of genes for every cell because the pathway exercise rating. Pagoda2 suits a mannequin to renormalize gene expression profiles and makes use of the primary weighted principal element to quantify pathway exercise scores. UniPath fashions the distribution of gene expression as bimodal and converts nonzero expressions into p-values. It combines the p-values of genes within the pathway and adjusts them as pathway enrichment scores utilizing a standard null background mannequin.

These strategies nonetheless have limitations that make it troublesome to mine info from single-cell information. AUCell is determined by the ranked listing of genes, which permits it to determine just a few pathways related to prime genes at a time. Pagoda2 solely focuses on the primary principal element, resulting in information loss. UniPath must assemble the null background mannequin for various species, affecting the scalability of the tactic. In the meantime, the completeness of the null background mannequin straight impacts its efficiency. Moreover, these strategies don’t make inferences about genes with dropout occasions for the scRNA-seq information with many zero values [9]. In addition to, AUCell and Pagoda2 are designed to carry out pathway evaluation just for single-cell transcriptome information. UniPath proposes a corresponding pathway enrichment technique that makes use of the hypergeometric or binomial check for single-cell ATAC sequencing (scATAC-seq) information, though it nonetheless depends on the background distribution.

Furthermore, these strategies can solely be primarily based on predefined pathways or gene units. Gene modules, serving as constructing blocks of advanced organic networks, are structural subnetworks that exhibit the identical organizational patterns or features [15,16]. Module-based analyses can obtain a higher-level understanding of the design and group of organic methods. Genomap is an entropy-based cartography technique to contrive the high-dimensional single-cell gene expression information right into a configured picture format and uncover cell-specific gene units [17]. It may possibly compute cell sort–particular gene significance scores by establishing the category activation map. Nonetheless, it doesn’t consider the importance stage of the cell sort–particular gene significance. Figuring out gene modules autonomously and effectively primarily based on cell phenotype info is conducive to understanding the mechanism of cell state transitions and the regulation of various cell phenotypes [18,19].

With the event of deep studying methods, many strategies have been developed to extract low-dimensional options from high-dimensional single-cell information and built-in single-cell multi-omics information in a low-dimensional house. Cobolt constructed a multimodal variational autoencoder primarily based on a hierarchical Bayesian generative mannequin that projected the single-cell multi-omics information into shared latent house to carry out visualization and clustering [20]. Single-cell Deep studying mannequin for ATAC-Seq and RNA-seq Trajectory integration (scDART) is a deep studying framework that compresses scRNA-seq and scATAC-seq information right into a shared house and aligns cells in accordance with trajectories [21]. Graph-linked unified embedding (GLUE) additionally permits single-cell multi-omics information integration by encoding cells into the latent house [22]. In contrast with the earlier 2 strategies, GLUE introduces the knowledge-based steering graph through a graph autoencoder and extracts gene options to appropriate the alignment of cells in latent house. Nonetheless, these strategies share a standard limitation: All of them align cells from completely different omics information inside a latent house. Whereas this facilitates cell clustering and annotation, its organic interpretations make extracting deep mechanisms from the information troublesome. Additionally, Cobolt and scDART depend on shared info, resulting in information loss. They solely extract low-dimensional options of cell and don’t course of genes, ignoring potential relationships between genes. scDART aligns cells in low-dimensional house in accordance with cell trajectories, which is probably not relevant to single-cell information with out differentiation trajectories. GLUE introduces predefined knowledge-based steering graphs, corresponding to protein interplay networks, introducing noise past single-cell information. Some statistical frameworks, corresponding to multi-omics issue evaluation (MOFA2) and a nonnegative matrix factorization algorithm (UINMF), are additionally designed to combine single-cell multi-omics information [23,24]. MOFA2 builds on the Bayesian group issue evaluation framework to deduce a low-dimensional illustration of the information by way of a small variety of (latent) elements that seize the worldwide sources of variability. UINMF derives a nonnegative matrix factorization algorithm for integrating single-cell datasets containing each shared and unshared options. These strategies nonetheless compress information into low-dimensional options for information integration, which nonetheless fails to clarify the organic mechanisms in single-cell multi-omics information. Integrating multi-omics information on the pathway and gene module ranges permits a complete research of advanced organic processes, highlights the interrelationship of related biomolecules and their features, and may mine potential organic mechanisms that can’t be found by single-omics information [25,26]. However, a spot stays in inferring lively pathways and cell phenotype–related gene modules supported by single-cell multi-omics information.

Therefore, we proposed a uniform framework referred to as scapGNN, which was a graph neural community (GNN)-based framework that inferred and reconstructed gene–cell, gene–gene, and cell–cell affiliation relationships for remodeling sparse single-cell profile information into the secure gene–cell affiliation community. Moreover, the scapGNN built-in single-cell multi-omics information, calculated single-cell pathway exercise scores, and recognized cell phenotype–related gene modules by quantifying community info. The true and simulated single-cell datasets had been used to benchmark the efficiency of scapGNN, demonstrating that it outperformed state-of-the-art strategies in a number of single-cell information evaluation duties.

Outcomes

Overview of the scapGNN framework

The scapGNN leveraged the GNN mannequin with a multimodal autoencoder to transform the sparse unstable single-cell profiling information right into a secure gene–cell affiliation community to determine lively pathways and cell phenotype–related gene modules from single-cell multi-omics information of scRNA-seq and scATAC-seq. The random stroll with restart (RWR) algorithm primarily based on graph principle additional inferred the pathway exercise rating matrix and recognized cell phenotype–related gene modules (Fig 1A and Supplies and strategies).

thumbnail

Fig 1. An summary of the scapGNN framework.

(A) The enter was the gene–cell matrix of scRNA-Seq or gene exercise matrix generated from scATAC-seq. A graph-based autoencoder, which contained a DNNAE and a VGAE, discovered the latent associations between genes and cells. The RWR algorithm quantified pathway exercise and recognized cell phenotype–related gene modules. (B) Foremost capabilities of scapGNN included inferring single-cell pathway exercise profiles, establishing cell cluster affiliation networks, figuring out cell phenotype–related gene modules beneath a number of cell phenotypes, and quantifying the significance of genes within the pathway. DNNAE, deep neural community autoencoder; LTMG, left truncated combination Gaussian; RWR, random stroll with restart; scATAC-seq, single-cell ATAC sequencing; scRNA-seq, single-cell RNA sequencing; VGAE, variational graph autoencoder.


https://doi.org/10.1371/journal.pbio.3002369.g001

The GNN mannequin included a preprocessed gene–cell matrix after the removing of low-quality cells and genes, normalization, and collection of extremely variable options [27] or informative differentially expressed genes (DEGs) [28] for the scRNA-seq gene expression profile or gene exercise matrix of scATAC-seq (Supplies and strategies). First, the encoder of the deep neural community autoencoder (DNNAE) discovered the low-dimensional embeddings of cell and gene options from the gene–cell matrix. A matrix factorization-based decoder was used to deduce the potential gene–cell affiliation matrix [29] (Fig 1A and Supplies and strategies). The left truncated combination Gaussian (LTMG) mannequin [30] was used to extract the transcriptional regulation states (TRSs) from the gene–cell matrix via the kinetic relationships of the transcriptional regulatory inputs, mRNA metabolism, and abundance in single cells (Supplies and strategies). The TRSs elucidated the a number of expression states of genes throughout single cells, and a excessive sign worth indicated that the gene was in a real lively expression state within the cell [31]. The TRS sign enhanced the signal-to-noise ratio of scapGNN, managed the course of neural community studying, and ensured that the inferences of gene–cell associations had been primarily based on the true state of gene expression. Due to this fact, we used TRSs and gene expression to assemble the loss perform and be taught new potential associations. In short, the gene–cell affiliation matrix was inferred by the DNNAE integrating the gene expression options and the multimodal kinetic relationships of the gene throughout single cells. A gene with excessive affiliation energy in a cell indicated that it was extra prone to have a excessive stage of exercise within the cell in contrast with different genes.

Second, we used a GNN autoencoder (GAE), which is a variational graph autoencoder (VGAE) [32] containing a 2-layer graph convolution community, to carry out edge activity inference in gene–gene and cell–cell correlation networks. The encoder of GAE took the embeddings of genes or cells because the options of nodes within the gene or cell correlation community that was constructed from the gene–cell matrix utilizing the Pearson correlation coefficient just like the Dong and colleagues’ research [33] (Supplies and strategies). The decoder inferred new affiliation relation to regenerate gene–gene or cell–cell affiliation community (Fig 1A and Supplies and strategies). The graph autoencoder diminished the quantity of spurious info encoded by reconstructing the enter networks by the loss perform. GAE was a framework for unsupervised studying on graph-structured information that would routinely retain high-quality node relationships and take away spurious ones. It inferred new relationships primarily based on the shared topological options of correlation networks in combination and output complete outcomes that coated your complete house of knowledge related to the enter information, whereas remaining agnostic to any explicit view of organic perform. With the GAE, scapGNN might progressively embed associated genes or cells of correlation networks nearer throughout the coaching course of, whereas making certain that unrelated genes or cells remained far aside [34].

Lastly, we built-in the outcomes of the GNN mannequin to assemble a weighted gene–cell affiliation community (Supplies and strategies). The RWR algorithm used genes within the pathway as seeds to calculate adjusted likelihood scores that represented a proximity measure between the pathway and every cell and had been used because the pathway exercise scores throughout particular person cells (Fig 1B and Supplies and strategies). In contrast with different single-cell pathway scoring strategies that solely thought of gene expression, scapGNN captured extra organic info together with gene–cell associations, cell–cell associations, and gene–gene associations when calculating pathway exercise scores. For gene–cell associations, extremely lively pathways positioned genes nearer to their corresponding cells. Cell–cell associations had been additionally thought of to additional distinguish cells with extremely lively pathways from these with much less lively pathways. This allowed the pathway exercise rating to precisely symbolize the heterogeneity between cells. Gene–gene associations as a background might improve the signal-to-noise ratio of pathway exercise scores. In the identical means, we set cells belonging to the identical phenotype (cell sort, time, and illness state) as seeds to routinely determine the cell phenotype–related gene modules (Fig 1B and Supplies and strategies). The gene module is the set of genes considerably most essential for the characterization of the id of that cell phenotype. For a number of cell phenotypes, we additionally quantified the propensity of genes to be expressed between cell phenotypes and supplied a visualization program for the community of cell phenotype–related gene modules. We might achieve perception into cell phenotype–particular genes, or genes expressed in a number of cell phenotypes, and the energy of gene–gene associations. The muse of the scapGNN framework was to deduce the gene–cell affiliation community primarily based on gene expression options in cells. Therefore, the organic significance conveyed by the associations between genes was the coexpression regulatory relationships.

We additionally clustered the cells into communities utilizing 3 neighborhood detection algorithms within the cell–cell affiliation community (Fig 1B and Supplies and strategies). We merged the identical cell sorts or recognized cell communities to assemble cell neighborhood networks from the cell–cell affiliation community. The sides between nodes on this community indicated the energy of comparable associations between cell sorts or cell communities.

We developed a community fusion technique to combine the gene–cell affiliation networks from several types of omics information to combine single-cell multi-omics information (S1 Fig and Supplies and strategies). As an example, scRNA-seq and scATAC-seq information had been processed utilizing scapGNN to generate gene–cell affiliation networks. Two gene–cell affiliation networks had been mixed utilizing Brown’s technique to generate a multi-omics supporting gene–cell affiliation community, which was processed by RWR to calculate pathway exercise scores and determine cell phenotype–related gene modules.

In abstract, the tactic enabled scapGNN for single-cell multi-omics information to assemble gene–cell affiliation community, calculate single-cell pathway exercise scores, discover key lively genes within the pathway, determine cell phenotype–related gene modules and cell communities, and combine single-cell multi-omics information (Fig 1B and S1 Fig).

Pathway exercise rating of scapGNN represented cell heterogeneity

Three scRNA-seq datasets for various organic functions, together with cell sort, cell subtype, and time sequence, had been used to check the efficiency of scapGNN in representing cell heterogeneity on the pathway stage (S1 Desk and S1 Textual content). The Uniform Manifold Approximation and Projection (UMAP) for dimension discount [35] and t-distributed stochastic neighbor embedding (tSNE) [36] had been used to visualise the cell clustering outcomes of scapGNN and the present state-of-the-art single-cell pathway enrichment strategies (AUCell, Pagoda2, and UniPath) (S2 Desk). In contrast with these pathway enrichment strategies, scapGNN higher clustered cells throughout the identical sort extra densely and separated cells of various sorts extra distinctly for the mouse pancreas (Fig 2A and S2A Fig). For the cell subtype dataset of the human embryonic stem cell (ESC)-derived dopaminergic neurons, solely scapGNN aggregated completely different subtypes of cells, corresponding to ESC-derived progenitor subtype 1a (eProg1a) and 1b (eProg1b) cells into particular person clusters (Fig 2B and S2B Fig). This implied that scapGNN was anticipated to assist uncover new cell subtypes. For the time sequence dataset of the human pluripotent stem cells (hPSCs), scapGNN might higher distinguish the state of cells at completely different instances (Fig 2C and S2C Fig). For these 3 scRNA-seq datasets, we additional counted 3 cell clustering accuracy indicators [adjusted rand index (ARI), normalized mutual information (NMI), and silhouette width (SW)]. scapGNN confirmed greater accuracy in contrast with AUCell, Pagoda2, and UniPath in all of 10 state-of-the-art single-cell clustering strategies (S3 Fig). We additional calculated the bias-corrected mutual info (BCMI) between pseudotime inferred primarily based on pathway exercise scores and true mobile timestamps (S1 Textual content) [37]. As proven in S4 Fig, scapGNN might extra precisely symbolize the true mobile temporal state in contrast with different strategies. As well as, UniPath proposed a temporal-ordering technique for single-cell pathway exercise scores [14]. Due to this fact, we adopted the UniPath technique to take away cell cycle–associated genes and infer temporal ordering for various single-cell pathway exercise scoring strategies. As proven in S5 Fig, the pathway exercise scores calculated utilizing scapGNN and UniPath carried out higher in characterizing the temporal ordering of cells in contrast with AUCell and Pagoda2. Our framework merged the identical cell sorts or recognized cell communities within the cell–cell affiliation community to watch the energy of affiliation between them. For the time sequence dataset, the affiliation between cells on the differentiation finish (36, 72, and 96 h) of the definitive endoderm cells (DECs) was stronger (S6A Fig). This indicated that the cells at these timescales had diminished differentiation drive and extra related mobile features. Early-differentiating DEC and late-differentiating DEC tended to be in numerous clusters for the cell communities we recognized within the cell affiliation community (S6B Fig).

thumbnail

Fig 2. Analysis of cell clustering primarily based on pathway exercise scores.

UMAP visualizations of cell sort information (A), time sequence (B), and cell subtype information (C) primarily based on pathway exercise scores utilizing the 4 pathway enrichment strategies (AUCell, Pagoda2, UniPath, and scapGNN). (D) Field plot of common ARI, common NMI, and common SW on the gene and pathway ranges from AUCell, Pagoda2, UniPath, and scapGNN utilizing 10 state-of-the-art single-cell clustering strategies on 16 scRNA-seq information units. The paired-sample Wilcoxon signed-rank check was used to calculate the importance p-values (* p < 0.05, ** p < 0.01, and *** p < 0.001) for the cell clustering indicator variations of every cell clustering indicator between scapGNN and different strategies. The information underlying this determine could be present in S1 Knowledge. ARI, adjusted rand index; NMI, normalized mutual info; scRNA-seq, single-cell RNA sequencing; SW, silhouette width; UMAP, Uniform Manifold Approximation and Projection.


https://doi.org/10.1371/journal.pbio.3002369.g002

We additional benchmarked scapGNN utilizing 16 scRNA-seq datasets, together with the aforementioned 3 datasets (S3 Desk), 10 single-cell clustering strategies (S4 Desk), and three accuracy quantification indicators to systemically consider the cell clustering accuracy (Supplies and strategies). The outcomes revealed that scapGNN had considerably greater accuracy in cell clustering and higher represented the heterogeneity of cells in contrast with the opposite strategies (Fig 2D). Though gene and pathway ranges are completely different features, scapGNN nonetheless had higher cell clustering efficiency (Fig 2D). Per Su’s research [10], conventional bulk RNA-seq pathway enrichment evaluation strategies carried out poorly on single-cell datasets. They had been, due to this fact, not in contrast (S7 Fig). The typical detected gene quantity in cells of every scRNA-seq dataset was decided, and the 16 benchmark datasets had been categorized into 2 teams (8 high-detection gene quantity datasets and eight low-detection gene quantity datasets) primarily based on their median values (S8A Fig). We discovered that every one strategies within the low-detection gene quantity dataset had diminished cell clustering efficiency, however scapGNN nonetheless outperformed the opposite strategies (S8B Fig). We additionally decided the values of ARI and NMI for AUCell, Pagoda2, UniPath, and scapGNN by 10 cell clustering strategies. General, scapGNN carried out higher than different single-cell pathway exercise enrichment strategies (S9 Fig). Using 16 datasets and 10 clustering strategies demonstrated the scalability of scapGNN. Moreover, in scapGNN, the calculation of single-cell pathway exercise scores was primarily based on a random sampling course of. The distribution of pathway exercise scores was unimodal like a standard distribution, which supplied a significant illustration of cell states and utilized to downstream evaluation corresponding to differential evaluation or marker pathway identification [38].

We evaluated scapGNN utilizing the gold-standard datasets with batch results, which contained 5 cell traces and analyzed it by 3 sequencing protocols to combine and analyze the scRNA-seq information supplied by completely different analysis teams or experimental platforms [39] (S1 Desk). The datasets had been built-in utilizing Seurat v4 [27] and reworked by scapGNN right into a pathway exercise rating matrix. UMAP visualization confirmed that scapGNN might additional cut back batch results and enhance the cell clustering accuracy relative to AUCell, Pagoda2, and UniPath (S10A and S10B Fig). In the meantime, scapGNN had greater accuracy for figuring out the proper marker gene set for A549 cells within the built-in information in contrast with different strategies (S10C Fig).

We then examined the cell clustering efficiency of scapGNN on atlas-scale scRNA-seq datasets. A mouse cell atlas dataset [40] (S1 Desk) was used, and the cells containing lower than 800 nonzero expressed genes had been filtered (set parameter min.options = 800 in Seurat). As proven in S11 Fig, scapGNN additionally exhibited glorious cell clustering efficiency in giant atlas-level information.

We additional assessed the uniformity of scapGNN by ablation experiments (S1 Textual content). For the cell sort dataset, we counted cell clustering accuracy indicators of pathway exercise scores from DNNAE, GAE+LTMG, and scapGNN-LTMG. In distinction, the unified scapGNN framework carried out higher (S12 Fig).

scapGNN precisely recognized lively pathways and cell phenotype–related gene modules

We used the cell marker gene units of 460 cell sorts collected by Chawla and colleagues [14] from CellMarker [41], BioGPS [42], and Harmonizome [43] databases to guage the efficiency of pathway identification because of the lack of gold-standard pathway information in single-cell research (Supplies and strategies). For homogeneous information (K562 and A549) and heterogeneous datasets (GM12878, and ESC), we counted the proportion of cells with the proper marker gene units of the 4 cell sorts detected within the prime 5 of the pathway exercise scores (S1 Desk). The outcomes confirmed that scapGNN detected extra substantial cells with the proper marker gene units in each homogeneous (containing just one cell sort) and heterogeneous (containing a number of cell sorts) information in contrast with AUCell, Pagoda2, and UniPath (Fig 3A). Though the homogeneous information contained just one cell sort, scapGNN might nonetheless determine the really lively gene set. This was as a result of the gene–cell affiliation community contained gene–cell associations by which genes with excessive exercise had been extra prone to be in proximity to the corresponding cells in contrast with different genes. We collected the scRNA-seq datasets that contained T and B cells to additional confirm the accuracy of the pathway enrichment (S1 Desk). scapGNN might extra precisely determine identified T- and B cell activated pathways (T- and B cell receptor signaling pathways) than AUCell, Pagoda2, and UniPath (Fig 3B). The true lively pathways (T- and B cell receptor signaling pathways) of T and B cells tended to have high-activity scores (S13A and S13B Fig). We used Seurat to determine marker pathways in T and B cells and located that the T- and B cell receptor signaling pathways could possibly be recognized (adjusted p-value < 0.0001) and ranked first in ascending order of adjusted p-values (S13C and S13D Fig). We assessed the steadiness of scapGNN pathway scoring by grouping T or B cells with completely different cells. The outcomes confirmed that scapGNN might persistently determine the T- and B cell receptor signaling pathways, even in numerous mixtures of datasets (S14 Fig). This steered that though scapGNN used cell–cell affiliation to reinforce the power of the pathway exercise rating to discriminate between cells, it didn’t have an effect on the proximity of genes contained within the true lively pathway to the corresponding cells. This additionally highlighted the benefit of scapGNN integrating extra organic info to calculate pathway exercise scores.

thumbnail

Fig 3. Accuracy analysis of scapGNN in figuring out pathway and cell phenotype–related gene modules.

(A) Proportion of cells that detected the proper marker gene units utilizing scapGNN and the opposite 3 strategies within the prime 1–5 of the pathway exercise scores on each homogeneous and heterogeneous scRNA-seq datasets. (B) Accuracy of 4 pathway enrichment strategies for figuring out gold-standard pathways. The proportion of cells with T- or B cell receptor signaling pathway within the prime 5 pathways. (C) ROC curves of T cell–related gene modules in T cell (homogeneous) or T- and B cell (heterogeneous) scRNA-seq datasets utilizing genes of the T-cell receptor signaling pathway because the gold normal. (D) ROC curves of B cell–related gene modules in B (homogeneous) or T cell and B cell (heterogeneous) scRNA-seq datasets utilizing genes of the B cell receptor signaling pathway because the gold normal. (E) ROC curves of ESC-associated gene modules and A549-associated gene modules within the ESC (heterogeneous) and A549 (homogeneous) datasets utilizing marker genes of the corresponding cell sort because the gold normal. The information underlying this determine could be present in S2 Knowledge. ESC, embryonic stem cell; ROC, receiver-operating attribute; scRNA-seq, single-cell RNA sequencing.


https://doi.org/10.1371/journal.pbio.3002369.g003

Subsequent, we evaluated the accuracy of scapGNN in figuring out cell phenotype–related gene modules. For homogeneous and heterogeneous information, we calculated the affiliation scores of cell sorts with genes utilizing cells belonging to the identical sort as a seed. These affiliation scores had been sorted in descending order. Utilizing the marker genes of the corresponding cell sort or the genes of the T- and B cell receptor signaling pathways because the gold normal, the receiver-operating attribute (ROC) curve evaluation was used to evaluate the accuracy of scapGNN in figuring out phenotype-associated gene module. As well as, we used genomap, an entropy-based cartography technique for locating cell- and class-specific gene units from scRNA-seq information, as a benchmark technique (S2 Desk) [17]. The genomap supplied activation values for every gene within the specified cell sort. Nonetheless, the coaching of the genomap mannequin relied on the identified fact cell labels of the dataset. When the labels of the cell sorts had been all the identical (i.e., homogeneous information), the genomap supplied zero exercise values for all genes. Due to this fact, we might solely examine the efficiency of scapGNN and genomap on heterogeneous information. The outcomes, as proven in Fig 3C–3E, indicated that scapGNN might determine cell sort–related lively genes utilizing both marker pathway or marker gene as a gold normal. The genomap can also be efficient in figuring out cell sort–related lively genes, notably for cell marker genes. Each scapGNN and genomap had good robustness for dropout noise (S15 Fig). Nonetheless, the accuracy of scapGNN was higher than that of genomap (Fig 3C–3E and S15 Fig). A further benefit of scapGNN over genomap was the power to supply significance-level p-values for the affiliation scores of every gene with cell phenotypes. We additional used GO phrases to guage the useful modularity of cell phenotype–related gene modules (S1 Textual content). As proven in S16 Fig, the gene modules recognized by scapGNN tended to be extra functionally modular.

Robustness evaluation of scapGNN

On this part, we evaluated the robustness of scapGNN beneath noise from completely different sources utilizing sciPath [1], a efficiency analysis framework for integrating pathways with single-cell information. First, we randomly transformed nonzero expression values to zero to simulate dropout noise. This noise was added to the cell subtype dataset at completely different strengths (5%, 10%, 15%, and 20%). We then quantified robustness utilizing the AUC for the noise proportion–cell clustering accuracy indicator. UMAP visualizations and AUC scores confirmed that scapGNN successfully preserved the traits of cells and maintained higher cell clustering accuracy in contrast with AUCell, Pagoda2, and UniPath (Fig 4A and 4B and S17A–S17C Fig).

thumbnail

Fig 4. Robustness analysis.

(A) UMAP visualizations of the scapGNN on cell subtype dataset with completely different strengths of dropout noise. The AUC of three cell clustering accuracy indicators of AUCell, Pagoda2, UniPath, and scapGNN beneath dropout noise (B) and “Gaussian” noise (C). (D) Proportion of ES cells with corresponding marker gene set within the prime 1–5 pathway scores utilizing the 4 pathway enrichment strategies beneath completely different strengths of dropout noise on the EC dataset. The information underlying this determine could be present in S3 Knowledge. ARI, adjusted rand index; EC, endothelial cell; ES, XXXX; NMI, normalized mutual info; SW, silhouette width; UMAP, Uniform Manifold Approximation and Projection.


https://doi.org/10.1371/journal.pbio.3002369.g004

Additional, we added Gaussian noise to the cell subtype dataset and located that scapGNN had higher robustness even beneath completely different noise sorts in contrast with the opposite 3 strategies (Fig 4C). For actual information, we counted the zero-valued price of 16 scRNA-seq datasets (S3 Desk) and calculated the AUC of the zero-valued price–cell clustering accuracy indicator curve (S17D Fig). This end result indicated that scapGNN was appropriate for actual scRNA-seq information with completely different zero-value charges and confirmed extra strong efficiency than AUCell, Pagoda2, and UniPath.

Subsequent, we examined the affect of zero worth and Gaussian noise on the pathway identification accuracy of scapGNN. For the endothelial cell (EC) and ESC datasets (S1 Desk), we decided the proportion of cells that detected the proper marker gene units of corresponding cell sorts within the prime 1 to five of pathway scores ranked beneath completely different noise strengths. The outcomes indicated that the pathway identification efficiency of scapGNN was strong, and the accuracy was greater than that of AUCell, Pagoda2, and UniPath relating to total noise energy (Fig 4D and S17E Fig).

Utility of scapGNN to scATAC-seq information

In addition to scRNA-seq, scapGNN might additionally course of single-cell epigenome information. We used the mouse cortical mind dataset and peripheral blood mononuclear cell (PBMC) dataset of two completely different species (mouse and human) to guage the efficiency of scapGNN in scATAC-seq information (S5 Desk). scapGNN maintained high-accuracy cell clustering and pathway identification efficiency, which was strong to completely different strengths of dropout noise (Fig 5A–5C and S18A Fig). For the scATAC-seq information, scapGNN additionally might stably determine cell-intrinsic lively pathways together with completely different cell sorts (S18B Fig). We subsequent used scapGNN to determine lively pathways within the scATAC-seq information of the PBMC dataset. The outcomes of cell sort–particular marker gene units identification confirmed that UniPath utilizing binomial and hypergeometric checks for pathway enrichment carried out properly, just like scapGNN solely on monocyte cells however failed on pure killer cells and native CD8+ T cells (Fig 5D). For the identified lively T-cell receptor signaling pathway in T cells, we discovered that scapGNN might extra precisely determine the lively pathways of T cells (Fig 5E). The T-cell receptor signaling pathway had a better pathway exercise rating and could possibly be efficiently recognized utilizing Seurat as a marker pathway for T cells (S19A and S19B Fig). We additionally evaluated the steadiness of scapGNN pathway scoring in scATAC-seq information. scapGNN might nonetheless persistently determine the T-cell receptor signaling pathway within the prime 5 pathways of T cells (S20 Fig). We examined the robustness of scapGNN within the scATAC-seq information by including completely different strengths of dropout noise to the PBMC dataset. The outcomes confirmed that scapGNN and the hypergeometric check technique of UniPath had been extremely strong to the dropout noise of the scATAC-seq information (Fig 5F). Thus, scapGNN carried out properly within the evaluation of pathway actions for scATAC-seq datasets.

thumbnail

Fig 5. Efficiency of scapGNN on scATAC-seq information.

(A) UMAP visualization of mouse cortical mind dataset utilizing pathway exercise rating matrix of scapGNN. (B) Bar graph of the 4 cell clustering accuracy indicators for pathway exercise rating matrix of scapGNN on mouse cortical mind dataset. (C) AUC of 4 cell clustering accuracy indicators for pathway exercise rating matrix of scapGNN on mouse cortical mind dataset with dropout noise of various strengths. The proportion of cells that detected the corresponding appropriate cell sort marker gene units (D) and the proportion of T cells that detected the T-cell receptor signaling pathway (E) within the prime 1 to five of the pathway scores on the PBMC dataset. (F) Robustness analysis of the scapGNN in appropriately detecting the marker gene set of monocytes with completely different dropout charges on the PBMC dataset. The information underlying this determine could be present in S4 Knowledge. ARI, adjusted rand index; AUC, space beneath the restoration curve; NMI, normalized mutual info; PBMC, peripheral blood mononuclear cell; scATAC-seq, single-cell ATAC sequencing; SW, silhouette width; UMAP, Uniform Manifold Approximation and Projection.


https://doi.org/10.1371/journal.pbio.3002369.g005

Inferring pathway exercise scores by integrating scRNA-seq and scATAC-seq

Subsequent, we examined the efficiency of scapGNN to combine single-cell multi-omics information. We utilized scapGNN to three single-cell multi-omics datasets with completely different sequencing platforms, tissues, and species to guage the power of the single-cell multi-omics upported pathway exercise scores inferred by scapGNN to symbolize the mobile heterogeneity (S5 Desk). UMAP plots of multi-omics supported pathway exercise scores confirmed that scapGNN might clearly distinguish between completely different cell sorts after integrating single-cell multi-omics info (S21A–S21C Fig). In contrast with 5 state-of-the-art single-cell multi-omics integration strategies (UINMF, MOFA2, Seurat, Cobolt, and GLUE), scapGNN confirmed higher efficiency on a number of cell clustering indices (Fig 6A and 6B, and S21D Fig and S2 Desk).

thumbnail

Fig 6. Efficiency of scapGNN on single-cell multi-omics information integration.

Bar graph of the three cell clustering accuracy indicators for scapGNN, and state-of-the-art single-cell multi-omics integration strategies on the mouse mind cortex dataset (A) and mouse pores and skin dataset (B) (the decision parameter of Seurat was 0.5). (C) UMAP plot coloured by the pseudotime of TACs, medulla, IRS, and hair shaft cuticle/cortex cells. Purple arrows are TAC populations pointing to the medulla, IRS, and hair shaft cuticle/cortex cells. (D) Bar graph of the Pearson correlation coefficient between the pseudotime of scapGNN or scDART and Jason’s research. The information underlying this determine could be present in S5 Knowledge. ARI, adjusted rand index; GLUE, graph-linked unified embedding; IRS, interior root sheath; NMI, normalized mutual info; scDART, single-cell Deep studying mannequin for ATAC-Seq and RNA-seq Trajectory integration; TAC, transit-amplifying cell; UMAP, Uniform Manifold Approximation and Projection.


https://doi.org/10.1371/journal.pbio.3002369.g006

In hair follicle tissue, transit-amplifying cells (TACs) can proliferate quickly and produce 3 several types of mature cells, together with the interior root sheath (IRS), hair shaft cuticle/cortex, and medulla [44]. Lately, Jones and colleagues utilized SHARE-seq, an strategy that enabled the joint measurement of chromatin accessibility and gene expression from the identical single cells to grownup mouse pores and skin tissue and resolved the differentiation strategy of TACs [44]. We utilized scapGNN to combine scRNA-seq and scATAC-seq information from the mouse pores and skin dataset and infer the pseudotime of TAC differentiation utilizing Monocle 3 [45]. In contrast with scDART (S2 Desk), scapGNN was a technique for integrating scRNA-seq and scATAC-seq information and could possibly be used for trajectory inference on built-in information. It confirmed the differentiation trajectory of TACs extra precisely (Fig 6C and S22 Fig). The pseudotime primarily based on single-cell multi-omics–supported pathway exercise scores calculated utilizing scapGNN had a better similarity to the TAC differentiation pseudotime supplied by Jones’s research [44] in contrast with scDART (Fig 6D).

We found that some considerably completely different pathways important for key organic features of astrocytes and oligodendrocytes within the mouse mind cortex dataset had been recognized solely via multi-omics information integration (S6 Desk). For T and B cells within the PBMC multi-omics dataset, the proportion of cells with T- or B cell receptor signaling pathway within the prime 5 of the single-cell multi-omics–supported pathway exercise rating listing was greater than 0.95 (S23A Fig). For single-cell multi-omics information integration, scapGNN might keep the steadiness of pathway scoring (S24 Fig). Utilizing genes within the T- or B cell receptor signaling pathway because the gold normal, the gene modules recognized utilizing scapGNN for T or B cells might precisely determine these genes with AUCs above 0.8 (S23B Fig). A constant pattern was noticed within the outcomes for marker gene units, with scapGNN having the ability to precisely determine the marker gene set akin to the cell sort in each heterogeneous and homogeneous information (S23C Fig). ROC curves confirmed that scapGNN might precisely determine extremely lively genes within the gene module utilizing the marker gene set because the gold normal (S23D Fig). Numerous research confirmed that the hedgehog signaling pathway was required for TAC developmental processes [44,46,47]. Specifically, the transcriptome and epigenome actions of Gli2 and Gli3 within the hedgehog signaling pathway performed a essential function [46,47]. This was confirmed by the violin plots of Gli2 and Gli3 actions in RNA and DNA within the mouse mind cortex dataset (S25A and S25B Fig). Per this, scapGNN additionally detected excessive single-cell multi-omics–supported pathway exercise scores of the hedgehog signaling pathway throughout TAC improvement (S25C Fig). Gli2 and Gli3 had been coexpressed throughout TAC improvement within the gene modules recognized primarily based on the mixed gene–cell affiliation community (S25D Fig).

These outcomes demonstrated that scapGNN successfully built-in single-cell multi-omics info on the pathway stage to cluster cells, infer cell differentiation processes, and precisely determine extremely lively pathways and cell phenotype–related gene modules.

Functions of scapGNN in spermatogenesis and early embryo improvement

Improvement-related organic occasions contain cell differentiation and intercellular regulation [48,49]. Unraveling the organic mechanisms concerned remains to be difficult. Due to this fact, we utilized scapGNN to the scRNA-seq datasets of mouse spermatogenesis, early embryo improvement, and human testis (S1 Desk).

First, Monocle 3 [45] was used to assemble single-cell trajectories primarily based on the pathway exercise rating matrix. The outcomes confirmed that scapGNN might clearly distinguish completely different cell sorts and higher reconstruct the method of spermatogenesis and early embryonic improvement (Fig 7A and S26 Fig). We then recognized the differentially lively pathways between the cell sorts within the mouse spermatogenesis dataset primarily based on Seurat and filtered the pathways by p.adj < 0.001 (S6 Knowledge). We discovered that the oxidative phosphorylation pathway had greater exercise from early pachytene (eP) spermatocytes to early spherical spermatids (RS2 to RS4) (Fig 7B). After differentiation of spermatogonia into spermatocytes, meiotic spermatocytes crossed the blood–testis barrier and have become depending on lactate secreted by Sertoli cells for vitality manufacturing. Lactate was oxidized to pyruvate and transported into mitochondria to gasoline the oxidative phosphorylation pathway [50]. We additionally discovered that cytochrome-c oxidase-related genes of the oxidative phosphorylation pathway had been of upper significance (S27A Fig), and so they had been proven to be extremely expressed in spermatocytes and spermatids [51,52]. As well as, we recognized the pathways that diversified over the trajectory of the early embryo improvement primarily based on a Monocle-fitted generalized linear mannequin and q-values < 0.001 (S6 Knowledge). We discovered that the peroxisome proliferators–activated receptor (PPAR) signaling pathway was considerably dynamically altered in early embryo improvement in mice (Fig 7C). Defects within the PPAR signaling pathway led to considerably delayed early embryo improvement [53]. The noticed traits of the PPAR signaling pathway had been in line with the information of EmExplorer, which is an experimentally supported database for mammalian embryos [54] (S27B Fig).

thumbnail

Fig 7. Evaluation of cell differentiation trajectory, dynamic pathways, and gene modules in mouse spermatogenesis and early embryonic improvement utilizing scapGNN.

(A) Cell differentiation trajectory evaluation of 20 spermatogenesis levels utilizing the gene expression information, and scapGNN, AUCell, Pagoda2, or UniPath-based pathway exercise rating matrix on mouse spermatogenesis dataset through Monocle 3. On the prime, the strong pink arrows point out the order of the 20 levels of spermatogenesis. The 20 levels of spermatogenic cells embrace A1, sort A1 spermatogonia; In, intermediate spermatogonia; BS, S-phase sort B spermatogonia; BG2, G2/M-phase sort B spermatogonia; G1, G1-phase preleptotene; ePL, early S-phase preleptotene; mPL, center S-phase preleptotene; lPL, late S-phase preleptotene; L, leptotene; Z, zygotene; eP, early pachytene; mP, center pachytene; lP, late pachytene; D, diplotene; MI, metaphase I; MII, metaphase II; RS2, steps 1–2 spermatids; RS4, steps 3–4 spermatids; RS6, steps 5–6 spermatids; and RS8, steps 7–8 spermatids. Within the scapGNN panel, pink arrows point out the order of differentiation of the spermatogenic cells. (B) Field plots of exercise scores of the oxidative phosphorylation pathway, which was considerably expressed within the early spherical spermatids (RS2 to RS4) of spermatogenesis and (C) the PPAR signaling pathway, which was considerably dynamic within the creating mouse embryos. Community of cell phenotype–related gene modules throughout mouse spermatogenesis (D) and embryo improvement (E). Within the community, the sector space of a node signifies the energy of affiliation between a gene and a cell phenotype, and the width of the sides between nodes signifies the energy of affiliation between genes. The information underlying this determine could be present in S6 Knowledge.


https://doi.org/10.1371/journal.pbio.3002369.g007

Subsequent, we characterised the developmental stage–related gene modules. A number of genes in our recognized RS8-specific gene modules, corresponding to protamine 1 (Prm1), protamine 2 (Prm2), the outer dense fiber of sperm tails 2 (Odf2), histone linker H1 area (Hils1), and transition protein 1 (Tnp1), have been beforehand reported to be particularly expressed in spherical spermatids [5557] (Fig 7D). For early embryo improvement, B cell translocation gene-4 (Btg4) was important and its deletion led to 1- or 2-cell arrest within the gene module particularly related to the 2-cell embryos [58] (Fig 7E). Moreover, we discovered that some genes related to a number of cell sorts had been concerned within the transition of germ cells and the upkeep of regular spermatogenesis. For instance, Fabp9 was extremely coexpressed from the metaphase I (MI) spermatocytes to the spherical spermatid stage and was important for the formation of the traditional form of the sperm head [59]. Among the many community of spermatogenesis stage–related gene modules, RBAK downstream neighbor (Rbakdn) and spermatogenesis-associated 33 (Spata33) had been the two genes with the best affiliation scores with Fabp9, and their expression patterns confirmed a transparent co-occurrence with Fabp9 (S28A and S28B Fig). The deletion of Spata33 was additionally reported to trigger abnormalities in sperm formation with sperm midpiece defects and infertility [60]. Compared, cysteine-rich perinuclear theca 12 (Cypt12) and high-mobility group field 4 (Hmgb4) had been 2 genes with a decrease energy of affiliation with Fabp9; they didn’t have a constant expression pattern with Fabp9 (S28C and S28D Fig).

Lastly, we constructed the cell neighborhood community for Sertoli cells and spermatogenic cells from the human testis dataset (S27C Fig and Supplies and strategies). We discovered that the Sertoli cells had been tightly linked to spermatogenic cells. The cell phenotype–related gene networks found the regulatory mechanisms between cell sorts. Clusterin (CLU) and prosaposin (PSAP) are produced by Sertoli cells [61,62] and are the connecting nodes of the gene modules of Sertoli cells and spermatogenic cells (S27D Fig). CLU regulates the meiosis of germ cells, protects the testes from warmth stress–induced damage, and ensures the traditional progress of spermatogenesis [63,64]. PSAP acts as glycolipid switch between Sertoli cells and the creating spermatids, and its abnormalities can result in delayed timing in sperm cell improvement [62,65,66].

Investigation of the organic mechanisms in coronavirus illness 2019 primarily based on scapGNN

Subsequent, the scapGNN was utilized to the coronavirus illness 2019 (COVID-19) dataset, which included wholesome controls and sufferers with COVID-19 (S29A Fig and S1 Desk). We screened B cell–related differential pathways between wholesome controls and sufferers with COVID-19 by taking the highest 10 fold-change for every B cell subtype in descending order, with the utmost p-value < 5 × 10−5 (S30 Fig). Coronavirus, as an exogenous virus, activated the host immune system. As anticipated, many differential pathways had been related to infections and immune ailments, corresponding to human cytomegalovirus an infection and measles, suggesting that B cells had been activated to supply an immune response to exogenous viruses. Per earlier research, protein advanced meeting and protein transport–associated pathways had been activated within the B cells of sufferers with COVID-19, as B cells require excessive ranges of protein synthesis to carry out their features [67]. The exercise of thyroid hormone, a substance upstream of plasma cell activation, and the exercise of the thyroid hormone synthesis pathway considerably enhance in sufferers with COVID-19 (S30 Fig). A number of stories confirmed that thyroid hormone had a protecting impact in sufferers with COVID-19, and low serum ranges would possibly enhance mortality [68,69]. Notably, the gene affiliation community of the thyroid hormone synthesis pathway confirmed that ATP1B3 performed a key function within the pathway exercise in sufferers with COVID-19 (S29B Fig). Liu’s research additionally discovered ATP1B3 to be one of many targets of COVID-19 an infection [70].

We additional built-in the scRNA-seq information of influenza A virus (IAV)-infected sufferers and constructed networks of gene modules for B cells of three cell phenotypes (wholesome management, COVID-19, and IAV) utilizing false discovery price (FDR) < 0.05 (S29C Fig). Genes coexpressed within the 3 mobile phenotypes is likely to be illness impartial. COVID-19-specific genes is likely to be doubtlessly high-activity gene signatures of COVID-19. For instance, NPM1 was current within the COVID-19-associated protein–protein interplay (PPI) community module and related to enhanced viral replication of COVID-19 [71,72]. COVID-19 and IAV coexpressed genes is likely to be related to important organic features of B cells; for instance, IGHG3 and IGLC2 had been B cell marker genes [73]. Apparently, the absence of ARPC1B, a health-specific gene diminished in COVID-19, led to immunodeficiency [74,75]. This would possibly present new insights into the pathogenic mechanisms and therapy of COVID-19.

Dialogue

Pathway enrichment evaluation is a large framework for condensing info from gene expression profiles right into a pathway or signature abstract [8]. Single-cell information have a lot of dropout occasions, resulting in difficulties in information evaluation on the gene stage [76]. Most single-cell clustering strategies solely use genes as options of cells, ignoring the connection between genes, which might make clustering strategies extra inclined to noise, leading to low accuracy and robustness. Pathway-level evaluation is much less affected by the noise on a single gene [1] and may present higher organic interpretation and insights [6]. On this research, we launched scapGNN, a GNN-based framework, to remodel single-cell information into the pathway exercise rating matrix and recognized the cell phenotype–related gene modules.

scapGNN successfully addressed the 4 main questions of pathway enrichment in single cells. The primary query was whether or not the pathway exercise rating might characterize single-cell heterogeneity and precisely determine cell clusters. We benchmarked the efficiency of scapGNN utilizing 16 scRNA-seq datasets, 10 state-of-the-art single-cell clustering strategies, and three cell clustering accuracy indicators (ARI, NMI, and SW). scapGNN might seize delicate variations between cells and extra precisely cluster cells in low-dimensional house in contrast with AUCell [12], Pagoda2 [13], and UniPath [14] (Fig 2). scapGNN additionally carried out higher in dealing with scRAN-seq information with batch results (S10 Fig). The second query was whether or not the pathway exercise scores represented the true pathway states. We utilized scapGNN to homogeneous and heterogeneous datasets to detect whether or not cell marker gene units and identified activated pathways of T and B cells had excessive scores within the corresponding cells. The end result indicated that scapGNN had excessive precision in pathway identification capability in contrast with the present single-cell pathway enrichment strategies (Fig 3A and 3B). The third query was whether or not scapGNN was strong. We simulated the noise with completely different strengths and kinds. scapGNN maintained correct cell clustering and pathway identification capabilities (Fig 4 and S17 Fig). The ultimate query was integrating single-cell multi-omics info into the pathway stage. We proposed a community fusion technique to deduce pathway exercise rating matrix supported by multi-omics info. By way of single-cell multi-omics information integration, scapGNN outperformed the state-of-the-art single-cell multi-omics integration strategies by way of cell clustering and mobile temporal inference and nonetheless might precisely determine pathways (Fig 6, S22 and S23 Figs). After integrating single-cell multi-omics information, scapGNN might discover pathways with differential exercise that would not be recognized primarily based on single-omics alone (S6 Desk). Lastly, scapGNN additionally might determine cell phenotype–related gene modules, which stuffed the hole in gene module evaluation strategies for single-cell information.

We additional utilized scapGNN to actual organic questions, which analyzed the pathways and gene modules related to spermatogenesis and early embryo improvement. Pseudotime evaluation confirmed that scapGNN might higher symbolize the actual improvement course of (Fig 7A and S26 Fig). Utilizing scapGNN to research pathways and gene modules in quite a few research, we discovered that dynamically altering pathways had been essential for regular spermatogenesis. Moreover, activated gene modules revealed shut regulation between Sertoli and spermatogenic cells. The scapGNN was additionally utilized to the COVID-19 dataset, which included wholesome controls and sufferers with COVID-19. We elucidated organic pathways carefully related to the onset and improvement of COVID-19 and recognized particular options of COVID-19 in contrast with IAV in addition to potential therapeutic targets (S29 and S30 Figs).

Nonetheless, scapGNN nonetheless had some limitations. First, we examined the operating time of the GNN module of scapGNN and the calculation time of the pathway exercise scores for AUCell, Pagoda2, UniPath, and scapGNN in scRNA-seq datasets at completely different scales (S1 Textual content). The outcomes confirmed that scapGNN’s runtime was greater than that of the opposite strategies, and UniPath’s runtime was the shortest (S31 Fig). We believed that 3 most important elements accounted for the time consumption of scapGNN: (1) scapGNN was an end-to-end framework that required the inference and studying of gene–cell affiliation networks for every enter information; (2) gene–cell affiliation community contained extra info than cells by gene expression matrix; and (3) we corrected the pathway exercise scores to take away the impact of noise by perturbation evaluation, which randomly chosen gene nodes as seeds for the RWR. Due to this fact, for every information, scapGNN wanted to have a primary runtime, after which a rise in cells led to a lesser enhance in runtime. Nonetheless, we carried out graphics processing unit computing and multi-core parallelism in scapGNN packages, offering the person with choices to successfully speed up the runtime. scapGNN might solely combine multi-omics information from the identical cells. However, multi-omics info throughout the identical cells might absolutely recapitulate the true organic state [77,78]. Single-cell sequencing applied sciences able to detecting a number of omics in the identical cell are nonetheless an rising area with improvement of platforms corresponding to SNARE-seq [77] and CITE-seq [79]. These applied sciences supplied the premise for the applying of scapGNN (Fig 6A and 6B, S21D and S21E Fig). It ought to be famous that primarily based on the three cell clustering accuracy indicators (ARI, NMI, and SW), the standard of scATAC-seq information is inferior in contrast with that of the corresponding scRNA-seq information from multi-omics of the identical cell, posing a problem for strategies that combine single-cell multi-omics information (S21E Fig). Theoretically, the variety of omics information sorts built-in by scapGNN has no restrict. Due to this fact, scapGNN is predicted to course of scNOMeRe-seq [80] and NEAT-seq [81], which permits the simultaneous profiling of extra omics info in the identical particular person cell. Lastly, scapGNN can convert gene expression profile information right into a gene–cell affiliation community, stably representing gene–cell relationships. Though the gene–cell affiliation community is a type of correlation community that can’t describe the communication and causality between cells, it offers a brand new view that makes use of graph theory-based strategies to research single-cell profiling information.

Supplies and strategies

Knowledge preprocessing

scapGNN can take single-cell epigenome or transcriptome information as enter. The information want high quality management and normalization to make sure their high quality and usefulness. For scRNA-seq datasets, the cells containing greater than 1% of genes with nonzero expression and people with nonzero expression in additional than 1% of cells are preserved. The worldwide-scaling normalization technique (LogNormalize) was used to normalize the gene expression measurements and log-transform the end result [27].

To evaluate the influence of extremely variable gene choice on the efficiency of scapGNN, we carried out benchmark experiments on completely different numbers of extremely variable genes. For heterogeneous single-cell information, cell clustering accuracy and marker gene set scoring lower because the variety of genes will increase (S32 Fig). As a result of low quantities of mRNA in particular person cells, inefficient mRNA seize, in addition to the stochasticity of mRNA expression can result in dropout occasions in single-cell information [82], and never all genes contribute to cell-to-cell variations [83], too many genes can introduce technical noise. We chosen 2,000 extremely variable genes in processing heterogeneous single-cell information. For homogeneous information, there may be little variation in expression ranges between cells for inherently expressed genes corresponding to marker genes. A small variety of extremely variable genes may cause a lack of info (S33 Fig). Our outcomes present that 8,000 extremely variable genes can symbolize details about intrinsic organic pathways or marker gene units in cells whereas minimizing the introduction of noise (S33 Fig). We chosen 8,000 extremely variable genes in processing homogeneous single-cell information.

For scATAC-seq datasets, we estimated the gene exercise by measuring ATAC-seq counts within the 2-kb upstream areas and gene physique [84]. Subsequently, filtering, high quality management, and normalization had been the identical as these for scRNA-seq. For different single-cell omics information, we would have liked to affiliate omics with genes and convert them into gene–cell matrices primarily based on the corresponding proof [25,85]. This research used 3,000 extremely variable genes for scATAC-seq information and single-cell multi-omics integration research.

Deep neural community autoencoder

We first used the LTMG [30] mannequin to parse the regulatory indicators from gene expression. The LTMG mannequin set a latent experimental decision threshold Zminimize to divide the gene expression of N cells into 2 elements. The left truncated gene expression X = {x1, …, xM}, the place X < Zminimize had zero- or low-expression values. The opposite half was lively gene expression X = {xM, …, xN}, the place XZminimize. The likelihood density perform of the normalized gene expression values was additional modeled as a mix of Gaussian distribution with Ok Gaussian distributions, akin to Ok transcriptional regulatory states:
(1)
the place Θ denotes the Ok Gaussian distributions, and ai, μi, and σi are the blending likelihood weight, imply, and normal deviation, respectively. The expectation-maximization algorithm can estimate Θ and calculate Zminimize. The variety of Gaussian distributions Ok is outlined by the Bayesian info criterion. In the end, discrete TRS sign values {0, 1, 2, …, Ok} are generated for every gene over cells. The sign worth Ok signifies that the gene expression worth belongs to the Ok Gaussian peaks, akin to Ok expression states. A excessive Ok worth signifies that the gene is extra prone to be in a very lively expression state within the cell.

Subsequent, we constructed a DNNAE regularized by TRSs to be taught the latent associations between genes and cells. Taking a gene–cell matrix X with m genes and n cells as enter, the neural community encoder carried out column compression and row compression to generate low-dimensional representations of genes and cells:
(2)
(3)
the place and symbolize the learnable weight of the lth hidden layer for gene embedding and cell embedding. and denote the encoded d dimensional function matrix of genes and cells, respectively, and σ is the nonlinear activation perform.

We additional used a matrix factorization decoder to acquire the potential gene–cell affiliation matrix:
(4)
and took a imply sq. error regularized by a transcriptional regulatory sign because the loss perform. Regularization aimed to enhance the signal-to-noise ratio by including completely different constraints to every gene throughout the studying course of. The loss perform was as follows:
(5)
the place a is the regularization weight and a ∈ [0,1]. ° denotes element-wise multiplication. STRSRm×n is the transcriptional regulatory sign matrix. In accordance with our benchmark experiment, the training price was set as 0.001 and the variety of iterations was 1000 (
S34A and S34B Fig).

Graph autoencoder

We first calculated the gene–gene Pearson correlation matrix from the gene–cell matrix Xm×n. For gene i, we designed the empirical p-values to guage the relative energy of the correlation:
(6)
the place is a vector of correlation values between gene i and different genes, and is the Pearson correlation worth between the ith gene and the jth gene. We set p-value < 0.05 to extract strongly correlated gene–gene pairs of gene i. The ultimate correlation matrix was used because the adjacency matrix A of the gene correlation community.

Subsequent, we took the low-dimensional representations encoded by deep neural networks because the function matrix E of nodes within the gene correlation community. D was the diploma matrix of the gene correlation community. For the VGAE, a 2-layer graph convolution community was outlined as , the place , and W0 and W1 are the discovered weight matrices. The encoder of VGAE was outlined as:
(7)
the place μi is a imply vector from the matrix μ = GCNμ (E, A). Equally, σi is the variance and log ⁡σ = GCNσ (E, A); Z is the consultant matrix of the graph in low-dimensional house. GCNμ (E, A) and GCNσ (E, A) share first-layer weight W0.

The decoder reconstructed the community to generate a gene–gene affiliation community by an interior product between latent variables:
(8)

The objective of studying the VGAE was to optimize the variational decrease certain L:
(9)
the place KL is the Kullback–Leibler divergence. In accordance with our benchmark experiment, the training price was set as 0.01 and the variety of iterations was 300 (
S34A, S34C and S34D Fig).

For the cell–cell Pearson correlation matrix , we constructed a cell–cell affiliation community utilizing the GAE, just like establishing a gene–gene affiliation community.

Calculating pathway exercise scores and figuring out gene modules

We normalized adjacency matrices of the gene–gene affiliation community, cell–cell affiliation community, and gene–cell affiliation matrix by min-max normalization. These values represented the relative energy of gene–gene, cell–cell, and gene–cell associations. We spliced the adjacency matrix of the cell–cell affiliation community with the gene–cell affiliation matrix and the adjacency matrix of the gene–gene affiliation community with the gene–cell affiliation matrix by column (S35 Fig). The two spliced matrices had been then merged by row to type a gene–cell affiliation community. The W is the column-normalized adjacency matrix of the gene–cell affiliation community. For 1 pathway, we used the genes included on this pathway as restart nodes (referred to as seeds) within the gene–cell affiliation community. Subsequently, the RWR algorithm carried out diffusion and iteration over the gene–cell affiliation community utilizing the seed nodes as beginning nodes. The method was as follows.
(10)
the place p0 is the preliminary likelihood vector, and solely the seeds have nonzero values; t is the variety of iterations; and r is the restart likelihood. Kohler and colleagues confirmed that r had solely a slight impact on the outcomes of the RWR algorithm when it fluctuated between 0.1 and 0.9 [
86]. This was additionally confirmed by our benchmark experiments for the r values (S34E Fig). On this research, we set r = 0.7, which had comparatively higher efficiency (S34E Fig). We obtained the stationary likelihood vector by iterating repeatedly till the distinction between pt+1 and pt fell under 1 × 10−6, i.e., ∑|pt+1pt| < 1 × 10−6. When the iteration ended, the stationary likelihood values of every cell node obtained from the seed diffusion represented the proximity measure between every cell and pathway [87]. We additional utilized permutation evaluation, which randomly sampled the identical variety of genes as seeds, to regulate the likelihood values and take them as pathway exercise scores:
(11)
the place PASij is the pathway exercise rating of ith pathway within the jth cell, p′ is a vector of the perturbed stationary likelihood values, and N is the variety of perturbations. We set the variety of perturbations to 100 (
S34F Fig). || signifies the variety of components within the set.

We subsequent used cells of the identical phenotype as seeds to spontaneously determine cell phenotype–related gene modules. For every gene, the RWR algorithm quantified the affiliation energy with the seeds (cell phenotype), and the stationary likelihood worth was used because the energy of affiliation rating. Subsequently, the permutations check, which randomly chosen cells as seeds, estimated the statistical significance of every gene. Genes that had been lower than the importance threshold (the parameter could possibly be set and the default worth was 0.01 in scapGNN) comprised the cell phenotype–related gene module. For genes considerably related to a number of cell phenotypes, we normalized the affiliation energy with the sum equal to 1 because the propensity of genes to be expressed between cell phenotypes and supplied a visualization program for the community of cell phenotype–related gene modules.

Integrating single-cell multi-omics information into pathway exercise rating matrix

For the single-cell transcriptome and epigenome from the identical cells, we constructed gene–cell affiliation networks of scRNA-seq and scATAC-seq information. We additional merged the two gene–cell affiliation networks into a brand new multi-omics gene–cell affiliation community (S1 Fig). For edges shared by 2 networks, we merged the weights into 1 mixed weight utilizing Brown’s technique [88]. Brown’s technique was first used to mix a number of dependent statistical checks [88]. Many research have used it for duties corresponding to multi-omics integration or for calculating pathway scores by combining the expression values of genes within the pathways [25,89]. Brown’s technique thought of dependencies between datasets and thus supplied extra conservative estimates of significance for genes supported by a number of related omics datasets [25]. The sting weights within the gene–cell affiliation community served as indicators of significance between nodes, such because the energy of coexpression between genes, the extent of gene expression in cells, and the similarity between cells. After integrating gene–cell affiliation networks from completely different omics, the weights of the sides accounted for the general covariation of the weights from completely different sources of proof. Moreover, we used the RWR algorithm to calculate the proximity between genes within the pathway and every cell because the multi-omics information-supported pathway exercise scores and the proximity between cells with the identical phenotype and every gene to determine cell phenotype–related gene modules.

Cell clustering efficiency analysis

We benchmarked the cell clustering efficiency of scapGNN utilizing 16 scRNA-seq datasets (S3 Desk), which had been preprocessed in accordance with uniform requirements. We downloaded the extensible markup language (XML) recordsdata of 396 pathways from the KEGG database and used the gene set contained in every pathway as a pathway time period. We used the pathway exercise rating matrix obtained from completely different single-cell pathway exercise scoring strategies as enter to the sciPath framework. The ten state-of-the-art single-cell clustering strategies (S4 Desk) within the sciPath framework had been used to deduce cell clustering from the pathway exercise rating matrix. The procedures and parameters organized by the sciPath framework had been used [1]. Within the 10 cell clustering strategies, the Leiden algorithm of the Seurat course of was used at 3 ranges of decision (0.5, 1, and 1.5). The highest 2,000 extremely variable genes had been used for the cell clustering of gene stage. The AUCell, Pagoda2, and UniPath had been applied utilizing the scTPA [90] and UniPath package deal [14]. The default parameters had been used. S2 Desk offers steering on using these strategies. All pathways within the pathway exercise rating matrix had been used and concerned in cell clustering. The cell clustering efficiency was quantified utilizing 3 accuracy indicators: ARI [91], NMI [92], and SW [93]. ARI and NMI had been applied utilizing the sciPath framework. SW was applied utilizing the R package deal cluster v2.1.3. The imply worth of three accuracy indicator outcomes of the ten cell clustering strategies was used (common ARI, common NMI, and common SW).

For cell clustering evaluation of single-cell multi-omics information, the Cobolt v1.0.1 and GLUE v0.3.2 had been used to combine single-cell multi-omics information, and default parameters had been used. The Leiden algorithm was used to cluster the cells and calculate the cell clustering indicators for every technique at completely different ranges of decision (0.5, 1, and 1.5). The purity perform of the R package deal funtimes v9.0 was used to calculate the purity of the clustering outcomes.

Analysis of capability to determine pathways and gene modules

We used identified cell marker gene units to check the accuracy of scapGNN in figuring out pathways such because the UniPath [14]. Marker genes for every cell sort had been collected from the CellMarker database, BioGPS, and Harmonizome and used as a gene set. Additional, 460 marker gene units akin to 460 cell sorts had been collected by Smriti and colleagues [14] and used because the gold normal (S36 Fig). Each homogeneous and heterogeneous scRNA-seq datasets had been used to calculate pathway exercise scores (S1 Desk). We calculated the proportion of cells that appropriately detected cell sort marker gene units among the many prime 1 to five rankings in pathway exercise scores. We used the KEGG pathway information from the C2 gene set of the molecular signatures database (MSigDB), together with 186 organic pathways. Utilizing these information, we recognized the proportion of T and B cells that ranked the T- and B cell receptor signaling pathways throughout the prime 1 to five.

We additional used marker genes of cell sorts because the gold normal. The ROC curve evaluation was used to confirm that scapGNN might determine cell phenotype–related gene modules.

Supporting info

S3 Fig. Bar graphs of three cell clustering accuracy indicators (ARI, NMI, and SW) that evaluated the cell clustering outcomes of AUCell, Pagoda2, UniPath, and scapGNN within the 3 scRNA-seq datasets utilizing the ten cell clustering strategies.

The information underlying this determine could be present in S7 Knowledge. ARI, adjusted rand index; NMI, normalized mutual info; scRNA-seq, single-cell RNA sequencing; SW, silhouette width.

https://doi.org/10.1371/journal.pbio.3002369.s004

(PDF)

S7 Fig. Utility of conventional cumbersome RNA-seq pathway enrichment evaluation strategies to scRNA-seq information.

Field plot of common ARI, common NMI, and common SW of the pathway stage from ssGSEA (A) and GSVA (B) utilizing 10 state-of-the-art single-cell clustering strategies on 16 scRNA-seq information units. The information underlying this determine could be present in S1 Knowledge. ARI, adjusted rand index; GSVA, gene set variation evaluation; NMI, normalized mutual info; scRNA-seq, single-cell RNA sequencing; ssGSEA, single-sample gene set enrichment evaluation; SW, silhouette width.

https://doi.org/10.1371/journal.pbio.3002369.s008

(PDF)

S16 Fig. Functionally modular analysis for cell phenotype–related gene modules of activated stellate cells within the cell sort dataset, eProg1b cells within the cell subtype dataset, and 36-h cells within the time sequence dataset.

The information underlying this determine could be present in S8 Knowledge.

https://doi.org/10.1371/journal.pbio.3002369.s017

(PDF)

[ad_2]

LEAVE A REPLY

Please enter your comment!
Please enter your name here