Abstract-1
Protein complexes are key units for studying a cell system. During the past decades, the genome-scale protein–protein interaction (PPI) data have been determined by high-throughput approaches, which enables the identification of protein complexes from PPI networks. However, the high-throughput approaches often produce considerable fraction of false positive and negative samples. In this study, we propose the mutual important interacting partner relation to reflect the co-complex relationship of two proteins based on their interaction neighborhoods. In addition, a new algorithm called idenPC-MIIP is developed to identify protein complexes from weighted PPI networks. The experimental results on two widely used datasets show that idenPC-MIIP outperforms 17 state-of-the-art methods, especially for identification of small protein complexes with only two or three proteins.
Brief:預測蛋白質(zhì)復合物艘蹋,目前主要基于PPI網(wǎng)絡(luò)評估,但由于PPI網(wǎng)絡(luò)存在大量假陽性和假陰性左医,評估效果差遥缕。整體流程:1先從酵母蛋白數(shù)據(jù)庫中構(gòu)建無偏PPI,隨后基于 GOSemSim評估蛋白間的語義相似性艺骂,從而為矩陣進行加權(quán)汞舱。通過對每個節(jié)點交互閾值來判斷節(jié)點間是否存在互相作用(MIIP)伍纫。2將MIIP聚類剩辟,并以其中節(jié)點degree最大的node作為種子次哈,并經(jīng)過設(shè)定閾值對聚類進行拓展。3對蛋白聚類進行相似性評估忧额,相似度超過0.8的則被認為可能組成蛋白復合體
Abstract-2
Breast cancer is one of the most human malignant diseases and the leading cause of cancer-related death in the world. However, the prognostic and therapeutic benefits of breast cancer patients cannot be predicted accurately by the current stratifying system. In this study, an immune-related prognostic score was established in 22 breast cancer cohorts with a total of 6415 samples. An extensive immunogenomic analysis was conducted to explore the relationships between immune score, prognostic significance, infiltrating immune cells, cancer genotypes and potential immune escape mechanisms. Our analysis revealed that this immune score was a promising biomarker for estimating overall survival in breast cancer. This immune score was associated with important immunophenotypic factors, such as immune escape and mutation load. Further analysis revealed that patients with high immune scores exhibited therapeutic benefits from chemotherapy and immunotherapy. Based on these results, we can conclude that this immune score may be a useful tool for overall survival prediction and treatment guidance for patients with breast cancer.
Brief:基于ssGSEA及收集的免疫相關(guān)基因集定義乳腺癌亞型泌神,揭示乳腺癌免疫亞型與突變及免疫治療有效性的聯(lián)系
Abstract-3
Identification of new drug–target interactions (DTIs) is an important but a time-consuming and costly step in drug discovery. In recent years, to mitigate these drawbacks, researchers have sought to identify DTIs using computational approaches. However, most existing methods construct drug networks and target networks separately, and then predict novel DTIs based on known associations between the drugs and targets without accounting for associations between drug–protein pairs (DPPs). To incorporate the associations between DPPs into DTI modeling, we built a DPP network based on multiple drugs and proteins in which DPPs are the nodes and the associations between DPPs are the edges of the network. We then propose a novel learning-based framework, ‘graph convolutional network (GCN)-DTI’, for DTI identification. The model first uses a graph convolutional network to learn the features for each DPP. Second, using the feature representation as an input, it uses a deep neural network to predict the final label. The results of our analysis show that the proposed framework outperforms some state-of-the-art approaches by a large margin.
Brief:1基于現(xiàn)有數(shù)據(jù)庫將一對藥物基因關(guān)系作為node構(gòu)建藥物蛋白網(wǎng)絡(luò)良漱,權(quán)重基于一定關(guān)系賦值為強連接,弱連接及無關(guān)腻扇。2將蛋白序列及藥物結(jié)構(gòu)信息作為特征3基于圖卷積計算網(wǎng)絡(luò)中每個節(jié)點特征再基于神經(jīng)網(wǎng)絡(luò)預測蛋白藥物作用
Abstract-4
The progression of cancer is accompanied by the acquisition of stemness features. Many stemness evaluation methods based on transcriptional profiles have been presented to reveal the relationship between stemness and cancer. However, instead of absolute stemness index values—the values with certain range—these methods gave the values without range, which makes them unable to intuitively evaluate the stemness. Besides, these indices were based on the absolute expression values of genes, which were found to be seriously influenced by batch effects and the composition of samples in the dataset. Recently, we have showed that the signatures based on the relative expression orderings (REOs) of gene pairs within a sample were highly robust against these factors, which makes that the REO-based signatures have been stably applied in the evaluations of the continuous scores with certain range. Here, we provided an absolute REO-based stemness index to evaluate the stemness. We found that this stemness index had higher correlation with the culture time of the differentiated stem cells than the previous stemness index. When applied to the cancer and normal tissue samples, the stemness index showed its significant difference between cancers and normal tissues and its ability to reveal the intratumor heterogeneity at stemness level. Importantly, higher stemness index was associated with poorer prognosis and greater oncogenic dedifferentiation reflected by histological grade. All results showed the capability of the REO-based stemness index to assist the assignment of tumor grade and its potential therapeutic and diagnostic implications.
Brief:前期研究不足债热,數(shù)據(jù)集間干性評估穩(wěn)定性較差。
LIN28B Functioning effectively in reprogramming to pluripotency
Abstract-5
Messenger RNAs (mRNAs) shoulder special responsibilities that transmit genetic code from DNA to discrete locations in the cytoplasm. The locating process of mRNA might provide spatial and temporal regulation of mRNA and protein functions. The situ hybridization and quantitative transcriptomics analysis could provide detail information about mRNA subcellular localization; however, they are time consuming and expensive. It is highly desired to develop computational tools for timely and effectively predicting mRNA subcellular location. In this work, by using binomial distribution and one-way analysis of variance, the optimal nonamer composition was obtained to represent mRNA sequences. Subsequently, a predictor based on support vector machine was developed to identify the mRNA subcellular localization. In 5-fold cross-validation, results showed that the accuracy is 90.12% for Homo sapiens (H. sapiens). The predictor may provide a reference for the study of mRNA localization mechanisms and mRNA translocation strategies. An online web server was established based on our models, which is available at http://lin-group.cn/server/iLoc-mRNA/.
Brief:基于mRNA序列通過SVM識別mRNA亞細胞定位
Abstract-6
Numerous studies have shown that copy number variation (CNV) in lncRNA regions play critical roles in the initiation and progression of cancer. However, our knowledge about their functionalities is still limited. Here, we firstly provided a computational method to identify lncRNAs with copy number variation (lncRNAs-CNV) and their driving transcriptional perturbed subpathways by integrating multidimensional omics data of cancer. The high reliability and accuracy of our method have been demonstrated. Then, the method was applied to 14 cancer types, and a comprehensive characterization and analysis was performed. LncRNAs-CNV had high specificity in cancers, and those with high CNV level may perturb broad biological functions. Some core subpathways and cancer hallmarks widely perturbed by lncRNAs-CNV were revealed. Moreover, subpathways highlighted the functional diversity of lncRNAs-CNV in various cancers. Survival analysis indicated that functional lncRNAs-CNV could be candidate prognostic biomarkers for clinical applications, such as ST7-AS1, CDKN2B-AS1 and EGFR-AS1. In addition, cascade responses and a functional crosstalk model among lncRNAs-CNV, impacted genes, driving subpathways and cancer hallmarks were proposed for understanding the driving mechanism of lncRNAs-CNV. Finally, we developed a user-friendly web interface-LncCASE (http://bio-bigdata.hrbmu.edu.cn/LncCASE/) for exploring lncRNAs-CNV and their driving subpathways in various cancer types. Our study identified and systematically characterized lncRNAs-CNV and their driving subpathways and presented valuable resources for investigating the functionalities of non-coding variations and the mechanisms of tumorigenesis.
Brief:表征了泛癌水平的lncRNAs-CNV幼苛,lnc-RNA-CNV影響了癌癥中的關(guān)鍵通路
Abstract-7
Single-cell RNA sequencing allows us to study cell heterogeneity at an unprecedented cell-level resolution and identify known and new cell populations. Current cell labeling pipeline uses unsupervised clustering and assigns labels to clusters by manual inspection. However, this pipeline does not utilize available gold-standard labels because there are usually too few of them to be useful to most computational methods. This article aims to facilitate cell labeling with a semi-supervised method in an alternative pipeline, in which a few gold-standard labels are first identified and then extended to the rest of the cells computationally.We built a semi-supervised dimensionality reduction method, a network-enhanced autoencoder (netAE). Tested on three public datasets, netAE outperforms various dimensionality reduction baselines and achieves satisfactory classification accuracy even when the labeled set is very small, without disrupting the similarity structure of the original space.
Brief:現(xiàn)有兩種注釋方法:cluster后依據(jù)marker人工定義及基于金標準數(shù)據(jù)集的訓練后注釋。文章基于自編碼器開發(fā)了一種半監(jiān)督的降維方法焕刮,算法優(yōu)勢降維空間在樣本壓縮到低維時盡可能保留多的信息舶沿,又要有效,表現(xiàn)出較強的聚類結(jié)構(gòu)配并,易于分類
Abstract-8
Evidence has shown that microRNAs, one type of small biomolecule, regulate the expression level of genes and play an important role in the development or treatment of diseases. Drugs, as important chemical compounds, can interact with microRNAs and change their functions. The experimental identification of microRNA–drug interactions is time-consuming and expensive. Therefore, it is appealing to develop effective computational approaches for predicting microRNA–drug interactions.In this study, a matrix factorization-based method, called the microRNA–drug interaction prediction approach (MDIPA), is proposed for predicting unknown interactions among microRNAs and drugs. Specifically, MDIPA utilizes experimentally validated interactions between drugs and microRNAs, drug similarity and microRNA similarity to predict undiscovered interactions. A path-based microRNA similarity matrix is constructed, while the structural information of drugs is used to establish a drug similarity matrix. To evaluate its performance, our MDIPA is compared with four state-of-the-art prediction methods with an independent dataset and cross-validation. The results of both evaluation methods confirm the superior performance of MDIPA over other methods. Finally, the results of molecular docking in a case study with breast cancer confirm the efficacy of our approach. In conclusion, MDIPA can be effective in predicting potential microRNA–drug interactions.
Brief:首先基于已知的miRNA及藥物構(gòu)建矩陣括荡,已知為1,未知為0溉旋。根據(jù)文章定義的公式畸冲,通過miRNA及藥物相似性矩陣填補矩陣,再基于NMF將矩陣拆分后再相乘最終獲得收斂過的miRNA及藥物相關(guān)性矩陣
Abstract-9
Gene Set Enrichment Analysis (GSEA) is an algorithm widely used to identify statistically enriched gene sets in transcriptomic data. However, GSEA cannot examine the enrichment of two gene sets or pathways relative to one another. Here we present Differential Gene Set Enrichment Analysis (DGSEA), an adaptation of GSEA that quantifies the relative enrichment of two gene sets.After validating the method using synthetic data, we demonstrate that DGSEA accurately captures the hypoxia-induced coordinated upregulation of glycolysis and downregulation of oxidative phosphorylation. We also show that DGSEA is more predictive than GSEA of the metabolic state of cancer cell lines, including lactate secretion and intracellular concentrations of lactate and AMP. Finally, we demonstrate the application of DGSEA to generate hypotheses about differential metabolic pathway activity in cellular senescence. Together, these data demonstrate that DGSEA is a novel tool to examine the relative enrichment of gene sets in transcriptomic data.
Brief:DGSEA观腊,基于GSEA算法邑闲,將其改為檢測兩個基因集之間差異的的算法
Abstract-10
Accurately predicting the risk of cancer patients is a central challenge for clinical cancer research. For high-dimensional gene expression data, Cox proportional hazard model with the least absolute shrinkage and selection operator for variable selection (Lasso-Cox) is one of the most popular feature selection and risk prediction algorithms. However, the Lasso-Cox model treats all genes equally, ignoring the biological characteristics of the genes themselves. This often encounters the problem of poor prognostic performance on independent datasets.Here, we propose a Reweighted Lasso-Cox (RLasso-Cox) model to ameliorate this problem by integrating gene interaction information. It is based on the hypothesis that topologically important genes in the gene interaction network tend to have stable expression changes. We used random walk to evaluate the topological weight of genes, and then highlighted topologically important genes to improve the generalization ability of the RLasso-Cox model. Experiments on datasets of three cancer types showed that the RLasso-Cox model improves the prognostic accuracy and robustness compared with the Lasso-Cox model and several existing network-based methods. More importantly, the RLasso-Cox model has the advantage of identifying small gene sets with high prognostic performance on independent datasets, which may play an important role in identifying robust survival biomarkers for various cancer types.
Brief:基于基因互作網(wǎng)絡(luò)中拓撲系數(shù)更強的基因具有穩(wěn)定的表達變化。通過在已知的基因互作網(wǎng)絡(luò)中進行隨機漫步來評估基因的拓撲系數(shù)梧油,從而在cox回歸中引入基因拓撲權(quán)重從而提高其泛化性
Abstract-11
Population studies such as genome-wide association study have identified a variety of genomic variants associated with human diseases. To further understand potential mechanisms of disease variants, recent statistical methods associate functional omic data (e.g. gene expression) with genotype and phenotype and link variants to individual genes. However, how to interpret molecular mechanisms from such associations, especially across omics, is still challenging. To address this problem, we developed an interpretable deep learning method, Varmole, to simultaneously reveal genomic functions and mechanisms while predicting phenotype from genotype. In particular, Varmole embeds multi-omic networks into a deep neural network architecture and prioritizes variants, genes and regulatory linkages via biological drop-connect without needing prior feature selections.
Brief:提出一種在群體水平中基因分型和基因表達數(shù)據(jù)預測疾病表型的學習算法
Abstract-12
we developed an interpretable and scalable machine learning model, ECMarker, to predict gene expression biomarkers for disease phenotypes and simultaneously reveal underlying regulatory mechanisms. Particularly, ECMarker is built on the integration of semi- and discriminative-restricted Boltzmann machines, a neural network model for classification allowing lateral connections at the input gene layer. This interpretable model is scalable without needing any prior feature selection and enables directly modeling and prioritizing genes and revealing potential gene networks (from lateral connections) for the phenotypes. With application to the gene expression data of non-small-cell lung cancer patients, we found that ECMarker not only achieved a relatively high accuracy for predicting cancer stages but also identified the biomarker genes and gene networks implying the regulatory mechanisms in the lung cancer development. In addition, ECMarker demonstrates clinical interpretability as its prioritized biomarker genes can predict survival rates of early lung cancer patients (P-value < 0.005). Finally, we identified a number of drugs currently in clinical use for late stages or other cancers with effects on these early lung cancer biomarkers, suggesting potential novel candidates on early cancer medicine.
Brief:主要包括三部分:1基于半受限和判別玻爾茲曼機在群體水平鑒定疾病表型苫耸;2對每個表型中有貢獻的基因基于神經(jīng)網(wǎng)絡(luò)連通度進行排序,并鑒定相關(guān)基因調(diào)控網(wǎng)絡(luò)儡陨;3相關(guān)基因的生存及功能分析
Abstract-13
CircRNAs are an abundant class of non-coding RNAs with widespread, cell-/tissue-specific patterns. Previous work suggested that epigenetic features might be related to circRNA expression. However, the contribution of epigenetic changes to circRNA expression has not been investigated systematically. Here, we built a machine learning framework named CIRCScan, to predict circRNA expression in various cell lines based on the sequence and epigenetic features.The predicted accuracy of the expression status models was high with area under the curve of receiver operating characteristic (ROC) values of 0.89–0.92 and the false-positive rates of 0.17–0.25. Predicted expressed circRNAs were further validated by RNA-seq data. The performance of expression-level prediction models was also good with normalized root-mean-square errors of 0.28–0.30 and Pearson’s correlation coefficient r over 0.4 in all cell lines, along with Spearman's correlation coefficient ρ of 0.33–0.46. Noteworthy, H3K79me2 was highly ranked in modeling both circRNA expression status and levels across different cells. Further analysis in additional nine cell lines demonstrated a significant enrichment of H3K79me2 in circRNA flanking intron regions, supporting the potential involvement of H3K79me2 in circRNA expression regulation.
Brief:基于序列及表觀遺傳特征預測cirRNA在不同細胞系中的表達
Abstract-14
With the reduction in price of next-generation sequencing technologies, gene expression profiling using RNA-seq has increased the scope of sequencing experiments to include more complex designs, such as designs involving repeated measures. In such designs, RNA samples are extracted from each experimental unit at multiple time points. The read counts that result from RNA sequencing of the samples extracted from the same experimental unit tend to be temporally correlated. Although there are many methods for RNA-seq differential expression analysis, existing methods do not properly account for within-unit correlations that arise in repeated-measures designs.We address this shortcoming by using normalized log-transformed counts and associated precision weights in a general linear model pipeline with continuous autoregressive structure to account for the correlation among observations within each experimental unit. We then utilize parametric bootstrap to conduct differential expression inference. Simulation studies show the advantages of our method over alternatives that do not account for the correlation among observations within experimental units.
Brief:時間序列RNA測序時褪子,RNAseq的技術(shù)結(jié)果往往是和時間相關(guān)的,但現(xiàn)有差異分析方法無法揭示重復測量設(shè)計中的單位內(nèi)相關(guān)性
Abstract-15
We describe a new iteration of ICGS that outperforms state-of-the-art scRNA-Seq detection workflows when applied to well-established benchmarks. This approach combines multiple complementary subtype detection methods (HOPACH, sparse non-negative matrix factorization, cluster ‘fitness’, support vector machine) to resolve rare and common cell-states, while minimizing differences due to donor or batch effects. Using data from multiple cell atlases, we show that the PageRank algorithm effectively downsamples ultra-large scRNA-Seq datasets, without losing extremely rare or transcriptionally similar yet distinct cell types and while recovering novel transcriptionally distinct cell populations. We believe this new approach holds tremendous promise in reproducibly resolving hidden cell populations in complex datasets.
Brief:工作流:1若數(shù)據(jù)集超過2500個細胞骗村,則根據(jù)數(shù)據(jù)集大小基于PageRank/Louvain-based downsampling向下采樣嫌褪;2通過對變異大的基因間進行相關(guān)性系數(shù)計算來識別基因模塊,在計算前排除細胞周期相關(guān)基因胚股;3基于NMF對數(shù)據(jù)進行降維笼痛,其中K的值由顯著差異基因的數(shù)目確定;4.對每個定義的NMF基因聚類信轿,對每個基因與cluster的metadata進行相關(guān)性計算晃痴,每組的Top60及相關(guān)性系數(shù)大于0.3的基因被認為是屬于該組的特征基因残吩;5對輸入的細胞及選擇的特征基因進行SVM建模,對所有細胞基于該模型進行分類
Brief:ncRNA的檢測pipeline
Abstract-18
Although there has been great progress in cancer treatment, cancer remains a serious health threat to humans because of the lack of biomarkers for diagnosis, especially for early-stage diagnosis. In this study, we comprehensively surveyed the specifically expressed genes (SEGs) using the SEGtool based on the big data of gene expression from the The Cancer Genome Atlas (TCGA) and the Genotype–Tissue Expression (GTEx) projects. In 15 solid tumors, we identified 233 cancer-specific SEGs (cSEGs), which were specifically expressed in only one cancer and showed great potential to be diagnostic biomarkers. Among them, three cSEGs (OGDH, MUDENG and ACO2) had a sample frequency >80% in kidney cancer, suggesting their high sensitivity. Furthermore, we identified 254 cSEGs as early-stage diagnostic biomarkers across 17 cancers. A two-gene combination strategy was applied to improve the sensitivity of diagnostic biomarkers, and hundreds of two-gene combinations were identified with high frequency. We also observed that 13 SEGs were targets of various drugs and nearly half of these drugs may be repurposed to treat cancers with SEGs as their targets. Several SEGs were regulated by specific transcription factors in the corresponding cancer, and 39 cSEGs were prognosis-related genes in 7 cancers. This work provides a survey of cancer biomarkers for diagnosis and early diagnosis and new insights to drug repurposing. These biomarkers may have great potential in cancer research and application.
Brief:特異表達基因(ESGs)是一種在少數(shù)幾個組織中會表達的基因倘核,具有組織特異性泣侮,可能成為癌癥診斷的標志物,對泛癌的癌癥相關(guān)特異性表達基因進行研究紧唱,發(fā)現(xiàn)了一些癌癥早期診斷基因(即在T1期高表達)活尊。兩兩組合的癌癥相關(guān)特異基因表達異常對癌癥的診斷能力更強