2020-09-22-JC

Briefings in Bioinformatics

Volume 21, Issue 4, July 2020

Cleft palate (CP) is the second most common congenital birth defect. The etiology of CP is complicated, with involvement of various genetic and environmental factors. To investigate the gene regulatory mechanisms, we designed a powerful regulatory analytical approach to identify the conserved regulatory networks in humans and mice, from which we identified critical microRNAs (miRNAs), target genes and regulatory motifs (miRNA–TF–gene) related to CP. Using our manually curated genes and miRNAs with evidence in CP in humans and mice, we constructed miRNA and transcription factor (TF) co-regulation networks for both humans and mice. A consensus regulatory loop (miR17/miR20a–FOXE1–PDGFRA) and eight miRNAs (miR-140, miR-17, miR-18a, miR-19a, miR-19b, miR-20a, miR-451a and miR-92a) were discovered in both humans and mice. The role of miR-140, which had the strongest association with CP, was investigated in both human and mouse palate cells. The overexpression of miR-140-5p, but not miR-140-3p, significantly inhibited cell proliferation. We further examined whether miR-140 overexpression could suppress the expression of its predicted target genes (BMP2, FGF9, PAX9 and PDGFRA). Our results indicated that miR-140-5p overexpression suppressed the expression of BMP2 and FGF9 in cultured human palate cells and Fgf9 and Pdgfra in cultured mouse palate cells. In summary, our conserved miRNA–TF–gene regulatory network approach is effective in detecting consensus miRNAs, motifs, and regulatory mechanisms in human and mouse CP

Brief:識(shí)別腭裂相關(guān)轉(zhuǎn)錄因子扁掸,miRNA及mRNA并推斷作用關(guān)系鸵贬,進(jìn)行部分實(shí)驗(yàn)驗(yàn)證

Somatic mutation and gene expression dysregulation are considered two major tumorigenesis factors. While independent investigations of either factor pervade, studies of associations between somatic mutations and gene expression changes have been sporadic and nonsystematic. Utilizing genomic data collected from 11 315 subjects of 33 distinct cancer types, we constructed MutEx, a pan-cancer integrative genomic database. This database records the relationships among gene expression, somatic mutation and survival data for cancer patients. MutEx can be used to swiftly explore the relationship between these genomic/clinic features within and across cancer types and, more importantly, search for corroborating evidence for hypothesis inception. Our database also incorporated Gene Ontology and several pathway databases to enhance functional annotation, and elastic net and a gene expression composite score to aid in survival analysis. To demonstrate the usability of MutEx, we provide several application examples, including top somatic mutations associated with the most extensive expression dysregulation in breast cancer, differential mutational burden downstream of DNA mismatch repair gene mutations and composite gene expression score-based survival difference in breast cancer. MutEx can be accessed at http://www.innovebioinfo.com/Databases/Mutationdb_About.php.

Brief:泛癌體細(xì)胞突變膳帕、基因表達(dá)及生存關(guān)聯(lián)數(shù)據(jù)庫(kù)

With the increasing awareness of heterogeneity in cancers, better prediction of cancer prognosis is much needed for more personalized treatment. Recently, extensive efforts have been made to explore the variations in gene expression for better prognosis. However, the prognostic gene signatures predicted by most existing methods have little robustness among different datasets of the same cancer. To improve the robustness of the gene signatures, we propose a novel high-frequency sub-pathways mining approach (HiFreSP), integrating a randomization strategy with gene interaction pathways. We identified a six-gene signature (CCND1, CSF3R, E2F2, JUP, RARA and TCF7) in esophageal squamous cell carcinoma (ESCC) by HiFreSP. This signature displayed a strong ability to predict the clinical outcome of ESCC patients in two independent datasets (log-rank test, P?=?0.0045 and 0.0087). To further show the predictive performance of HiFreSP, we applied it to two other cancers: pancreatic adenocarcinoma and breast cancer. The identified signatures show high predictive power in all testing datasets of the two cancers. Furthermore, compared with the two popular prognosis signature predicting methods, the least absolute shrinkage and selection operator penalized Cox proportional hazards model and the random survival forest, HiFreSP showed better predictive accuracy and generalization across all testing datasets of the above three cancers. Lastly, we applied HiFreSP to 8137 patients involving 20 cancer types in the TCGA database and found high-frequency prognosis-associated pathways in many cancers. Taken together, HiFreSP shows higher prognostic capability and greater robustness, and the identified signatures provide clinical guidance for cancer prognosis. HiFreSP is freely available via GitHub: https://github.com/chunquanlipathway/HiFreSP.

Brief:癌癥預(yù)后基因篩選工具,基于結(jié)合隨機(jī)化策略及基因交互通路官脓,優(yōu)于LASSO及COX回歸

Depression is a seriously disabling psychiatric disorder with a significant burden of disease. Metabolic abnormalities have been widely reported in depressed patients and animal models. However, there are few systematic efforts that integrate meaningful biological insights from these studies. Herein, available metabolic knowledge in the context of depression was integrated to provide a systematic and panoramic view of metabolic characterization. After screening more than 10 000 citations from five electronic literature databases and five metabolomics databases, we manually curated 5675 metabolite entries from 464 studies, including human, rat, mouse and non-human primate, to develop a new metabolite-disease association database, called MENDA (http://menda.cqmu.edu.cn:8080/index.php). The standardized data extraction process was used for data collection, a multi-faceted annotation scheme was developed, and a user-friendly search engine and web interface were integrated for database access. To facilitate data analysis and interpretation based on MENDA, we also proposed a systematic analytical framework, including data integration and biological function analysis. Case studies were provided that identified the consistently altered metabolites using the vote-counting method, and that captured the underlying molecular mechanism using pathway and network analyses. Collectively, we provided a comprehensive curation of metabolic characterization in depression. Our model of a specific psychiatry disorder may be replicated to study other complex diseases.

Brief:抑郁癥代謝圖譜,通過(guò)已有數(shù)據(jù)庫(kù)及文獻(xiàn)整合實(shí)現(xiàn),提供相關(guān)功能注釋

Circular RNAs (circRNAs) are a group of novel discovered non-coding RNAs with closed-loop structure, which play critical roles in various biological processes. Identifying associations between circRNAs and diseases is critical for exploring the complex disease mechanism and facilitating disease-targeted therapy. Although several computational predictors have been proposed, their performance is still limited. In this study, a novel computational method called iCircDA-MF is proposed. Because the circRNA-disease associations with experimental validation are very limited, the potential circRNA-disease associations are calculated based on the circRNA similarity and disease similarity extracted from the disease semantic information and the known associations of circRNA-gene, gene-disease and circRNA-disease. The circRNA-disease interaction profiles are then updated by the neighbour interaction profiles so as to correct the false negative associations. Finally, the matrix factorization is performed on the updated circRNA-disease interaction profiles to predict the circRNA-disease associations. The experimental results on a widely used benchmark dataset showed that iCircDA-MF outperforms other state-of-the-art predictors and can identify new circRNA-disease associations effectively.

Brief:環(huán)狀RNA與疾病關(guān)聯(lián)預(yù)測(cè)工具揪惦,輸入為文本信息及已知circRNA,基因及疾病間的關(guān)系,基于矩陣因子分解

Microbial community (MC) has great impact on mediating complex disease indications, biogeochemical cycling and agricultural productivities, which makes metaproteomics powerful technique for quantifying diverse and dynamic composition of proteins or peptides. The key role of biostatistical strategies in MC study is reported to be underestimated, especially the appropriate application of feature selection method (FSM) is largely ignored. Although extensive efforts have been devoted to assessing the performance of FSMs, previous studies focused only on their classification accuracy without considering their ability to correctly and comprehensively identify the spiked proteins. In this study, the performances of 14 FSMs were comprehensively assessed based on two key criteria (both sample classification and spiked protein discovery) using a variety of metaproteomics benchmarks. First, the classification accuracies of those 14 FSMs were evaluated. Then, their abilities in identifying the proteins of different spiked concentrations were assessed. Finally, seven FSMs (FC, LMEB, OPLS-DA, PLS-DA, SAM, SVM-RFE and T-Test) were identified as performing consistently superior or good under both criteria with the PLS-DA performing consistently superior. In summary, this study served as comprehensive analysis on the performances of current FSMs and could provide a valuable guideline for researchers in metaproteomics.

Brief:宏蛋白質(zhì)組學(xué)

Streptococcus pneumoniae is the most common human respiratory pathogen, and β-lactam antibiotics have been employed to treat infections caused by S. pneumoniae for decades. β-lactam resistance is steadily increasing in pneumococci and is mainly associated with the alteration in penicillin-binding proteins (PBPs) that reduce binding affinity of antibiotics to PBPs. However, the high variability of PBPs in clinical isolates and their mosaic gene structure hamper the predication of resistance level according to the PBP gene sequences. In this study, we developed a systematic strategy for applying supervised machine learning to predict S. pneumoniae antimicrobial susceptibility to β-lactam antibiotics. We combined published PBP sequences with minimum inhibitory concentration (MIC) values as labelled data and the sequences from NCBI database without MIC values as unlabelled data to develop an approach, using only a fragment from pbp2x (750 bp) and a fragment from pbp2b (750 bp) to predicate the cefuroxime and amoxicillin resistance. We further validated the performance of the supervised learning model by constructing mutants containing the randomly selected pbps and testing more clinical strains isolated from Chinese hospital. In addition, we established the association between resistance phenotypes and serotypes and sequence type of S. pneumoniae using our approach, which facilitate the understanding of the worldwide epidemiology of S. pneumonia.

Brief:監(jiān)督學(xué)習(xí)預(yù)測(cè)肺炎鏈球菌對(duì)抗生素敏感性

Functional annotation of protein sequence with high accuracy has become one of the most important issues in modern biomedical studies, and computational approaches of significantly accelerated analysis process and enhanced accuracy are greatly desired. Although a variety of methods have been developed to elevate protein annotation accuracy, their ability in controlling false annotation rates remains either limited or not systematically evaluated. In this study, a protein encoding strategy, together with a deep learning algorithm, was proposed to control the false discovery rate in protein function annotation, and its performances were systematically compared with that of the traditional similarity-based and de novo approaches. Based on a comprehensive assessment from multiple perspectives, the proposed strategy and algorithm were found to perform better in both prediction stability and annotation accuracy compared with other de novo methods. Moreover, an in-depth assessment revealed that it possessed an improved capacity of controlling the false discovery rate compared with traditional methods. All in all, this study not only provided a comprehensive analysis on the performances of the newly proposed strategy but also provided a tool for the researcher in the fields of protein function annotation.

Brief:基于蛋白質(zhì)編碼策略及深度學(xué)習(xí)提高蛋白質(zhì)功能注釋準(zhǔn)確性

Essential genes are those whose loss of function compromises organism viability or results in profound loss of fitness. Recent gene-editing technologies have provided new opportunities to characterize essential genes. Here, we present an integrated analysis that comprehensively and systematically elucidates the genetic and regulatory characteristics of human essential genes. First, we found that essential genes act as ‘hubs’ in protein–protein interaction networks, chromatin structure and epigenetic modification. Second, essential genes represent conserved biological processes across species, although gene essentiality changes differently among species. Third, essential genes are important for cell development due to their discriminate transcription activity in embryo development and oncogenesis. In addition, we developed an interactive web server, the Human Essential Genes Interactive Analysis Platform (http://sysomics.com/HEGIAP/), which integrates abundant analytical tools to enable global, multidimensional interpretation of gene essentiality. Our study provides new insights that improve the understanding of human essential genes.

Brief:基于基因編輯構(gòu)建人類(lèi)必須基因的遺傳及調(diào)控特性相關(guān)數(shù)據(jù)庫(kù)

Long non-coding RNAs (lncRNAs) are of fundamental biological importance; however, their functional role is often unclear or loosely defined as experimental characterization is challenging and bioinformatic methods are limited. We developed a novel integrated method protocol for the annotation and detailed functional characterization of lncRNAs within the genome. It combines annotation, normalization and gene expression with sequence-structure conservation, functional interactome and promoter analysis. Our protocol allows an analysis based on the tissue and biological context, and is powerful in functional characterization of experimental and clinical RNA-Seq datasets including existing lncRNAs. This is demonstrated on the uncharacterized lncRNA GATA6-AS1 in dilated cardiomyopathy.

Brief:LncRNA功能注釋

Nucleic Acids Res

Volume 48, Issue 16, 18 September 2020

Microbial and viral communities transform the chemistry of Earth's ecosystems, yet the specific reactions catalyzed by these biological engines are hard to decode due to the absence of a scalable, metabolically resolved, annotation software. Here, we present DRAM (Distilled and Refined Annotation of Metabolism), a framework to translate the deluge of microbiome-based genomic information into a catalog of microbial traits. To demonstrate the applicability of DRAM across metabolically diverse genomes, we evaluated DRAM performance on a defined, in silico soil community and previously published human gut metagenomes. We show that DRAM accurately assigned microbial contributions to geochemical cycles and automated the partitioning of gut microbial carbohydrate metabolism at substrate levels. DRAM-v, the viral mode of DRAM, established rules to identify virally-encoded auxiliary metabolic genes (AMGs), resulting in the metabolic categorization of thousands of putative AMGs from soils and guts. Together DRAM and DRAM-v provide critical metabolic profiling capabilities that decipher mechanisms underpinning microbiome function.

Brief:微生物基因組信息提取

The most popular RNA secondary structure prediction programs utilize free energy (ΔG°37) minimization and rely upon thermodynamic parameters from the nearest neighbor (NN) model. Experimental parameters are derived from a series of optical melting experiments; however, acquiring enough melt data to derive accurate NN parameters with modified base pairs is expensive and time consuming. Given the multitude of known natural modifications and the continuing use and development of unnatural nucleotides, experimentally characterizing all modified NNs is impractical. This dilemma necessitates a computational model that can predict NN thermodynamics where experimental data is scarce or absent. Here, we present a combined molecular dynamics/quantum mechanics protocol that accurately predicts experimental NN ΔG°37 parameters for modified nucleotides with neighboring Watson–Crick base pairs. NN predictions for Watson-Crick and modified base pairs yielded an overall RMSD of 0.32 kcal/mol when compared with experimentally derived parameters. NN predictions involving modified bases without experimental parameters (N6-methyladenosine, 2-aminopurineriboside, and 5-methylcytidine) demonstrated promising agreement with available experimental melt data. This procedure not only yields accurate NN ΔG°37 predictions but also quantifies stacking and hydrogen bonding differences between modified NNs and their canonical counterparts, allowing investigators to identify energetic differences and providing insight into sources of (de)stabilization from nucleotide modifications.

Brief:RNA二級(jí)機(jī)構(gòu)預(yù)測(cè)

The differentiation and regeneration of skeletal muscle from myoblasts to myotubes involves myogenic transcription factors, such as myocardin-related transcription factor A (MRTF-A) and serum response factor (SRF). In addition, post-transcriptional regulation by miRNAs is required during myogenesis. Here, we provide evidence for novel mechanisms regulating MRTF-A during myogenic differentiation. Endogenous MRTF-A protein abundance and activity decreased during C2C12 differentiation, which was attributable to miRNA-directed inhibition. Conversely, overexpression of MRTF-A impaired differentiation and myosin expression. Applying miRNA trapping by RNA affinity purification (miTRAP), we identified miRNAs which directly regulate MRTF-A via its 3′UTR, including miR-1a-3p, miR-206-3p, miR-24-3p and miR-486-5p. These miRNAs were upregulated during differentiation and specifically recruited to the 3′UTR of MRTF-A. Concomitantly, Ago2 recruitment to the MRTF-A 3′UTR was considerably increased, whereas Dicer1 depletion or 3′UTR deletion elevated MRTF-A and inhibited differentiation. MRTF-A protein expression was inhibited by ectopic miRNA expression in murine C2C12 and primary human myoblasts. 3′UTR reporter activity diminished upon differentiation or miRNA expression, whereas deletion of the predicted binding sites reversed these effects. Furthermore, TGF-β abolished MRTF-A reduction and decreased miR-486-5p expression. Our findings implicate miR-24-3p and miR-486-5p in the repression of MRTF-A and suggest a complex network of transcriptional and post-transcriptional mechanisms regulating myogenesis.

Brief:肌原性分化調(diào)控中miRNA的作用_實(shí)驗(yàn)文章

Infertility is a complex multifactorial disease that affects up to 10% of couples across the world. However, many mechanisms of infertility remain unclear due to the lack of studies based on systematic knowledge, leading to ineffective treatment and/or transmission of genetic defects to offspring. Here, we developed an infertility disease database to provide a comprehensive resource featuring various factors involved in infertility. Features in the current IDDB version were manually curated as follows: (i) a total of 307 infertility-associated genes in human and 1348 genes associated with reproductive disorder in 9 model organisms; (ii) a total of 202 chromosomal abnormalities leading to human infertility, including aneuploidies and structural variants; and (iii) a total of 2078 pathogenic variants from infertility patients’ samples across 60 different diseases causing infertility. Additionally, the characteristics of clinically diagnosed infertility patients (i.e. causative variants, laboratory indexes and clinical manifestations) were collected. To the best of our knowledge, the IDDB is the first infertility database serving as a systematic resource for biologists to decipher infertility mechanisms and for clinicians to achieve better diagnosis/treatment of patients from disease phenotype to genetic factors. The IDDB is freely available at http://mdl.shsmu.edu.cn/IDDB/.

Brief:不孕疾病數(shù)據(jù)庫(kù)罗侯,含有生殖相關(guān)基因突變

PULs (polysaccharide utilization loci) are discrete gene clusters of CAZymes (Carbohydrate Active EnZymes) and other genes that work together to digest and utilize carbohydrate substrates. While PULs have been extensively characterized in Bacteroidetes, there exist PULs from other bacterial phyla, as well as archaea and metagenomes, that remain to be catalogued in a database for efficient retrieval. We have developed an online database dbCAN-PUL (http://bcb.unl.edu/dbCAN_PUL/) to display experimentally verified CAZyme-containing PULs from literature with pertinent metadata, sequences, and annotation. Compared to other online CAZyme and PUL resources, dbCAN-PUL has the following new features: (i) Batch download of PUL data by target substrate, species/genome, genus, or experimental characterization method; (ii) Annotation for each PUL that displays associated metadata such as substrate(s), experimental characterization method(s) and protein sequence information, (iii) Links to external annotation pages for CAZymes (CAZy), transporters (UniProt) and other genes, (iv) Display of homologous gene clusters in GenBank sequences via integrated MultiGeneBlast tool and (v) An integrated BLASTX service available for users to query their sequences against PUL proteins in dbCAN-PUL. With these features, dbCAN-PUL will be an important repository for CAZyme and PUL research, complementing our other web servers and databases (dbCAN2, dbCAN-seq).

Brief:碳水化合物活性酶序列器腋,注釋

Many studies have indicated that non-coding RNA (ncRNA) dysfunction is closely related to numerous diseases. Recently, accumulated ncRNA–disease associations have made related databases insufficient to meet the demands of biomedical research. The constant updating of ncRNA–disease resources has become essential. Here, we have updated the mammal ncRNA–disease repository (MNDR, http://www.rna-society.org/mndr/) to version 3.0, containing more than one million entries, four-fold increment in data compared to the previous version. Experimental and predicted circRNA–disease associations have been integrated, increasing the number of categories of ncRNAs to five, and the number of mammalian species to 11. Moreover, ncRNA–disease related drug annotations and associations, as well as ncRNA subcellular localizations and interactions, were added. In addition, three ncRNA–disease (miRNA/lncRNA/circRNA) prediction tools were provided, and the website was also optimized, making it more practical and user-friendly. In summary, MNDR v3.0 will be a valuable resource for the investigation of disease mechanisms and clinical treatment strategies.

Brief:ncRNA數(shù)據(jù)庫(kù),可獲得與疾病钩杰,藥物關(guān)聯(lián)數(shù)據(jù)纫塌,以及ncRNA亞細(xì)胞定位及相互作用

Although cancer is the leading cause of disease-related mortality in children, the relative rarity of pediatric cancers poses a significant challenge for developing novel therapeutics to further improve prognosis. Patient-derived xenograft (PDX) models, which are usually developed from high-risk tumors, are a useful platform to study molecular driver events, identify biomarkers and prioritize therapeutic agents. Here, we develop PDX for Childhood Cancer Therapeutics (PCAT), a new integrated portal for pediatric cancer PDX models. Distinct from previously reported PDX portals, PCAT is focused on pediatric cancer models and provides intuitive interfaces for querying and data mining. The current release comprises 324 models and their associated clinical and genomic data, including gene expression, mutation and copy number alteration. Importantly, PCAT curates preclinical testing results for 68 models and 79 therapeutic agents manually collected from individual agent testing studies published since 2008. To facilitate comparisons of patterns between patient tumors and PDX models, PCAT curates clinical and molecular data of patient tumors from the TARGET project. In addition, PCAT provides access to gene fusions identified in nearly 1000 TARGET samples. PCAT was built using R-shiny and MySQL. The portal can be accessed at http://pcat.zhenglab.info or http://www.pedtranscriptome.org.

Brief:兒科癌癥PDX數(shù)據(jù)庫(kù),包括324個(gè)模型及其臨床讲弄,基因表達(dá)和突變數(shù)據(jù)措左,且包含有藥物的治療數(shù)據(jù)


Brief:海洋微生物測(cè)序數(shù)據(jù)及相應(yīng)水樣的理化性質(zhì)數(shù)據(jù)庫(kù)

Housekeeping (HK) genes are constitutively expressed genes that are required for the maintenance of basic cellular functions. Despite their importance in the calibration of gene expression, as well as the understanding of many genomic and evolutionary features, important discrepancies have been observed in studies that previously identified these genes. Here, we present Housekeeping and Reference Transcript Atlas (HRT Atlas v1.0, www.housekeeping.unicamp.br) a web-based database which addresses some of the previously observed limitations in the identification of these genes, and offers a more accurate database of human and mouse HK genes and transcripts. The database was generated by mining massive human and mouse RNA-seq data sets, including 11 281 and 507 high-quality RNA-seq samples from 52 human non-disease tissues/cells and 14 healthy tissues/cells of C57BL/6 wild type mouse, respectively. User can visualize the expression and download lists of 2158 human HK transcripts from 2176 HK genes and 3024 mouse HK transcripts from 3277 mouse HK genes. HRT Atlas also offers the most stable and suitable tissue selective candidate reference transcripts for normalization of qPCR experiments. Specific primers and predicted modifiers of gene expression for some of these HK transcripts are also proposed. HRT Atlas has also been integrated with a regulatory elements resource from Epiregio server.

Brief:小鼠及人house keeping gene數(shù)據(jù)庫(kù),基于RNA-seq數(shù)據(jù)挖掘避除≡跖可用于查詢(xún)適應(yīng)特定組織的內(nèi)參,并且集成了引物序列

PathDIP was introduced to increase proteome coverage of literature-curated human pathway databases. PathDIP 4 now integrates 24 major databases. To further reduce the number of proteins with no curated pathway annotation, pathDIP integrates pathways with physical protein–protein interactions (PPIs) to predict significant physical associations between proteins and curated pathways. For human, it provides pathway annotations for 5366 pathway orphans. Integrated pathway annotation now includes six model organisms and ten domesticated animals. A total of 6401 core and ortholog pathways have been curated from the literature or by annotating orthologs of human proteins in the literature-curated pathways. Extended pathways are the result of combining these pathways with protein-pathway associations that are predicted using organism-specific PPIs. Extended pathways expand proteome coverage from 81 088 to 120 621 proteins, making pathDIP 4 the largest publicly available pathway database for these organisms and providing a necessary platform for comprehensive pathway-enrichment analysis. PathDIP 4 users can customize their search and analysis by selecting organism, identifier and subset of pathways. Enrichment results and detailed annotations for input list can be obtained in different formats and views. To support automated bioinformatics workflows, Java, R and Python APIs are available for batch pathway annotation and enrichment analysis. PathDIP 4 is publicly available at http://ophid.utoronto.ca/pathDIP.

Brief:4種模式動(dòng)物及10種家養(yǎng)動(dòng)物的通路注釋數(shù)據(jù)集

Genomics, Proteomics & Bioinformatics

Volume 17, Issue 5,Pages 473-550 (October 2019)


Brief:深度學(xué)習(xí)預(yù)測(cè)蛋白與RNA結(jié)合預(yù)測(cè)

Accurate identification of compound–protein interactions (CPIs) in silico may deepen our understanding of the underlying mechanisms of drug action and thus remarkably facilitate drug discovery and development. Conventional similarity- or docking-based computational methods for predicting CPIs rarely exploit latent features from currently available large-scale unlabeled compound and protein data and often limit their usage to relatively small-scale datasets. In the present study, we propose DeepCPI, a novel general and scalable computational framework that combines effective feature embedding (a technique of representation learning) with powerful deep learning methods to accurately predict CPIs at a large scale. DeepCPI automatically learns the implicit yet expressive low-dimensional features of compounds and proteins from a massive amount of unlabeled data. Evaluations of the measured CPIs in large-scale databases, such as ChEMBL and BindingDB, as well as of the known drug–target interactions from DrugBank, demonstrated the superior predictive performance of DeepCPI. Furthermore, several interactions among small-molecule compounds and three G protein-coupled receptor targets (glucagon-like peptide-1 receptor, glucagon receptor, and vasoactive intestinal peptide receptor) predicted using DeepCPI were experimentally validated. The present study suggests that DeepCPI is a useful and powerful tool for drug discovery and repositioning. The source code of DeepCPI can be downloaded from https://github.com/FangpingWan/DeepCPI.

Brief:基于特征嵌入及深度學(xué)習(xí)算法預(yù)測(cè)復(fù)合蛋白作用瓶摆,以提示藥物篩選

Bioinformics

volume 36, Issue 12, 15 June 2020

Next-generation sequencing technologies have accelerated the discovery of single nucleotide variants in the human genome, stimulating the development of predictors for classifying which of these variants are likely functional in disease, and which neutral. Recently, we proposed CScape, a method for discriminating between cancer driver mutations and presumed benign variants. For the neutral class, this method relied on benign germline variants found in the 1000 Genomes Project database. Discrimination could, therefore, be influenced by the distinction of germline versus somatic, rather than neutral versus disease driver. This motivates this article in which we consider predictive discrimination between recurrent and rare somatic single point mutations based solely on using cancer data, and the distinction between these two somatic classes and germline single point mutations.

Brief:預(yù)測(cè)驅(qū)動(dòng)突變

We studied the problem of discriminating early- and late-stage tumors of several cancers using genomic information while enforcing interpretability on the solutions. To this end, we developed a multitask multiple kernel learning (MTMKL) method with a co-clustering step based on a cutting-plane algorithm to identify the relationships between the input tasks and kernels. We tested our algorithm on 15 cancer cohorts and observed that, in most cases, MTMKL outperforms other algorithms (including random forests, support vector machine and single-task multiple kernel learning) in terms of predictive power. Using the aggregate results from multiple replications, we also derived similarity matrices between cancer cohorts, which are, in many cases, in agreement with available relationships reported in the relevant literature.

Brief:基于一種多核學(xué)習(xí)法通過(guò)基因表達(dá)量預(yù)測(cè)早期及晚期癌癥

We describe a new iteration of ICGS that outperforms state-of-the-art scRNA-Seq detection workflows when applied to well-established benchmarks. This approach combines multiple complementary subtype detection methods (HOPACH, sparse non-negative matrix factorization, cluster ‘fitness’, support vector machine) to resolve rare and common cell-states, while minimizing differences due to donor or batch effects. Using data from multiple cell atlases, we show that the PageRank algorithm effectively downsamples ultra-large scRNA-Seq datasets, without losing extremely rare or transcriptionally similar yet distinct cell types and while recovering novel transcriptionally distinct cell populations. We believe this new approach holds tremendous promise in reproducibly resolving hidden cell populations in complex datasets.

Brief:基于非負(fù)矩陣分解凉逛,聚類(lèi),支持向量機(jī)等多步算法確定singlecell亞群聚類(lèi)

Here, we present a Bayesian ridge regression-based method (B-GEX) to infer gene expression profiles of multiple tissues from blood gene expression profile. For each gene in a tissue, a low-dimensional feature vector was extracted from whole blood gene expression profile by feature selection. We used GTEx RNAseq data of 16 tissues to train inference models to capture the cross-tissue expression correlations between each target gene in a tissue and its preselected feature genes in peripheral blood. We compared B-GEX with least square regression, LASSO regression and ridge regression. B-GEX outperforms the other three models in most tissues in terms of mean absolute error, Pearson correlation coefficient and root-mean-squared error. Moreover, B-GEX infers expression level of tissue-specific genes as well as those of non-tissue-specific genes in all tissues. Unlike previous methods, which require genomic features or gene expression profiles of multiple tissues, our model only requires whole blood expression profile as input. B-GEX helps gain insights into gene expressions of uncollected tissues from more accessible data of blood.

Brief:基于貝葉斯嶺回歸從血液基因表達(dá)譜推測(cè)多個(gè)組織的表達(dá)譜

Gene network inference and master regulator analysis (MRA) have been widely adopted to define specific transcriptional perturbations from gene expression signatures. Several tools exist to perform such analyses but most require a computer cluster or large amounts of RAM to be executed.
We developed corto, a fast and lightweight R package to infer gene networks and perform MRA from gene expression data, with optional corrections for copy-number variations and able to run on signatures generated from RNA-Seq or ATAC-Seq data. We extensively benchmarked it to infer context-specific gene networks in 39 human tumor and 27 normal tissue datasets.

Brief:快速?gòu)幕虮磉_(dá)數(shù)據(jù)推斷調(diào)節(jié)網(wǎng)絡(luò)

Complex diseases are due to the dense interactions of many disease-associated factors that dysregulate genes that in turn form the so-called disease modules, which have shown to be a powerful concept for understanding pathological mechanisms. There exist many disease module inference methods that rely on somewhat different assumptions, but there is still no gold standard or best-performing method. Hence, there is a need for combining these methods to generate robust disease modules.
We developed MODule IdentiFIER (MODifieR), an ensemble R package of nine disease module inference methods from transcriptomics networks. MODifieR uses standardized input and output allowing the possibility to combine individual modules generated from these methods into more robust disease-specific modules, contributing to a better understanding of complex diseases.

Brief:從表達(dá)矩陣推斷疾病相關(guān)網(wǎng)絡(luò)

最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
  • 序言:七十年代末赏壹,一起剝皮案震驚了整個(gè)濱河市鱼炒,隨后出現(xiàn)的幾起案子,更是在濱河造成了極大的恐慌蝌借,老刑警劉巖昔瞧,帶你破解...
    沈念sama閱讀 206,126評(píng)論 6 481
  • 序言:濱河連續(xù)發(fā)生了三起死亡事件,死亡現(xiàn)場(chǎng)離奇詭異菩佑,居然都是意外死亡自晰,警方通過(guò)查閱死者的電腦和手機(jī),發(fā)現(xiàn)死者居然都...
    沈念sama閱讀 88,254評(píng)論 2 382
  • 文/潘曉璐 我一進(jìn)店門(mén)稍坯,熙熙樓的掌柜王于貴愁眉苦臉地迎上來(lái)酬荞,“玉大人搓劫,你說(shuō)我怎么就攤上這事』烨桑” “怎么了枪向?”我有些...
    開(kāi)封第一講書(shū)人閱讀 152,445評(píng)論 0 341
  • 文/不壞的土叔 我叫張陵,是天一觀的道長(zhǎng)咧党。 經(jīng)常有香客問(wèn)我秘蛔,道長(zhǎng),這世上最難降的妖魔是什么傍衡? 我笑而不...
    開(kāi)封第一講書(shū)人閱讀 55,185評(píng)論 1 278
  • 正文 為了忘掉前任深员,我火速辦了婚禮,結(jié)果婚禮上蛙埂,老公的妹妹穿的比我還像新娘倦畅。我一直安慰自己,他們只是感情好绣的,可當(dāng)我...
    茶點(diǎn)故事閱讀 64,178評(píng)論 5 371
  • 文/花漫 我一把揭開(kāi)白布叠赐。 她就那樣靜靜地躺著,像睡著了一般屡江。 火紅的嫁衣襯著肌膚如雪燎悍。 梳的紋絲不亂的頭發(fā)上,一...
    開(kāi)封第一講書(shū)人閱讀 48,970評(píng)論 1 284
  • 那天盼理,我揣著相機(jī)與錄音,去河邊找鬼俄删。 笑死宏怔,一個(gè)胖子當(dāng)著我的面吹牛,可吹牛的內(nèi)容都是我干的畴椰。 我是一名探鬼主播臊诊,決...
    沈念sama閱讀 38,276評(píng)論 3 399
  • 文/蒼蘭香墨 我猛地睜開(kāi)眼,長(zhǎng)吁一口氣:“原來(lái)是場(chǎng)噩夢(mèng)啊……” “哼斜脂!你這毒婦竟也來(lái)了抓艳?” 一聲冷哼從身側(cè)響起,我...
    開(kāi)封第一講書(shū)人閱讀 36,927評(píng)論 0 259
  • 序言:老撾萬(wàn)榮一對(duì)情侶失蹤帚戳,失蹤者是張志新(化名)和其女友劉穎玷或,沒(méi)想到半個(gè)月后,有當(dāng)?shù)厝嗽跇?shù)林里發(fā)現(xiàn)了一具尸體片任,經(jīng)...
    沈念sama閱讀 43,400評(píng)論 1 300
  • 正文 獨(dú)居荒郊野嶺守林人離奇死亡偏友,尸身上長(zhǎng)有42處帶血的膿包…… 初始之章·張勛 以下內(nèi)容為張勛視角 年9月15日...
    茶點(diǎn)故事閱讀 35,883評(píng)論 2 323
  • 正文 我和宋清朗相戀三年,在試婚紗的時(shí)候發(fā)現(xiàn)自己被綠了对供。 大學(xué)時(shí)的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片位他。...
    茶點(diǎn)故事閱讀 37,997評(píng)論 1 333
  • 序言:一個(gè)原本活蹦亂跳的男人離奇死亡,死狀恐怖,靈堂內(nèi)的尸體忽然破棺而出鹅髓,到底是詐尸還是另有隱情舞竿,我是刑警寧澤,帶...
    沈念sama閱讀 33,646評(píng)論 4 322
  • 正文 年R本政府宣布窿冯,位于F島的核電站骗奖,受9級(jí)特大地震影響,放射性物質(zhì)發(fā)生泄漏靡菇。R本人自食惡果不足惜重归,卻給世界環(huán)境...
    茶點(diǎn)故事閱讀 39,213評(píng)論 3 307
  • 文/蒙蒙 一、第九天 我趴在偏房一處隱蔽的房頂上張望厦凤。 院中可真熱鬧鼻吮,春花似錦、人聲如沸较鼓。這莊子的主人今日做“春日...
    開(kāi)封第一講書(shū)人閱讀 30,204評(píng)論 0 19
  • 文/蒼蘭香墨 我抬頭看了看天上的太陽(yáng)博烂。三九已至香椎,卻和暖如春,著一層夾襖步出監(jiān)牢的瞬間禽篱,已是汗流浹背畜伐。 一陣腳步聲響...
    開(kāi)封第一講書(shū)人閱讀 31,423評(píng)論 1 260
  • 我被黑心中介騙來(lái)泰國(guó)打工, 沒(méi)想到剛下飛機(jī)就差點(diǎn)兒被人妖公主榨干…… 1. 我叫王不留躺率,地道東北人玛界。 一個(gè)月前我還...
    沈念sama閱讀 45,423評(píng)論 2 352
  • 正文 我出身青樓,卻偏偏與公主長(zhǎng)得像悼吱,于是被迫代替她去往敵國(guó)和親慎框。 傳聞我的和親對(duì)象是個(gè)殘疾皇子,可洞房花燭夜當(dāng)晚...
    茶點(diǎn)故事閱讀 42,722評(píng)論 2 345