2020-10-11

z

Bioinformatics

Volume 36, Issue 10, 15 May 2020


Motivation

Hi-C is currently the method of choice to investigate the global 3D organization of the genome. A major limitation of Hi-C is the sequencing depth required to robustly detect loops in the data. A popular approach used to mitigate this issue, even in single-cell Hi-C data, is genome-wide averaging (piling-up) of peaks, or other features, annotated in high-resolution datasets, to measure their prominence in less deeply sequenced data. However, current tools do not provide a computationally efficient and versatile implementation of this approach.

Results

Here, we describe coolpup.py—a versatile tool to perform pile-up analysis on Hi-C data. We demonstrate its utility by replicating previously published findings regarding the role of cohesin and CTCF in 3D genome organization, as well as discovering novel details of Polycomb-driven interactions. We also present a novel variation of the pile-up approach that can aid the statistical analysis of looping interactions. We anticipate thatcoolpup.pywill aid in Hi-C data analysis by allowing easy to use, versatile and efficient generation of pile-ups

Key: Hi-C數(shù)據的堆積分析


Motivation

The studies have indicated that not only microRNAs (miRNAs) or long non-coding RNAs (lncRNAs) play important roles in biological activities, but also their interactions affect the biological process. A growing number of studies focus on the miRNA–lncRNA interactions, while few of them are proposed for plant. The prediction of interactions is significant for understanding the mechanism of interaction between miRNA and lncRNA in plant.

Results

This article proposes a new method for fulfilling plant miRNA–lncRNA interaction prediction (PmliPred). The deep learning model and shallow machine learning model are trained using raw sequence and manually extracted features, respectively. Then they are hybridized based on fuzzy decision for prediction. PmliPred shows better performance and generalization ability compared with the existing methods. Several new miRNA–lncRNA interactions inSolanum lycopersicumare successfully identified using quantitative real time–polymerase chain reaction from the candidates predicted by PmliPred, which further verifies its effectiveness.

Key: miRNA和lncRNA互作 深度學習


Motivation

One of the key computational problems in comparative genomics is the reconstruction of genomes of ancestral species based on genomes of extant species. Since most dramatic changes in genomic architectures are caused by genome rearrangements, this problem is often posed as minimization of the number of genome rearrangements between extant and ancestral genomes. The basic case of three given genomes is known as thegenome median problem. Whole-genome duplications (WGDs) represent yet another type of dramatic evolutionary events and inspire the reconstruction of preduplicated ancestral genomes, referred to as thegenome halving problem. Generalization of WGDs to whole-genome multiplication events leads to thegenome aliquoting problem.

Results

In this study, we propose polynomial-size integer linear programming (ILP) formulations for the aforementioned problems. We further obtain such formulations for the restricted and conserved versions of the median and halving problems, which have been recently introduced to improve biological relevance of the solutions. Extensive evaluation of solutions to the different ILP problems demonstrates their good accuracy. Furthermore, since the ILP formulations for the conserved versions have linear size, they provide a novel practical approach to ancestral genome reconstruction, which combines the advantages of homology- and rearrangements-based methods.


Motivation

With the emerging of high-dimensional genomic data, genetic analysis such as genome-wide association studies (GWAS) have played an important role in identifying disease-related genetic variants and novel treatments. Complex longitudinal phenotypes are commonly collected in medical studies. However, since limited analytical approaches are available for longitudinal traits, these data are often underutilized. In this article, we develop a high-throughput machine learning approach for multilocus GWAS using longitudinal traits by coupling Empirical Bayesian Estimates from mixed-effects modeling with a novel ?0-norm algorithm.

Results

Extensive simulations demonstrated that the proposed approach not only provided accurate selection of single nucleotide polymorphisms (SNPs) with comparable or higher power but also robust control of false positives. More importantly, this novel approach is highly scalable and could be approximately >1000 times faster than recently published approaches, making genome-wide multilocus analysis of longitudinal traits possible. In addition, our proposed approach can simultaneously analyze millions of SNPs if the computer memory allows, thereby potentially allowing a true multilocus analysis for high-dimensional genomic data. With application to the data from Alzheimer's Disease Neuroimaging Initiative, we confirmed that our approach can identify well-known SNPs associated with AD and were much faster than recently published approaches (≥6000 times).

Key: 高吞吐量機器學習GWAS



Motivation

Methodological advances in metagenome assembly are rapidly increasing in the number of published metagenome assemblies. However, identifying misassemblies is challenging due to a lack of closely related reference genomes that can act as pseudo ground truth. Existing reference-free methods are no longer maintained, can make strong assumptions that may not hold across a diversity of research projects, and have not been validated on large-scale metagenome assemblies.

Results

We present DeepMAsED, a deep learning approach for identifying misassembled contigs without the need for reference genomes. Moreover, we provide anin silicopipeline for generating large-scale, realistic metagenome assemblies for comprehensive model training and testing. DeepMAsED accuracy substantially exceeds the state-of-the-art when applied to large and complex metagenome assemblies. Our model estimates a 1% contig misassembly rate in two recent large-scale metagenome assembly publications.

Conclusions

DeepMAsED accurately identifies misassemblies in metagenome-assembled contigs from a broad diversity of bacteria and archaea without the need for reference genomes or strong modeling assumptions. Running DeepMAsED is straight-forward, as well as is model re-training with our dataset generation pipeline. Therefore, DeepMAsED is a flexible misassembly classifier that can be applied to a wide range of metagenome assembly projects.


Motivation

Knowledge of protein–ligand binding residues is important for understanding the functions of proteins and their interaction mechanisms. From experimentally solved protein structures, how to accurately identify its potential binding sites of a specific ligand on the protein is still a challenging problem. Compared with structure-alignment-based methods, machine learning algorithms provide an alternative flexible solution which is less dependent on annotated homogeneous protein structures. Several factors are important for an efficient protein–ligand prediction model, e.g. discriminative feature representation and effective learning architecture to deal with both the large-scale and severely imbalanced data.

Results

In this study, we propose a novel deep-learning-based method called DELIA for protein–ligand binding residue prediction. In DELIA, a hybrid deep neural network is designed to integrate 1D sequence-based features with 2D structure-based amino acid distance matrices. To overcome the problem of severe data imbalance between the binding and nonbinding residues, strategies of oversampling in mini-batch, random undersampling and stacking ensemble are designed to enhance the model. Experimental results on five benchmark datasets demonstrate the effectiveness of proposed DELIA pipeline.

Key: 蛋白配體結合殘基預測 深度學習



Motivation

Cell-penetrating peptides (CPPs) are a vehicle for transporting into? living cells pharmacologically active molecules, such as short interfering RNAs, nanoparticles, plasmid DNAs and small peptides, thus offering great potential as future therapeutics. Existing experimental techniques for identifying CPPs are time-consuming and expensive. Thus, the prediction of CPPs from peptide sequences by using computational methods can be useful to annotate and guide the experimental process quickly. Many machine learning-based methods have recently emerged for identifying CPPs. Although considerable progress has been made, existing methods still have low feature representation capabilities, thereby limiting further performance improvements.

Results

We propose a method called StackCPPred, which proposes three feature methods on the basis of the pairwise energy content of the residue as follows: RECM-composition, PseRECM and RECM–DWT. These features are used to train stacking-based machine learning methods to effectively predict CPPs. On the basis of the CPP924 and CPPsite3 datasets with jackknife validation, StackDPPred achieved 94.5% and 78.3% accuracy, which was 2.9% and 5.8% higher than the state-of-the-art CPP predictors, respectively. StackCPPred can be a powerful tool for predicting CPPs and their uptake efficiency, facilitating hypothesis-driven experimental design and accelerating their applications in clinical therapy.

Key: 從肽序列中預測細胞膜穿透肽(CPP)

Motivation

Searching the Longest Common Subsequences of many sequences is called a Multiple Longest Common Subsequence (MLCS) problem which is a very fundamental and challenging problem in many fields of data mining.?The existing algorithms cannot be applicable to problems with long and large-scale sequences due to their huge time and space consumption.?To efficiently handle large-scale MLCS problems, a Path Recorder Directed Acyclic Graph (PRDAG) model and a novel Path Recorder Algorithm (PRA) are proposed.

Results

Searching the Longest Common Subsequences of many sequences is called a Multiple Longest Common Subsequence (MLCS) problem which is a very fundamental and challenging problem in many fields of data mining.?The existing algorithms cannot be applicable to problems with long and large-scale sequences due to their huge time and space consumption.?To efficiently handle large-scale MLCS problems, a Path Recorder Directed Acyclic Graph (PRDAG) model and a novel Path Recorder Algorithm (PRA) are proposed.

Key: MLCS 算法優(yōu)化

Motivation

Single-cell RNA-sequencing (scRNA-seq) allows us to dissect transcriptional heterogeneity arising from cellular types, spatio-temporal contexts and environmental stimuli.?Transcriptional heterogeneity may reflect phenotypes and molecular signatures that are often unmeasured or unknown a priori.?Cell identities of samples derived from heterogeneous subpopulations are then determined by clustering of scRNA-seq data.?These cell identities are used in downstream analyses.?How can we examine if cell identities are accurately inferred??Unlike external measurements or labels for single cells, using clustering-based cell identities result in spurious signals and false discoveries.

Results

We introduce non-parametric methods to evaluate cell identities by testing cluster memberships in an unsupervised manner.?Diverse simulation studies demonstrate accuracy of the jackstraw test for cluster membership.?We propose a posterior probability that a cell should be included in that clustering-based subpopulation.?Posterior inclusion probabilities (PIPs) for cluster memberships can be used to select and visualize samples relevant to subpopulations.?The proposed methods are applied on three scRNA-seq datasets.?First, a mixture of Jurkat and 293T cell lines provides two distinct cellular populations.?Second, Cell Hashing yields cell identities corresponding to eight donors which are independently analyzed by the jackstraw.?Third, peripheral blood mononuclear cells are used to explore heterogeneous immune populations.?The proposed P-values and PIPs lead to probabilistic feature selection of single cells that can be visualized using principal component analysis (PCA), t-distributed stochastic neighbor embedding (t-SNE) and others.?By learning uncertainty in clustering high-dimensional data, the proposed methods enable unsupervised evaluation of cluster membership.

Key: scRNA-seq 成員細胞身份聚類評估


Motivation

Batch effect is a frequent challenge in deep sequencing data analysis that can lead to misleading conclusions.?Existing methods do not correct batch effects satisfactorily, especially with single-cell RNA sequencing (RNA-seq) data.

Results

We present scBatch, a numerical algorithm for batch-effect correction on bulk and single-cell RNA-seq data with emphasis on improving both clustering and gene differential expression analysis.?scBatch is not restricted by assumptions on the mechanism of batch-effect generation.?As shown in simulations and real data analyses, scBatch outperforms benchmark batch-effect correction methods.

Key: RNA-seq批量效應校正


Motivation

Single-cell RNA sequencing (scRNA-seq) has enabled the simultaneous transcriptomic profiling of individual cells under different biological conditions.?scRNA-seq data have two unique challenges that can affect the sensitivity and specificity of single-cell differential expression analysis: a large proportion of expressed genes with zero or low read counts ('dropout' events) and multimodal data distributions.

Results

We have developed a zero-inflation-adjusted quantile (ZIAQ) algorithm, which is the first method to account for both dropout rates and complex scRNA-seq data distributions in the same model.?ZIAQ demonstrates superior performance over several existing methods on simulated scRNA-seq datasets by finding more differentially expressed genes.?When ZIAQ was applied to the comparison of neoplastic and non-neoplastic cells from a human glioblastoma dataset, the ranking of biologically relevant genes and pathways showed clear improvement over existing methods.

Key: scRNA-seq下游差異表達分析(DEA)的分位數(shù)回歸方法

Motivation

Single-cell RNA sequencing (scRNA-seq) methods make it possible to reveal gene expression patterns at single-cell resolution.?Due to technical defects, dropout events in scRNA-seq will add noise to the gene-cell expression matrix and hinder downstream analysis.?Therefore, it is important for recovering the true gene expression levels before carrying out downstream analysis.

Results

In this article, we develop an imputation method, called scTSSR, to recover gene expression for scRNA-seq.?Unlike most existing methods that impute dropout events by borrowing information across only genes or cells, scTSSR simultaneously leverages information from both similar genes and similar cells using a two-side sparse self-representation model.?We demonstrate that scTSSR can effectively capture the Gini coefficients of genes and gene-to-gene correlations observed in single-molecule RNA fluorescence in situ hybridization (smRNA FISH).?Down-sampling experiments indicate that scTSSR performs better than existing methods in recovering the true gene expression levels.?We also show that scTSSR has a competitive performance in differential expression analysis, cell clustering and cell trajectory inference.

Key: 恢復scRNA-seq真實的基因表達水平 雙向稀疏自表達模型


Motivation

Single-cell RNA-sequencing (scRNA-seq) technology provides a powerful tool for investigating cell heterogeneity and cell subpopulations by allowing the quantification of gene expression at single-cell level.?However, scRNA-seq data analysis remains challenging because of various technical noises such as dropout events (i.e. excessive zero counts in the expression matrix).

Results

By taking consideration of the association among cells and genes, we propose a novel collaborative matrix factorization-based method called CMF-Impute to impute the dropout entries of a given scRNA-seq expression matrix.?We test CMF-Impute and compare it with the other five state-of-the-art methods on six popular real scRNA-seq datasets of various sizes and three simulated datasets.?For simulated datasets, CMF-Impute outperforms other methods in imputing the closest dropouts to the original expression values as evaluated by both the sum of squared error and Pearson correlation coefficient.?For real datasets, CMF-Impute achieves the most accurate cell classification results in spite of the choice of different clustering methods like SC3 or T-SNE followed by K-means as evaluated by both adjusted rand index and normalized mutual information.?Finally, we demonstrate that CMF-Impute is powerful in reconstructing cell-to-cell and gene-to-gene correlation, and in inferring cell lineage trajectories.

Key: 協(xié)作矩陣因子 scRNA-seq的dropout恢復

Motivation

Single cell RNA-sequencing (scRNA-seq) technology enables whole transcriptome profiling at single cell resolution and holds great promises in many biological and medical applications. Nevertheless, scRNA-seq often fails to capture expressed genes, leading to the prominent dropout problem. These dropouts cause many problems in down-stream analysis, such as significant increase of noises, power loss in differential expression analysis and obscuring of gene-to-gene or cell-to-cell relationship. Imputation of these dropout values can be beneficial in scRNA-seq data analysis.

Results

In this article, we model the dropout imputation problem as robust matrix decomposition. This model has minimal assumptions and allows us to develop a computational efficient imputation method called scRMD. Extensive data analysis shows that scRMD can accurately recover the dropout values and help to improve downstream analysis such as differential expression analysis and clustering analysis.

Key: dropout估算

最后編輯于
?著作權歸作者所有,轉載或內容合作請聯(lián)系作者
  • 序言:七十年代末,一起剝皮案震驚了整個濱河市,隨后出現(xiàn)的幾起案子见芹,更是在濱河造成了極大的恐慌耕陷,老刑警劉巖,帶你破解...
    沈念sama閱讀 218,682評論 6 507
  • 序言:濱河連續(xù)發(fā)生了三起死亡事件,死亡現(xiàn)場離奇詭異,居然都是意外死亡,警方通過查閱死者的電腦和手機吁津,發(fā)現(xiàn)死者居然都...
    沈念sama閱讀 93,277評論 3 395
  • 文/潘曉璐 我一進店門,熙熙樓的掌柜王于貴愁眉苦臉地迎上來堕扶,“玉大人碍脏,你說我怎么就攤上這事∩运悖” “怎么了典尾?”我有些...
    開封第一講書人閱讀 165,083評論 0 355
  • 文/不壞的土叔 我叫張陵,是天一觀的道長糊探。 經常有香客問我钾埂,道長,這世上最難降的妖魔是什么科平? 我笑而不...
    開封第一講書人閱讀 58,763評論 1 295
  • 正文 為了忘掉前任褥紫,我火速辦了婚禮,結果婚禮上瞪慧,老公的妹妹穿的比我還像新娘髓考。我一直安慰自己,他們只是感情好弃酌,可當我...
    茶點故事閱讀 67,785評論 6 392
  • 文/花漫 我一把揭開白布氨菇。 她就那樣靜靜地躺著儡炼,像睡著了一般。 火紅的嫁衣襯著肌膚如雪查蓉。 梳的紋絲不亂的頭發(fā)上乌询,一...
    開封第一講書人閱讀 51,624評論 1 305
  • 那天,我揣著相機與錄音豌研,去河邊找鬼妹田。 笑死,一個胖子當著我的面吹牛聂沙,可吹牛的內容都是我干的。 我是一名探鬼主播初嘹,決...
    沈念sama閱讀 40,358評論 3 418
  • 文/蒼蘭香墨 我猛地睜開眼及汉,長吁一口氣:“原來是場噩夢啊……” “哼!你這毒婦竟也來了屯烦?” 一聲冷哼從身側響起坷随,我...
    開封第一講書人閱讀 39,261評論 0 276
  • 序言:老撾萬榮一對情侶失蹤,失蹤者是張志新(化名)和其女友劉穎驻龟,沒想到半個月后温眉,有當?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體,經...
    沈念sama閱讀 45,722評論 1 315
  • 正文 獨居荒郊野嶺守林人離奇死亡翁狐,尸身上長有42處帶血的膿包…… 初始之章·張勛 以下內容為張勛視角 年9月15日...
    茶點故事閱讀 37,900評論 3 336
  • 正文 我和宋清朗相戀三年类溢,在試婚紗的時候發(fā)現(xiàn)自己被綠了。 大學時的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片露懒。...
    茶點故事閱讀 40,030評論 1 350
  • 序言:一個原本活蹦亂跳的男人離奇死亡闯冷,死狀恐怖,靈堂內的尸體忽然破棺而出懈词,到底是詐尸還是另有隱情蛇耀,我是刑警寧澤,帶...
    沈念sama閱讀 35,737評論 5 346
  • 正文 年R本政府宣布坎弯,位于F島的核電站纺涤,受9級特大地震影響,放射性物質發(fā)生泄漏抠忘。R本人自食惡果不足惜撩炊,卻給世界環(huán)境...
    茶點故事閱讀 41,360評論 3 330
  • 文/蒙蒙 一、第九天 我趴在偏房一處隱蔽的房頂上張望崎脉。 院中可真熱鬧衰抑,春花似錦、人聲如沸荧嵌。這莊子的主人今日做“春日...
    開封第一講書人閱讀 31,941評論 0 22
  • 文/蒼蘭香墨 我抬頭看了看天上的太陽。三九已至谭网,卻和暖如春汪厨,著一層夾襖步出監(jiān)牢的瞬間,已是汗流浹背愉择。 一陣腳步聲響...
    開封第一講書人閱讀 33,057評論 1 270
  • 我被黑心中介騙來泰國打工劫乱, 沒想到剛下飛機就差點兒被人妖公主榨干…… 1. 我叫王不留,地道東北人锥涕。 一個月前我還...
    沈念sama閱讀 48,237評論 3 371
  • 正文 我出身青樓衷戈,卻偏偏與公主長得像,于是被迫代替她去往敵國和親层坠。 傳聞我的和親對象是個殘疾皇子殖妇,可洞房花燭夜當晚...
    茶點故事閱讀 44,976評論 2 355