首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Multimodal single-cell sequencing technologies provide unprecedented information on cellular heterogeneity from multiple layers of genomic readouts. However, joint analysis of two modalities without properly handling the noise often leads to overfitting of one modality by the other and worse clustering results than vanilla single-modality analysis. How to efficiently utilize the extra information from single cell multi-omics to delineate cell states and identify meaningful signal remains as a significant computational challenge. In this work, we propose a deep learning framework, named SAILERX, for efficient, robust, and flexible analysis of multi-modal single-cell data. SAILERX consists of a variational autoencoder with invariant representation learning to correct technical noises from sequencing process, and a multimodal data alignment mechanism to integrate information from different modalities. Instead of performing hard alignment by projecting both modalities to a shared latent space, SAILERX encourages the local structures of two modalities measured by pairwise similarities to be similar. This strategy is more robust against overfitting of noises, which facilitates various downstream analysis such as clustering, imputation, and marker gene detection. Furthermore, the invariant representation learning part enables SAILERX to perform integrative analysis on both multi- and single-modal datasets, making it an applicable and scalable tool for more general scenarios.  相似文献   

2.
3.
With the rapid accumulation of biological omics datasets, decoding the underlying relationships of cross-dataset genes becomes an important issue. Previous studies have attempted to identify differentially expressed genes across datasets. However, it is hard for them to detect interrelated ones. Moreover, existing correlation-based algorithms can only measure the relationship between genes within a single dataset or two multi-modal datasets from the same samples. It is still unclear how to quantify the strength of association of the same gene across two biological datasets with different samples. To this end, we propose Approximate Distance Correlation (ADC) to select interrelated genes with statistical significance across two different biological datasets. ADC first obtains the k most correlated genes for each target gene as its approximate observations, and then calculates the distance correlation (DC) for the target gene across two datasets. ADC repeats this process for all genes and then performs the Benjamini-Hochberg adjustment to control the false discovery rate. We demonstrate the effectiveness of ADC with simulation data and four real applications to select highly interrelated genes across two datasets. These four applications including 21 cancer RNA-seq datasets of different tissues; six single-cell RNA-seq (scRNA-seq) datasets of mouse hematopoietic cells across six different cell types along the hematopoietic cell lineage; five scRNA-seq datasets of pancreatic islet cells across five different technologies; coupled single-cell ATAC-seq (scATAC-seq) and scRNA-seq data of peripheral blood mononuclear cells (PBMC). Extensive results demonstrate that ADC is a powerful tool to uncover interrelated genes with strong biological implications and is scalable to large-scale datasets. Moreover, the number of such genes can serve as a metric to measure the similarity between two datasets, which could characterize the relative difference of diverse cell types and technologies.  相似文献   

4.
This paper presents a novel spatio-temporal LSTM (SPATIAL) architecture for time series forecasting applied to environmental datasets. The framework was applied for three different ocean datasets: current speed, temperature, and dissolved oxygen. Network implementation proceeded in two directions that are nominally separated but connected as part of a natural environmental system – across the spatial (between individual sensors) and temporal dimensions of the sensor data. Data from twenty ocean sensors were used to train the model. Results were compared against four baseline models: two machine learning algorithms generated by robust autoML frameworks, and two deep neural networks based on CNN and LSTM, respectively. Results demonstrated ability to accurately replicate complex signals and provide comparable performance to state-of-the-art benchmarks. Learning from multiple sensors simultaneously increased robustness to missing data. This paper addresses two fundamental challenges related to environmental applications of machine learning: 1) data sparsity, particularly in a challenging ocean environment, and 2) environmental datasets are inherently connected in the spatial and temporal directions while classical ML approaches only consider one of these at a time. Furthermore, sharing of parameters across all input steps makes SPATIAL a fast, scalable, and easily-parameterized forecasting framework.  相似文献   

5.
6.
7.
CellDepot containing over 270 datasets from 8 species and many tissues serves as an integrated web application to empower scientists in exploring single-cell RNA-seq (scRNA-seq) datasets and comparing the datasets among various studies through a user-friendly interface with advanced visualization and analytical capabilities. To begin with, it provides an efficient data management system that users can upload single cell datasets and query the database by multiple attributes such as species and cell types. In addition, the graphical multi-logic, multi-condition query builder and convenient filtering tool backed by MySQL database system, allows users to quickly find the datasets of interest and compare the expression of gene(s) across these. Moreover, by embedding the cellxgene VIP tool, CellDepot enables fast exploration of individual dataset in the manner of interactivity and scalability to gain more refined insights such as cell composition, gene expression profiles, and differentially expressed genes among cell types by leveraging more than 20 frequently applied plotting functions and high-level analysis methods in single cell research. In summary, the web portal available at http://celldepot.bxgenomics.com, prompts large scale single cell data sharing, facilitates meta-analysis and visualization, and encourages scientists to contribute to the single-cell community in a tractable and collaborative way. Finally, CellDepot is released as open-source software under MIT license to motivate crowd contribution, broad adoption, and local deployment for private datasets.  相似文献   

8.
9.
The most popular way of comparing the performance of multiple sequence alignment programs is to use empirical testing on sets of test sequences. Several such test sets now exist, each with potential strengths and weaknesses. We apply several different alignment packages to 6 benchmark datasets, and compare their relative performances. HOMSTRAD, a collection of alignments of homologous proteins, is regularly used as a benchmark for sequence alignment though it is not designed as such, and lacks annotation of reliable regions within the alignment. We introduce this annotation into HOMSTRAD using protein structural superposition. Results on each database show that method performance is dependent on the input sequences. Alignment benchmarks are regularly used in combination to measure performance across a spectrum of alignment problems. Through combining benchmarks, it is possible to detect whether a program has been over-optimised for a single dataset, or alignment problem type.  相似文献   

10.
11.
MOTIVATION: The exponential growth of sequence databases poses a major challenge to bioinformatics tools for querying alignment and annotation databases. There is a pressing need for methods for finding overlapping sequence intervals that are highly scalable to database size, query interval size, result size and construction/updating of the interval database. RESULTS: We have developed a new interval database representation, the Nested Containment List (NCList), whose query time is O(n + log N), where N is the database size and n is the size of the result set. In all cases tested, this query algorithm is 5-500-fold faster than other indexing methods tested in this study, such as MySQL multi-column indexing, MySQL binning and R-Tree indexing. We provide performance comparisons both in simulated datasets and real-world genome alignment databases, across a wide range of database sizes and query interval widths. We also present an in-place NCList construction algorithm that yields database construction times that are approximately 100-fold faster than other methods available. The NCList data structure appears to provide a useful foundation for highly scalable interval database applications. AVAILABILITY: NCList data structure is part of Pygr, a bioinformatics graph database library, available at http://sourceforge.net/projects/pygr  相似文献   

12.
Large-scale genomics requires highly scalable and accurate multiple sequence alignment methods. Results collected over this last decade suggest accuracy loss when scaling up over a few thousand sequences. This issue has been actively addressed with a number of innovative algorithmic solutions that combine low-level hardware optimization with novel higher-level heuristics. This review provides an extensive critical overview of these recent methods. Using established reference datasets we conclude that albeit significant progress has been achieved, a unified framework able to consistently and efficiently produce high-accuracy large-scale multiple alignments is still lacking.  相似文献   

13.
Technological advances have enabled us to profile multiple molecular layers at unprecedented single-cell resolution and the available datasets from multiple samples or domains are growing. These datasets, including scRNA-seq data, scATAC-seq data and sc-methylation data, usually have different powers in identifying the unknown cell types through clustering. So, methods that integrate multiple datasets can potentially lead to a better clustering performance. Here we propose coupleCoC+ for the integrative analysis of single-cell genomic data. coupleCoC+ is a transfer learning method based on the information-theoretic co-clustering framework. In coupleCoC+, we utilize the information in one dataset, the source data, to facilitate the analysis of another dataset, the target data. coupleCoC+ uses the linked features in the two datasets for effective knowledge transfer, and it also uses the information of the features in the target data that are unlinked with the source data. In addition, coupleCoC+ matches similar cell types across the source data and the target data. By applying coupleCoC+ to the integrative clustering of mouse cortex scATAC-seq data and scRNA-seq data, mouse and human scRNA-seq data, mouse cortex sc-methylation and scRNA-seq data, and human blood dendritic cells scRNA-seq data from two batches, we demonstrate that coupleCoC+ improves the overall clustering performance and matches the cell subpopulations across multimodal single-cell genomic datasets. coupleCoC+ has fast convergence and it is computationally efficient. The software is available at https://github.com/cuhklinlab/coupleCoC_plus.  相似文献   

14.
The advent of next generation sequencing (NGS) technologies have revolutionised the way biologists produce, analyse and interpret data. Although NGS platforms provide a cost-effective way to discover genome-wide variants from a single experiment, variants discovered by NGS need follow up validation due to the high error rates associated with various sequencing chemistries. Recently, whole exome sequencing has been proposed as an affordable option compared to whole genome runs but it still requires follow up validation of all the novel exomic variants. Customarily, a consensus approach is used to overcome the systematic errors inherent to the sequencing technology, alignment and post alignment variant detection algorithms. However, the aforementioned approach warrants the use of multiple sequencing chemistry, multiple alignment tools, multiple variant callers which may not be viable in terms of time and money for individual investigators with limited informatics know-how. Biologists often lack the requisite training to deal with the huge amount of data produced by NGS runs and face difficulty in choosing from the list of freely available analytical tools for NGS data analysis. Hence, there is a need to customise the NGS data analysis pipeline to preferentially retain true variants by minimising the incidence of false positives and make the choice of right analytical tools easier. To this end, we have sampled different freely available tools used at the alignment and post alignment stage suggesting the use of the most suitable combination determined by a simple framework of pre-existing metrics to create significant datasets.  相似文献   

15.
16.
The single-cell RNA sequencing (scRNA-seq) technologies obtain gene expression at single-cell resolution and provide a tool for exploring cell heterogeneity and cell types. As the low amount of extracted mRNA copies per cell, scRNA-seq data exhibit a large number of dropouts, which hinders the downstream analysis of the scRNA-seq data. We propose a statistical method, SDImpute (Single-cell RNA-seq Dropout Imputation), to implement block imputation for dropout events in scRNA-seq data. SDImpute automatically identifies the dropout events based on the gene expression levels and the variations of gene expression across similar cells and similar genes, and it implements block imputation for dropouts by utilizing gene expression unaffected by dropouts from similar cells. In the experiments, the results of the simulated datasets and real datasets suggest that SDImpute is an effective tool to recover the data and preserve the heterogeneity of gene expression across cells. Compared with the state-of-the-art imputation methods, SDImpute improves the accuracy of the downstream analysis including clustering, visualization, and differential expression analysis.  相似文献   

17.
Although there has been a recent proliferation in maximum‐likelihood (ML)‐based tree estimation methods based on a fixed sequence alignment (MSA), little research has been done on incorporating indel information in this traditional framework. We show, using a simple model on a single character example, that a trivial alignment of a different form than that previously identified for parsimony is optimal in ML under standard assumptions treating indels as “missing” data, but that it is not optimal when indels are incorporated into the character alphabet. We show that the optimality of the trivial alignment is not an artefact of simplified theory assumptions by demonstrating that trivial alignment likelihoods of five different multiple sequence alignment datasets exhibit this phenomenon. These results demonstrate the need for use of indel information in likelihood analysis on fixed MSAs, and suggest that caution must be exercised when drawing conclusions from software implementations claiming improvements in likelihood scores under an indels‐as‐missing assumption. © The Willi Hennig Society 2012.  相似文献   

18.
Comprehensive understanding of biological systems requires efficient and systematic assimilation of high-throughput datasets in the context of the existing knowledge base. A major limitation in the field of proteomics is the lack of an appropriate software platform that can synthesize a large number of experimental datasets in the context of the existing knowledge base. Here, we describe a software platform, termed PROTEOME-3D, that utilizes three essential features for systematic analysis of proteomics data: creation of a scalable, queryable, customized database for identified proteins from published literature; graphical tools for displaying proteome landscapes and trends from multiple large-scale experiments; and interactive data analysis that facilitates identification of crucial networks and pathways. Thus, PROTEOME-3D offers a standardized platform to analyze high-throughput experimental datasets for the identification of crucial players in co-regulated pathways and cellular processes.  相似文献   

19.
SUMMARY: Multi-dimensional Automated Clustering Genotyping Tool (MACGT) is a Java application that clusters complex multi-dimensional vector data derived from single nucleotide polymorphism (SNP) genotyping experiments using mini-sequencing based microarray chemistries such as arrayed primer extension (APEX). Spot intensity output files from microarray experiments across multiple samples are imported into MACGT. The datasets can include four channels of intensity data for each spot, replica spots for each SNP probe and multiple probe types (APEX and allele-specific APEX probes) on both DNA strands for each SNP. MACGT automatically clusters these multi-dimensionality datasets for each SNP across multiple samples. Incorporation of additional array datasets from known samples that have previously validated SNP genotype calls allows unknown samples to be automatically assigned a genotype based on the clustering, along with numerical measures of confidence for each genotype call. Calling accuracy by MACGT exceeds 98% when applied to genotyping data from APEX microarrays, and can be increased to >99.5% by applying thresholds to the confidence measures.  相似文献   

20.
MOTIVATION: The power of multi-sequence comparison for biological discovery is well established. The need for new capabilities to visualize and compare cross-species alignment data is intensified by the growing number of genomic sequence datasets being generated for an ever-increasing number of organisms. To be efficient these visualization algorithms must support the ability to accommodate consistently a wide range of evolutionary distances in a comparison framework based upon phylogenetic relationships. RESULTS: We have developed Phylo-VISTA, an interactive tool for analyzing multiple alignments by visualizing a similarity measure for multiple DNA sequences. The complexity of visual presentation is effectively organized using a framework based upon interspecies phylogenetic relationships. The phylogenetic organization supports rapid, user-guided interspecies comparison. To aid in navigation through large sequence datasets, Phylo-VISTA leverages concepts from VISTA that provide a user with the ability to select and view data at varying resolutions. The combination of multiresolution data visualization and analysis, combined with the phylogenetic framework for interspecies comparison, produces a highly flexible and powerful tool for visual data analysis of multiple sequence alignments. AVAILABILITY: Phylo-VISTA is available at http://www-gsd.lbl.gov/phylovista. It requires an Internet browser with Java Plug-in 1.4.2 and it is integrated into the global alignment program LAGAN at http://lagan.stanford.edu  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号