首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Intensively sampled species abundance distributions (SADs) show left‐skew on a log scale. That is, there are too many rare species to fit a lognormal distribution. I propose that this log‐left‐skew might be a sampling artefact. Monte Carlo simulations show that taking progressively larger samples from a log‐unskewed distribution (such as the lognormal) causes log‐skew to decrease asymptotically (move towards ?∞) until it reaches the level of the underlying distribution (zero in this case). In contrast, accumulating certain types of repeated small samples results in a log‐skew that becomes progressively more log‐left‐skewed to a level well beyond the underlying distribution. These repeated samples correspond to samples from the same site over many years or from many sites in 1 year. Data from empirical datasets show that log‐skew generally goes from positive (right‐skewed) to negative (left‐skewed) as the number of temporally or spatially replicated samples increases. This suggests caution when interpreting log‐left‐skew as a pattern that needs biological interpretation.  相似文献   

2.
With the tremendous increase of publicly available single-cell RNA-sequencing (scRNA-seq) datasets, bioinformatics methods based on gene co-expression network are becoming efficient tools for analyzing scRNA-seq data, improving cell type prediction accuracy and in turn facilitating biological discovery. However, the current methods are mainly based on overall co-expression correlation and overlook co-expression that exists in only a subset of cells, thus fail to discover certain rare cell types and sensitive to batch effect. Here, we developed independent component analysis-based gene co-expression network inference (ICAnet) that decomposed scRNA-seq data into a series of independent gene expression components and inferred co-expression modules, which improved cell clustering and rare cell-type discovery. ICAnet showed efficient performance for cell clustering and batch integration using scRNA-seq datasets spanning multiple cells/tissues/donors/library types. It works stably on datasets produced by different library construction strategies and with different sequencing depths and cell numbers. We demonstrated the capability of ICAnet to discover rare cell types in multiple independent scRNA-seq datasets from different sources. Importantly, the identified modules activated in acute myeloid leukemia scRNA-seq datasets have the potential to serve as new diagnostic markers. Thus, ICAnet is a competitive tool for cell clustering and biological interpretations of single-cell RNA-seq data analysis.  相似文献   

3.
The standard analysis pipeline for single-cell RNA-seq data consists of sequential steps initiated by clustering the cells. An innate limitation of this pipeline is that an imperfect clustering result can irreversibly affect the succeeding steps. For example, there can be cell types not well distinguished by clustering because they largely share the global structure, such as the anterior primitive streak and mid primitive streak cells. If one searches differentially expressed genes (DEGs) solely based on clustering, marker genes for distinguishing these types will be missed. Moreover, clustering depends on many parameters and can often be subjective to manual decisions. To overcome these limitations, we propose MarcoPolo, a method that identifies informative DEGs independently of prior clustering. MarcoPolo sorts out genes by evaluating if the distributions are bimodal, if similar expression patterns are observed in other genes, and if the expressing cells are proximal in a low-dimensional space. Using real datasets with FACS-purified cell labels, we demonstrate that MarcoPolo recovers marker genes better than competing methods. Notably, MarcoPolo finds key genes that can distinguish cell types that are not distinguishable by the standard clustering. MarcoPolo is built in a convenient software package that provides analysis results in an HTML file.  相似文献   

4.
5.
Our current biological knowledge is spread over many independent bioinformatics databases where many different types of gene and protein identifiers are used. The heterogeneous and redundant nature of these identifiers limits data analysis across different bioinformatics resources. It is an even more serious bottleneck of data analysis for larger datasets, such as gene lists derived from microarray and proteomic experiments. The DAVID Gene ID Conversion Tool (DICT), a web-based application, is able to convert user's input gene or gene product identifiers from one type to another in a more comprehensive and high-throughput manner with a uniquely enhanced ID-ID mapping database.  相似文献   

6.
CellDepot containing over 270 datasets from 8 species and many tissues serves as an integrated web application to empower scientists in exploring single-cell RNA-seq (scRNA-seq) datasets and comparing the datasets among various studies through a user-friendly interface with advanced visualization and analytical capabilities. To begin with, it provides an efficient data management system that users can upload single cell datasets and query the database by multiple attributes such as species and cell types. In addition, the graphical multi-logic, multi-condition query builder and convenient filtering tool backed by MySQL database system, allows users to quickly find the datasets of interest and compare the expression of gene(s) across these. Moreover, by embedding the cellxgene VIP tool, CellDepot enables fast exploration of individual dataset in the manner of interactivity and scalability to gain more refined insights such as cell composition, gene expression profiles, and differentially expressed genes among cell types by leveraging more than 20 frequently applied plotting functions and high-level analysis methods in single cell research. In summary, the web portal available at http://celldepot.bxgenomics.com, prompts large scale single cell data sharing, facilitates meta-analysis and visualization, and encourages scientists to contribute to the single-cell community in a tractable and collaborative way. Finally, CellDepot is released as open-source software under MIT license to motivate crowd contribution, broad adoption, and local deployment for private datasets.  相似文献   

7.
The production of high-throughput gene expression data has generated a crucial need for bioinformatics tools to generate biologically interesting hypotheses. Whereas many tools are available for extracting global patterns, less attention has been focused on local pattern discovery. We propose here an original way to discover knowledge from gene expression data by means of the so-called formal concepts which hold in derived Boolean gene expression datasets. We first encoded the over-expression properties of genes in human cells using human SAGE data. It has given rise to a Boolean matrix from which we extracted the complete collection of formal concepts, i.e., all the largest sets of over-expressed genes associated to a largest set of biological situations in which their over-expression is observed. Complete collections of such patterns tend to be huge. Since their interpretation is a time-consuming task, we propose a new method to rapidly visualize clusters of formal concepts. This designates a reasonable number of Quasi-Synexpression-Groups (QSGs) for further analysis. The interest of our approach is illustrated using human SAGE data and interpreting one of the extracted QSGs. The assessment of its biological relevancy leads to the formulation of both previously proposed and new biological hypotheses.  相似文献   

8.
9.
Translational control of gene expression has emerged as a major mechanism that regulates many biological processes and shows dysregulation in human diseases including cancer. When studying differential translation, levels of both actively translating mRNAs and total cytosolic mRNAs are obtained where the latter is used to correct for a possible contribution of differential cytosolic mRNA levels to the observed differential levels of actively translated mRNAs. We have recently shown that analysis of partial variance (APV) corrects for cytosolic mRNA levels more effectively than the commonly applied log ratio approach. APV provides a high degree of specificity and sensitivity for detecting biologically meaningful translation changes, especially when combined with a variance shrinkage method for estimating random error. Here we describe the anota (analysis of translational activity) R-package which implements APV, allows scrutiny of associated statistical assumptions and provides biologically motivated filters for analysis of genome wide datasets. Although the package was developed for analysis of differential translation in polysome microarray or ribosome-profiling datasets, any high-dimensional data that result in paired controls, such as RNP immunoprecipitation-microarray (RIP-CHIP) datasets, can be successfully analyzed with anota. AVAILABILITY: The anota Bioconductor package, www.bioconductor.org.  相似文献   

10.
A hidden-state Markov model for cell population deconvolution.   总被引:1,自引:0,他引:1  
Microarrays measure gene expression typically from a mixture of cell populations during different stages of a biological process. However, the specific effects of the distinct or pure populations on measured gene expression are difficult or impossible to determine. The ability to deconvolve measured gene expression into the contributions from pure populations is critical to maximizing the potential of microarray analysis for investigating complex biological processes. In this paper, we describe a novel approach called the multinomial hidden Markov model (MHMM) that produces: (i) a maximum a posteriori estimate of the fraction represented by each pure population and (ii) gene expression values for each pure population. Our method uses an unsupervised, probabilistic approach for handling missing data points and clusters genes based on expression in pure populations. MHMM, used with several yeast datasets, identified statistically significant temporal dynamics. This method, unlike the linear decomposition models used previously for deconvolution, can extract information from different types of data, does not require a priori identification of pure gene expression, exploits the temporal nature of time series data, and is less affected by missing data.  相似文献   

11.
12.

Background  

Microarray technology is generating huge amounts of data about the expression level of thousands of genes, or even whole genomes, across different experimental conditions. To extract biological knowledge, and to fully understand such datasets, it is essential to include external biological information about genes and gene products to the analysis of expression data. However, most of the current approaches to analyze microarray datasets are mainly focused on the analysis of experimental data, and external biological information is incorporated as a posterior process.  相似文献   

13.
Contemporary high dimensional biological assays, such as mRNA expression microarrays, regularly involve multiple data processing steps, such as experimental processing, computational processing, sample selection, or feature selection (i.e. gene selection), prior to deriving any biological conclusions. These steps can dramatically change the interpretation of an experiment. Evaluation of processing steps has received limited attention in the literature. It is not straightforward to evaluate different processing methods and investigators are often unsure of the best method. We present a simple statistical tool, Standardized WithIn class Sum of Squares (SWISS), that allows investigators to compare alternate data processing methods, such as different experimental methods, normalizations, or technologies, on a dataset in terms of how well they cluster a priori biological classes. SWISS uses Euclidean distance to determine which method does a better job of clustering the data elements based on a priori classifications. We apply SWISS to three different gene expression applications. The first application uses four different datasets to compare different experimental methods, normalizations, and gene sets. The second application, using data from the MicroArray Quality Control (MAQC) project, compares different microarray platforms. The third application compares different technologies: a single Agilent two-color microarray versus one lane of RNA-Seq. These applications give an indication of the variety of problems that SWISS can be helpful in solving. The SWISS analysis of one-color versus two-color microarrays provides investigators who use two-color arrays the opportunity to review their results in light of a single-channel analysis, with all of the associated benefits offered by this design. Analysis of the MACQ data shows differential intersite reproducibility by array platform. SWISS also shows that one lane of RNA-Seq clusters data by biological phenotypes as well as a single Agilent two-color microarray.  相似文献   

14.
15.
16.
Non-negative matrix factorization is a useful tool for reducing the dimension of large datasets. This work considers simultaneous non-negative matrix factorization of multiple sources of data. In particular, we perform the first study that involves more than two datasets. We discuss the algorithmic issues required to convert the approach into a practical computational tool and apply the technique to new gene expression data quantifying the molecular changes in four tissue types due to different dosages of an experimental panPPAR agonist in mouse. This study is of interest in toxicology because, whilst PPARs form potential therapeutic targets for diabetes, it is known that they can induce serious side-effects. Our results show that the practical simultaneous non-negative matrix factorization developed here can add value to the data analysis. In particular, we find that factorizing the data as a single object allows us to distinguish between the four tissue types, but does not correctly reproduce the known dosage level groups. Applying our new approach, which treats the four tissue types as providing distinct, but related, datasets, we find that the dosage level groups are respected. The new algorithm then provides separate gene list orderings that can be studied for each tissue type, and compared with the ordering arising from the single factorization. We find that many of our conclusions can be corroborated with known biological behaviour, and others offer new insights into the toxicological effects. Overall, the algorithm shows promise for early detection of toxicity in the drug discovery process.  相似文献   

17.
A critical and difficult part of studying cancer with DNA microarrays is data interpretation. Besides the need for data analysis algorithms, integration of additional information about genes might be useful. We performed genome-wide expression profiling of 36 types of normal human tissues and identified 2503 tissue-specific genes. We then systematically studied the expression of these genes in cancers by reanalyzing a large collection of published DNA microarray datasets. We observed that the expression level of liver-specific genes in hepatocellular carcinoma (HCC) correlates with the clinically defined degree of tumor differentiation. Through unsupervised clustering of tissue-specific genes differentially expressed in tumors, we extracted expression patterns that are characteristic of individual cell types, uncovering differences in cell lineage among tumor subtypes. We were able to detect the expression signature of hepatocytes in HCC, neuron cells in medulloblastoma, glia cells in glioma, basal and luminal epithelial cells in breast tumors, and various cell types in lung cancer samples. We also demonstrated that tissue-specific expression signatures are useful in locating the origin of metastatic tumors. Our study shows that integration of each gene's breadth of expression (BOE) in normal tissues is important for biological interpretation of the expression profiles of cancers in terms of tumor differentiation, cell lineage, and metastasis.  相似文献   

18.
MOTIVATION: Biological assays are often carried out on tissues that contain many cell lineages and active pathways. Microarray data produced using such material therefore reflect superimpositions of biological processes. Analysing such data for shared gene function by means of well-matched assays may help to provide a better focus on specific cell types and processes. The identification of genes that behave similarly in different biological systems also has the potential to reveal new insights into preserved biological mechanisms. RESULTS: In this article, we propose a hierarchical Bayesian model allowing integrated analysis of several microarray data sets for shared gene function. Each gene is associated with an indicator variable that selects whether binary class labels are predicted from expression values or by a classifier which is common to all genes. Each indicator selects the component models for all involved data sets simultaneously. A quantitative measure of shared gene function is obtained by inferring a probability measure over these indicators. Through experiments on synthetic data, we illustrate potential advantages of this Bayesian approach over a standard method. A shared analysis of matched microarray experiments covering (a) a cycle of mouse mammary gland development and (b) the process of in vitro endothelial cell apoptosis is proposed as a biological gold standard. Several useful sanity checks are introduced during data analysis, and we confirm the prior biological belief that shared apoptosis events occur in both systems. We conclude that a Bayesian analysis for shared gene function has the potential to reveal new biological insights, unobtainable by other means. AVAILABILITY: An online supplement and MatLab code are available at http://www.sykacek.net/research.html#mcabf  相似文献   

19.
20.
Kivioja T  Tiirikka T  Siermala M  Vihinen M 《Gene》2008,410(1):53-66
Gene and protein expression is controlled so that cells can react to changing intra- and extracellular signals by modulating biochemical networks and pathways. We have previously shown that gene expression and the properties of expressed proteins are dynamically correlated. Here we investigated correlations between gene related parameters and gene expression patterns, and found statistically significant correlations in microarray datasets for different cell types, organisms and processes, including human B and T cell stimulation, cell cycle in HeLa cells, infection in intestinal epithelial cells, Drosophila melanogaster life span, and Saccharomyces cerevisiae cell cycle. Our method was applied to time course datasets individually for each time point. We derived from sequence information numerous parameters for nucleotide composition, two-base composition, codon usage, skew parameters, and codon bias. In addition to coding regions, we also investigated correlations for complete genes and introns. Significant dynamic correlations were identified for each of the analyses. Our method also proved useful for detecting dynamic shifts in gene expression profiles, such as in the D. melanogaster dataset. Detection of changes in the properties of expressed genes and proteins might be useful for predicting or following biological processes, responses, growth, differentiation and possibly in related disorders.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号