共查询到20条相似文献,搜索用时 10 毫秒
1.
Oba S Sato MA Takemasa I Monden M Matsubara K Ishii S 《Bioinformatics (Oxford, England)》2003,19(16):2088-2096
MOTIVATION: Gene expression profile analyses have been used in numerous studies covering a broad range of areas in biology. When unreliable measurements are excluded, missing values are introduced in gene expression profiles. Although existing multivariate analysis methods have difficulty with the treatment of missing values, this problem has received little attention. There are many options for dealing with missing values, each of which reaches drastically different results. Ignoring missing values is the simplest method and is frequently applied. This approach, however, has its flaws. In this article, we propose an estimation method for missing values, which is based on Bayesian principal component analysis (BPCA). Although the methodology that a probabilistic model and latent variables are estimated simultaneously within the framework of Bayes inference is not new in principle, actual BPCA implementation that makes it possible to estimate arbitrary missing variables is new in terms of statistical methodology. RESULTS: When applied to DNA microarray data from various experimental conditions, the BPCA method exhibited markedly better estimation ability than other recently proposed methods, such as singular value decomposition and K-nearest neighbors. While the estimation performance of existing methods depends on model parameters whose determination is difficult, our BPCA method is free from this difficulty. Accordingly, the BPCA method provides accurate and convenient estimation for missing values. AVAILABILITY: The software is available at http://hawaii.aist-nara.ac.jp/~shige-o/tools/. 相似文献
2.
Many external and internal validity measures have been proposed in order to estimate the number of clusters in gene expression data but as a rule they do not consider the analysis of the stability of the groupings produced by a clustering algorithm. Based on the approach assessing the predictive power or stability of a partitioning, we propose the new measure of cluster validation and the selection procedure to determine the suitable number of clusters. The validity measure is based on the estimation of the "clearness" of the consensus matrix, which is the result of a resampling clustering scheme or consensus clustering. According to the proposed selection procedure the stable clustering result is determined with the reference to the validity measure for the null hypothesis encoding for the absence of clusters. The final number of clusters is selected by analyzing the distance between the validity plots for initial and permutated data sets. We applied the selection procedure to estimate the clustering results on several datasets. As a result the proposed procedure produced an accurate and robust estimate of the number of clusters, which are in agreement with the biological knowledge and gold standards of cluster quality. 相似文献
3.
Filtering is a common practice used to simplify the analysis of microarray data by removing from subsequent consideration probe sets believed to be unexpressed. The m/n filter, which is widely used in the analysis of Affymetrix data, removes all probe sets having fewer than m present calls among a set of n chips. The m/n filter has been widely used without considering its statistical properties. The level and power of the m/n filter are derived. Two alternative filters, the pooled p-value filter and the error-minimizing pooled p-value filter are proposed. The pooled p-value filter combines information from the present-absent p-values into a single summary p-value which is subsequently compared to a selected significance threshold. We show that pooled p-value filter is the uniformly most powerful statistical test under a reasonable beta model and that it exhibits greater power than the m/n filter in all scenarios considered in a simulation study. The error-minimizing pooled p-value filter compares the summary p-value with a threshold determined to minimize a total-error criterion based on a partition of the distribution of all probes' summary p-values. The pooled p-value and error-minimizing pooled p-value filters clearly perform better than the m/n filter in a case-study analysis. The case-study analysis also demonstrates a proposed method for estimating the number of differentially expressed probe sets excluded by filtering and subsequent impact on the final analysis. The filter impact analysis shows that the use of even the best filter may hinder, rather than enhance, the ability to discover interesting probe sets or genes. S-plus and R routines to implement the pooled p-value and error-minimizing pooled p-value filters have been developed and are available from www.stjuderesearch.org/depts/biostats/index.html. 相似文献
4.
MOTIVATION: Time-course microarray experiments are designed to study biological processes in a temporal fashion. Longitudinal gene expression data arise when biological samples taken from the same subject at different time points are used to measure the gene expression levels. It has been observed that the gene expression patterns of samples of a given tumor measured at different time points are likely to be much more similar to each other than are the expression patterns of tumor samples of the same type taken from different subjects. In statistics, this phenomenon is called the within-subject correlation of repeated measurements on the same subject, and the resulting data are called longitudinal data. It is well known in other applications that valid statistical analyses have to appropriately take account of the possible within-subject correlation in longitudinal data. RESULTS: We apply estimating equation techniques to construct a robust statistic, which is a variant of the robust Wald statistic and accounts for the potential within-subject correlation of longitudinal gene expression data, to detect genes with temporal changes in expression. We associate significance levels to the proposed statistic by either incorporating the idea of the significance analysis of microarrays method or using the mixture model method to identify significant genes. The utility of the statistic is demonstrated by applying it to an important study of osteoblast lineage-specific differentiation. Using simulated data, we also show pitfalls in drawing statistical inference when the within-subject correlation in longitudinal gene expression data is ignored. 相似文献
5.
Current methods for analysis of gene expression data are mostly based on clustering and classification of either genes or samples. We offer support for the idea that more complex patterns can be identified in the data if genes and samples are considered simultaneously. We formalize the approach and propose a statistical framework for two-way clustering. A simultaneous clustering parameter is defined as a function theta=Phi(P) of the true data generating distribution P, and an estimate is obtained by applying this function to the empirical distribution P(n). We illustrate that a wide range of clustering procedures, including generalized hierarchical methods, can be defined as parameters which are compositions of individual mappings for clustering patients and genes. This framework allows one to assess classical properties of clustering methods, such as consistency, and to formally study statistical inference regarding the clustering parameter. We present results of simulations designed to assess the asymptotic validity of different bootstrap methods for estimating the distribution of Phi(P(n)). The method is illustrated on a publicly available data set. 相似文献
6.
Machaon CVE: cluster validation for gene expression data 总被引:2,自引:0,他引:2
SUMMARY: This paper presents a cluster validation tool for gene expression data. Machaon CVE (Clustering and Validation Environment) system aims to partition samples or genes into groups characterized by similar expression patterns, and to evaluate the quality of the clusters obtained. AVAILABILITY: The program is freely available for non-profit use on request at http://www.cs.tcd.ie/Nadia.Bolshakova/Machaon.html SUPPLEMENTARY INFORMATION: http://www.cs.tcd.ie/Nadia.Bolshakova/Machaon.html 相似文献
7.
We propose a statistical model for estimating gene expression using data from multiple laser scans at different settings of hybridized microarrays. A functional regression model is used, based on a non-linear relationship with both additive and multiplicative error terms. The function is derived as the expected value of a pixel, given that values are censored at 65 535, the maximum detectable intensity for double precision scanning software. Maximum likelihood estimation based on a Cauchy distribution is used to fit the model, which is able to estimate gene expressions taking account of outliers and the systematic bias caused by signal censoring of highly expressed genes. We have applied the method to experimental data. Simulation studies suggest that the model can estimate the true gene expression with negligible bias. AVAILABILITY: FORTRAN 90 code for implementing the method can be obtained from the authors. 相似文献
8.
MOTIVATION: The study of the dynamics of regulatory processes has led to increased interest for the analysis of temporal gene expression level data. To address the dynamics of regulation, expression data are collected repeatedly over time. It is difficult to statistically represent the resulting high-dimensional data. When regulatory processes determine gene expression, time-warping is likely to be present, i.e. the sample of gene expression trajectories reflects variation not only in terms of the expression amplitudes, but also in terms of the temporal structure of gene expression. RESULTS: A non-parametric time-synchronized iterative mean updating technique is proposed to find an overall representation that corresponds to a mode of a sample of expression profiles, viewed as a random sample in function space. The proposed algorithm explores the application of previous work of Hall and Heckman to genome-wide expression data and provides an extension that includes random time-warping with the aim to synchronize timescales across genes. The proposed algorithm is universally applicable for the construction of modes for functional data with time-warping. We demonstrate the construction of mode functions for a sample of Drosophila gene expression data. The algorithm can be applied to define clusters among the observed trajectories of gene expression, without any kind of prior non-time-warped clustering, as illustrated in the numerical example. 相似文献
9.
Statistical design and the analysis of gene expression microarray data 总被引:18,自引:0,他引:18
Gene expression microarrays are an innovative technology with enormous promise to help geneticists explore and understand the genome. Although the potential of this technology has been clearly demonstrated, many important and interesting statistical questions persist. We relate certain features of microarrays to other kinds of experimental data and argue that classical statistical techniques are appropriate and useful. We advocate greater attention to experimental design issues and a more prominent role for the ideas of statistical inference in microarray studies. 相似文献
10.
11.
Background
The availability of high throughput methods for measurement of mRNA concentrations makes the reliability of conclusions drawn from the data and global quality control of samples and hybridization important issues. We address these issues by an information theoretic approach, applied to discretized expression values in replicated gene expression data. 相似文献12.
Applying appropriate error models and conservative estimates to microarray data helps to reduce the number of false predictions and allows one to focus on biologically relevant observations. Several key conclusions have been drawn from the statistical analysis of global gene expression data: it is worth keeping core information for each experiment, including raw and processed data; biological and technical replicates are needed; careful experimental design makes the analysis simpler and more powerful; the choice of the similarity measure is nontrivial and depends on the goal of an experiment; array information must be complemented with other data; and gene expression studies are 'hypothesis generators'. 相似文献
13.
Kernel density smoothing techniques have been used in classification or supervised learning of gene expression profile (GEP) data, but their applications to clustering or unsupervised learning of those data have not been explored and assessed. Here we report a kernel density clustering method for analysing GEP data and compare its performance with the three most widely-used clustering methods: hierarchical clustering, K-means clustering, and multivariate mixture model-based clustering. Using several methods to measure agreement, between-cluster isolation, and withincluster coherence, such as the Adjusted Rand Index, the Pseudo F test, the r(2) test, and the profile plot, we have assessed the effectiveness of kernel density clustering for recovering clusters, and its robustness against noise on clustering both simulated and real GEP data. Our results show that the kernel density clustering method has excellent performance in recovering clusters from simulated data and in grouping large real expression profile data sets into compact and well-isolated clusters, and that it is the most robust clustering method for analysing noisy expression profile data compared to the other three methods assessed. 相似文献
14.
A gene expression profile of Alzheimer's disease. 总被引:12,自引:0,他引:12
Postmortem analysis of brains of patients with Alzheimer's disease (AD) has led to diverse theories about the causes of the pathology, suggesting that this complex disease involves multiple physiological changes. In an effort to better understand the variety and integration of these changes, we generated a gene expression profile for AD brain. Comparing affected and unaffected brain regions in nine controls and six AD cases, we showed that 118 of the 7050 sequences on a broadly representative cDNA microarray were differentially expressed in the amygdala and cingulate cortex, two regions affected early in the disease. The identity of these genes suggests the most prominent upregulated physiological correlates of pathology involve chronic inflammation, cell adhesion, cell proliferation, and protein synthesis (31 upregulated genes). Conversely, downregulated correlates of pathology involve signal transduction, energy metabolism, stress response, synaptic vesicle synthesis and function, calcium binding, and cytoskeleton (87 downregulated genes). The results support several separate theories of the causes of AD pathology, as well as add to the list of genes associated with AD. In addition, approximately 10 genes of unknown function were found to correlate with the pathology. 相似文献
15.
16.
17.
Analysis of large-scale gene expression data. 总被引:10,自引:0,他引:10
G Sherlock 《Briefings in bioinformatics》2001,2(4):350-362
DNA microarray technology has resulted in the generation of large complex data sets, such that the bottleneck in biological investigation has shifted from data generation, to data analysis. This review discusses some of the algorithms and tools for the analysis and organisation of microarray expression data, including clustering methods, partitioning methods, and methods for correlating expression data to other biological data. 相似文献
18.
19.
A global profile of germline gene expression in C. elegans 总被引:7,自引:0,他引:7
Reinke V Smith HE Nance J Wang J Van Doren C Begley R Jones SJ Davis EB Scherer S Ward S Kim SK 《Molecular cell》2000,6(3):605-616
20.