共查询到20条相似文献,搜索用时 0 毫秒
1.
Background
The availability of high throughput methods for measurement of mRNA concentrations makes the reliability of conclusions drawn from the data and global quality control of samples and hybridization important issues. We address these issues by an information theoretic approach, applied to discretized expression values in replicated gene expression data. 相似文献2.
MOTIVATION: A serious limitation in microarray analysis is the unreliability of the data generated from low signal intensities. Such data may produce erroneous gene expression ratios and cause unnecessary validation or post-analysis follow-up tasks. Therefore, the elimination of unreliable signal intensities will enhance reproducibility and reliability of gene expression ratios produced from microarray data. In this study, we applied fuzzy c-means (FCM) and normal mixture modeling (NMM) based classification methods to separate microarray data into reliable and unreliable signal intensity populations. RESULTS: We compared the results of FCM classification with those of classification based on NMM. Both approaches were validated against reference sets of biological data consisting of only true positives and true negatives. We observed that both methods performed equally well in terms of sensitivity and specificity. Although a comparison of the computation times indicated that the fuzzy approach is computationally more efficient, other considerations support the use of NMM for the reliability analysis of microarray data. AVAILABILITY: The classification approaches described in this paper and sample microarray data are available as Matlab( TM ) (The MathWorks Inc., Natick, MA) programs (mfiles) and text files, respectively, at http://rc.kfshrc.edu.sa/bssc/staff/MusaAsyali/Downloads.asp. The programs can be run/tested on many different computer platforms where Matlab is available. CONTACT: asyali@kfshrc.edu.sa. 相似文献
3.
Background
Using genomic DNA as common reference in microarray experiments has recently been tested by different laboratories. Conflicting results have been reported with regard to the reliability of microarray results using this method. To explain it, we hypothesize that data processing is a critical element that impacts the data quality.Results
Microarray experiments were performed in a γ-proteobacterium Shewanella oneidensis. Pair-wise comparison of three experimental conditions was obtained either with two labeled cDNA samples co-hybridized to the same array, or by employing Shewanella genomic DNA as a standard reference. Various data processing techniques were exploited to reduce the amount of inconsistency between both methods and the results were assessed. We discovered that data quality was significantly improved by imposing the constraint of minimal number of replicates, logarithmic transformation and random error analyses.Conclusion
These findings demonstrate that data processing significantly influences data quality, which provides an explanation for the conflicting evaluation in the literature. This work could serve as a guideline for microarray data analysis using genomic DNA as a standard reference.4.
An important goal of microarray studies is the detection of genes that show significant changes in expression when two classes of biological samples are being compared. We present an ANOVA-style mixed model with parameters for array normalization, overall level of gene expression, and change of expression between the classes. For the latter we assume a mixing distribution with a probability mass concentrated at zero, representing genes with no changes, and a normal distribution representing the level of change for the other genes. We estimate the parameters by optimizing the marginal likelihood. To make this practical, Laplace approximations and a backfitting algorithm are used. The performance of the model is studied by simulation and by application to publicly available data sets. 相似文献
5.
MOTIVATION: We present a new approach to the analysis of images for complementary DNA microarray experiments. The image segmentation and intensity estimation are performed simultaneously by adopting a two-component mixture model. One component of this mixture corresponds to the distribution of the background intensity, while the other corresponds to the distribution of the foreground intensity. The intensity measurement is a bivariate vector consisting of red and green intensities. The background intensity component is modeled by the bivariate gamma distribution, whose marginal densities for the red and green intensities are independent three-parameter gamma distributions with different parameters. The foreground intensity component is taken to be the bivariate t distribution, with the constraint that the mean of the foreground is greater than that of the background for each of the two colors. The degrees of freedom of this t distribution are inferred from the data but they could be specified in advance to reduce the computation time. Also, the covariance matrix is not restricted to being diagonal and so it allows for nonzero correlation between R and G foreground intensities. This gamma-t mixture model is fitted by maximum likelihood via the EM algorithm. A final step is executed whereby nonparametric (kernel) smoothing is undertaken of the posterior probabilities of component membership. The main advantages of this approach are: (1) it enjoys the well-known strengths of a mixture model, namely flexibility and adaptability to the data; (2) it considers the segmentation and intensity simultaneously and not separately as in commonly used existing software, and it also works with the red and green intensities in a bivariate framework as opposed to their separate estimation via univariate methods; (3) the use of the three-parameter gamma distribution for the background red and green intensities provides a much better fit than the normal (log normal) or t distributions; (4) the use of the bivariate t distribution for the foreground intensity provides a model that is less sensitive to extreme observations; (5) as a consequence of the aforementioned properties, it allows segmentation to be undertaken for a wide range of spot shapes, including doughnut, sickle shape and artifacts. RESULTS: We apply our method for gridding, segmentation and estimation to cDNA microarray real images and artificial data. Our method provides better segmentation results in spot shapes as well as intensity estimation than Spot and spotSegmentation R language softwares. It detected blank spots as well as bright artifact for the real data, and estimated spot intensities with high-accuracy for the synthetic data. AVAILABILITY: The algorithms were implemented in Matlab. The Matlab codes implementing both the gridding and segmentation/estimation are available upon request. SUPPLEMENTARY INFORMATION: Supplementary material is available at Bioinformatics online. 相似文献
6.
7.
Background
Quality assessment of microarray data is an important and often challenging aspect of gene expression analysis. This task frequently involves the examination of a variety of summary statistics and diagnostic plots. The interpretation of these diagnostics is often subjective, and generally requires careful expert scrutiny. 相似文献8.
Gaussian mixture clustering and imputation of microarray data 总被引:3,自引:0,他引:3
MOTIVATION: In microarray experiments, missing entries arise from blemishes on the chips. In large-scale studies, virtually every chip contains some missing entries and more than 90% of the genes are affected. Many analysis methods require a full set of data. Either those genes with missing entries are excluded, or the missing entries are filled with estimates prior to the analyses. This study compares methods of missing value estimation. RESULTS: Two evaluation metrics of imputation accuracy are employed. First, the root mean squared error measures the difference between the true values and the imputed values. Second, the number of mis-clustered genes measures the difference between clustering with true values and that with imputed values; it examines the bias introduced by imputation to clustering. The Gaussian mixture clustering with model averaging imputation is superior to all other imputation methods, according to both evaluation metrics, on both time-series (correlated) and non-time series (uncorrelated) data sets. 相似文献
9.
10.
11.
Large-scale gene expression profiling with DNA microarrays opens new dimensions to molecular biology but still lacks the overall precision of traditional low-scale techniques. We developed a novel strategy of data processing linking search stringency to quality indicators for efficient detection of low-level, regulated genes. Using retinoid-induced differentiation of NB-4 promyelocytic cells, the variation of expression profiles between biological duplicates was studied and compared with the changes induced by all-trans retinoic acid (atRA) treatment. An analysis of 4320 genes showed that retinoic acid has mainly geneactivating function in NB-4 cells. Treatment with atRA for 18 hours induced metabolic genes that may be associated with cell differentiation and signaling factors triggering later events leading to apoptosis; cytokine genes were among the highest stimulated by atRA. Notably, we identified a regulatory loop inhibiting MYC action: as MYC was downregulated, a cognate repressor of MYC was upregulated. 相似文献
12.
Jianjun Hu Haifeng Li Michael S Waterman Xianghong Jasmine Zhou 《BMC bioinformatics》2006,7(1):449-14
Background
Missing value estimation is an important preprocessing step in microarray analysis. Although several methods have been developed to solve this problem, their performance is unsatisfactory for datasets with high rates of missing data, high measurement noise, or limited numbers of samples. In fact, more than 80% of the time-series datasets in Stanford Microarray Database contain less than eight samples. 相似文献13.
MOTIVATION: Identifying patterns of co-expression in microarray data by cluster analysis has been a productive approach to uncovering molecular mechanisms underlying biological processes under investigation. Using experimental replicates can generally improve the precision of the cluster analysis by reducing the experimental variability of measurements. In such situations, Bayesian mixtures allow for an efficient use of information by precisely modeling between-replicates variability. RESULTS: We developed different variants of Bayesian mixture based clustering procedures for clustering gene expression data with experimental replicates. In this approach, the statistical distribution of microarray data is described by a Bayesian mixture model. Clusters of co-expressed genes are created from the posterior distribution of clusterings, which is estimated by a Gibbs sampler. We define infinite and finite Bayesian mixture models with different between-replicates variance structures and investigate their utility by analyzing synthetic and the real-world datasets. Results of our analyses demonstrate that (1) improvements in precision achieved by performing only two experimental replicates can be dramatic when the between-replicates variability is high, (2) precise modeling of intra-gene variability is important for accurate identification of co-expressed genes and (3) the infinite mixture model with the 'elliptical' between-replicates variance structure performed overall better than any other method tested. We also introduce a heuristic modification to the Gibbs sampler based on the 'reverse annealing' principle. This modification effectively overcomes the tendency of the Gibbs sampler to converge to different modes of the posterior distribution when started from different initial positions. Finally, we demonstrate that the Bayesian infinite mixture model with 'elliptical' variance structure is capable of identifying the underlying structure of the data without knowing the 'correct' number of clusters. AVAILABILITY: The MS Windows based program named Gaussian Infinite Mixture Modeling (GIMM) implementing the Gibbs sampler and corresponding C++ code are available at http://homepages.uc.edu/~medvedm/GIMM.htm SUPPLEMENTAL INFORMATION: http://expression.microslu.washington.edu/expression/kayee/medvedovic2003/medvedovic_bioinf2003.html 相似文献
14.
To detect changes in gene expression data from microarrays, a fixed threshold for fold difference is used widely. However, it is not always guaranteed that a threshold value which is appropriate for highly expressed genes is suitable for lowly expressed genes. In this study, aiming at detecting truly differentially expressed genes from a wide expression range, we proposed an adaptive threshold method (AT). The adaptive thresholds, which have different values for different expression levels, are calculated based on two measurements under the same condition. The sensitivity, specificity and false discovery rate (FDR) of AT were investigated by simulations. The sensitivity and specificity under various noise conditions were greater than 89.7% and 99.32%, respectively. The FDR was smaller than 0.27. These results demonstrated the reliability of the method. 相似文献
15.
Analysis of microarray experiments is complicated by the huge amount of data involved. Searching for groups of co-expressed genes is akin to searching for protein families in a database as, in both cases, small subsets of genes with similar features are to be found within vast quantities of data. CLANS was originally developed to find protein families in large sets of amino acid sequences where the amount of data involved made phylogenetic approaches overly cumbersome. We present a number of improvements that greatly extend the previous version of CLANS and show its application to microarray data as well as its ability of incorporating additional information to facilitate interactive analysis. AVAILABILITY: The program is available for download from: http://bioinfoserver.rsbs.anu.edu.au/downloads/clans/ 相似文献
16.
Kelemen JZ Kertész-Farkas A Kocsor A Puskás LG 《Bioinformatics (Oxford, England)》2006,22(24):3047-3053
MOTIVATION: In this paper, we propose using the Kalman filter (KF) as a pre-processing step in microarray-based molecular diagnosis. Incorporating the expression covariance between genes is important in such classification problems, since this represents the functional relationships that govern tissue state. Failing to fulfil such requirements may result in biologically implausible class prediction models. Here, we show that employing the KF to remove noise (while retaining meaningful covariance and thus being able to estimate the underlying biological state from microarray measurements) yields linearly separable data suitable for most classification algorithms. RESULTS: We demonstrate the utility and performance of the KF as a robust disease-state estimator on publicly available binary and multi-class microarray datasets in combination with the most widely used classification methods to date. Moreover, using popular graphical representation schemes we show that our filtered datasets also have an improved visualization capability. 相似文献
17.
Background
Gene microarray technology provides the ability to study the regulation of thousands of genes simultaneously, but its potential is limited without an estimate of the statistical significance of the observed changes in gene expression. Due to the large number of genes being tested and the comparatively small number of array replicates (e.g., N = 3), standard statistical methods such as the Student's t-test fail to produce reliable results. Two other statistical approaches commonly used to improve significance estimates are a penalized t-test and a Z-test using intensity-dependent variance estimates. 相似文献18.
19.
20.
MOTIVATION: This paper introduces the software EMMIX-GENE that has been developed for the specific purpose of a model-based approach to the clustering of microarray expression data, in particular, of tissue samples on a very large number of genes. The latter is a nonstandard problem in parametric cluster analysis because the dimension of the feature space (the number of genes) is typically much greater than the number of tissues. A feasible approach is provided by first selecting a subset of the genes relevant for the clustering of the tissue samples by fitting mixtures of t distributions to rank the genes in order of increasing size of the likelihood ratio statistic for the test of one versus two components in the mixture model. The imposition of a threshold on the likelihood ratio statistic used in conjunction with a threshold on the size of a cluster allows the selection of a relevant set of genes. However, even this reduced set of genes will usually be too large for a normal mixture model to be fitted directly to the tissues, and so the use of mixtures of factor analyzers is exploited to reduce effectively the dimension of the feature space of genes. RESULTS: The usefulness of the EMMIX-GENE approach for the clustering of tissue samples is demonstrated on two well-known data sets on colon and leukaemia tissues. For both data sets, relevant subsets of the genes are able to be selected that reveal interesting clusterings of the tissues that are either consistent with the external classification of the tissues or with background and biological knowledge of these sets. AVAILABILITY: EMMIX-GENE is available at http://www.maths.uq.edu.au/~gjm/emmix-gene/ 相似文献