共查询到20条相似文献,搜索用时 0 毫秒
1.
Background
Identification of molecular markers for the classification of microarray data is a challenging task. Despite the evident dissimilarity in various characteristics of biological samples belonging to the same category, most of the marker – selection and classification methods do not consider this variability. In general, feature selection methods aim at identifying a common set of genes whose combined expression profiles can accurately predict the category ofallsamples. Here, we argue that this simplified approach is often unable to capture the complexity of a disease phenotype and we propose an alternative method that takes into account the individuality of each patient-sample. 相似文献2.
MOTIVATION: In the process of developing risk prediction models, various steps of model building and model selection are involved. If this process is not adequately controlled, overfitting may result in serious overoptimism leading to potentially erroneous conclusions. METHODS: For right censored time-to-event data, we estimate the prediction error for assessing the performance of a risk prediction model (Gerds and Schumacher, 2006; Graf et al., 1999). Furthermore, resampling methods are used to detect overfitting and resulting overoptimism and to adjust the estimates of prediction error (Gerds and Schumacher, 2007). RESULTS: We show how and to what extent the methodology can be used in situations characterized by a large number of potential predictor variables where overfitting may be expected to be overwhelming. This is illustrated by estimating the prediction error of some recently proposed techniques for fitting a multivariate Cox regression model applied to the data of a prognostic study in patients with diffuse large-B-cell lymphoma (DLBCL). AVAILABILITY: Resampling-based estimation of prediction error curves is implemented in an R package called pec available from the authors. 相似文献
3.
4.
DNA microarray technology provides tools for studying the expression profiles of a large number of distinct genes simultaneously. This technology has been applied to sample clustering and sample prediction. Because of a large number of genes measured, many of the genes in the original data set are irrelevant to the analysis. Selection of discriminatory genes is critical to the accuracy of clustering and prediction. This paper considers statistical significance testing approach to selecting discriminatory gene sets for multi-class clustering and prediction of experimental samples. A toxicogenomic data set with nine treatments (a control and eight metals, As, Cd, Ni, Cr, Sb, Pb, Cu, and AsV with a total of 55 samples) is used to illustrate a general framework of the approach. Among four selected gene sets, a gene set omega(I) formed by the intersection of the F-test and the set of the union of one-versus-all t-tests performs the best in terms of clustering as well as prediction. Hierarchical and two modified partition (k-means) methods all show that the set omega(I) is able to group the 55 samples into seven clusters reasonably well, in which the As and AsV samples are considered as one cluster (the same group) as are the Cd and Cu samples. With respect to prediction, the overall accuracy for the gene set omega(I) using the nearest neighbors algorithm to predict 55 samples into one of the nine treatments is 85%. 相似文献
5.
6.
7.
8.
Finding edging genes from microarray data 总被引:1,自引:0,他引:1
MOTIVATION: A set of genes and their gene expression levels are used to classify disease and normal tissues. Due to the massive number of genes in microarray, there are a large number of edges to divide different classes of genes in microarray space. The edging genes (EGs) can be co-regulated genes, they can also be on the same pathway or deregulated by the same non-coding genes, such as siRNA or miRNA. Every gene in EGs is vital for identifying a tissue's class. The changing in one EG's gene expression may cause a tissue alteration from normal to disease and vice versa. Finding EGs is of biological importance. In this work, we propose an algorithm to effectively find these EGs. RESULT: We tested our algorithm with five microarray datasets. The results are compared with the border-based algorithm which was used to find gene groups and subsequently divide different classes of tissues. Our algorithm finds a significantly larger amount of EGs than does the border-based algorithm. As our algorithm prunes irrelevant patterns at earlier stages, time and space complexities are much less prevalent than in the border-based algorithm. AVAILABILITY: The algorithm proposed is implemented in C++ on Linux platform. The EGs in five microarray datasets are calculated. The preprocessed datasets and the discovered EGs are available at http://www3.it.deakin.edu.au/~phoebe/microarray.html. 相似文献
9.
Background
Missing values frequently pose problems in gene expression microarray experiments as they can hinder downstream analysis of the datasets. While several missing value imputation approaches are available to the microarray users and new ones are constantly being developed, there is no general consensus on how to choose between the different methods since their performance seems to vary drastically depending on the dataset being used. 相似文献10.
Sample classification and class prediction is the aim of many gene expression studies. We present a web-based application, Prophet, which builds prediction rules and allows using them for further sample classification. Prophet automatically chooses the best classifier, along with the optimal selection of genes, using a strategy that renders unbiased cross-validated errors. Prophet is linked to different microarray data analysis modules, and includes a unique feature: the possibility of performing the functional interpretation of the molecular signature found. Availability: Prophet can be found at the URL http://prophet.bioinfo.cipf.es/ or within the GEPAS package at http://www.gepas.org/ Supplementary information: http://gepas.bioinfo.cipf.es/tutorial/prophet.html. 相似文献
11.
Chaotic mixer improves microarray hybridization 总被引:3,自引:0,他引:3
McQuain MK Seale K Peek J Fisher TS Levy S Stremler MA Haselton FR 《Analytical biochemistry》2004,325(2):215-226
Hybridization is an important aspect of microarray experimental design which influences array signal levels and the repeatability of data within an array and across different arrays. Current methods typically require 24h and use target inefficiently. In these studies, we compare hybridization signals obtained in conventional static hybridization, which depends on diffusional target delivery, with signals obtained in a dynamic hybridization chamber, which employs a fluid mixer based on chaotic advection theory to deliver targets across a conventional glass slide array. Microarrays were printed with a pattern of 102 identical probe spots containing a 65-mer oligonucleotide capture probe. Hybridization of a 725-bp fluorescently labeled target was used to measure average target hybridization levels, local signal-to-noise ratios, and array hybridization uniformity. Dynamic hybridization for 1h with 1 or 10ng of target DNA increased hybridization signal intensities approximately threefold over a 24-h static hybridization. Similarly, a 10- or 60-min dynamic hybridization of 10ng of target DNA increased hybridization signal intensities fourfold over a 24h static hybridization. In time course studies, static hybridization reached a maximum within 8 to 12h using either 1 or 10ng of target. In time course studies using the dynamic hybridization chamber, hybridization using 1ng of target increased to a maximum at 4h and that using 10ng of target did not vary over the time points tested. In comparison to static hybridization, dynamic hybridization reduced the signal-to-noise ratios threefold and reduced spot-to-spot variation twofold. Therefore, we conclude that dynamic hybridization based on a chaotic mixer design improves both the speed of hybridization and the maximum level of hybridization while increasing signal-to-noise ratios and reducing spot-to-spot variation. 相似文献
12.
13.
14.
MOTIVATION: The last few years have seen the development of DNA microarray technology that allows simultaneous measurement of the expression levels of thousands of genes. While many methods have been developed to analyze such data, most have been visualization-based. Methods that yield quantitative conclusions have been diverse and complex. RESULTS: We present two straightforward methods for identifying specific genes whose expression is linked with a phenotype or outcome variable as well as for systematically predicting sample class membership: (1) a conservative, permutation-based approach to identifying differentially expressed genes; (2) an augmentation of K-nearest-neighbor pattern classification. Our analyses replicate the quantitative conclusions of Golub et al. (1999; Science, 286, 531-537) on leukemia data, with better classification results, using far simpler methods. With the breast tumor data of Perou et al. (2000; Nature, 406, 747-752), the methods lend rigorous quantitative support to the conclusions of the original paper. In the case of the lymphoma data in Alizadeh et al. (2000; Nature, 403, 503-511), our analyses only partially support the conclusions of the original authors. AVAILABILITY: The software and supplementary information are available freely to researchers at academic and non-profit institutions at http://cc.ucsf.edu/jain/public 相似文献
15.
Predicting the functional roles of proteins based on various genome-wide data, such as protein-protein association networks, has become a canonical problem in computational biology. Approaching this task as a binary classification problem, we develop a network-based extension of the spatial auto-probit model. In particular, we develop a hierarchical Bayesian probit-based framework for modeling binary network-indexed processes, with a latent multivariate conditional autoregressive Gaussian process. The latter allows for the easy incorporation of protein-protein association network topologies-either binary or weighted-in modeling protein functional similarity. We use this framework to predict protein functions, for functions defined as terms in the Gene Ontology (GO) database, a popular rigorous vocabulary for biological functionality. Furthermore, we show how a natural extension of this framework can be used to model and correct for the high percentage of false negative labels in training data derived from GO, a serious shortcoming endemic to biological databases of this type. Our method performance is evaluated and compared with standard algorithms on weighted yeast protein-protein association networks, extracted from a recently developed integrative database called Search Tool for the Retrieval of INteracting Genes/proteins (STRING). Results show that our basic method is competitive with these other methods, and that the extended method-incorporating the uncertainty in negative labels among the training data-can yield nontrivial improvements in predictive accuracy. 相似文献
16.
Extracting binary signals from microarray time-course data 总被引:1,自引:0,他引:1
This article presents a new method for analyzing microarray time courses by identifying genes that undergo abrupt transitions in expression level, and the time at which the transitions occur. The algorithm matches the sequence of expression levels for each gene against temporal patterns having one or two transitions between two expression levels. The algorithm reports a P-value for the matching pattern of each gene, and a global false discovery rate can also be computed. After matching, genes can be sorted by the direction and time of transitions. Genes can be partitioned into sets based on the direction and time of change for further analysis, such as comparison with Gene Ontology annotations or binding site motifs. The method is evaluated on simulated and actual time-course data. On microarray data for budding yeast, it is shown that the groups of genes that change in similar ways and at similar times have significant and relevant Gene Ontology annotations. 相似文献
17.
Extracting three-way gene interactions from microarray data 总被引:1,自引:0,他引:1
MOTIVATION: It is an important and difficult task to extract gene network information from high-throughput genomic data. A common approach is to cluster genes using pairwise correlation as a distance metric. However, pairwise correlation is clearly too simplistic to describe the complex relationships among real genes since co-expression relationships are often restricted to a specific set of biological conditions/processes. In this study, we described a three-way gene interaction model that captures the dynamic nature of co-expression relationship between a gene pair through the introduction of a controller gene. RESULTS: We surveyed 0.4 billion possible three-way interactions among 1000 genes in a microarray dataset containing 678 human cancer samples. To test the reproducibility and statistical significance of our results, we randomly split the samples into a training set and a testing set. We found that the gene triplets with the strongest interactions (i.e. with the smallest P-values from appropriate statistical tests) in the training set also had the strongest interactions in the testing set. A distinctive pattern of three-way interaction emerged from these gene triplets: depending on the third gene being expressed or not, the remaining two genes can be either co-expressed or mutually exclusive (i.e. expression of either one of them would repress the other). Such three-way interactions can exist without apparent pairwise correlations. The identified three-way interactions may constitute candidates for further experimentation using techniques such as RNA interference, so that novel gene network or pathways could be identified. 相似文献
18.
19.
Correction of technical bias in clinical microarray data improves concordance with known biological information 总被引:1,自引:0,他引:1 下载免费PDF全文
The performance of gene expression microarrays has been well characterized using controlled reference samples, but the performance on clinical samples remains less clear. We identified sources of technical bias affecting many genes in concert, thus causing spurious correlations in clinical data sets and false associations between genes and clinical variables. We developed a method to correct for technical bias in clinical microarray data, which increased concordance with known biological relationships in multiple data sets. 相似文献