共查询到20条相似文献,搜索用时 0 毫秒
1.
Molecular classification of cancer types from microarray data using the combination of genetic algorithms and support vector machines 总被引:11,自引:0,他引:11
Simultaneous multiclass classification of tumor types is essential for future clinical implementations of microarray-based cancer diagnosis. In this study, we have combined genetic algorithms (GAs) and all paired support vector machines (SVMs) for multiclass cancer identification. The predictive features have been selected through iterative SVMs/GAs, and recursive feature elimination post-processing steps, leading to a very compact cancer-related predictive gene set. Leave-one-out cross-validations yielded accuracies of 87.93% for the eight-class and 85.19% for the fourteen-class cancer classifications, outperforming the results derived from previously published methods. 相似文献
2.
Cluster-Rasch models for microarray gene expression data 总被引:1,自引:0,他引:1
Background
We propose two different formulations of the Rasch statistical models to the problem of relating gene expression profiles to the phenotypes. One formulation allows us to investigate whether a cluster of genes with similar expression profiles is related to the observed phenotypes; this model can also be used for future prediction. The other formulation provides an alternative way of identifying genes that are over- or underexpressed from their expression levels in tissue or cell samples of a given tissue or cell type.Results
We illustrate the methods on available datasets of a classification of acute leukemias and of 60 cancer cell lines. For tumor classification, the results are comparable to those previously obtained. For the cancer cell lines dataset, we found four clusters of genes that are related to drug response for many of the 90 drugs that we considered. In addition, for each type of cell line, we identified genes that are over- or underexpressed relative to other genes.Conclusions
The cluster-Rasch model provides a probabilistic model for describing gene expression patterns across samples and can be used to relate gene expression profiles to phenotypes. 相似文献3.
Arianne C Richard Paul A Lyons James E Peters Daniele Biasci Shaun M Flint James C Lee Eoin F McKinney Richard M Siegel Kenneth GC Smith 《BMC genomics》2014,15(1)
Background
Although numerous investigations have compared gene expression microarray platforms, preprocessing methods and batch correction algorithms using constructed spike-in or dilution datasets, there remains a paucity of studies examining the properties of microarray data using diverse biological samples. Most microarray experiments seek to identify subtle differences between samples with variable background noise, a scenario poorly represented by constructed datasets. Thus, microarray users lack important information regarding the complexities introduced in real-world experimental settings. The recent development of a multiplexed, digital technology for nucleic acid measurement enables counting of individual RNA molecules without amplification and, for the first time, permits such a study.Results
Using a set of human leukocyte subset RNA samples, we compared previously acquired microarray expression values with RNA molecule counts determined by the nCounter Analysis System (NanoString Technologies) in selected genes. We found that gene measurements across samples correlated well between the two platforms, particularly for high-variance genes, while genes deemed unexpressed by the nCounter generally had both low expression and low variance on the microarray. Confirming previous findings from spike-in and dilution datasets, this “gold-standard” comparison demonstrated signal compression that varied dramatically by expression level and, to a lesser extent, by dataset. Most importantly, examination of three different cell types revealed that noise levels differed across tissues.Conclusions
Microarray measurements generally correlate with relative RNA molecule counts within optimal ranges but suffer from expression-dependent accuracy bias and precision that varies across datasets. We urge microarray users to consider expression-level effects in signal interpretation and to evaluate noise properties in each dataset independently.Electronic supplementary material
The online version of this article (doi:10.1186/1471-2164-15-649) contains supplementary material, which is available to authorized users. 相似文献4.
An accurate classifier with linguistic interpretability using a small number of relevant genes is beneficial to microarray data analysis and development of inexpensive diagnostic tests. Several frequently used techniques for designing classifiers of microarray data, such as support vector machine, neural networks, k-nearest neighbor, and logistic regression model, suffer from low interpretabilities. This paper proposes an interpretable gene expression classifier (named iGEC) with an accurate and compact fuzzy rule base for microarray data analysis. The design of iGEC has three objectives to be simultaneously optimized: maximal classification accuracy, minimal number of rules, and minimal number of used genes. An "intelligent" genetic algorithm IGA is used to efficiently solve the design problem with a large number of tuning parameters. The performance of iGEC is evaluated using eight commonly-used data sets. It is shown that iGEC has an accurate, concise, and interpretable rule base (1.1 rules per class) on average in terms of test classification accuracy (87.9%), rule number (3.9), and used gene number (5.0). Moreover, iGEC not only has better performance than the existing fuzzy rule-based classifier in terms of the above-mentioned objectives, but also is more accurate than some existing non-rule-based classifiers. 相似文献
5.
6.
7.
8.
Assessing gene significance from cDNA microarray expression data via mixed models. 总被引:29,自引:0,他引:29
R D Wolfinger G Gibson E D Wolfinger L Bennett H Hamadeh P Bushel C Afshari R S Paules 《Journal of computational biology》2001,8(6):625-637
The determination of a list of differentially expressed genes is a basic objective in many cDNA microarray experiments. We present a statistical approach that allows direct control over the percentage of false positives in such a list and, under certain reasonable assumptions, improves on existing methods with respect to the percentage of false negatives. The method accommodates a wide variety of experimental designs and can simultaneously assess significant differences between multiple types of biological samples. Two interconnected mixed linear models are central to the method and provide a flexible means to properly account for variability both across and within genes. The mixed model also provides a convenient framework for evaluating the statistical power of any particular experimental design and thus enables a researcher to a priori select an appropriate number of replicates. We also suggest some basic graphics for visualizing lists of significant genes. Analyses of published experiments studying human cancer and yeast cells illustrate the results. 相似文献
9.
New stochastic models are developed for the dynamics of a viral infection and an immune response during the early stages of infection. The stochastic models are derived based on the dynamics of deterministic models. The simplest deterministic model is a well-known system of ordinary differential equations which consists of three populations: uninfected cells, actively infected cells, and virus particles. This basic model is extended to include some factors of the immune response related to Human Immunodeficiency Virus-1 (HIV-1) infection. For the deterministic models, the basic reproduction number, R0, is calculated and it is shown that if R0<1, the disease-free equilibrium is locally asymptotically stable and is globally asymptotically stable in some special cases. The new stochastic models are systems of stochastic differential equations (SDEs) and continuous-time Markov chain (CTMC) models that account for the variability in cellular reproduction and death, the infection process, the immune system activation, and viral reproduction. Two viral release strategies are considered: budding and bursting. The CTMC model is used to estimate the probability of virus extinction during the early stages of infection. Numerical simulations are carried out using parameter values applicable to HIV-1 dynamics. The stochastic models provide new insights, distinct from the basic deterministic models. For the case R0>1, the deterministic models predict the viral infection persists in the host. But for the stochastic models, there is a positive probability of viral extinction. It is shown that the probability of a successful invasion depends on the initial viral dose, whether the immune system is activated, and whether the release strategy is bursting or budding. 相似文献
10.
11.
An evolutionary approach for gene selection and classification of microarray data based on SVM error-bound theories 总被引:1,自引:0,他引:1
Microarrays have thousands to tens-of-thousands of gene features, but only a few hundred patient samples are available. The fundamental problem in microarray data analysis is identifying genes whose disruption causes congenital or acquired disease in humans. In this paper, we propose a new evolutionary method that can efficiently select a subset of potentially informative genes for support vector machine (SVM) classifiers. The proposed evolutionary method uses SVM with a given subset of gene features to evaluate the fitness function, and new subsets of features are selected based on the estimates of generalization error of SVMs and frequency of occurrence of the features in the evolutionary approach. Thus, in theory, selected genes reflect to some extent the generalization performance of SVM classifiers. We compare our proposed method with several existing methods and find that the proposed method can obtain better classification accuracy with a smaller number of selected genes than the existing methods. 相似文献
12.
Lai Y 《Biostatistics (Oxford, England)》2007,8(4):744-755
Due to advances in experimental technologies, it is feasible to collect measurements for a large number of variables. When these variables are simultaneously screened by a statistical test, it is necessary to consider the adjustment for multiple hypothesis testing. The false discovery rate has been proposed and widely used to address this issue. A related problem is the estimation of the proportion of true null hypotheses. The long-standing difficulty to this problem is the identifiability of the nonparametric model. In this study, we propose a moment-based method coupled with sample splitting for estimating this proportion. If the p values from the alternative hypothesis are homogeneously distributed, then the proposed method will solve the identifiability and give its optimal performances. When the p values from the alternative hypothesis are heterogeneously distributed, we propose to approximate this mixture distribution so that the identifiability can be achieved. Theoretical aspects of the approximation error are discussed. The proposed estimation method is completely nonparametric and simple with an explicit formula. Simulation studies show the favorable performances of the proposed method when it is compared to the other existing methods. Two microarray gene expression data sets are considered for applications. 相似文献
13.
Recent advances in technologies such as DNA microarrays have provided an abundance of gene expression data on the genomic scale. One of the most important projects in the post-genome-era is the systemic identification of gene expression networks. However, inferring internal gene expression structure from experimentally observed time-series data are an inverse problem. We have therefore developed a system for inferring network candidates based on experimental observations. Moreover, we have proposed an analytical method for extracting common core binomial genetic interactions from various network candidates. Common core binomial genetic interactions are reliable interactions with a higher possibility of existence, and are important for understanding the dynamic behavior of gene expression networks. Here, we discuss an efficient method for inferring genetic interactions that combines a Step-by-step strategy (Y. Maki, Y. Takahashi, Y. Arikawa, S. Watanabe, K. Aoshima, Y. Eguchi, T. Ueda, S. Aburatani, S. Kuhara, M. Okamoto, An integrated comprehensive workbench for inferring genetic networks: Voyagene, Journal of Bioinformatics and Computational Biology 2(3) (2004) 533.) with an analysis method for extracting common core binomial genetic interactions. 相似文献
14.
Hierarchical Bayes models for cDNA microarray gene expression 总被引:2,自引:0,他引:2
cDNA microarrays are used in many contexts to compare mRNA levels between samples of cells. Microarray experiments typically give us expression measurements on 1000-20 000 genes, but with few replicates for each gene. Traditional methods using means and standard deviations to detect differential expression are not satisfactory in this context. A handful of alternative statistics have been developed, including several empirical Bayes methods. In the present paper we present two full hierarchical Bayes models for detecting gene expression, of which one (D) describes our microarray data very well. We also compare the full Bayes and empirical Bayes approaches with respect to model assumptions, false discovery rates and computer running time. The proposed models are compared to existing empirical Bayes models in a simulation study and for a set of data (Yuen et al., 2002), where 27 genes have been categorized by quantitative real-time PCR. It turns out that the existing empirical Bayes methods have at least as good performance as the full Bayes ones. 相似文献
15.
Abstract A third-order algorithm for stochastic dynamics (SD) simulations is proposed, identical to the powerful molecular dynamics leap-frog algorithm in the limit of infinitely small friction coefficient γ. It belongs to the class of SD algorithms, in which the integration time step Δt is not limited by the condition Δt ≤ γ?1, but only by the properties of the systematic force. It is shown how constraints, such as bond length or bond angle constraints, can be incorporated in the computational scheme. It is argued that the third-order Verlet-type SD algorithm proposed earlier may be simplified without loosing its third-order accuracy. The leap-frog SD algorithm is proven to be equivalent to the verlet-type SD algorithm. Both these SD algorithms are slightly more economical on computer storage than the Beeman-type SD algorithm. 相似文献
16.
Biclustering, which performs simultaneous clustering of rows (e.g., genes) and columns (e.g., conditions), has proved of great value for finding interesting patterns from microarray data. To find biclusters, a model called pCluster was proposed. A pCluster consists of a set of genes and a set of conditions, where the expression levels of these genes have a similar variation under these conditions. Based on this model, most of the previous methods need to compute MDSs (maximum dimension sets) for every two genes in the microarray data. Since the number of genes is far larger than the number of conditions, this step is inefficient. Another method called MicroCluster was proposed. This method does not compute MDSs for every two genes, and transforms the problem into a graph problem. However, it needs to solve the Maximal Clique problem, which is NP-Complete. To avoid the above disadvantages, in this paper, we propose a new method, CE-Tree (Condition-Enumeration Tree), for finding pClusters. Instead of generating MDSs for every two genes, we generate only MDSs for every two conditions. Then, based only on these MDSs, we expand the CE-Tree in a special local breadth-first within global depth-first manner to efficiently find all pClusters. We also utilize the idea of the traditional hash join approach to efficiently support the CE-Tree. From the simulation results, we show that the CE-Tree method could find pClusters more efficiently than those previous methods. 相似文献
17.
Real-time polymerase chain reaction-based exponential sample amplification for microarray gene expression profiling 总被引:1,自引:0,他引:1
Conventional approaches to target labeling for gene expression analysis using microarray technology typically require relatively large amounts of RNA, a serious limitation when the available sample is limited. Here we describe an alternative exponential sample amplification method by using quantitative real-time polymerase chain reaction (QRT-PCR) to follow the amplification and eliminate the overamplified cDNA which could distort the quantitative ratio of the starting mRNA population. Probes generated from nonamplified, PCR-amplified, and real-time-PCR-amplified cDNA samples were generated from lipopolysaccharide-treated and nontreated mouse macrophages and hybridized to mouse cDNA microarrays. Signals obtained from the three protocols were compared. Reproducibility and reliability of the methods were determined. The Pearson correlation coefficients for replica experiments were r=0.927 and r=0.687 for QRT-PCR-amplification and PCR-overamplification protocols, respectively. Chi2 test showed that overamplification resulted in major biases in expression ratios, while these alterations could be eliminated by following the cycling status with QRT-PCR. Our exponential sample amplification protocol preserves the original expression ratios and allows unbiased gene expression analysis from minute amounts of starting material. 相似文献
18.
19.
Information on gene expression in colon tumors versus normal human colon was recently generated by an oligonucleotide microarray study. We used the associated database to search for genes that display age-dependent variations in expression. Statistically significant evidence was obtained that such genes are present in both the tumor and normal tissue databases. Besides the analysis of all genes included in the database, three subsets of genes were analyzed separately: genes controlled by p53, and genes coding for ribosomal proteins and for nuclear-encoded mitochondrial proteins. Among the genes controlled by p53 some show an age-dependent change in expression in tumor tissues, in the sense compatible with an activation of p53 at higher age. A decreased expression of some ribosomal genes at advanced age was detected both in tumor and normal tissues. No significant age-dependent expression could be detected for genes encoding mitochondrial proteins. 相似文献
20.
It is shown here how gene knock-out experiments can be simulated in Random Boolean Networks (RBN), which are well-known simplified models of genetic networks. The results of the simulations are presented and compared with those of actual experiments in S. cerevisiae. RBN with two incoming links per node have been considered, and the Boolean functions have been chosen at random among the set of so-called canalizing functions. Genes are knocked-out (i.e. silenced) one at a time, and the variations in the expression levels of the other genes, with respect to the unperturbed case, are considered. Two important variables are defined: (i) avalanches, which measure the size of the perturbation generated by knocking out a single gene, and (ii) susceptibilities, which measure how often the expression of a given gene is modified in these experiments. A remarkable observation is that the distributions of avalanches and susceptibilities are very robust, i.e. they are very similar in different random networks; this should be contrasted with the distribution of other variables that show a high variance in RBN. Moreover, the distribution of avalanches and susceptibilities of the RBN models are close to those observed in actual experiments performed with S. cerevisiae, where the changes in gene expression levels have been recorded with DNA microarrays. These findings suggest that these distributions might be "generic" properties, common to a wide range of genetic models and real genetic networks. The importance of such generic properties is discussed. 相似文献