首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
This paper studies the problem of building multiclass classifiers for tissue classification based on gene expression. The recent development of microarray technologies has enabled biologists to quantify gene expression of tens of thousands of genes in a single experiment. Biologists have begun collecting gene expression for a large number of samples. One of the urgent issues in the use of microarray data is to develop methods for characterizing samples based on their gene expression. The most basic step in the research direction is binary sample classification, which has been studied extensively over the past few years. This paper investigates the next step-multiclass classification of samples based on gene expression. The characteristics of expression data (e.g. large number of genes with small sample size) makes the classification problem more challenging. The process of building multiclass classifiers is divided into two components: (i) selection of the features (i.e. genes) to be used for training and testing and (ii) selection of the classification method. This paper compares various feature selection methods as well as various state-of-the-art classification methods on various multiclass gene expression datasets. Our study indicates that multiclass classification problem is much more difficult than the binary one for the gene expression datasets. The difficulty lies in the fact that the data are of high dimensionality and that the sample size is small. The classification accuracy appears to degrade very rapidly as the number of classes increases. In particular, the accuracy was very low regardless of the choices of the methods for large-class datasets (e.g. NCI60 and GCM). While increasing the number of samples is a plausible solution to the problem of accuracy degradation, it is important to develop algorithms that are able to analyze effectively multiple-class expression data for these special datasets.  相似文献   

2.

Background  

With DNA microarray data, selecting a compact subset of discriminative genes from thousands of genes is a critical step for accurate classification of phenotypes for, e.g., disease diagnosis. Several widely used gene selection methods often select top-ranked genes according to their individual discriminative power in classifying samples into distinct categories, without considering correlations among genes. A limitation of these gene selection methods is that they may result in gene sets with some redundancy and yield an unnecessary large number of candidate genes for classification analyses. Some latest studies show that incorporating gene to gene correlations into gene selection can remove redundant genes and improve classification accuracy.  相似文献   

3.
4.
Microarrays are a new technology that allows biologists to better understand the interactions between diverse pathologic state at the gene level. However, the amount of data generated by these tools becomes problematic, even though data are supposed to be automatically analyzed (e.g., for diagnostic purposes). The issue becomes more complex when the expression data involve multiple states. We present a novel approach to the gene selection problem in multi-class gene expression-based cancer classification, which combines support vector machines and genetic algorithms. This new method is able to select small subsets and still improve the classification accuracy.  相似文献   

5.
Multi-gene phylogenetic analyses were conducted to address the evolution of Clavicipitaceae (Ascomycota). Data are presented here for approximately 5900 base pairs from portions of seven loci: the nuclear ribosomal small and large subunit DNA (nrSSU and nrLSU), beta-tubulin, elongation factor 1alpha (EF-1alpha), the largest and second largest subunits of RNA polymerase II (RPB1 and RPB2), and mitochondrial ATP Synthase subunit 6 (mtATP6). These data were analyzed in a complete 66-taxon matrix and 91-taxon supermatrix that included some missing data. Separate phylogenetic analyses, with data partitioned according to genes, produced some conflicting results. The results of separate analyses from RPB1 and RPB2 are in agreement with the combined analyses that resolve a paraphyletic Clavicipitaceae comprising three well-supported clades (i.e., Clavicipitaceae clade A, B, and C), whereas the tree obtained from mtATP6 is in strong conflict with the monophyly of Clavicipitaceae clade B and the sister-group relationship of Hypocreaceae and Clavicipitaceae clade C. The distribution of relative contribution of nodal support for each gene partition was assessed using both partitioned Bremer support (PBS) values and combinational bootstrap (CB) analyses, the latter of which analyzed bootstrap proportions from all possible combinations of the seven gene partitions. These results suggest that CB analyses provide a more consistent estimate of nodal support than PBS and that combining heterogeneous gene partitions, which individually support a limited number of nodes, results in increased support for overall tree topology. Analyses of the 91-taxa supermatrix data sets revealed that some nodes were more strongly supported by increased taxon sampling. Identifying the localized incongruence of mtATP6 and analyses of complete and supermatrix data sets strengthen the evidence for rejecting the monophyly of Clavicipitaceae and much of the current subfamilial classification of the family. Although the monophyly of the grass-associated subfamily Clavicipitoideae (e.g., Claviceps, Balansia, and Epichlo?) is strongly supported, the subfamily Cordycipitoideae (e.g., Cordyceps and Torrubiella) is not monophyletic. In particular, species of the genus Cordyceps, which are pathogens of arthropods and truffles, are found in all three clavicipitaceous clades. These results imply that most characters used in the current familial classification of Clavicipitaceae are not diagnostic of monophyly.  相似文献   

6.
A wide range of research areas in molecular biology and medical biochemistry require a reliable enzyme classification system, e.g., drug design, metabolic network reconstruction and system biology. When research scientists in the above mentioned areas wish to unambiguously refer to an enzyme and its function, the EC number introduced by the Nomenclature Committee of the International Union of Biochemistry and Molecular Biology (IUBMB) is used. However, each and every one of these applications is critically dependent upon the consistency and reliability of the underlying data for success. We have developed tools for the validation of the EC number classification scheme. In this paper, we present validated data of 3788 enzymatic reactions including 229 sub-subclasses of the EC classification system. Over 80% agreement was found between our assignment and the EC classification. For 61 (i.e., only 2.5%) reactions we found that their assignment was inconsistent with the rules of the nomenclature committee; they have to be transferred to other sub-subclasses. We demonstrate that our validation results can be used to initiate corrections and improvements to the EC number classification scheme.  相似文献   

7.
Diplopods (millipedes) are known for their irregular body segmentation. Most importantly, the number of dorsal segmental cuticular plates (tergites) does not match the number of ventral structures (e.g., sternites). Controversial theories exist to explain the origin of this so-called diplosegmentation. We have studied the embryology of a representative diplopod, Glomeris marginata, and have analyzed the segmentation genes engrailed (en), hedgehog (hh), cubitus-interruptus (ci), and wingless (wg). We show that dorsal segments can be distinguished from ventral segments. They differ not only in number and developmental history, but also in gene expression patterns. engrailed, hedgehog, and cubitus-interruptus are expressed in both ventral and dorsal segments, but at different intrasegmental locations, whereas wingless is expressed only in the ventral segments, but not in the dorsal segments. Ventrally, the patterns are similar to what has been described from Drosophila and other arthropods, consistent with a conserved role of these genes in establishing parasegment boundaries. On the dorsal side, however, the gene expression patterns are different and inconsistent with a role in boundary formation between segments, but they suggest that these genes might function to establish the tergite borders. Our data suggest a profound and rather complete decoupling of dorsal and ventral segmentation leading to the dorsoventral discrepancies in the number of segmental elements. Based on gene expression, we propose a model that may resolve the hitherto controversial issue of the correlation between dorsal tergites and ventral leg pairs in basal diplopods (e.g., Glomeris) and is suggestive also for derived, ring-forming diplopods (e.g., Juliformia).  相似文献   

8.
It was shown that the inhibitory effect of kanamycin and streptomycin in a growing culture of Clostridium thermohydrosulfuricum JW 102 is of limited duration. To screen a large number of antibiotics, their stability during incubation under the growth conditions of thermophilic clostridia was determined at 72 and 50 degrees C by using a 0.2% yeast extract-amended prereduced mineral medium with a pH of 7.3 or 5.0. Half-lives were determined in a modified MIC test with Escherichia coli, Staphylococcus aureus, and Bacillus megaterium as indicator strains. All compounds tested were similar at the two temperatures or more stable at 50 than at 72 degrees C. The half-life (t1/2) at pH 7.3 and 72 degrees C ranged from 3.3 h (k = 7.26 day-1, where k [degradation constant] = 1/t1/2) for ampicillin to no detectable loss of activity for kanamycin, neomycin, and other antibiotics. Apparently some compounds (e.g., lasalocid and neomycin) became more potent during incubation (k greater than 0). A change to pH 5.0 caused some compounds to become more labile (e.g., kanamycin) and others (e.g., streptomycin) to become more stable than at pH 7.3.  相似文献   

9.
Stability of antibiotics under growth conditions for thermophilic anaerobes   总被引:1,自引:0,他引:1  
It was shown that the inhibitory effect of kanamycin and streptomycin in a growing culture of Clostridium thermohydrosulfuricum JW 102 is of limited duration. To screen a large number of antibiotics, their stability during incubation under the growth conditions of thermophilic clostridia was determined at 72 and 50 degrees C by using a 0.2% yeast extract-amended prereduced mineral medium with a pH of 7.3 or 5.0. Half-lives were determined in a modified MIC test with Escherichia coli, Staphylococcus aureus, and Bacillus megaterium as indicator strains. All compounds tested were similar at the two temperatures or more stable at 50 than at 72 degrees C. The half-life (t1/2) at pH 7.3 and 72 degrees C ranged from 3.3 h (k = 7.26 day-1, where k [degradation constant] = 1/t1/2) for ampicillin to no detectable loss of activity for kanamycin, neomycin, and other antibiotics. Apparently some compounds (e.g., lasalocid and neomycin) became more potent during incubation (k greater than 0). A change to pH 5.0 caused some compounds to become more labile (e.g., kanamycin) and others (e.g., streptomycin) to become more stable than at pH 7.3.  相似文献   

10.
Tissue classification with gene expression profiles.   总被引:29,自引:0,他引:29  
Constantly improving gene expression profiling technologies are expected to provide understanding and insight into cancer-related cellular processes. Gene expression data is also expected to significantly aid in the development of efficient cancer diagnosis and classification platforms. In this work we examine three sets of gene expression data measured across sets of tumor(s) and normal clinical samples: The first set consists of 2,000 genes, measured in 62 epithelial colon samples (Alon et al., 1999). The second consists of approximately equal to 100,000 clones, measured in 32 ovarian samples (unpublished extension of data set described in Schummer et al. (1999)). The third set consists of approximately equal to 7,100 genes, measured in 72 bone marrow and peripheral blood samples (Golub et al, 1999). We examine the use of scoring methods, measuring separation of tissue type (e.g., tumors from normals) using individual gene expression levels. These are then coupled with high-dimensional classification methods to assess the classification power of complete expression profiles. We present results of performing leave-one-out cross validation (LOOCV) experiments on the three data sets, employing nearest neighbor classifier, SVM (Cortes and Vapnik, 1995), AdaBoost (Freund and Schapire, 1997) and a novel clustering-based classification technique. As tumor samples can differ from normal samples in their cell-type composition, we also perform LOOCV experiments using appropriately modified sets of genes, attempting to eliminate the resulting bias. We demonstrate success rate of at least 90% in tumor versus normal classification, using sets of selected genes, with, as well as without, cellular-contamination-related members. These results are insensitive to the exact selection mechanism, over a certain range.  相似文献   

11.
12.
DNA microarray technology provides useful tools for profiling global gene expression patterns in different cell/tissue samples. One major challenge is the large number of genes relative to the number of samples. The use of all genes can suppress or reduce the performance of a classification rule due to the noise of nondiscriminatory genes. Selection of an optimal subset from the original gene set becomes an important prestep in sample classification. In this study, we propose a family-wise error (FWE) rate approach to selection of discriminatory genes for two-sample or multiple-sample classification. The FWE approach controls the probability of the number of one or more false positives at a prespecified level. A public colon cancer data set is used to evaluate the performance of the proposed approach for the two classification methods: k nearest neighbors (k-NN) and support vector machine (SVM). The selected gene sets from the proposed procedure appears to perform better than or comparable to several results reported in the literature using the univariate analysis without performing multivariate search. In addition, we apply the FWE approach to a toxicogenomic data set with nine treatments (a control and eight metals, As, Cd, Ni, Cr, Sb, Pb, Cu, and AsV) for a total of 55 samples for a multisample classification. Two gene sets are considered: the gene set omegaF formed by the ANOVA F-test, and a gene set omegaT formed by the union of one-versus-all t-tests. The predicted accuracies are evaluated using the internal and external crossvalidation. Using the SVM classification, the overall accuracies to predict 55 samples into one of the nine treatments are above 80% for internal crossvalidation. OmegaF has slightly higher accuracy rates than omegaT. The overall predicted accuracies are above 70% for the external crossvalidation; the two gene sets omegaT and omegaF performed equally well.  相似文献   

13.
In the clinical practice, many diseases such as glioblastoma, leukemia, diabetes, and prostates have multiple subtypes. Classifying subtypes accurately using genomic data will provide individualized treatments to target-specific disease subtypes. However, it is often difficult to obtain satisfactory classification accuracy using only one type of data, because the subtypes of a disease can exhibit similar patterns in one data type. Fortunately, multiple types of genomic data are often available due to the rapid development of genomic techniques. This raises the question on whether the classification performance can significantly be improved by combining multiple types of genomic data. In this article, we classified four subtypes of glioblastoma multiforme (GBM) with multiple types of genome-wide data (e.g., mRNA and miRNA expression) from The Cancer Genome Atlas (TCGA) project. We proposed a multi-class compressed sensing-based detector (MCSD) for this study. The MCSD was trained with data from TCGA and then applied to subtype GBM patients using an independent testing data. We performed the classification on the same patient subjects with three data types, i.e., miRNA expression data, mRNA (or gene expression) data, and their combinations. The classification accuracy is 69.1% with the miRNA expression data, 52.7% with mRNA expression data, and 90.9% with the combination of both mRNA and miRNA expression data. In addition, some biomarkers identified by the integrated approaches have been confirmed with results from the published literatures. These results indicate that the combined analysis can significantly improve the accuracy of classifying GBM subtypes and identify potential biomarkers for disease diagnosis.  相似文献   

14.
15.
Hatchery broodstocks used for genetic conservation or aquaculture may represent their ancestral gene pools rather poorly. This is especially likely when the fish that found a broodstock are close relatives of each other. We re-analysed microsatellite data from a breeding experiment on red sea bream to demonstrate how lost genetic variation might be recovered when gene frequencies have been distorted by consanguineous founders in a hatchery. A minimal-kinship criterion based on a relatedness estimator was used to select subsets of breeders which represented the maximum number of founder lineages (i.e., carried the fewest identical copies of ancestral genes). UPGMA clustering of Nei's genetic distances grouped these selected subsets with the parental gene pool, rather than with the entire, highly drifted offspring generation. The selected subsets also captured much of the expected heterozygosity and allelic diversity of the parental gene pool. Independent pedigree data on the same fish showed that the selected subsets had more contributing parents and more founder equivalents than random subsets of the same size. The estimated mean coancestry was lower in the selected subsets, meaning that inbreeding in subsequent generations would be lower if they were used as breeders. The procedure appears suitable for reducing the genetic distortion due to consanguineous and over-represented founders of a hatchery gene pool.  相似文献   

16.
Recent advances in next-generation sequencing technologies have resulted in an exponential increase in the rate at which protein sequence data are being acquired. The k-gram feature representation, commonly used for protein sequence classification, usually results in prohibitively high dimensional input spaces, for large values of k. Applying data mining algorithms to these input spaces may be intractable due to the large number of dimensions. Hence, using dimensionality reduction techniques can be crucial for the performance and the complexity of the learning algorithms. In this paper, we study the applicability of feature hashing to protein sequence classification, where the original high-dimensional space is "reduced" by hashing the features into a low-dimensional space, using a hash function, i.e., by mapping features into hash keys, where multiple features can be mapped (at random) to the same hash key, and "aggregating" their counts. We compare feature hashing with the "bag of k-grams" approach. Our results show that feature hashing is an effective approach to reducing dimensionality on protein sequence classification tasks.  相似文献   

17.
Copy number variants (CNVs) play an important role in the etiology of many diseases such as cancers and psychiatric disorders. Due to a modest marginal effect size or the rarity of the CNVs, collapsing rare CNVs together and collectively evaluating their effect serves as a key approach to evaluating the collective effect of rare CNVs on disease risk. While a plethora of powerful collapsing methods are available for sequence variants (e.g., SNPs) in association analysis, these methods cannot be directly applied to rare CNVs due to the CNV-specific challenges, i.e., the multi-faceted nature of CNV polymorphisms (e.g., CNVs vary in size, type, dosage, and details of gene disruption), and etiological heterogeneity (e.g., heterogeneous effects of duplications and deletions that occur within a locus or in different loci). Existing CNV collapsing analysis methods (a.k.a. the burden test) tend to have suboptimal performance due to the fact that these methods often ignore heterogeneity and evaluate only the marginal effects of a CNV feature. We introduce CCRET, a random effects test for collapsing rare CNVs when searching for disease associations. CCRET is applicable to variants measured on a multi-categorical scale, collectively modeling the effects of multiple CNV features, and is robust to etiological heterogeneity. Multiple confounders can be simultaneously corrected. To evaluate the performance of CCRET, we conducted extensive simulations and analyzed large-scale schizophrenia datasets. We show that CCRET has powerful and robust performance under multiple types of etiological heterogeneity, and has performance comparable to or better than existing methods when there is no heterogeneity.  相似文献   

18.
《Ecological monographs》2011,81(4):635-663
Ecology is inherently multivariate, but high-dimensional data are difficult to understand. Dimension reduction with ordination analysis helps with both data exploration and clarification of the meaning of inferences (e.g., randomization tests, variation partitioning) about a statistical population. Most such inferences are asymmetric, in that variables are classified as either response or explanatory (e.g., factors, predictors). But this asymmetric approach has limitations (e.g., abiotic variables may not entirely explain correlations between interacting species). We study symmetric population-level inferences by modeling correlations and co-occurrences, using these models for out-of-sample prediction. Such modeling requires a novel treatment of ordination axes as random effects, because fixed effects only allow within-sample predictions. We advocate an iterative methodology for random-effects ordination: (1) fit a set of candidate models differing in complexity (e.g., number of axes); (2) use information criteria to choose among models; (3) compare model predictions with data; (4) explore dimension-reduced graphs (e.g., biplots); (5) repeat 1–4 if model performance is poor. We describe and illustrate random-effects ordination models (with software) for two types of data: multivariate-normal (e.g., log morphometric data) and presence–absence community data. A large simulation experiment with multivariate-normal data demonstrates good performance of (1) a small-sample-corrected information criterion and (2) factor analysis relative to principal component analysis. Predictive comparisons of multiple alternative models is a powerful form of scientific reasoning: we have shown that unconstrained ordination can be based on such reasoning.  相似文献   

19.
Gene family evolution is determined by microevolutionary processes (e.g., point mutations) and macroevolutionary processes (e.g., gene duplication and loss), yet macroevolutionary considerations are rarely incorporated into gene phylogeny reconstruction methods. We present a dynamic program to find the most parsimonious gene family tree with respect to a macroevolutionary optimization criterion, the weighted sum of the number of gene duplications and losses. The existence of a polynomial delay algorithm for duplication/loss phylogeny reconstruction stands in contrast to most formulations of phylogeny reconstruction, which are NP-complete. We next extend this result to obtain a two-phase method for gene tree reconstruction that takes both micro- and macroevolution into account. In the first phase, a gene tree is constructed from sequence data, using any of the previously known algorithms for gene phylogeny construction. In the second phase, the tree is refined by rearranging regions of the tree that do not have strong support in the sequence data to minimize the duplication/lost cost. Components of the tree with strong support are left intact. This hybrid approach incorporates both micro- and macroevolutionary considerations, yet its computational requirements are modest in practice because the two-phase approach constrains the search space. Our hybrid algorithm can also be used to resolve nonbinary nodes in a multifurcating gene tree. We have implemented these algorithms in a software tool, NOTUNG 2.0, that can be used as a unified framework for gene tree reconstruction or as an exploratory analysis tool that can be applied post hoc to any rooted tree with bootstrap values. The NOTUNG 2.0 graphical user interface can be used to visualize alternate duplication/loss histories, root trees according to duplication and loss parsimony, manipulate and annotate gene trees, and estimate gene duplication times. It also offers a command line option that enables high-throughput analysis of a large number of trees.  相似文献   

20.
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号