共查询到20条相似文献,搜索用时 0 毫秒
1.
Background
A key component in protein structure prediction is a scoring or discriminatory function that can distinguish near-native conformations from misfolded ones. Various types of scoring functions have been developed to accomplish this goal, but their performance is not adequate to solve the structure selection problem. In addition, there is poor correlation between the scores and the accuracy of the generated conformations. 相似文献2.
In bioinformatics studies, supervised classification with high-dimensional input variables is frequently encountered. Examples routinely arise in genomic, epigenetic and proteomic studies. Feature selection can be employed along with classifier construction to avoid over-fitting, to generate more reliable classifier and to provide more insights into the underlying causal relationships. In this article, we provide a review of several recently developed penalized feature selection and classification techniques--which belong to the family of embedded feature selection methods--for bioinformatics studies with high-dimensional input. Classification objective functions, penalty functions and computational algorithms are discussed. Our goal is to make interested researchers aware of these feature selection and classification methods that are applicable to high-dimensional bioinformatics data. 相似文献
3.
Monte Carlo feature selection for supervised classification 总被引:4,自引:0,他引:4
Draminski M Rada-Iglesias A Enroth S Wadelius C Koronacki J Komorowski J 《Bioinformatics (Oxford, England)》2008,24(1):110-117
MOTIVATION: Pre-selection of informative features for supervised classification is a crucial, albeit delicate, task. It is desirable that feature selection provides the features that contribute most to the classification task per se and which should therefore be used by any classifier later used to produce classification rules. In this article, a conceptually simple but computer-intensive approach to this task is proposed. The reliability of the approach rests on multiple construction of a tree classifier for many training sets randomly chosen from the original sample set, where samples in each training set consist of only a fraction of all of the observed features. RESULTS: The resulting ranking of features may then be used to advantage for classification via a classifier of any type. The approach was validated using Golub et al. leukemia data and the Alizadeh et al. lymphoma data. Not surprisingly, we obtained a significantly different list of genes. Biological interpretation of the genes selected by our method showed that several of them are involved in precursors to different types of leukemia and lymphoma rather than being genes that are common to several forms of cancers, which is the case for the other methods. AVAILABILITY: Prototype available upon request. 相似文献
4.
A major challenge for genomewide disease association studies is the high cost of genotyping large number of single nucleotide polymorphisms (SNPs). The correlations between SNPs, however, make it possible to select a parsimonious set of informative SNPs, known as "tagging" SNPs, able to capture most variation in a population. Considerable research interest has recently focused on the development of methods for finding such SNPs. In this paper, we present an efficient method for finding tagging SNPs. The method does not involve computation-intensive search for SNP subsets but discards redundant SNPs using a feature selection algorithm. In contrast to most existing methods, the method presented here does not limit itself to using only correlations between SNPs in local groups. By using correlations that occur across different chromosomal regions, the method can reduce the number of globally redundant SNPs. Experimental results show that the number of tagging SNPs selected by our method is smaller than by using block-based methods. Supplementary website: http://htsnp.stanford.edu/FSFS/. 相似文献
5.
Agnieszka S Juncker Lars J Jensen Andrea Pierleoni Andreas Bernsel Michael L Tress Peer Bork Gunnar von Heijne Alfonso Valencia Christos A Ouzounis Rita Casadio Søren Brunak 《Genome biology》2009,10(2):1-6
A recent trend in computational methods for annotation of protein function is that many prediction tools are combined in complex workflows and pipelines to facilitate the analysis of feature combinations, for example, the entire repertoire of kinase-binding motifs in the human proteome. 相似文献
6.
Recent advances in next-generation sequencing technologies have resulted in an exponential increase in the rate at which protein sequence data are being acquired. The k-gram feature representation, commonly used for protein sequence classification, usually results in prohibitively high dimensional input spaces, for large values of k. Applying data mining algorithms to these input spaces may be intractable due to the large number of dimensions. Hence, using dimensionality reduction techniques can be crucial for the performance and the complexity of the learning algorithms. In this paper, we study the applicability of feature hashing to protein sequence classification, where the original high-dimensional space is "reduced" by hashing the features into a low-dimensional space, using a hash function, i.e., by mapping features into hash keys, where multiple features can be mapped (at random) to the same hash key, and "aggregating" their counts. We compare feature hashing with the "bag of k-grams" approach. Our results show that feature hashing is an effective approach to reducing dimensionality on protein sequence classification tasks. 相似文献
7.
In order to classify the real/pseudo human precursor microRNA (pre-miRNAs) hairpins with ab initio methods, numerous features are extracted from the primary sequence and second structure of pre-miRNAs. However, they include some redundant and useless features. It is essential to select the most representative feature subset; this contributes to improving the classification accuracy. We propose a novel feature selection method based on a genetic algorithm, according to the characteristics of human pre-miRNAs. The information gain of a feature, the feature conservation relative to stem parts of pre-miRNA, and the redundancy among features are all considered. Feature conservation was introduced for the first time. Experimental results were validated by cross-validation using datasets composed of human real/pseudo pre-miRNAs. Compared with microPred, our classifier miPredGA, achieved more reliable sensitivity and specificity. The accuracy was improved nearly 12%. The feature selection algorithm is useful for constructing more efficient classifiers for identification of real human pre-miRNAs from pseudo hairpins. 相似文献
8.
Methods for extracting useful information from the datasets produced by microarray experiments are at present of much interest. Here we present new methods for finding gene sets that are well suited for distinguishing experiment classes, such as healthy versus diseased tissues. Our methods are based on evaluating genes in pairs and evaluating how well a pair in combination distinguishes two experiment classes. We tested the ability of our pair-based methods to select gene sets that generalize the differences between experiment classes and compared the performance relative to two standard methods. To assess the ability to generalize class differences, we studied how well the gene sets we select are suited for learning a classifier. We show that the gene sets selected by our methods outperform the standard methods, in some cases by a large margin, in terms of cross-validation prediction accuracy of the learned classifier. We show that on two public datasets, accurate diagnoses can be made using only 15-30 genes. Our results have implications for how to select marker genes and how many gene measurements are needed for diagnostic purposes. When looking for differential expression between experiment classes, it may not be sufficient to look at each gene in a separate universe. Evaluating combinations of genes reveals interesting information that will not be discovered otherwise. Our results show that class prediction can be improved by taking advantage of this extra information. 相似文献
9.
Humberto Pérez-Espinosa Carlos A. Reyes-García Luis Villaseñor-Pineda 《Biomedical signal processing and control》2012,7(1):79-87
In this paper we report the results obtained from experiments with a database of emotional speech in English in order to find the most important acoustic features to estimate Emotion Primitives which determine the emotional content on speech. We are interested in exploiting the potential benefits of continuous emotion models, so in this paper we demonstrate the feasibility of applying this approach to annotation of emotional speech and we explore ways to take advantage of this kind of annotation to improve the automatic classification of basic emotions. 相似文献
10.
Metabolome analysis by flow injection electrospray mass spectrometry (FIE-MS) fingerprinting generates measurements relating to large numbers of m/z signals. Such data sets often exhibit high variance with a paucity of replicates, thus providing a challenge for data mining. We describe data preprocessing and modeling methods that have proved reliable in projects involving samples from a range of organisms. The protocols interact with software resources specifically for metabolomics provided in a Web-accessible data analysis package FIEmspro (http://users.aber.ac.uk/jhd) written in the R environment and requiring a moderate knowledge of R command-line usage. Specific emphasis is placed on describing the outcome of modeling experiments using FIE-MS data that require further preprocessing to improve quality. The salient features of both poor and robust (i.e., highly generalizable) multivariate models are outlined together with advice on validating classifiers and avoiding false discovery when seeking explanatory variables. 相似文献
11.
Background
Using DNA microarrays, we have developed two novel models for tumor classification and target gene prediction. First, gene expression profiles are summarized by optimally selected Self-Organizing Maps (SOMs), followed by tumor sample classification by Fuzzy C-means clustering. Then, the prediction of marker genes is accomplished by either manual feature selection (visualizing the weighted/mean SOM component plane) or automatic feature selection (by pair-wise Fisher's linear discriminant). 相似文献12.
Isolating the core functional elements of an RNA is normally performed during the characterization of a new RNA in order to simplify further biochemical analysis. The removal of extraneous sequence is challenging and can lead to biases that result from the incomplete sampling of deletion variants. An impartial solution to this problem is to construct a library containing a large number of deletion constructs and to select functional RNA isolates that are at least as efficient as their full-length progenitors. Here, we use nonhomologous recombination and selection to isolate the catalytic core of a pyrimidine nucleotide synthase ribozyme. A variable-length pool of approximately 10(8) recombinant molecules that included deletions, inversions, and translocations of a 271-nucleotide-long ribozyme isolate was constructed by digesting and randomly religating its DNA genome. In vitro selection for functional ribozymes was then performed in a size-dependent and a size-independent manner. The final pools had nearly equivalent catalytic rates even though their length distributions were completely different, indicating that a diverse range of deletion constructs were functionally active. Four short sequence islands, requiring as little as 81 nt of sequence, were found within all of the truncated ribozymes and could be folded into a secondary structure consisting of three helix-loops. Our findings suggest that nonhomologous recombination is a highly efficient way to isolate a ribozyme's core motif and could prove to be a useful method for evolving new ribozyme functions from pre-existing sequences in a manner that may have played an important role early in evolution. 相似文献
13.
The thermostability of proteins is particularly relevant for enzyme engineering. Developing a computational method to identify mesophilic proteins would be helpful for protein engineering and design. In this work, we developed support vector machine based method to predict thermophilic proteins using the information of amino acid distribution and selected amino acid pairs. A reliable benchmark dataset including 915 thermophilic proteins and 793 non-thermophilic proteins was constructed for training and testing the proposed models. Results showed that 93.8% thermophilic proteins and 92.7% non-thermophilic proteins could be correctly predicted by using jackknife cross-validation. High predictive successful rate exhibits that this model can be applied for designing stable proteins. 相似文献
14.
Recursive SVM feature selection and sample classification for mass-spectrometry and microarray data 总被引:3,自引:0,他引:3
Xuegong Zhang Xin Lu Qian Shi Xiu-qin Xu Hon-chiu E Leung Lyndsay N Harris James D Iglehart Alexander Miron Jun S Liu Wing H Wong 《BMC bioinformatics》2006,7(1):197-13
Background
Like microarray-based investigations, high-throughput proteomics techniques require machine learning algorithms to identify biomarkers that are informative for biological classification problems. Feature selection and classification algorithms need to be robust to noise and outliers in the data. 相似文献15.
A speech act is a linguistic action intended by a speaker. Speech act classification is an essential part of a dialogue understanding system because the speech act of an utterance is closely tied with the user's intention in the utterance. We propose a neural network model for Korean speech act classification. In addition, we propose a method that extracts morphological features from surface utterances and selects effective ones among the morphological features. Using the feature selection method, the proposed neural network can partially increase precision and decrease training time. In the experiment, the proposed neural network showed better results than other models using comparatively high-level linguistic features. Based on the experimental result, we believe that the proposed neural network model is suitable for real field applications because it is easy to expand the neural network model into other domains. Moreover, we found that neural networks can be useful in speech act classification if we can convert surface sentences into vectors with fixed dimensions by using an effective feature selection method. 相似文献
16.
17.
A comparative study of feature selection and multiclass classification methods for tissue classification based on gene expression 总被引:1,自引:0,他引:1
This paper studies the problem of building multiclass classifiers for tissue classification based on gene expression. The recent development of microarray technologies has enabled biologists to quantify gene expression of tens of thousands of genes in a single experiment. Biologists have begun collecting gene expression for a large number of samples. One of the urgent issues in the use of microarray data is to develop methods for characterizing samples based on their gene expression. The most basic step in the research direction is binary sample classification, which has been studied extensively over the past few years. This paper investigates the next step-multiclass classification of samples based on gene expression. The characteristics of expression data (e.g. large number of genes with small sample size) makes the classification problem more challenging. The process of building multiclass classifiers is divided into two components: (i) selection of the features (i.e. genes) to be used for training and testing and (ii) selection of the classification method. This paper compares various feature selection methods as well as various state-of-the-art classification methods on various multiclass gene expression datasets. Our study indicates that multiclass classification problem is much more difficult than the binary one for the gene expression datasets. The difficulty lies in the fact that the data are of high dimensionality and that the sample size is small. The classification accuracy appears to degrade very rapidly as the number of classes increases. In particular, the accuracy was very low regardless of the choices of the methods for large-class datasets (e.g. NCI60 and GCM). While increasing the number of samples is a plausible solution to the problem of accuracy degradation, it is important to develop algorithms that are able to analyze effectively multiple-class expression data for these special datasets. 相似文献
18.
19.
Microarray data classification is one of the most important emerging clinical applications in the medical community. Machine learning algorithms are most frequently used to complete this task. We selected one of the state-of-the-art kernel-based algorithms, the support vector machine (SVM), to classify microarray data. As a large number of kernels are available, a significant research question is what is the best kernel for patient diagnosis based on microarray data classification using SVM? We first suggest three solutions based on data visualization and quantitative measures. Different types of microarray problems then test the proposed solutions. Finally, we found that the rule-based approach is most useful for automatic kernel selection for SVM to classify microarray data. 相似文献
20.
Liqi Li Sanjiu Yu Weidong Xiao Yongsheng Li Lan Huang Xiaoqi Zheng Shiwen Zhou Hua Yang 《BMC bioinformatics》2014,15(1)