首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
2.
We have introduced a new method of protein secondary structure prediction which is based on the theory of support vector machine (SVM). SVM represents a new approach to supervised pattern classification which has been successfully applied to a wide range of pattern recognition problems, including object recognition, speaker identification, gene function prediction with microarray expression profile, etc. In these cases, the performance of SVM either matches or is significantly better than that of traditional machine learning approaches, including neural networks.The first use of the SVM approach to predict protein secondary structure is described here. Unlike the previous studies, we first constructed several binary classifiers, then assembled a tertiary classifier for three secondary structure states (helix, sheet and coil) based on these binary classifiers. The SVM method achieved a good performance of segment overlap accuracy SOV=76.2 % through sevenfold cross validation on a database of 513 non-homologous protein chains with multiple sequence alignments, which out-performs existing methods. Meanwhile three-state overall per-residue accuracy Q(3) achieved 73.5 %, which is at least comparable to existing single prediction methods. Furthermore a useful "reliability index" for the predictions was developed. In addition, SVM has many attractive features, including effective avoidance of overfitting, the ability to handle large feature spaces, information condensing of the given data set, etc. The SVM method is conveniently applied to many other pattern classification tasks in biology.  相似文献   

3.
4.
The search for predictive biomarkers of disease from high-throughput mass spectrometry (MS) data requires a complex analysis path. Preprocessing and machine-learning modules are pipelined, starting from raw spectra, to set up a predictive classifier based on a shortlist of candidate features. As a machine-learning problem, proteomic profiling on MS data needs caution like the microarray case. The risk of overfitting and of selection bias effects is pervasive: not only potential features easily outnumber samples by 10(3) times, but it is easy to neglect information-leakage effects during preprocessing from spectra to peaks. The aim of this review is to explain how to build a general purpose design analysis protocol (DAP) for predictive proteomic profiling: we show how to limit leakage due to parameter tuning and how to organize classification and ranking on large numbers of replicate versions of the original data to avoid selection bias. The DAP can be used with alternative components, i.e. with different preprocessing methods (peak clustering or wavelet based), classifiers e.g. Support Vector Machine (SVM) or feature ranking methods (recursive feature elimination or I-Relief). A procedure for assessing stability and predictive value of the resulting biomarkers' list is also provided. The approach is exemplified with experiments on synthetic datasets (from the Cromwell MS simulator) and with publicly available datasets from cancer studies.  相似文献   

5.
The need for accurate, automated protein classification methods continues to increase as advances in biotechnology uncover new proteins. G-protein coupled receptors (GPCRs) are a particularly difficult superfamily of proteins to classify due to extreme diversity among its members. Previous comparisons of BLAST, k-nearest neighbor (k-NN), hidden markov model (HMM) and support vector machine (SVM) using alignment-based features have suggested that classifiers at the complexity of SVM are needed to attain high accuracy. Here, analogous to document classification, we applied Decision Tree and Naive Bayes classifiers with chi-square feature selection on counts of n-grams (i.e. short peptide sequences of length n) to this classification task. Using the GPCR dataset and evaluation protocol from the previous study, the Naive Bayes classifier attained an accuracy of 93.0 and 92.4% in level I and level II subfamily classification respectively, while SVM has a reported accuracy of 88.4 and 86.3%. This is a 39.7 and 44.5% reduction in residual error for level I and level II subfamily classification, respectively. The Decision Tree, while inferior to SVM, outperforms HMM in both level I and level II subfamily classification. For those GPCR families whose profiles are stored in the Protein FAMilies database of alignments and HMMs (PFAM), our method performs comparably to a search against those profiles. Finally, our method can be generalized to other protein families by applying it to the superfamily of nuclear receptors with 94.5, 97.8 and 93.6% accuracy in family, level I and level II subfamily classification respectively.  相似文献   

6.
7.
基于支持向量机的~(31)P磁共振波谱肝细胞癌诊断   总被引:1,自引:1,他引:0  
支持向量机是在统计学习理论基础上发展起来的一种新的机器学习方法,在模式识别领域有着广泛的应用。利用基于支持向量机模型的31P磁共振波谱数据对肝脏进行分类,区别肝细胞癌,肝硬化和正常的肝组织。通过对基于多项式核函数和径向基核函数的支持向量机分类器进行比较,并且得到三种肝脏分类的识别率。实验表明基于31P磁共振波谱数据的支持向量机分类模型能够对活体肝脏进行诊断性的预测。  相似文献   

8.
Vocal individuality has been documented in a variety of mammalian species and it has been proposed that this individuality can be used as a vocal fingerprint to monitor individuals. Here we provide and test a classification method using Mel-frequency cepstral coefficients (MFCCs) to extract features from Bornean gibbon female calls. Our method is semi-automated as it requires manual pre-processing to identify and extract calls from the original recordings. We compared two methods of MFCC feature extraction: (1) averaging across all time windows and (2) creating a standardized number of time windows for each call. We analysed 376 calls from 33 gibbon females and, using linear discriminant analysis, found that we were able to improve female identification accuracy from 95.7% with spectrogram features to 98.4% accuracy when averaging MFCCs across time windows, and 98.9% accuracy when using a standardized number of windows. We divided our data randomly into training and test data-sets, and tested the accuracy of support vector machine (SVM) predictions over 100 iterations. We found that we could predict female identity in the test data-set with a 98.8% accuracy. Using SVM on our entire data-set, we were able to predict female identity with 99.5% accuracy (validated by leave-one-out cross-validation). Lastly, we used the method presented here to classify four females recorded during three or more recording seasons using SVM with limited success. We provide evidence that MFCC feature extraction is effective for distinguishing between female Bornean gibbons, and make suggestions for future vocal fingerprinting applications.  相似文献   

9.
The identification of catalytic residues is an essential step in functional characterization of enzymes. We present a purely structural approach to this problem, which is motivated by the difficulty of evolution-based methods to annotate structural genomics targets that have few or no homologs in the databases. Our approach combines a state-of-the-art support vector machine (SVM) classifier with novel structural features that augment structural clues by spatial averaging and Z scoring. Special attention is paid to the class imbalance problem that stems from the overwhelming number of non-catalytic residues in enzymes compared to catalytic residues. This problem is tackled by: (1) optimizing the classifier to maximize a performance criterion that considers both Type I and Type II errors in the classification of catalytic and non-catalytic residues; (2) under-sampling non-catalytic residues before SVM training; and (3) during SVM training, penalizing errors in learning catalytic residues more than errors in learning non-catalytic residues. Tested on four enzyme datasets, one specifically designed by us to mimic the structural genomics scenario and three previously evaluated datasets, our structure-based classifier is never inferior to similar structure-based classifiers and comparable to classifiers that use both structural and evolutionary features. In addition to the evaluation of the performance of catalytic residue identification, we also present detailed case studies on three proteins. This analysis suggests that many false positive predictions may correspond to binding sites and other functional residues. A web server that implements the method, our own-designed database, and the source code of the programs are publicly available at http://www.cs.bgu.ac.il/~meshi/functionPrediction.  相似文献   

10.
Individually specific acoustic signals in birds are used in territorial defence. These signals enable a reduction of energy expenditure due to individual recognition between rivals and the associated threat levels. Mechanisms and acoustic cues used for individual recognition seem to be versatile among birds. However, most studies so far have been conducted on oscine species. Few studies have focused on exactly how the potential for individual recognition changes with distance between the signaller and receiver. We studied a nocturnally active rail species, the corncrake, which utters a seemingly simple disyllabic call. The inner call structure, however, is quite complex and expressed as intervals between maximal amplitude peaks, called pulse-to-pulse durations (PPD). The inner call is characterized by very low within-individual variation and high between-individuals difference. These variations and differences enable recognition of individuals. We conducted our propagation experiments in a natural corncrake habitat. We found that PPD was not affected by transmission. Correct individual identification was possible regardless of the distance and position of the microphone which was above the ground. The results for sounds from the extreme distance propagated through the vegetation compared to those transmitted above the vegetation were even better. These results support the idea that PPD structure has evolved under selection favouring individual recognition in a species signalling at night, in a dense environment and close to the ground.  相似文献   

11.
MOTIVATION: Obtaining soluble proteins in sufficient concentrations is a recurring limiting factor in various experimental studies. Solubility is an individual trait of proteins which, under a given set of experimental conditions, is determined by their amino acid sequence. Accurate theoretical prediction of solubility from sequence is instrumental for setting priorities on targets in large-scale proteomics projects. RESULTS: We present a machine-learning approach called PROSO to assess the chance of a protein to be soluble upon heterologous expression in Escherichia coli based on its amino acid composition. The classification algorithm is organized as a two-layered structure in which the output of primary support vector machine (SVM) classifiers serves as input for a secondary Naive Bayes classifier. Experimental progress information from the TargetDB database as well as previously published datasets were used as the source of training data. In comparison with previously published methods our classification algorithm possesses improved discriminatory capacity characterized by the Matthews Correlation Coefficient (MCC) of 0.434 between predicted and known solubility states and the overall prediction accuracy of 72% (75 and 68% for positive and negative class, respectively). We also provide experimental verification of our predictions using solubility measurements for 31 mutational variants of two different proteins.  相似文献   

12.
In optical printed Chinese character recognition (OPCCR), many classifiers have been proposed for the recognition. Among the classifiers, support vector machine (SVM) might be the best classifier. However, SVM is a classifier for two classes. When it is used for multi-classes in OPCCR, its computation is time-consuming. Thus, we propose a neighbor classes based SVM (NC-SVM) to reduce the computation consumption of SVM. Experiments of NC-SVM classification for OPCCR have been done. The results of the experiments have shown that the NC-SVM we proposed can effectively reduce the computation time in OPCCR.  相似文献   

13.
Acoustic individual discrimination has been demonstrated for a wide range of animal taxa. However, there has been far less scientific effort to demonstrate the effectiveness of automatic individual identification, which could greatly facilitate research, especially when data are collected via an acoustic localization system (ALS). In this study, we examine the accuracy of acoustic caller recognition in long calls (LCs) emitted by Bornean male orangutans (Pongo pygmaeus wurmbii) derived from two data-sets: the first consists of high-quality recordings taken during individual focal follows (N = 224 LCs by 14 males) and the second consists of LC recordings with variable microphone-caller distances stemming from ALS (N = 123 LCs by 10 males). The LC is a long-distance vocalization. We therefore expect that even the low-quality test-set should yield caller recognition results significantly better than by chance. Automatic individual identification was accomplished using software originally developed for human speaker recognition (i.e. the MSR identity toolbox). We obtained a 93.3% correct identification rate with high-quality recordings, and 72.23% with recordings stemming from the ALS with variable microphone-caller distances (20–420 m). These results show that automatic individual identification is possible even though the accuracy declines compared with the results of high-quality recordings due to severe signal degradations (e.g. sound attenuation, environmental noise contamination, and echo interference) with increasing distance. We therefore suggest that acoustic individual identification with speaker recognition software can be a valuable tool to apply to data obtained through an ALS, thereby facilitating field research on vocal communication.  相似文献   

14.
Song S  Zhan Z  Long Z  Zhang J  Yao L 《PloS one》2011,6(2):e17191

Background

Support vector machine (SVM) has been widely used as accurate and reliable method to decipher brain patterns from functional MRI (fMRI) data. Previous studies have not found a clear benefit for non-linear (polynomial kernel) SVM versus linear one. Here, a more effective non-linear SVM using radial basis function (RBF) kernel is compared with linear SVM. Different from traditional studies which focused either merely on the evaluation of different types of SVM or the voxel selection methods, we aimed to investigate the overall performance of linear and RBF SVM for fMRI classification together with voxel selection schemes on classification accuracy and time-consuming.

Methodology/Principal Findings

Six different voxel selection methods were employed to decide which voxels of fMRI data would be included in SVM classifiers with linear and RBF kernels in classifying 4-category objects. Then the overall performances of voxel selection and classification methods were compared. Results showed that: (1) Voxel selection had an important impact on the classification accuracy of the classifiers: in a relative low dimensional feature space, RBF SVM outperformed linear SVM significantly; in a relative high dimensional space, linear SVM performed better than its counterpart; (2) Considering the classification accuracy and time-consuming holistically, linear SVM with relative more voxels as features and RBF SVM with small set of voxels (after PCA) could achieve the better accuracy and cost shorter time.

Conclusions/Significance

The present work provides the first empirical result of linear and RBF SVM in classification of fMRI data, combined with voxel selection methods. Based on the findings, if only classification accuracy was concerned, RBF SVM with appropriate small voxels and linear SVM with relative more voxels were two suggested solutions; if users concerned more about the computational time, RBF SVM with relative small set of voxels when part of the principal components were kept as features was a better choice.  相似文献   

15.
DNA microarrays (gene chips), frequently used in biological and medical studies, measure the expressions of thousands of genes per sample. Using microarray data to build accurate classifiers for diseases is an important task. This paper introduces an algorithm, called Committee of Decision Trees by Attribute Behavior Diversity (CABD), to build highly accurate ensembles of decision trees for such data. Since a committee's accuracy is greatly influenced by the diversity among its member classifiers, CABD uses two new ideas to "optimize" that diversity, namely (1) the concept of attribute behavior-based similarity between attributes, and (2) the concept of attribute usage diversity among trees. The ideas are effective for microarray data, since such data have many features and behavior similarity between genes can be high. Experiments on microarray data for six cancers show that CABD outperforms previous ensemble methods significantly and outperforms SVM, and show that the diversified features used by CABD's decision tree committee can be used to improve performance of other classifiers such as SVM. CABD has potential for other high-dimensional data, and its ideas may apply to ensembles of other classifier types.  相似文献   

16.
Over the past decade, dramatic declines in frog populations have been noticed worldwide. To examine this decline, monitoring frogs is becoming increasingly important. Compared to traditional field survey methods, recent advances in acoustic sensor technology have greatly extended spatial and temporal scales for monitoring animal populations. In this paper, we examine the problem of monitoring frog populations by analysing acoustic sensor data, where the population is reflected by community calling activity and species richness. Specifically, a novel acoustic event detection (AED) algorithm is first proposed to filter out those recordings without frog calls. Then, multi-label learning is used to classify each individual recording with six acoustic features: linear predictive coding coefficients, Mel-frequency cepstral coefficients, linear-frequency cepstral coefficients, acoustic complexity index, acoustic diversity index, and acoustic evenness index. Next, frog community calling activity and species richness are estimated by accumulating the results of AED and multi-label learning, respectively. Finally, ordinary least squares regression (OLS) is conducted to reveal the relationship between frog populations (frog calling activity and species richness) and weather variables (maximum temperature and rainfall). Experimental results demonstrate that our proposed intelligent system can significantly facilitate the effort to estimate frog community calling activity and species richness with comparable accuracies. The statistical results of OLS indicate that rainfall pattern has a lagged impact on frog community calling activity (significant in the first day after rainy day) and species richness (significant in the fourth day after rainy day). Temperature is shown to affect species richness but is less likely to change calling activity.  相似文献   

17.
G蛋白偶联受体是非常重要的信号分子受体,其功能失调会导致许多疾病的产生。在前期工作的基础上,作者将序列特征分析与支持向量机技术结合起来,通过分析序列的特征差异,对G蛋白偶联受体分子及其类型进行识别。首次提取了G蛋白偶联受体对应的mRNA序列的绝对密码子使用频率作为特征,这主要因为它既包含了基因密码子使用偏性的信息,也包含了基因所编码蛋白的氨基酸组成信息。结果显示:在G蛋白偶联受体序列及其类型预测的问题中,设计支持向量机分类器时,最好选择使用包含基因序列绝对密码子使用频率和蛋白序列双联氨基酸使用频率两部分信息的组合特征作为特征,同时采用径向基核作为核函数。  相似文献   

18.
Microarrays have thousands to tens-of-thousands of gene features, but only a few hundred patient samples are available. The fundamental problem in microarray data analysis is identifying genes whose disruption causes congenital or acquired disease in humans. In this paper, we propose a new evolutionary method that can efficiently select a subset of potentially informative genes for support vector machine (SVM) classifiers. The proposed evolutionary method uses SVM with a given subset of gene features to evaluate the fitness function, and new subsets of features are selected based on the estimates of generalization error of SVMs and frequency of occurrence of the features in the evolutionary approach. Thus, in theory, selected genes reflect to some extent the generalization performance of SVM classifiers. We compare our proposed method with several existing methods and find that the proposed method can obtain better classification accuracy with a smaller number of selected genes than the existing methods.  相似文献   

19.
Sunn pest (Eurygaster integriceps put.) causes severe damage to wheat fields annually, reducing production by up to 50%. Rapid identification of pest concentration points and estimation of infestation levels in fields can be useful for production management and reducing the use of chemical sprays. Because of the limited ability to detect pests on the ground and access to high-resolution satellite imagery, aerial photography was considered for crop pest and disease detection. In this study, the feasibility of soft computing approaches and image processing to identify areas infected with sunn pest using near-infrared and visible light aerial imagery was investigated. An irrigated winter wheat field was surveyed for five consecutive months, from February to June. The spectral vegetation features (SVI), were extracted and analysed for both near infrared and visible light images. To detect infected spikes, Support Vector Machine (SVM) with Radial Basis Function (RBF) kernel was used. The Red and near-infrared (NIR) bands reflectance and, the Ratio Vegetation Index (RVI) for near-infrared images as well as Red and Green bands reflectance and normalised green blue difference index (NGBDI) for visible light images had the greatest impact on the performance of the SVM classifiers. The SVM classifiers were validated using the confusion matrix method. The best accuracy and performance of the detection system was achieved in February and March when the healthy wheat plant was still green. The mean accuracy for these two months was 0.97 and 0.93 for the SVM classifiers for NIR and visible light, respectively.  相似文献   

20.
k-gram方法识别microRNA前体   总被引:3,自引:0,他引:3  
MicroRNAs(miRNAs)是动植物中较短的参与调控基因表达的功能性非编码RNA序列.第一个miRNA是通过实验手段发现的,然而通过实验手段识别miRNA在技术上仍然具有很大的挑战性和不完整性.因此,miRNA基因识别需要寻求计算方法来弥补实验方法的不足.提出了一个全新的miRNA前体的识别方法.在构造识别模型中,把初级序列和序列二级结构相结合,采用k-gram方法把序列信息映射到高维特征空间中,然后通过特征选取方法提取特征,并用这些特征为miRNA前体的识别构造了基于SVM的识别模型.同时,采用隐马尔可夫模型(HMM)的学习方法进行了比较.实验结果表明,该方法是有效的,可以达到较高的敏感性和特异性.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号