首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Proximity based GPCRs prediction in transform domain   总被引:1,自引:0,他引:1  
In this work, we predict G-protein coupled receptors (GPCRs) using hydrophobicity of amino acid sequences and Fast Fourier Transform for feature generation. We analyze whether the GPCRs classification strategy depends on the way the feature space may be exploited. Consequently, we show that the sequence pattern based information could easily be exploited in the frequency domain using proximity rather than increasing margin of separation between the classes. We thus develop a simple proximity based approach known as nearest neighbor (NN) for classifying the 17 GPCRs subfamilies. The NN classifier has outperformed the one against all implementation of support vector machine using both Jackknife and independent dataset. The results validate the importance of the understanding and efficient exploitation of the feature space. It also shows that simple classification strategies may outperform complex ones because of the efficient exploitation of the feature space.  相似文献   

2.
Sleep apnoea is a very common sleep disorder which is able to cause symptoms such as daytime sleepiness, irritability and poor concentration. This paper presents a combinational feature extraction approach based on some nonlinear features extracted from Electro Cardio Graph (ECG) Reconstructed Phase Space (RPS) and usually used frequency domain features for detection of sleep apnoea. Here 6 nonlinear features extracted from ECG RPS are combined with 3 frequency based features to reconstruct final feature set. The nonlinear features consist of Detrended Fluctuation Analysis (DFA), Correlation Dimensions (CD), 3 Large Lyapunov Exponents (LLEs) and Spectral Entropy (SE). The final proposed feature set show about 94.8% accuracy over the Physionet sleep apnoea dataset using a kernel based SVM classifier. This research also proves that using non-linear analysis to detect sleep apnoea can potentially improve the classification accuracy of apnoea detection system.  相似文献   

3.
Hayat M  Khan A  Yeasin M 《Amino acids》2012,42(6):2447-2460
Knowledge of the types of membrane protein provides useful clues in deducing the functions of uncharacterized membrane proteins. An automatic method for efficiently identifying uncharacterized proteins is thus highly desirable. In this work, we have developed a novel method for predicting membrane protein types by exploiting the discrimination capability of the difference in amino acid composition at the N and C terminus through split amino acid composition (SAAC). We also show that the ensemble classification can better exploit this discriminating capability of SAAC. In this study, membrane protein types are classified using three feature extraction and several classification strategies. An ensemble classifier Mem-EnsSAAC is then developed using the best feature extraction strategy. Pseudo amino acid (PseAA) composition, discrete wavelet analysis (DWT), SAAC, and a hybrid model are employed for feature extraction. The nearest neighbor, probabilistic neural network, support vector machine, random forest, and Adaboost are used as individual classifiers. The predicted results of the individual learners are combined using genetic algorithm to form an ensemble classifier, Mem-EnsSAAC yielding an accuracy of 92.4 and 92.2% for the Jackknife and independent dataset test, respectively. Performance measures such as MCC, sensitivity, specificity, F-measure, and Q-statistics show that SAAC-based prediction yields significantly higher performance compared to PseAA- and DWT-based systems, and is also the best reported so far. The proposed Mem-EnsSAAC is able to predict the membrane protein types with high accuracy and consequently, can be very helpful in drug discovery. It can be accessed at http://111.68.99.218/membrane.  相似文献   

4.
Classification is a data mining task the goal of which is to learn a model, from a training dataset, that can predict the class of a new data instance, while clustering aims to discover natural instance-groupings within a given dataset. Learning cluster-based classification systems involves partitioning a training set into data subsets (clusters) and building a local classification model for each data cluster. The class of a new instance is predicted by first assigning the instance to its nearest cluster and then using that cluster’s local classification model to predict the instance’s class. In this paper, we present an ant colony optimization (ACO) approach to building cluster-based classification systems. Our ACO approach optimizes the number of clusters, the positioning of the clusters, and the choice of classification algorithm to use as the local classifier for each cluster. We also present an ensemble approach that allows the system to decide on the class of a given instance by considering the predictions of all local classifiers, employing a weighted voting mechanism based on the fuzzy degree of membership in each cluster. Our experimental evaluation employs five widely used classification algorithms: naïve Bayes, nearest neighbour, Ripper, C4.5, and support vector machines, and results are reported on a suite of 54 popular UCI benchmark datasets.  相似文献   

5.
《Genomics》2020,112(5):3089-3096
Automatic classification of glaucoma from fundus images is a vital diagnostic tool for Computer-Aided Diagnosis System (CAD). In this work, a novel fused feature extraction technique and ensemble classifier fusion is proposed for diagnosis of glaucoma. The proposed method comprises of three stages. Initially, the fundus images are subjected to preprocessing followed by feature extraction and feature fusion by Intra-Class and Extra-Class Discriminative Correlation Analysis (IEDCA). The feature fusion approach eliminates between-class correlation while retaining sufficient Feature Dimension (FD) for Correlation Analysis (CA). The fused features are then fed to the classifiers namely Support Vector Machine (SVM), Random Forest (RF) and K-Nearest Neighbor (KNN) for classification individually. Finally, Classifier fusion is also designed which combines the decision of the ensemble of classifiers based on Consensus-based Combining Method (CCM). CCM based Classifier fusion adjusts the weights iteratively after comparing the outputs of all the classifiers. The proposed fusion classifier provides a better improvement in accuracy and convergence when compared to the individual algorithms. A classification accuracy of 99.2% is accomplished by the two-level hybrid fusion approach. The method is evaluated on the public datasets High Resolution Fundus (HRF) and DRIVE datasets with cross dataset validation.  相似文献   

6.
The fractal dimension D may be calculated in many ways, since its strict definition, the Hausdorff definition is too complicated for practical estimation. In this paper we perform a comparative study often methods of fractal analysis of time series. In Benoit, a commercial program for fractal analysis, five methods of computing fractal dimension of time series (rescaled range analysis, power spectral analysis, roughness-length, variogram methods and wavelet method) are available. We have implemented some other algorithms for calculating D: Higuchi's fractal dimension, relative dispersion analysis, running fractal dimension, method based on mathematical morphology and method based on intensity differences. For biomedical signals results obtained by means of different algorithms are different, but consistent.  相似文献   

7.
癌症的早期诊断能够显著提高癌症患者的存活率,在肝细胞癌患者中这种情况更加明显。机器学习是癌症分类中的有效工具。如何在复杂和高维的癌症数据集中,选择出低维度、高分类精度的特征子集是癌症分类的难题。本文提出了一种二阶段的特征选择方法SC-BPSO:通过组合Spearman相关系数和卡方独立检验作为过滤器的评价函数,设计了一种新型的过滤器方法——SC过滤器,再组合SC过滤器方法和基于二进制粒子群算法(BPSO)的包裹器方法,从而实现两阶段的特征选择。并应用在高维数据的癌症分类问题中,区分正常样本和肝细胞癌样本。首先,对来自美国国家生物信息中心(NCBI)和欧洲生物信息研究所(EBI)的130个肝组织microRNA序列数据(64肝细胞癌,66正常肝组织)进行预处理,使用MiRME算法从原始序列文件中提取microRNA的表达量、编辑水平和编辑后表达量3类特征。然后,调整SC-BPSO算法在肝细胞癌分类场景中的参数,选择出关键特征子集。最后,建立分类模型,预测结果,并与信息增益过滤器、信息增益率过滤器、BPSO包裹器特征选择算法选出的特征子集,使用相同参数的随机森林、支持向量机、决策树、KNN四种分类器分类,对比分类结果。使用SC-BPSO算法选择出的特征子集,分类准确率高达98.4%。研究结果表明,与另外3个特征选择算法相比,SC-BPSO算法能有效地找到尺寸较小和精度更高的特征子集。这对于少量样本高维数据的癌症分类问题可能具有重要意义。  相似文献   

8.
Categorizing the bioacoustic and ecoacoustic properties of animals is great interest to biologists and ecologists. Also, multidisciplinary studies in engineering have significantly contributed to the development of acoustic analysis. Observing the animals living in the ecological environment provides information in many areas such as global warming, climate changes, monitoring of endangered animals, agricultural activities. However, the classification of bioacoustics sounds by manually is very hard. Therefore, automated bioacoustics sound classification is crucial for ecological science. This work presents a new multispecies bioacoustics sound dataset and novel machine learning model to classify bird and anuran species with sounds automatically. In this model, a new nonlinear textural feature generation function is presented by using twine cipher substitution box(S-box), and this feature generation function is named twine-pat. By using twine-pat and tunable Q-factor wavelet transform, a multilevel feature generation network is presented. Iterative ReliefF(IRF) is employed to select the most effective/valuable features. Two shallow classifiers are used to calculate results. Our presented model reached 98.75% accuracy by using k-nearest neighbor(kNN) classifier. The results obviously demonstrated the success of the presented model.  相似文献   

9.
The study of soil mean weight diameter (MWD), essential for sustainable soil management, has recently received much attention. As the estimation of MWD is challenging, labor-intensive, and time-consuming, there is a crucial need to develop a predictive estimation method to generate helpful information required for the soil health assessment to save time and cost involved in soil analysis. Pedotransfer functions (PTFs) are used to estimate parameters that are ‘difficult to measure’ and time-consuming with the help of ’easy to measure’ parameters. In the current study, empirical PTFs, i.e., multi-linear regression (MLR), and four machine learning based PTFs, i.e., artificial neural network (ANN), support vector machine (SVM), classification and regression trees (CART), and random forest (RF) were used for mean weight diameter prediction in Karnal district of Haryana, India. A total of 121 soil samples from 0‐15 and 15‐30 cm soil depths were collected from seventeen villages of Nilokheri, Nissing, and Assandh blocks of Karnal district. Soil parameters such as bulk density (BD), fractal dimension (D), soil texture (i.e., sand, silt, and clay), organic carbon (OC), and glomalin content were used as the input variables. Two input combinations, i.e., one with texture data (dataset 1) and the other with fractal dimension data replacing texture (dataset 2), were used, and the complete dataset (121) was divided into training and testing datasets in a 4:1 ratio. The model performance was evaluated by statistical parameters such as mean absolute error (MAE), mean absolute percentage error (MAPE), root mean square error (RMSE), normalized root mean square error (NRMSE), and determination coefficient (R2). The comparison results showed that including the fractal dimension in the input dataset improved the prediction capability of ANN, SVM, and RF. MLR and CART showed lower predictive ability than the other three approaches (i.e., ANN, SVM, and RF). In the training dataset, RMSE (mm) for the SVM model was 8.33% lower with D than with texture as the input, whereas, in the testing dataset, it was 16.67% lower. Because SVM is more flexible and effectively captures non-linear relationships, it performed better than the other models in predicting MWD. As seen in this study, the SVM model with input data D is the best in its class and has a high potential for MWD prediction in the Karnal district of Haryana, India.  相似文献   

10.
Shen HB  Chou KC 《Amino acids》2007,32(4):483-488
Predicting membrane protein type is both an important and challenging topic in current molecular and cellular biology. This is because knowledge of membrane protein type often provides useful clues for determining, or sheds light upon, the function of an uncharacterized membrane protein. With the explosion of newly-found protein sequences in the post-genomic era, it is in a great demand to develop a computational method for fast and reliably identifying the types of membrane proteins according to their primary sequences. In this paper, a novel classifier, the so-called "ensemble classifier", was introduced. It is formed by fusing a set of nearest neighbor (NN) classifiers, each of which is defined in a different pseudo amino acid composition space. The type for a query protein is determined by the outcome of voting among these constituent individual classifiers. It was demonstrated through the self-consistency test, jackknife test, and independent dataset test that the ensemble classifier outperformed other existing classifiers widely used in biological literatures. It is anticipated that the idea of ensemble classifier can also be used to improve the prediction quality in classifying other attributes of proteins according to their sequences.  相似文献   

11.
MOTIVATION: An important challenge in the use of large-scale gene expression data for biological classification occurs when the expression dataset being analyzed involves multiple classes. Key issues that need to be addressed under such circumstances are the efficient selection of good predictive gene groups from datasets that are inherently 'noisy', and the development of new methodologies that can enhance the successful classification of these complex datasets. METHODS: We have applied genetic algorithms (GAs) to the problem of multi-class prediction. A GA-based gene selection scheme is described that automatically determines the members of a predictive gene group, as well as the optimal group size, that maximizes classification success using a maximum likelihood (MLHD) classification method. RESULTS: The GA/MLHD-based approach achieves higher classification accuracies than other published predictive methods on the same multi-class test dataset. It also permits substantial feature reduction in classifier genesets without compromising predictive accuracy. We propose that GA-based algorithms may represent a powerful new tool in the analysis and exploration of complex multi-class gene expression data. AVAILABILITY: Supplementary information, data sets and source codes are available at http://www.omniarray.com/bioinformatics/GA.  相似文献   

12.
G-protein-coupled receptors (GPCRs) are the largest family of cell surface receptors that, via trimetric guanine nucleotide-binding proteins (G-proteins), initiate some signaling pathways in the eukaryotic cell. Many diseases involve malfunction of GPCRs making their role evident in drug discovery. Thus, the automatic prediction of GPCRs can be very helpful in the pharmaceutical industry. However, prediction of GPCRs, their families, and their subfamilies is a challenging task. In this article, GPCRs are classified into families, subfamilies, and sub-subfamilies using pseudo-amino-acid composition and multiscale energy representation of different physiochemical properties of amino acids. The aim of the current research is to assess different feature extraction strategies and to develop a hybrid feature extraction strategy that can exploit the discrimination capability in both the spatial and transform domains for GPCR classification. Support vector machine, nearest neighbor, and probabilistic neural network are used for classification purposes. The overall performance of each classifier is computed individually for each feature extraction strategy. It is observed that using the jackknife test the proposed GPCR–hybrid method provides the best results reported so far. The GPCR–hybrid web predictor to help researchers working on GPCRs in the field of biochemistry and bioinformatics is available at http://111.68.99.218/GPCR.  相似文献   

13.
To evaluate the possibility of an unknown protein to be a resistant gene against Xanthomonas oryzae pv. oryzae, a different mode of pseudo amino acid composition (PseAAC) is proposed to formulate the protein samples by integrating the amino acid composition, as well as the Chaos games representation (CGR) method. Some numerical comparisons of triangle, quadrangle and 12-vertex polygon CGR are carried to evaluate the efficiency of using these fractal figures in classifiers. The numerical results show that among the three polygon methods, triangle method owns a good fractal visualization and performs the best in the classifier construction. By using triangle + 12-vertex polygon CGR as the mathematical feature, the classifier achieves 98.13% in Jackknife test and MCC achieves 0.8462.  相似文献   

14.
This paper proposes a new power spectral-based hybrid genetic algorithm-support vector machines (SVMGA) technique to classify five types of electrocardiogram (ECG) beats, namely normal beats and four manifestations of heart arrhythmia. This method employs three modules: a feature extraction module, a classification module and an optimization module. Feature extraction module extracts electrocardiogram's spectral and three timing interval features. Non-parametric power spectral density (PSD) estimation methods are used to extract spectral features. Support vector machine (SVM) is employed as a classifier to recognize the ECG beats. We investigate and compare two such classification approaches. First they are specified experimentally by the trial and error method. In the second technique the approach optimizes the relevant parameters through an intelligent algorithm. These parameters are: Gaussian radial basis function (GRBF) kernel parameter σ and C penalty parameter of SVM classifier. Then their performances in classification of ECG signals are evaluated for eight files obtained from the MIT–BIH arrhythmia database. Classification accuracy of the SVMGA approach proves superior to that of the SVM which has constant and manually extracted parameter.  相似文献   

15.
基于SIFT特征和近似最近邻算法的医学CT图像检索   总被引:1,自引:0,他引:1  
针对医学X线计算机断层(Computed Tomography,CT)图像,提出了一种基于尺度不变特征变换(Scale InvariantFeature Transform,SIFT)特征和近似最近邻算法的检索方法。首先通过SIFT算法得到图像的特征点和相应的特征向量,再采用近似最近邻算法进行SIFT特征向量的匹配搜索,得到数据库中与参考图像最相似的图像序列。实验结果表明,该法能检索到与目标图像细节相符的结果,大大提高了检索速度。与传统的基于纹理的检索方法相比,查准率和检索结果与目标图像的相似程度方面更佳,符合医学CT图像检索的要求。  相似文献   

16.
Dabney AR  Storey JD 《PloS one》2007,2(10):e1002
Nearest-centroid classifiers have recently been successfully employed in high-dimensional applications, such as in genomics. A necessary step when building a classifier for high-dimensional data is feature selection. Feature selection is frequently carried out by computing univariate scores for each feature individually, without consideration for how a subset of features performs as a whole. We introduce a new feature selection approach for high-dimensional nearest centroid classifiers that instead is based on the theoretically optimal choice of a given number of features, which we determine directly here. This allows us to develop a new greedy algorithm to estimate this optimal nearest-centroid classifier with a given number of features. In addition, whereas the centroids are usually formed from maximum likelihood estimates, we investigate the applicability of high-dimensional shrinkage estimates of centroids. We apply the proposed method to clinical classification based on gene-expression microarrays, demonstrating that the proposed method can outperform existing nearest centroid classifiers.  相似文献   

17.
There is a strong research interest in identifying the surface roughness of the carotid arterial inner wall via texture analysis for early diagnosis of atherosclerosis. The purpose of this study is to assess the efficacy of texture analysis methods for identifying arterial roughness in the early stage of atherosclerosis. Ultrasound images of common carotid arteries of 15 normal mice fed a normal diet and 28 apoE−/− mice fed a high-fat diet were recorded by a high-frequency ultrasound system (Vevo 2100, frequency: 40 MHz). Six different texture feature sets were extracted based on the following methods: first-order statistics, fractal dimension texture analysis, spatial gray level dependence matrix, gray level difference statistics, the neighborhood gray tone difference matrix, and the statistical feature matrix. Statistical analysis indicates that 11 of 19 texture features can be used to distinguish between normal and abnormal groups (p<0.05). When the 11 optimal features were used as inputs to a support vector machine classifier, we achieved over 89% accuracy, 87% sensitivity and 93% specificity. The accuracy, sensitivity and specificity for the k-nearest neighbor classifier were 73%, 75% and 70%, respectively. The results show that it is feasible to identify arterial surface roughness based on texture features extracted from ultrasound images of the carotid arterial wall. This method is shown to be useful for early detection and diagnosis of atherosclerosis.  相似文献   

18.
MOTIVATION: Ranking gene feature sets is a key issue for both phenotype classification, for instance, tumor classification in a DNA microarray experiment, and prediction in the context of genetic regulatory networks. Two broad methods are available to estimate the error (misclassification rate) of a classifier. Resubstitution fits a single classifier to the data, and applies this classifier in turn to each data observation. Cross-validation (in leave-one-out form) removes each observation in turn, constructs the classifier, and then computes whether this leave-one-out classifier correctly classifies the deleted observation. Resubstitution typically underestimates classifier error, severely so in many cases. Cross-validation has the advantage of producing an effectively unbiased error estimate, but the estimate is highly variable. In many applications it is not the misclassification rate per se that is of interest, but rather the construction of gene sets that have the potential to classify or predict. Hence, one needs to rank feature sets based on their performance. RESULTS: A model-based approach is used to compare the ranking performances of resubstitution and cross-validation for classification based on real-valued feature sets and for prediction in the context of probabilistic Boolean networks (PBNs). For classification, a Gaussian model is considered, along with classification via linear discriminant analysis and the 3-nearest-neighbor classification rule. Prediction is examined in the steady-distribution of a PBN. Three metrics are proposed to compare feature-set ranking based on error estimation with ranking based on the true error, which is known owing to the model-based approach. In all cases, resubstitution is competitive with cross-validation relative to ranking accuracy. This is in addition to the enormous savings in computation time afforded by resubstitution.  相似文献   

19.
The diagnostic interpretation of medical images is a complex task aiming to detect potential abnormalities. One of the most used features in this process is texture which is a key component in the human understanding of images. Many studies were conducted to develop algorithms for texture quantification. The relevance of fractal geometry in medical image analysis is justified by the proven self-similarity of anatomical objects when imaged with a finite resolution. Over the last years, fractal geometry was applied extensively in many medical signal analysis applications. The use of these geometries relies heavily on estimation of the fractal features. Various methods were proposed to estimate the fractal dimension or multifractal spectrum of a signal. This article presents an overview of these algorithms, the way they work, their benefits and limits, and their application in the field of medical signal analysis.  相似文献   

20.
Afridi TH  Khan A  Lee YS 《Amino acids》2012,42(4):1443-1454
Mitochondria are all-important organelles of eukaryotic cells since they are involved in processes associated with cellular mortality and human diseases. Therefore, trustworthy techniques are highly required for the identification of new mitochondrial proteins. We propose Mito-GSAAC system for prediction of mitochondrial proteins. The aim of this work is to investigate an effective feature extraction strategy and to develop an ensemble approach that can better exploit the advantages of this feature extraction strategy for mitochondria classification. We investigate four kinds of protein representations for prediction of mitochondrial proteins: amino acid composition, dipeptide composition, pseudo amino acid composition, and split amino acid composition (SAAC). Individual classifiers such as support vector machine (SVM), k-nearest neighbor, multilayer perceptron, random forest, AdaBoost, and bagging are first trained. An ensemble classifier is then built using genetic programming (GP) for evolving a complex but effective decision space from the individual decision spaces of the trained classifiers. The highest prediction performance for Jackknife test is 92.62% using GP-based ensemble classifier on SAAC features, which is the highest accuracy, reported so far on the Mitochondria dataset being used. While on the Malaria Parasite Mitochondria dataset, the highest accuracy is obtained by SVM using SAAC and it is further enhanced to 93.21% using GP-based ensemble. It is observed that SAAC has better discrimination power for mitochondria prediction over the rest of the feature extraction strategies. Thus, the improved prediction performance is largely due to the better capability of SAAC for discriminating between mitochondria and non-mitochondria proteins at the N and C terminus and the effective combination capability of GP. Mito-GSAAC can be accessed at . It is expected that the novel approach and the accompanied predictor will have a major impact to Molecular Cell Biology, Proteomics, Bioinformatics, System Biology, and Drug Development.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号