首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
In this article, we present COMSAT, a hybrid framework for residue contact prediction of transmembrane (TM) proteins, integrating a support vector machine (SVM) method and a mixed integer linear programming (MILP) method. COMSAT consists of two modules: COMSAT_SVM which is trained mainly on position–specific scoring matrix features, and COMSAT_MILP which is an ab initio method based on optimization models. Contacts predicted by the SVM model are ranked by SVM confidence scores, and a threshold is trained to improve the reliability of the predicted contacts. For TM proteins with no contacts above the threshold, COMSAT_MILP is used. The proposed hybrid contact prediction scheme was tested on two independent TM protein sets based on the contact definition of 14 Å between Cα‐Cα atoms. First, using a rigorous leave‐one‐protein‐out cross validation on the training set of 90 TM proteins, an accuracy of 66.8%, a coverage of 12.3%, a specificity of 99.3% and a Matthews' correlation coefficient (MCC) of 0.184 were obtained for residue pairs that are at least six amino acids apart. Second, when tested on a test set of 87 TM proteins, the proposed method showed a prediction accuracy of 64.5%, a coverage of 5.3%, a specificity of 99.4% and a MCC of 0.106. COMSAT shows satisfactory results when compared with 12 other state‐of‐the‐art predictors, and is more robust in terms of prediction accuracy as the length and complexity of TM protein increase. COMSAT is freely accessible at http://hpcc.siat.ac.cn/COMSAT/ . Proteins 2016; 84:332–348. © 2016 Wiley Periodicals, Inc.  相似文献   

2.
《Genomics》2021,113(5):2919-2924
Drug resistance of pathogenic bacteria has become increasingly serious due to the abuse of antibiotics in recent years. Researchers have found that cell wall lyases are effective antibacterial agents that can specifically recognize target bacteria and degrade bacterial peptidoglycan. Traditional wet experiments are usually expensive, time-consuming and laborious for the identification of lyases. Therefore, there is an urgent need to develop prediction tools based on computer methods to identify lyases quickly and accurately. In this paper, a new predictor, CWLy-RF, is proposed based on the random forest (RF) algorithm to identify cell wall lyases. In this method, we combined three features, namely, 400D, 188D and the composition of k-spaced amino acid group pairs, using mixed-feature representation methods. Afterward, we improved the feature representation ability with the selected top 100 features by using the information gain method and trained a predictive model using RF. The constructed prediction model is evaluated by using 10-fold cross-validation. The accuracy obtained was 96.09%, the AUC was 0.993, the MCC was 0.922, the sensitivity was 94.92%, and the specificity was 97.32%. We have proved that the proposed predictor CWLy-RF is superior to other latest models, and it will hopefully become an effective and useful tool for identifying lyases.  相似文献   

3.
4.
There is a strong research interest in identifying the surface roughness of the carotid arterial inner wall via texture analysis for early diagnosis of atherosclerosis. The purpose of this study is to assess the efficacy of texture analysis methods for identifying arterial roughness in the early stage of atherosclerosis. Ultrasound images of common carotid arteries of 15 normal mice fed a normal diet and 28 apoE−/− mice fed a high-fat diet were recorded by a high-frequency ultrasound system (Vevo 2100, frequency: 40 MHz). Six different texture feature sets were extracted based on the following methods: first-order statistics, fractal dimension texture analysis, spatial gray level dependence matrix, gray level difference statistics, the neighborhood gray tone difference matrix, and the statistical feature matrix. Statistical analysis indicates that 11 of 19 texture features can be used to distinguish between normal and abnormal groups (p<0.05). When the 11 optimal features were used as inputs to a support vector machine classifier, we achieved over 89% accuracy, 87% sensitivity and 93% specificity. The accuracy, sensitivity and specificity for the k-nearest neighbor classifier were 73%, 75% and 70%, respectively. The results show that it is feasible to identify arterial surface roughness based on texture features extracted from ultrasound images of the carotid arterial wall. This method is shown to be useful for early detection and diagnosis of atherosclerosis.  相似文献   

5.
《Genomics》2020,112(2):1282-1289
DNase I hypersensitive site (DHS) is related to DNA regulatory elements, so the understanding of DHS sites is of great significance for biomedical research. However, traditional experiments are not very good at identifying recombinant sites of a large number of emerging DNA sequences by sequencing. Some machine learning methods have been proposed to identify DHS, but most methods ignore spatial autocorrelation of the DNA sequence. In this paper, we proposed a predictor called iDHS-DSAMS to identify DHS based on the benchmark datasets. We develop a feature extraction method called dinucleotide-based spatial autocorrelation (DSA). Then we use Min-Redundancy-Max-Relevance (mRMR) to remove irrelevant and redundant features and a 100-dimensional feature vector is selected. Finally, we utilize ensemble bagged tree as classifier, which is based on the oversampled datasets using SMOTE. Five-fold cross validation tests on two benchmark datasets indicate that the proposed method outperforms its existing counterparts on the individual accuracy (Acc), Matthews correlation coefficient (MCC), sensitivity (Sn) and specificity (Sp).  相似文献   

6.
Yan C  Hu J  Wang Y 《Amino acids》2008,35(1):65-73
Identification of outer membrane proteins (OMPs) from genome is an important task. This paper presents a k-nearest neighbor (K-NN) method for discriminating outer membrane proteins (OMPs). The method makes predictions based on a weighted Euclidean distance that is computed from residue composition. The method achieves 89.1% accuracy with 0.668 MCC (Matthews correlation coefficient) in discriminating OMPs and non-OMPs. The performance of the method is improved by including homologous information into the calculation of residue composition. The final method achieves an accuracy of 96.1%, with 0.873 MCC, 87.5% sensitivity, and 98.2% specificity. Comparisons with multiple recently published methods show that the method proposed in this study outperforms the others.  相似文献   

7.
8.
Lytic enzymes were isolated from 14 strains of phage-infected Staphylococcus aureus. Cell walls were prepared from the same uninfected strains of bacteria. Comparison of the lytic rates was made for each enzyme, with each of the cell walls as substrate. Differences in the rate of substrate utilization of the various cell wall types exceeded 10-fold. Cell walls from strains 42E, 29, and 77 were the best substrates, whereas cell walls from strains 3C, 80, and 187 were the poorest substrates. The cell wall amino acid composition is discussed as related to lytic enzyme specificity. A possible explanation of phage typing of staphylococcal cells, based on enzyme activity and cell wall composition, is presented.  相似文献   

9.
In this paper, we present an effective and efficient diagnosis system based on particle swarm optimization (PSO) enhanced fuzzy k-nearest neighbor (FKNN) for Parkinson's disease (PD) diagnosis. In the proposed system, named PSO–FKNN, both the continuous version and binary version of PSO were used to perform the parameter optimization and feature selection simultaneously. On the one hand, the neighborhood size k and the fuzzy strength parameter m in FKNN classifier are adaptively specified by the continuous PSO. On the other hand, binary PSO is utilized to choose the most discriminative subset of features for prediction. The effectiveness of the PSO–FKNN model has been rigorously evaluated against the PD data set in terms of classification accuracy, sensitivity, specificity and the area under the receiver operating characteristic (ROC) curve (AUC). Compared to the existing methods in previous studies, the proposed system has achieved the highest classification accuracy reported so far via 10-fold cross-validation analysis, with the mean accuracy of 97.47%. Promisingly, the proposed diagnosis system might serve as a new candidate of powerful tools for diagnosing PD with excellent performance.  相似文献   

10.
Lipocalins are functionally diverse proteins that are composed of 120–180 amino acid residues. Members of this family have several important biological functions including ligand transport, cryptic coloration, sensory transduction, endonuclease activity, stress response activity in plants, odorant binding, prostaglandin biosynthesis, cellular homeostasis regulation, immunity, immunotherapy and so on. Identification of lipocalins from protein sequence is more challenging due to the poor sequence identity which often falls below the twilight zone. So far, no specific method has been reported to identify lipocalins from primary sequence. In this paper, we report a support vector machine (SVM) approach to predict lipocalins from protein sequence using sequence-derived properties. LipoPred was trained using a dataset consisting of 325 lipocalin proteins and 325 non-lipocalin proteins, and evaluated by an independent set of 140 lipocalin proteins and 21,447 non-lipocalin proteins. LipoPred achieved 88.61% accuracy with 89.26% sensitivity, 85.27% specificity and 0.74 Matthew’s correlation coefficient (MCC). When applied on the test dataset, LipoPred achieved 84.25% accuracy with 88.57% sensitivity, 84.22% specificity and MCC of 0.16. LipoPred achieved better performance rate when compared with PSI-BLAST, HMM and SVM-Prot methods. Out of 218 lipocalins, LipoPred correctly predicted 194 proteins including 39 lipocalins that are non-homologous to any protein in the SWISSPROT database. This result shows that LipoPred is potentially useful for predicting the lipocalin proteins that have no sequence homologs in the sequence databases. Further, successful prediction of nine hypothetical lipocalin proteins and five new members of lipocalin family prove that LipoPred can be efficiently used to identify and annotate the new lipocalin proteins from sequence databases. The LipoPred software and dataset are available at .  相似文献   

11.
Shi SP  Qiu JD  Sun XY  Suo SB  Huang SY  Liang RP 《PloS one》2012,7(6):e38772
Protein methylation is predominantly found on lysine and arginine residues, and carries many important biological functions, including gene regulation and signal transduction. Given their important involvement in gene expression, protein methylation and their regulatory enzymes are implicated in a variety of human disease states such as cancer, coronary heart disease and neurodegenerative disorders. Thus, identification of methylation sites can be very helpful for the drug designs of various related diseases. In this study, we developed a method called PMeS to improve the prediction of protein methylation sites based on an enhanced feature encoding scheme and support vector machine. The enhanced feature encoding scheme was composed of the sparse property coding, normalized van der Waals volume, position weight amino acid composition and accessible surface area. The PMeS achieved a promising performance with a sensitivity of 92.45%, a specificity of 93.18%, an accuracy of 92.82% and a Matthew's correlation coefficient of 85.69% for arginine as well as a sensitivity of 84.38%, a specificity of 93.94%, an accuracy of 89.16% and a Matthew's correlation coefficient of 78.68% for lysine in 10-fold cross validation. Compared with other existing methods, the PMeS provides better predictive performance and greater robustness. It can be anticipated that the PMeS might be useful to guide future experiments needed to identify potential methylation sites in proteins of interest. The online service is available at http://bioinfo.ncu.edu.cn/inquiries_PMeS.aspx.  相似文献   

12.
13.
Tandem mass spectrometry (MS/MS) combined with protein database searching has been widely used in protein identification. A validation procedure is generally required to reduce the number of false positives. Advanced tools using statistical and machine learning approaches may provide faster and more accurate validation than manual inspection and empirical filtering criteria. In this study, we use two feature selection algorithms based on random forest and support vector machine to identify peptide properties that can be used to improve validation models. We demonstrate that an improved model based on an optimized set of features reduces the number of false positives by 58% relative to the model which used only search engine scores, at the same sensitivity score of 0.8. In addition, we develop classification models based on the physicochemical properties and protein sequence environment of these peptides without using search engine scores. The performance of the best model based on the support vector machine algorithm is at 0.8 AUC, 0.78 accuracy, and 0.7 specificity, suggesting a reasonably accurate classification. The identified properties important to fragmentation and ionization can be either used in independent validation tools or incorporated into peptide sequencing and database search algorithms to improve existing software programs.  相似文献   

14.
Nicotinamide adenine dinucleotide (NAD) plays an important role in cellular metabolism and acts as hydrideaccepting and hydride-donating coenzymes in energy production. Identification of NAD protein interacting sites can significantly aid in understanding the NAD dependent metabolism and pathways, and it could further contribute useful information for drug development. In this study, a computational method is proposed to predict NAD-protein interacting sites using the sequence information and structure-based information. All models developed in this work are evaluated using the 7-fold cross validation technique. Results show that using the position specific scoring matrix (PSSM) as an input feature is quite encouraging for predicting NAD interacting sites. After considering the unbalance dataset, the ensemble support vector machine (SVM), which is an assembly of many individual SVM classifiers, is developed to predict the NAD interacting sites. It was observed that the overall accuracy (Acc) thus obtained was 87.31% with Matthew's correlation coefficient (MCC) equal to 0.56. In contrast, the corresponding rate by the single SVM approach was only 80.86% with MCC of 0.38. These results indicated that the prediction accuracy could be remarkably improved via the ensemble SVM classifier approach.  相似文献   

15.
Hayat M  Khan A  Yeasin M 《Amino acids》2012,42(6):2447-2460
Knowledge of the types of membrane protein provides useful clues in deducing the functions of uncharacterized membrane proteins. An automatic method for efficiently identifying uncharacterized proteins is thus highly desirable. In this work, we have developed a novel method for predicting membrane protein types by exploiting the discrimination capability of the difference in amino acid composition at the N and C terminus through split amino acid composition (SAAC). We also show that the ensemble classification can better exploit this discriminating capability of SAAC. In this study, membrane protein types are classified using three feature extraction and several classification strategies. An ensemble classifier Mem-EnsSAAC is then developed using the best feature extraction strategy. Pseudo amino acid (PseAA) composition, discrete wavelet analysis (DWT), SAAC, and a hybrid model are employed for feature extraction. The nearest neighbor, probabilistic neural network, support vector machine, random forest, and Adaboost are used as individual classifiers. The predicted results of the individual learners are combined using genetic algorithm to form an ensemble classifier, Mem-EnsSAAC yielding an accuracy of 92.4 and 92.2% for the Jackknife and independent dataset test, respectively. Performance measures such as MCC, sensitivity, specificity, F-measure, and Q-statistics show that SAAC-based prediction yields significantly higher performance compared to PseAA- and DWT-based systems, and is also the best reported so far. The proposed Mem-EnsSAAC is able to predict the membrane protein types with high accuracy and consequently, can be very helpful in drug discovery. It can be accessed at http://111.68.99.218/membrane.  相似文献   

16.
Membrane proteins are vital type of proteins that serve as channels, receptors, and energy transducers in a cell. Prediction of membrane protein types is an important research area in bioinformatics. Knowledge of membrane protein types provides some valuable information for predicting novel example of the membrane protein types. However, classification of membrane protein types can be both time consuming and susceptible to errors due to the inherent similarity of membrane protein types. In this paper, neural networks based membrane protein type prediction system is proposed. Composite protein sequence representation (CPSR) is used to extract the features of a protein sequence, which includes seven feature sets; amino acid composition, sequence length, 2 gram exchange group frequency, hydrophobic group, electronic group, sum of hydrophobicity, and R-group. Principal component analysis is then employed to reduce the dimensionality of the feature vector. The probabilistic neural network (PNN), generalized regression neural network, and support vector machine (SVM) are used as classifiers. A high success rate of 86.01% is obtained using SVM for the jackknife test. In case of independent dataset test, PNN yields the highest accuracy of 95.73%. These classifiers exhibit improved performance using other performance measures such as sensitivity, specificity, Mathew's correlation coefficient, and F-measure. The experimental results show that the prediction performance of the proposed scheme for classifying membrane protein types is the best reported, so far. This performance improvement may largely be credited to the learning capabilities of neural networks and the composite feature extraction strategy, which exploits seven different properties of protein sequences. The proposed Mem-Predictor can be accessed at http://111.68.99.218/Mem-Predictor.  相似文献   

17.
《Translational oncology》2020,13(11):100816
Merkel cell carcinoma (MCC) is a rare primary cutaneous neoplasm of neuroendocrine carcinoma of the skin. About 80% of the MCC occurs due to Merkel cell polyomavirus (MCPyV) and 20% of the tumors usually occur due to severe UV exposure which is a more aggressive type of MCC. It tends to have an increased incidence rate among elderly and immunosuppressed individuals. On therapeutic level, sub-classification of MCC through molecular subtyping has emerged as a promising technique for MCC prognosis. In current study, two consistent distinct molecular subtypes of MCCs were identified using gene expression profiling data. Subtypes I MCCs were associated with spliceosome, DNA replication and cellular pathways. On the other hand, genes overexpressed in subtype II were found active in TNF signalling pathway and MAPK signalling pathway. We proposed different therapeutic targets based on subtype specificity, such as PTCH1, CDKN2A, AURKA in case of subtype I and MCL1, FGFR2 for subtype II. Such findings may provide fruitful knowledge to understand the intrinsic subtypes of MCCs and the pathways involved in distinct subtype oncogenesis, and will further advance the knowledge in developing a specific therapeutic strategy for these MCC subtypes.  相似文献   

18.
Many important cellular processes are performed by molecular machines, composed of multiple proteins that physically interact to execute biological functions. An example is the bacterial peptidoglycan (PG) synthesis machine, responsible for the synthesis of the main component of the cell wall and the target of many contemporary antibiotics. One approach for the identification of essential components of a cellular machine involves the determination of its minimal protein composition. Staphylococcus aureus is a Gram-positive pathogen, renowned for its resistance to many commonly used antibiotics and prevalence in hospitals. Its genome encodes a low number of proteins with PG synthesis activity (9 proteins), when compared to other model organisms, and is therefore a good model for the study of a minimal PG synthesis machine. We deleted seven of the nine genes encoding PG synthesis enzymes from the S. aureus genome without affecting normal growth or cell morphology, generating a strain capable of PG biosynthesis catalyzed only by two penicillin-binding proteins, PBP1 and the bi-functional PBP2. However, multiple PBPs are important in clinically relevant environments, as bacteria with a minimal PG synthesis machinery became highly susceptible to cell wall-targeting antibiotics, host lytic enzymes and displayed impaired virulence in a Drosophila infection model which is dependent on the presence of specific peptidoglycan receptor proteins, namely PGRP-SA. The fact that S. aureus can grow and divide with only two active PG synthesizing enzymes shows that most of these enzymes are redundant in vitro and identifies the minimal PG synthesis machinery of S. aureus. However a complex molecular machine is important in environments other than in vitro growth as the expendable PG synthesis enzymes play an important role in the pathogenicity and antibiotic resistance of S. aureus.  相似文献   

19.

Background

Cellular respiration is the process by which cells obtain energy from glucose and is a very important biological process in living cell. As cells do cellular respiration, they need a pathway to store and transport electrons, the electron transport chain. The function of the electron transport chain is to produce a trans-membrane proton electrochemical gradient as a result of oxidation–reduction reactions. In these oxidation–reduction reactions in electron transport chains, metal ions play very important role as electron donor and acceptor. For example, Fe ions are in complex I and complex II, and Cu ions are in complex IV. Therefore, to identify metal-binding sites in electron transporters is an important issue in helping biologists better understand the workings of the electron transport chain.

Methods

We propose a method based on Position Specific Scoring Matrix (PSSM) profiles and significant amino acid pairs to identify metal-binding residues in electron transport proteins.

Results

We have selected a non-redundant set of 55 metal-binding electron transport proteins as our dataset. The proposed method can predict metal-binding sites in electron transport proteins with an average 10-fold cross-validation accuracy of 93.2% and 93.1% for metal-binding cysteine and histidine, respectively. Compared with the general metal-binding predictor from A. Passerini et al., the proposed method can improve over 9% of sensitivity, and 14% specificity on the independent dataset in identifying metal-binding cysteines. The proposed method can also improve almost 76% sensitivity with same specificity in metal-binding histidine, and MCC is also improved from 0.28 to 0.88.

Conclusions

We have developed a novel approach based on PSSM profiles and significant amino acid pairs for identifying metal-binding sites from electron transport proteins. The proposed approach achieved a significant improvement with independent test set of metal-binding electron transport proteins.  相似文献   

20.
To identify non-coding RNA (ncRNA) signals within genomic regions, a classification tool was developed based on a hybrid random forest (RF) with a logistic regression model to efficiently discriminate short ncRNA sequences as well as long complex ncRNA sequences. This RF-based classifier was trained on a well-balanced dataset with a discriminative set of features and achieved an accuracy, sensitivity and specificity of 92.11%, 90.7% and 93.5%, respectively. The selected feature set includes a new proposed feature, SCORE. This feature is generated based on a logistic regression function that combines five significant features—structure, sequence, modularity, structural robustness and coding potential—to enable improved characterization of long ncRNA (lncRNA) elements. The use of SCORE improved the performance of the RF-based classifier in the identification of Rfam lncRNA families. A genome-wide ncRNA classification framework was applied to a wide variety of organisms, with an emphasis on those of economic, social, public health, environmental and agricultural significance, such as various bacteria genomes, the Arthrospira (Spirulina) genome, and rice and human genomic regions. Our framework was able to identify known ncRNAs with sensitivities of greater than 90% and 77.7% for prokaryotic and eukaryotic sequences, respectively. Our classifier is available at http://ncrna-pred.com/HLRF.htm.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号