首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
According to their main EC (Enzyme Commission) numbers, enzymes are classified into the following 6 main classes: oxidoreductases, transferases, hydrolases, lyases, isomerases, and ligases. A new method has been developed to predict the enzymatic attribute of proteins by introducing the functional domain composition to formulate a given protein sequence. The advantage by doing so is that both the sequence-order-related features and the function-related features are naturally incorporated in the predictor. As a demonstration, the jackknife cross-validation test was performed on a dataset that consists of proteins with only less than 20% sequence identity to each other in order to get rid of any homologous bias. The overall success rate thus obtained was 85% in identifying the enzyme family classes (including the identification of nonenzyme protein sequences as well). The success rate is significantly higher than those obtained by the other methods on such a stringent dataset. This indicates that using the functional domain composition to represent protein samples for statistical prediction is indeed very promising, and will become a powerful tool in bioinformatics and proteomics.  相似文献   

2.
3.
Background: The type III secreted effectors (T3SEs) are one of the indispensable proteins in the growth and reproduction of Gram-negative bacteria. In particular, the pathogenesis of Gram-negative bacteria depends on the type III secreted effectors, and by injecting T3SEs into a host cell, the host cell’s immunity can be destroyed. The high diversity of T3SE sequences and the lack of defined secretion signals make it difficult to identify and predict. Moreover, the related study of the pathological system associated with T3SE remains a hot topic in bioinformatics. Some computational tools have been developed to meet the growing demand for the recognition of T3SEs and the studies of type III secretion systems (T3SS). Although these tools can help biological experiments in certain procedures, there is still room for improvement, even for the current best model, as the existing methods adopt hand-designed feature and traditional machine learning methods. Methods: In this study, we propose a powerful predictor based on deep learning methods, called WEDeepT3. Our work consists mainly of three key steps. First, we train word embedding vectors for protein sequences in a large-scale amino acid sequence database. Second, we combine the word vectors with traditional features extracted from protein sequences, like PSSM, to construct a more comprehensive feature representation. Finally, we construct a deep neural network model in the prediction of type III secreted effectors. Results: The feature representation of WEDeepT3 consists of both word embedding and position-specific features. Working together with convolutional neural networks, the new model achieves superior performance to the state-of-the-art methods, demonstrating the effectiveness of the new feature representation and the powerful learning ability of deep models. Conclusion: WEDeepT3 exploits both semantic information of k-mer fragments and evolutional information of protein sequences to accurately differentiate between T3SEs and non-T3SEs. WEDeepT3 is available at bcmi.sjtu.edu.cn/~yangyang/WEDeepT3.html.  相似文献   

4.
Protein–DNA interactions play important roles in many biological processes. To understand the molecular mechanisms of protein–DNA interaction, it is necessary to identify the DNA-binding sites in DNA-binding proteins. In the last decade, computational approaches have been developed to predict protein–DNA-binding sites based solely on protein sequences. In this study, we developed a novel predictor based on support vector machine algorithm coupled with the maximum relevance minimum redundancy method followed by incremental feature selection. We incorporated not only features of physicochemical/biochemical properties, sequence conservation, residual disorder, secondary structure, solvent accessibility, but also five three-dimensional (3D) structural features calculated from PDB data to predict the protein–DNA interaction sites. Feature analysis showed that 3D structural features indeed contributed to the prediction of DNA-binding site and it was demonstrated that the prediction performance was better with 3D structural features than without them. It was also shown via analysis of features from each site that the features of DNA-binding site itself contribute the most to the prediction. Our prediction method may become a useful tool for identifying the DNA-binding sites and the feature analysis described in this paper may provide useful insights for in-depth investigations into the mechanisms of protein–DNA interaction.  相似文献   

5.
BQ Li  KY Feng  L Chen  T Huang  YD Cai 《PloS one》2012,7(8):e43927
Prediction of protein-protein interaction (PPI) sites is one of the most challenging problems in computational biology. Although great progress has been made by employing various machine learning approaches with numerous characteristic features, the problem is still far from being solved. In this study, we developed a novel predictor based on Random Forest (RF) algorithm with the Minimum Redundancy Maximal Relevance (mRMR) method followed by incremental feature selection (IFS). We incorporated features of physicochemical/biochemical properties, sequence conservation, residual disorder, secondary structure and solvent accessibility. We also included five 3D structural features to predict protein-protein interaction sites and achieved an overall accuracy of 0.672997 and MCC of 0.347977. Feature analysis showed that 3D structural features such as Depth Index (DPX) and surface curvature (SC) contributed most to the prediction of protein-protein interaction sites. It was also shown via site-specific feature analysis that the features of individual residues from PPI sites contribute most to the determination of protein-protein interaction sites. It is anticipated that our prediction method will become a useful tool for identifying PPI sites, and that the feature analysis described in this paper will provide useful insights into the mechanisms of interaction.  相似文献   

6.
Plaque morphology and biomechanics are believed to be closely associated with plaque progression. In this paper, we test the hypothesis that integrating morphological and biomechanical risk factors would result in better predictive power for plaque progression prediction. A sample size of 374 intravascular ultrasound (IVUS) slices was obtained from 9 patients with IVUS follow-up data. 3D fluid-structure interaction models were constructed to obtain both structural stress/strain and fluid biomechanical conditions. Data for eight morphological and biomechanical risk factors were extracted for each slice. Plaque area increase (PAI) and wall thickness increase (WTI) were chosen as two measures for plaque progression. Progression measure and risk factors were fed to generalized linear mixed models and linear mixed-effect models to perform prediction and correlation analysis, respectively. All combinations of eight risk factors were exhausted to identify the optimal predictor(s) with highest prediction accuracy defined as sum of sensitivity and specificity. When using a single risk factor, plaque wall stress (PWS) at baseline was the best predictor for plaque progression (PAI and WTI). The optimal predictor among all possible combinations for PAI was PWS + PWSn + Lipid percent + Min cap thickness + Plaque Area (PA) + Plaque Burden (PB) (prediction accuracy = 1.5928) while Wall Thickness (WT) + Plaque Wall Strain (PWSn) + Plaque Area (PA) was the best for WTI (1.2589). This indicated that PAI was a more predictable measure than WTI. The combination including both morphological and biomechanical parameters had improved prediction accuracy, compared to predictions using only morphological features.  相似文献   

7.
Summary Erwinia chrysanthemi is a soft-rot pathogenic enterobacterium that provokes maceration of host plant tissues by producing extracellular cell-wall-degrading enzymes, among which are pectate lyases, pectin methyl esterases, and cellulases. Cell wall degradation in leaves and petiole tissue of infectedSaintpaulia ionantha plants has been investigated in order to define the structural and temporal framework of wall deconstruction. The degradation of major cell wall components, pectins and cellulose, was studied by both classical histochemical techniques (Calcofluor and periodic acid-thiocarbohydrazide-silver proteinate staining) and immunocytochemistry (tissue printing for detection of pectate lyases; monoclonal antibodies JIM5 and JIM7 for detection of pectic substrates). The results show that the mode of progression of the bacteria within the host plant is via the intercellular spaces of the parenchyma leaf and the petiole cortex. Maceration symptoms and secretion of pectate lyases PelA, -D, and -E can be directly correlated to the spread of the bacteria. Wall degradation is very heterogeneous. Loss of reactivity with JIM5 and JIM7 was progressive and/or clearcut. The primary and middle lamella appear to be the most susceptible regions of the wall. The innermost layer of the cell wall frequently resists complete deconstruction. At the wall intersects and around intercellular spaces resistant domains and highly degraded domains occurred simultaneously. All results lead to the hypothesis that both spatial organisation of the wall and accessibility to enzymes are very highly variable according to regions. The use of mutants lacking pectate lyases PelA, -D, -E or -B, -C confirm the important role that PelA, PelD, and PelE play in the rapid degradation of pectins from the host cell walls. In contrast, PelB and PelC seem not essential for degradation of the wall, though they can be detected in leaves infected with wild-type bacteria. With Calcofluor staining, regularly localised cellulose-rich and cellulose-poor domains were observed in pectic-deprived walls.Abbreviations MAb monoclonal antibody - PATAg periodic acid-thiocarbohydrazide-silver proteinate  相似文献   

8.

Background

Traditionally, it is believed that the native structure of a protein corresponds to a global minimum of its free energy. However, with the growing number of known tertiary (3D) protein structures, researchers have discovered that some proteins can alter their structures in response to a change in their surroundings or with the help of other proteins or ligands. Such structural shifts play a crucial role with respect to the protein function. To this end, we propose a machine learning method for the prediction of the flexible/rigid regions of proteins (referred to as FlexRP); the method is based on a novel sequence representation and feature selection. Knowledge of the flexible/rigid regions may provide insights into the protein folding process and the 3D structure prediction.

Results

The flexible/rigid regions were defined based on a dataset, which includes protein sequences that have multiple experimental structures, and which was previously used to study the structural conservation of proteins. Sequences drawn from this dataset were represented based on feature sets that were proposed in prior research, such as PSI-BLAST profiles, composition vector and binary sequence encoding, and a newly proposed representation based on frequencies of k-spaced amino acid pairs. These representations were processed by feature selection to reduce the dimensionality. Several machine learning methods for the prediction of flexible/rigid regions and two recently proposed methods for the prediction of conformational changes and unstructured regions were compared with the proposed method. The FlexRP method, which applies Logistic Regression and collocation-based representation with 95 features, obtained 79.5% accuracy. The two runner-up methods, which apply the same sequence representation and Support Vector Machines (SVM) and Naïve Bayes classifiers, obtained 79.2% and 78.4% accuracy, respectively. The remaining considered methods are characterized by accuracies below 70%. Finally, the Naïve Bayes method is shown to provide the highest sensitivity for the prediction of flexible regions, while FlexRP and SVM give the highest sensitivity for rigid regions.

Conclusion

A new sequence representation that uses k-spaced amino acid pairs is shown to be the most efficient in the prediction of the flexible/rigid regions of protein sequences. The proposed FlexRP method provides the highest prediction accuracy of about 80%. The experimental tests show that the FlexRP and SVM methods achieved high overall accuracy and the highest sensitivity for rigid regions, while the best quality of the predictions for flexible regions is achieved by the Naïve Bayes method.  相似文献   

9.
外膜蛋白(Outer Membrane Proteins, OMPs)是一类具有重要生物功能的蛋白质, 通过生物信息学方法来预测OMPs能够为预测OMPs的二级和三级结构以及在基因组发现新的OMPs提供帮助。文中提出计算蛋白质序列的氨基酸含量特征、二肽含量特征和加权多阶氨基酸残基指数相关系数特征, 将三类特征组合, 采用支持向量机(Support Vector Machine, SVM)算法来识别OMPs。计算了包括四种残基指数的多种组合特征的识别结果, 并且讨论了相关系数的阶次和权值对预测性能的影响。在数据集上的十倍交叉验证测试和独立性测试结果显示, 组合特征识别方法对OMPs和非OMPs的识别精度最高分别达到96.96%和97.33%, 优于现有的多种方法。在五种细菌基因组内识别OMPs的结果显示, 组合特征方法具有很高的特异性, 并且对PDB数据库中已知结构的OMPs识别准确度超过99%。表明该方法能够作为基因组内筛选OMPs的有效工具。  相似文献   

10.
11.
外膜蛋白(Outer Membrane Proteins, OMPs)是一类具有重要生物功能的蛋白质, 通过生物信息学方法来预测OMPs能够为预测OMPs的二级和三级结构以及在基因组发现新的OMPs提供帮助。文中提出计算蛋白质序列的氨基酸含量特征、二肽含量特征和加权多阶氨基酸残基指数相关系数特征, 将三类特征组合, 采用支持向量机(Support Vector Machine, SVM)算法来识别OMPs。计算了包括四种残基指数的多种组合特征的识别结果, 并且讨论了相关系数的阶次和权值对预测性能的影响。在数据集上的十倍交叉验证测试和独立性测试结果显示, 组合特征识别方法对OMPs和非OMPs的识别精度最高分别达到96.96%和97.33%, 优于现有的多种方法。在五种细菌基因组内识别OMPs的结果显示, 组合特征方法具有很高的特异性, 并且对PDB数据库中已知结构的OMPs识别准确度超过99%。表明该方法能够作为基因组内筛选OMPs的有效工具。  相似文献   

12.
Thosea sinensis Walker (TSW) rapidly spreads and severely damages the tea plants. Therefore, finding a reliable operational method for identifying the TSW-damaged areas via remote sensing has been a focus of a research community. Such methods also enable us to calculate the precise application of pesticides and prevent the subsequent spread of the pests. In this work, based on the unmanned aerial vehicle (UAV) platform, five band images of multispectral red-edge camera were obtained and used for monitoring the TSW in tea plantations. By combining the minimum redundancy maximum relevance (mRMR) with the selected spectral features, a comprehensive spectral selection strategy was proposed. Then, based on the selected spectral features, three classic machine learning algorithms, including random forest (RF), support vector machine (SVM), and k-nearest neighbors (KNN) were used to construct the pest monitoring model and were evaluated and compared. The results showed that the strategy proposed in this work obtained ideal monitoring accuracy by only using the combination of a few optimized features (2 or 4). In order to differentiate the healthy and TSW-damaged areas (2-class model), the monitoring accuracies of all the three models were computed, which were above 96%. The RF model used the least number of features, including only SAVI and Bandred. In order to further discriminate the pest incidence levels (3-class model), the monitoring accuracies of all the three models were computed, which were above 80%, among which the RF algorithm based on SAVI, Bandred, VARI_green, and Bandred_edge features achieve the highest accuracy (OAA of 87%, and Kappa of 0.79). Considering the computational cost and model accuracy, this work recommends the RF model based on a few optimal feature combinations to monitor and distinguish the severity of TSW in tea plantations. According to the UAV remote sensing mapping results, the TSW infestation exhibited an aggregated distribution pattern. The spatial information of occurrence and severity can offer effective guidance for precise control of the pest. In addition, the relevant methods provide a reference for monitoring other leaf-eating pests, effectively improving the management level of plant protection in tea plantations, and guaranting the yield and quality of tea plantations.  相似文献   

13.
It has been hypothesized that mechanical risk factors may be used to predict future atherosclerotic plaque rupture. Truly predictive methods for plaque rupture and methods to identify the best predictor(s) from all the candidates are lacking in the literature. A novel combination of computational and statistical models based on serial magnetic resonance imaging (MRI) was introduced to quantify sensitivity and specificity of mechanical predictors to identify the best candidate for plaque rupture site prediction. Serial in vivo MRI data of carotid plaque from one patient was acquired with follow-up scan showing ulceration. 3D computational fluid-structure interaction (FSI) models using both baseline and follow-up data were constructed and plaque wall stress (PWS) and strain (PWSn) and flow maximum shear stress (FSS) were extracted from all 600 matched nodal points (100 points per matched slice, baseline matching follow-up) on the lumen surface for analysis. Each of the 600 points was marked "ulcer" or "nonulcer" using follow-up scan. Predictive statistical models for each of the seven combinations of PWS, PWSn, and FSS were trained using the follow-up data and applied to the baseline data to assess their sensitivity and specificity using the 600 data points for ulcer predictions. Sensitivity of prediction is defined as the proportion of the true positive outcomes that are predicted to be positive. Specificity of prediction is defined as the proportion of the true negative outcomes that are correctly predicted to be negative. Using probability 0.3 as a threshold to infer ulcer occurrence at the prediction stage, the combination of PWS and PWSn provided the best predictive accuracy with (sensitivity, specificity)?=?(0.97, 0.958). Sensitivity and specificity given by PWS, PWSn, and FSS individually were (0.788, 0.968), (0.515, 0.968), and (0.758, 0.928), respectively. The proposed computational-statistical process provides a novel method and a framework to assess the sensitivity and specificity of various risk indicators and offers the potential to identify the optimized predictor for plaque rupture using serial MRI with follow-up scan showing ulceration as the gold standard for method validation. While serial MRI data with actual rupture are hard to acquire, this single-case study suggests that combination of multiple predictors may provide potential improvement to existing plaque assessment schemes. With large-scale patient studies, this predictive modeling process may provide more solid ground for rupture predictor selection strategies and methods for image-based plaque vulnerability assessment.  相似文献   

14.
Surface proteins in Gram-positive bacteria are frequently implicated in virulence. We have focused on a group of extracellular cell wall-attached proteins (CWPs), containing an LPXTG motif for cleavage and covalent coupling to peptidoglycan by sortase enzymes. A hidden Markov model (HMM) approach for predicting the LPXTG-anchored cell wall proteins of Gram-positive bacteria was developed and compared against existing methods. The HMM model is parsimonious in terms of the number of freely estimated parameters, and it has proved to be very sensitive and specific in a training set of 55 experimentally verified LPXTG-anchored cell wall proteins as well as in reliable data sets of globular and transmembrane proteins. In order to identify such proteins in Gram-positive bacteria, a comprehensive analysis of 94 completely sequenced genomes has been performed. We identified, in total, 860 LPXTG-anchored cell wall proteins, a number that is significantly higher compared to those obtained by other available methods. Of these proteins, 237 are hypothetical proteins according to the annotation of SwissProt, and 88 had no homologs in the SwissProt database--this might be evidence that they are members of newly identified families of CWPs. The prediction tool, the database with the proteins identified in the genomes, and supplementary material are available online at http://bioinformatics.biol.uoa.gr/CW-PRED/.  相似文献   

15.
N4-甲基胞嘧啶(N4-methylcytosine, 4mC)是一种重要的表观遗传修饰,在DNA的修复、表达和复制中发挥重要作用。准确鉴定4mC位点有助于深入研究其生物学功能和机制,由于4mC位点的实验鉴定即耗时又昂贵,特别是考虑到基因序列的快速积累,迫切需要补充有效的计算方法。因此,提供一个快速、准确的4mC位点在线预测平台十分必要。目前,还未见对构建必要的预测模型所需的不同特征的机器学习(machine learning, ML)方法进行全面的分析和评估。我们构建多组特征集,并且采用5种ML方法(如随机森林,支持向量机,集成学习等),提出一种称为“DNA4mcEL”的预测方法。在随机10折交叉验证测试下与现有的预测器相比,DNA4mcEL预测C. elegans、D. melanogaster、A. thaliana、E. coli、G. subterraneus、G. pickeringii 6个物种的精度均有提高。基于本方法的预测器DNA4mcEL在这项任务中显著优于现有的预测器。我们希望通过这个综合调查和建立更准确模型的策略,可以作为激发N4-甲基胞嘧啶预测计算方法未来发展的有用指南,加快新N4-甲基胞嘧啶的发现。DNA4mcEL的独立版本可以从https://github.com/kukuky00/DNA4mcEL.git免费获得。  相似文献   

16.
Structural class characterizes the overall folding type of a protein or its domain. This paper develops an accurate method for in silico prediction of structural classes from low homology (twilight zone) protein sequences. The proposed LLSC-PRED method applies linear logistic regression classifier and a custom-designed, feature-based sequence representation to provide predictions. The main advantages of the LLSC-PRED are the comprehensive representation that includes 58 features describing composition and physicochemical properties of the sequences and transparency of the prediction model. The representation also includes predicted secondary structure content, thus for the first time exploring synergy between these two related predictions. Based on tests performed with a large set of 1673 twilight zone domains, the LLSC-PRED's prediction accuracy, which equals over 62%, is shown to be better than accuracy of over a dozen recently published competing in silico methods and similar to accuracy of other, non-transparent classifiers that use the proposed representation.  相似文献   

17.
Fuchs A  Kirschner A  Frishman D 《Proteins》2009,74(4):857-871
Despite rapidly increasing numbers of available 3D structures, membrane proteins still account for less than 1% of all structures in the Protein Data Bank. Recent high-resolution structures indicate a clearly broader structural diversity of membrane proteins than initially anticipated, motivating the development of reliable structure prediction methods specifically tailored for this class of molecules. One important prediction target capturing all major aspects of a protein's 3D structure is its contact map. Our analysis shows that computational methods trained to predict residue contacts in globular proteins perform poorly when applied to membrane proteins. We have recently published a method to identify interacting alpha-helices in membrane proteins based on the analysis of coevolving residues in predicted transmembrane regions. Here, we present a substantially improved algorithm for the same problem, which uses a newly developed neural network approach to predict helix-helix contacts. In addition to the input features commonly used for contact prediction of soluble proteins, such as windowed residue profiles and residue distance in the sequence, our network also incorporates features that apply to membrane proteins only, such as residue position within the transmembrane segment and its orientation toward the lipophilic environment. The obtained neural network can predict contacts between residues in transmembrane segments with nearly 26% accuracy. It is therefore the first published contact predictor developed specifically for membrane proteins performing with equal accuracy to state-of-the-art contact predictors available for soluble proteins. The predicted helix-helix contacts were employed in a second step to identify interacting helices. For our dataset consisting of 62 membrane proteins of solved structure, we gained an accuracy of 78.1%. Because the reliable prediction of helix interaction patterns is an important step in the classification and prediction of membrane protein folds, our method will be a helpful tool in compiling a structural census of membrane proteins.  相似文献   

18.
This paper presents a novel feature vector based on physicochemical property of amino acids for prediction protein structural classes. The proposed method is divided into three different stages. First, a discrete time series representation to protein sequences using physicochemical scale is provided. Later on, a wavelet-based time-series technique is proposed for extracting features from mapped amino acid sequence and a fixed length feature vector for classification is constructed. The proposed feature space summarizes the variance information of ten different biological properties of amino acids. Finally, an optimized support vector machine model is constructed for prediction of each protein structural class. The proposed approach is evaluated using leave-one-out cross-validation tests on two standard datasets. Comparison of our result with existing approaches shows that overall accuracy achieved by our approach is better than exiting methods.  相似文献   

19.
With the development of bioinformatics, more and more protein sequence information has become available. Meanwhile, the number of known protein–protein interactions (PPIs) is still very limited. In this article, we propose a new method for predicting interacting protein pairs using a Bayesian method based on a new feature representation. We trained our model using data on 6,459 PPI pairs from the yeast Saccharomyces cerevisiae core subset. Using six species of DIP database, our model demonstrates an average prediction accuracy of 93.67%. The result showed that our method is superior to other methods in both computing time and prediction accuracy.  相似文献   

20.

Background

Long noncoding RNAs (lncRNAs) are widely involved in the initiation and development of cancer. Although some computational methods have been proposed to identify cancer-related lncRNAs, there is still a demanding to improve the prediction accuracy and efficiency. In addition, the quick-update data of cancer, as well as the discovery of new mechanism, also underlay the possibility of improvement of cancer-related lncRNA prediction algorithm. In this study, we introduced CRlncRC, a novel Cancer-Related lncRNA Classifier by integrating manifold features with five machine-learning techniques.

Results

CRlncRC was built on the integration of genomic, expression, epigenetic and network, totally in four categories of features. Five learning techniques were exploited to develop the effective classification model including Random Forest (RF), Naïve bayes (NB), Support Vector Machine (SVM), Logistic Regression (LR) and K-Nearest Neighbors (KNN). Using ten-fold cross-validation, we showed that RF is the best model for classifying cancer-related lncRNAs (AUC?=?0.82). The feature importance analysis indicated that epigenetic and network features play key roles in the classification. In addition, compared with other existing classifiers, CRlncRC exhibited a better performance both in sensitivity and specificity. We further applied CRlncRC to lncRNAs from the TANRIC (The Atlas of non-coding RNA in Cancer) dataset, and identified 121 cancer-related lncRNA candidates. These potential cancer-related lncRNAs showed a certain kind of cancer-related indications, and many of them could find convincing literature supports.

Conclusions

Our results indicate that CRlncRC is a powerful method for identifying cancer-related lncRNAs. Machine-learning-based integration of multiple features, especially epigenetic and network features, had a great contribution to the cancer-related lncRNA prediction. RF outperforms other learning techniques on measurement of model sensitivity and specificity. In addition, using CRlncRC method, we predicted a set of cancer-related lncRNAs, all of which displayed a strong relevance to cancer as a valuable conception for the further cancer-related lncRNA function studies.
  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号