首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 0 毫秒
1.
Hu LL  Wan SB  Niu S  Shi XH  Li HP  Cai YD  Chou KC 《Biochimie》2011,93(3):489-496
Palmitoylation is a universal and important lipid modification, involving a series of basic cellular processes, such as membrane trafficking, protein stability and protein aggregation. With the avalanche of new protein sequences generated in the post genomic era, it is highly desirable to develop computational methods for rapidly and effectively identifying the potential palmitoylation sites of uncharacterized proteins so as to timely provide useful information for revealing the mechanism of protein palmitoylation. By using the Incremental Feature Selection approach based on amino acid factors, conservation, disorder feature, and specific features of palmitoylation site, a new predictor named IFS-Palm was developed in this regard. The overall success rate thus achieved by jackknife test on a newly constructed benchmark dataset was 90.65%. It was shown via an in-depth analysis that palmitoylation was intimately correlated with the feature of the upstream residue directly adjacent to cysteine site as well as the conservation of amino acid cysteine. Meanwhile, the protein disorder region might also play an import role in the post-translational modification. These findings may provide useful insights for revealing the mechanisms of palmitoylation.  相似文献   

2.
Drug–drug interaction (DDI) defines a situation in which one drug affects the activity of another when both are administered together. DDI is a common cause of adverse drug reactions and sometimes also leads to improved therapeutic effects. Therefore, it is of great interest to discover novel DDIs according to their molecular properties and mechanisms in a robust and rigorous way. This paper attempts to predict effective DDIs using the following properties: (1) chemical interaction between drugs; (2) protein interactions between the targets of drugs; and (3) target enrichment of KEGG pathways. The data consisted of 7323 pairs of DDIs collected from the DrugBank and 36,615 pairs of drugs constructed by randomly combining two drugs. Each drug pair was represented by 465 features derived from the aforementioned three categories of properties. The random forest algorithm was adopted to train the prediction model. Some feature selection techniques, including minimum redundancy maximum relevance and incremental feature selection, were used to extract key features as the optimal input for the prediction model. The extracted key features may help to gain insights into the mechanisms of DDIs and provide some guidelines for the relevant clinical medication developments, and the prediction model can give new clues for identification of novel DDIs.  相似文献   

3.
Li BQ  Hu LL  Niu S  Cai YD  Chou KC 《Journal of Proteomics》2012,75(5):1654-1665
S-nitrosylation (SNO) is one of the most important and universal post-translational modifications (PTMs) which regulates various cellular functions and signaling events. Identification of the exact S-nitrosylation sites in proteins may facilitate the understanding of the molecular mechanisms and biological function of S-nitrosylation. Unfortunately, traditional experimental approaches used for detecting S-nitrosylation sites are often laborious and time-consuming. However, computational methods could overcome this demerit. In this work, we developed a novel predictor based on nearest neighbor algorithm (NNA) with the maximum relevance minimum redundancy (mRMR) method followed by incremental feature selection (IFS). The features of physicochemical/biochemical properties, sequence conservation, residual disorder, amino acid occurrence frequency, second structure and the solvent accessibility were utilized to represent the peptides concerned. Feature analysis showed that the features except residual disorder affected identification of the S-nitrosylation sites. It was also shown via the site-specific feature analysis that the features of sites away from the central cysteine might contribute to the S-nitrosylation site determination through a subtle manner. It is anticipated that our prediction method may become a useful tool for identifying the protein S-nitrosylation sites and that the features analysis described in this paper may provide useful insights for in-depth investigation into the mechanism of S-nitrosylation.  相似文献   

4.
Synthetic lethality is the synthesis of mutations leading to cell death. Tumor-specific synthetic lethality has been targeted in research to improve cancer therapy. With the advances of techniques in molecular biology, such as RNAi and CRISPR/Cas9 gene editing, efforts have been made to systematically identify synthetic lethal interactions, especially for frequently mutated genes in cancers. However, elucidating the mechanism of synthetic lethality remains a challenge because of the complexity of its influencing conditions. In this study, we proposed a new computational method to identify critical functional features that can accurately predict synthetic lethal interactions. This method incorporates several machine learning algorithms and encodes protein-coding genes by an enrichment system derived from gene ontology terms and Kyoto Encyclopedia of Genes and Genomes pathways to represent their functional features. We built a random forest-based prediction engine by using 2120 selected features and obtained a Matthews correlation coefficient of 0.532. We examined the top 15 features and found that most of them have potential roles in synthetic lethality according to previous studies. These results demonstrate the ability of our proposed method to predict synthetic lethal interactions and provide a basis for further characterization of these particular genetic combinations.  相似文献   

5.
Information of protein quaternary structure can help to understand the biological functions of proteins. Because wet-lab experiments are both time-consuming and costly, we adopt a novel computational approach to assign proteins into 10 kinds of quaternary structures. By coding each protein using its biochemical and physicochemical properties, feature selection was carried out using Incremental Feature Selection (IFS) method. The thus obtained optimal feature set consisted of 97 features, with which the prediction model was built. As a result, the overall prediction success rate is 74.90% evaluated by Jackknife test, much higher than the overall correct rate of a random guess 10% (1/10). The further feature analysis indicates that protein secondary structure is the most contributed feature in the prediction of protein quaternary structure.  相似文献   

6.
Hematopoiesis is a complicated process involving a series of biological sub-processes that lead to the formation of various blood components. A widely accepted model of early hematopoiesis proceeds from long-term hematopoietic stem cells (LT-HSCs) to multipotent progenitors (MPPs) and then to lineage-committed progenitors. However, the molecular mechanisms of early hematopoiesis have not been fully characterized. In this study, we applied a computational strategy to identify the gene expression signatures distinguishing three types of closely related hematopoietic cells collected in recent studies: (1) hematopoietic stem cell/multipotent progenitor cells; (2) LT-HSCs; and (3) hematopoietic progenitor cells. Each cell in these cell types was represented by its gene expression profile among a total number of 20,475 genes. The expression features were analyzed by a Monte-Carlo Feature Selection (MCFS) method, resulting in a feature list. Then, the incremental feature selection (IFS) and a support vector machine (SVM) optimized with a sequential minimum optimization (SMO) algorithm were employed to access the optimal classifier with the highest Matthews correlation coefficient (MCC) value of 0.889, in which 6698 features were used to represent cells. In addition, through an updated program of MCFS method, seventeen decision rules can be obtained, which can classify the three cell types with an overall accuracy of 0.812. Using a literature review, both the rules and the top features used for building the optimal classifier were confirmed to be commonly used or potential biological markers for distinguishing the three cell types of HSPCs. This article is part of a Special Issue entitled: Accelerating Precision Medicine through Genetic and Genomic Big Data Analysis edited by Yudong Cai & Tao Huang.  相似文献   

7.
Glycation is chemical reaction by which sugar molecule bonds with a protein without the help of enzymes. This is often cause to many diseases and therefore the knowledge about glycation is very important. In this paper, we present iProtGly‐SS, a protein lysine glycation site identification method based on features extracted from sequence and secondary structural information. In the experiments, we found the best feature groups combination: Amino Acid Composition, Secondary Structure Motifs, and Polarity. We used support vector machine classifier to train our model and used an optimal set of features using a group based forward feature selection technique. On standard benchmark datasets, our method is able to significantly outperform existing methods for glycation prediction. A web server for iProtGly‐SS is implemented and publicly available to use: http://brl.uiu.ac.bd/iprotgly-ss/ .  相似文献   

8.
Lysine acetylation and ubiquitination are two primary post-translational modifications (PTMs) in most eukaryotic proteins. Lysine residues are targets for both types of PTMs, resulting in different cellular roles. With the increasing availability of protein sequences and PTM data, it is challenging to distinguish the two types of PTMs on lysine residues. Experimental approaches are often laborious and time consuming. There is an urgent need for computational tools to distinguish between lysine acetylation and ubiquitination. In this study, we developed a novel method, called DAUFSA (distinguish between lysine acetylation and lysine ubiquitination with feature selection and analysis), to discriminate ubiquitinated and acetylated lysine residues. The method incorporated several types of features: PSSM (position-specific scoring matrix) conservation scores, amino acid factors, secondary structures, solvent accessibilities, and disorder scores. By using the mRMR (maximum relevance minimum redundancy) method and the IFS (incremental feature selection) method, an optimal feature set containing 290 features was selected from all incorporated features. A dagging-based classifier constructed by the optimal features achieved a classification accuracy of 69.53%, with an MCC of .3853. An optimal feature set analysis showed that the PSSM conservation score features and the amino acid factor features were the most important attributes, suggesting differences between acetylation and ubiquitination. Our study results also supported previous findings that different motifs were employed by acetylation and ubiquitination. The feature differences between the two modifications revealed in this study are worthy of experimental validation and further investigation.  相似文献   

9.

Background

Previous studies on protein-DNA interaction mostly focused on the bound structure of DNA-binding proteins but few paid enough attention to the unbound structures. As more new proteins are discovered, it is useful and imperative to develop algorithms for the functional prediction of unbound proteins. In our work, we apply an alpha shape model to represent the surface structure of the protein-DNA complex and extract useful statistical and geometric features, and use structural alignment and support vector machines for the prediction of unbound DNA-binding proteins.

Results

The performance of our method is evaluated by discriminating a set of 104 DNA-binding proteins from 401 non-DNA-binding proteins. In the same test, the proposed method outperforms the other method using conditional probability. The results achieved by our proposed method for; precision, 83.33%; accuracy, 86.53%; and MCC, 0.5368 demonstrate its good performance.

Conclusions

In this study we develop an effective method for the prediction of protein-DNA interactions based on statistical and geometric features and support vector machines. Our results show that interface surface features play an important role in protein-DNA interaction. Our technique is able to predict unbound DNA-binding protein and discriminatory DNA-binding proteins from proteins that bind with other molecules.
  相似文献   

10.
In this study, the predictors are developed for protein submitochondria locations based on various features of sequences. Information about the submitochondria location for a mitochondria protein can provide much better understanding about its function. We use ten representative models of protein samples such as pseudo amino acid composition, dipeptide composition, functional domain composition, the combining discrete model based on prediction of solvent accessibility and secondary structure elements, the discrete model of pairwise sequence similarity, etc. We construct a predictor based on support vector machines (SVMs) for each representative model. The overall prediction accuracy by the leave-one-out cross validation test obtained by the predictor which is based on the discrete model of pairwise sequence similarity is 1% better than the best computational system that exists for this problem. Moreover, we develop a method based on ordered weighted averaging (OWA) which is one of the fusion data operators. Therefore, OWA is applied on the 11 best SVM-based classifiers that are constructed based on various features of sequence. This method is called Mito-Loc. The overall leave-one-out cross validation accuracy obtained by Mito-Loc is about 95%. This indicates that our proposed approach (Mito-Loc) is superior to the result of the best existing approach which has already been reported.  相似文献   

11.
Protein oxidation is a ubiquitous post-translational modification that plays important roles in various physiological and pathological processes. Owing to the fact that protein oxidation can also take place as an experimental artifact or caused by oxygen in the air during the process of sample collection and analysis, and that it is both time-consuming and expensive to determine the protein oxidation sites purely by biochemical experiments, it would be of great benefit to develop in silico methods for rapidly and effectively identifying protein oxidation sites. In this study, we developed a computational method to address this problem. Our method was based on the nearest neighbor algorithm in which, however, the maximum relevance minimum redundancy and incremental feature selection approaches were incorporated. From the initial 735 features, 16 features were selected as the optimal feature set. Of such 16 optimized features, 10 features were associated with the position-specific scoring matrix conservation scores, three with the amino acid factors, one with the propensity of conservation of residues on protein surface, one with the side chain count of carbon atom deviation from mean, and one with the solvent accessibility. It was observed that our prediction model achieved an overall success rate of 75.82%, indicating that it is quite encouraging and promising for practical applications. Also, the 16 optimal features obtained through this study may provide useful clues and insights for in-depth understanding the action mechanism of protein oxidation.  相似文献   

12.
The study of rat proteins is an indispensable task in experimental medicine and drug development. The function of a rat protein is closely related to its subcellular location. Based on the above concept, we construct the benchmark rat proteins dataset and develop a combined approach for predicting the subcellular localization of rat proteins. From protein primary sequence, the multiple sequential features are obtained by using of discrete Fourier analysis, position conservation scoring function and increment of diversity, and these sequential features are selected as input parameters of the support vector machine. By the jackknife test, the overall success rate of prediction is 95.6% on the rat proteins dataset. Our method are performed on the apoptosis proteins dataset and the Gram-negative bacterial proteins dataset with the jackknife test, the overall success rates are 89.9% and 96.4%, respectively. The above results indicate that our proposed method is quite promising and may play a complementary role to the existing predictors in this area.  相似文献   

13.
Newcastle disease virus (NDV), an avian orthoavulavirus, is a causative agent of Newcastle disease named (NDV), and can cause even the epidemics when disease is not treated. Previously several vaccines based on attenuated and inactivated viruses have been reported which are rendered useless with the passage of time due to versatile changes in viral genome. Therefore, we aimed to develop an effective multi-epitope vaccine against the haemagglutinin neuraminidase (HN) protein of 26 NDV strains from Pakistan through a modern immunoinformatic approaches. As a result, a vaccine chimaera was constructed by combining T-cell and B-cell epitopes with the appropriate linkers and adjuvant. The designed vaccine was highly immunogenic, non-allergen and antigenic; therefore, the potential 3D-structureof multi epitope vaccine was constructed, refined and validated. A molecular docking study of a multiepitope vaccine candidate with the chicken Toll-like receptor-4 indicated successful binding. An In silico immunological simulation was used to evaluate the candidate vaccine''s ability to elicit an effective immune response. According to the computational studies, the proposed multiepitope vaccine is physically stable and may induce immune responses whichsuggested it a strong candidate against 26 Newcastle disease virus strains from Pakistan.  相似文献   

14.
The phylogenetic relationship among the three genera of the family Streptomycetaceae was examined using the small and large subunit ribosomal RNA genes, and the gyrB, rpoB, trpB, atpD and recA genes. The total stretches of the analyzed ribosomal genes were 4.2kb, and those of five protein coding genes were 4.5 kb. The resultant phylogenetic trees confirmed that each genus formed an independent clade in the majority of cases. The G+C contents of rRNA genes were 56.9-58.9 mol%, and those of protein coding genes were 65.4-72.4 mol%, the latter being closer to those of the genomic DNAs. The average nucleotide sequence identity between the organisms were 94.1-96.4% for rRNA genes and 85.7-90.6% for protein coding genes, thus indicating that protein coding genes can give higher resolution than rRNA genes. In addition, the protein coding gene trees were more stable than the rRNA gene trees, supported by higher bootstrap values and other treeing algorithms. Moreover, the genome data of six Streptomyces species indicated that many protein coding genes exhibited higher correlations with genome relatedness. The combined gene sequences were also shown to give a better resolution with higher stability than any single genes, though not necessarily more correlated with genome relatedness. It is evident from this study that the rRNA gene based phylogeny can be misleading, and also that protein coding genes have a number of advantages over the rRNA genes as the phylogenetic markers including a high correlation with the genome relatedness.  相似文献   

15.
In this paper, we intend to predict protein structural classes (α, β, α+β, or α/β) for low-homology data sets. Two data sets were used widely, 1189 (containing 1092 proteins) and 25PDB (containing 1673 proteins) with sequence homology being 40% and 25%, respectively. We propose to decompose the chaos game representation of proteins into two kinds of time series. Then, a novel and powerful nonlinear analysis technique, recurrence quantification analysis (RQA), is applied to analyze these time series. For a given protein sequence, a total of 16 characteristic parameters can be calculated with RQA, which are treated as feature representation of protein sequences. Based on such feature representation, the structural class for each protein is predicted with Fisher's linear discriminant algorithm. The jackknife test is used to test and compare our method with other existing methods. The overall accuracies with step-by-step procedure are 65.8% and 64.2% for 1189 and 25PDB data sets, respectively. With one-against-others procedure used widely, we compare our method with five other existing methods. Especially, the overall accuracies of our method are 6.3% and 4.1% higher for the two data sets, respectively. Furthermore, only 16 parameters are used in our method, which is less than that used by other methods. This suggests that the current method may play a complementary role to the existing methods and is promising to perform the prediction of protein structural classes.  相似文献   

16.
The problem of predicting the enzymes and non-enzymes from the protein sequence information is still an open problem in bioinformatics. It is further becoming more important as the number of sequenced information grows exponentially over time. We describe a novel approach for predicting the enzymes and non-enzymes from its amino-acid sequence using artificial neural network (ANN). Using 61 sequence derived features alone we have been able to achieve 79 percent correct prediction of enzymes/non-enzymes (in the set of 660 proteins). For the complete set of 61 parameters using 5-fold cross-validated classification, ANN model reveal a superior model (accuracy = 78.79 plus or minus 6.86 percent, Q(pred) = 74.734 plus or minus 17.08 percent, sensitivity = 84.48 plus or minus 6.73 percent, specificity = 77.13 plus or minus 13.39 percent). The second module of ANN is based on PSSM matrix. Using the same 5-fold cross-validation set, this ANN model predicts enzymes/non-enzymes with more accuracy (accuracy = 80.37 plus or minus 6.59 percent, Q(pred) = 67.466 plus or minus 12.41 percent, sensitivity = 0.9070 plus or minus 3.37 percent, specificity = 74.66 plus or minus 7.17 percent).  相似文献   

17.
Protein–protein interactions are intrinsic to virtually every cellular process. Predicting the binding affinity of protein–protein complexes is one of the challenging problems in computational and molecular biology. In this work, we related sequence features of protein–protein complexes with their binding affinities using machine learning approaches. We set up a database of 185 protein–protein complexes for which the interacting pairs are heterodimers and their experimental binding affinities are available. On the other hand, we have developed a set of 610 features from the sequences of protein complexes and utilized Ranker search method, which is the combination of Attribute evaluator and Ranker method for selecting specific features. We have analyzed several machine learning algorithms to discriminate protein‐protein complexes into high and low affinity groups based on their Kd values. Our results showed a 10‐fold cross‐validation accuracy of 76.1% with the combination of nine features using support vector machines. Further, we observed accuracy of 83.3% on an independent test set of 30 complexes. We suggest that our method would serve as an effective tool for identifying the interacting partners in protein–protein interaction networks and human–pathogen interactions based on the strength of interactions. Proteins 2014; 82:2088–2096. © 2014 Wiley Periodicals, Inc.  相似文献   

18.
We have developed an entirely sequence-based method that identifies and integrates relevant features that can be used to assign proteins of unknown function to functional classes, and enzyme categories for enzymes. We show that strategies for the elucidation of protein function may benefit from a number of functional attributes that are more directly related to the linear sequence of amino acids, and hence easier to predict, than protein structure. These attributes include features associated with post-translational modifications and protein sorting, but also much simpler aspects such as the length, isoelectric point and composition of the polypeptide chain.  相似文献   

19.
Zheng Wu  Ming Lu  Tingting Li 《Amino acids》2014,46(8):1919-1928
Tyrosine phosphorylation plays crucial roles in numerous physiological processes. The level of phosphorylation state depends on the combined action of protein tyrosine kinases and protein tyrosine phosphatases. Detection of possible phosphorylation and dephosphorylation sites can provide useful information to the functional studies of relevant proteins. Several studies have focused on the identification of protein tyrosine kinase substrates. However, compared with protein tyrosine kinases, the prediction of protein tyrosine phosphatase substrates involved in the balance of protein phosphorylation level falls behind. This paper described a method that utilized the k-nearest neighbor algorithm to identity the substrate sites of three protein tyrosine phosphatases based on the sequence features of manually collected dephosphorylation sites. In the performance evaluation, both sensitivities and specificities could reach above 75 % for all three protein tyrosine phosphatases. Finally, the method was applied on a set of known tyrosine phosphorylation sites to search for candidate substrates.  相似文献   

20.
The North American mid-continent population of lesser snow geese (Anser caerulescens caerulescens) breeds in coastal areas of the Hudson Bay region. Breeding success is highly variable, particularly during recent decades. The availability of long-term data sets of weather and the breeding success of geese allowed us to determine whether climatic variables in spring and early summer (May–June) are reliable predictors of different attributes of the reproductive biology of snow geese. A large region of strong anomalous cooling in north-eastern North America has been the dominant anomalous climatic feature since the mid-1970s. The cooling which becomes established during winter persists into spring and early summer when migration, nesting and hatching of geese are occurring. Redundancy analysis (RDA) of the data sets was made to identify dominant correlations and regression relationships between climatic and goose variables. Individual goose response variables were further explored with stepwise multiple regression and bipartial regression. 96.7% of year-to-year variance in the goose data was explained by the selected climatic data. The first four orthogonal axes out of seven possible axes explained 92.2% of the total variance. Date of last snow on the ground and mean daily temperature from 6 to 20 May formed the lowest and highest predictor scores, respectively. Initiation date and hatching date at the low end and total clutch size and clutch size at hatch at the high end were associated with these extremes, particularly in certain years. Days of freezing rain in May and total rainfall were correlated with nest failure. Bivariate correlation/regression showed that the most parsimonious model for nest initiation day was based on four climatic predictors, for hatching day four predictors, and for clutch size at hatch, nine predictors. Both the multiple regression analyses and the redundancy analyses confirm the high degree of predictability of goose reproductive variables from selected climatic variables. As discussed, the correlations reflect both direct and indirect effects of climate on the reproductive biology of geese. The correlations are strongest in the early season and weaken by early summer.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号