首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Fang Y  Guo Y  Feng Y  Li M 《Amino acids》2008,34(1):103-109
Summary. DNA-binding proteins play a pivotal role in gene regulation. It is vitally important to develop an automated and efficient method for timely identification of novel DNA-binding proteins. In this study, we proposed a method based on alone the primary sequences of proteins to predict the DNA-binding proteins. DNA-binding proteins were encoded by autocross-covariance transform, pseudo-amino acid composition, dipeptide composition, respectively and also the different combinations of the three encoded methods; further, these feature matrices were applied to support vector machine classifiers to predict the DNA-binding proteins. All modules were trained and validated by the jackknife cross-validation test. Through comparing the performance of these substituted modules, the best result was obtained from pseudo-amino acid composition with the overall accuracy of 96.6% and the sensitivity of 90.7%. The results suggest that it can efficiently predict the novel DNA-binding proteins only using the primary sequences. Authors’ address: Menglong Li, College of Chemistry, Sichuan University, Chengdu, Sichuan 610064, P.R. China  相似文献   

2.
Predicting the cofactors of oxidoreductases plays an important role in inferring their catalytic mechanism. Feature extraction is a critical part in the prediction systems, requiring raw sequence data to be transformed into appropriate numerical feature vectors while minimizing information loss. In this paper, we present an amino acid composition distribution method for extracting useful features from primary sequence, and the k-nearest neighbor was used as the classifier. The overall prediction accuracy evaluated by the 10-fold cross-validation reached 90.74%. Comparing our method with other eight feature extraction methods, the improvement of the overall prediction accuracy ranged from 3.49% to 15.74%. Our experimental results confirm that the method we proposed is very useful and may be used for other bioinformatical predictions. Interestingly, when features extracted by our method and Chou's amphiphilic pseudo-amino acid composition were combined, the overall accuracy could reach 92.53%.  相似文献   

3.
Apoptosis proteins are very important for understanding the mechanism of programmed cell death. The apoptosis protein localization can provide valuable information about its molecular function. The prediction of localization of an apoptosis protein is a challenging task. In our previous work we proposed an increment of diversity (ID) method using protein sequence information for this prediction task. In this work, based on the concept of Chou's pseudo-amino acid composition [Chou, K.C., 2001. Prediction of protein cellular attributes using pseudo-amino acid composition. Proteins: Struct. Funct. Genet. (Erratum: Chou, K.C., 2001, vol. 44, 60) 43, 246-255, Chou, K.C., 2005. Using amphiphilic pseudo-amino acid composition to predict enzyme subfamily classes. Bioinformatics 21, 10-19], a different pseudo-amino acid composition by using the hydropathy distribution information is introduced. A novel ID_SVM algorithm combined ID with support vector machine (SVM) is proposed. This method is applied to three data sets (317 apoptosis proteins, 225 apoptosis proteins and 98 apoptosis proteins). The higher predictive success rates than the previous algorithms are obtained by the jackknife tests.  相似文献   

4.
Du P  Wang X  Xu C  Gao Y 《Analytical biochemistry》2012,425(2):117-119
The pseudo-amino acid composition has been widely used to convert complicated protein sequences with various lengths to fixed length digital feature vectors while keeping considerable sequence order information. However, so far the only software available to the public is the web server PseAAC (http://www.csbio.sjtu.edu.cn/bioinf/PseAAC), which has some limitations in dealing with large-scale datasets. Here, we propose a new cross-platform stand-alone software program, called PseAAC-Builder (http://www.pseb.sf.net), which can be used to generate various modes of Chou's pseudo-amino acid composition in a much more efficient and flexible way. It is anticipated that PseAAC-Builder may become a useful tool for studying various protein attributes.  相似文献   

5.
Pattern recognition receptors (PRRs) play a key role in the innate immune response by recognizing pathogen associated molecular patterns derived from a diverse collection of microbial pathogens. PRRs form a superfamily of proteins related to host health and disease. Thus, prediction of PRR family might supply biologically significant information for functional annotation of PRRs and development of novel drugs. In this paper, a computational method is proposed for predicting the families of PRRs. The prediction was performed on the basis of amino acid composition and pseudo-amino acid composition (PseAAC) from primary sequences of proteins using support vector machines. A non-redundant dataset consisted of 332 PRRs in seven families was constructed to do training and testing. It was demonstrated that different families of PRRs were quite closely correlated with amino acid composition as well as PseAAC. In the jackknife test, overall accuracies of amino acid composition-based and PseAAC-based classifiers reached 96.1% and 97.9%, respectively. The results indicate that families of PRRs are predictable with high accuracy. It is anticipated that this computational method might be a powerful tool for the automated assignment of families of PRRs.  相似文献   

6.
Nanni L  Lumini A 《Amino acids》2008,34(4):653-660
Given a protein that is localized in the mitochondria it is very important to know the submitochondria localization of that protein to understand its function. In this work, we propose a submitochondria localizer whose feature extraction method is based on the Chou's pseudo-amino acid composition. The pseudo-amino acid based features are obtained by combining pseudo-amino acid compositions with hundreds of amino-acid indices and amino-acid substitution matrices, then from this huge set of features a small set of 15 "artificial" features is created. The feature creation is performed by genetic programming combining one or more "original" features by means of some mathematical operators. Finally, the set of combined features are used to train a radial basis function support vector machine. This method is named GP-Loc. Moreover, we also propose a very few parameterized method, named ALL-Loc, where all the "original" features are used to train a linear support vector machine. The overall prediction accuracy obtained by GP-Loc is 89% when the jackknife cross-validation is used, this result outperforms the performance obtained in the literature (85.2%) using the same dataset. While the overall prediction accuracy obtained by ALL-Loc is 83.9%.  相似文献   

7.
Liu N  Wang T 《FEBS letters》2006,580(22):5321-5327
So far, various approaches for phylogenetic analysis have been developed. Almost all of them put stress on analyzing nucleic acid sequences or protein primary structures. In this paper, we take the physicochemical properties of amino acids into account and introduce the hydropathy profile of amino acids into phylogenetic analysis. We find that this introduction is effectual and our method may be used to complement phylogenetic analysis.  相似文献   

8.
Li ZC  Zhou XB  Dai Z  Zou XY 《Amino acids》2009,37(2):415-425
A prior knowledge of protein structural classes can provide useful information about its overall structure, so it is very important for quick and accurate determination of protein structural class with computation method in protein science. One of the key for computation method is accurate protein sample representation. Here, based on the concept of Chou’s pseudo-amino acid composition (AAC, Chou, Proteins: structure, function, and genetics, 43:246–255, 2001), a novel method of feature extraction that combined continuous wavelet transform (CWT) with principal component analysis (PCA) was introduced for the prediction of protein structural classes. Firstly, the digital signal was obtained by mapping each amino acid according to various physicochemical properties. Secondly, CWT was utilized to extract new feature vector based on wavelet power spectrum (WPS), which contains more abundant information of sequence order in frequency domain and time domain, and PCA was then used to reorganize the feature vector to decrease information redundancy and computational complexity. Finally, a pseudo-amino acid composition feature vector was further formed to represent primary sequence by coupling AAC vector with a set of new feature vector of WPS in an orthogonal space by PCA. As a showcase, the rigorous jackknife cross-validation test was performed on the working datasets. The results indicated that prediction quality has been improved, and the current approach of protein representation may serve as a useful complementary vehicle in classifying other attributes of proteins, such as enzyme family class, subcellular localization, membrane protein types and protein secondary structure, etc.  相似文献   

9.
We introduce a new approach to compare DNA primary sequences. The core of our method is a new measure of pairwise distances among sequences. Using the primitive discrimination substrings of sequence S and Q, a discrimination measure DM(S, Q) is defined for the similarity analysis of them. The proposed method does not require multiple alignments and is fully automatic. To illustrate its utility, we construct phylogenetic trees on two independent data sets. The results indicate that the method is efficient and powerful.  相似文献   

10.
In an attempt to define the phylogenetical relationship among 17 phenotypically related species of genera Enterobacter, Pantoea, Serratia, Klebsiella and Erwinia, we determined almost all of their groE operon sequences using the polymerase chain reaction direct sequencing method. The number of nucleotide substitutions per site was 0.12+/-0.030. The value was 3.6-fold higher than that of 16S rDNA. As a result, we were successful in constructing molecular phylogenetic trees which had a finer resolution than that based on the 16S rDNA sequences. The phylogenetic trees based on the nucleotide sequences and deduced amino acid sequences of groE operons indicated that the members of genera Enterobacter, Pantoea and Klebsiella were closely related to each other, while Serratia and Erwinia species except Erwinia carotovora, made distinct clades. The close relationship between Enterobacter aerogenes and Klebsiella pneumoniae, that had been suggested by biochemical tests and DNA hybridization, was also supported by our molecular phylogenetic trees.  相似文献   

11.
Graphical representation of DNA sequences is one of the most popular techniques for alignment-free sequence comparison. Here, we propose a new method for the feature extraction of DNA sequences represented by binary images, by estimating the similarity between DNA sequences using the frequency histograms of local bitmap patterns of images. Our method shows linear time complexity for the length of DNA sequences, which is practical even when long sequences, such as whole genome sequences, are compared. We tested five distance measures for the estimation of sequence similarities, and found that the histogram intersection and Manhattan distance are the most appropriate ones for phylogenetic analyses.  相似文献   

12.
本研究选取优茧蜂亚科Euphorinae(膜翅目Hymenoptera:茧蜂科Braconidae)的8族19属23种作为内群,茧蜂其它6个亚科的8属8种作外群,首次结合同源核糖体28S rDNA D2基因序列片段和41个形态学特征对该亚科进行了系统发育学研究。利用"圆口类"的内茧蜂亚科Rogadinae、茧蜂亚科Braconinae、矛茧蜂亚科Doryctinae的3个亚科为根,以PAUP*4.0和MrBayes3.0B4软件分别应用最大简约法(MP)和贝叶斯法对优茧蜂亚科的分子数据和分子数据与非分子数据的结合体进行了分析;并以PAUP*4.0对优茧蜂亚科的28S rDNA D2基因序列的片段的碱基组成与碱基替代情况进行了分析。结果表明:优茧蜂亚科的28S rDNA D2基因序列片段的GC%含量在40.00%~49.25%之间变动,而对于碱基替代情况来讲,优茧蜂亚科各个成员间序列变异位点上颠换(transversion)大于转换(transition);不同的分析和算法所产生的系统发育树都表明目前根据形态定义出的优茧蜂亚科Euphorinae不是一个单系群,而是一个与蚁茧蜂亚科Neoneurinae和高腹茧蜂亚科Cenocoelinae混杂在一起的并系群;在优茧蜂亚科内部,悬茧蜂族Meterorini和食甲茧蜂族Microctonini(排除猎户茧蜂属Orionis)为单系群,而宽鞘茧蜂族Centistini、大颚茧蜂族Cosmophorini、优茧蜂族Euphorini、瓢虫茧蜂族Dinocampini为并系群;悬茧蜂族Meterorini在优茧蜂亚科Euphorinae内位于基部位置的观点得到部分的支持,同时食甲茧蜂族Microctonini被判定为相对进化的类群。此外对于优茧蜂亚科内各属之间的相互亲缘关系,不同算法所得到的系统发育属的结果不完全一致,这表明优茧蜂亚科内(属及族)的系统发育关系还有待于进一步研究。  相似文献   

13.
Many podoviruses have been isolated which infect marine picocyanobacteria, and they may play a potentially important role in regulating the biomass and population composition of picocyanobacteria. However, little is known about the diversity and population dynamics of autochthonous cyanopodoviruses in marine environments. Using a set of newly designed PCR primers which specifically amplify the DNA pol from cyanopodoviruses, a total of 221 DNA pol sequences were retrieved from eight Chesapeake Bay virioplankton communities collected at different times and locations. All DNA pol sequences clustered with the eight known podoviruses that infect different marine picocyanobacteria, and could be divided into at least 10 different subclusters (I-X). The presence of these cyanopodovirus genotypes based on PCR-amplification of DNA pol gene sequences was supported by the existence of similar DNA pol genotypes with metagenome libraries of Chesapeake Bay virioplankton assemblages. The composition of cyanopodoviruses in the Bay also exhibited distinct winter and summer patterns which were likely related to corresponding seasonal changes in the composition of cyanobacterial populations. Our study suggests that diverse and dynamic populations of cyanopodoviruses are present in the estuarine environment. The PCR method developed in this study provides a specific and sensitive tool to explore the abundance, distribution and phylogenetic diversity of cyanopodoviruses in aquatic environments. Linking the dynamics of host and viral populations in the natural environment is critical to broader characterization of the ecological role of virioplankton within microbial communities.  相似文献   

14.
The amino acid gamma-aminobutyric-acid receptors (GABAARs) belong to the ligand-gated ion channels (LGICs) superfamily. GABAARs are highly diverse in the central nervous system. These channels play a key role in regulating behavior. As a result, the prediction of GABAARs from the amino acid sequence would be helpful for research on these receptors. We have developed a method to predict these proteins using the features obtained from Chou's pseudo-amino acid composition concept and support vector machine as a powerful machine learning approach. The predictor efficiency was assessed by five-fold cross-validation. This method achieved an overall accuracy and Matthew's correlation coefficient (MCC) of 94.12% and 0.88, respectively. Furthermore, to evaluate the effect and power of each feature, the minimum Redundancy and Maximum Relevance (mRMR) feature selection method was implemented. An interesting finding in this study is the presence of all six characters (hydrophobicity, hydrophilicity, side chain mass, pK1, pK2 and pI) or combination of the characters among the 5 higher ranked features (pk2 and pI, hydrophobicity and mass, pk1, hydrophilicity and mass) obtained from the mRMR feature selection method. The results show a biologically justifiable ranked attributes of pk2 and pI; hydrophobicity, hydrophilicity and mass; mass and pk1; pk2 and mass. Based on our results, using the concept of Chou's pseudo-amino acid composition and support vector machine is an effective approach for the prediction of GABAARs.  相似文献   

15.
Based on parsimony analyses of eight new SSU rDNA sequences and 24 homologous sequences retrieved from the DNA databases, we suggest a possible phylogenetic relationship of Elaphomycetales with Eurotiales and Onygenales. Our three includedElaphomyces sequences strongly cluster together (bootstrap value 100%) within a monophyletic group (100%) of Elaphomycetales, Eurotiales, and Onygenales. Earlier reports that another cleistothecial lineage (Erysiphe) is related to Leotiales, are supported by our discovery that also another cleistothecial species,Amylocarpus encephaloides, shows affinity to Leotiales. Ascosphaeraceae and Eremascaceae are possibly better accommodated in Onygenales. We describe a new DNA extraction method in which sonication is used to disrupt thick-walled spores. It is useful for both fresh and dried fungal material.  相似文献   

16.
Summary We present compositional statistics, a new method of phylogenetic inference, which is an extension of evolutionary parsimony. Compositional statistics takes account of the base composition of the compared sequences by using nucleotide positions that evolutionary parsimony ignores. It shares with evolutionary parsimony the features of rate invariance and the fundamental distinction between transitions and transversions. Of the presently available methods of phylogenetic inference, compositional statistics is based on the fewest and mildest assumptions about the mode of DNA sequence evolution. It is therefore applicable to phylogenetic studies of the most distantly related organisms or molecules. This was illustrated by analyzing conservative positions in the DNA sequences of the large subunit of RNA polymerase from three archaebacterial groups, a eubacterium, a chloroplast, and the three eukaryotic polymerases. Internally consistent results, which are in accord with our knowledge of organelle origin and archaebacterial physiology, were achieved.  相似文献   

17.
Up to now, various approaches for phylogenetic analysis have been developed. Almost all of them put stress on analyzing nucleic acid sequences or protein primary sequences. In this paper, we propose a new sequence distance for efficient reconstruction of phylogenetic trees based on the distribution of length about common subsequences between two sequences. We describe some applications of this method, which not only show the validity of the method, but also suggest a number of novel phylogenetic insights.  相似文献   

18.
With the rapid increment of protein sequence data, it is indispensable to develop automated and reliable predictive methods for protein function annotation. One approach for facilitating protein function prediction is to classify proteins into functional families from primary sequence. Being the most important group of all proteins, the accurate prediction for enzyme family classes and subfamily classes is closely related to their biological functions. In this paper, for the prediction of enzyme subfamily classes, the Chou's amphiphilic pseudo-amino acid composition [Chou, K.C., 2005. Using amphiphilic pseudo amino acid composition to predict enzyme subfamily classes. Bioinformatics 21, 10-19] has been adopted to represent the protein samples for training the 'one-versus-rest' support vector machine. As a demonstration, the jackknife test was performed on the dataset that contains 2640 oxidoreductase sequences classified into 16 subfamily classes [Chou, K.C., Elrod, D.W., 2003. Prediction of enzyme family classes. J. Proteome Res. 2, 183-190]. The overall accuracy thus obtained was 80.87%. The significant enhancement in the accuracy indicates that the current method might play a complementary role to the exiting methods.  相似文献   

19.
Zhang SW  Zhang YL  Yang HF  Zhao CH  Pan Q 《Amino acids》2008,34(4):565-572
The rapidly increasing number of sequence entering into the genome databank has called for the need for developing automated methods to analyze them. Information on the subcellular localization of new found protein sequences is important for helping to reveal their functions in time and conducting the study of system biology at the cellular level. Based on the concept of Chou’s pseudo-amino acid composition, a series of useful information and techniques, such as residue conservation scores, von Neumann entropies, multi-scale energy, and weighted auto-correlation function were utilized to generate the pseudo-amino acid components for representing the protein samples. Based on such an infrastructure, a hybridization predictor was developed for identifying uncharacterized proteins among the following 12 subcellular localizations: chloroplast, cytoplasm, cytoskeleton, endoplasmic reticulum, extracell, Golgi apparatus, lysosome, mitochondria, nucleus, peroxisome, plasma membrane, and vacuole. Compared with the results reported by the previous investigators, higher success rates were obtained, suggesting that the current approach is quite promising, and may become a useful high-throughput tool in the relevant areas.  相似文献   

20.
DNA sequences encoding the C2 to V3 region of envelope glycoprotein gp120 of human immunodeficiency virus type 1 (HIV-1) were amplified by PCR from uncultured peripheral blood mononuclear cells obtained from 24 of 25 HIV-1-seropositive patients from Cyprus. By using a heteroduplex mobility assay (HMA), all amplified products were studied genetically and compared with 16 previously characterized HIV-1 strains belonging to subtypes A through F. HMA results revealed that HIV-1 gp120 sequences from 15 of our patients were of subtype B of HIV-1, whereas one isolate was of subtype C. However, gp120 sequences from eight patients had no obvious similarities to the known subtypes as defined by HMA. DNA sequencing and phylogenetic analyses of molecular clones confirmed the HMA results and placed the eight undefined HIV-1 isolates into three distinct genetic clusters. On the basis of branch topology and lengths of the phylogenetic tree, we conclude that one group consisting of three clones from two patients represents a new HIV-1 env subtype, which we have termed subtype I. The remaining two sequence clusters, consisting of five sequences from four patients and two sequences from two other patients, are distally related to subtypes A and F. These data demonstrate the extensive heterogeneity of HIV-1 in Cyprus, including the presence of new subtype.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号