首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Liu H  Han H  Li J  Wong L 《In silico biology》2004,4(3):255-269
The translation initiation site (TIS) prediction problem is about how to correctly identify TIS in mRNA, cDNA, or other types of genomic sequences. High prediction accuracy can be helpful in a better understanding of protein coding from nucleotide sequences. This is an important step in genomic analysis to determine protein coding from nucleotide sequences. In this paper, we present an in silico method to predict translation initiation sites in vertebrate cDNA or mRNA sequences. This method consists of three sequential steps as follows. In the first step, candidate features are generated using k-gram amino acid patterns. In the second step, a small number of top-ranked features are selected by an entropy-based algorithm. In the third step, a classification model is built to recognize true TISs by applying support vector machines or ensembles of decision trees to the selected features. We have tested our method on several independent data sets, including two public ones and our own extracted sequences. The experimental results achieved are better than those reported previously using the same data sets. Our high accuracy not only demonstrates the feasibility of our method, but also indicates that there might be "amino acid" patterns around TIS in cDNA and mRNA sequences.  相似文献   

2.
The prediction of translation initiation sites (TISs) in eukaryotic mRNAs has been a challenging problem in computational molecular biology. In this paper, we present a new algorithm to recognize TISs with a very high accuracy. Our algorithm includes two novel ideas. First, we introduce a class of new sequence-similarity kernels based on string editing, called edit kernels, for use with support vector machines (SVMs) in a discriminative approach to predict TISs. The edit kernels are simple and have significant biological and probabilistic interpretations. Although the edit kernels are not positive definite, it is easy to make the kernel matrix positive definite by adjusting the parameters. Second, we convert the region of an input mRNA sequence downstream to a putative TIS into an amino acid sequence before applying SVMs to avoid the high redundancy in the genetic code. The algorithm has been implemented and tested on previously published data. Our experimental results on real mRNA data show that both ideas improve the prediction accuracy greatly and that our method performs significantly better than those based on neural networks and SVMs with polynomial kernels or Salzberg kernels.  相似文献   

3.
As the number of complete genomes rapidly increases, accurate methods to automatically predict the subcellular location of proteins are increasingly useful to help their functional annotation. In order to improve the predictive accuracy of the many prediction methods developed to date, a novel representation of protein sequences is proposed. This representation involves local compositions of amino acids and twin amino acids, and local frequencies of distance between successive (basic, hydrophobic, and other) amino acids. For calculating the local features, each sequence is split into three parts: N-terminal, middle, and C-terminal. The N-terminal part is further divided into four regions to consider ambiguity in the length and position of signal sequences. We tested this representation with support vector machines on two data sets extracted from the SWISS-PROT database. Through fivefold cross-validation tests, overall accuracies of more than 87% and 91% were obtained for eukaryotic and prokaryotic proteins, respectively. It is concluded that considering the respective features in the N-terminal, middle, and C-terminal parts is helpful to predict the subcellular location.  相似文献   

4.
With the rapid increase of DNA databases of human and other eukaryotic model organisms, a large great number of genes need to be distinguished from the DNA databases. Exact recognition of translation initiation sites (TISs) of eukaryotic genes is very important to understand the translation initiation process, predict the detailed structure of eukaryotic genes, and annotate uncharacterized sequences. The problem has not been solved satisfactorily, especially for recognizing TISs of the eukaryotic genes with shorter first exons. It is an important task for extracting new features and finding new powerful algorithms for recognizing TISs of eukaryotic genes. In this paper, the important characteristics of shorter flanking fragments around TISs are extracted and an expectation-maximization (EM) algorithm based on incomplete data is used to recognize TISs of eukaryotic genes. The accuracy is up to 87.8% over a six-fold cross-validation test. The result shows that the identification variables are effectively extracted and the EM algorithm is a powerful tool to predict the TISs of eukaryotic genes. The algorithm also can be applied to other classification or clustering tasks in bioinformatics.  相似文献   

5.
Knowledge of the three‐dimensional structure of a protein is essential for describing and understanding its function. Today, a large number of known protein sequences faces a small number of identified structures. Thus, the need arises to predict structure from sequence without using time‐consuming experimental identification. In this paper the performance of Support Vector Machines (SVMs) is compared to Neural Networks and to standard statistical classification methods as Discriminant Analysis and Nearest Neighbor Classification. We show that SVMs can beat the competing methods on a dataset of 268 protein sequences to be classified into a set of 42 fold classes. We discuss misclassification with respect to biological function and similarity. In a second step we examine the performance of SVMs if the embedding is varied from frequencies of single amino acids to frequencies of tripletts of amino acids. This work shows that SVMs provide a promising alternative to standard statistical classification and prediction methods in functional genomics.  相似文献   

6.
R Pytela 《The EMBO journal》1988,7(5):1371-1378
Clones encoding the Mac-1 alpha chain were selected from a mouse macrophage cDNA library by screening with oligonucleotide probes based on the sequence of a genomic clone encoding the N-terminus of the mature protein. The sequence of overlapping clones (4282 nt) was determined and translated into a protein of 1137 amino acids and a signal peptide of 15 amino acids. The Mac-1 sequence was found to be related to the alpha chain sequences of three other members of the integrin family of cell adhesion receptors, i.e. the fibroblast receptors for fibronectin and vitronectin and the platelet glycoprotein IIb/IIIa. All four sequences share a number of structural features, like the position of 13 cysteine residues, a transmembrane domain near the C-terminus and the location of three putative binding sites for divalent cations. Furthermore, a conserved sequence motif is repeated seven times in the N-terminal half of the molecule and three of these repeats include putative Ca/Mg-binding sites of the general structure DXDXDGXXD. On the other hand, Mac-1 contains a unique domain of 220 amino acids inserted into the N-terminal part of the integrin structure. This additional domain is homologous to a repeated domain found in von Willebrand factor, cartilage matrix protein and in the complement factors B and C2. In two of these proteins, the homologous domain is likely to be involved in binding to collagen fibrils. Therefore, Mac-1 may also bind to collagen, which could play a role in the interaction of leukocytes with the subendothelial matrix.  相似文献   

7.
A model has been developed that permits the prediction of mRNA nucleic acid sequence from the sequences of the translated proteins. The model relies on the information obtained from the comparison of protein sequences in related species to reduce the number of possible codons for those amino acids where mutations are observed. The predictions so obtained have been tested by applying the model to proteins whose mRNA sequences are known. The model's predictions have been found to be 100% accurate if three or more different amino acids are known at a given position and if the protein sequences are restricted to relatively closely related species (within the same class). The use of this model may permit a reduction of the mRNA sequence degeneracy and therefore be helpful in the synthesis of cDNA probes or for the prediction of restriction endonuclease sites. Computer programs have been developed to ease the use of the model.  相似文献   

8.
Remote homology detection refers to the detection of structure homology in evolutionarily related proteins with low sequence similarity. Supervised learning algorithms such as support vector machine (SVM) are currently the most accurate methods. In most of these SVM-based methods, efforts have been dedicated to developing new kernels to better use the pairwise alignment scores or sequence profiles. Moreover, amino acids’ physicochemical properties are not generally used in the feature representation of protein sequences. In this article, we present a remote homology detection method that incorporates two novel features: (1) a protein's primary sequence is represented using amino acid's physicochemical properties and (2) the similarity between two proteins is measured using recurrence quantification analysis (RQA). An optimization scheme was developed to select different amino acid indices (up to 10 for a protein family) that are best to characterize the given protein family. The selected amino acid indices may enable us to draw better biological explanation of the protein family classification problem than using other alignment-based methods. An SVM-based classifier will then work on the space described by the RQA metrics. The classification scheme is named as SVM-RQA. Experiments at the superfamily level of the SCOP1.53 dataset show that, without using alignment or sequence profile information, the features generated from amino acid indices are able to produce results that are comparable to those obtained by the published state-of-the-art SVM kernels. In the future, better prediction accuracies can be expected by combining the alignment-based features with our amino acids property-based features. Supplementary information including the raw dataset, the best-performing amino acid indices for each protein family and the computed RQA metrics for all protein sequences can be downloaded from http://ym151113.ym.edu.tw/svm-rqa.  相似文献   

9.
This paper presents a novel feature vector based on physicochemical property of amino acids for prediction protein structural classes. The proposed method is divided into three different stages. First, a discrete time series representation to protein sequences using physicochemical scale is provided. Later on, a wavelet-based time-series technique is proposed for extracting features from mapped amino acid sequence and a fixed length feature vector for classification is constructed. The proposed feature space summarizes the variance information of ten different biological properties of amino acids. Finally, an optimized support vector machine model is constructed for prediction of each protein structural class. The proposed approach is evaluated using leave-one-out cross-validation tests on two standard datasets. Comparison of our result with existing approaches shows that overall accuracy achieved by our approach is better than exiting methods.  相似文献   

10.
Although ribosome-profiling and translation initiation sequencing (TI-seq) analyses have identified many noncanonical initiation codons, the precise detection of translation initiation sites (TISs) remains a challenge, mainly because of experimental artifacts of such analyses. Here, we describe a new method, TISCA (TIS detection by translation Complex Analysis), for the accurate identification of TISs. TISCA proved to be more reliable for TIS detection compared with existing tools, and it identified a substantial number of near-cognate codons in Kozak-like sequence contexts. Analysis of proteomics data revealed the presence of methionine at the NH2-terminus of most proteins derived from near-cognate initiation codons. Although eukaryotic initiation factor 2 (eIF2), eIF2A and eIF2D have previously been shown to contribute to translation initiation at near-cognate codons, we found that most noncanonical initiation events are most probably dependent on eIF2, consistent with the initial amino acid being methionine. Comprehensive identification of TISs by TISCA should facilitate characterization of the mechanism of noncanonical initiation.  相似文献   

11.
The parasitic protozoan Trichomonas vaginalis is known to contain the ubiquitous and highly conserved protein actin. A genomic library and a cDNA library have been screened to identify and clone the actin gene(s) of T. vaginalis. The nucleotide sequence of one gene and its flanking regions have been determined. The open reading frame encodes a protein of 376 amino acids. The sequence is not interrupted by any introns and the promoter could be represented by a 10 bp motif close to a consensus motif also found upstream of most sequenced T. vaginalis genes. The five different clones isolated from the cDNA library have similar sequences and encode three actin proteins differing only by one or two amino acids. A phylogenetic analysis of 31 actin sequences by distance matrix and parsimony methods, using centractin as outgroup, gives congruent trees with Parabasala branching above Diplomonadida.  相似文献   

12.
We describe a methodology, as well as some related data mining tools, for analyzing sequence data. The methodology comprises three steps: (a) generating candidate features from the sequences, (b) selecting relevant features from the candidates, and (c) integrating the selected features to build a system to recognize specific properties in sequence data. We also give relevant techniques for each of these three steps. For generating candidate features, we present various types of features based on the idea of k-grams. For selecting relevant features, we discuss signal-to-noise, t-statistics, and entropy measures, as well as a correlation-based feature selection method. For integrating selected features, we use machine learning methods, including C4.5, SVM, and Naive Bayes. We illustrate this methodology on the problem of recognizing translation initiation sites. We discuss how to generate and select features that are useful for understanding the distinction between ATG sites that are translation initiation sites and those that are not. We also discuss how to use such features to build reliable systems for recognizing translation initiation sites in DNA sequences.  相似文献   

13.
14.

Background  

Computational prediction methods are currently used to identify genes in prokaryote genomes. However, identification of the correct translation initiation sites remains a difficult task. Accurate translation initiation sites (TISs) are important not only for the annotation of unknown proteins but also for the prediction of operons, promoters, and small non-coding RNA genes, as this typically makes use of the intergenic distance. A further problem is that most existing methods are optimized for Escherichia coli data sets; applying these methods to newly sequenced bacterial genomes may not result in an equivalent level of accuracy.  相似文献   

15.
The cloning and sequencing of a full length cDNA of GAFP-1 ( Gastrodia antifungal protein), an antifungal protein from Gastrodia elata BI. f. fiavida S. Chow is reported. Degenerate primers were designed based on the N-terminal partial sequence from purified GAFP-1 to amplify the corresponding cDNA by rapid amplification of cDNA ends (RACE). A cDNA was obtained that contains an open reading frame for a peptide of 171 amino acids which matches the known peptide sequences. A 5'UTR (untranslated region) of 55 bp was found upstream from the translation initiation site. Two poly(A) adenylation sites were located downstream the stop codon. GAFP-1 cDNA and its deduced amino acid sequence share high homology with the mannose binding lectins from Epipactis helloborine, Listera ovata and snowdrop ( Galanthus nivalis ). The cDNA can now be used for testing the potential of GAFP-1 for engineering fungal resistance in crop plants.  相似文献   

16.
The cDNA sequence coding for the coat protein of cucumber mosaic virus (Japanese Y strain) was cloned, and its nucleotide sequence was determined. The sequence contains an open reading frame that encodes the coat protein composed of 218 amino acids. The nucleotide and deduced amino acid sequences of the coat protein of this strain were compared with those of the Q strain; the homologies of the sequences were 78% and 81%, respectively. Further study of the sequences gave an insight into the genome organization and the molecular features of the coat protein. The coding region can be divided into three characteristic regions. The N-terminal region has conserved features in the positively charged structure, the hydropathy pattern and the predicted secondary structure, although the amino acid sequence is varied mainly due to frameshift mutations. It is noteworthy that the positions of arginine residues in this region are highly conserved. Both the nucleotide and amino acid sequences of the central region are well conserved. The amino acid sequence of the C-terminal region is not conserved, because of frameshift mutations, however, the total number of amino acids is conserved. The nucleotide sequence of the 3'-noncoding region is divergent, but it could form a tRNA-like structure similar to those reported for other viruses. Detailed investigation suggests that the Y and Q strains are evolutionarily distant.  相似文献   

17.
A computational system for the prediction and classification of human G-protein coupled receptors (GPCRs) has been developed based on the support vector machine (SVM) method and protein sequence information. The feature vectors used to develop the SVM prediction models consist of statistically significant features selected from single amino acid, dipeptide, and tripeptide compositions of protein sequences. Furthermore, the length distribution difference between GPCRs and non-GPCRs has also been exploited to improve the prediction performance. The testing results with annotated human protein sequences demonstrate that this system can get good performance for both prediction and classification of human GPCRs.  相似文献   

18.
19.
20.
翻译起始位点(TIS,即基因5’端)的精确定位是原核生物基因预测的一个关键问题,而基因组GC含量和翻译起始机制的多样性是影响当前TIS预测水平的重要因素.结合基因组结构的复杂信息(包括GC含量、TIS邻近序列及上游调控信号、序列编码潜能、操纵子结构等),发展刻画翻译起始机制的数学统计模型,据此设计TIS预测的新算法MED.StartPlus.并将MED.StartPlus与同类方法RBSfinder、GS.Finder、MED-Start、TiCo和Hon-yaku等进行系统地比较和评价.测试针对两种数据集进行:当前14个已知的TIS被确认的基因数据集,以及300个物种中功能已知的基因数据集.测试结果表明,MED-StartPlus的预测精度在总体上超过同类方法.尤其是对高GC含量基因组以及具有复杂翻译起始机制的基因组,MED-StartPlus具有明显的优势.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号