共查询到20条相似文献,搜索用时 15 毫秒
1.
MOTIVATION: A large body of experimental and theoretical evidence suggests that local structural determinants are frequently encoded in short segments of protein sequence. Although the local structural information, once recognized, is particularly useful in protein structural and functional analyses, it remains a difficult problem to identify embedded local structural codes based solely on sequence information. RESULTS: In this paper, we describe a local structure prediction method aiming at predicting the backbone structures of nine-residue sequence segments. Two elements are the keys for this local structure prediction procedure. The first key element is the LSBSP1 database, which contains a large number of non-redundant local structure-based sequence profiles for nine-residue structure segments. The second key element is the consensus approach, which identifies a consensus structure from a set of hit structures. The local structure prediction procedure starts by matching a query sequence segment of nine consecutive amino acid residues to all the sequence profiles in the local structure-based sequence profile database (LSBSP1). The consensus structure, which is at the center of the largest structural cluster of the hit structures, is predicted to be the native state structure adopted by the query sequence segment. This local structure prediction method is assessed with a large set of random test protein structures that have not been used in constructing the LSBSP1 database. The benchmark results indicate that the prediction capacities of the novel local structure prediction procedure exceed the prediction capacities of the local backbone structure prediction methods based on the I-sites library by a significant margin. AVAILABILITY: All the computational and assessment procedures have been implemented in the integrated computational system PrISM.1 (Protein Informatics System for Modeling). The system and associated databases for LINUX systems can be downloaded from the website: http://www.columbia.edu/~ay1/. 相似文献
2.
A structure-based method for protein sequence alignment 总被引:1,自引:0,他引:1
Kann MG Thiessen PA Panchenko AR Schäffer AA Altschul SF Bryant SH 《Bioinformatics (Oxford, England)》2005,21(8):1451-1456
MOTIVATION: With the continuing rapid growth of protein sequence data, protein sequence comparison methods have become the most widely used tools of bioinformatics. Among these methods are those that use position-specific scoring matrices (PSSMs) to describe protein families. PSSMs can capture information about conserved patterns within families, which can be used to increase the sensitivity of searches for related sequences. Certain types of structural information, however, are not generally captured by PSSM search methods. Here we introduce a program, Structure-based ALignment TOol (SALTO), that aligns protein query sequences to PSSMs using rules for placing and scoring gaps that are consistent with the conserved regions of domain alignments from NCBI's Conserved Domain Database. RESULTS: In most cases, the alignment scores obtained using the local alignment version follow an extreme value distribution. SALTO's performance in finding related sequences and producing accurate alignments is similar to or better than that of IMPALA; one advantage of SALTO is that it imposes an explicit gapping model on each protein family. AVAILABILITY: A stand-alone version of the program that can generate global or local alignments is available by ftp distribution (ftp://ftp.ncbi.nih.gov/pub/SALTO/), and has been incorporated to Cn3D structure/alignment viewer. CONTACT: bryant@ncbi.nlm.nih.gov. 相似文献
3.
Grishin NV 《Cell》2012,149(7):1424-1425
A daring experiment is performed. Using sequence alignments to predict contacts between residues in protein spatial structures, Hopf et al. are publishing untested de novo structure models for 11 transmembrane protein families. Will their models stand the test of time and hold up to experimentation? The prospects are excellent. 相似文献
4.
SUMMARY: 3MOTIF is a web application that visually maps conserved sequence motifs onto three-dimensional protein structures in the Protein Data Bank (PDB; Berman et al., Nucleic Acids Res., 28, 235-242, 2000). Important properties of motifs such as conservation strength and solvent accessible surface area at each position are visually represented on the structure using a variety of color shading schemes. Users can manipulate the displayed motifs using the freely available Chime plugin. AVAILABILITY: http://motif.stanford.edu/3motif/ 相似文献
5.
Searching the protein sequence database 总被引:1,自引:0,他引:1
As the volume of protein sequence data grows, rapid methods for searching the protein sequence database become of primary
importance. Rigorous comparison of sequences is obtained with the well-known dynamic programming algorithms. However, these
algorithms are not rapid enough to use for routinely searching the entire database. In this paper we discuss some methods
that can be used for rapid searches. 相似文献
6.
A model has been developed that permits the prediction of mRNA nucleic acid sequence from the sequences of the translated proteins. The model relies on the information obtained from the comparison of protein sequences in related species to reduce the number of possible codons for those amino acids where mutations are observed. The predictions so obtained have been tested by applying the model to proteins whose mRNA sequences are known. The model's predictions have been found to be 100% accurate if three or more different amino acids are known at a given position and if the protein sequences are restricted to relatively closely related species (within the same class). The use of this model may permit a reduction of the mRNA sequence degeneracy and therefore be helpful in the synthesis of cDNA probes or for the prediction of restriction endonuclease sites. Computer programs have been developed to ease the use of the model. 相似文献
7.
Shepherd AJ Martin NJ Johnson RG Kellam P Orengo CA 《Bioinformatics (Oxford, England)》2002,18(12):1666-1672
MOTIVATION: The PFDB (Protein Family Database) is a new database designed to integrate protein family-related data with relevant functional and genomic data. It currently manages biological data for three projects-the CATH protein domain database (Orengo et al., 1997; Pearl et al., 2001), the VIDA virus domains database (Albà et al., 2001) and the Gene3D database (Buchan et al., 2001). The PFDB has been designed to accommodate protein families identified by a variety of sequence based or structure based protocols and provides a generic resource for biological research by enabling mapping between different protein families and diverse biochemical and genetic data, including complete genomes. RESULTS: A characteristic feature of the PFDB is that it has a number of meta-level entities (for example aggregation, collection and inclusion) represented as base tables in the final design. The explicit representation of relationships at the meta-level has a number of advantages, including flexibility-both in terms of the range of queries that can be formulated and the ability to integrate new biological entities within the existing design. A potential drawback with this approach-poor performance caused by the number of joins across meta-level tables-is avoided by implementing the PFDB with materialized views using the mature relational database technology of Oracle 8i. The resultant database is both fast and flexible. This paper presents the principles on which the database has been designed and implemented, and describes the current status of the database and query facilities supported. 相似文献
8.
Mott R 《Journal of molecular biology》2000,300(3):649-659
A simple general approximation for the distribution of gapped local alignment scores is presented, suitable for assessing significance of comparisons between two protein sequences or a sequence and a profile. The approximation takes account of the scoring scheme (i.e. gap penalty and substitution matrix or profile), sequence composition and length. Use of this formula means it is unnecessary to fit an extreme-value distribution to simulations or to the results of databank searches. The method is based on the theoretical ideas introduced by R. Mott and R. Tribe in 1999. Extensive simulation studies show that score-thresholds produced by the method are accurate to within +/-5 % 95 % of the time. We also investigate factors which effect the accuracy of alignment statistics, and show that any method based on asymptotic theory is limited because asymptotic behaviour is not strictly achieved for many real protein sequences, due to extreme composition effects. Consequently, it may not be practicable to find a general formula that is significantly more accurate until the sub-asymptotic behaviour of alignments is better understood. 相似文献
9.
For the identification of newly sequenced proteins it is necessaryto have a large stock of known proteins for comparison. In thispaper we present an automatically generated protein sequencedatabase. The translation program introduced allows a periodicaltranslation of every new release of the EMBL database. Possibleerrors of the translation are discussed as well as the reliabilityof the nucleotide sequence data, which turns out to be quitegood. A comparison of our translated database with some establishedones is given.
Received on December 15, 1987; accepted on April 19, 1988 相似文献
10.
To elucidate the role of high mass accuracy in mass spectrometric peptide mapping and database searching, selected proteins were subjected to tryptic digestion and the resulting mixtures were analyzed by electrospray ionization on a 7 Tesla Fourier transform mass spectrometer with a mass accuracy of 1 ppm. Two extreme cases were examined in detail: equine apomyoglobin, which digested easily and gave very few spurious masses, and bovine alpha-lactalbumin, which under the conditions used, gave many spurious masses. The effectiveness of accurate mass measurements in minimizing false protein matches was examined by varying the mass error allowed in the search over a wide range (2-500 ppm). For the "clean" data obtained from apomyoglobin, very few masses were needed to return valid protein matches, and the mass error allowed in the search had little effect up to 500 ppm. However, in the case of alpha-lactalbumin more mass values were needed, and low mass errors increased the search specificity. Mass errors below 30 ppm were particularly useful in eliminating false protein matches when few mass values were used in the search. Collision-induced dissociation of an unassigned peak in the alpha-lactalbumin digest provided sufficient data to unambiguously identify the peak as a fragment from alpha-lactalbumin and eliminate a large number of spurious proteins found in the peptide mass search. The results show that even with a relatively high mass error (0.8 Da for mass differences between singly charged product ions), collision-induced dissociation can help identify proteins in cases where unfavorable digest conditions or modifications render digest peaks unidentifiable by a simple mass mapping search. 相似文献
11.
An object-oriented database system has been developed which is being used to store protein structure data. The database can be queried using the logic programming language Prolog or the query language Daplex. Queries retrieve information by navigating through a network of objects which represent the primary, secondary and tertiary structures of proteins. Routines written in both Prolog and Daplex can integrate complex calculations with the retrieval of data from the database, and can also be stored in the database for sharing among users. Thus object-oriented databases are better suited to prototyping applications and answering complex queries about protein structure than relational databases. This system has been used to find loops of varying length and anchor positions when modelling homologous protein structures. 相似文献
12.
13.
A comprehensive, non-redundant composite protein sequence database is described. The database, OWL, is an amalgam of data from six publicly-available primary sources, and is generated using strict redundancy criteria. The database is updated monthly and its size has increased almost eight-fold in the last six years: the current version contains > 76,000 entries. For added flexibility, OWL is distributed with a tailor-made query language, together with a number of programs for database exploration, information retrieval and sequence analysis, which together form an integrated database and software resource for protein sequences. 相似文献
14.
In the recent past, there has been a resurgence of interest in Chikungunya virus (CHIKV) attributed to massive outbreaks of Chikungunya fever in the South-East Asia Region. This has reflected in substantial increase in submission of CHIKV genome sequences to NCBI (National Center for Biotechnology Information) database. Hereby we submit a database "CHIKVPRO" containing structural and functional annotation of Chikungunya virus proteins (25 strains) submitted in the NCBI repository. The CHIKV genome encodes for 9 proteins:4 non-structural and 5 structural. The CHIKVPRO database aims to provide the virology community with a single accession authoritative resource for CHIKV proteome- with reference to physiochemical and molecular properties, proteolytic cleavage sites, hydrophobicity, transmembrane prediction, and classification into functional families using SVMProt and other Expasy tools. AVAILABILITY: The database is freely available at http://www.chikvpro.info/ 相似文献
15.
Sommer I Rahnenführer J Domingues FS de Lichtenberg U Lengauer T 《Bioinformatics (Oxford, England)》2004,20(5):770-776
MOTIVATION: We introduce a new approach to using the information contained in sequence-to-function prediction data in order to recognize protein template classes, a critical step in predicting protein structure. The data on which our method is based comprise probabilities of functional categories; for given query sequences these probabilities are obtained by a neural net that has previously been trained on a variety of functionally important features. On a training set of sequences we assess the relevance of individual functional categories for identifying a given structural family. Using a combination of the most relevant categories, the likelihood of a query sequence to belong to a specific family can be estimated. RESULTS: The performance of the method is evaluated using cross-validation. For a fixed structural family and for every sequence, a score is calculated that measures the evidence for family membership. Even for structural families of small size, family members receive significantly higher scores. For some examples, we show that the relevant functional features identified by this method are biologically meaningful. The proposed approach can be used to improve existing sequence-to-structure prediction methods. AVAILABILITY: Matlab code is available on request from the authors. The data are available at http://www.mpisb.mpg.de/~sommer/Fun2Struc/ 相似文献
16.
In proteome studies, identification of proteins requires searching protein sequence databases. The public protein sequence databases (e.g., NCBInr, UniProt) each contain millions of entries, and private databases add thousands more. Although much of the sequence information in these databases is redundant, each database uses distinct identifiers for the identical protein sequence and often contains unique annotation information. Users of one database obtain a database-specific sequence identifier that is often difficult to reconcile with the identifiers from a different database. When multiple databases are used for searches or the databases being searched are updated frequently, interpreting the protein identifications and associated annotations can be problematic. We have developed a database of unique protein sequence identifiers called Sequence Globally Unique Identifiers (SEGUID) derived from primary protein sequences. These identifiers serve as a common link between multiple sequence databases and are resilient to annotation changes in either public or private databases throughout the lifetime of a given protein sequence. The SEGUID Database can be downloaded (http://bioinformatics.anl.gov/SEGUID/) or easily generated at any site with access to primary protein sequence databases. Since SEGUIDs are stable, predictions based on the primary sequence information (e.g., pI, Mr) can be calculated just once; we have generated approximately 500 different calculations for more than 2.5 million sequences. SEGUIDs are used to integrate MS and 2-DE data with bioinformatics information and provide the opportunity to search multiple protein sequence databases, thereby providing a higher probability of finding the most valid protein identifications. 相似文献
17.
Betel D Breitkreuz KE Isserlin R Dewar-Darch D Tyers M Hogue CW 《PLoS computational biology》2007,3(9):1783-1789
The multitude of functions performed in the cell are largely controlled by a set of carefully orchestrated protein interactions often facilitated by specific binding of conserved domains in the interacting proteins. Interacting domains commonly exhibit distinct binding specificity to short and conserved recognition peptides called binding profiles. Although many conserved domains are known in nature, only a few have well-characterized binding profiles. Here, we describe a novel predictive method known as domain–motif interactions from structural topology (D-MIST) for elucidating the binding profiles of interacting domains. A set of domains and their corresponding binding profiles were derived from extant protein structures and protein interaction data and then used to predict novel protein interactions in yeast. A number of the predicted interactions were verified experimentally, including new interactions of the mitotic exit network, RNA polymerases, nucleotide metabolism enzymes, and the chaperone complex. These results demonstrate that new protein interactions can be predicted exclusively from sequence information. 相似文献
18.
A basis set of protein canonical fragments, or centroids, represents the range of local structure found in globular proteins. We develop a methodology to predict centroids from the amino acid sequence. The predictor gives the probability of each centroid in the basis set, at each loci along the backbone. The predictor selects the best-fit centroid at about 40% of the loci. The predicted probabilities are accurate and can be used to judge the confidence of each centroid prediction. For example, when filtering out centroids with <0.50 probability, the predictor is 65% accurate, although such high-probability centroids occur at only 28% of the loci. Centroids with high probability can be interpreted as segments that are highly influenced by the amino acid sequence, whereas centroids with low probability can be interpreted as segments that are more likely influenced by tertiary contacts. Low-resolution, starting point structures, can be generated by fitting the predicted centroids together. 相似文献
19.
Computational biology is replete with high-dimensional (high-D) discrete prediction and inference problems, including sequence alignment, RNA structure prediction, phylogenetic inference, motif finding, prediction of pathways, and model selection problems in statistical genetics. Even though prediction and inference in these settings are uncertain, little attention has been focused on the development of global measures of uncertainty. Regardless of the procedure employed to produce a prediction, when a procedure delivers a single answer, that answer is a point estimate selected from the solution ensemble, the set of all possible solutions. For high-D discrete space, these ensembles are immense, and thus there is considerable uncertainty. We recommend the use of Bayesian credibility limits to describe this uncertainty, where a (1−α)%, 0≤α≤1, credibility limit is the minimum Hamming distance radius of a hyper-sphere containing (1−α)% of the posterior distribution. Because sequence alignment is arguably the most extensively used procedure in computational biology, we employ it here to make these general concepts more concrete. The maximum similarity estimator (i.e., the alignment that maximizes the likelihood) and the centroid estimator (i.e., the alignment that minimizes the mean Hamming distance from the posterior weighted ensemble of alignments) are used to demonstrate the application of Bayesian credibility limits to alignment estimators. Application of Bayesian credibility limits to the alignment of 20 human/rodent orthologous sequence pairs and 125 orthologous sequence pairs from six Shewanella species shows that credibility limits of the alignments of promoter sequences of these species vary widely, and that centroid alignments dependably have tighter credibility limits than traditional maximum similarity alignments. 相似文献
20.
In order to understand the mechanism of protein folding and to assist the rational de-novo design of fast-folding, non-aggregating and stable artificial enzymes it is very helpful to be able to simulate protein folding reactions and to predict the structures of proteins and other biomacromolecules. Here, we use a method of computer programming called "evolutionary computer programming" in which a program evolves depending on the evolutionary pressure exerted on the program. In the case of the presented application of this method on a computer program for folding simulations, the evolutionary pressure exerted was towards faster finding deep minima in the energy landscape of protein folding. Already after 20 evolution steps, the evolved program was able to find deep minima in the energy landscape more than 10 times faster than the original program prior to the evolution process. 相似文献