首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
We propose new methods for finding similarities in protein structure databases. These methods extract feature vectors on triplets of SSEs (Secondary Structure Elements) of proteins. The feature vectors are then indexed using a multidimensional index structure. Our first technique considers the problem of finding proteins similar to a given query protein in a protein dataset. It quickly finds promising proteins using the index structure. These proteins are then aligned to the query protein using a popular pairwise alignment tool such as VAST. We also develop a novel statistical model to estimate the goodness of a match using the SSEs. Our second technique considers the problem of joining two protein datasets to find an all-to-all similarity. Experimental results show that our techniques improve the pruning time of VAST 3 to 3.5 times, while keeping the sensitivity similar. Our technique can also be incorporated with DALI and CE to improve their running times by a factor of 2 and 2.7 respectively. The software is available online at http://bioserver.cs.ucsb.edu/.  相似文献   

2.
MOTIVATION: As the sizes of three-dimensional (3D) protein structure databases are growing rapidly nowadays, exhaustive database searching, in which a 3D query structure is compared to each and every structure in the database, becomes inefficient. We propose a rapid 3D protein structure retrieval system named 'ProtDex2', in which we adopt the techniques used in information retrieval systems in order to perform rapid database searching without having access to every 3D structure in the database. The retrieval process is based on the inverted-file index constructed on the feature vectors of the relationships between the secondary structure elements (SSEs) of all the 3D protein structures in the database. ProtDex2 is a significant improvement, both in terms of speed and accuracy, upon its predecessor system, ProtDex. RESULTS: The experimental results show that ProtDex2 is very much faster than two well-known protein structure comparison methods, DALI and CE, yet not sacrificing on the accuracy of the comparison. When comparing with a similar SSE-based method, namely TopScan, ProtDex2 is much faster with comparable degree of accuracy. AVAILABILITY: The software is available at: http://xena1.ddns.comp.nus.edu.sg/~genesis/PD2.htm  相似文献   

3.
The question of how best to compare and classify the (three‐dimensional) structures of proteins is one of the most important unsolved problems in computational biology. To help tackle this problem, we have developed a novel shape‐density superposition algorithm called 3D‐Blast which represents and superposes the shapes of protein backbone folds using the spherical polar Fourier correlation technique originally developed by us for protein docking. The utility of this approach is compared with several well‐known protein structure alignment algorithms using receiver‐operator‐characteristic plots of queries against the “gold standard” CATH database. Despite being completely independent of protein sequences and using no information about the internal geometry of proteins, our results from searching the CATH database show that 3D‐Blast is highly competitive compared to current state‐of‐the‐art protein structure alignment algorithms. A novel and potentially very useful feature of our approach is that it allows an average or “consensus” fold to be calculated easily for a given group of protein structures. We find that using consensus shapes to represent entire fold families also gives very good database query performance. We propose that using the notion of consensus fold shapes could provide a powerful new way to index existing protein structure databases, and that it offers an objective way to cluster and classify all of the currently known folds in the protein universe. Proteins 2012. © 2011 Wiley Periodicals, Inc.  相似文献   

4.
Proteins that contain similar structural elements often have analogous functions regardless of the degree of sequence similarity or structure connectivity in space. In general, protein structure comparison (PSC) provides a straightforward methodology for biologists to determine critical aspects of structure and function. Here, we developed a novel PSC technique based on angle-distance image (A-D image) transformation and matching, which is independent of sequence similarity and connectivity of secondary structure elements (SSEs). An A-D image is constructed by utilizing protein secondary structure information. According to various types of SSEs, the mutual SSE pairs of the query protein are classified into three different types of sub-images. Subsequently, corresponding sub-images between query and target protein structures are compared using modified cross-correlation approaches to identify the similarity of various patterns. Structural relationships among proteins are displayed by hierarchical clustering trees, which facilitate the establishment of the evolutionary relationships between structure and function of various proteins.Four standard testing datasets and one newly created dataset were used to evaluate the proposed method. The results demonstrate that proteins from these five datasets can be categorized in conformity with their spatial distribution of SSEs. Moreover, for proteins with low sequence identity that share high structure similarity, the proposed algorithms are an efficient and effective method for structural comparison.  相似文献   

5.
Searching for protein structure-function relationships using three-dimensional (3D) structural coordinates represents a fundamental approach for determining the function of proteins with unknown functions. Since protein structure databases are rapidly growing in size, the development of a fast search method to find similar protein substructures by comparison of protein 3D structures is essential. In this article, we present a novel protein 3D structure search method to find all substructures with root mean square deviations (RMSDs) to the query structure that are lower than a given threshold value. Our new algorithm runs in O(m + N/m(0.5)) time, after O(N log N) preprocessing, where N is the database size and m is the query length. The new method is 1.8-41.6 times faster than the practically best known O(N) algorithm, according to computational experiments using a huge database (i.e., >20,000,000 C-alpha coordinates).  相似文献   

6.
Noble WS  Kuang R  Leslie C  Weston J 《The FEBS journal》2005,272(20):5119-5128
Perhaps the most widely used applications of bioinformatics are tools such as psi-blast for searching sequence databases. We describe a recently developed protein database search algorithm called rankprop. rankprop relies upon a precomputed network of pairwise protein similarities. The algorithm performs a diffusion operation from a specified query protein across the protein similarity network. The resulting activation scores, assigned to each database protein, encode information about the global structure of the protein similarity network. This type of algorithm has a rich history in associationist psychology, artificial intelligence and web search. We describe the rankprop algorithm and its relatives, and we provide evidence that the algorithm successfully improves upon the rankings produced by psi-blast.  相似文献   

7.
Comparison and classification of folding patterns from a database of protein structures is crucial to understand the principles of protein architecture, evolution and function. Current search methods for proteins with similar folding patterns are slow and computationally intensive. The sharp growth in the number of known protein structures poses severe challenges for methods of structural comparison. There is a need for methods that can search the database of structures accurately and rapidly. We provide several methods to search for similar folding patterns using a concise tableau representation of proteins that encodes the relative geometry of secondary structural elements. Our first approach allows the extraction of identical and very closely-related protein folding patterns in constant-time (per hit). Next, we address the hard computational problem of extraction of maximally-similar subtableaux, when comparing two tableaux. We solve the problem using Quadratic and Linear integer programming formulations and demonstrate their power to identify subtle structural similarities, especially when protein structures significantly diverge. Finally, we describe a rapid and accurate method for comparing a query structure against a database of protein domains, TableauSearch. TableauSearch is rapid enough to search the entire structural database in seconds on a standard desktop computer. Our analysis of TableauSearch on many queries shows that the method is very accurate in identifying similarities of folding patterns, even between distantly related proteins. AVAILABILITY: A web server implementing the TableauSearch is available from http://hollywood.bx.psu.edu/TabSearch.  相似文献   

8.
An object-oriented database system has been developed which is being used to store protein structure data. The database can be queried using the logic programming language Prolog or the query language Daplex. Queries retrieve information by navigating through a network of objects which represent the primary, secondary and tertiary structures of proteins. Routines written in both Prolog and Daplex can integrate complex calculations with the retrieval of data from the database, and can also be stored in the database for sharing among users. Thus object-oriented databases are better suited to prototyping applications and answering complex queries about protein structure than relational databases. This system has been used to find loops of varying length and anchor positions when modelling homologous protein structures.  相似文献   

9.
The Protein Mutant Database.   总被引:3,自引:0,他引:3       下载免费PDF全文
Currently the protein mutant database (PMD) contains over 81 000 mutants, including artificial as well as natural mutants of various proteins extracted from about 10 000 articles. We recently developed a powerful viewing and retrieving system (http://pmd.ddbj.nig.ac.jp), which is integrated with the sequence and tertiary structure databases. The system has the following features: (i) mutated sequences are displayed after being automatically generated from the information described in the entry together with the sequence data of wild-type proteins integrated. This is a convenient feature because it allows one to see the position of altered amino acids (shown in a different color) in the entire sequence of a wild-type protein; (ii) for those proteins whose 3D structures have been experimentally determined, a 3D structure is displayed to show mutation sites in a different color; (iii) a sequence homology search against PMD can be carried out with any query sequence; (iv) a summary of mutations of homologous sequences can be displayed, which shows all the mutations at a certain site of a protein, recorded throughout the PMD.  相似文献   

10.
To unscramble the relationship between protein function and protein structure, it is essential to assess the protein similarity from different aspects. Although many methods have been proposed for protein structure alignment or comparison, alternative similarity measures are still strongly demanded due to the requirement of fast screening and query in large-scale structure databases. In this paper, we first formulate a novel representation of a protein structure, i.e., Feature Sequence of Surface (FSS). Then, a new score scheme is developed to measure the similarity between two representations. To verify the proposed method, numerical experiments are conducted in four different protein data sets. We also classify SARS coronavirus to verify the effectiveness of the new method. Furthermore, preliminary results of fast classification of the whole CATH v2.5.1 database based on the new macrostructure similarity are given as a pilot study. We demonstrate that the proposed approach to measure the similarities between protein structures is simple to implement, computationally efficient, and surprisingly fast. In addition, the method itself provides a new and quantitative tool to view a protein structure.  相似文献   

11.
Our algorithm predicts short linear functional motifs in proteins using only sequence information. Statistical models for short linear functional motifs in proteins are built using the database of short sequence fragments taken from proteins in the current release of the Swiss-Prot database. Those segments are confirmed by experiments to have single-residue post-translational modification. The sensitivities of the classification for various types of short linear motifs are in the range of 70%. The query protein sequence is dissected into short overlapping fragments. All segments are represented as vectors. Each vector is then classified by a machine learning algorithm (Support Vector Machine) as potentially modifiable or not. The resulting list of plausible post-translational sites in the query protein is returned to the user. We also present a study of the human protein kinase C family as a biological application of our method.  相似文献   

12.
It has been known that topologically different proteins of the same class sometimes share the same spatial arrangement of secondary structure elements (SSEs). However, the frequency by which topologically different structures share the same spatial arrangement of SSEs is unclear. It is important to estimate this frequency because it provides both a deeper understanding of the geometry of protein folds and a valuable suggestion for predicting protein structures with novel folds. Here we clarified the frequency with which protein folds share the same SSE packing arrangement with other folds, the types of spatial arrangement of SSEs that are frequently observed across different folds, and the diversity of protein folds that share the same spatial arrangement of SSEs with a given fold, using a protein structure alignment program MICAN, which we have been developing. By performing comprehensive structural comparison of SCOP fold representatives, we found that approximately 80% of protein folds share the same spatial arrangement of SSEs with other folds. We also observed that many protein pairs that share the same spatial arrangement of SSEs belong to the different classes, often with an opposing N- to C-terminal direction of the polypeptide chain. The most frequently observed spatial arrangement of SSEs was the 2-layer α/β packing arrangement and it was dispersed among as many as 27% of SCOP fold representatives. These results suggest that the same spatial arrangements of SSEs are adopted by a wide variety of different folds and that the spatial arrangement of SSEs is highly robust against the N- to C-terminal direction of the polypeptide chain.  相似文献   

13.
MOTIVATION: Protein families can be defined based on structure or sequence similarity. We wanted to compare two protein family databases, one based on structural and one on sequence similarity, to investigate to what extent they overlap, the similarity in definition of corresponding families, and to create a list of large protein families with unknown structure as a resource for structural genomics. We also wanted to increase the sensitivity of fold assignment by exploiting protein family HMMs. RESULTS: We compared Pfam, a protein family database based on sequence similarity, to Scop, which is based on structural similarity. We found that 70% of the Scop families exist in Pfam while 57% of the Pfam families exist in Scop. Most families that occur in both databases correspond well to each other, but in some cases they are different. Such cases highlight situations in which structure and sequence approaches differ significantly. The comparison enabled us to compile a list of the largest families that do not occur in Scop; these are suitable targets for structure prediction and determination, and may be useful to guide projects in structural genomics. It can be noted that 13 out of the 20 largest protein families without a known structure are likely transmembrane proteins. We also exploited Pfam to increase the sensitivity of detecting homologs of proteins with known structure, by comparing query sequences to Pfam HMMs that correspond to Scop families. For SWISSPROT+TREMBL, this yielded an increase in fold assignment from 31% to 42% compared to using FASTA only. This method assigned a structure to 22% of the proteins in Saccharomyces cerevisiae, 24% in Escherichia coli, and 16% in Methanococcus jannaschii.  相似文献   

14.
In protein structures, the fold is described according to the spatial arrangement of secondary structure elements (SSEs: α‐helices and β‐strands) and their connectivity. The connectivity or the pattern of links among SSEs is one of the most important factors for understanding the variety of protein folds. In this study, we introduced the connectivity strings that encode the connectivities by using the types, positions, and connections of SSEs, and computationally enumerated all the connectivities of two‐layer αβ sandwiches. The calculated connectivities were compared with those in natural proteins determined using MICAN, a nonsequential structure comparison method. For 2α‐4β, among 23,000 of all connectivities, only 48 were free from irregular connectivities such as loop crossing. Of these, only 20 were found in natural proteins and the superfamilies were biased toward certain types of connectivities. A similar disproportional distribution was confirmed for most of other spatial arrangements of SSEs in the two‐layer αβ sandwiches. We found two connectivity rules that explain the bias well: the abundances of interlayer connecting loops that bridge SSEs in the distinct layers; and nonlocal β‐strand pairs, two spatially adjacent β‐strands located at discontinuous positions in the amino acid sequence. A two‐dimensional plot of these two properties indicated that the two connectivity rules are not independent, which may be interpreted as a rule for the cooperativity of proteins.  相似文献   

15.

Background  

The identification of protein domains plays an important role in protein structure comparison. Domain query size and composition are critical to structure similarity search algorithms such as the Vector Alignment Search Tool (VAST), the method employed for computing related protein structures in NCBI Entrez system. Currently, domains identified on the basis of structural compactness are used for VAST computations. In this study, we have investigated how alternative definitions of domains derived from conserved sequence alignments in the Conserved Domain Database (CDD) would affect the domain comparisons and structure similarity search performance of VAST.  相似文献   

16.
MOTIVATION: The exponential growth of sequence databases poses a major challenge to bioinformatics tools for querying alignment and annotation databases. There is a pressing need for methods for finding overlapping sequence intervals that are highly scalable to database size, query interval size, result size and construction/updating of the interval database. RESULTS: We have developed a new interval database representation, the Nested Containment List (NCList), whose query time is O(n + log N), where N is the database size and n is the size of the result set. In all cases tested, this query algorithm is 5-500-fold faster than other indexing methods tested in this study, such as MySQL multi-column indexing, MySQL binning and R-Tree indexing. We provide performance comparisons both in simulated datasets and real-world genome alignment databases, across a wide range of database sizes and query interval widths. We also present an in-place NCList construction algorithm that yields database construction times that are approximately 100-fold faster than other methods available. The NCList data structure appears to provide a useful foundation for highly scalable interval database applications. AVAILABILITY: NCList data structure is part of Pygr, a bioinformatics graph database library, available at http://sourceforge.net/projects/pygr  相似文献   

17.
Computational methods seeking to automatically determine the properties (functional, structural, physicochemical, etc.) of a protein directly from the sequence have long been the focus of numerous research groups. With the advent of advanced sequencing methods and systems, the number of amino acid sequences that are being deposited in the public databases has been increasing steadily. This has in turn generated a renewed demand for automated approaches that can annotate individual sequences and complete genomes quickly, exhaustively and objectively. In this paper, we present one such approach that is centered around and exploits the Bio-Dictionary, a collection of amino acid patterns that completely covers the natural sequence space and can capture functional and structural signals that have been reused during evolution, within and across protein families. Our annotation approach also makes use of a weighted, position-specific scoring scheme that is unaffected by the over-representation of well-conserved proteins and protein fragments in the databases used. For a given query sequence, the method permits one to determine, in a single pass, the following: local and global similarities between the query and any protein already present in a public database; the likeness of the query to all available archaeal/ bacterial/eukaryotic/viral sequences in the database as a function of amino acid position within the query; the character of secondary structure of the query as a function of amino acid position within the query; the cytoplasmic, transmembrane or extracellular behavior of the query; the nature and position of binding domains, active sites, post-translationally modified sites, signal peptides, etc. In terms of performance, the proposed method is exhaustive, objective and allows for the rapid annotation of individual sequences and full genomes. Annotation examples are presented and discussed in Results, including individual queries and complete genomes that were released publicly after we built the Bio-Dictionary that is used in our experiments. Finally, we have computed the annotations of more than 70 complete genomes and made them available on the World Wide Web at http://cbcsrv.watson.ibm.com/Annotations/.  相似文献   

18.
MOTIVATION: Structural alignments of superfamily members often exhibit insertions and deletions of secondary structure elements (SSEs), yet conserved subsets of SSEs appear to be important for maintaining the fold and facilitating common functionalities. RESULTS: A database of aligned SSEs was constructed from the structure-based alignments of protein superfamily members in the CAMPASS database. SSEs were classified into several types on the basis of their length and solvent accessibility and counts were made for the replacements of SSEs in different types at structurally aligned positions. The results, summarized as log-odds substitution matrices, can be used for two types of comparisons: (1) structure against structure, both with secondary structure assignments; and (2) structure against sequence with predicted secondary structures. The conservation of SSEs at each alignment position was defined as the deviation of observed SSE frequencies from the uniform distribution. This offers a useful resource to define and examine the core of superfamily folds. Even when the structure of only a single member of a superfamily is known, the extended method can be used to predict the conservation of SSEs. Such information will be useful when modelling the structure of other members of a superfamily or identifying structurally and functionally important positions in the fold.  相似文献   

19.
GlycoSuiteDB is a relational database that curates information from the scientific literature on glyco-protein derived glycan structures, their biological sources, the references in which the glycan was described and the methods used to determine the glycan structure. To date, the database includes most published O:-linked oligosaccharides from the last 50 years and most N:-linked oligosaccharides that were published in the 1990s. For each structure, information is available concerning the glycan type, linkage and anomeric configuration, mass and composition. Detailed information is also provided on native and recombinant sources, including tissue and/or cell type, cell line, strain and disease state. Where known, the proteins to which the glycan structures are attached are reported, and cross-references to the SWISS-PROT/TrEMBL protein sequence databases are given if applicable. The GlycoSuiteDB annotations include literature references which are linked to PubMed, and detailed information on the methods used to determine each glycan structure are noted to help the user assess the quality of the structural assignment. GlycoSuiteDB has a user-friendly web interface which allows the researcher to query the database using mono-isotopic or average mass, monosaccharide composition, glycosylation linkages (e.g. N:- or O:-linked), reducing terminal sugar, attached protein, taxonomy, tissue or cell type and GlycoSuiteDB accession number. Advanced queries using combinations of these parameters are also possible. GlycoSuiteDB can be accessed on the web at http://www.glycosuite.com.  相似文献   

20.
PALI (release 1.2) contains three-dimensional (3-D) structure-dependent sequence alignments as well as structure-based phylogenetic trees of homologous protein domains in various families. The data set of homologous protein structures has been derived by consulting the SCOP database (release 1.50) and the data set comprises 604 families of homologous proteins involving 2739 protein domain structures with each family made up of at least two members. Each member in a family has been structurally aligned with every other member in the same family (pairwise alignment) and all the members in the family are also aligned using simultaneous super-position (multiple alignment). The structural alignments are performed largely automatically, with manual interventions especially in the cases of distantly related proteins, using the program STAMP (version 4.2). Every family is also associated with two dendrograms, calculated using PHYLIP (version 3.5), one based on a structural dissimilarity metric defined for every pairwise alignment and the other based on similarity of topologically equivalent residues. These dendrograms enable easy comparison of sequence and structure-based relationships among the members in a family. Structure-based alignments with the details of structural and sequence similarities, superposed coordinate sets and dendrograms can be accessed conveniently using a web interface. The database can be queried for protein pairs with sequence or structural similarities falling within a specified range. Thus PALI forms a useful resource to help in analysing the relationship between sequence and structure variation at a given level of sequence similarity. PALI also contains over 653 'orphans' (single member families). Using the web interface involving PSI_BLAST and PHYLIP it is possible to associate the sequence of a new protein with one of the families in PALI and generate a phylogenetic tree combining the query sequence and proteins of known 3-D structure. The database with the web interfaced search and dendrogram generation tools can be accessed at http://pauling.mbu.iisc.ernet. in/ approximately pali.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号