首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 46 毫秒
1.
2.
Protein sequence similarity searches using patterns as seeds.   总被引:18,自引:1,他引:17       下载免费PDF全文
Protein families often are characterized by conserved sequence patterns or motifs. A researcher frequently wishes to evaluate the significance of a specific pattern within a protein, or to exploit knowledge of known motifs to aid the recognition of greatly diverged but homologous family members. To assist in these efforts, the pattern-hit initiated BLAST (PHI-BLAST) program described here takes as input both a protein sequence and a pattern of interest that it contains. PHI-BLAST searches a protein database for other instances of the input pattern, and uses those found as seeds for the construction of local alignments to the query sequence. The random distribution of PHI-BLAST alignment scores is studied analytically and empirically. In many instances, the program is able to detect statistically significant similarity between homologous proteins that are not recognizably related using traditional single-pass database search methods. PHI-BLAST is applied to the analysis of CED4-like cell death regulators, HS90-type ATPase domains, archaeal tRNA nucleotidyltransferases and archaeal homologs of DnaG-type DNA primases.  相似文献   

3.
Serial BLAST searching   总被引:2,自引:0,他引:2  
MOTIVATION: The translating BLAST algorithms are powerful tools for finding protein-coding genes because they identify amino acid similarities in nucleotide sequences. Unfortunately, these kinds of searches are computationally intensive and often represent bottlenecks in sequence analysis pipelines. Tuning parameters for speed can make the searches much faster, but one risks losing low-scoring alignments. However, high scoring alignments are relatively resistant to such changes in parameters, and this fact makes it possible to use a serial strategy where a fast, insensitive search is used to pre-screen a database for similar sequences, and a slow, sensitive search is used to produce the sequence alignments. RESULTS: Serial BLAST searches improve both the speed and sensitivity.  相似文献   

4.
Sequence comparison methods based on position-specific score matrices (PSSMs) have proven a useful tool for recognition of the divergent members of a protein family and for annotation of functional sites. Here we investigate one of the factors that affects overall performance of PSSMs in a PSI-BLAST search, the algorithm used to construct the seed alignment upon which the PSSM is based. We compare PSSMs based on alignments constructed by global sequence similarity (ClustalW and ClustalW-pairwise), local sequence similarity (BLAST), and local structure similarity (VAST). To assess performance with respect to identification of conserved functional or structural sites, we examine the accuracy of the three-dimensional molecular models predicted by PSSM-sequence alignments. Using the known structures of those sequences as the standard of truth, we find that model accuracy varies with the algorithm used for seed alignment construction in the pattern local-structure (VAST) > local-sequence (BLAST) > global-sequence (ClustalW). Using structural similarity of query and database proteins as the standard of truth, we find that PSSM recognition sensitivity depends primarily on the diversity of the sequences included in the alignment, with an optimum around 30-50% average pairwise identity. We discuss these observations, and suggest a strategy for constructing seed alignments that optimize PSSM-sequence alignment accuracy and recognition sensitivity.  相似文献   

5.
BioParser     
The widely used programs BLAST (in this article, 'BLAST' includes both the National Center for Biotechnology Information [NCBI] BLAST and the Washington University version WU BLAST) and FASTA for similarity searches in nucleotide and protein databases usually result in copious output. However, when large query sets are used, human inspection rapidly becomes impractical. BioParser is a Perl program for parsing BLAST and FASTA reports. Making extensive use of the BioPerl toolkit, the program filters, stores and returns components of these reports in either ASCII or HTML format. BioParser is also capable of automatically feeding a local MySQL database with the parsed information, allowing subsequent filtering of hits and/or alignments with specific attributes. For this reason, BioParser is a valuable tool for large-scale similarity analyses by improving the access to the information present in BLAST or FASTA reports, facilitating extraction of useful information of large sets of sequence alignments, and allowing for easy handling and processing of the data. AVAILABILITY: BioParser is licensed under the Creative Commons Attribution-NonCommercial-NoDerivs 2.0 license terms (http://creativecommons.org/licenses/by-nc-nd/2.0/) and is available upon request. Additional information can be found at the BioParser website (http://www.dbbm.fiocruz.br/BioParser.html).  相似文献   

6.
Sequence alignment programs such as BLAST and PSI-BLAST are used routinely in pairwise, profile-based, or intermediate-sequence-search (ISS) methods to detect remote homologies for the purposes of fold assignment and comparative modeling. Yet, the sequence alignment quality of these methods at low sequence identity is not known. We have used the CE structure alignment program (Shindyalov and Bourne, Prot Eng 1998;11:739) to derive sequence alignments for all superfamily and family-level related proteins in the SCOP domain database. CE aligns structures and their sequences based on distances within each protein, rather than on interprotein distances. We compared BLAST, PSI-BLAST, CLUSTALW, and ISS alignments with the CE structural alignments. We found that global alignments with CLUSTALW were very poor at low sequence identity (<25%), as judged by the CE alignments. We used PSI-BLAST to search the nonredundant sequence database (nr) with every sequence in SCOP using up to four iterations. The resulting matrix was used to search a database of SCOP sequences. PSI-BLAST is only slightly better than BLAST in alignment accuracy on a per-residue basis, but PSI-BLAST matrix alignments are much longer than BLAST's, and so align correctly a larger fraction of the total number of aligned residues in the structure alignments. Any two SCOP sequences in the same superfamily that shared a hit or hits in the nr PSI-BLAST searches were identified as linked by the shared intermediate sequence. We examined the quality of the longest SCOP-query/ SCOP-hit alignment via an intermediate sequence, and found that ISS produced longer alignments than PSI-BLAST searches alone, of nearly comparable per-residue quality. At 10-15% sequence identity, BLAST correctly aligns 28%, PSI-BLAST 40%, and ISS 46% of residues according to the structure alignments. We also compared CE structure alignments with FSSP structure alignments generated by the DALI program. In contrast to the sequence methods, CE and structure alignments from the FSSP database identically align 75% of residue pairs at the 10-15% level of sequence identity, indicating that there is substantial room for improvement in these sequence alignment methods. BLAST produced alignments for 8% of the 10,665 nonimmunoglobulin SCOP superfamily sequence pairs (nearly all <25% sequence identity), PSI-BLAST matched 17% and the double-PSI-BLAST ISS method aligned 38% with E-values <10.0. The results indicate that intermediate sequences may be useful not only in fold assignment but also in achieving more complete sequence alignments for comparative modeling.  相似文献   

7.
8.
MOTIVATION: Word-matching algorithms such as BLAST are routinely used for sequence comparison. These algorithms typically use areas of matching words to seed alignments which are then used to assess the degree of sequence similarity. In this paper, we show that by formally separating the word-matching and sequence-alignment process, and using information about word frequencies to generate alignments and similarity scores, we can create a new sequence-comparison algorithm which is both fast and sensitive. The formal split between word searching and alignment allows users to select an appropriate alignment method without affecting the underlying similarity search. The algorithm has been used to develop software for identifying entries in DNA sequence databases which are contaminated with vector sequence. RESULTS: We present three algorithms, RAPID, PHAT and SPLAT, which together allow vector contaminations to be found and assessed extremely rapidly. RAPID is a word search algorithm which uses probabilities to modify the significance attached to different words; PHAT and SPLAT are alignment algorithms. An initial implementation has been shown to be approximately an order of magnitude faster than BLAST. The formal split between word searching and alignment not only offers considerable gains in performance, but also allows alignment generation to be viewed as a user interface problem, allowing the most useful output method to be selected without affecting the underlying similarity search. Receiver Operator Characteristic (ROC) analysis of an artificial test set allows the optimal score threshold for identifying vector contamination to be determined. ROC curves were also used to determine the optimum word size (nine) for finding vector contamination. An analysis of the entire expressed sequence tag (EST) subset of EMBL found a contamination rate of 0.27%. A more detailed analysis of the 50 000 ESTs in est10.dat (an EST subset of EMBL) finds an error rate of 0.86%, principally due to two large-scale projects. AVAILABILITY: A Web page for the software exists at http://bioinf.man.ac.uk/rapid, or it can be downloaded from ftp://ftp.bioinf.man.ac.uk/RAPID CONTACT: crispin@cs.man.ac.uk  相似文献   

9.
Fast and exact comparison of large genomic sequences remains a challenging task in biosequence analysis. We consider the problem of finding all epsilon-matches between two sequences, i.e., all local alignments over a given length with an error rate of at most epsilon. We study this problem theoretically, giving an efficient q-gram filter for solving it. Two applications of the filter are also discussed, in particular genomic sequence assembly and BLAST-like sequence comparison. Our results show that the method is 25 times faster than BLAST, while not being heuristic.  相似文献   

10.
A structural class in the MemGen classification of membrane proteins is a set of evolutionary related proteins sharing a similar global fold. A structural class contains both closely related pairs of proteins for which homology is clear from sequence comparison and very distantly related pairs, for which it is not possible to establish homology based on sequence similarity alone. In the latter case the evolutionary link is based on hydropathy profile analysis. Here, we use these evolutionary related sets of proteins to analyze the relationship between E-values in BLAST searches, sequence similarities in multiple sequence alignments and structural similarities in hydropathy profile analyses. Two structural classes of secondary transporters termed ST[3], which includes the Ion Transporter (IT) superfamily and ST[4], which includes the DAACS family (TC# 2.A.23) were extracted from the NCBI protein database. ST[3] contains 2051 unique sequences distributed over 32 families and 59 subfamilies. ST[4] is a smaller class containing 399 unique sequences distributed over 2 families and 7 subfamilies. One subfamily in ST[4] contains a new class of binding protein dependent secondary transporters. Comparison of the averaged hydropathy profiles of the subfamilies in ST[3] and ST[4] revealed that the two classes represent different folds. Divergence of the sequences in ST[4] is much smaller than observed in ST[3], suggesting different constraints on the proteins during evolution. Analysis of the correlation between the evolutionary relationship of pairs of proteins in a class and the BLAST E-value revealed that: (i) the BLAST algorithm is unable to pick up the majority of the links between proteins in structural class ST[3], (ii) "low complexity filtering" and "composition based statistics" improve the specificity, but strongly reduce the sensitivity of BLAST searches for distantly related proteins, indicating that these filters are too stringent for the proteins analyzed, and (iii) the E-value cut-off, which may be used to evaluate evolutionary significance of a hit in a BLAST search is very different for the two structural classes of membrane proteins.  相似文献   

11.
Database scanning programs such as BLAST and FASTA are used nowadays by most biologists for the post-genomic processing of DNA or protein sequence information (in particular to retrieve the structure/function of uncharacterized proteins). Unfortunately, their results can be polluted by identical alignments (called redundancies) coming from the same protein or DNA sequences present in different entries of the database. This makes the efficient use of the listed alignments difficult. Pretreatment of databases has been proposed to suppress strictly identical entries. However, there still remain many identical alignments since redundancies may occur locally for entries corresponding to various fragments of the same sequence or for entries corresponding to very homologous sequences but differing at the level of a few residues such as ortholog proteins. In the present work, we show that redundant alignments can be indeed numerous even when working with a pretreated non-redundant data bank, going as high as 60% of the output results according to the query and the bank. Therefore the accuracy and the efficiency of the post-genomic work will be greatly increased if these redundancies are removed. To solve this up to now unaddressed problem, we have developed an algorithm that allows for the efficient and safe suppression of all the redundancies with no loss of information. This algorithm is based on various filtering steps that we describe here in the context of the Automat similarity search program, and such an algorithm should also be added to the other similarity search programs (BLAST, FASTA, etc...).  相似文献   

12.
A structural class in the MemGen classification of membrane proteins is a set of evolutionary related proteins sharing a similar global fold. A structural class contains both closely related pairs of proteins for which homology is clear from sequence comparison and very distantly related pairs, for which it is not possible to establish homology based on sequence similarity alone. In the latter case the evolutionary link is based on hydropathy profile analysis. Here, we use these evolutionary related sets of proteins to analyze the relationship between E-values in BLAST searches, sequence similarities in multiple sequence alignments and structural similarities in hydropathy profile analyses. Two structural classes of secondary transporters termed ST[3], which includes the Ion Transporter (IT) superfamily and ST[4], which includes the DAACS family (TC# 2.A.23) were extracted from the NCBI protein database. ST[3] contains 2051 unique sequences distributed over 32 families and 59 subfamilies. ST[4] is a smaller class containing 399 unique sequences distributed over 2 families and 7 subfamilies. One subfamily in ST[4] contains a new class of binding protein dependent secondary transporters. Comparison of the averaged hydropathy profiles of the subfamilies in ST[3] and ST[4] revealed that the two classes represent different folds. Divergence of the sequences in ST[4] is much smaller than observed in ST[3], suggesting different constraints on the proteins during evolution. Analysis of the correlation between the evolutionary relationship of pairs of proteins in a class and the BLAST E-value revealed that: (i) the BLAST algorithm is unable to pick up the majority of the links between proteins in structural class ST[3], (ii) ‘low complexity filtering’ and ‘composition based statistics’ improve the specificity, but strongly reduce the sensitivity of BLAST searches for distantly related proteins, indicating that these filters are too stringent for the proteins analyzed, and (iii) the E-value cut-off, which may be used to evaluate evolutionary significance of a hit in a BLAST search is very different for the two structural classes of membrane proteins.  相似文献   

13.
This paper presents a novel approach to profile-profile comparison. The method compares two input profiles (like those that are generated by PSI-BLAST) and assigns a similarity score to assess their statistical similarity. Our profile-profile comparison tool, which allows for gaps, can be used to detect weak similarities between protein families. It has also been optimized to produce alignments that are in very good agreement with structural alignments. Tests show that the profile-profile alignments are indeed highly correlated with similarities between secondary structure elements and tertiary structure. Exhaustive evaluations show that our method is significantly more sensitive in detecting distant homologies than the popular profile-based search programs PSI-BLAST and IMPALA. The relative improvement is the same order of magnitude as the improvement of PSI-BLAST relative to BLAST. Our new tool often detects similarities that fall within the twilight zone of sequence similarity.  相似文献   

14.

Background  

DNA sequence comparison is based on optimal local alignment of two sequences using a similarity score. However, some new DNA sequencing technologies do not directly measure the base sequence, but rather an encoded form, such as the two-base encoding considered here. In order to compare such data to a reference sequence, the data must be decoded into sequence. The decoding is deterministic, but the possibility of measurement errors requires searching among all possible error modes and resulting alignments to achieve an optimal balance of fewer errors versus greater sequence similarity.  相似文献   

15.
MOTIVATION:The popular BLAST algorithm is based on a local similarity search strategy, so its high-scoring segment pairs (HSPs) do not have global alignment information. When scientists use BLAST to search for a target protein or DNA sequence in a huge database like the human genome map, the existence of repeated fragments, homologues or pseudogenes in the genome often makes the BLAST result filled with redundant HSPs. Therefore, we need a computational strategy to alleviate this problem. RESULTS: In the gene discovery group of Celera Genomics, I developed a two-step method, i.e. a BLAST step plus an LIS step, to align thousands of cDNA and protein sequences into the human genome map. The LIS step is based on a mature computational algorithm, Longest Increasing Subsequence (LIS) algorithm. The idea is to use the LIS algorithm to find the longest series of consecutive HSPs in the BLAST output. Such a BLAST+LIS strategy can be used as an independent alignment tool or as a complementary tool for other alignment programs like Sim4 and GenWise. It can also work as a general purpose BLAST result processor in all sorts of BLAST searches. Two examples from Celera were shown in this paper.  相似文献   

16.
Position-specific substitution matrices, known as profiles,derived from multiple sequence alignments are currently usedto search sequence databases for distantly related members ofprotein families. The performance of the database searches isenhanced by using (i) a sequence weighting scheme which assignshigher weights to more distantly related sequences based onbranch lengths derived from phylogenetic trees, (ii) exclusionof positions with mainly padding characters at sites of insertionsor deletions and (iii) the BLOSUM62 residue comparison matrix.A natural consequence of these modifications is an improvementin the alignment of new sequences to the profiles. However,the accuracy of the alignments can be further increased by employinga similarity residue comparison matrix. These developments areimplemented in a program called PROFILEWEIGHT which runs onUnix and Vax computers. The only input required by the programis the multiple sequence alignment. The output from PROFILEWEIGHTis a profile designed to be used by existing searching and alignmentprograms. Test results from database searches with four differentfamilies of proteins show the improved sensitivity of the weightedprofiles.  相似文献   

17.
We present a novel protein structure database search tool, 3D-BLAST, that is useful for analyzing novel structures and can return a ranked list of alignments. This tool has the features of BLAST (for example, robust statistical basis, and effective and reliable search capabilities) and employs a kappa-alpha (κ, α) plot derived structural alphabet and a new substitution matrix. 3D-BLAST searches more than 12,000 protein structures in 1.2 s and yields good results in zones with low sequence similarity.  相似文献   

18.
When investigators undertake searches of DNA databases, they normally discard large numbers of alignments that demonstrate very weak resemblances to each other, retaining only those that show statistically significant levels of resemblance. We show here that a great deal of information can be extracted from these weak alignments by examining them en masse. This is done by building three-dimensional similarity landscapes from the alignments, landscapes that reveal whether an unusual number of individually nonsignificant alignments tend to match up to a particular region of the query sequence being searched. The power of the search is increased by the use of libraries consisting entirely of introns or of exons. We show that (1) similarity landscapes with a variety of features can be generated from both intron and exon libraries, using introns or exons as query sequences; (2) the landscape features are real and not a statistical artifact; (3) well-known protein motifs used as query sequences can generate various landscape features; and (4) there is some evidence for resemblances between short regions of sequence carried by introns and exons. One possible interpretation of these results is that both introns and exons may have been built up during their evolution from short regions of sequence that as a result are now widely distributed throughout eukaryotic genomes. Such an interpretation would imply that these short regions have common ancestry. Alternatively, the wide sharing of short pieces of DNA may reflect regions with particular structural properties that have arisen through convergent evolution. The similarity-landscape approach can be used to detect such widespread structural motifs and sequence motifs in the genome that might be missed by less-global searches. It can also be used in conjunction with algorithms developed for detecting significant multiple alignments by isolating promising subsets of the databases that can be examined in more detail.Correspondence to: C. Wills  相似文献   

19.
20.
Sequence similarity tools, such as BLAST, seek sequences most similar to a query from a database of sequences. They return results significantly similar to the query sequence and that are typically highly similar to each other. Most sequence analysis tasks in bioinformatics require an exploratory approach, where the initial results guide the user to new searches. However, diversity has not yet been considered an integral component of sequence search tools for this discipline. Some redundancy can be avoided by introducing non-redundancy during database construction, but it is not feasible to dynamically set a level of non-redundancy tailored to a query sequence. We introduce the problem of diverse search and browsing in sequence databases that produce non-redundant results optimized for any given query. We define diversity measures for sequences and propose methods to obtain diverse results extracted from current sequence similarity search tools. We also propose a new measure to evaluate the diversity of a set of sequences that is returned as a result of a sequence similarity query. We evaluate the effectiveness of the proposed methods in post-processing BLAST and PSI-BLAST results. We also assess the functional diversity of the returned results based on available Gene Ontology annotations. Additionally, we include a comparison with a current redundancy elimination tool, CD-HIT. Our experiments show that the proposed methods are able to achieve more diverse yet significant result sets compared to static non-redundancy approaches. In both sequence-based and functional diversity evaluation, the proposed diversification methods significantly outperform original BLAST results and other baselines. A web based tool implementing the proposed methods, Div-BLAST, can be accessed at cedar.cs.bilkent.edu.tr/Div-BLAST  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号