首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
MOTIVATION: Many studies have shown that database searches using position-specific score matrices (PSSMs) or profiles as queries are more effective at identifying distant protein relationships than are searches that use simple sequences as queries. One popular program for constructing a PSSM and comparing it with a database of sequences is Position-Specific Iterated BLAST (PSI-BLAST). RESULTS: This paper describes a new software package, IMPALA, designed for the complementary procedure of comparing a single query sequence with a database of PSI-BLAST-generated PSSMs. We illustrate the use of IMPALA to search a database of PSSMs for protein folds, and one for protein domains involved in signal transduction. IMPALA's sensitivity to distant biological relationships is very similar to that of PSI-BLAST. However, IMPALA employs a more refined analysis of statistical significance and, unlike PSI-BLAST, guarantees the output of the optimal local alignment by using the rigorous Smith-Waterman algorithm. Also, it is considerably faster when run with a large database of PSSMs than is BLAST or PSI-BLAST when run against the complete non-redundant protein database.  相似文献   

2.
MOTIVATION: Blast programs are very efficient in finding relatively strong similarities but some very distantly related sequences are given a very high Expect value and are ranked very low in Blast results. We have developed Ballast, a program to predict local maximum segments (LMSs-i.e. sequence segments conserved relatively to their flanking regions) from a single Blast database search and to highlight these divergent homologues. The TBlastN database searches can also be processed with the help of information from a joint BlastP search. RESULTS: We have applied the Ballast algorithm to BlastP searches performed with sequences belonging to well described dispersed families (aminoacyl-tRNA synthetases; helicases) against the SwissProt 38 database. We show that Ballast is able to build an appropriate conservation profile and that LMSs are predicted that are consistent with the signatures and motifs described in the literature. Furthermore, by comparing the Blast, PsiBlast and Ballast results obtained on a well defined database of structurally related sequences, we show that the LMSs provide a scoring scheme that can concentrate on top ranking distant homologues better than Blast. Using the graphical user interface available on the Web, specific LMSs may be selected to detect divergent homologues sharing the corresponding properties with the query sequence without requiring any additional database search.  相似文献   

3.
UniRef: comprehensive and non-redundant UniProt reference clusters   总被引:2,自引:0,他引:2  
MOTIVATION: Redundant protein sequences in biological databases hinder sequence similarity searches and make interpretation of search results difficult. Clustering of protein sequence space based on sequence similarity helps organize all sequences into manageable datasets and reduces sampling bias and overrepresentation of sequences. RESULTS: The UniRef (UniProt Reference Clusters) provide clustered sets of sequences from the UniProt Knowledgebase (UniProtKB) and selected UniProt Archive records to obtain complete coverage of sequence space at several resolutions while hiding redundant sequences. Currently covering >4 million source sequences, the UniRef100 database combines identical sequences and subfragments from any source organism into a single UniRef entry. UniRef90 and UniRef50 are built by clustering UniRef100 sequences at the 90 or 50% sequence identity levels. UniRef100, UniRef90 and UniRef50 yield a database size reduction of approximately 10, 40 and 70%, respectively, from the source sequence set. The reduced redundancy increases the speed of similarity searches and improves detection of distant relationships. UniRef entries contain summary cluster and membership information, including the sequence of a representative protein, member count and common taxonomy of the cluster, the accession numbers of all the merged entries and links to rich functional annotation in UniProtKB to facilitate biological discovery. UniRef has already been applied to broad research areas ranging from genome annotation to proteomics data analysis. AVAILABILITY: UniRef is updated biweekly and is available for online search and retrieval at http://www.uniprot.org, as well as for download at ftp://ftp.uniprot.org/pub/databases/uniprot/uniref. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.  相似文献   

4.
We describe a web-based resource to identify, search and analyze sequence patterns conserved in the multiple sequence alignments of orthologous promoters from closely related / distant Saccharomyces spp. The webtool interfaces with a database where conserved sequence patterns (greater than 4 bp) have been previously extracted from genome-wide promoter alignments, allowing one to carry out user-defined genome-wide searches for conserved sequences to assist in the discovery of novel promoter elements based on comparative genomics. The web-based server can be accessed at http://www2.imtech.res.in/ anand/sacch_prom_pat.html.  相似文献   

5.
We present a fast and flexible program for clustering large protein databases at different sequence identity levels. It takes less than 2 h for the all-against-all sequence comparison and clustering of the non-redundant protein database of over 560,000 sequences on a high-end PC. The output database, including only the representative sequences, can be used for more efficient and sensitive database searches.  相似文献   

6.
Shotgun: getting more from sequence similarity searches.   总被引:1,自引:0,他引:1  
MOTIVATION: As genomic sequencing reveals the range of structural classes generated through the evolution of proteins, analysis of the superfamilies to which they belong can contribute important insights for understanding their structure-function relationships. Current database search techniques fall short of identifying the majority of distant sequence relationships at statistically significant levels. We developed the Shotgun program in an effort to enhance the sensitivity and utility of current database search output. RESULTS: We have developed and used the Shotgun program to identify both new superfamily members and to reconstruct several known enzyme superfamilies using BLAST database searches. An analysis of the false-positive rates generated in the analysis and other control experiments provides evidence that high Shotgun scores indicate real evolutionary relationships. Shotgun is also a useful tool for identifying subgroup relationships within superfamilies and for testing hypotheses about related protein families. AVAILABILITY: By request from the Babbitt lab homepage: http://mako.cgl.ucsf. edu/babbittlab/ CONTACT: babbitt@cgl.ucsf.edu  相似文献   

7.
Protein functional annotation relies on the identification of accurate relationships, sequence divergence being a key factor. This is especially evident when distant protein relationships are demonstrated only with three-dimensional structures. To address this challenge, we describe a computational approach to purposefully bridge gaps between related protein families through directed design of protein-like “linker” sequences. For this, we represented SCOP domain families, integrated with sequence homologues, as multiple profiles and performed HMM-HMM alignments between related domain families. Where convincing alignments were achieved, we applied a roulette wheel-based method to design 3,611,010 protein-like sequences corresponding to 374 SCOP folds. To analyze their ability to link proteins in homology searches, we used 3024 queries to search two databases, one containing only natural sequences and another one additionally containing designed sequences. Our results showed that augmented database searches showed up to 30% improvement in fold coverage for over 74% of the folds, with 52 folds achieving all theoretically possible connections. Although sequences could not be designed between some families, the availability of designed sequences between other families within the fold established the sequence continuum to demonstrate 373 difficult relationships. Ultimately, as a practical and realistic extension, we demonstrate that such protein-like sequences can be “plugged-into” routine and generic sequence database searches to empower not only remote homology detection but also fold recognition. Our richly statistically supported findings show that complementary searches in both databases will increase the effectiveness of sequence-based searches in recognizing all homologues sharing a common fold.  相似文献   

8.
SUMMARY: We have developed a program, MPBLAST, that increases the throughput of batch BLASTN searches by multiplexing (concatenating) query sequences and thereby reducing the number of actual database searches performed. Throughput was observed to increase in reciprocal proportion to the component sequence length. For sequencing read-sized queries of 500 bp, an order of magnitude speed-up was seen. AVAILABILITY: Free (see http://blast.wustl.edu) CONTACT: [ikorf, gish]@watson.wustl.edu  相似文献   

9.
H J?rnvall 《FEBS letters》1999,456(1):85-88
Motifer is a software tool able to find directly in nucleotide databases very distant homologues to an amino acid query sequence. It focuses searches on a specific amino acid pattern, scoring the matching and intervening residues as specified by the user. The program has been developed for searching databases of expressed sequence tags (ESTs), but it is also well suited to search genomic sequences. The query sequence can be a variable pattern with alternative amino acids or gaps and the sequences searched can contain introns or sequencing errors with accompanying frame shifts. Other features include options to generate a searchable output, set the maximal sequencing error frequency, limit searches to given species, or exclude already known matches. Motifer can find sequence homologues that other search algorithms would deem unrelated or would not find because of sequencing errors or a too large number of other homologues. The ability of Motifer to find relatives to a given sequence is exemplified by searches for members of the transforming growth factor-beta family and for proteins containing a WW-domain. The functions aimed at enhancing EST searches are illustrated by the 'in silico' cloning of a novel cytochrome P450 enzyme.  相似文献   

10.
A computer algorithm has been developed which identifies tRNA genes and tRNA-like structures in DNA sequences. The program searches the sequence string for specific base positions that correspond to the invariant and semi-invariant bases found in tRNAs. The tRNA nature of the sequence is confirmed by the presence of complementary base pairing at the tRNA's calculated 5' and 3' ends (which in situ constitutes the amino-acyl stem region). The program achieves greater than 96% accuracy when run against known tRNA sequences in the Genbank database. The program is modular and is readily modified to allow searching either a file or database. The program is written in "C" and operates on a D.E.C. Vax 750. The utility of the algorithm is demonstrated by the identification of a distinctive tRNA structure in an intron of a published bovine hemoglobin gene.  相似文献   

11.
Position-specific substitution matrices, known as profiles,derived from multiple sequence alignments are currently usedto search sequence databases for distantly related members ofprotein families. The performance of the database searches isenhanced by using (i) a sequence weighting scheme which assignshigher weights to more distantly related sequences based onbranch lengths derived from phylogenetic trees, (ii) exclusionof positions with mainly padding characters at sites of insertionsor deletions and (iii) the BLOSUM62 residue comparison matrix.A natural consequence of these modifications is an improvementin the alignment of new sequences to the profiles. However,the accuracy of the alignments can be further increased by employinga similarity residue comparison matrix. These developments areimplemented in a program called PROFILEWEIGHT which runs onUnix and Vax computers. The only input required by the programis the multiple sequence alignment. The output from PROFILEWEIGHTis a profile designed to be used by existing searching and alignmentprograms. Test results from database searches with four differentfamilies of proteins show the improved sensitivity of the weightedprofiles.  相似文献   

12.
Histone and histone fold sequences and structures: a database.   总被引:4,自引:3,他引:1       下载免费PDF全文
A database of aligned histone protein sequences has been constructed based on the results of homology searches of the major public sequence databases. In addition, sequences of proteins identified as containing the histone fold motif and structures of all known histone and histone fold proteins have been included in the current release. Database resources include information on conflicts between similar sequence entries in different source databases, multiple sequence alignments, and links to the Entrez integrated information retrieval system at the National Center for Biotechnology Information (NCBI). The database currently contains over 1000 protein sequences. All sequences and alignments in this database are available through the World Wide Web at: http: //www.ncbi.nlm.nih.gov/Baxevani/HISTONES/ .  相似文献   

13.
MOTIVATION: Noise in database searches resulting from random sequence similarities increases as the databases expand rapidly. The noise problems are not a technical shortcoming of the database search programs, but a logical consequence of the idea of homology searches. The effect can be observed in simulation experiments. RESULTS: We have investigated noise levels in pairwise alignment based database searches. The noise levels of 38 releases of the SwissProt database, display perfect logarithmic growth with the total length of the databases. Clustering of real biological sequences reduces noise levels, but the effect is marginal.  相似文献   

14.
We have developed a web-based tool for design of specific PCR primers and probes. The program allows you to enter primer sequence information as well as an optional probe, and sequence similarity searches (MegaBLAST) will be performed to see if the sequences match the same sequence entry in the specified database. If primers (and probe) match, this will be reported. The program can handle overlapping amplicons, amplification from a single primer, ambiguous bases and other problematic cases.  相似文献   

15.
BLAST (Basic Local Alignment Search Tool) searches against DNA and protein sequence databases have become an indispensable tool for biomedical research. The proliferation of the genome sequencing projects is steadily increasing the fraction of genome-derived sequences in the public databases and their importance as a public resource. We report here the availability of Genomic BLAST, a novel graphical tool for simplifying BLAST searches against complete and unfinished genome sequences. This tool allows the user to compare the query sequence against a virtual database of DNA and/or protein sequences from a selected group of organisms with finished or unfinished genomes. The organisms for such a database can be selected using either a graphic taxonomy-based tree or an alphabetical list of organism-specific sequences. The first option is designed to help explore the evolutionary relationships among organisms within a certain taxonomy group when performing BLAST searches. The use of an alphabetical list allows the user to perform a more elaborate set of selections, assembling any given number of organism-specific databases from unfinished or complete genomes. This tool, available at the NCBI web site http://www.ncbi.nlm.nih.gov/cgi-bin/Entrez/genom_table_cgi, currently provides access to over 170 bacterial and archaeal genomes and over 40 eukaryotic genomes.  相似文献   

16.
17.
Serial BLAST searching   总被引:2,自引:0,他引:2  
MOTIVATION: The translating BLAST algorithms are powerful tools for finding protein-coding genes because they identify amino acid similarities in nucleotide sequences. Unfortunately, these kinds of searches are computationally intensive and often represent bottlenecks in sequence analysis pipelines. Tuning parameters for speed can make the searches much faster, but one risks losing low-scoring alignments. However, high scoring alignments are relatively resistant to such changes in parameters, and this fact makes it possible to use a serial strategy where a fast, insensitive search is used to pre-screen a database for similar sequences, and a slow, sensitive search is used to produce the sequence alignments. RESULTS: Serial BLAST searches improve both the speed and sensitivity.  相似文献   

18.
Motivation: A large number of new DNA sequences with virtuallyunknown functions are generated as the Human Genome Projectprogresses. Therefore, it is essential to develop computer algorithmsthat can predict the functionality of DNA segments accordingto their primary sequences, including algorithms that can predictpromoters. Although several promoter-predicting algorithms areavailable, they have high false-positive detections and therate of promoter detection needs to be improved further. Results: In this research, PromFD, a computer program to recognizevertebrate RNA polymerase II promoters, has been developed.Both vertebrate promoters and non-promoter sequences are usedin the analysis. The promoters are obtained from the EukaryoticPromoter Database. Promoters are divided into a training setand a test set. Non-promoter sequences are obtained from theGenBank sequence databank, and are also divided into a trainingset and a test set. The first step is to search out, among allpossible permutations, patterns of strings 5–10 bp long,that are significantly over-represented in the promoter set.The program also searches IMD (Information Matrix Database)matrices that have a significantly higher presence in the promoterset. The results of the searches are stored in the PromFD database,and the program PromFD scores input DNA sequences accordingto their content of the database entries. PromFD predicts promoters—theirlocations and the location of potential TATA boxes, if found.The program can detect 71% of promoters in the training setwith a false-positive rate of under 1 in every 13 000 bp, and47% of promoters in the test set with a false-positive rateof under 1 in every 9800 bp. PromFD uses a new approach andits false-positive identification rate is better compared withother available promoter recognition algorithms. The sourcecode for PromFD is in the ‘c++’ language. Availability: PromFD is available for Unix platforms by anonymousftp to: beagle. colorado. edu, cd pub, get promFD.tar. A Javaversion of the program is also available for netscape 2.0, byhttp: // beagle.colorado.edu/chenq. Contact: E-mail: chenq{at}beagle.colorado.edu  相似文献   

19.
Profile matching methods are commonly used in searches in protein sequence databases to detect evolutionary relationships. We describe here a sensitive protocol, which detects remote similarities by searching in a specialized database of sequences belonging to a fold. We have assessed this protocol by exploring the relationships we detect among sequences known to belong to specific folds. We find that searches within sequences adopting a fold are more effective in detecting remote similarities and evolutionary connections than searches in a database of all sequences. We also discuss the implications of using this strategy to link sequence and structure space.  相似文献   

20.
Sequence alignment programs such as BLAST and PSI-BLAST are used routinely in pairwise, profile-based, or intermediate-sequence-search (ISS) methods to detect remote homologies for the purposes of fold assignment and comparative modeling. Yet, the sequence alignment quality of these methods at low sequence identity is not known. We have used the CE structure alignment program (Shindyalov and Bourne, Prot Eng 1998;11:739) to derive sequence alignments for all superfamily and family-level related proteins in the SCOP domain database. CE aligns structures and their sequences based on distances within each protein, rather than on interprotein distances. We compared BLAST, PSI-BLAST, CLUSTALW, and ISS alignments with the CE structural alignments. We found that global alignments with CLUSTALW were very poor at low sequence identity (<25%), as judged by the CE alignments. We used PSI-BLAST to search the nonredundant sequence database (nr) with every sequence in SCOP using up to four iterations. The resulting matrix was used to search a database of SCOP sequences. PSI-BLAST is only slightly better than BLAST in alignment accuracy on a per-residue basis, but PSI-BLAST matrix alignments are much longer than BLAST's, and so align correctly a larger fraction of the total number of aligned residues in the structure alignments. Any two SCOP sequences in the same superfamily that shared a hit or hits in the nr PSI-BLAST searches were identified as linked by the shared intermediate sequence. We examined the quality of the longest SCOP-query/ SCOP-hit alignment via an intermediate sequence, and found that ISS produced longer alignments than PSI-BLAST searches alone, of nearly comparable per-residue quality. At 10-15% sequence identity, BLAST correctly aligns 28%, PSI-BLAST 40%, and ISS 46% of residues according to the structure alignments. We also compared CE structure alignments with FSSP structure alignments generated by the DALI program. In contrast to the sequence methods, CE and structure alignments from the FSSP database identically align 75% of residue pairs at the 10-15% level of sequence identity, indicating that there is substantial room for improvement in these sequence alignment methods. BLAST produced alignments for 8% of the 10,665 nonimmunoglobulin SCOP superfamily sequence pairs (nearly all <25% sequence identity), PSI-BLAST matched 17% and the double-PSI-BLAST ISS method aligned 38% with E-values <10.0. The results indicate that intermediate sequences may be useful not only in fold assignment but also in achieving more complete sequence alignments for comparative modeling.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号