首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Bioinformatic tools have become essential to biologists in their quest to understand the vast quantities of sequence data, and now whole genomes, which are being produced at an ever increasing rate. Much of these sequence data are single-pass sequences, such as sample sequences from organisms closely related to other organisms of interest which have already been sequenced, or cDNAs or expressed sequence tags (ESTs). These single-pass sequences often contain errors, including frameshifts, which complicate the identification of homologues, especially at the protein level. Therefore, sequence searches with this type of data are often performed at the nucleotide level. The most commonly used sequence search algorithms for the identification of homologues are Washington University's and the National Center for Biotechnology Information's (NCBI) versions of the BLAST suites of tools, which are to be found on websites all over the world. The work reported here examines the use of these tools for comparing sample sequence datasets to a known genome. It shows that care must be taken when choosing the parameters to use with the BLAST algorithms. NCBI's version of gapped BLASTn gives much shorter, and sometimes different, top alignments to those found using Washington University's version of BLASTn (which also allows for gaps), when both are used with their default parameters. Most of the differences in performance were found to be due to the choices of default parameters rather than underlying differences between the two algorithms. Washington University's version, used with defaults, compares very favourably with the results obtained using the accurate but computationally intensive Smith-Waterman algorithm.  相似文献   

2.
3.
BLAST (Basic Local Alignment Search Tool) searches against DNA and protein sequence databases have become an indispensable tool for biomedical research. The proliferation of the genome sequencing projects is steadily increasing the fraction of genome-derived sequences in the public databases and their importance as a public resource. We report here the availability of Genomic BLAST, a novel graphical tool for simplifying BLAST searches against complete and unfinished genome sequences. This tool allows the user to compare the query sequence against a virtual database of DNA and/or protein sequences from a selected group of organisms with finished or unfinished genomes. The organisms for such a database can be selected using either a graphic taxonomy-based tree or an alphabetical list of organism-specific sequences. The first option is designed to help explore the evolutionary relationships among organisms within a certain taxonomy group when performing BLAST searches. The use of an alphabetical list allows the user to perform a more elaborate set of selections, assembling any given number of organism-specific databases from unfinished or complete genomes. This tool, available at the NCBI web site http://www.ncbi.nlm.nih.gov/cgi-bin/Entrez/genom_table_cgi, currently provides access to over 170 bacterial and archaeal genomes and over 40 eukaryotic genomes.  相似文献   

4.
MOTIVATION: The blastp and tblastn modules of BLAST are widely used methods for searching protein queries against protein and nucleotide databases, respectively. One heuristic used in BLAST is to consider only database sequences that contain a high-scoring match of length at most 5 to the query. We implemented the capability to use words of length 6 or 7. We demonstrate an improved trade-off between running time and retrieval accuracy, controlled by the score threshold used for short word matches. For example, the running time can be reduced by 20-30% while achieving ROC (receiver operator characteristic) scores similar to those obtained with current default parameters. AVAILABILITY: The option to use long words is in the NCBI C and C++ toolkit code for BLAST, starting with version 2.2.16 of blastall. A Linux executable used to produce the results herein is available at: ftp://ftp.ncbi.nlm.nih.gov/pub/agarwala/protein_longwords  相似文献   

5.
Sequence analysis has become essential to the study of genomes and biological research in general. Basic Local Alignment Search Tool (BLAST) leads the way as the most accepted method for performing necessary query searches and analysis of discovered genes. Combating growing data sizes, with the goal of speeding up job runtimes, scientist are resorting to grid computing technologies. However, grid environments are characterized by dynamic, heterogeneous, and transient state of available resources causing major hindrance to users when trying to realize user-desired levels of service. This paper analyzes performance characteristics of NCBI BLAST on several resources and captures influence of resource characteristics and job parameters on BLAST job runtime across those resources. Obtained results are summarized as a set of principles characterizing performance of NCBI BLAST across homogeneous and heterogeneous environments. These principles are then applied and verified through creation of a grid-enabled BLAST wrapper application called Dynamic BLAST. Results show runtime savings up to 50% and resource utilization improvement of approximately 40%.  相似文献   

6.
A structural class in the MemGen classification of membrane proteins is a set of evolutionary related proteins sharing a similar global fold. A structural class contains both closely related pairs of proteins for which homology is clear from sequence comparison and very distantly related pairs, for which it is not possible to establish homology based on sequence similarity alone. In the latter case the evolutionary link is based on hydropathy profile analysis. Here, we use these evolutionary related sets of proteins to analyze the relationship between E-values in BLAST searches, sequence similarities in multiple sequence alignments and structural similarities in hydropathy profile analyses. Two structural classes of secondary transporters termed ST[3], which includes the Ion Transporter (IT) superfamily and ST[4], which includes the DAACS family (TC# 2.A.23) were extracted from the NCBI protein database. ST[3] contains 2051 unique sequences distributed over 32 families and 59 subfamilies. ST[4] is a smaller class containing 399 unique sequences distributed over 2 families and 7 subfamilies. One subfamily in ST[4] contains a new class of binding protein dependent secondary transporters. Comparison of the averaged hydropathy profiles of the subfamilies in ST[3] and ST[4] revealed that the two classes represent different folds. Divergence of the sequences in ST[4] is much smaller than observed in ST[3], suggesting different constraints on the proteins during evolution. Analysis of the correlation between the evolutionary relationship of pairs of proteins in a class and the BLAST E-value revealed that: (i) the BLAST algorithm is unable to pick up the majority of the links between proteins in structural class ST[3], (ii) "low complexity filtering" and "composition based statistics" improve the specificity, but strongly reduce the sensitivity of BLAST searches for distantly related proteins, indicating that these filters are too stringent for the proteins analyzed, and (iii) the E-value cut-off, which may be used to evaluate evolutionary significance of a hit in a BLAST search is very different for the two structural classes of membrane proteins.  相似文献   

7.
A structural class in the MemGen classification of membrane proteins is a set of evolutionary related proteins sharing a similar global fold. A structural class contains both closely related pairs of proteins for which homology is clear from sequence comparison and very distantly related pairs, for which it is not possible to establish homology based on sequence similarity alone. In the latter case the evolutionary link is based on hydropathy profile analysis. Here, we use these evolutionary related sets of proteins to analyze the relationship between E-values in BLAST searches, sequence similarities in multiple sequence alignments and structural similarities in hydropathy profile analyses. Two structural classes of secondary transporters termed ST[3], which includes the Ion Transporter (IT) superfamily and ST[4], which includes the DAACS family (TC# 2.A.23) were extracted from the NCBI protein database. ST[3] contains 2051 unique sequences distributed over 32 families and 59 subfamilies. ST[4] is a smaller class containing 399 unique sequences distributed over 2 families and 7 subfamilies. One subfamily in ST[4] contains a new class of binding protein dependent secondary transporters. Comparison of the averaged hydropathy profiles of the subfamilies in ST[3] and ST[4] revealed that the two classes represent different folds. Divergence of the sequences in ST[4] is much smaller than observed in ST[3], suggesting different constraints on the proteins during evolution. Analysis of the correlation between the evolutionary relationship of pairs of proteins in a class and the BLAST E-value revealed that: (i) the BLAST algorithm is unable to pick up the majority of the links between proteins in structural class ST[3], (ii) ‘low complexity filtering’ and ‘composition based statistics’ improve the specificity, but strongly reduce the sensitivity of BLAST searches for distantly related proteins, indicating that these filters are too stringent for the proteins analyzed, and (iii) the E-value cut-off, which may be used to evaluate evolutionary significance of a hit in a BLAST search is very different for the two structural classes of membrane proteins.  相似文献   

8.
Serial BLAST searching   总被引:2,自引:0,他引:2  
MOTIVATION: The translating BLAST algorithms are powerful tools for finding protein-coding genes because they identify amino acid similarities in nucleotide sequences. Unfortunately, these kinds of searches are computationally intensive and often represent bottlenecks in sequence analysis pipelines. Tuning parameters for speed can make the searches much faster, but one risks losing low-scoring alignments. However, high scoring alignments are relatively resistant to such changes in parameters, and this fact makes it possible to use a serial strategy where a fast, insensitive search is used to pre-screen a database for similar sequences, and a slow, sensitive search is used to produce the sequence alignments. RESULTS: Serial BLAST searches improve both the speed and sensitivity.  相似文献   

9.
10.
11.
Yeast glycoproteins are representative of low-complexity sequences, those sequences rich in a few types of amino acids. Low-complexity protein sequences comprise more than 10% of the proteome but are poorly aligned by existing methods. Under default conditions, BLAST and FASTA use the scoring matrix BLOSUM62, which is optimized for sequences with diverse amino acid compositions. Because low-complexity sequences are rich in a few amino acids, these tools tend to align the most common residues in nonhomologous positions, thereby generating anomalously high scores, deviations from the expected extreme value distribution, and small e values. This anomalous scoring prevents BLOSUM62-based BLAST and FASTA from identifying correct homologs for proteins with low-complexity sequences, including Saccharomyces cerevisiae wall proteins. We have devised and empirically tested scoring matrices that compensate for the overrepresentation of some amino acids in any query sequence in different ways. These matrices were tested for sensitivity in finding true homologs, discrimination against nonhomologous and random sequences, conformance to the extreme value distribution, and accuracy of e values. Of the tested matrices, the two best matrices (called E and gtQ) gave reliable alignments in BLAST and FASTA searches, identified a consistent set of paralogs of the yeast cell wall test set proteins, and improved the consistency of secondary structure predictions for cell wall proteins.  相似文献   

12.
Sequence similarity tools, such as BLAST, seek sequences most similar to a query from a database of sequences. They return results significantly similar to the query sequence and that are typically highly similar to each other. Most sequence analysis tasks in bioinformatics require an exploratory approach, where the initial results guide the user to new searches. However, diversity has not yet been considered an integral component of sequence search tools for this discipline. Some redundancy can be avoided by introducing non-redundancy during database construction, but it is not feasible to dynamically set a level of non-redundancy tailored to a query sequence. We introduce the problem of diverse search and browsing in sequence databases that produce non-redundant results optimized for any given query. We define diversity measures for sequences and propose methods to obtain diverse results extracted from current sequence similarity search tools. We also propose a new measure to evaluate the diversity of a set of sequences that is returned as a result of a sequence similarity query. We evaluate the effectiveness of the proposed methods in post-processing BLAST and PSI-BLAST results. We also assess the functional diversity of the returned results based on available Gene Ontology annotations. Additionally, we include a comparison with a current redundancy elimination tool, CD-HIT. Our experiments show that the proposed methods are able to achieve more diverse yet significant result sets compared to static non-redundancy approaches. In both sequence-based and functional diversity evaluation, the proposed diversification methods significantly outperform original BLAST results and other baselines. A web based tool implementing the proposed methods, Div-BLAST, can be accessed at cedar.cs.bilkent.edu.tr/Div-BLAST  相似文献   

13.
MOTIVATION: Two proteins can have a similar 3-dimensional structure and biological function, but have sequences sufficiently different that traditional protein sequence comparison algorithms do not identify their relationship. The desire to identify such relations has led to the development of more sensitive sequence alignment strategies. One such strategy is the Intermediate Sequence Search (ISS), which connects two proteins through one or more intermediate sequences. In its brute-force implementation, ISS is a strategy that repetitively uses the results of the previous query as new search seeds, making it time-consuming and difficult to analyze. RESULTS: Saturated BLAST is a package that performs ISS in an efficient and automated manner. It was developed using Perl and Perl/Tk and implemented on the LINUX operating system. Starting with a protein sequence, Saturated BLAST runs a BLAST search and identifies representative sequences for the next generation of searches. The procedure is run until convergence or until some predefined criteria are met. Saturated BLAST has a friendly graphic user interface, a built-in BLAST result parser, several multiple alignment tools, clustering algorithms and various filters for the elimination of false positives, thereby providing an easy way to edit, visualize, analyze, monitor and control the search. Besides detecting remote homologies, Saturated BLAST can be used to maintain protein family databases and to search for new genes in genomic databases.  相似文献   

14.
Sequence alignment programs such as BLAST and PSI-BLAST are used routinely in pairwise, profile-based, or intermediate-sequence-search (ISS) methods to detect remote homologies for the purposes of fold assignment and comparative modeling. Yet, the sequence alignment quality of these methods at low sequence identity is not known. We have used the CE structure alignment program (Shindyalov and Bourne, Prot Eng 1998;11:739) to derive sequence alignments for all superfamily and family-level related proteins in the SCOP domain database. CE aligns structures and their sequences based on distances within each protein, rather than on interprotein distances. We compared BLAST, PSI-BLAST, CLUSTALW, and ISS alignments with the CE structural alignments. We found that global alignments with CLUSTALW were very poor at low sequence identity (<25%), as judged by the CE alignments. We used PSI-BLAST to search the nonredundant sequence database (nr) with every sequence in SCOP using up to four iterations. The resulting matrix was used to search a database of SCOP sequences. PSI-BLAST is only slightly better than BLAST in alignment accuracy on a per-residue basis, but PSI-BLAST matrix alignments are much longer than BLAST's, and so align correctly a larger fraction of the total number of aligned residues in the structure alignments. Any two SCOP sequences in the same superfamily that shared a hit or hits in the nr PSI-BLAST searches were identified as linked by the shared intermediate sequence. We examined the quality of the longest SCOP-query/ SCOP-hit alignment via an intermediate sequence, and found that ISS produced longer alignments than PSI-BLAST searches alone, of nearly comparable per-residue quality. At 10-15% sequence identity, BLAST correctly aligns 28%, PSI-BLAST 40%, and ISS 46% of residues according to the structure alignments. We also compared CE structure alignments with FSSP structure alignments generated by the DALI program. In contrast to the sequence methods, CE and structure alignments from the FSSP database identically align 75% of residue pairs at the 10-15% level of sequence identity, indicating that there is substantial room for improvement in these sequence alignment methods. BLAST produced alignments for 8% of the 10,665 nonimmunoglobulin SCOP superfamily sequence pairs (nearly all <25% sequence identity), PSI-BLAST matched 17% and the double-PSI-BLAST ISS method aligned 38% with E-values <10.0. The results indicate that intermediate sequences may be useful not only in fold assignment but also in achieving more complete sequence alignments for comparative modeling.  相似文献   

15.
Getting the most from PSI-BLAST   总被引:1,自引:0,他引:1  
Most biologists now conduct sequence searches as a matter of course. But how do we know that a relationship predicted by a homology search is a true, rather than false, hit with the same score? Many biologists design their own experiments with exquisite care yet still assume that results from programs with more than 20 adjustable parameters are 100% reliable. This article explains some of the key steps in getting the most from PSI-Blast, one of the most popular and powerful homology search programs currently available.  相似文献   

16.
The National Center for Biotechnology Information (NCBI) integrates data from more than 20 biological databases through a flexible search and retrieval system called Entrez. A core Entrez database, Entrez Nucleotide, includes GenBank and is tightly linked to the NCBI Taxonomy database, the Entrez Protein database, and the scientific literature in PubMed. A suite of more specialized databases for genomes, genes, gene families, gene expression, gene variation, and protein domains dovetails with the core databases to make Entrez a powerful system for genomic research. Linked to the full range of Entrez databases is the NCBI Map Viewer, which displays aligned genetic, physical, and sequence maps for eukaryotic genomes including those of many plants. A specialized plant query page allow maps from all plant genomes covered by the Map Viewer to be searched in tandem to produce a display of aligned maps from several species. PlantBLAST searches against the sequences shown in the Map Viewer allow BLAST alignments to be viewed within a genomic context. In addition, precomputed sequence similarities, such as those for proteins offered by BLAST Link, enable fluid navigation from unannotated to annotated sequences, quickening the pace of discovery. NCBI Web pages for plants, such as Plant Genome Central, complete the system by providing centralized access to NCBI's genomic resources as well as links to organism-specific Web pages beyond NCBI.  相似文献   

17.
ABSTRACT: BACKGROUND: Local alignment programs often calculate the probability that a match occurred by chance. The calculation of this probability may require a "finite-size" correction to the lengths of the sequences, as an alignment that starts near the end of either sequence may run out of sequence before achieving a significant score. FINDINGS: We present an improved finite-size correction that considers the distribution of sequence lengths rather than simply the corresponding means. This approach improves sensitivity and avoids substituting an ad hoc length for short sequences that can underestimate the significance of a match. We use a test set derived from ASTRAL to show improved ROC scores, especially for shorter sequences. CONCLUSIONS: The new finite-size correction improves the calculation of probabilities for a local alignment. It is now used in the BLAST + package and at the NCBI BLAST web site (http://blast.ncbi.nlm.nih.gov).  相似文献   

18.
Repetitive DNA elements discovered in the basidiomycete Chondrostereum purpureum were characterized and validated for use as genetic markers. Regions of these marker sequences were similar to retrotransposon and retrotransposon-like sequences, as indicated by BLAST searches of NCBI databases. These sequences occur in multiple DNA fragments of variable length in a given C. purpureum isolate, and thus can serve as strain-specific genetic markers. The segregation of the markers within a progeny set demonstrated their stability through meiosis. The population structure of C. purpureum was assessed using the markers. There was no evidence of a barrier to gene flow between C. purpureum populations separated by 1400 km and no indication of population sub-structuring based on host or geographical source of isolate. Repetitive fragments were amplified from four other species, suggesting the occurrence of these retrotransposon-like elements in other basidiomycetes and the potential utility of these markers for other fungi.  相似文献   

19.
MOTIVATION: A number of free-standing programs have been developed in order to help researchers find potential coding regions and deduce gene structure for long stretches of what is essentially 'anonymous DNA'. As these programs apply inherently different criteria to the question of what is and is not a coding region, multiple algorithms should be used in the course of positional cloning and positional candidate projects to assure that all potential coding regions within a previously-identified critical region are identified. RESULTS: We have developed a gene identification tool called GeneMachine which allows users to query multiple exon and gene prediction programs in an automated fashion. BLAST searches are also performed in order to see whether a previously-characterized coding region corresponds to a region in the query sequence. A suite of Perl programs and modules are used to run MZEF, GENSCAN, GRAIL 2, FGENES, RepeatMasker, Sputnik, and BLAST. The results of these runs are then parsed and written into ASN.1 format. Output files can be opened using NCBI Sequin, in essence using Sequin as both a workbench and as a graphical viewer. The main feature of GeneMachine is that the process is fully automated; the user is only required to launch GeneMachine and then open the resulting file with Sequin. Annotations can then be made to these results prior to submission to GenBank, thereby increasing the intrinsic value of these data. AVAILABILITY: GeneMachine is freely-available for download at http://genome.nhgri.nih.gov/genemachine. A public Web interface to the GeneMachine server for academic and not-for-profit users is available at http://genemachine.nhgri.nih.gov. The Web supplement to this paper may be found at http://genome.nhgri.nih.gov/genemachine/supplement/.  相似文献   

20.
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号