首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 93 毫秒
1.
Sequence alignment is an important bioinformatics tool for identifying homology, but searching against the full set of available sequences is likely to result in many hits to poorly annotated sequences providing very little information. Consequently, we often want alignments against a specific subset of sequences: for instance, we are looking for sequences from a particular species, sequences that have known 3d-structures, sequences that have a reliable (curated) function annotation, and so on. Although such subset databases are readily available, they only represent a small fraction of all sequences. Thus, the likelihood of finding close homologs for query sequences is smaller, and the alignments will in general have lower scores. This makes it difficult to distinguish hits to homologous sequences from random hits to unrelated sequences. Here, we propose a method that addresses this problem by first aligning query sequences against a large database representing the corpus of known sequences, and then constructing indirect (or transitive) alignments by combining the results with alignments from the large database against the desired target database. We compare the results to direct pairwise alignments, and show that our method gives us higher sensitivity alignments against the target database.  相似文献   

2.
Phylogenomics of prokaryotic ribosomal proteins   总被引:1,自引:0,他引:1  
Yutin N  Puigbò P  Koonin EV  Wolf YI 《PloS one》2012,7(5):e36972
Archaeal and bacterial ribosomes contain more than 50 proteins, including 34 that are universally conserved in the three domains of cellular life (bacteria, archaea, and eukaryotes). Despite the high sequence conservation, annotation of ribosomal (r-) protein genes is often difficult because of their short lengths and biased sequence composition. We developed an automated computational pipeline for identification of r-protein genes and applied it to 995 completely sequenced bacterial and 87 archaeal genomes available in the RefSeq database. The pipeline employs curated seed alignments of r-proteins to run position-specific scoring matrix (PSSM)-based BLAST searches against six-frame genome translations, mitigating possible gene annotation errors. As a result of this analysis, we performed a census of prokaryotic r-protein complements, enumerated missing and paralogous r-proteins, and analyzed the distributions of ribosomal protein genes among chromosomal partitions. Phyletic patterns of bacterial and archaeal r-protein genes were mapped to phylogenetic trees reconstructed from concatenated alignments of r-proteins to reveal the history of likely multiple independent gains and losses. These alignments, available for download, can be used as search profiles to improve genome annotation of r-proteins and for further comparative genomics studies.  相似文献   

3.
SRS (Sequence Retrieval System) is a widely used keyword search engine for querying biological databases. BLAST2 is the most widely used tool to query databases by sequence similarity search. These tools allow users to retrieve sequences by shared keyword or by shared similarity, with many public web servers available. However, with the increasingly large datasets available it is now quite common that a user is interested in some subset of homologous sequences but has no efficient way to restrict retrieval to that set. By allowing the user to control SRS from the BLAST output, BLAST2SRS (http://blast2srs.embl.de/) aims to meet this need. This server therefore combines the two ways to search sequence databases: similarity and keyword.  相似文献   

4.
As an archive of sequence data for over 165,000 species, GenBank is an indispensable resource for phylogenetic inference. Here we describe an informatics processing pipeline and online database, the PhyLoTA Browser (http://loco.biosci.arizona.edu/pb), which offers a view of GenBank tailored for molecular phylogenetics. The first release of the Browser is computed from 2.6 million sequences representing the taxonomically enriched subset of GenBank sequences for eukaryotes (excluding most genome survey sequences, ESTs, and other high-throughput data). In addition to summarizing sequence diversity and species diversity across nodes in the NCBI taxonomy, it reports 87,000 potentially phylogenetically informative clusters of homologous sequences, which can be viewed or downloaded, along with provisional alignments and coarse phylogenetic trees. At each node in the NCBI hierarchy, the user can display a "data availability matrix" of all available sequences for entries in a subtaxa-by-clusters matrix. This matrix provides a guidepost for subsequent assembly of multigene data sets or supertrees. The database allows for comparison of results from previous GenBank releases, highlighting recent additions of either sequences or taxa to GenBank and letting investigators track progress on data availability worldwide. Although the reported alignments and trees are extremely approximate, the database reports several statistics correlated with alignment quality to help users choose from alternative data sources.  相似文献   

5.
Sequence comparison methods based on position-specific score matrices (PSSMs) have proven a useful tool for recognition of the divergent members of a protein family and for annotation of functional sites. Here we investigate one of the factors that affects overall performance of PSSMs in a PSI-BLAST search, the algorithm used to construct the seed alignment upon which the PSSM is based. We compare PSSMs based on alignments constructed by global sequence similarity (ClustalW and ClustalW-pairwise), local sequence similarity (BLAST), and local structure similarity (VAST). To assess performance with respect to identification of conserved functional or structural sites, we examine the accuracy of the three-dimensional molecular models predicted by PSSM-sequence alignments. Using the known structures of those sequences as the standard of truth, we find that model accuracy varies with the algorithm used for seed alignment construction in the pattern local-structure (VAST) > local-sequence (BLAST) > global-sequence (ClustalW). Using structural similarity of query and database proteins as the standard of truth, we find that PSSM recognition sensitivity depends primarily on the diversity of the sequences included in the alignment, with an optimum around 30-50% average pairwise identity. We discuss these observations, and suggest a strategy for constructing seed alignments that optimize PSSM-sequence alignment accuracy and recognition sensitivity.  相似文献   

6.
7.
We describe EnteriX, a suite of three web-based visualization tools for graphically portraying alignment information from comparisons among several fixed and user-supplied sequences from related enterobacterial species, anchored on a reference genome (http://bio.cse.psu.edu/). The first visualization, Enteric, displays stacked pairwise alignments between a reference genome and each of the related bacteria, represented schematically as PIPs (Percent Identity Plots). Encoded in the views are large-scale genomic rearrangement events and functional landmarks. The second visualization, Menteric, computes and displays 1 Kb views of nucleotide-level multiple alignments of the sequences, together with annotations of genes, regulatory sites and conserved regions. The third, a Java-based tool named Maj, displays alignment information in two formats, corresponding roughly to the Enteric and Menteric views, and adds zoom-in capabilities. The uses of such tools are diverse, from examining the multiple sequence alignment to infer conserved sites with potential regulatory roles, to scrutinizing the commonalities and differences between the genomes for pathogenicity or phylogenetic studies. The EnteriX suite currently includes >15 enterobacterial genomes, generates views centered on four different anchor genomes and provides support for including user sequences in the alignments.  相似文献   

8.
9.
SUMMARY: MuSeqBox is a program to parse BLAST output and store attributes of BLAST hits in tabular form. The user can apply a number of selection criteria to filter out hits with particular attributes. MuSeqBox provides a powerful annotation tool for large sets of query sequences that are simultaneously compared against a database with any of the standard stand-alone or network-client BLAST programs. We discuss such application to the problem of annotation and analysis of EST collections. AVAILABILITY: The program was written in standard C++ and is freely available to noncommercial users by request from the authors. The program is also available over the web at http://bioinformatics.iastate.edu/bioinformatics2go/mb/MuSeqBox.html.  相似文献   

10.
In recent years we have witnessed a growth in sequencing yield, the number of samples sequenced, and as a result–the growth of publicly maintained sequence databases. The increase of data present all around has put high requirements on protein similarity search algorithms with two ever-opposite goals: how to keep the running times acceptable while maintaining a high-enough level of sensitivity. The most time consuming step of similarity search are the local alignments between query and database sequences. This step is usually performed using exact local alignment algorithms such as Smith-Waterman. Due to its quadratic time complexity, alignments of a query to the whole database are usually too slow. Therefore, the majority of the protein similarity search methods prior to doing the exact local alignment apply heuristics to reduce the number of possible candidate sequences in the database. However, there is still a need for the alignment of a query sequence to a reduced database. In this paper we present the SW#db tool and a library for fast exact similarity search. Although its running times, as a standalone tool, are comparable to the running times of BLAST, it is primarily intended to be used for exact local alignment phase in which the database of sequences has already been reduced. It uses both GPU and CPU parallelization and was 4–5 times faster than SSEARCH, 6–25 times faster than CUDASW++ and more than 20 times faster than SSW at the time of writing, using multiple queries on Swiss-prot and Uniref90 databases  相似文献   

11.
MOTIVATION: Pairwise local sequence alignment is commonly used to search data bases for sequences related to some query sequence. Alignments are obtained using a scoring matrix that takes into account the different frequencies of occurrence of the various types of amino acid substitutions. Software like BLAST provides the user with a set of scoring matrices available to choose from, and in the literature it is sometimes recommended to try several scoring matrices on the sequences of interest. The significance of an alignment is usually assessed by looking at E-values and p-values. While sequence lengths and data base sizes enter the standard calculations of significance, it is much less common to take the use of several scoring matrices on the same sequences into account. Altschul proposed corrections of the p-value that account for the simultaneous use of an infinite number of PAM matrices. Here we consider the more realistic situation where the user may choose from a finite set of popular PAM and BLOSUM matrices, in particular the ones available in BLAST. It turns out that the significance of a result can be considerably overestimated, if a set of substitution matrices is used in an alignment problem and the most significant alignment is then quoted. RESULTS: Based on extensive simulations, we study the multiple testing problem that occurs when several scoring matrices for local sequence alignment are used. We consider a simple Bonferroni correction of the p-values and investigate its accuracy. Finally, we propose a more accurate correction based on extreme value distributions fitted to the maximum of the normalized scores obtained from different scoring matrices. For various sets of matrices we provide correction factors which can be easily applied to adjust p- and E-values reported by software packages.  相似文献   

12.
MOTIVATION: Two proteins can have a similar 3-dimensional structure and biological function, but have sequences sufficiently different that traditional protein sequence comparison algorithms do not identify their relationship. The desire to identify such relations has led to the development of more sensitive sequence alignment strategies. One such strategy is the Intermediate Sequence Search (ISS), which connects two proteins through one or more intermediate sequences. In its brute-force implementation, ISS is a strategy that repetitively uses the results of the previous query as new search seeds, making it time-consuming and difficult to analyze. RESULTS: Saturated BLAST is a package that performs ISS in an efficient and automated manner. It was developed using Perl and Perl/Tk and implemented on the LINUX operating system. Starting with a protein sequence, Saturated BLAST runs a BLAST search and identifies representative sequences for the next generation of searches. The procedure is run until convergence or until some predefined criteria are met. Saturated BLAST has a friendly graphic user interface, a built-in BLAST result parser, several multiple alignment tools, clustering algorithms and various filters for the elimination of false positives, thereby providing an easy way to edit, visualize, analyze, monitor and control the search. Besides detecting remote homologies, Saturated BLAST can be used to maintain protein family databases and to search for new genes in genomic databases.  相似文献   

13.
Rfam is a collection of multiple sequence alignments and covariance models representing non-coding RNA families. Rfam is available on the web in the UK at http://www.sanger.ac.uk/Software/Rfam/ and in the US at http://rfam.wustl.edu/. These websites allow the user to search a query sequence against a library of covariance models, and view multiple sequence alignments and family annotation. The database can also be downloaded in flatfile form and searched locally using the INFERNAL package (http://infernal.wustl.edu/). The first release of Rfam (1.0) contains 25 families, which annotate over 50 000 non-coding RNA genes in the taxonomic divisions of the EMBL nucleotide database.  相似文献   

14.
Reconstructing the evolutionary history of protein sequences will provide a better understanding of divergence mechanisms of protein superfamilies and their functions. Long-term protein evolution often includes dynamic changes such as insertion, deletion, and domain shuffling. Such dynamic changes make reconstructing protein sequence evolution difficult and affect the accuracy of molecular evolutionary methods, such as multiple alignments and phylogenetic methods. Unfortunately, currently available simulation methods are not sufficiently flexible and do not allow biologically realistic dynamic protein sequence evolution. We introduce a new method, indel-Seq-Gen (iSG), that can simulate realistic evolutionary processes of protein sequences with insertions and deletions (indels). Unlike other simulation methods, iSG allows the user to simulate multiple subsequences according to different evolutionary parameters, which is necessary for generating realistic protein families with multiple domains. iSG tracks all evolutionary events including indels and outputs the "true" multiple alignment of the simulated sequences. iSG can also generate a larger sequence space by allowing the use of multiple related root sequences. With all these functions, iSG can be used to test the accuracy of, for example, multiple alignment methods, phylogenetic methods, evolutionary hypotheses, ancestral protein reconstruction methods, and protein family classification methods. We empirically evaluated the performance of iSG against currently available methods by simulating the evolution of the G protein-coupled receptor and lipocalin protein families. We examined their true multiple alignments, reconstruction of the transmembrane regions and beta-strands, and the results of similarity search against a protein database using the simulated sequences. We also presented an example of using iSG for examining how phylogenetic reconstruction is affected by high indel rates.  相似文献   

15.
Genome-wide multiple sequence alignments (MSAs) are a necessary prerequisite for an increasingly diverse collection of comparative genomic approaches. Here we present a versatile method that generates high-quality MSAs for non-protein-coding sequences. The NcDNAlign pipeline combines pairwise BLAST alignments to create initial MSAs, which are then locally improved and trimmed. The program is optimized for speed and hence is particulary well-suited to pilot studies. We demonstrate the practical use of NcDNAlign in three case studies: the search for ncRNAs in gammaproteobacteria and the analysis of conserved noncoding DNA in nematodes and teleost fish, in the latter case focusing on the fate of duplicated ultra-conserved regions. Compared to the currently widely used genome-wide alignment program TBA, our program results in a 20- to 30-fold reduction of CPU time necessary to generate gammaproteobacterial alignments. A showcase application of bacterial ncRNA prediction based on alignments of both algorithms results in similar sensitivity, false discovery rates, and up to 100 putatively novel ncRNA structures. Similar findings hold for our application of NcDNAlign to the identification of ultra-conserved regions in nematodes and teleosts. Both approaches yield conserved sequences of unknown function, result in novel evolutionary insights into conservation patterns among these genomes, and manifest the benefits of an efficient and reliable genome-wide alignment package. The software is available under the GNU Public License at http://www.bioinf.uni-leipzig.de/Software/NcDNAlign/.  相似文献   

16.
Identification of ectomycorrhizal (ECM) fungi is often achieved through comparisons of ribosomal DNA internal transcribed spacer (ITS) sequences with accessioned sequences deposited in public databases. A major problem encountered is that annotation of the sequences in these databases is not always complete or trustworthy. In order to overcome this deficiency, we report on UNITE, an open-access database. UNITE comprises well annotated fungal ITS sequences from well defined herbarium specimens that include full herbarium reference identification data, collector/source and ecological data. At present UNITE contains 758 ITS sequences from 455 species and 67 genera of ECM fungi. UNITE can be searched by taxon name, via sequence similarity using blastn, and via phylogenetic sequence identification using galaxie. Following implementation, galaxie performs a phylogenetic analysis of the query sequence after alignment either to pre-existing generic alignments, or to matches retrieved from a blast search on the UNITE data. It should be noted that the current version of UNITE is dedicated to the reliable identification of ECM fungi. The UNITE database is accessible through the URL http://unite.zbi.ee  相似文献   

17.
SUMMARY: BLAST statistics have been shown to be extremely useful for searching for significant similarity hits, for amino acid and nucleotide sequences. Although these statistics are well understood for pairwise comparisons, there has been little success developing statistical scores for multiple alignments. In particular, there is no score for multiple alignment that is well founded and treated as a standard. We extend the BLAST theory to multiple alignments. Following some simple assumptions, we present and justify a significance score for multiple segments of a local multiple alignment. We demonstrate its usefulness in distinguishing high and moderate quality multiple alignments from low quality ones, with supporting experiments on orthologous vertebrate promoter sequences.  相似文献   

18.
The Conserved Domain Database (CDD) is now indexed as a separate database within the Entrez system and linked to other Entrez databases such as MEDLINE(R). This allows users to search for domain types by name, for example, or to view the domain architecture of any protein in Entrez's sequence database. CDD can be accessed on the WorldWideWeb at http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=cdd. Users may also employ the CD-Search service to identify conserved domains in new sequences, at http://www.ncbi.nlm.nih.gov/Structure/cdd/wrpsb.cgi. CD-Search results, and pre-computed links from Entrez's protein database, are calculated using the RPS-BLAST algorithm and Position Specific Score Matrices (PSSMs) derived from CDD alignments. CD-Searches are also run by default for protein-protein queries submitted to BLAST(R) at http://www.ncbi.nlm.nih.gov/BLAST. CDD mirrors the publicly available domain alignment collections SMART and PFAM, and now also contains alignment models curated at NCBI. Structure information is used to identify the core substructure likely to be present in all family members, and to produce sequence alignments consistent with structure conservation. This alignment model allows NCBI curators to annotate 'columns' corresponding to functional sites conserved among family members.  相似文献   

19.
GenBank.   总被引:2,自引:1,他引:2       下载免费PDF全文
The GenBank(R) sequence database (http://www.ncbi.nlm.nih.gov/) incorporates DNA sequences from all available public sources, primarily through the direct submission of sequence data from individual laboratories and from large-scale sequencing projects. Most submitters use the BankIt (WWW) or Sequin programs to send their sequence data. Data exchange with the EMBL Data Library and the DNA Data Bank of Japan helps ensure comprehensive worldwide coverage. GenBank data is accessible through NCBI's integrated retrieval system, Entrez , which integrates data from the major DNA and protein sequence databases along with taxonomy, genome and protein structure information. MEDLINE(R) abstracts from published articles describing the sequences are also included as an additional source of biological annotation. Sequence similarity searching is offered through the BLAST series of database search programs. In addition to FTP, e-mail and server/client versions of Entrez and BLAST, NCBI offers a wide range of World Wide Web retrieval and analysis services of interest to biologists.  相似文献   

20.
A comparison of scoring functions for protein sequence profile alignment   总被引:3,自引:0,他引:3  
MOTIVATION: In recent years, several methods have been proposed for aligning two protein sequence profiles, with reported improvements in alignment accuracy and homolog discrimination versus sequence-sequence methods (e.g. BLAST) and profile-sequence methods (e.g. PSI-BLAST). Profile-profile alignment is also the iterated step in progressive multiple sequence alignment algorithms such as CLUSTALW. However, little is known about the relative performance of different profile-profile scoring functions. In this work, we evaluate the alignment accuracy of 23 different profile-profile scoring functions by comparing alignments of 488 pairs of sequences with identity < or =30% against structural alignments. We optimize parameters for all scoring functions on the same training set and use profiles of alignments from both PSI-BLAST and SAM-T99. Structural alignments are constructed from a consensus between the FSSP database and CE structural aligner. We compare the results with sequence-sequence and sequence-profile methods, including BLAST and PSI-BLAST. RESULTS: We find that profile-profile alignment gives an average improvement over our test set of typically 2-3% over profile-sequence alignment and approximately 40% over sequence-sequence alignment. No statistically significant difference is seen in the relative performance of most of the scoring functions tested. Significantly better results are obtained with profiles constructed from SAM-T99 alignments than from PSI-BLAST alignments. AVAILABILITY: Source code, reference alignments and more detailed results are freely available at http://phylogenomics.berkeley.edu/profilealignment/  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号