首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Reciprocal Best Hits (RBH) are a common proxy for orthology in comparative genomics. Essentially, a RBH is found when the proteins encoded by two genes, each in a different genome, find each other as the best scoring match in the other genome. NCBI''s BLAST is the software most usually used for the sequence comparisons necessary to finding RBHs. Since sequence comparison can be time consuming, we decided to compare the number and quality of RBHs detected using algorithms that run in a fraction of the time as BLAST. We tested BLAT, LAST and UBLAST. All three programs ran in a hundredth to a 25th of the time required to run BLAST. A reduction in the number of homologs and RBHs found by the faster algorithms compared to BLAST becomes apparent as the genomes compared become more dissimilar, with BLAT, a program optimized for quickly finding very similar sequences, missing both the most homologs and the most RBHs. Though LAST produced the closest number of homologs and RBH to those produced with BLAST, UBLAST was very close, with either program producing between 0.6 and 0.8 of the RBHs as BLAST between dissimilar genomes, while in more similar genomes the differences were barely apparent. UBLAST ran faster than LAST, making it the best option among the programs tested.  相似文献   

2.
Tekaia F  Yeramian E 《Gene》2012,492(1):199-211
The proper detection of orthologs is crucial for evolutionary studies of genes and species. Despite large efforts to solve this problem the methodological situation appears unsettled to a large extent and the “quest for orthologs” is still an ongoing task in large-scale genome comparisons.Here, we introduce a simple operational framework for the detection of orthologs and their classification. The operational framework relies on well-established principles, optimizing their implementation for the considered purposes, and chaining components in coherent procedures: 1) We take advantage of the efficiency and simplicity of the Reciprocal Best Hit (RBH) detections, remedying (by design) the drawback concerning the limitations in terms of 1:1 detections. The procedure is based on the partitioning of Reciprocal Best Hits, with the further merging of partitions including members of the same paralogous classes (“SuperPartition of Orthologs” (SPOs)). 2) We then resort to the conservation profiles of the obtained clusters, allowing simple detection of SPOs containing duplicated members. Based on accepted evolutionary principles, such members can be further tagged as in-paralogs (co-orthologs) or out-paralogs.The method is illustrated and validated by extensive genomic analyses. The performances of the overall approach are characterized in global terms for three sets of species (Chlamydiae, Mycobacteria, Aspergilli), showing that at least 75% of the sets of orthologs contain at most one protein from a given species. The sets including more than one protein from a given species are shown to contain in-paralogs in proportions varying from 28% to 58%. The characterizations also show that the large majority of SPOs are associated with ancestral motifs, and accordingly not prone to chaining effects that might be triggered by multi-domain proteins. Further the SPO formulation is compared to other similarity based ortholog detection methods. Beyond core common results, significant differences are observed between various methods, which can be accounted for to a large extent on conceptual grounds, relative to the different merging schemes involved. Such comparisons highlight a major advantage of the SPO approach concerning the proper clustering of associated paralogs, which appear to be often dispatched spuriously into distinct orthologous classes.Finally the perspectives for future applications and elaborations of SPO-based compositional analyses are discussed.  相似文献   

3.

Background  

Orthologs (genes that have diverged after a speciation event) tend to have similar function, and so their prediction has become an important component of comparative genomics and genome annotation. The gold standard phylogenetic analysis approach of comparing available organismal phylogeny to gene phylogeny is not easily automated for genome-wide analysis; therefore, ortholog prediction for large genome-scale datasets is typically performed using a reciprocal-best-BLAST-hits (RBH) approach. One problem with RBH is that it will incorrectly predict a paralog as an ortholog when incomplete genome sequences or gene loss is involved. In addition, there is an increasing interest in identifying orthologs most likely to have retained similar function.  相似文献   

4.
The accuracy of the global Smith-Waterman alignments and Pareto-optimal alignments depending on the degree of sequence similarity (percent of coincidence, % id, and the number of remote fragments NGap) has been examined. An algorithm for constructing a set of three to six alignments has been developed of which the accuracy of the best alignment exceeds on the average the accuracy of the best alignment that can be constructed using the Smith-Waterman algorithm. For weakly homologous sequences (% id 15, NGap 20), the increase in the accuracy is on the average about 8%, with the average accuracy of the global Smith-Waterman alignments being about 38% (the accuracy was estimated on model test sets).  相似文献   

5.
Iurlova AA  Makunin IV  Zhimulev IF 《Genetika》2010,46(9):1272-1275
Different genome regions differ in replication timing during the S phase. Late-replicating sequences are often underreplicated in the Drosophila salivary-gland polytene chromosomes. The SuUR gene, whose mutation changes the replication time of late-replicating regions in salivary-gland cells, has been identified in Drosophila melanogaster. The SUUR protein lacks homologs by a BLAST search, and only moderate homology is observed between its N-terminal end and chromatin-remodeling proteins of the SWI2/SNF2 family. The gene and the protein were analyzed in insects. Orthologs of the SuUR gene were found in all annotated Drosophila species. The number of amino acid substitutions in the SUUR protein proved to be extremely high, corresponding to that of rapidly evolving genes. Orthologs with low homology were found in mosquitoes Anopheles gambiae, Aedes aegypti, and Culex quinquefasciatus. No orthologs of the SuUR gene were detected beyond Diptera.  相似文献   

6.
Sequence alignment programs such as BLAST and PSI-BLAST are used routinely in pairwise, profile-based, or intermediate-sequence-search (ISS) methods to detect remote homologies for the purposes of fold assignment and comparative modeling. Yet, the sequence alignment quality of these methods at low sequence identity is not known. We have used the CE structure alignment program (Shindyalov and Bourne, Prot Eng 1998;11:739) to derive sequence alignments for all superfamily and family-level related proteins in the SCOP domain database. CE aligns structures and their sequences based on distances within each protein, rather than on interprotein distances. We compared BLAST, PSI-BLAST, CLUSTALW, and ISS alignments with the CE structural alignments. We found that global alignments with CLUSTALW were very poor at low sequence identity (<25%), as judged by the CE alignments. We used PSI-BLAST to search the nonredundant sequence database (nr) with every sequence in SCOP using up to four iterations. The resulting matrix was used to search a database of SCOP sequences. PSI-BLAST is only slightly better than BLAST in alignment accuracy on a per-residue basis, but PSI-BLAST matrix alignments are much longer than BLAST's, and so align correctly a larger fraction of the total number of aligned residues in the structure alignments. Any two SCOP sequences in the same superfamily that shared a hit or hits in the nr PSI-BLAST searches were identified as linked by the shared intermediate sequence. We examined the quality of the longest SCOP-query/ SCOP-hit alignment via an intermediate sequence, and found that ISS produced longer alignments than PSI-BLAST searches alone, of nearly comparable per-residue quality. At 10-15% sequence identity, BLAST correctly aligns 28%, PSI-BLAST 40%, and ISS 46% of residues according to the structure alignments. We also compared CE structure alignments with FSSP structure alignments generated by the DALI program. In contrast to the sequence methods, CE and structure alignments from the FSSP database identically align 75% of residue pairs at the 10-15% level of sequence identity, indicating that there is substantial room for improvement in these sequence alignment methods. BLAST produced alignments for 8% of the 10,665 nonimmunoglobulin SCOP superfamily sequence pairs (nearly all <25% sequence identity), PSI-BLAST matched 17% and the double-PSI-BLAST ISS method aligned 38% with E-values <10.0. The results indicate that intermediate sequences may be useful not only in fold assignment but also in achieving more complete sequence alignments for comparative modeling.  相似文献   

7.
The accuracy of global Smith-Waterman alignments and Pareto-optimal alignments depending on the degree of sequence similarity (percent of coincidence, %id, and the number of removed fragments NGap) has been examined. An algorithm for constructing a set of three to six alignments has been developed of which the best alignment on the average exceeds in accuracy the best alignment that can be constructed using the Smith-Waterman algorithm. For weakly homologous sequences (%id 15, NGap 20), the increase in accuracy is on the average about 8%, with the average accuracy of the global Smith-Waterman alignments being about 38% (the accuracy was estimated on model test sets).  相似文献   

8.
SUMMARY: BLAST statistics have been shown to be extremely useful for searching for significant similarity hits, for amino acid and nucleotide sequences. Although these statistics are well understood for pairwise comparisons, there has been little success developing statistical scores for multiple alignments. In particular, there is no score for multiple alignment that is well founded and treated as a standard. We extend the BLAST theory to multiple alignments. Following some simple assumptions, we present and justify a significance score for multiple segments of a local multiple alignment. We demonstrate its usefulness in distinguishing high and moderate quality multiple alignments from low quality ones, with supporting experiments on orthologous vertebrate promoter sequences.  相似文献   

9.
The classic algorithms of Needleman-Wunsch and Smith-Waterman find a maximum a posteriori probability alignment for a pair hidden Markov model (PHMM). To process large genomes that have undergone complex genome rearrangements, almost all existing whole genome alignment methods apply fast heuristics to divide genomes into small pieces that are suitable for Needleman-Wunsch alignment. In these alignment methods, it is standard practice to fix the parameters and to produce a single alignment for subsequent analysis by biologists. As the number of alignment programs applied on a whole genome scale continues to increase, so does the disagreement in their results. The alignments produced by different programs vary greatly, especially in non-coding regions of eukaryotic genomes where the biologically correct alignment is hard to find. Parametric alignment is one possible remedy. This methodology resolves the issue of robustness to changes in parameters by finding all optimal alignments for all possible parameters in a PHMM. Our main result is the construction of a whole genome parametric alignment of Drosophila melanogaster and Drosophila pseudoobscura. This alignment draws on existing heuristics for dividing whole genomes into small pieces for alignment, and it relies on advances we have made in computing convex polytopes that allow us to parametrically align non-coding regions using biologically realistic models. We demonstrate the utility of our parametric alignment for biological inference by showing that cis-regulatory elements are more conserved between Drosophila melanogaster and Drosophila pseudoobscura than previously thought. We also show how whole genome parametric alignment can be used to quantitatively assess the dependence of branch length estimates on alignment parameters.  相似文献   

10.
We describe a new strategy for utilizing multiple sequence alignment information to detect distant relationships in searches of sequence databases. A single sequence representing a protein family is enriched by replacing conserved regions with position-specific scoring matrices (PSSMs) or consensus residues derived from multiple alignments of family members. In comprehensive tests of these and other family representations, PSSM-embedded queries produced the best results overall when used with a special version of the Smith-Waterman searching algorithm. Moreover, embedding consensus residues instead of PSSMs improved performance with readily available single sequence query searching programs, such as BLAST and FASTA. Embedding PSSMs or consensus residues into a representative sequence improves searching performance by extracting multiple alignment information from motif regions while retaining single sequence information where alignment is uncertain.  相似文献   

11.
All popular algorithms of pair-wise alignment of protein primary structures (e.g. Smith-Waterman (SW), FASTA, BLAST, et al.) utilize only amino acid sequences. The SW-algorithm is the most accurate among them, i.e. it produces alignments that are most similar to the alignments obtained by superposition of protein 3D-structures. But even the SW-algorithm is unable to restore the 3D-based alignment if similarity of amino acid sequences (%id) is below 30%. We have proposed a novel alignment method that explicitly takes into account the secondary structure of the compared proteins. We have shown that it creates significantly more accurate alignments compared to SW-algorithm. In particular, for sequences with %id < 30% the average accuracy of the new method is 58% compared to 35% for SW-algorithm (the accuracy of an algorithmic sequence alignment is the part of restored position of a "golden standard" alignment obtained by superposition of corresponding 3D-structures). The accuracy of the proposed method is approximately identical both for experimental, and for theoretically predicted secondary structures. Thus the method can be applied for alignment of protein sequences even if protein 3D-structure is unknown. The program is available at ftp://194.149.64.196/STRUSWER/.  相似文献   

12.
Alignment of protein sequences is a key step in most computational methods for prediction of protein function and homology-based modeling of three-dimensional (3D)-structure. We investigated correspondence between "gold standard" alignments of 3D protein structures and the sequence alignments produced by the Smith-Waterman algorithm, currently the most sensitive method for pair-wise alignment of sequences. The results of this analysis enabled development of a novel method to align a pair of protein sequences. The comparison of the Smith-Waterman and structure alignments focused on their inner structure and especially on the continuous ungapped alignment segments, "islands" between gaps. Approximately one third of the islands in the gold standard alignments have negative or low positive score, and their recognition is below the sensitivity limit of the Smith-Waterman algorithm. From the alignment accuracy perspective, the time spent by the algorithm while working in these unalignable regions is unnecessary. We considered features of the standard similarity scoring function responsible for this phenomenon and suggested an alternative hierarchical algorithm, which explicitly addresses high scoring regions. This algorithm is considerably faster than the Smith-Waterman algorithm, whereas resulting alignments are in average of the same quality with respect to the gold standard. This finding shows that the decrease of alignment accuracy is not necessarily a price for the computational efficiency.  相似文献   

13.
The review considers the original works on the primary structure of biopolymers, which were carried out from 1983 to 2003. Most works were supported by the Russian program Human Genome and earlier similar Russian programs. Little-known publications of 1983-1993 and recent unpublished results are described in detail. In the field of genome comparisons, these concern the OWEN hierarchic algorithm aligning syntenic regions of two genome sequences. The resulting global alignment is obtained as an ordered chain of local similarities. Alignment of sequences sized about 10(6) nucleotides takes several minutes. The concept of local similarity conflicts is generalized to multiple comparisons. New algorithms aligning protein sequences are described and compared with the Smith-Waterman algorithm, which is now most accurate. The ANCHOR hierarchic algorithm generates alignments of much the same accuracy and is twice as rapid as the Smith-Waterman one. The STRSWer algorithm takes an account of the secondary structures of proteins under study. With the secondary structures predicted using the PSI-PRED software for pairs of proteins having 10-30% similarity, the average accuracy of alignments generated by STRSWer is 15% higher than that achieved with the Smith-Waterman algorithm.  相似文献   

14.
15.
Comparative accuracy of methods for protein sequence similarity search   总被引:2,自引:0,他引:2  
MOTIVATION: Searching a protein sequence database for homologs is a powerful tool for discovering the structure and function of a sequence. Two new methods for searching sequence databases have recently been described: Probabilistic Smith-Waterman (PSW), which is based on Hidden Markov models for a single sequence using a standard scoring matrix, and a new version of BLAST (WU-BLAST2), which uses Sum statistics for gapped alignments. RESULTS: This paper compares and contrasts the effectiveness of these methods with three older methods (Smith- Waterman: SSEARCH, FASTA and BLASTP). The analysis indicates that the new methods are useful, and often offer improved accuracy. These tools are compared using a curated (by Bill Pearson) version of the annotated portion of PIR 39. Three different statistical criteria are utilized: equivalence number, minimum errors and the receiver operating characteristic. For complete-length protein query sequences from large families, PSW's accuracy is superior to that of the other methods, but its accuracy is poor when used with partial-length query sequences. False negatives are twice as common as false positives irrespective of the search methods if a family-specific threshold score that minimizes the total number of errors (i.e. the most favorable threshold score possible) is used. Thus, sensitivity, not selectivity, is the major problem. Among the analyzed methods using default parameters, the best accuracy was obtained from SSEARCH and PSW for complete-length proteins, and the two BLAST programs, plus SSEARCH, for partial-length proteins.   相似文献   

16.
Bioinformatic tools have become essential to biologists in their quest to understand the vast quantities of sequence data, and now whole genomes, which are being produced at an ever increasing rate. Much of these sequence data are single-pass sequences, such as sample sequences from organisms closely related to other organisms of interest which have already been sequenced, or cDNAs or expressed sequence tags (ESTs). These single-pass sequences often contain errors, including frameshifts, which complicate the identification of homologues, especially at the protein level. Therefore, sequence searches with this type of data are often performed at the nucleotide level. The most commonly used sequence search algorithms for the identification of homologues are Washington University's and the National Center for Biotechnology Information's (NCBI) versions of the BLAST suites of tools, which are to be found on websites all over the world. The work reported here examines the use of these tools for comparing sample sequence datasets to a known genome. It shows that care must be taken when choosing the parameters to use with the BLAST algorithms. NCBI's version of gapped BLASTn gives much shorter, and sometimes different, top alignments to those found using Washington University's version of BLASTn (which also allows for gaps), when both are used with their default parameters. Most of the differences in performance were found to be due to the choices of default parameters rather than underlying differences between the two algorithms. Washington University's version, used with defaults, compares very favourably with the results obtained using the accurate but computationally intensive Smith-Waterman algorithm.  相似文献   

17.
MOTIVATION: Although pairwise sequence alignment is essential in comparative genomic sequence analysis, it has proven difficult to precisely determine the gap penalties for a given pair of sequences. A common practice is to employ default penalty values. However, there are a number of problems associated with using gap penalties. First, alignment results can vary depending on the gap penalties, making it difficult to explore appropriate parameters. Second, the statistical significance of an alignment score is typically based on a theoretical model of non-gapped alignments, which may be misleading. Finally, there is no way to control the number of gaps for a given pair of sequences, even if the number of gaps is known in advance. RESULTS: In this paper, we develop and evaluate the performance of an alignment technique that allows the researcher to assign a priori set of the number of allowable gaps, rather than using gap penalties. We compare this approach with the Smith-Waterman and Needleman-Wunsch techniques on a set of structurally aligned protein sequences. We demonstrate that this approach outperforms the other techniques, especially for short sequences (56-133 residues) with low similarity (<25%). Further, by employing a statistical measure, we show that it can be used to assess the quality of the alignment in relation to the true alignment with the associated optimal number of gaps. AVAILABILITY: The implementation of the described methods SANK_AL is available at http://cbbc.murdoch.edu.au/ CONTACT: matthew@cbbc.murdoch.edu.au.  相似文献   

18.
In recent years we have witnessed a growth in sequencing yield, the number of samples sequenced, and as a result–the growth of publicly maintained sequence databases. The increase of data present all around has put high requirements on protein similarity search algorithms with two ever-opposite goals: how to keep the running times acceptable while maintaining a high-enough level of sensitivity. The most time consuming step of similarity search are the local alignments between query and database sequences. This step is usually performed using exact local alignment algorithms such as Smith-Waterman. Due to its quadratic time complexity, alignments of a query to the whole database are usually too slow. Therefore, the majority of the protein similarity search methods prior to doing the exact local alignment apply heuristics to reduce the number of possible candidate sequences in the database. However, there is still a need for the alignment of a query sequence to a reduced database. In this paper we present the SW#db tool and a library for fast exact similarity search. Although its running times, as a standalone tool, are comparable to the running times of BLAST, it is primarily intended to be used for exact local alignment phase in which the database of sequences has already been reduced. It uses both GPU and CPU parallelization and was 4–5 times faster than SSEARCH, 6–25 times faster than CUDASW++ and more than 20 times faster than SSW at the time of writing, using multiple queries on Swiss-prot and Uniref90 databases  相似文献   

19.
Orthologs generally are under selective pressure against loss of function, while paralogs usually accumulate mutations and finally die or deviate in terms of function or regulation. Most ortholog detection methods contaminate the resulting datasets with a substantial amount of paralogs. Therefore we aimed to implement a straightforward method that allows the detection of ortholog clusters with a reduced amount of paralogs from completely sequenced genomes. The described cross-species expansion of the reciprocal best BLAST hit method is a time-effective method for ortholog detection, which results in 68% truly orthologous clusters and the procedure specifically enriches single-copy orthologs. The detection of true orthologs can provide a phylogenetic toolkit to better understand evolutionary processes. In a study across six photosynthetic eukaryotes, nuclear genes of putative mitochondrial origin were shown to be over-represented among single copy orthologs. These orthologs are involved in fundamental biological processes like amino acid metabolism or translation. Molecular clock analyses based on this dataset yielded divergence time estimates for the red/green algae (1,142 MYA), green algae/land plant (725 MYA), mosses/seed plant (496 MYA), gymno-/angiosperm (385 MYA) and monocotyledons/core eudicotyledons (301 MYA) divergence times. Electronic supplementary material The online version of this article (doi:) contains supplementary material, which is available to authorized users.  相似文献   

20.
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号