首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
To study the genomic divergence between human and chimpanzee, large-scale genomic sequence alignments were performed. The genomic sequences of human and chimpanzee were first masked with the RepeatMasker and the repeats were excluded before alignments. The repeats were then reinserted into the alignments of nonrepetitive segments and entire sequences were aligned again. A total of 2.3 million base pairs (Mb) of genomic sequences, including repeats, were aligned and the average nucleotide divergence was estimated to be 1.22%. The Jukes-Cantor (JC) distances (nucleotide divergences) in nonrepetitive (1.44 Mb) and repetitive sequences (0.86 Mb) are 1.14% and 1.34%, respectively, suggesting a slightly higher average rate in repetitive sequences. Annotated coding and noncoding regions of homologous chimpanzee genes were also retrieved from GenBank and compared. The average synonymous and nonsynonymous divergences in 88 coding genes are 1.48% and 0.55%, respectively. The JC distances in intron, 5' flanking, 3' flanking, promoter, and pseudogene regions are 1.47%, 1.41%, 1.68%, 0.75%, and 1.39%, respectively. It is not clear why the genetic distances in most of these regions are somewhat higher than those in genomic sequences. One possible explanation is that some of the genes may be located in regions with higher mutation rates.  相似文献   

2.
SUMMARY: In the segment-by-segment approach to sequence alignment, pairwise and multiple alignments are generated by comparing gap-free segments of the sequences under study. This method is particularly efficient in detecting local homologies, and it has been used to identify functional regions in large genomic sequences. Herein, an algorithm is outlined that calculates optimal pairwise segment-by-segment alignments in essentially linear space. AVAILABILTIY: The program is available at the Bielefeld Bioinformatics Server (BiBiServ) at http://bibiserv.techfak. uni-bielefeld.de/dialign/  相似文献   

3.
PALMA: mRNA to genome alignments using large margin algorithms   总被引:1,自引:0,他引:1  
MOTIVATION: Despite many years of research on how to properly align sequences in the presence of sequencing errors, alternative splicing and micro-exons, the correct alignment of mRNA sequences to genomic DNA is still a challenging task. RESULTS: We present a novel approach based on large margin learning that combines accurate splice site predictions with common sequence alignment techniques. By solving a convex optimization problem, our algorithm-called PALMA-tunes the parameters of the model such that true alignments score higher than other alignments. We study the accuracy of alignments of mRNAs containing artificially generated micro-exons to genomic DNA. In a carefully designed experiment, we show that our algorithm accurately identifies the intron boundaries as well as boundaries of the optimal local alignment. It outperforms all other methods: for 5702 artificially shortened EST sequences from Caenorhabditis elegans and human, it correctly identifies the intron boundaries in all except two cases. The best other method is a recently proposed method called exalin which misaligns 37 of the sequences. Our method also demonstrates robustness to mutations, insertions and deletions, retaining accuracy even at high noise levels. AVAILABILITY: Datasets for training, evaluation and testing, additional results and a stand-alone alignment tool implemented in C++ and python are available at http://www.fml.mpg.de/raetsch/projects/palma  相似文献   

4.
Post-processing long pairwise alignments   总被引:2,自引:0,他引:2  
MOTIVATION: The local alignment problem for two sequences requires determining similar regions, one from each sequence, and aligning those regions. For alignments computed by dynamic programming, current approaches for selecting similar regions may have potential flaws. For instance, the criterion of Smith and Waterman can lead to inclusion of an arbitrarily poor internal segment. Other approaches can generate an alignment scoring less than some of its internal segments. RESULTS: We develop an algorithm that decomposes a long alignment into sub-alignments that avoid these potential imperfections. Our algorithm runs in time proportional to the original alignment's length. Practical applications to alignments of genomic DNA sequences are described.  相似文献   

5.
The extant mammalian groups Monotremata, Marsupialia and Placentalia are, according to the 'Theria' hypothesis, traditionally classified into two subclasses. The subclass Prototheria includes the monotremes and subclass Theria marsupials and placental mammals. Based on some morphological and molecular data, an alternative proposition, the Marsupionta hypothesis, favours a sister group relationship between monotremes and marsupials to the exclusion of placental mammals. Phylogenetic analyses of single genes and even multiple gene alignments have not yet been able to conclusively resolve this basal mammalian divergence. We have examined this problem using one data set composed of expressed sequence tags (EST) and another containing 1 510 509 nucleotide (nt) sites from 1358 inferred cDNA genomic sequences. All analyses of the concatenated sequences unambiguously supported the Theria hypothesis. The Marsupionta hypothesis was rejected with high statistical confidence from both data sets. In spite of the strong support for Theria, a non-negligible number of single genes supported either of the two alternative hypotheses. The divergence between monotremes and therian mammals was estimated to have taken place 168–178 Mya, a dating compatible with the fossil record. Considering the long common evolutionary branch of therians, it is surprising that sequence data from many thousand amino acid sites were needed to conclusively resolve their relationship to monotremes. This finding draws attention to other mammalian divergences that have been taken as unequivocally settled based on much smaller alignments. EST data provide a comprehensive random sample of protein coding sequences and an economic way to produce large amounts of data for phylogenetic analysis of species for which genomic sequences are not yet available.  相似文献   

6.
The application of Needleman-Wunsch alignment techniques to biological sequences is complicated by two serious problems when the sequences are long: the running time, which scales as the product of the lengths of sequences, and the difficulty in obtaining suitable parameters that produce meaningful alignments. The running time problem is often corrected by reducing the search space, using techniques such as banding, or chaining of high-scoring pairs. The parameter problem is more difficult to fix, partly because the probabilistic model, which Needleman-Wunsch is equivalent to, does not capture a key feature of biological sequence alignments, namely the alternation of conserved blocks and seemingly unrelated nonconserved segments. We present a solution to the problem of designing efficient search spaces for pair hidden Markov models that align biological sequences by taking advantage of their associated features. Our approach leads to an optimization problem, for which we obtain a 2-approximation algorithm, and that is based on the construction of Manhattan networks, which are close relatives of Steiner trees. We describe the underlying theory and show how our methods can be applied to alignment of DNA sequences in practice, successfully reducing the Viterbi algorithm search space of alignment PHMMs by three orders of magnitude.  相似文献   

7.
8.
Benchmarking tools for the alignment of functional noncoding DNA   总被引:1,自引:0,他引:1  

Background

Numerous tools have been developed to align genomic sequences. However, their relative performance in specific applications remains poorly characterized. Alignments of protein-coding sequences typically have been benchmarked against "correct" alignments inferred from structural data. For noncoding sequences, where such independent validation is lacking, simulation provides an effective means to generate "correct" alignments with which to benchmark alignment tools.

Results

Using rates of noncoding sequence evolution estimated from the genus Drosophila, we simulated alignments over a range of divergence times under varying models incorporating point substitution, insertion/deletion events, and short blocks of constrained sequences such as those found in cis-regulatory regions. We then compared "correct" alignments generated by a modified version of the ROSE simulation platform to alignments of the simulated derived sequences produced by eight pairwise alignment tools (Avid, BlastZ, Chaos, ClustalW, DiAlign, Lagan, Needle, and WABA) to determine the off-the-shelf performance of each tool. As expected, the ability to align noncoding sequences accurately decreases with increasing divergence for all tools, and declines faster in the presence of insertion/deletion evolution. Global alignment tools (Avid, ClustalW, Lagan, and Needle) typically have higher sensitivity over entire noncoding sequences as well as in constrained sequences. Local tools (BlastZ, Chaos, and WABA) have lower overall sensitivity as a consequence of incomplete coverage, but have high specificity to detect constrained sequences as well as high sensitivity within the subset of sequences they align. Tools such as DiAlign, which generate both local and global outputs, produce alignments of constrained sequences with both high sensitivity and specificity for divergence distances in the range of 1.25–3.0 substitutions per site.

Conclusion

For species with genomic properties similar to Drosophila, we conclude that a single pair of optimally diverged species analyzed with a high performance alignment tool can yield accurate and specific alignments of functionally constrained noncoding sequences. Further algorithm development, optimization of alignment parameters, and benchmarking studies will be necessary to extract the maximal biological information from alignments of functional noncoding DNA.
  相似文献   

9.
The chronological scenario of the evolution of hominoid primates has been thoroughly investigated since the advent of the molecular clock hypothesis. With the availability of genomic sequences for all hominid genera and other anthropoids, we may have reached the point at which the information from sequence data alone will not provide further evidence for the inference of the hominid evolution timescale. To verify this conjecture, we have compiled a genomic data set for all of the anthropoid genera. Our estimate places the Homo/Pan divergence at approximately 7.4 Ma, the Gorilla lineage divergence at approximately 9.7 Ma, the basal Hominidae divergence at 18.1 Ma and the basal Hominoidea divergence at 20.6 Ma. By inferring the theoretical limit distribution of posterior densities under a Bayesian framework, we show that it is unlikely that lengthier alignments or the availability of new genomic sequences will provide additional information to reduce the uncertainty associated with the divergence time estimates of the four hominid genera. A reduction of this uncertainty will be achieved only by the inclusion of more informative calibration priors.  相似文献   

10.
MOTIVATION: Comparison of multimegabase genomic DNA sequences is a popular technique for finding and annotating conserved genome features. Performing such comparisons entails finding many short local alignments between sequences up to tens of megabases in length. To process such long sequences efficiently, existing algorithms find alignments by expanding around short runs of matching bases with no substitutions or other differences. Unfortunately, exact matches that are short enough to occur often in significant alignments also occur frequently by chance in the background sequence. Thus, these algorithms must trade off between efficiency and sensitivity to features without long exact matches. RESULTS: We introduce a new algorithm, LSH-ALL-PAIRS, to find ungapped local alignments in genomic sequence with up to a specified fraction of substitutions. The length and substitution rate of these alignments can be chosen so that they appear frequently in significant similarities yet still remain rare in the background sequence. The algorithm finds ungapped alignments efficiently using a randomized search technique, locality-sensitive hashing. We have found LSH-ALL-PAIRS to be both efficient and sensitive for finding local similarities with as little as 63% identity in mammalian genomic sequences up to tens of megabases in length  相似文献   

11.
12.
Most bioinformatics analyses require the assembly of a multiple sequence alignment. It has long been suspected that structural information can help to improve the quality of these alignments, yet the effect of combining sequences and structures has not been evaluated systematically. We developed 3DCoffee, a novel method for combining protein sequences and structures in order to generate high-quality multiple sequence alignments. 3DCoffee is based on TCoffee version 2.00, and uses a mixture of pairwise sequence alignments and pairwise structure comparison methods to generate multiple sequence alignments. We benchmarked 3DCoffee using a subset of HOMSTRAD, the collection of reference structural alignments. We found that combining TCoffee with the threading program Fugue makes it possible to improve the accuracy of our HOMSTRAD dataset by four percentage points when using one structure only per dataset. Using two structures yields an improvement of ten percentage points. The measures carried out on HOM39, a HOMSTRAD subset composed of distantly related sequences, show a linear correlation between multiple sequence alignment accuracy and the ratio of number of provided structure to total number of sequences. Our results suggest that in the case of distantly related sequences, a single structure may not be enough for computing an accurate multiple sequence alignment.  相似文献   

13.
Comparative sequence analyses, including such fundamental bioinformatics techniques as similarity searching, sequence alignment and phylogenetic inference, have become a mainstay for researchers studying type 1 Human Immunodeficiency Virus (HIV-1) genome structure and evolution. Implicit in comparative analyses is an underlying model of evolution, and the chosen model can significantly affect the results. In general, evolutionary models describe the probabilities of replacing one amino acid character with another over a period of time. Most widely used evolutionary models for protein sequences have been derived from curated alignments of hundreds of proteins, usually based on mammalian genomes. It is unclear to what extent these empirical models are generalizable to a very different organism, such as HIV-1-the most extensively sequenced organism in existence. We developed a maximum likelihood model fitting procedure to a collection of HIV-1 alignments sampled from different viral genes, and inferred two empirical substitution models, suitable for describing between-and within-host evolution. Our procedure pools the information from multiple sequence alignments, and provided software implementation can be run efficiently in parallel on a computer cluster. We describe how the inferred substitution models can be used to generate scoring matrices suitable for alignment and similarity searches. Our models had a consistently superior fit relative to the best existing models and to parameter-rich data-driven models when benchmarked on independent HIV-1 alignments, demonstrating evolutionary biases in amino-acid substitution that are unique to HIV, and that are not captured by the existing models. The scoring matrices derived from the models showed a marked difference from common amino-acid scoring matrices. The use of an appropriate evolutionary model recovered a known viral transmission history, whereas a poorly chosen model introduced phylogenetic error. We argue that our model derivation procedure is immediately applicable to other organisms with extensive sequence data available, such as Hepatitis C and Influenza A viruses.  相似文献   

14.
We are interested in detecting homologous genomic DNA sequences with the goal of locating approximate inverted, interspersed, and tandem repeats. Standard search techniques start by detecting small matching parts, called seeds, between a query sequence and database sequences. Contiguous seed models have existed for many years. Recently, spaced seeds were shown to be more sensitive than contiguous seeds without increasing the random hit rate. To determine the superiority of one seed model over another, a model of homologous sequence alignment must be chosen. Previous studies evaluating spaced and contiguous seeds have assumed that matches and mismatches occur within these alignments, but not insertions and deletions (indels). This is perhaps appropriate when searching for protein coding sequences (<5% of the human genome), but is inappropriate when looking for repeats in the majority of genomic sequence where indels are common. In this paper, we assume a model of homologous sequence alignment which includes indels and we describe a new seed model, called indel seeds, which explicitly allows indels. We present a waiting time formula for computing the sensitivity of an indel seed and show that indel seeds significantly outperform contiguous and spaced seeds when homologies include indels. We discuss the practical aspect of using indel seeds and finally we present results from a search for inverted repeats in the dog genome using both indel and spaced seeds.  相似文献   

15.
We generated nucleotide sequences for H-2Kk and H-2Dk from the C3H mouse, as well as for a genomic clone of H-2Db, in order to conduct an evolutionary analysis of the H-2 genes from three haplotypes, k, d, and b. H-2Kk from both the C3H and AKR strains, H-2Kd, H-2Kb, H-2Dk, H-2Ld, H-2Dd, H-2Db, and H-2Dp DNA sequences were aligned, and the alignments used to construct phylogenetic trees inferring the evolutionary relationships among the nine genes by two independent methods. Both approaches yielded trees with similar topologies. In addition, the sequence alignments revealed patterns of nucleotide substitutions which implicate both point mutation and recombination in the divergence of the H-2 genes. Future considerations for evolutionary analysis of class I genes are discussed.  相似文献   

16.
When detecting positive selection in proteins, the prevalence of errors resulting from misalignment and the ability of alignment filters to mitigate such errors are not well understood, but filters are commonly applied to try to avoid false positive results. Focusing on the sitewise detection of positive selection across a wide range of divergence levels and indel rates, we performed simulation experiments to quantify the false positives and false negatives introduced by alignment error and the ability of alignment filters to improve performance. We found that some aligners led to many false positives, whereas others resulted in very few. False negatives were a problem for all aligners, increasing with sequence divergence. Of the aligners tested, PRANK's codon-based alignments consistently performed the best and ClustalW performed the worst. Of the filters tested, GUIDANCE performed the best and Gblocks performed the worst. Although some filters showed good ability to reduce the error rates from ClustalW and MAFFT alignments, none were found to substantially improve the performance of PRANK alignments under most conditions. Our results revealed distinct trends in error rates and power levels for aligners and filters within a biologically plausible parameter space. With the best aligner, a low false positive rate was maintained even with extremely divergent indel-prone sequences. Controls using the true alignment and an optimal filtering method suggested that performance improvements could be gained by improving aligners or filters to reduce the prevalence of false negatives, especially at higher divergence levels and indel rates.  相似文献   

17.
Fast and exact comparison of large genomic sequences remains a challenging task in biosequence analysis. We consider the problem of finding all epsilon-matches between two sequences, i.e., all local alignments over a given length with an error rate of at most epsilon. We study this problem theoretically, giving an efficient q-gram filter for solving it. Two applications of the filter are also discussed, in particular genomic sequence assembly and BLAST-like sequence comparison. Our results show that the method is 25 times faster than BLAST, while not being heuristic.  相似文献   

18.
Human enteroviruses consist of more than 60 serotypes, reflecting a wide range of evolutionary divergence. They have been genetically classified into four clusters on the basis of sequence homology in the coding region of the single-stranded RNA genome. To explore further the genetic relationships between human enteroviruses and to characterize the evolutionary mechanisms responsible for variation, previously sequenced genomes were subjected to detailed comparison. Bootstrap and genetic similarity analyses were used to systematically scan the alignments of complete genomic sequences. Bootstrap analysis provided evidence from an early recombination event at the junction of the 5' noncoding and coding regions of the progenitors of the current clusters. Analysis within the genetic clusters indicated that enterovirus prototype strains include intraspecies recombinants. Recombination breakpoints were detected in all genomic regions except the capsid protein coding region. Our results suggest that recombination is a significant and relatively frequent mechanism in the evolution of enterovirus genomes.  相似文献   

19.
MOTIVATION: When studying multiple alignments of genomic sequences one frequently aims to locate and count regions which satisfy a set of constraints. These regions may be putatively functional, but researchers may also be interested in quantifying the frequency of occurrences of certain patterns. RESULTS: We have developed a program that applies simple formulas and pattern specifications to multiple alignments, reporting the positions and counts of conforming regions. As an example, we have navigated a 15-species alignment of the CAV2-CAV1 region and outlined some findings regarding PPARgamma binding sites. AVAILABILITY: Our software and the accompanying documentation can be obtained at no charge by contacting the authors. It can also be accessed at http://ranger.uta.edu/~nick/compgen  相似文献   

20.
Sequence conservation between species is useful both for locating coding regions of genes and for identifying functional noncoding segments. Hence interspecies alignment of genomic sequences is an important computational technique. However, its utility is limited without extensive annotation. We describe a suite of software tools, PipTools, and related programs that facilitate the annotation of genes and putative regulatory elements in pairwise alignments. The alignment server PipMaker uses the output of these tools to display detailed information needed to interpret alignments. These programs are provided in a portable format for use on common desktop computers and both the toolkit and the PipMaker server can be found at our Web site (http://bio.cse.psu.edu/). We illustrate the utility of the toolkit using annotation of a pairwise comparison of the mouse MHC class II and class III regions with orthologous human sequences and subsequently identify conserved, noncoding sequences that are DNase I hypersensitive sites in chromatin of mouse cells.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号