共查询到20条相似文献,搜索用时 13 毫秒
1.
ABSTRACT: BACKGROUND: A number of software packages are available to generate DNA multiple sequence alignments (MSAs) evolved under continuous-time Markov processes on phylogenetic trees. On the other hand, methods of simulating the DNA MSA directly from the transition matrices do not exist. Moreover, existing software restricts to the time-reversible models and it is not optimized to generate nonhomogeneous data (i.e. placing distinct substitution rates at different lineages). RESULTS: We present the first package designed to generate MSAs evolving under discrete-time Markov processes on phylogenetic trees, directly from probability substitution matrices. Based on the input model and a phylogenetic tree in the Newick format (with branch lengths measured as the expected number of substitutions per site), the algorithm produces DNA alignments of desired length. GenNon-h is publicly available for download. CONCLUSION: The software presented here is an efficient tool to generate DNA MSAs on a given phylogenetic tree. GenNon-h provides the user with the nonstationary or nonhomogeneous phylogenetic data that is well suited for testing complex biological hypotheses, exploring the limits of the reconstruction algorithms and their robustness to such models. 相似文献
2.
Lebrun E Santini JM Brugna M Ducluzeau AL Ouchane S Schoepp-Cothenet B Baymann F Nitschke W 《Molecular biology and evolution》2006,23(6):1180-1191
Previously published phylogenetic trees reconstructed on "Rieske protein" sequences frequently are at odds with each other, with those of other subunits of the parent enzymes and with small-subunit rRNA trees. These differences are shown to be at least partially if not completely due to problems in the reconstruction procedures. A major source of erroneous Rieske protein trees lies in the presence of a large, poorly conserved domain prone to accommodate very long insertions in well-defined structural hot spots substantially hampering multiple alignments. The remaining smaller domain, in contrast, is too conserved to allow distant phylogenies to be deduced with sufficient confidence. Three-dimensional structures of representatives from this protein family are now available from phylogenetically distant species and from diverse enzymes. Multiple alignments can thus be refined on the basis of these structures. We show that structurally guided alignments of Rieske proteins from Rieske-cytochrome b complexes and arsenite oxidases strongly reduce conflicts between resulting trees and those obtained on their companion enzyme subunits. Further problems encountered during this work, mainly consisting in database errors such as wrong annotations and frameshifts, are described. The obtained results are discussed against the background of hypotheses stipulating pervasive lateral gene transfer in prokaryotes. 相似文献
3.
Liu K Warnow TJ Holder MT Nelesen SM Yu J Stamatakis AP Linder CR 《Systematic biology》2012,61(1):90-106
Highly accurate estimation of phylogenetic trees for large data sets is difficult, in part because multiple sequence alignments must be accurate for phylogeny estimation methods to be accurate. Coestimation of alignments and trees has been attempted but currently only SATé estimates reasonably accurate trees and alignments for large data sets in practical time frames (Liu K., Raghavan S., Nelesen S., Linder C.R., Warnow T. 2009b. Rapid and accurate large-scale coestimation of sequence alignments and phylogenetic trees. Science. 324:1561-1564). Here, we present a modification to the original SATé algorithm that improves upon SATé (which we now call SATé-I) in terms of speed and of phylogenetic and alignment accuracy. SATé-II uses a different divide-and-conquer strategy than SATé-I and so produces smaller more closely related subsets than SATé-I; as a result, SATé-II produces more accurate alignments and trees, can analyze larger data sets, and runs more efficiently than SATé-I. Generally, SATé is a metamethod that takes an existing multiple sequence alignment method as an input parameter and boosts the quality of that alignment method. SATé-II-boosted alignment methods are significantly more accurate than their unboosted versions, and trees based upon these improved alignments are more accurate than trees based upon the original alignments. Because SATé-I used maximum likelihood (ML) methods that treat gaps as missing data to estimate trees and because we found a correlation between the quality of tree/alignment pairs and ML scores, we explored the degree to which SATé's performance depends on using ML with gaps treated as missing data to determine the best tree/alignment pair. We present two lines of evidence that using ML with gaps treated as missing data to optimize the alignment and tree produces very poor results. First, we show that the optimization problem where a set of unaligned DNA sequences is given and the output is the tree and alignment of those sequences that maximize likelihood under the Jukes-Cantor model is uninformative in the worst possible sense. For all inputs, all trees optimize the likelihood score. Second, we show that a greedy heuristic that uses GTR+Gamma ML to optimize the alignment and the tree can produce very poor alignments and trees. Therefore, the excellent performance of SATé-II and SATé-I is not because ML is used as an optimization criterion for choosing the best tree/alignment pair but rather due to the particular divide-and-conquer realignment techniques employed. 相似文献
4.
Significance of nucleotide sequence alignments: a method for random sequence permutation that preserves dinucleotide and codon usage 总被引:10,自引:0,他引:10
The similarity of two nucleotide sequences is often expressed in terms of
evolutionary distance, a measure of the amount of change needed to
transform one sequence into the other. Given two sequences with a small
distance between them, can their similarity be explained by their base
composition alone? The nucleotide order of these sequences contributes to
their similarity if the distance is much smaller than their average
permutation distance, which is obtained by calculating the distances for
many random permutations of these sequences. To determine whether their
similarity can be explained by their dinucleotide and codon usage, random
sequences must be chosen from the set of permuted sequences that preserve
dinucleotide and codon usage. The problem of choosing random dinucleotide
and codon-preserving permutations can be expressed in the language of graph
theory as the problem of generating random Eulerian walks on a directed
multigraph. An efficient algorithm for generating such walks is described.
This algorithm can be used to choose random sequence permutations that
preserve (1) dinucleotide usage, (2) dinucleotide and trinucleotide usage,
or (3) dinucleotide and codon usage. For example, the similarity of two
60-nucleotide DNA segments from the human beta-1 interferon gene
(nucleotides 196-255 and 499-558) is not just the result of their nonrandom
dinucleotide and codon usage.
相似文献
5.
Sammeth M Rothgänger J Esser W Albert J Stoye J Harmsen D 《Bioinformatics (Oxford, England)》2003,19(12):1592-1593
Integrating different alignment strategies, a layout editor and tools deriving phylogenetic trees in a 'multiple alignment environment' helps to investigate and enhance results of multiple sequence alignment by hand. QAlign combines algorithms for fast progressive and accurate simultaneous multiple alignment with a versatile editor and a dynamic phylogenetic analysis in a convenient graphical user interface. 相似文献
6.
7.
8.
9.
AltAVisT: comparing alternative multiple sequence alignments 总被引:2,自引:0,他引:2
We introduce a WWW-based tool that is able to compare two alternative multiple alignments of a given sequence set. Regions where both alignments coincide are color-coded to visualize the local agreement between the two alignments and to identify those regions that can be considered to be reliably aligned. AVAILABILITY: http://bibiserv.techfak.uni-bielefeld.de/altavist/. 相似文献
10.
The 16S rRNA nucleotide sequence of Mycobacterium leprae: phylogenetic position and development of DNA probes 总被引:2,自引:0,他引:2
The almost complete 16S rRNA sequence from Mycobacterium leprae was determined by direct sequencing of the chromosomal gene amplified by the polymerase chain reaction. The primary sequence revealed an insertion of 12 nucleotides at the 5' end of the 16S rRNA gene, which consists of an A-T stretch and appears to be unique for M. leprae. Within the mycobacteria M. leprae branches off with a group of slow-growing species comprising M. scrofulaceum, M. kansasii, M. szulgai, M. malmoense, M. intracellulare and M. avium. A systematic comparison of the nucleotide sequence resulted in the characterization of oligonucleotide probes which are highly specific for M. leprae. The probes hybridized exclusively to 16S rRNA nucleic acids from M. leprae, but not to nucleic acids from 20 cultivable fast- and slow-growing mycobacteria. 相似文献
11.
SUMMARY: The Kinase Sequence Database (KSD) located at http://kinase.ucsf.edu/ksd contains information on 290 protein kinase families derived by profile-based clustering of the non-redundant list of sequences obtained from a GenBank-wide search. Included in the database are a total of 5,041 protein kinases from over 100 organisms. Clustering into families is based on the extent of homology within the kinase catalytic domain (250-300 residues in length). Alignments of the families are viewed by interactive Excel-based sequence spreadsheets. In addition, KSD features evolutionary trees derived for each family and detailed information on each sequence as well as links to the corresponding GenBank entries. Sequence manipulation tools, such as evolutionary tree generation, novel sequence assignment, and statistical analysis, are also provided. AVAILABILITY: The kinase sequence database is a web-based service accessible at http://kinase.ucsf.edu/ksd CONTACT: buzko@cmp.ucsf.edu; shokat@cmp.ucsf.edu/ksd 相似文献
12.
Frequentist estimation of coalescence times from nucleotide sequence data using a tree-based partition 总被引:5,自引:0,他引:5
This article proposes a method of estimating the time to the most recent common ancestor (TMRCA) of a sample of DNA sequences. The method is based on the molecular clock hypothesis, but avoids assumptions about population structure. Simulations show that in a wide range of situations, the point estimate has small bias and the confidence interval has at least the nominal coverage probability. We discuss conditions that can lead to biased estimates. Performance of this estimator is compared with existing methods based on the coalescence theory. The method is applied to sequences of Y chromosomes and mtDNAs to estimate the coalescent times of human male and female populations. 相似文献
13.
14.
15.
Klaere S Gesell T von Haeseler A 《Philosophical transactions of the Royal Society of London. Series B, Biological sciences》2008,363(1512):4041-4047
We introduce another view of sequence evolution. Contrary to other approaches, we model the substitution process in two steps. First we assume (arbitrary) scaled branch lengths on a given phylogenetic tree. Second we allocate a Poisson distributed number of substitutions on the branches. The probability to place a mutation on a branch is proportional to its relative branch length. More importantly, the action of a single mutation on an alignment column is described by a doubly stochastic matrix, the so-called one-step mutation matrix. This matrix leads to analytical formulae for the posterior probability distribution of the number of substitutions for an alignment column. 相似文献
16.
Mouse heavy chain variable regions: nucleotide sequence of a germ-line VH gene segment 总被引:14,自引:4,他引:14 下载免费PDF全文
We have constructed a library of Balb/c mouse embryo DNA in the vector Charon 4A. The library was searched for sequences homologous to the VH region of a cloned cDNA of the UPC10 heavy chain mRNA. In this paper, we describe the structure and the partial nucleotide sequence of one of such clones (VH441). The nucleotide sequence of this germ-line gene indicates that it encodes amino-acids 1-98 of the X44 and J601 galactan-binding VH regions, but that it differs from the UPC10 VH segment by four single base changes. The VH gene appears to contain a 101 bases long intervening sequence within a precursor sequence identical to the precursor sequence of UPC10. The 3' non coding sequence of the V gene contains the two conserved sequences found in embryonic V DNA segments, CACAGTG and ACATGAACC, separated by 23 nucleotides and a sequence CACTGTG separated by 33 nucleotides from the first heptamer. 相似文献
17.
Shah N Couronne O Pennacchio LA Brudno M Batzoglou S Bethel EW Rubin EM Hamann B Dubchak I 《Bioinformatics (Oxford, England)》2004,20(5):636-643
MOTIVATION: The power of multi-sequence comparison for biological discovery is well established. The need for new capabilities to visualize and compare cross-species alignment data is intensified by the growing number of genomic sequence datasets being generated for an ever-increasing number of organisms. To be efficient these visualization algorithms must support the ability to accommodate consistently a wide range of evolutionary distances in a comparison framework based upon phylogenetic relationships. RESULTS: We have developed Phylo-VISTA, an interactive tool for analyzing multiple alignments by visualizing a similarity measure for multiple DNA sequences. The complexity of visual presentation is effectively organized using a framework based upon interspecies phylogenetic relationships. The phylogenetic organization supports rapid, user-guided interspecies comparison. To aid in navigation through large sequence datasets, Phylo-VISTA leverages concepts from VISTA that provide a user with the ability to select and view data at varying resolutions. The combination of multiresolution data visualization and analysis, combined with the phylogenetic framework for interspecies comparison, produces a highly flexible and powerful tool for visual data analysis of multiple sequence alignments. AVAILABILITY: Phylo-VISTA is available at http://www-gsd.lbl.gov/phylovista. It requires an Internet browser with Java Plug-in 1.4.2 and it is integrated into the global alignment program LAGAN at http://lagan.stanford.edu 相似文献
18.
G Y Srinivasarao L S Yeh C R Marzec B C Orcutt W C Barker 《Bioinformatics (Oxford, England)》1999,15(5):382-390
MOTIVATION: The Protein Information Resource (PIR) maintains a database of annotated and curated alignments in order to visually represent interrelationships among sequences in the PIR-International Protein Sequence Database, to spread and standardize protein names, features and keywords among members of a family or superfamily, and to aid us in classifying sequences, in identifying conserved regions, and in defining new homology domains. RESULTS: Release 22.0, (December 1998), of the PIR-ALN database contains a total of 3806 alignments, including 1303 superfamily, 2131 family and 372 homology domain alignments. This is an appropriate dataset to develop and extract patterns, test profiles, train neural networks or build Hidden Markov Models (HMMs). These alignments can be used to standardize and spread annotation to newer members by homology, as well as to understand the modular architecture of multidomain proteins. PIR-ALN includes 529 alignments that can be used to develop patterns not represented in PROSITE, Blocks, PRINTS and Pfam databases. The ATLAS information retrieval system can be used to browse and query the PIR-ALN alignments. AVAILABILITY: PIR-ALN is currently being distributed as a single ASCII text file along with the title, member, species, superfamily and keyword indexes. The quarterly and weekly updates can be accessed via the WWW at pir.georgetown.edu. The quarterly updates can also be obtained by anonymous FTP from the PIR FTP site at NBRF.Georgetown.edu, directory [ANONYMOUS.PIR.ALIGNMENT]. 相似文献
19.
20.
Kraken is an ultrafast and highly accurate program for assigning taxonomic labels to metagenomic DNA sequences. Previous programs designed for this task have been relatively slow and computationally expensive, forcing researchers to use faster abundance estimation programs, which only classify small subsets of metagenomic data. Using exact alignment of k-mers, Kraken achieves classification accuracy comparable to the fastest BLAST program. In its fastest mode, Kraken classifies 100 base pair reads at a rate of over 4.1 million reads per minute, 909 times faster than Megablast and 11 times faster than the abundance estimation program MetaPhlAn. Kraken is available at http://ccb.jhu.edu/software/kraken/. 相似文献