首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 156 毫秒
1.
Indels in DNA sequences frequently affect more than a single nucleotide, creating problems for alignment, character coding and phylogenetic analysis. However, the size and frequency of multiple‐residue indels is not usually tested, and with popular alignment packages their reconstruction is indirectly acheived by reducing the affine (gap extension) cost. We explored the length distribution of indels in intron sequences of the gene Mp20 by modifying the gap opening and gap extension costs. Given a “known” tree for the study group, global homology levels were greatest under low gap cost, with gap extension costs of roughly 0.4‐fold the opening cost. Different approaches to gap coding and weighting suggested that taxonomic congruence was correlated with high frequencies of multiple‐position indels, with a maximum indel length of 2–5 bp and few indels above 15 bp, but also including a proportion of indels > 100 bp. Only a small minority of indels could be reconstructed as single‐position indels. Consequently, tree topologies improved when homologous multinucleotide indels were recoded as binary characters which are otherwise highly homoplastic and weighted characters in single‐position coding. In tree‐generating alignment procedures as implemented in POY, where gap penalty determines the character weight during tree search, the problem of assigning inappropriately high weight to multiple‐residue indels could partly be overcome by setting the extension costs to about 0.4‐fold lower than gap opening costs. We conclude that multiple consecutive gap positions are not independent characters and hence methods for parsimony reconstruction of long indels are required. Finally, we also observed a general lack of correlation between taxonomic and character congruence, demonstrating the difficulties of applying congruence criteria to decide among competing alignments. This highlights the value of recent model‐based alignment procedures which can implement the statistical distributions of indel size classes, and do not rely on potentially circular strategies for optimizing overall congruence. © The Willi Hennig Society 2006.  相似文献   

2.
Nuclear introns are commonly used as phylogenetic markers, but a number of issues related to alignment strategies, indel treatments, and the incorporation of length-variant heterozygotes (LVHs) are not routinely addressed when generating phylogenetic hypotheses. Topological congruence in relation to an extensive mitochondrial DNA multigene phylogeny (derived from 2,423 bp of 12S, 16S, ND4, and CYTB genes) of the Asian pitviper Trimeresurus radiation was used to compare combinations of "by eye" and edited and unedited ClustalX 1.8 alignments of two nuclear introns. Indels were treated as missing data, fifth character states, and assigned simple and multistate codes. Upon recovery of the optimal alignment and indel treatment strategy, a total evidence approach was used to investigate the phylogenetic utility of the indels and test new generic arrangements within Trimeresurus. Approximately one third of the intron data partitions exhibited LVHs, suggesting that they are common in introns. Furthermore, a simple concatenation approach can facilitate the incorporation of LVHs into phylogenetic analyses to make use of all available data and investigate mechanisms of molecular evolution. Analyses of ClustalX 1.8-assisted alignments were generally more congruent than the "by eye" alignment and the analysis of a simple coded, edited ClustalX 1.8 (gap opening cost 5, gap extension cost 1) alignment revealed the most congruent tree. The total evidence approach supported the new arrangements within Trimeresurus, suggesting that the phylogeny should be considered as a working benchmark in Asian pitviper systematics. Finally, a critical appraisal of the diverse array of indels (56 to 57 per intron, ranging from 1 to 151 bp in length) suggested that they are a combination of Hennigian and homoplasious events unrelated to indel size or location within the intron. [Alignment; indels; intron analysis; length-variant heterozygotes; Trimeresurus.].  相似文献   

3.
Although there has been a recent proliferation in maximum‐likelihood (ML)‐based tree estimation methods based on a fixed sequence alignment (MSA), little research has been done on incorporating indel information in this traditional framework. We show, using a simple model on a single character example, that a trivial alignment of a different form than that previously identified for parsimony is optimal in ML under standard assumptions treating indels as “missing” data, but that it is not optimal when indels are incorporated into the character alphabet. We show that the optimality of the trivial alignment is not an artefact of simplified theory assumptions by demonstrating that trivial alignment likelihoods of five different multiple sequence alignment datasets exhibit this phenomenon. These results demonstrate the need for use of indel information in likelihood analysis on fixed MSAs, and suggest that caution must be exercised when drawing conclusions from software implementations claiming improvements in likelihood scores under an indels‐as‐missing assumption. © The Willi Hennig Society 2012.  相似文献   

4.
In this study we use sensitivity analysis sensu Wheeler (1995 ) for a matrix entirely composed of DNA sequences. We propose that not only congruence but also phylogenetic structure, as measured by character resampling, should be used to choose among competing weighting regimes. An extensive analysis of a five‐gene data set for Themira (Sepsidae: Diptera) reveals that even with different ways of partitioning the data, measures of topological congruence, character incongruence, and phylogenetic structure favor similar weighting regimes involving the down‐weighting of transitions. We furthermore use sensitivity analysis for obtaining empirical evidence that allows us to select weights for third positions, deciding between treating indels as fifth character states or missing values, and choosing between manual and computational alignments. For our data, sensitivity analysis favors manual alignment over a Clustal‐generated numerical alignment, the treatment of indels as fifth character states over considering them missing values, and equal weights for all positions in protein‐encoding genes over the down‐weighting of third positions. Among the topological congruence measures compared, symmetric tree distance performed best. Partitioned Bremer Support analysis reveals that COI contributes the largest amount of support for our phylogenetic tree for Themira. © The Willi Hennig Society 2005.  相似文献   

5.
The performance of the computer program for phyloge netic analysis, POY, and its two implemented methods, "optimization alignment" and "fixed-states optimization," are explored for four data sets. Four gap costs are analyzed for every partition; some of the partitions (the 18S rRNA) are treated as a single fragment or in increasing fragments of 3, 10, and 30. Comparisons within and among methods are undertaken according to gap cost, number of fragments in which the sequences are divided, tree length, character congruence, topological congruence, primary homology statements, and computation time.  相似文献   

6.
This work presents a novel pairwise statistical alignment method based on an explicit evolutionary model of insertions and deletions (indels). Indel events of any length are possible according to a geometric distribution. The geometric distribution parameter, the indel rate, and the evolutionary time are all maximum likelihood estimated from the sequences being aligned. Probability calculations are done using a pair hidden Markov model (HMM) with transition probabilities calculated from the indel parameters. Equations for the transition probabilities make the pair HMM closely approximate the specified indel model. The method provides an optimal alignment, its likelihood, the likelihood of all possible alignments, and the reliability of individual alignment regions. Human alpha and beta-hemoglobin sequences are aligned, as an illustration of the potential utility of this pair HMM approach.  相似文献   

7.
We use a multigene data set (the mitochondrial locus and nine nuclear gene regions) to test phylogenetic relationships in the South American "lava lizards" (genus Microlophus) and describe a strategy for aligning noncoding sequences that accounts for differences in tempo and class of mutational events. We focus on seven nuclear introns that vary in size and frequency of multibase length mutations (i.e., indels) and present a manual alignment strategy that incorporates insertions and deletions (indels) for each intron. Our method is based on mechanistic explanations of intron evolution that does not require a guide tree. We also use a progressive alignment algorithm (Probabilistic Alignment Kit; PRANK) and distinguishes insertions from deletions and avoids the "gapcost" conundrum. We describe an approach to selecting a guide tree purged of ambiguously aligned regions and use this to refine PRANK performance. We show that although manual alignment is successful in finding repeat motifs and the most obvious indels, some regions can only be subjectively aligned, and there are limits to the size and complexity of a data matrix for which this approach can be taken. PRANK alignments identified more parsimony-informative indels while simultaneously increasing nucleotide identity in conserved sequence blocks flanking the indel regions. When comparing manual and PRANK with two widely used methods (CLUSTAL, MUSCLE) for the alignment of the most length-variable intron, only PRANK recovered a tree congruent at deeper nodes with the combined data tree inferred from all nuclear gene regions. We take this concordance as an objective function of alignment quality and present a strongly supported phylogenetic hypothesis for Microlophus relationships. From this hypothesis we show that (1) a coded indel data partition derived from the PRANK alignment contributed significantly to nodal support and (2) the indel data set permitted detection of significant conflict between mitochondrial and nuclear data partitions, which we hypothesize arose from secondary contact of distantly related taxa, followed by hybridization and mtDNA introgression.  相似文献   

8.
9.
Insertions and deletions (indels) are common molecular evolutionary events. However, probabilistic models for indel evolution are under-developed due to their computational complexity. Here, we introduce several improvements to indel modeling: 1) While previous models for indel evolution assumed that the rates and length distributions of insertions and deletions are equal, here we propose a richer model that explicitly distinguishes between the two; 2) we introduce numerous summary statistics that allow approximate Bayesian computation-based parameter estimation; 3) we develop a method to correct for biases introduced by alignment programs, when inferring indel parameters from empirical data sets; and 4) using a model-selection scheme, we test whether the richer model better fits biological data compared with the simpler model. Our analyses suggest that both our inference scheme and the model-selection procedure achieve high accuracy on simulated data. We further demonstrate that our proposed richer model better fits a large number of empirical data sets and that, for the majority of these data sets, the deletion rate is higher than the insertion rate.  相似文献   

10.
Partition-free congruence analysis: implications for sensitivity analysis   总被引:1,自引:0,他引:1  
A criterion is proposed to compare systematic hypotheses based on multiple sources of information under a diverse set of interpretive assumptions (i.e., sensitivity analysis of Wheeler, 1995 ). This metric, the Meta‐Retention Index (MRI), is the retention index (RI) of Farris calculated over the set of conventional homologous qualitative characters (ordered, unordered, Sankoff, etc.) and molecular fragment characters sensu Wheeler (1996, 1999 ). The superiority of this measure to other similar measures (e.g., incongruence length difference test) comes from its independence from partition information. The only values that participate in its calculation are the minimum, maximum and observed cost (= cladogram cost) of each character. The partition (morphology, gene locus) from which the variant may have come is irrelevant. In the special cases where there is only a single data partition, this measure is equivalent to the conventional RI; and in the case where there are single fragment characters per partition (contiguous molecular loci as data sets) the measure is identical to the complement of the Rescaled Incongruence Length Difference (RILD) of Wheeler and Hayashi (1998 ). The MRI can serve as an optimality criterion for deciding among systematic hypotheses based on the same data, but different sets of analysis assumptions (e.g., character weights, indel costs). The MRI may lose discriminatory power in situations where a minority of highly congruent characters is given high weight. This situation can be detected and seems unlikely to occur frequently in real data sets. © The Willi Hennig Society 2006.  相似文献   

11.
We are interested in detecting homologous genomic DNA sequences with the goal of locating approximate inverted, interspersed, and tandem repeats. Standard search techniques start by detecting small matching parts, called seeds, between a query sequence and database sequences. Contiguous seed models have existed for many years. Recently, spaced seeds were shown to be more sensitive than contiguous seeds without increasing the random hit rate. To determine the superiority of one seed model over another, a model of homologous sequence alignment must be chosen. Previous studies evaluating spaced and contiguous seeds have assumed that matches and mismatches occur within these alignments, but not insertions and deletions (indels). This is perhaps appropriate when searching for protein coding sequences (<5% of the human genome), but is inappropriate when looking for repeats in the majority of genomic sequence where indels are common. In this paper, we assume a model of homologous sequence alignment which includes indels and we describe a new seed model, called indel seeds, which explicitly allows indels. We present a waiting time formula for computing the sensitivity of an indel seed and show that indel seeds significantly outperform contiguous and spaced seeds when homologies include indels. We discuss the practical aspect of using indel seeds and finally we present results from a search for inverted repeats in the dog genome using both indel and spaced seeds.  相似文献   

12.
Insertion and deletion events (indels) provide a suite of markers with enormous potential for molecular phylogenetics. Using many more indel characters than those in previous studies, we here for the first time address the impact of indel inclusion on the phylogenetic inferences of Arctoidea (Mammalia: Carnivora). Based on 6843 indel characters from 22 nuclear intron loci of 16 species of Arctoidea, our analyses demonstrate that when the indels were not taken into consideration, the monophyly of Ursidae and Pinnipedia tree and the monophyly of Pinnipedia and Musteloidea tree were both recovered, whereas inclusion of indels by using three different indel coding schemes give identical phylogenetic tree topologies supporting the monophyly of Ursidae and Pinnipedia. Our work brings new perspectives on the previously controversial placements among Arctoidea families, and provides another example demonstrating the importance of identifying and incorporating indels in the phylogenetic analyses of introns. In addition, comparison of indel incorporation methods revealed that the three indel coding methods are all advantageous over treating indels as missing data, given that incorporating indels produces consistent results across methods. This is the first report of the impact of different indel coding schemes on phylogenetic reconstruction at the family level in Carnivora, which indicates that indels should be taken into account in the future phylogenetic analyses.  相似文献   

13.
On gaps.   总被引:4,自引:0,他引:4  
Gaps result from the alignment of sequences of unequal length during primary homology assessment. Viewed as character states originating from particular biological events (mutation), gaps contain historical information suitable for phylogenetic analysis. The effect of gaps as a source of phylogenetic data is explored via sensitivity analysis and character congruence among different data partitions. Example data sets are provided to show that gaps contain important phylogenetic information not recovered by those methods that omit gaps in their calculations. However, gap cost schemes are arbitrary (although they must be explicit) and thus data exploration is a necessity of molecular analyses, while character congruence is necessary as an external criterion for hypothesis decision.  相似文献   

14.
The majority of the available methods for the molecular identification of species use pairwise sequence divergences between the query and reference sequences (DNA barcoding). The presence of multiple insertions and deletions (indels) in the target genomic regions is generally regarded as a problem, as it introduces ambiguities in sequence alignments. However, we have recently shown that a high level of species discrimination is attainable in all taxa of life simply by considering the length of hypervariable regions defined by indel variants. Each species is tagged with a numeric profile of fragment lengths—a true numeric barcode. In this study, we describe a multifunctional computational workbench (named SPInDel for SPecies Identification by Insertions/Deletions) to assist researchers using variable‐length DNA sequences, and we demonstrate its applicability in molecular ecology. The SPInDel workbench provides a step‐by‐step environment for the alignment of target sequences, selection of informative hypervariable regions, design of PCR primers and the statistical validation of the species‐identification process. In our test data sets, we were able to discriminate all species from two genera of frogs (Ansonia and Leptobrachium) inhabiting lowland rainforests and mountain regions of South‐East Asia and species from the most common genus of coral reef fishes (Apogon). Our method can complement conventional DNA barcoding systems when indels are common (e.g. in rRNA genes) without the required step of DNA sequencing. The executable files, source code, documentation and test data sets are freely available at http://www.portugene.com/SPInDel/SPInDel_webworkbench.html .  相似文献   

15.
Brandström M  Ellegren H 《Genetics》2007,176(3):1691-1701
It is increasingly recognized that insertions and deletions (indels) are an important source of genetic as well as phenotypic divergence and diversity. We analyzed length polymorphisms identified through partial (0.25x) shotgun sequencing of three breeds of domestic chicken made by the International Chicken Polymorphism Map Consortium. A data set of 140,484 short indel polymorphisms in unique DNA was identified after filtering for microsatellite structures. There was a significant excess of tandem duplicates at indel sites, with deletions of a duplicate motif outnumbering the generation of duplicates through insertion. Indel density was lower in microchromosomes than in macrochromosomes, in the Z chromosome than in autosomes, and in 100 bp of upstream sequence, 5'-UTR, and first introns than in intergenic DNA and in other introns. Indel density was highly correlated with single nucleotide polymorphism (SNP) density. The mean density of indels in pairwise sequence comparisons was 1.9 x 10(-4) indel events/bp, approximately 5% the density of SNPs segregating in the chicken genome. The great majority of indels involved a limited number of nucleotides (median 1 bp), with A-rich motifs being overrepresented at indel sites. The overrepresentation of deletions at tandem duplicates indicates that replication slippage in duplicate sequences is a common mechanism behind indel mutation. The correlation between indel and SNP density indicates common effects of mutation and/or selection on the occurrence of indels and point mutations.  相似文献   

16.
The Sepsidae is, with approximately 300 described species, a relatively small family of cyclorrhaphan flies whose behaviour, morphology, and development have been extensively studied. However, currently the only available tree for Sepsidae is more than 10 years old and was based entirely on morphological characters. Here, we present the results of parsimony and Bayesian analyses based on 75 species, ten genes, and morphology. Parsimony and Bayesian analyses produce largely congruent and well‐supported topologies regardless of whether indels are coded as 5th character states, as missing values, or all sites with indels are removed. The tree confirms the monophyly of Sepsidae and identifies the Ropalomeridae as its sister group. With regard to higher‐level relationships, we identify widespread conflict between the morphological and the DNA sequence data. The proposed hypothesis based on both partitions largely reflects the signal in the molecular data. Particularly surprising is the rejection of two relationship hypotheses with strong morphological support, namely the sister group relationship between Orygma and the remaining Sepsidae and the monophyly of the Sepsis species group. Our partitioned Bremer support (PBS) analyses imply that indel coding has a stronger effect on the relative performance of individual gene partitions than the exclusion of alignment‐ambiguous sequences or the location of a gene on the mitochondrial or nuclear genome. However, these analyses also reveal unexpectedly strong fluctuations in PBS values given that indel treatment has only a minor effect on tree topology and jacknife support. These unexpected fluctuations highlight the need for a comparative study across multiple data sets that investigates the influence of conflict and indel treatment on PBS values. © The Willi Hennig Society 2008.  相似文献   

17.
Tandem repeats (TRs) are often present in proteins with crucial functions, responsible for resistance, pathogenicity and associated with infectious or neurodegenerative diseases. This motivates numerous studies of TRs and their evolution, requiring accurate multiple sequence alignment. TRs may be lost or inserted at any position of a TR region by replication slippage or recombination, but current methods assume fixed unit boundaries, and yet are of high complexity. We present a new global graph-based alignment method that does not restrict TR unit indels by unit boundaries. TR indels are modeled separately and penalized using the phylogeny-aware alignment algorithm. This ensures enhanced accuracy of reconstructed alignments, disentangling TRs and measuring indel events and rates in a biologically meaningful way. Our method detects not only duplication events but also all changes in TR regions owing to recombination, strand slippage and other events inserting or deleting TR units. We evaluate our method by simulation incorporating TR evolution, by either sampling TRs from a profile hidden Markov model or by mimicking strand slippage with duplications. The new method is illustrated on a family of type III effectors, a pathogenicity determinant in agriculturally important bacteria Ralstonia solanacearum. We show that TR indel rate variation contributes to the diversification of this protein family.  相似文献   

18.
The robustness of clades to parameter variation may be a desirable quality or even a goal in phylogenetic analyses. Sensitivity analyses used to assess clade stability have invoked the incongruence length difference (ILD or WILD) metric, a measure of congruence among datasets, to compare a series of most‐parsimonious results from re‐running analyses under different analytical conditions. It is also common practice to select a single “optimal” parameter set that minimizes WILD across all parameter sets. However, the divergent molecular evolution of ribosomal genes and protein‐encoding genes—specifically the bias against transversion events in coding genes of conserved function—suggests that deployment of multiple parameter sets could outperform the use of a single parameter set applied to all molecules. We explored congruence in five published datasets by including mixed parameter sets in our sensitivity analysis. In four cases, mixed parameter sets outperformed the previously reported, single optimal parameter set. Conversely, multiple parameter sets did not outperform a single optimal parameter set in a case in which actual strong topological conflict exists between data partitions. Exploration of mixed parameter sets may prove useful when combining ribosomal and protein‐encoding genes, due to the relatively higher frequency of single‐ and double‐base pair indel events in the former, and the relatively lower frequency of transversions in the latter.
© The Willi Hennig Society 2010.  相似文献   

19.
Sawyer SL  Howell WM  Brookes AJ 《BioTechniques》2003,35(2):292-6, 298
Genome variation provides researchers with thousands of markers with which to study human demographic history and phenotypes. Insertion-deletion (indel) polymorphism is an important and abundant form of human genome variation, and convenient methods for genotyping indels are therefore needed. Here we evaluate dynamic allele-specific hybridization (DASH) for its ability to score indels. Evaluation of six model indel DASH assays based on synthetic oligonucleotides showed that length differences of 1-5 bp were accurately scored. Only single probes were required to assay indels of 3-4 bp or less, while longer indels tended to require the use of both allele probes serially. The best results were obtained by central placing of the probe over the indel. Model study findings were confirmed by running indel DASH assays upon PCR-amplified targets representing four polymorphisms from Alzheimer's disease candidate genes APBB1 and LRP1. These indels were genotyped in a set of 121 patients and 156 controls. While no disease association was found, the data quality confirmed that DASH is a robust and useful procedure for genotyping indels of the size range typically found in the human genome.  相似文献   

20.
A method to align sequence data based on parsimonious synapomorphy schemes generated by direct optimization (DO; earlier termed optimization alignment) is proposed. DO directly diagnoses sequence data on cladograms without an intervening multiple-alignment step, thereby creating topology-specific, dynamic homology statements. Hence, no multiple-alignment is required to generate cladograms. Unlike general and globally optimal multiple-alignment procedures, the method described here, implied alignment (IA), takes these dynamic homologies and traces them back through a single cladogram, linking the unaligned sequence positions in the terminal taxa via DO transformation series. These "lines of correspondence" link ancestor-descendent states and, when displayed as linearly arrayed columns without hypothetical ancestors, are largely indistinguishable from standard multiple alignment. Since this method is based on synapomorphy, the treatment of certain classes of insertion-deletion (indel) events may be different from that of other alignment procedures. As with all alignment methods, results are dependent on parameter assumptions such as indel cost and transversion:transition ratios. Such an IA could be used as a basis for phylogenetic search, but this would be questionable since the homologies derived from the implied alignment depend on its natal cladogram and any variance, between DO and IA + Search, due to heuristic approach. The utility of this procedure in heuristic cladogram searches using DO and the improvement of heuristic cladogram cost calculations are discussed.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号