首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
2.
The plant mitochondrial rps3 intron was analyzed for substitution and indel rate variation among 15 monocot and dicot angiosperms from 10 genera, including perennial and annual taxa. Overall, the intron sequence was very conserved among angiosperms. Based on length polymorphism, 10 different alleles were identified among the 10 genera. These allelic differences were mainly attributable to large indels. An insertion of 133 nucleotides, observed in the Alnus intron was partially or completely absent in the other lineages of the family Betulaceae. This insertion was located within domain IV of the secondary-structure model of this group IIA intron. A mobile element of 47 nucleotides that showed homology to sequences located in rice rps3 intron and in intergenic plant mitochondrial genomes was found within this insertion. Both substitution and indel rates were low among the Betulaceae sequences, but substitution rates were increasingly larger than indel rates in comparisons involving more distantly related taxa. From a secondary-structure model, regions involved in helical structures were shown to be well preserved from indels as compared to substitutions, but compensatory changes were not observed among the angiosperm sequences analyzed. Using approximate divergence times based on the fossil record, substitution and indel rate heterogeneity was observed between different pairs of annual and perennial taxa. In particular, the annual petunia and primrose evolved more than 15 and 10 times faster, for substitution and indel rates respectively, than the perennial birch and alder. This is the first demonstration of an evolutionary rate difference between perennial and annual forms in noncoding DNA, lending support to neutral causes such as the generation time, population size, and speciation rate effects to explain such rate heterogeneity. Surprisingly, the sequence from the rps3 intron had a high identity with the sequence of intron 1 from the angiosperm mitochondrial nad5 gene, suggesting a common origin of these two group IIA introns.  相似文献   

3.
To understand how protein segments are inserted and deleted during divergent evolution, a set of pairwise alignments contained exactly one gap, and therefore arising from the first insertion-deletion (indel) event in the time separating the homologs, was examined. The alignments showed that "structure breaking" amino acids (PGDNS) were preferred within and flanking gapped regions, as are two residues with hydrophilic side-chains (QE) that frequently occur at the surface of protein folds. Conversely, hydrophobic residues (FMILYVW) occur infrequently within and flanking the gapped region. These preferences are modestly different in protein pairs separated by an episode of adaptive evolution, than in pairs diverging under strong functional constraints. Surprisingly, regions near an indel have not evolved more rapidly than the sequence pair overall, showing no evidence that an indel event must be compensated by local amino acid replacement. The gap-lengths are best approximated by a Zipfian distribution, with the probability of a gap of length L decreasing as a function of L(-1.8). These features are largely independent of the length of the gap and the extent of divergence (measured by both silent and non-silent sequence changes) separating the two proteins. Surprisingly, amino acid repeats were discovered in more than a third of the polypeptide segments in and around the gap. These correspond to repeats in the DNA sequence. This suggests that a signature of the mechanism by which indels occur in the DNA sequence remains in the encoded protein sequences. These data suggest specific tools to score gap placement in an alignment. They also suggest tools that distinguish true indels from gaps created by mistaken gene finding, including under-predicted and over-predicted introns. By providing mechanisms to identify errors, the tools will enhance the value of genome sequence databases in support of integrated paleogenomics strategies used to extract functional information in a post-genomic environment.  相似文献   

4.
We are interested in detecting homologous genomic DNA sequences with the goal of locating approximate inverted, interspersed, and tandem repeats. Standard search techniques start by detecting small matching parts, called seeds, between a query sequence and database sequences. Contiguous seed models have existed for many years. Recently, spaced seeds were shown to be more sensitive than contiguous seeds without increasing the random hit rate. To determine the superiority of one seed model over another, a model of homologous sequence alignment must be chosen. Previous studies evaluating spaced and contiguous seeds have assumed that matches and mismatches occur within these alignments, but not insertions and deletions (indels). This is perhaps appropriate when searching for protein coding sequences (<5% of the human genome), but is inappropriate when looking for repeats in the majority of genomic sequence where indels are common. In this paper, we assume a model of homologous sequence alignment which includes indels and we describe a new seed model, called indel seeds, which explicitly allows indels. We present a waiting time formula for computing the sensitivity of an indel seed and show that indel seeds significantly outperform contiguous and spaced seeds when homologies include indels. We discuss the practical aspect of using indel seeds and finally we present results from a search for inverted repeats in the dog genome using both indel and spaced seeds.  相似文献   

5.
Direct Sanger sequencing of a diploid template containing a heterozygous insertion or deletion results in a difficult-to-interpret mixed trace formed by two allelic traces superimposed onto each other. Existing computational methods for deconvolution of such traces require knowledge of a reference sequence or the availability of both direct and reverse mixed sequences of the same template. We describe a simple yet accurate method, which uses dynamic programming optimization to predict superimposed allelic sequences solely from a string of letters representing peaks within an individual mixed trace. We used the method to decode 104 human traces (mean length 294 bp) containing heterozygous indels 5 to 30 bp with a mean of 99.1% bases per allelic sequence reconstructed correctly and unambiguously. Simulations with artificial sequences have demonstrated that the method yields accurate reconstructions when (1) the allelic sequences forming the mixed trace are sufficiently similar, (2) the analyzed fragment is significantly longer than the indel, and (3) multiple indels, if present, are well-spaced. Because these conditions occur in most encountered DNA sequences, the method is widely applicable. It is available as a free Web application Indelligent at http://ctap.inhs.uiuc.edu/dmitriev/indel.asp.  相似文献   

6.
Brandström M  Ellegren H 《Genetics》2007,176(3):1691-1701
It is increasingly recognized that insertions and deletions (indels) are an important source of genetic as well as phenotypic divergence and diversity. We analyzed length polymorphisms identified through partial (0.25x) shotgun sequencing of three breeds of domestic chicken made by the International Chicken Polymorphism Map Consortium. A data set of 140,484 short indel polymorphisms in unique DNA was identified after filtering for microsatellite structures. There was a significant excess of tandem duplicates at indel sites, with deletions of a duplicate motif outnumbering the generation of duplicates through insertion. Indel density was lower in microchromosomes than in macrochromosomes, in the Z chromosome than in autosomes, and in 100 bp of upstream sequence, 5'-UTR, and first introns than in intergenic DNA and in other introns. Indel density was highly correlated with single nucleotide polymorphism (SNP) density. The mean density of indels in pairwise sequence comparisons was 1.9 x 10(-4) indel events/bp, approximately 5% the density of SNPs segregating in the chicken genome. The great majority of indels involved a limited number of nucleotides (median 1 bp), with A-rich motifs being overrepresented at indel sites. The overrepresentation of deletions at tandem duplicates indicates that replication slippage in duplicate sequences is a common mechanism behind indel mutation. The correlation between indel and SNP density indicates common effects of mutation and/or selection on the occurrence of indels and point mutations.  相似文献   

7.
Insertions and deletions (indels) are important types of structural variations. Obtaining accurate genotypes of indels may facilitate further genetic study. There are a few existing methods for calling indel genotypes from sequence reads. However, none of these tools can accurately call indel genotypes for indels of all lengths, especially for low coverage sequence data. In this paper, we present GINDEL, an approach for calling genotypes of both insertions and deletions from sequence reads. GINDEL uses a machine learning approach which combines multiple features extracted from next generation sequencing data. We test our approach on both simulated and real data and compare with existing tools, including Genome STRiP, Pindel and Clever-sv. Results show that GINDEL works well for deletions larger than 50 bp on both high and low coverage data. Also, GINDEL performs well for insertion genotyping on both simulated and real data. For comparison, Genome STRiP performs less well for shorter deletions (50–200 bp) on both simulated and real sequence data from the 1000 Genomes Project. Clever-sv performs well for intermediate deletions (200–1500 bp) but is less accurate when coverage is low. Pindel only works well for high coverage data, but does not perform well at low coverage. To summarize, we show that GINDEL not only can call genotypes of insertions and deletions (both short and long) for high and low coverage population sequence data, but also is more accurate and efficient than other approaches. The program GINDEL can be downloaded at: http://sourceforge.net/p/gindel  相似文献   

8.
9.
Little is known about variation of nucleotide insertion/deletions (indels) within species. In Arabidopsis thaliana, we investigated indel polymorphism patterns between two genome sequences and among 96 accessions at 1215 loci. Our study identified patterns in the variation of indel density, size, GC content and distribution, and a correlation between indels and substitutions. We found that the GC content in indel sequences was lower than that in non-indel sequences and that indels typically occur in regions with lower GC content. Patterns of indel frequency distribution among populations were more consistent with neutral expectation than substitution patterns. We also found that the local level of substitutions is positively correlated with indel density and negatively correlated with their distance to the closed indel, suggesting that indels play an important role in nucleotide variation.  相似文献   

10.
Molecular characterization of the waxy locus in sorghum   总被引:1,自引:0,他引:1  
A comparison of approximately 4.5 kb of nucleotide sequence from the waxy locus (the granule-bound starch synthase I [GBSS I] locus) from a waxy line, BTxARG1, and a non-waxy line, QL39, revealed an extremely high level of sequence conservation. Among a total of 24 nucleotide differences and 9 indels, only 2 nucleotide changes resulted in altered amino acid residues. Protein folding prediction software suggested that one of the amino acid changes (Glu to His) may result in an altered protein structure, which may explain the apparently inactive GBSS I present in BTxARG1. This SNP was not found in the second waxy line, RTx2907, which does not produce GBSS I, and no other SNPs or indels were found in the approximately 4 kb of sequence obtained from RTx2907. Using one indel, the waxy locus was mapped to sorghum chromosome SBI-10, which is syntenous to maize chromosome 9; the waxy locus has been mapped to this maize chromosome. The distribution of indels in a diverse set of sorghum germplasm suggested that there are two broad types of non-waxy GBSS I alleles, each type comprising several alleles, and that the two waxy alleles in BTxARG1 and RTx2907 have evolved from one of the non-waxy allele types. The Glu/His polymorphism was found only in BTxARG1 and derived lines and has potential as a perfect marker for the BTxARG1 source of the waxy allele at the GBSS I locus. The indels correctly predicted the non-waxy phenotype in approximately 65% of diverse sorghum germplasm. The indels co-segregated perfectly with phenotype in two sorghum populations derived from crosses between a waxy and a non-waxy sorghum line, correctly identifying heterozygous lines. Thus, these indel markers or sequence-based SNP markers can be used to follow waxy alleles in sorghum breeding programs in selected pedigrees.  相似文献   

11.
Tandem repeats (TRs) are often present in proteins with crucial functions, responsible for resistance, pathogenicity and associated with infectious or neurodegenerative diseases. This motivates numerous studies of TRs and their evolution, requiring accurate multiple sequence alignment. TRs may be lost or inserted at any position of a TR region by replication slippage or recombination, but current methods assume fixed unit boundaries, and yet are of high complexity. We present a new global graph-based alignment method that does not restrict TR unit indels by unit boundaries. TR indels are modeled separately and penalized using the phylogeny-aware alignment algorithm. This ensures enhanced accuracy of reconstructed alignments, disentangling TRs and measuring indel events and rates in a biologically meaningful way. Our method detects not only duplication events but also all changes in TR regions owing to recombination, strand slippage and other events inserting or deleting TR units. We evaluate our method by simulation incorporating TR evolution, by either sampling TRs from a profile hidden Markov model or by mimicking strand slippage with duplications. The new method is illustrated on a family of type III effectors, a pathogenicity determinant in agriculturally important bacteria Ralstonia solanacearum. We show that TR indel rate variation contributes to the diversification of this protein family.  相似文献   

12.
Sequence divergence derives from either point substitution or indel (insertion or deletion) processes. We investigated the rates of these two processes both in protein and non-protein coding DNA. We aligned sequence pairs using two pair-hidden Markov models (PHMMs) conjoined by one silent state. The two PHMMs had their own set of parameters to model rates in their respective regions. The aim was to test the hypothesis that the indel mutation rate mimics the point mutation rate. That is, indels are found less often in conserved regions (slow point substitution rate) and more often in non-conserved regions (fast point substitution rate). Both polypeptides and rRNA molecules in our data exhibited a clear distinction between slow and fast rates of the two processes. These two rates served as surrogates to conserved and non-conserved secondary structure components, respectively. With polypeptides we found both the fast indel rate and the fast replacement rate were co-located with hydrophilic residues. We also found that the average concordance, of our alignments with corresponding curated alignments, improves markedly when the model allows either of the two fast rates to colocate with hydrophilic residues. With rRNA molecules, our model did not detect colocation between the fast indel rate and the fast substitution rate. Nevertheless, coupling the indel rates with the point substitution rates across the two regions markedly increased model fit. This result suggests that rRNA pairwise alignments should be modeled after allowing for the two processes to vary simultaneously and independently in the two regions.  相似文献   

13.
We describe a novel model and algorithm for simultaneously estimating multiple molecular sequence alignments and the phylogenetic trees that relate the sequences. Unlike current techniques that base phylogeny estimates on a single estimate of the alignment, we take alignment uncertainty into account by considering all possible alignments. Furthermore, because the alignment and phylogeny are constructed simultaneously, a guide tree is not needed. This sidesteps the problem in which alignments created by progressive alignment are biased toward the guide tree used to generate them. Joint estimation also allows us to model rate variation between sites when estimating the alignment and to use the evidence in shared insertion/deletions (indels) to group sister taxa in the phylogeny. Our indel model makes use of affine gap penalties and considers indels of multiple letters. We make the simplifying assumption that the indel process is identical on all branches. As a result, the probability of a gap is independent of branch length. We use a Markov chain Monte Carlo (MCMC) method to sample from the posterior of the joint model, estimating the most probable alignment and tree and their support simultaneously. We describe a new MCMC transition kernel that improves our algorithm's mixing efficiency, allowing the MCMC chains to converge even when started from arbitrary alignments. Our software implementation can estimate alignment uncertainty and we describe a method for summarizing this uncertainty in a single plot.  相似文献   

14.
15.
Positive and negative selection on indel variation may explain the correlation between intron length and recombination levels in natural populations of Drosophila. A nucleotide sequence analysis of the 3.5 kilobase sequence of the alcohol dehydrogenase (Adh) region from 139 Drosophila pseudoobscura strains and one D. miranda strain was used to determine whether positive or negative selection acts on indel variation in a gene that experiences high levels of recombination. A total of 30 deletion and 36 insertion polymorphisms were segregating within D. pseudoobscura populations and no indels were fixed between D. pseudoobscura and its two sibling species D. miranda and D. persimilis. The ratio of Tajima's D to its theoretical minimum value (D(min)) was proposed as a metric to assess the heterogeneity in D among D. pseudoobscura loci when the number of segregating sites differs among loci. The magnitude of the D/D(min) ratio was found to increase as the rate of population expansion increases, allowing one to assess which loci have an excess of rare variants due to population expansion versus purifying selection. D. pseudoobscura populations appear to have had modest increases in size accounting for some of the observed excess of rare variants. The D/D(min) ratio rejected a neutral model for deletion polymorphisms. Linkage disequilibrium among pairs of indels was greater than between pairs of segregating nucleotides. These results suggest that purifying selection removes deletion variation from intron sequences, but not insertion polymorphisms. Genome rearrangement and size-dependent intron evolution are proposed as mechanisms that limit runaway intron expansion.  相似文献   

16.
Mutation rates are used to calibrate molecular clocks and to link genetic variants with human disease. However, mutation rates are not uniform across each eukaryotic genome. Rates for insertion/deletion (indel) mutations have been found to vary widely when examined in vitro and at specific loci in vivo. Here, we report the genome-wide rates of formation and repair of indels made during replication of yeast nuclear DNA. Using over 6000 indels accumulated in four mismatch repair (MMR) defective strains, and statistical corrections for false negatives, we find that indel rates increase by 100 000-fold with increasing homonucleotide run length, representing the greatest effect on replication fidelity of any known genomic parameter. Nonetheless, long genomic homopolymer runs are overrepresented relative to random chance, implying positive selection. Proofreading defects in the replicative polymerases selectively increase indel rates in short repetitive tracts, likely reflecting the distance over which Pols δ and ϵ interact with duplex DNA upstream of the polymerase active site. In contrast, MMR defects hugely increase indel mutagenesis in long repetitive sequences. Because repetitive sequences are not uniformly distributed among genomic functional elements, the quantitatively different consequences on genome-wide repeat sequence instability conferred by defects in proofreading and MMR have important biological implications.  相似文献   

17.
18.
Analysis of insertions/deletions in protein structures.   总被引:17,自引:0,他引:17  
An analysis of insertions and deletions (indels) occurring in a databank of multiple sequence alignments based on protein tertiary structure is reported. Indels prefer to be short (1 to 5 residues). The average intervening sequence length between them versus the percentage of residue identity in pairwise alignments shows an exponential behaviour, suggesting a stochastic process such that nearly every loop in an ancestral structure is a possible target for indels during evolution. The results also suggest a limit to the average size of indels accommodated by protein structures. The preferred indel conformations are reverse turn and coil as are the preferred conformations at the indel edges (N- and C-terminal sides). Interruptions in helices and strands were observed as very rare events.  相似文献   

19.
Phylogenetic tree reconstruction is traditionally based on multiple sequence alignments (MSAs) and heavily depends on the validity of this information bottleneck. With increasing sequence divergence, the quality of MSAs decays quickly. Alignment-free methods, on the other hand, are based on abstract string comparisons and avoid potential alignment problems. However, in general they are not biologically motivated and ignore our knowledge about the evolution of sequences. Thus, it is still a major open question how to define an evolutionary distance metric between divergent sequences that makes use of indel information and known substitution models without the need for a multiple alignment. Here we propose a new evolutionary distance metric to close this gap. It uses finite-state transducers to create a biologically motivated similarity score which models substitutions and indels, and does not depend on a multiple sequence alignment. The sequence similarity score is defined in analogy to pairwise alignments and additionally has the positive semi-definite property. We describe its derivation and show in simulation studies and real-world examples that it is more accurate in reconstructing phylogenies than competing methods. The result is a new and accurate way of determining evolutionary distances in and beyond the twilight zone of sequence alignments that is suitable for large datasets.  相似文献   

20.

Background

Insertions and deletions (indels) are the most abundant form of structural variation in all genomes. Indels have been increasingly recognized as an important source of molecular markers due to high-density occurrence, cost-effectiveness, and ease of genotyping. Coupled with developments in bioinformatics, next-generation sequencing (NGS) platforms enable the discovery of millions of indel polymorphisms by comparing the whole genome sequences of individuals within a species.

Results

A total of 1,973,746 unique indels were identified in 345 maize genomes, with an overall density of 958.79 indels/Mbp, and an average allele number of 2.76, ranging from 2 to 107. There were 264,214 indels with polymorphism information content (PIC) values greater than or equal to 0.5, accounting for 13.39 % of overall indels. Of these highly polymorphic indels, we designed primer pairs for 83,481 and 29,403 indels with major allele differences (i.e. the size difference between the most and second most frequent alleles) greater than or equal to 3 and 8 bp, respectively, based on the differing resolution capabilities of gel electrophoresis. The accuracy of our indel markers was experimentally validated, and among 100 indel markers, average accuracy was approximately 90 %. In addition, we also validated the polymorphism of the indel markers. Of 100 highly polymorphic indel markers, all had polymorphisms with average PIC values of 0.54.

Conclusions

The maize genome is rich in indel polymorphisms. Intriguingly, the level of polymorphism in genic regions of the maize genome was higher than that in intergenic regions. The polymorphic indel markers developed from this study may enhance the efficiency of genetic research and marker-assisted breeding in maize.

Electronic supplementary material

The online version of this article (doi:10.1186/s12864-015-1797-5) contains supplementary material, which is available to authorized users.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号