首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 562 毫秒
1.
There has been considerable interest in the problem of making maximum likelihood (ML) evolutionary trees which allow insertions and deletions. This problem is partly one of formulation: how does one define a probabilistic model for such trees which treats insertion and deletion in a biologically plausible manner? A possible answer to this question is proposed here by extending the concept of a hidden Markov model (HMM) to evolutionary trees. The model, called a tree-HMM, allows what may be loosely regarded as learnable affine-type gap penalties for alignments. These penalties are expressed in HMMs as probabilities of transitions between states. In the tree-HMM, this idea is given an evolutionary embodiment by defining trees of transitions. Just as the probability of a tree composed of ungapped sequences is computed, by Felsenstein's method, using matrices representing the probabilities of substitutions of residues along the edges of the tree, so the probabilities in a tree-HMM are computed by substitution matrices for both residues and transitions. How to define these matrices by a ML procedure using an algorithm that learns from a database of protein sequences is shown here. Given these matrices, one can define a tree-HMM likelihood for a set of sequences, assuming a particular tree topology and an alignment of the sequences to the model. If one could efficiently find the alignment which maximizes (or comes close to maximizing) this likelihood, then one could search for the optimal tree topology for the sequences. An alignment algorithm is defined here which, given a particular tree topology, is guaranteed to increase the likelihood of the model. Unfortunately, it fails to find global optima for realistic sequence sets. Thus further research is needed to turn the tree-HMM into a practical phylogenetic tool.  相似文献   

2.
3.
A fundamental task in sequence analysis is to calculate the probability of a multiple alignment given a phylogenetic tree relating the sequences and an evolutionary model describing how sequences change over time. However, the most widely used phylogenetic models only account for residue substitution events. We describe a probabilistic model of a multiple sequence alignment that accounts for insertion and deletion events in addition to substitutions, given a phylogenetic tree, using a rate matrix augmented by the gap character. Starting from a continuous Markov process, we construct a non-reversible generative (birth-death) evolutionary model for insertions and deletions. The model assumes that insertion and deletion events occur one residue at a time. We apply this model to phylogenetic tree inference by extending the program dnaml in phylip. Using standard benchmarking methods on simulated data and a new "concordance test" benchmark on real ribosomal RNA alignments, we show that the extended program dnamlepsilon improves accuracy relative to the usual approach of ignoring gaps, while retaining the computational efficiency of the Felsenstein peeling algorithm.  相似文献   

4.
Summary The problem of choosing an alignment of two or more nucleotide sequences is particularly difficult for nucleic acids, such as 5S ribosomal RNA, which do not code for protein and for which secondary structure is unknown. Given a set of costs for the various types of replacement mutations and for base insertion or deletion, we present a dynamic programming algorithm which finds the optimal (least costly) alignment for a set of N sequences simultaneously, where each sequence is associated with one of the N tips of a given evolutionary tree. Concurrently, protosequences are constructed corresponding to the ancestral nodes of the tree. A version of this algorithm, modified to be computationally feasible, is implemented to align the sequences of 5S RNA from nine organisms. Complete sets of alignments and proto-sequence reconstructions are done for a large number of different con-figurations of mutation costs. Examination of the family of curves of total replacements inferred versus the ratio of transitions/trans-versions inferred, each curve corresponding to a given number of in-sertions-deletions inferred, provides a method for estimating relative costs and relative frequencies for these different types of mutation.  相似文献   

5.
The human hypervariable minisatellite MS32 has a well characterised internal repeat unit array and high mutation rates have been observed at this locus. Analysis of MS32 mutants has shown that male germline mutations are polarised to one end of the array and frequently involve complex gene conversion-like events, suggesting that tandem repeat instability may be modulated by cis-acting sequences flanking the locus. In order to investigate the processes affecting MS32 mutation rate and mechanism, we have created transgenic mice harbouring an MS32 allele. Here we describe the organisation of eight transgenic insertions. Analysis of these transgenic loci by MVR-PCR and structural analysis of the junctions between mouse flanking DNA and the transgenic loci has shed light on mechanisms of integration and rearrangement of the tandem repeated transgenes. Sequence analysis of the mouse DNA flanking these transgenes has shown that 5 of the 8 insertions have integrated into mouse gamma satellite repeated sequence. This suggests a non-random integration of the MS32 transgene construct into the mouse genome.  相似文献   

6.
R Kraft  L Kadyk  L A Leinwand 《Genomics》1992,12(3):555-566
The rodent 4.5 S RNA is an RNA polymerase III product with a sequence related to the Alu family of interspersed repeated DNA. A previous study identified a tandem array of 4.2-kb repeating units that contain the 4.5 S RNA coding sequence as well as many short repetitive sequences. To understand the genomic organization of this gene family, we have isolated and characterized 4.5 S RNA sequences that are part of the tandem array as well as identified members that are not part of the array. One variant 4.5 S RNA gene family member exhibits length polymorphisms in its minisatellite sites relative to the single previously reported gene. The 4.5 S RNA sequences that are not part of the tandem array possess many of the features of processed pseudogenes and are found adjacent to other interspersed repeated elements. These findings suggest that the mouse 4.5 S RNA can behave as a retroposon, resulting in the accumulation of 4.5 S RNA-like elements at many sites in the genome.  相似文献   

7.
A mosaic minisatellite region has been identified in the mitochondrial genome of Norway spruce (Picea abies). The array was composed of three tandem repeats PaTR1 (32 bp), PaTR2a (26 bp) and PaTR2b (26 bp). PaTR2a and PaTR2b differed by one base substitution. The analysis of 92 trees covering the whole natural distribution area of the species allowed detection of 11 length variants ranging from 131 bp to 447 bp. This high intra-specific polymorphism relies on variation in the number of the tandem repeats. Population genetic parameters estimated among 14 populations suggested high population differentiation (Gst=0.749). The phylogenetic analysis of the 11 sequenced length variants has been performed using a parsimony approach. The topology of the tree showed a good association of groups with geographical origin and a low level of size homoplasy. The phylogenetic reconstruction also suggests that this minisatellite locus has mainly evolved by an increase in the repeat copy number.  相似文献   

8.

Background  

Minisatellites are genomic loci composed of tandem arrays of short repetitive DNA segments. A minisatellite map is a sequence of symbols that represents the tandem repeat array such that the set of symbols is in one-to-one correspondence with the set of distinct repeats. Due to variations in repeat type and organization as well as copy number, the minisatellite maps have been widely used in forensic and population studies. In either domain, researchers need to compare the set of maps to each other, to build phylogenetic trees, to spot structural variations, and to study duplication dynamics. Efficient algorithms for these tasks are required to carry them out reliably and in reasonable time.  相似文献   

9.
A new hypervariable marker for the human alpha-globin gene cluster.   总被引:17,自引:10,他引:7       下载免费PDF全文
We have located a highly polymorphic region of DNA approximately 100 kb upstream of the human alpha-globin genes (the alpha-globin 5' hypervariable region; 5'HVR). The element responsible is a minisatellite sequence comprising a variable copy number tandem repeat array of a G/C-rich 57-bp sequence. This increases the number of minisatellite elements in the vicinity of the alpha-globin genes to five, all of which share a region of sequence identity, thus raising questions concerning the distribution and origins of such tandem repeat sequences. The 5'HVR is highly polymorphic and, together with other hypervariable regions at this locus, provides a valuable genetic marker on the short arm of chromosome 16.  相似文献   

10.
SUMMARY: Multiple sequence alignment is the NP-hard problem of aligning three or more DNA or amino acid sequences in an optimal way so as to match as many characters as possible from the set of sequences. The popular sequence alignment program ClustalW uses the classical method of approximating a sequence alignment, by first computing a distance matrix and then constructing a guide tree to show the evolutionary relationship of the sequences. We show that parallelizing the ClustalW algorithm can result in significant speedup. We used a cluster of workstations using C and message passing interface for our implementation. Experimental results show that speedup of over 5.5 on six processors is obtainable for most inputs. AVAILABILITY: The software is available upon request from the second author.  相似文献   

11.
Tandem repeat loci such as minisatellites and trinucleotide repeats frequently show instability. We have investigated mutation at human minisatellite MS32 (locus D1S8) transferred to transgenic mice. Three lines of hemizygous transgenic mice were studied. A single-copy line (110D) was seen to be relatively stable, whilst two multicopy lines showed structural instability of the transgene in pedigrees (lines 109 and 110A). For both these lines, mutant structures were detected as a result of mutation events having occurred in the germline or early embryo. Structural changes seen included gain or loss of minisatellite repeat units (110A and 109), alteration of DNA flanking the minisatellite repeat array (109 only) or deletion of the entire transgene (109 only). This work demonstrates that tandem repeat transgenes can show instability and thus provide additional systems for the analysis of repetitive DNA structural change in mice.  相似文献   

12.
Hypermutable minisatellites,a human affair?   总被引:6,自引:0,他引:6  
Bois PR 《Genomics》2003,81(4):349-355
Minisatellites are a class of highly polymorphic GC-rich tandem repeats. They include some of the most variable loci in the human genome, with mutation rates ranging from 0.5% to >20% per generation. Structurally, they consist of 10- to 100-bp intermingled variant repeats, making them ideal tools for dissecting mechanisms of instability at tandem repeats. Distinct mutation processes generate rare intra-allelic somatic events and frequent complex conversion-like germline mutations in these repeats. Furthermore, turnover of repeats at human minisatellites is controlled by intense recombinational activity in DNA flanking the repeat array. Surprisingly, whereas other mammalian genomes possess minisatellite-like sequences, hypermutable loci have not been identified that suggest human-specific turnover processes at minisatellite arrays. Attempts to transfer minisatellite germline instability to the mouse have failed. However, yeast models are now revealing valuable information regarding the mechanisms regulating instability at these tandem repeats. Finally, minisatellites and tandem repeats provide exquisitely sensitive molecular tools to detect genomic insults such as ionizing radiation exposure. Surprisingly, by a mechanism that remains elusive, there are transgenerational increases in minisatellite instability.  相似文献   

13.
Bishop AJ  Louis EJ  Borts RH 《Genetics》2000,156(1):7-20
Two yeast minisatellite alleles were cloned and inserted into a genetically defined interval in Saccharomyces cerevisiae. Analysis of flanking markers in combination with sequencing allowed the determination of the meiotic events that produced minisatellites with altered lengths. Tetrad analysis revealed that gene conversions, deletions, or complex combinations of both were involved in producing minisatellite variants. Similar changes were obtained following selection for nearby gene conversions or crossovers among random spores. The largest class of events involving the minisatellite was a 3:1 segregation of parental-size alleles, a class that would have been missed in all previous studies of minisatellites. Comparison of the sequences of the parental and novel alleles revealed that DNA must have been removed from the recipient array while a newly synthesized copy of donor array sequences was inserted. The length of inserted sequences did not appear to be constrained by the length of DNA that was removed. In cases where one or both sides of the insertion could be determined, the insertion endpoints were consistent with the suggestion that the event was mediated by alignment of homologous stretches of donor/recipient DNA.  相似文献   

14.
Duplication/deletion polymorphism 5' - to the human beta globin gene.   总被引:14,自引:3,他引:11       下载免费PDF全文
DNA sequence analysis of the human beta globin locus has identified an array of simple tandem repeated sequences upstream from the beta globin structural gene. Comparison of several cloned human beta globin alleles demonstrated a high frequency of sequence heteromorphism at this site apparently due to duplication or deletion of single units of the repeat array. At least two such duplication/deletion events are necessary to account for the observed variation. No other sequence variation was observed, suggesting that duplication/deletion events within the tandem repeat array may be at least 13 to 14 times more frequent than nucleotide substitutions in the surrounding DNA.  相似文献   

15.
We present a stochastic sequence evolution model to obtain alignments and estimate mutation rates between two homologous sequences. The model allows two possible evolutionary behaviors along a DNA sequence in order to determine conserved regions and take its heterogeneity into account. In our model, the sequence is divided into slow and fast evolution regions. The boundaries between these sections are not known. It is our aim to detect them. The evolution model is based on a fragment insertion and deletion process working on fast regions only and on a substitution process working on fast and slow regions with different rates. This model induces a pair hidden Markov structure at the level of alignments, thus making efficient statistical alignment algorithms possible. We propose two complementary estimation methods, namely, a Gibbs sampler for Bayesian estimation and a stochastic version of the EM algorithm for maximum likelihood estimation. Both algorithms involve the sampling of alignments. We propose a partial alignment sampler, which is computationally less expensive than the typical whole alignment sampler. We show the convergence of the two estimation algorithms when used with this partial sampler. Our algorithms provide consistent estimates for the mutation rates and plausible alignments and sequence segmentations on both simulated and real data.  相似文献   

16.
In the process of characterizing a rice wx deletion mutant, an AT-rich minisatellite sequence that consisted of units of approximately 80 bp was detected about 2.3 kb downstream of the wx gene. This AT-rich minisatellite was a multiple-copy element (1 x 10(3) to 2 x 10(3) copies per haploid genome) and interspersed in the rice genome. By BLAST homology search it was indicated that not only the tandem repeat but also both flanking sequences were conserved among copies. According to the characteristics of the termini (5'-CHH ... CTAG-3') and a target site preference for T, this AT-rich minisatellite accompanying the flanking sequences was classified into a novel transposon, Basho. The results of direct amplification of Basho showed that relatively large variation in size existed in the Basho family. We estimate the variation to be generated by not only alteration of the number of units in the minisatellite but also by duplications of larger blocks including the conserved flanking sequences caused by single-strand mispairing (SSM) at noncontiguous repeats. Because the AT-rich minisatellite contained in Basho possessed several motifs of the matrix attachment region (MAR) in its repeat unit, the functional role as MAR in the rice genome was discussed.  相似文献   

17.
Nucleotide sequencing identified a tandemly repeated sequence array 22 x 10(3) base-pairs from the right-hand DNA terminus of the African swine fever virus (ASFV) genome. The sequence of the repeat array and sequences closely flanking it were compared in the genomes of four groups of ASFV isolates that had very different restriction enzyme site maps. Arrays present in one group of ASFV isolates from East Zambia/Malawi varied in length and contained between 8 and 38 copies of a 17-nucleotide repeat unit. Repeat arrays in a second group of ASFV isolates from Europe were less variable in length but consisted of different types of repeat unit that were divergent in sequence. A third genetically diverse ASFV isolate. LIV 13 from a South Zambia Game Park, contained repeat unit types that were similar to those of European viruses. MFUE6 isolate from an East Zambia Game Park contained a shorter version of the European repeat unit. An eight-base-pair core sequence was conserved between the East Zambia/Malawi and European and LIV 13 repeat units. These tandemly repeated sequence arrays share a number of properties with chromosomal minisatellite DNA. Similar tandem repeat arrays have not been described in poxviruses.  相似文献   

18.
The genomic basis of facioscapulohumeral muscular dystrophy (FSHD) is of considerable interest because of the unique nature of the molecular mutation, which is a deletion within a large, complex DNA tandem array (D4Z4). This repeat maps within 30 kb of the 4q telomere. Although D4Z4 repeat units each contain an open reading frame that could encode a homeodomain protein, there is no evidence that the repeat is transcribed, and the underlying disease mechanism probably involves a position effect. A recent study has identified a protein complex bound to D4Z4 that contains YY1 and HMGB2, implicating a role for D4Z4 as a repressor. The 4q telomere has two variants, 4qA and 4qB. Although these alleles are present at almost equal frequencies in the general population, FSHD is associated only with the 4qA allele and never with 4qB. This suggests a functional difference between the telomere variants, either in predisposition to deletions within D4Z4 or in the pathological consequence of the deletion. Comparative mapping studies of the FSHD region in primates, mouse and Fugu rubripes have given insights into the evolutionary history of the D4Z4 repeat and of 4qter, although as yet they have not provided any solutions to the FSHD puzzle.  相似文献   

19.
Abstract

Molecular sequence data have become prominent tools for phylogenetic relationship inference, particularly useful in the analysis of highly diverse taxonomic orders. Ribosomal RNA sequences provide markers that can be used in the study of phylogeny, because their function and structure have been conserved to a large extent throughout the evolutionary history of organisms. These sequences are inferred from cloned or enzymatically amplified gene sequences, or determined by direct RNA sequencing. The first step of the phylogenetic interpretation of nucleic acid sequence variations implies proper alignment of corresponding sequences from various organisms. Best alignment based on similarity criteria is greatly reinforced, in the case of ribosomal RNAs, by secondary structure homologies. Distance matrix methods to infer evolutionary trees are based on the assumption that the phylogenetic distance between each pair of organisms is proportional to the number of nucleotide substitution events. Computed tree inference methods usually take into consideration the possibility of unequal mutation rates among lineages. Divergence times can be estimated on the tree, provided that at least one lineage has been dated by fossil records. We have utilized this approach based on ribosomal RNA sequence comparison to investigate the phylogenetic relationship between dinoflagellated and other eukaryote protists, and to refine controverse phylogenies of the class Dinophycae.  相似文献   

20.
Exact and heuristic algorithms for the Indel Maximum Likelihood Problem.   总被引:1,自引:0,他引:1  
Given a multiple alignment of orthologous DNA sequences and a phylogenetic tree for these sequences, we investigate the problem of reconstructing the most likely scenario of insertions and deletions capable of explaining the gaps observed in the alignment. This problem, that we called the Indel Maximum Likelihood Problem (IMLP), is an important step toward the reconstruction of ancestral genomics sequences, and is important for studying evolutionary processes, genome function, adaptation and convergence. We solve the IMLP using a new type of tree hidden Markov model whose states correspond to single-base evolutionary scenarios and where transitions model dependencies between neighboring columns. The standard Viterbi and Forward-backward algorithms are optimized to produce the most likely ancestral reconstruction and to compute the level of confidence associated to specific regions of the reconstruction. A heuristic is presented to make the method practical for large data sets, while retaining an extremely high degree of accuracy. The methods are illustrated on a 1-Mb alignment of the CFTR regions from 12 mammals.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号