首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.

Background

Existing sequence alignment algorithms use heuristic scoring schemes based on biological expertise, which cannot be used as objective distance metrics. As a result one relies on crude measures, like the p- or log-det distances, or makes explicit, and often too simplistic, a priori assumptions about sequence evolution. Information theory provides an alternative, in the form of mutual information (MI). MI is, in principle, an objective and model independent similarity measure, but it is not widely used in this context and no algorithm for extracting MI from a given alignment (without assuming an evolutionary model) is known. MI can be estimated without alignments, by concatenating and zipping sequences, but so far this has only produced estimates with uncontrolled errors, despite the fact that the normalized compression distance based on it has shown promising results.

Results

We describe a simple approach to get robust estimates of MI from global pairwise alignments. Our main result uses algorithmic (Kolmogorov) information theory, but we show that similar results can also be obtained from Shannon theory. For animal mitochondrial DNA our approach uses the alignments made by popular global alignment algorithms to produce MI estimates that are strikingly close to estimates obtained from the alignment free methods mentioned above. We point out that, due to the fact that it is not additive, normalized compression distance is not an optimal metric for phylogenetics but we propose a simple modification that overcomes the issue of additivity. We test several versions of our MI based distance measures on a large number of randomly chosen quartets and demonstrate that they all perform better than traditional measures like the Kimura or log-det (resp. paralinear) distances.

Conclusions

Several versions of MI based distances outperform conventional distances in distance-based phylogeny. Even a simplified version based on single letter Shannon entropies, which can be easily incorporated in existing software packages, gave superior results throughout the entire animal kingdom. But we see the main virtue of our approach in a more general way. For example, it can also help to judge the relative merits of different alignment algorithms, by estimating the significance of specific alignments. It strongly suggests that information theory concepts can be exploited further in sequence analysis.  相似文献   

2.
Contact-based sequence alignment   总被引:2,自引:1,他引:1  
This paper introduces the novel method of contact-based protein sequence alignment, where structural information in the form of contact mutation probabilities is incorporated into an alignment routine using contact-mutation matrices (CAO: Contact Accepted mutatiOn). The contact-based alignment routine optimizes the score of matched contacts, which involves four (two per contact) instead of two residues per match in pairwise alignments. The first contact refers to a real side-chain contact in a template sequence with known structure, and the second contact is the equivalent putative contact of a homologous query sequence with unknown structure. An algorithm has been devised to perform a pairwise sequence alignment based on contact information. The contact scores were combined with PAM-type (Point Accepted Mutation) substitution scores after parameterization of gap penalties and score weights by means of a genetic algorithm. We show that owing to the structural information contained in the CAO matrices, significantly improved alignments of distantly related sequences can be obtained. This has allowed us to annotate eight putative Drosophila IGF sequences. Contact-based sequence alignment should therefore prove useful in comparative modelling and fold recognition.  相似文献   

3.
4.
Mutation rates are of key importance for understanding evolutionary processes and predicting their outcomes. Empirical mutation rate estimates are available for a number of RNA viruses, but few are available for DNA viruses, which tend to have larger genomes. Whilst some viruses have very high mutation rates, lower mutation rates are expected for viruses with large genomes to ensure genome integrity. Alphabaculoviruses are insect viruses with large genomes and often have high levels of polymorphism, suggesting high mutation rates despite evidence of proofreading activity by the replication machinery. Here, we report an empirical estimate of the mutation rate per base per strand copying (s/n/r) of Autographa californica multiple nucleopolyhedrovirus (AcMNPV). To avoid biases due to selection, we analyzed mutations that occurred in a stable, non-functional genomic insert after five serial passages in Spodoptera exigua larvae. Our results highlight that viral demography and the stringency of mutation calling affect mutation rate estimates, and that using a population genetic simulation model to make inferences can mitigate the impact of these processes on estimates of mutation rate. We estimated a mutation rate of μ = 1×10−7 s/n/r when applying the most stringent criteria for mutation calling, and estimates of up to μ = 5×10−7 s/n/r when relaxing these criteria. The rates at which different classes of mutations accumulate provide good evidence for neutrality of mutations occurring within the inserted region. We therefore present a robust approach for mutation rate estimation for viruses with stable genomes, and strong evidence of a much lower alphabaculovirus mutation rate than supposed based on the high levels of polymorphism observed.  相似文献   

5.
Multiple osteochondromas (MO) is an inherited skeletal disorder, and the molecular mechanism of MO remains elusive. Exome sequencing has high chromosomal coverage and accuracy, and has recently been successfully used to identify pathogenic gene mutations. In this study, exome sequencing followed by Sanger sequencing validation was first used to screen gene mutations in two representative MO patients from a Chinese family. After filtering the data from the 1000 Genome Project and the dbSNP database (build 132), the detected candidate gene mutations were further validated via Sanger sequencing of four other members of the same MO family and 200 unrelated healthy subjects. Immunohistochemisty and multiple sequence alignment were performed to evaluate the importance of the identified causal mutation. A novel frameshift mutation, c.1457insG at codon 486 of exon 6 of EXT1 gene, was identified, which truncated the glycosyltransferase domain of EXT1 gene. Multiple sequence alignment showed that codon 486 of EXT1 gene was highly conserved across various vertebrates. Immunohistochemisty demonstrated that the chondrocytes with functional EXT1 in MO were less than those in extragenetic solitary chondromas. The novel c.1457insG deleterious mutation of EXT1 gene reported in this study expands the causal mutation spectrum of MO, and may be helpful for prenatal genetic screening and early diagnosis of MO.  相似文献   

6.
Characteristics of the new phenotypic variation introduced via mutation have broad implications in evolutionary and medical genetics. Standardized estimates of this mutational variance, VM, span 2 orders of magnitude, but the causes of this remain poorly resolved. We investigated estimate heterogeneity using 2 approaches. First, meta-analyses of ∼150 estimates of standardized VM from 37 mutation accumulation studies did not support a difference among taxa (which differ in mutation rate) but provided equivocal support for differences among trait types (life history vs morphology, predicted to differ in mutation rate). Notably, several experimental factors were confounded with taxon and trait, and further empirical data are required to resolve their influences. Second, we analyzed morphological data from an experiment in Drosophila serrata to determine the potential for unintentional heterogeneity among environments in which phenotypes were measured (i.e. among laboratories or time points) or transient segregation of mutations within mutation accumulation lines to affect standardized VM. Approximating the size of an average mutation accumulation experiment, variability among repeated estimates of (accumulated) mutational variance was comparable to variation among published estimates of standardized VM. This heterogeneity was (partially) attributable to unintended environmental variation or within line segregation of mutations only for wing size, not wing shape traits. We conclude that sampling error contributed substantial variation within this experiment, and infer that it will also contribute substantially to differences among published estimates. We suggest a logistically permissive approach to improve the precision of estimates, and consequently our understanding of the dynamics of mutational variance of quantitative traits.  相似文献   

7.
Benchmarking tools for the alignment of functional noncoding DNA   总被引:1,自引:0,他引:1  

Background

Numerous tools have been developed to align genomic sequences. However, their relative performance in specific applications remains poorly characterized. Alignments of protein-coding sequences typically have been benchmarked against "correct" alignments inferred from structural data. For noncoding sequences, where such independent validation is lacking, simulation provides an effective means to generate "correct" alignments with which to benchmark alignment tools.

Results

Using rates of noncoding sequence evolution estimated from the genus Drosophila, we simulated alignments over a range of divergence times under varying models incorporating point substitution, insertion/deletion events, and short blocks of constrained sequences such as those found in cis-regulatory regions. We then compared "correct" alignments generated by a modified version of the ROSE simulation platform to alignments of the simulated derived sequences produced by eight pairwise alignment tools (Avid, BlastZ, Chaos, ClustalW, DiAlign, Lagan, Needle, and WABA) to determine the off-the-shelf performance of each tool. As expected, the ability to align noncoding sequences accurately decreases with increasing divergence for all tools, and declines faster in the presence of insertion/deletion evolution. Global alignment tools (Avid, ClustalW, Lagan, and Needle) typically have higher sensitivity over entire noncoding sequences as well as in constrained sequences. Local tools (BlastZ, Chaos, and WABA) have lower overall sensitivity as a consequence of incomplete coverage, but have high specificity to detect constrained sequences as well as high sensitivity within the subset of sequences they align. Tools such as DiAlign, which generate both local and global outputs, produce alignments of constrained sequences with both high sensitivity and specificity for divergence distances in the range of 1.25–3.0 substitutions per site.

Conclusion

For species with genomic properties similar to Drosophila, we conclude that a single pair of optimally diverged species analyzed with a high performance alignment tool can yield accurate and specific alignments of functionally constrained noncoding sequences. Further algorithm development, optimization of alignment parameters, and benchmarking studies will be necessary to extract the maximal biological information from alignments of functional noncoding DNA.
  相似文献   

8.
There is a lack of information on how individual microsatellite loci differ with respect to their mutation properties. Such variation will have an important bearing on our understanding of the ubiquitous occurrence of simple repeat sequences in eukaryotic genomes and on deriving proper mutation models that can be incorporated into genetic distance estimates. We genotyped ~100 families of the bird barn swallow (Hirundo rustica) for two hypervariable (heterozygosity >95%) microsatellite markers: HrU6, an (AAAG)n tetranucleotide repeat, and HrU10, an (AAGAG)n pentanucleotide repeat. A total of 27 germline mutation events were documented, corresponding to mutation rates of 0.57% (HrU6) and 1.56% (HrU10). The mutation rate increased with allele size, at ~0.1% per repeat unit over the observed range of allele sizes (~10–100 repeat units). Single repeat unit changes dominated, with 21/27 mutations representing the gain or loss of one repeat unit. There was no clear difference in the number of gains versus losses nor was there an effect of allele size on the magnitude or direction of mutation. Unexpectedly, the mutation rate of females (maternally transmitted mutations) was 2.5–5 times higher than that of males. Contrasting these observations with mutation data from other microsatellite loci reveals differences not only in the mutation rate, but also in the magnitude, direction and effect of sex on mutation. Thus, microsatellite mutation and evolution may be viewed as a dynamic and variable process.  相似文献   

9.
Hearing loss (HL) is a common disorder with mitochondrial dysfunction as one of the major causes leading to deafness. Mitochondrial dysfunction may be caused by either mutations in nuclear genes leading to defective nuclear-encoded proteins or mutations in mitochondrial genes leading to defective mitochondrial-encoded products. The specific nuclear genes involved in HL can be classified into two categories depending on whether mitochondrial gene mutations co-exist (modifier genes) or not (deafness-causing genes). TFB1M, MTO1, GTPBP3, and TRMU are modifier genes. A mutation in any of these modifier genes may lead to a deafness phenotype when accompanied by the mitochondrial gene mutation. OPA1, TIMM8A, SMAC/DIABLO, MPV17, PDSS1, BCS1L, SUCLA2, C10ORF2, COX10, PLOG1and RRM2B are deafness-causing genes. A mutation in any of these deafness-causing genes will directly induce variable phenotypic HL.  相似文献   

10.
Computational biology is replete with high-dimensional (high-D) discrete prediction and inference problems, including sequence alignment, RNA structure prediction, phylogenetic inference, motif finding, prediction of pathways, and model selection problems in statistical genetics. Even though prediction and inference in these settings are uncertain, little attention has been focused on the development of global measures of uncertainty. Regardless of the procedure employed to produce a prediction, when a procedure delivers a single answer, that answer is a point estimate selected from the solution ensemble, the set of all possible solutions. For high-D discrete space, these ensembles are immense, and thus there is considerable uncertainty. We recommend the use of Bayesian credibility limits to describe this uncertainty, where a (1−α)%, 0≤α≤1, credibility limit is the minimum Hamming distance radius of a hyper-sphere containing (1−α)% of the posterior distribution. Because sequence alignment is arguably the most extensively used procedure in computational biology, we employ it here to make these general concepts more concrete. The maximum similarity estimator (i.e., the alignment that maximizes the likelihood) and the centroid estimator (i.e., the alignment that minimizes the mean Hamming distance from the posterior weighted ensemble of alignments) are used to demonstrate the application of Bayesian credibility limits to alignment estimators. Application of Bayesian credibility limits to the alignment of 20 human/rodent orthologous sequence pairs and 125 orthologous sequence pairs from six Shewanella species shows that credibility limits of the alignments of promoter sequences of these species vary widely, and that centroid alignments dependably have tighter credibility limits than traditional maximum similarity alignments.  相似文献   

11.
Markov models of codon substitution are powerful inferential tools for studying biological processes such as natural selection and preferences in amino acid substitution. The equilibrium character distributions of these models are almost always estimated using nucleotide frequencies observed in a sequence alignment, primarily as a matter of historical convention. In this note, we demonstrate that a popular class of such estimators are biased, and that this bias has an adverse effect on goodness of fit and estimates of substitution rates. We propose a “corrected” empirical estimator that begins with observed nucleotide counts, but accounts for the nucleotide composition of stop codons. We show via simulation that the corrected estimates outperform the de facto standard estimates not just by providing better estimates of the frequencies themselves, but also by leading to improved estimation of other parameters in the evolutionary models. On a curated collection of sequence alignments, our estimators show a significant improvement in goodness of fit compared to the approach. Maximum likelihood estimation of the frequency parameters appears to be warranted in many cases, albeit at a greater computational cost. Our results demonstrate that there is little justification, either statistical or computational, for continued use of the -style estimators.  相似文献   

12.
Alignment ambiguity is a widespread problem in molecular evolutionary studies that has received insufficient attention. Most studies ignore such regions by deleting them before analyses, even though alignment-ambiguous regions can contain useful phylogenetic and evolutionary information. The alignment ambiguity might affect only one taxon, the region being readily alignable and phylogenetically informative across all other taxa. Alternatively, all possible alignments can consistently imply certain relationships. Because they are usually the most rapidly evolving regions, alignment-ambiguous regions might be those that are most able to resolve closely spaced divergences and contribute to estimates of branch lengths, evolutionary rates and divergence times. Three methods to incorporate such regions into phylogenetic and evolutionary analyses have been devised. The multiple analysis method evaluates each plausible alignment separately and seeks areas of congruence among the resultant trees, whereas the elision method combines all plausible alignments into a single analysis. Fragment-level alignment (= fixed states, INAASE) treats the entire unalignable section as a single but highly complex multistate character. Although these methods still need refining, they are preferable to discarding large portions of hard-earned and potentially informative sequence data.  相似文献   

13.
Knowledge of the rate and fitness effects of mutations is essential for understanding the process of evolution. Mutations are inherently difficult to study because they are rare and are frequently eliminated by natural selection. In the ciliate Tetrahymena thermophila, mutations can accumulate in the germline genome without being exposed to selection. We have conducted a mutation accumulation (MA) experiment in this species. Assuming that all mutations are deleterious and have the same effect, we estimate that the deleterious mutation rate per haploid germline genome per generation is U = 0.0047 (95% credible interval: 0.0015, 0.0125), and that germline mutations decrease fitness by s = 11% when expressed in a homozygous state (95% CI: 4.4%, 27%). We also estimate that deleterious mutations are partially recessive on average (h = 0.26; 95% CI: –0.022, 0.62) and that the rate of lethal mutations is <10% of the deleterious mutation rate. Comparisons between the observed evolutionary responses in the germline and somatic genomes and the results from individual-based simulations of MA suggest that the two genomes have similar mutational parameters. These are the first estimates of the deleterious mutation rate and fitness effects from the eukaryotic supergroup Chromalveolata and are within the range of those of other eukaryotes.  相似文献   

14.
The simple fact that proteins are built from 20 amino acids while DNA only contains four different bases, means that the 'signal-to-noise ratio' in protein sequence alignments is much better than in alignments of DNA. Besides this information-theoretical advantage, protein alignments also benefit from the information that is implicit in empirical substitution matrices such as BLOSUM-62. Taken together with the generally higher rate of synonymous mutations over non-synonymous ones, this means that the phylogenetic signal disappears much more rapidly from DNA sequences than from the encoded proteins. It is therefore preferable to align coding DNA at the amino acid level and it is for this purpose we have constructed the program RevTrans. RevTrans constructs a multiple DNA alignment by: (i) translating the DNA; (ii) aligning the resulting peptide sequences; and (iii) building a multiple DNA alignment by 'reverse translation' of the aligned protein sequences. In the resulting DNA alignment, gaps occur in groups of three corresponding to entire codons, and analogous codon positions are therefore always lined up. These features are useful when constructing multiple DNA alignments for phylogenetic analysis. RevTrans also accepts user-provided protein alignments for greater control of the alignment process. The RevTrans web server is freely available at http://www.cbs.dtu.dk/services/RevTrans/.  相似文献   

15.
Induction of back mutations to prototrophy by methylene blue (MB)-sensitized photodynamic (PD) treatment has been studied in wild-type and repair-deficient strains of Salmonella typhimurium carrying either the base-pair substitution mutation hisG46 or the frameshift mutation hisD3052. We found that reversion of the hisG46 mutation was increased in a strain carrying a uvrB deletion and decreased in a strain carrying a recA-type mutation. Reversion of the hisD3052 (frameshift) mutation, on the other hand, was decreased in both uvrB deletion and recA-type strains. The former results are consistent with the hypothesis that the majority of MB-sensitized PD-induced base-pair substitution mutations arise by a mechanism similar to that currently believed to be involved in UV mutagenesis. The latter results suggest that PD-induced frameshift mutations may arise in some other way, and two possible mechanisms involving sequential action of the excision repair and recombinational repair pathways are considered.  相似文献   

16.
The presence of a TP53 gene mutation can influence tumour response to some treatments, especially in breast cancer. In this study, we analysed p53 mRNA expression, LOH at 17p13 and TP53 mutations from exons 2 to 11 in 206 patients with breast carcinoma and correlated the results with disease-free and overall survival. The observed mutations were classified according to their type and location in the three protein domains (transactivation domain, DNA binding domain, oligomerization domain) and correlated with disease-free and overall survival. In our population, neither p53 mRNA expression nor LOH correlated with outcome. Concerning TP53 mutations, 27% of tumours were mutated (53/197) and the presence of a mutation in the TP53 gene was associated with worse overall survival (p = 0.0026) but not with disease-free survival (p = 0.0697), with median survival of 80 months and 78 months, respectively. When alterations were segregated into mutation categories and locations, and related to survival, tumours harbouring mutations other than missense mutations in the DNA binding domain of P53 had the same survival profiles as wild-type tumours. Concerning missense mutations in the DNA binding domain, median disease-free and overall survival was 23 months and 35 months, respectively (p = 0.0021 and p<0.0001, respectively), compared with 78 and 80 months in mutated tumours overall. This work shows that disease-free and overall survival in patients with a frameshift mutation of TP53 or missense mutation in the oligomerization domain are the same as those in wild-type TP53 patients.  相似文献   

17.
Sequence alignment profiles have been shown to be very powerful in creating accurate sequence alignments. Profiles are often used to search a sequence database with a local alignment algorithm. More accurate and longer alignments have been obtained with profile-to-profile comparison. There are several steps that must be performed in creating profile-profile alignments, and each involves choices in parameters and algorithms. These steps include (1) what sequences to include in a multiple alignment used to build each profile, (2) how to weight similar sequences in the multiple alignment and how to determine amino acid frequencies from the weighted alignment, (3) how to score a column from one profile aligned to a column of the other profile, (4) how to score gaps in the profile-profile alignment, and (5) how to include structural information. Large-scale benchmarks consisting of pairs of homologous proteins with structurally determined sequence alignments are necessary for evaluating the efficacy of each scoring scheme. With such a benchmark, we have investigated the properties of profile-profile alignments and found that (1) with optimized gap penalties, most column-column scoring functions behave similarly to one another in alignment accuracy; (2) some functions, however, have much higher search sensitivity and specificity; (3) position-specific weighting schemes in determining amino acid counts in columns of multiple sequence alignments are better than sequence-specific schemes; (4) removing positions in the profile with gaps in the query sequence results in better alignments; and (5) adding predicted and known secondary structure information improves alignments.  相似文献   

18.
Gap costs for multiple sequence alignment   总被引:6,自引:0,他引:6  
Standard methods for aligning pairs of biological sequences charge for the most common mutations, which are substitutions, deletions and insertions. Because a single mutation may insert or delete several nucleotides, gap costs that are not directly proportional to gap length are usually the most effective. How to extend such gap costs to alignments of three or more sequences is not immediately obvious, and a variety of approaches have been taken. This paper argues that, since gap and substitution costs together specify optimal alignments, they should be defined using a common rationale. Specifically, a new definition of gap costs for multiple alignments is proposed and compared with previous ones. Since the new definition links a multiple alignment's cost to that of its pairwise projections, it allows knowledge gained about two-sequence alignments to bear on the multiple alignment problem. Also, such linkage is a key element of recent algorithms that have rendered practical the simultaneous alignment of as many as six sequences.  相似文献   

19.

Background

Several founder mutations leading to increased risk of cancer among Ashkenazi Jewish individuals have been identified, and some estimates of the age of the mutations have been published. A variety of different methods have been used previously to estimate the age of the mutations. Here three datasets containing genotype information near known founder mutations are reanalyzed in order to compare three approaches for estimating the age of a mutation. The methods are: (a) the single marker method used by Risch et al., (1995); (b) the intra-allelic coalescent model known as DMLE, and (c) the Goldgar method proposed in Neuhausen et al. (1996), and modified slightly by our group. The three mutations analyzed were MSH2*1906 G->C, APC*I1307K, and BRCA2*6174delT.

Results

All methods depend on accurate estimates of inter-marker recombination rates. The modified Goldgar method allows for marker mutation as well as recombination, but requires prior estimates of the possible haplotypes carrying the mutation for each individual. It does not incorporate population growth rates. The DMLE method simultaneously estimates the haplotypes with the mutation age, and builds in the population growth rate. The single marker estimates, however, are more sensitive to the recombination rates and are unstable. Mutation age estimates based on DMLE are 16.8 generations for MSH2 (95% credible interval (13, 23)), 106 generations for I1037K (86-129), and 90 generations for 6174delT (71-114).

Conclusions

For recent founder mutations where marker mutations are unlikely to have occurred, both DMLE and the Goldgar method can give good results. Caution is necessary for older mutations, especially if the effective population size may have remained small for a long period of time.
  相似文献   

20.

Background

Protein sequence profile-profile alignment is an important approach to recognizing remote homologs and generating accurate pairwise alignments. It plays an important role in protein sequence database search, protein structure prediction, protein function prediction, and phylogenetic analysis.

Results

In this work, we integrate predicted solvent accessibility, torsion angles and evolutionary residue coupling information with the pairwise Hidden Markov Model (HMM) based profile alignment method to improve profile-profile alignments. The evaluation results demonstrate that adding predicted relative solvent accessibility and torsion angle information improves the accuracy of profile-profile alignments. The evolutionary residue coupling information is helpful in some cases, but its contribution to the improvement is not consistent.

Conclusion

Incorporating the new structural information such as predicted solvent accessibility and torsion angles into the profile-profile alignment is a useful way to improve pairwise profile-profile alignment methods.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号