首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 11 毫秒
1.
The selective pressure at the protein level is usually measured by the nonsynonymous/synonymous rate ratio (omega = dN/dS), with omega < 1, omega = 1, and omega > 1 indicating purifying (or negative) selection, neutral evolution, and diversifying (or positive) selection, respectively. The omega ratio is commonly calculated as an average over sites. As every functional protein has some amino acid sites under selective constraints, averaging rates across sites leads to low power to detect positive selection. Recently developed models of codon substitution allow the omega ratio to vary among sites and appear to be powerful in detecting positive selection in empirical data analysis. In this study, we used computer simulation to investigate the accuracy and power of the likelihood ratio test (LRT) in detecting positive selection at amino acid sites. The test compares two nested models: one that allows for sites under positive selection (with omega > 1), and another that does not, with the chi2 distribution used for significance testing. We found that use of the chi(2) distribution makes the test conservative, especially when the data contain very short and highly similar sequences. Nevertheless, the LRT is powerful. Although the power can be low with only 5 or 6 sequences in the data, it was nearly 100% in data sets of 17 sequences. Sequence length, sequence divergence, and the strength of positive selection also were found to affect the power of the LRT. The exact distribution assumed for the omega ratio over sites was found not to affect the effectiveness of the LRT.  相似文献   

2.
The w statistic introduced by Lockhart et al. (1998. A covariotide model explains apparent phylogenetic structure of oxygenic photosynthetic lineages. Mol Biol Evol. 15:1183-1188) is a simple and easily calculated statistic intended to detect heterotachy by comparing amino acid substitution patterns between two monophyletic groups of protein sequences. It is defined as the difference between the fraction of varied sites in both groups and the fraction of varied sites in each group. The w test has been used to distinguish a covarion process from equal rates and rates variation across sites processes. Using simulation we show that the w test is effective for small data sets and for data sets that have low substitution rates in the groups but can have difficulties when these conditions are not met. Using site entropy as a measure of variability of a sequence site, we modify the w statistic to a w' statistic by assigning as varied in one group those sites that are actually varied in both groups but have a large entropy difference. We show that the w' test has more power to detect two kinds of heterotachy processes (covarion and bivariate rate shifts) in large and variable data. We also show that a test of Pearson's correlation of the site entropies between two monophyletic groups can be used to detect heterotachy and has more power than the w' test. Furthermore, we demonstrate that there are settings where the correlation test as well as w and w' tests do not detect heterotachy signals in data simulated under a branch length mixture model. In such cases, it is sometimes possible to detect heterotachy through subselection of appropriate taxa. Finally, we discuss the abilities of the three statistical tests to detect a fourth mode of heterotachy: lineage-specific changes in proportion of variable sites.  相似文献   

3.
A software program CRANN has been developed in order to detect adaptive evolution in protein-coding DNA sequences.  相似文献   

4.
The rapid accumulation of genomic sequences in public databases will finally allow large scale studies of gene family evolution, including evaluation of the role of positive Darwinian selection following a duplication event. This will be possible because recent statistical methods of comparing synonymous and nonsynonymous substitution rates permit reliable detection of positive selection at individual amino acid sites and along evolutionary lineages. Here, we summarize maximum-likelihood based methods, and present a framework for their application to analysis of gene families. Using these methods, we investigated the role of positive Darwinian selection in the ECP-EDN gene family of primates and the Troponin C gene family of vertebrates. We also comment on the limitations of these methods and discuss directions for further improvements.  相似文献   

5.
Evidence of positively selected sites in mammalian alpha-defensins   总被引:8,自引:0,他引:8  
Alpha-defensins are a family of mammalian antimicrobial peptides that exhibit variable activity against a panel of microbes, including bacteria, fungi, and enveloped viruses. We have employed a maximum-likelihood approach to detect evidence of positive selection (adaptive evolution) in the evolution of these important molecules of the innate immune response. We have identified 14 amino acid sites that are predicted to be subject to positive selection. Furthermore, we show that all these sites are located in the mature antimicrobial peptide and not in the prepropeptide region of the molecule, implying that they are of functional importance. These results suggest that mammalian alpha-defensins have been under selective pressure to evolve in response to potentially infectious challenges by fast-evolving microbes.  相似文献   

6.
R Nielsen  Z Yang 《Genetics》1998,148(3):929-936
Several codon-based models for the evolution of protein-coding DNA sequences are developed that account for varying selection intensity among amino acid sites. The "neutral model" assumes two categories of sites at which amino acid replacements are either neutral or deleterious. The "positive-selection model" assumes an additional category of positively selected sites at which nonsynonymous substitutions occur at a higher rate than synonymous ones. This model is also used to identify target sites for positive selection. The models are applied to a data set of the V3 region of the HIV-1 envelope gene, sequenced at different years after the infection of one patient. The results provide strong support for variable selection intensity among amino acid sites The neutral model is rejected in favor of the positive-selection model, indicating the operation of positive selection in the region. Positively selected sites are found in both the V3 region and the flanking regions.  相似文献   

7.
This paper describes a computer method that uses codon preference to help find protein coding regions in long DNA sequences. The method can distinguish between introns and exons and can help to detect sequencing errors.  相似文献   

8.
Polygalacturonase inhibitor proteins (PGIPs) protect plants against invasion by diverse microbial and invertebrate enemies that use polygalacturonase (PG) to breach the plant cell wall. Directed mutagenesis has identified specific natural mutations conferring novel defensive capability in green bean PGIP against a specific fungal PG. These same sites are identified as positively selected by phylogenetic codon-substitution models, demonstrating the utility of such models for connecting retrospective comparative analyses with contemporary, ecologically relevant variation.  相似文献   

9.
MOTIVATION: At present the computational gene identification methods in microbial genomes have a high prediction accuracy of verified translation termination site (3' end), but a much lower accuracy of the translation initiation site (TIS, 5' end). The latter is important to the analysis and the understanding of the putative protein of a gene and the regulatory machinery of the translation. Improving the accuracy of prediction of TIS is one of the remaining open problems. RESULTS: In this paper, we develop a four-component statistical model to describe the TIS of prokaryotic genes. The model incorporates several features with biological meanings, including the correlation between translation termination site and TIS of genes, the sequence content around the start codon; the sequence content of the consensus signal related to ribosomal binding sites (RBSs), and the correlation between TIS and the upstream consensus signal. An entirely non-supervised training system is constructed, which takes as input a set of annotated coding open reading frames (ORFs) by any gene finder, and gives as output a set of organism-specific parameters (without any prior knowledge or empirical constants and formulas). The novel algorithm is tested on a set of reliable datasets of genes from Escherichia coli and Bacillus subtillis. MED-Start may correctly predict 95.4% of the start sites of 195 experimentally confirmed E.coli genes, 96.6% of 58 reliable B.subtillis genes. Moreover, the test results indicate that the algorithm gives higher accuracy for more reliable datasets, and is robust to the variation of gene length. MED-Start may be used as a postprocessor for a gene finder. After processing by our program, the improvement of gene start prediction of gene finder system is remarkable, e.g. the accuracy of TIS predicted by MED 1.0 increases from 61.7 to 91.5% for 854 E.coli verified genes, while that by GLIMMER 2.02 increases from 63.2 to 92.0% for the same dataset. These results show that our algorithm is one of the most accurate methods to identify TIS of prokaryotic genomes. AVAILABILITY: The program MED-Start can be accessed through the website of CTB at Peking University: http://ctb.pku.edu.cn/main/SheGroup/MED_Start.htm.  相似文献   

10.
Functioning as an "address tag" or "zip code" that guides nascent proteins (newly synthesized proteins in the cytosol) to wherever they are needed, signal peptides (also called targeting signals or signal sequences) have become a crucial tool in finding new drugs or reprogramming cells for gene therapy. To effectively and timely use such a tool, however, the first important thing is to develop an automated method for quickly and accurately identifying the signal peptide for a given nascent protein. With the avalanche of new protein sequences generated in the post-genomic era, the challenge has become even more urgent and critical. In this paper, five statistical rulers were derived via performing a mutual information analysis. By combining these statistical rulers, a new prediction algorithm was established and high success prediction rates were observed. The new algorithm may play a complementary role to the existing algorithms in this area. It is anticipated that the mutual information approach introduced here may be very useful for studying many other sequence-coupling problems in molecular biology as well.  相似文献   

11.
We describe a new method for identifying the sequences that signal the start of translation, and the boundaries between exons and introns (donor and acceptor sites) in human mRNA. According to the mandatory keyword, ORGANISM, and feature key, CDS, a large set of standard data for each signal site was extracted from the ASCII flat file, gbpri.seq, in the GenBank release 108.0. This was used to generate the scoring matrices, which summarize the sequence information for each signal site. The scoring matrices take into account the independent nucleotide frequencies between adjacent bases in each position within the signal site regions, and the relative weight on each nucleotide in proportion to their probabilities in the known signal sites. Using a scoring scheme that is based on the nucleotide scoring matrices, the method has great sensitivity and specificity when used to locate signals in uncharacterized human genomic DNA. These matrices are especially effective at distinguishing true and false sites.  相似文献   

12.
The aim of this paper is to give measurements indicative of evolutional stages of the species. Two types of statistics of trinucleotides in coding regions are analysed for 27 species. The first one is the codon space, the nucleotide ratio for each of the three codon positions. We apply principal component analysis on this space and extract two principal components faithfully describing the original distribution of the codon space. The first principal component corresponds to the GC content. The second principal component classifies the species into three evolutional groups, Archaea, Bacteria and Eukaryota. The second statistics is the real and theoretical frequency of amino acids. The real frequency of an amino acid in a coding sequence is its frequency in the translated protein. The theoretical frequency is the expected frequency calculated from the ratio of nucleotides. We introduce the discrepancy between these two frequencies as an index of non-randomness of nucleotides in the sequence. This index of non-randomness divides the species into two groups: eukaryotes having smaller non-randomness (i.e. being more random) and prokaryotes having higher non-randomness.  相似文献   

13.
The principle of heterotachy states that the substitution rate of sites in a gene can change through time. In this article, we propose a powerful statistical test to detect sites that evolve according to the process of heterotachy. We apply this test to an alignment of 1289 eukaryotic rRNA molecules to 1) determine how widespread the phenomenon of heterotachy is in ribosomal RNA, 2) to test whether these heterotachous sites are nonrandomly distributed, that is, linked to secondary structure features of ribosomal RNA, and 3) to determine the impact of heterotachous sites on the bootstrap support of monophyletic groupings. Our study revealed that with 21 monophyletic taxa, approximately two-thirds of the sites in the considered set of sequences is heterotachous. Although the detected heterotachous sites do not appear bound to specific structural features of the small subunit rRNA, their presence is shown to have a large beneficial influence on the bootstrap support of monophyletic groups. Using extensive testing, we show that this may not be due to heterotachy itself but merely due to the increased substitution rate at the detected heterotachous sites.  相似文献   

14.
We have developed a statistical method named MAP (mutagenesis assistant program) to equip protein engineers with a tool to develop promising directed evolution strategies by comparing 19 mutagenesis methods. Instead of conventional transition/transversion bias indicators as benchmarks for comparison, we propose to use three indicators based on the subset of amino acid substitutions generated on the protein level: (1) protein structure indicator; (2) amino acid diversity indicator with a codon diversity coefficient; and (3) chemical diversity indicator. A MAP analysis for a single nucleotide substitution was performed for four genes: (1) heme domain of cytochrome P450 BM-3 from Bacillus megaterium (EC 1.14.14.1); (2) glucose oxidase from Aspergillus niger (EC 1.1.3.4); (3) arylesterase from Pseudomonas fluorescens (EC 3.1.1.2); and (4) alcohol dehydrogenase from Saccharomyces cerevisiae (EC 1.1.1.1). Based on the MAP analysis of these four genes, 19 mutagenesis methods have been evaluated and criteria for an ideal mutagenesis method have been proposed. The statistical analysis showed that existing gene mutagenesis methods are limited and highly biased. An average amino acid substitution per residue of only 3.15-7.4 can be achieved with current random mutagenesis methods. For the four investigated gene sequences, an average fraction of amino acid substitutions of 0.5-7% results in stop codons and 4.5-23.9% in glycine or proline residues. An average fraction of 16.2-44.2% of the amino acid substitutions are preserved, and 45.6% (epPCR method) are chemically different. The diversity remains low even when applying a non-biased method: an average of seven amino acid substitutions per residue, 2.9-4.7% stop codons, 11.1-16% glycine/proline residues, 21-25.8% preserved amino acids, and 55.5% are amino acids with chemically different side-chains. Statistical information for each mutagenesis method can further be used to investigate the mutational spectra in protein regions regarded as important for the property of interest.  相似文献   

15.

Background  

Protein structure comparison is one of the most important problems in computational biology and plays a key role in protein structure prediction, fold family classification, motif finding, phylogenetic tree reconstruction and protein docking.  相似文献   

16.

Background  

It is believed that animal-to-human transmission of severe acute respiratory syndrome (SARS) coronavirus (CoV) is the cause of the SARS outbreak worldwide. The spike (S) protein is one of the best characterized proteins of SARS-CoV, which plays a key role in SARS-CoV overcoming species barrier and accomplishing interspecies transmission from animals to humans, suggesting that it may be the major target of selective pressure. However, the process of adaptive evolution of S protein and the exact positively selected sites associated with this process remain unknown.  相似文献   

17.
The co-variance of amino acid positions within a multiple alignment of 294 protein kinases from mammals, plants, and bacteria was studied. Applying mutual information (MI), characteristic amino acid sites have been identified markedly discriminating the different organisms. The relation of surface accessibility of these sites in the 3D structure of a kinase and their MI content is studied. We extended the method to score a predicted phosphorylation site of this highly conserved catalytic protein kinase region. Based on this score mammalian and plant protein kinases were grouped together apart from the bacterial kinases. Thus, the presented method allows us to analyse putative phosphorylation sites in the context of their organism-specific origin.  相似文献   

18.
基因组中开阅读框架长度的分布模型与基因组进化   总被引:3,自引:1,他引:2  
分析了5种真核、15种细菌和10种古菌基因组中开阅读框架(open reading flame,ORF)的数目随长度的分布,发现不同生物的分布相似且有明显的规律性。用各种分布模型进行拟合比较,结果显示每种生物的这类分布均符合Г(α,β)分布,由此提出生物基因组中ORF的数目随长度的分布是Г(α,β)分布的假设。分析各生物基因组的拟合参数,发现α和β值与基因组进化存在明显的相关性;讨论了α和β值的生物进化意义,并给出了真核生物偏好使用长基因的结论;依照Г(α,β)分布估计了酵母基因组中ORF数目的上限为5870个。该方法对于研究生物基因组进化以及评估理论预测基因的可靠性具有建设性意义。  相似文献   

19.
Creevey CJ  McInerney JO 《Gene》2002,300(1-2):43-51
Positive selection or adaptive evolution is thought to be responsible, at least some of the time, for the rapid accumulation of advantageous changes in protein-coding genes. The origin of new enzymatic functions, erection of barriers to heterospecific fertilization, and evasion of host response by pathogens, among other things, are thought to be instances of adaptive evolution. Detecting positive selection in protein-coding genes is fraught with difficulties. Saturation for sequence change, codon usage bias, ephemeral selection events and differential selective pressures on amino acids all contribute to the problem. A number of solutions have been proposed with varying degrees of success, however they suffer from limitations of not being accurate enough or being prohibitively computationally intensive. We have developed a character-based method of identifying lineages that undergo positive selection. In our method we assess the possibility that for each internal branch of a phylogenetic tree an event occurred that subsequently gave rise to a greater number of replacement substitutions than might be expected. We classify these replacement substitutions into two categories – whether they subsequently became invariable or changed again in at least one descendent lineage. The former situation indicates that the new character state is under strong selection to preserve its new identity (directional selection), while the latter situation indicates that there is a persistent pressure to change identity (non-directional selection). The method is fast and accurate, easy to implement, sensitive to short-lived selection events and robust with respect to sampling density and proportion of sites under the influence of positive selection.  相似文献   

20.
Selection of oligonucleotide probes for protein coding sequences   总被引:7,自引:0,他引:7  
MOTIVATION: Large arrays of oligonucleotide probes have become popular tools for analyzing RNA expression. However to date most oligo collections contain poorly validated sequences or are biased toward untranslated regions (UTRs). Here we present a strategy for picking oligos for microarrays that focus on a design universe consisting exclusively of protein coding regions. We describe the constraints in oligo design that are imposed by this strategy, as well as a software tool that allows the strategy to be applied broadly. RESULT: In this work we sequentially apply a variety of simple filters to candidate sequences for oligo probes. The primary filter is a rejection of probes that contain contiguous identity with any other sequence in the sample universe that exceeds a pre-established threshold length. We find that rejection of oligos that contain 15 bases of perfect match with other sequences in the design universe is a feasible strategy for oligo selection for probe arrays designed to interrogate mammalian RNA populations. Filters to remove sequences with low complexity and predicted poor probe accessibility narrow the candidate probe space only slightly. Rejection based on global sequence alignment is performed as a secondary, rather than primary, test, leading to an algorithm that is computationally efficient. Splice isoforms pose unique challenges and we find that isoform prevalence will for the most part have to be determined by analysis of the patterns of hybridization of partially redundant oligonucleotides. AVAILABILITY: The oligo design program OligoPicker and its source code are freely available at our website.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号