首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
2.
Highly repetitive sequence within proteins is an abundant feature yet is considered by some to be the protein equivalent of "junk DNA." Homopolymer sequences, the most highly repetitive of this group, are typically encoded by trinucleotide repeats at the DNA level. It is thought that many of these sequences are produced by a replicative slippage mechanism. Recent studies suggest that these highly mutable regions within proteins may allow for rapid morphological evolution emerging from the increased variability afforded by such coding structures. However, in a homopolymer, it is difficult to determine if the repeated amino acid is due to slippage at the DNA level or due to selection at the protein level. Here we develop and test a model to detect cases for which the homopolymer tract has clearly been selected for, with no evidence of slippage at the DNA level. The polyserine tract within the phosphatidylserine receptor protein is used as an excellent example of one such case.  相似文献   

3.
The distribution of selection coefficients of new mutations is of key interest in population genetics. In this paper we explore how codon-based likelihood models can be used to estimate the distribution of selection coefficients of new amino acid replacement mutations from phylogenetic data. To obtain such estimates we assume that all mutations at the same site have the same selection coefficient. We first estimate the distribution of selection coefficients from two large viral data sets under the assumption that the viral population size is the same along all lineages of the phylogeny and that the selection coefficients vary among sites. We then implement several new models in which the lineages of the phylogeny may have different population sizes. We apply the new models to a data set consisting of the coding regions from eight primate mitochondrial genomes. The results suggest that there might be little power to determine the exact shape of the distribution of selection coefficient but that the normal and gamma distributions fit the data significantly better than the exponential distribution.  相似文献   

4.
Tricodon regions on messenger RNAs corresponding to a set of proteins from Escherichia coli were scrutinized for their translation speed. The fractional frequency values of the individual codons as they occur in mRNAs of highly expressed genes from Escherichia coli were taken as an indicative measure of the translation speed. The tricodons were classified by the sum of the frequency values of the constituent codons. Examination of the conformation of the encoded amino acid residues in the corresponding protein tertiary structures revealed a correlation between codon usage in mRNA and topological features of the encoded proteins. Alpha helices on proteins tend to be preferentially coded by translationally fast mRNA regions while the slow segments often code for beta strands and coil regions. Fast regions correspondingly avoid coding for beta strands and coil regions while the slow regions similarly move away from encoding alpha helices. Structural and mechanistic aspects of the ribosome peptide channel support the relevance of sequence fragment translation and subsequent conformation. A discussion is presented relating the observation to the reported kinetic data on the formation and stabilization of protein secondary structural types during protein folding. The observed absence of such strong positive selection for codons in non-highly expressed genes is compatible with existing theories that mutation pressure may well dominate codon selection in non-highly expressed genes.  相似文献   

5.
Nucleotide substitution in both coding and noncoding regions is context-dependent, in the sense that substitution rates depend on the identity of neighboring bases. Context-dependent substitution has been modeled in the case of two sequences and an unrooted phylogenetic tree, but it has only been accommodated in limited ways with more general phylogenies. In this article, extensions are presented to standard phylogenetic models that allow for better handling of context-dependent substitution, yet still permit exact inference at reasonable computational cost. The new models improve goodness of fit substantially for both coding and noncoding data. Considering context dependence leads to much larger improvements than does using a richer substitution model or allowing for rate variation across sites, under the assumption of site independence. The observed improvements appear to derive from three separate properties of the models: their explicit characterization of context-dependent substitution within N-tuples of adjacent sites, their ability to accommodate overlapping N-tuples, and their rich parameterization of the substitution process. Parameter estimation is accomplished using an expectation maximization algorithm, with a quasi-Newton algorithm for the maximization step; this approach is shown to be preferable to ordinary Newton methods for parameter-rich models. Overlapping tuples are efficiently handled by assuming Markov dependence of the observed bases at each site on those at the N - 1 preceding sites, and the required conditional probabilities are computed with an extension of Felsenstein's algorithm. Estimated substitution rates based on a data set of about 160,000 noncoding sites in mammalian genomes indicate a pronounced CpG effect, but they also suggest a complex overall pattern of context-dependent substitution, comprising a variety of subtle effects. Estimates based on about 3 million sites in coding regions demonstrate that amino acid substitution rates can be learned at the nucleotide level, and suggest that context effects across codon boundaries are significant.  相似文献   

6.
It is understood that DNA and amino acid substitution rates are highly sequence context-dependent, e.g., C --> T substitutions in vertebrates may occur much more frequently at CpG sites and that cysteine substitution rates may depend on support of the context for participation in a disulfide bond. Furthermore, many applications rely on quantitative models of nucleotide or amino acid substitution, including phylogenetic inference and identification of amino acid sequence positions involved in functional specificity. We describe quantification of the context dependence of nucleotide substitution rates using baboon, chimpanzee, and human genomic sequence data generated by the NISC Comparative Sequencing Program. Relative mutation rates are reported for the 96 classes of mutations of the form 5' alphabetagamma 3' --> 5' alphadeltagamma 3', where alpha, beta, gamma, and delta are nucleotides and beta not equal delta, based on maximum likelihood calculations. Our results confirm that C --> T substitutions are enhanced at CpG sites compared with other transitions, relatively independent of the identity of the preceding nucleotide. While, as expected, transitions generally occur more frequently than transversions, we find that the most frequent transversions involve the C at CpG sites (CpG transversions) and that their rate is comparable to the rate of transitions at non-CpG sites. A four-class model of the rates of context-dependent evolution of primate DNA sequences, CpG transitions > non-CpG transitions approximately CpG transversions > non-CpG transversions, captures qualitative features of the mutation spectrum. We find that despite qualitative similarity of mutation rates among different genomic regions, there are statistically significant differences.  相似文献   

7.
La D  Kihara D 《Proteins》2012,80(1):126-141
Protein-protein binding events mediate many critical biological functions in the cell. Typically, functionally important sites in proteins can be well identified by considering sequence conservation. However, protein-protein interaction sites exhibit higher sequence variation than other functional regions, such as catalytic sites of enzymes. Consequently, the mutational behavior leading to weak sequence conservation poses significant challenges to the protein-protein interaction site prediction. Here, we present a phylogenetic framework to capture critical sequence variations that favor the selection of residues essential for protein-protein binding. Through the comprehensive analysis of diverse protein families, we show that protein binding interfaces exhibit distinct amino acid substitution as compared with other surface residues. On the basis of this analysis, we have developed a novel method, BindML, which utilizes the substitution models to predict protein-protein binding sites of protein with unknown interacting partners. BindML estimates the likelihood that a phylogenetic tree of a local surface region in a query protein structure follows the substitution patterns of protein binding interface and nonbinding surfaces. BindML is shown to perform well compared to alternative methods for protein binding interface prediction. The methodology developed in this study is very versatile in the sense that it can be generally applied for predicting other types of functional sites, such as DNA, RNA, and membrane binding sites in proteins.  相似文献   

8.
We analyzed the complete genome sequence of Arabidopsis thaliana and sequence data from 83 genes in the outcrossing A. lyrata, to better understand the role of gene expression on the strength of natural selection on synonymous and replacement sites in Arabidopsis. From data on tRNA gene abundance, we find a good concordance between codon preferences and the relative abundance of isoaccepting tRNAs in the complete A. thaliana genome, consistent with models of translational selection. Both EST-based and new quantitative measures of gene expression (MPSS) suggest that codon preferences derived from information on tRNA abundance are more strongly associated with gene expression than those obtained from multivariate analysis, which provides further support for the hypothesis that codon bias in Arabidopsis is under selection mediated by tRNA abundance. Consistent with previous results, analysis of protein evolution reveals a significant correlation between gene expression level and amino acid substitution rate. Analysis by MPSS estimates of gene expression suggests that this effect is primarily the result of a correlation between the number of tissues in which a gene is expressed and the rate of amino acid substitution, which indicates that the degree of tissue specialization may be an important determinant of the rate of protein evolution in Arabidopsis.  相似文献   

9.
Adenylate kinase, the product of the adk locus in Escherichia coli K12, catalyzes the conversion of AMP and ATP to two molecules of ADP. The gene has been cloned by complementation of an adk temperature sensitive mutation. The DNA sequence of the complete coding region and of 5'- and 3'-untranslated regions were determined. The resulting protein sequence was found to contain several regions of high homology with cytosolic adenylate kinase of pig muscle (AK1), whose three-dimensional structure has been determined. The most significant of the amino acid exchanges is the replacement of histidine 36 with glutamine. This residue is believed to play a role in catalysis through metal ion binding. The codon usage pattern and the determination of adenylate kinase molecules per cell shows that the enzyme is one of the more abundant soluble proteins of the bacterial cells.  相似文献   

10.
Purifying and directional selection in overlapping prokaryotic genes   总被引:4,自引:0,他引:4  
In overlapping genes, the same DNA sequence codes for two proteins using different reading frames. Analysis of overlapping genes can help in understanding the mode of evolution of a coding region from noncoding DNA. We identified 71 pairs of convergent genes, with overlapping 3' ends longer than 15 nucleotides, that are conserved in at least two prokaryotic genomes. Among the overlap regions, we observed a statistically significant bias towards the 123:132 phase (i.e. the second codon base in one gene facing the degenerate third position in the second gene). This phase ensures the least mutual constraint on nonconservative amino acid replacements in both overlapping coding sequences. The excess of this phase is compatible with directional (positive) selection acting on the overlapping coding regions. This could be a general evolutionary mode for genes emerging from noncoding sequences, in which the protein sequence has not been subject to selection.  相似文献   

11.
Nucleotide sequence of the coding region of the mouse N-myc gene.   总被引:11,自引:3,他引:8       下载免费PDF全文
Y Taya  S Mizusawa    S Nishimura 《The EMBO journal》1986,5(6):1215-1219
A genomic clone for the mouse N-myc gene was isolated and the total nucleotide sequence (4807 bp) of the two coding exons and an intron located between them was determined. The amino acid sequence of the N-myc protein was deduced from the DNA sequence. This protein is composed of 462 amino acids, slightly larger than human and mouse c-myc proteins, and is rich in proline like the c-myc protein. Comparison of the amino acid sequences of the mouse N-myc and c-myc proteins showed that conserved sequences are located in eight regions: four regions are in the N-terminal half of the N-myc protein and are separated from each other by regions poorly homologous to those of the c-myc protein, and the four others are located in the C-terminal half, throughout which certain homology exists. A remarkable sequence containing 13 successive acidic amino acids is present in one of the conserved regions located in the middle of the N-myc protein.  相似文献   

12.
Translational selection and yeast proteome evolution   总被引:26,自引:0,他引:26  
Akashi H 《Genetics》2003,164(4):1291-1303
  相似文献   

13.
Estimation of evolutionary distances from coding sequences must take into account protein-level selection to avoid relative underestimation of longer evolutionary distances. Current modeling of selection via site-to-site rate heterogeneity generally neglects another aspect of selection, namely position-specific amino acid frequencies. These frequencies determine the maximum dissimilarity expected for highly diverged but functionally and structurally conserved sequences, and hence are crucial for estimating long distances. We introduce a codon- level model of coding sequence evolution in which position-specific amino acid frequencies are free parameters. In our implementation, these are estimated from an alignment using methods described previously. We use simulations to demonstrate the importance and feasibility of modeling such behavior; our model produces linear distance estimates over a wide range of distances, while several alternative models underestimate long distances relative to short distances. Site-to-site differences in rates, as well as synonymous/nonsynonymous and first/second/third-codon-position differences, arise as a natural consequence of the site-to-site differences in amino acid frequencies.   相似文献   

14.
A clone encoding a proline-rich protein (ZmPRP) has been obtained from maize root by differential screening of a maturing elongation root cDNA library. The amino acid sequence deduced from the full-length cDNA contains a putative signal peptide and a highly repetitive sequence containing the PEPK motif, indicating that the ZmPRP mRNA may code for a cell wall protein. The PEPK repeat is also found in a previously reported wheat sequence but differs from the repeated sequences found in hydroxyproline-rich glycoproteins (HRGP) and in dicot proline-rich proteins (PRP). In the maize genome, the ZmPRP protein is encoded by a single gene that is expressed in maturing regions of the root, in the hypocotyl and in the pericarp. In these organs, the ZmPRP mRNA accumulates in the xylem and surrounding cells, and in the epidermis. No ZmPRP mRNA was found in the phloem. The pattern of mRNA accumulation is very similar to the one observed for genes coding for proteins involved in lignin biosynthesis and, like most cell wall proteins, ZmPRP synthesis is also induced by wounding. These data support the hypothesis that ZmPRP is a member of a new class of fibrous proteins involved in the secondary cell wall formation in monocot species.  相似文献   

15.
The TATA-box binding protein (TBP) is one of the 4 DNA-binding proteins that has been shown to associate with the proximal promoter region (−295) of the gene for bean seed storage protein phaseolin. The −295 promoter is essential for spatial and temporal control of the phaseolin gene expression. We designed a pair of degenerated primers based on the highly conserved sequence of the carboxyl-terminal domain of yeast TBP and used PCR to amplify the corresponding sequence from the bean cDNA. By using the amplified fragment as a probe, we screened a cDNA library derived from poly A(+) RNA from developing bean seeds and isolated 2 nearly full-length cDNA clones (813 and 826 bp long). The cDNAs encode 2 distinct isoforms of bean TBP, PV1 and PV2, each with an open reading frame of 200 amino acid residues. The 2 cDNA sequences share an 85.8% overall nucleotide sequence identity, with the coding region showing a higher degree of identity (94.4%) than the 5′- and 3′-untranslated regions (69%). The deduced amino acid sequence of the bean TBP isoforms differ in only 3 amino acid residues at positions 5, 9, and 16, all located in the amino-terminal region. The carboxyl-terminal domain of 180 amino acid residues shows a high degree (>82%) of evolutionary sequence conservation with the TBP sequences from other eukaryotic species. This domain possesses the 3 highly conserved structural motifs, namely the 2 direct repeat sequences, a central basic region rich in basic amino acid residues, and a region similar to the sigma factor of prokaryote. On the basis of this and other findings, we suggest that higher plants in general may have at least 2 copies of TBP gene, presumably resulting from the global duplication of the genome. Accession numbers AF015784 and AF015785 at the GenBank.  相似文献   

16.
We use flexible backbone protein design to explore the sequence and structure neighborhoods of naturally occurring proteins. The method samples sequence and structure space in the vicinity of a known sequence and structure by alternately optimizing the sequence for a fixed protein backbone using rotamer based sequence search, and optimizing the backbone for a fixed amino acid sequence using atomic-resolution structure prediction. We find that such a flexible backbone design method better recapitulates protein family sequence variation than sequence optimization on fixed backbones or randomly perturbed backbone ensembles for ten diverse protein structures. For the SH3 domain, the backbone structure variation in the family is also better recapitulated than in randomly perturbed backbones. The potential application of this method as a model of protein family evolution is highlighted by a concerted transition to the amino acid sequence in the structural core of one SH3 domain starting from the backbone coordinates of an homologous structure.  相似文献   

17.
Summary Chou-Fasman parameters, measuring preferences of each amino acid for different conformational regions in proteins, were used to obtain an amino acid difference index of conformational parameter distance (CPD) values. CPD values were found to be significantly lower for amino acid exchanges representing in the genetic code transitions of purines, GA than for exchanges representing either transitions of pyrimidines, CU, or transversions of purines and pyrimidines. Inasmuch as the distribution of CPD values in these non GA exchanges resembles that obtained for amino acid pairs with double or triple base differences in their underlying codons, we conclude that the genetic code was not particularly designed to minimize effects of mutation on protein conformation. That natural selection minimizes these changes, however, was shown by tabulating results obtained by the maximum parsimony method for eight protein genealogies with a total occurrence of 4574 base substitutions. At the beginning position of the codons GA transitions were in very great excess over other base substitutions, and, conversely, CU transitions were deficient. At the middle position of the codons only fast evolving proteins showed an excess of GA transitions, as though selection mainly preserved conformation in these proteins while weeding out mutations affecting chemical properties of functional sites in slow evolving proteins. In both fast and slow evolving proteins the net direction of transitions and transversions was found to be from G beginning codons to non-G beginning codons resulting in more commonly occurring amino acids, especially alanine with its generalized conformational properties, being replaced at suitable sites by amino acids with more specialized conformational and chemical properties. Historical circumstances pertaining to the origin of the genetic code and the nature of primordial proteins could account for such directional changes leading to increases in the functional density of proteins.In order to further explore the course of protein evolution, a modified parsimony algorithm was developed for constructing protein genealogies on the basis of minimum CPD length. The algorithm's ability to judge with finer discrimination that in protein evolution certain pathways of amino acid substitution should occur more readily than others was considered a potential advantage over strict maximum parsimony. In developing this CPD algorithm, the path of minimum CPD length through intermediate amino acids allowed by the genetic code for each pair of amino acids was determined. It was found that amino acid exchanges representing two base changes have a considerably lower average CPD value per base substitution than the amino acid exchanges representing single base changes. Amino acid exchanges representing three base changes have yet a further marked reduction in CPD per base change. This shows how extreme constraining effects of stabilizing selection can be circumvented, for by way of intermediate amino acids almost any amino acid can ultimately be substituted for another without damage to an evolving protein's conformation during the process.  相似文献   

18.
We have designed, synthesized, and characterized a 216 amino acid residue sequence encoding a putative idealized alpha/beta-barrel protein. The design was elaborated in two steps. First, the idealized backbone was defined with geometric parameters representing our target fold: a central eight parallel-stranded beta-sheet surrounded by eight parallel alpha-helices, connected together with short structural turns on both sides of the barrel. An automated sequence selection algorithm, based on the dead-end elimination theorem, was used to find the optimal amino acid sequence fitting the target structure. A synthetic gene coding for the designed sequence was constructed and the recombinant artificial protein was expressed in bacteria, purified and characterized. Far-UV CD spectra with prominent bands at 222nm and 208nm revealed the presence of alpha-helix secondary structures (50%) in fairly good agreement with the model. A pronounced absorption band in the near-UV CD region, arising from immobilized aromatic side-chains, showed that the artificial protein is folded in solution. Chemical unfolding monitored by tryptophan fluorescence revealed a conformational stability (DeltaG(H2O)) of 35kJ/mol. Thermal unfolding monitored by near-UV CD revealed a cooperative transition with an apparent T(m) of 65 degrees C. Moreover, the artificial protein did not exhibit any affinity for the hydrophobic fluorescent probe 1-anilinonaphthalene-8-sulfonic acid (ANS), providing additional evidence that the artificial barrel is not in the molten globule state, contrary to previously designed artificial alpha/beta-barrels. Finally, 1H NMR spectra of the folded and unfolded proteins provided evidence for specific interactions in the folded protein. Taken together, the results indicate that the de novo designed alpha/beta-barrel protein adopts a stable three-dimensional structure in solution. These encouraging results show that de novo design of an idealized protein structure of more than 200 amino acid residues is now possible, from construction of a particular backbone conformation to determination of an amino acid sequence with an automated sequence selection algorithm.  相似文献   

19.
Markovian models of protein evolution that relax the assumption of independent change among codons are considered. With this comparatively realistic framework, an evolutionary rate at a site can depend both on the state of the site and on the states of surrounding sites. By allowing a relatively general dependence structure among sites, models of evolution can reflect attributes of tertiary structure. To quantify the impact of protein structure on protein evolution, we analyze protein-coding DNA sequence pairs with an evolutionary model that incorporates effects of solvent accessibility and pairwise interactions among amino acid residues. By explicitly considering the relationship between nonsynonymous substitution rates and protein structure, this approach can lead to refined detection and characterization of positive selection. Analyses of simulated sequence pairs indicate that parameters in this evolutionary model can be well estimated. Analyses of lysozyme c and annexin V sequence pairs yield the biologically reasonable result that amino acid replacement rates are higher when the replacements lead to energetically favorable proteins than when they destabilize the proteins. Although the focus here is evolutionary dependence among codons that is associated with protein structure, the statistical approach is quite general and could be applied to diverse cases of evolutionary dependence where surrogates for sequence fitness can be measured or modeled.  相似文献   

20.
Repeat proteins comprise tandem arrays of a small structural motif. Their structure is defined and stabilized by interactions between residues that are close in the primary sequence. Several studies have investigated whether their structural modularity translates into modular thermodynamic properties. Tetratricopeptide repeat proteins (TPRs) are a class in which the repeated unit is a 34 amino acid helix-turn-helix motif. In this work, we use differential scanning calorimetry (DSC) to study the equilibrium stability of a series of TPR proteins with different numbers of an identical consensus repeat, from 2 to 20, CTPRa2 to CTPRa20. The DSC data provides direct evidence that the folding/unfolding transition of CTPR proteins does not fit a two-state folding model. Our results confirm and expand earlier studies on TPR proteins, which showed that apparent two-state unfolding curves are better fit by linear statistical mechanics models: 1D Ising models in which each repeat is treated as an independent folding unit.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号