首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
The genomic era has seen a remarkable increase in the number of genomes being sequenced and annotated. Nonetheless, annotation remains a serious challenge for compositionally biased genomes. For the preliminary annotation, popular nucleotide and protein comparison methods such as BLAST are widely employed. These methods make use of matrices to score alignments such as the amino acid substitution matrices. Since a nucleotide bias leads to an overall bias in the amino acid composition of proteins, it is possible that a genome with nucleotide bias may have introduced atypical amino acid substitutions in its proteome. Consequently, standard matrices fail to perform well in sequence analysis of these genomes. To address this issue, we examined the amino acid substitution in the AT-rich genome of Plasmodium falciparum, chosen as a reference and reconstituted a substitution matrix in the genome's context. The matrix was used to generate protein sequence alignments for the parasite proteins that improved across the functional regions. We attribute this to the consistency that may have been achieved amid the target and background frequencies calculated exclusively in our study. This study has important implications on annotation of proteins that are of experimental interest but give poor sequence alignments with standard conventional matrices.  相似文献   

2.
Amino acid sequences from several thousand homologous gene pairs were compared for two plant genomes, Oryza sativa and Arabidopsis thaliana. The Arabidopsis genes all have similar G+C (guanine plus cytosine) contents, whereas their homologs in rice span a wide range of G+C levels. The results show that those rice genes that display increased divergence in their nucleotide composition (specifically, increased G+C content) showed a corresponding, predictable change in the amino acid compositions of the encoded proteins relative to their Arabidopsis homologs. This trend was not seen in a "control" set of rice genes that had nucleotide contents closer to their Arabidopsis homologs. In addition to showing an overall difference in the amino acid composition of the homologous proteins, we were also able to investigate the biased patterns of amino acid substitution since the divergence of these two species. We found that the amino acid exchange matrix was highly asymmetric when comparing the High G+C rice genes with their Arabidopsis homologs. Finally, we investigated the possible causes of this biased pattern of sequence evolution. Our results indicate that the biased pattern of protein evolution is the consequence, rather than the cause, of the corresponding changes in nucleotide content. In fact, there is an even more marked asymmetry in the patterns of substitution at synonymous nucleotide sites. Surprisingly, there is a very strong negative correlation between the level of nucleotide bias and the length of the coding sequences within the rice genome. This difference in gene length may provide important clues about the underlying mechanisms.  相似文献   

3.
ABSTRACT: BACKGROUND: The COG database is the most popular collection of orthologous proteins from many different completely sequenced microbial genomes. Per definition, a cluster of orthologous groups (COG) within this database exclusively contains proteins that most likely achieve the same cellular function. Recently, the COG database was extended by assigning to every protein both the corresponding amino acid and its encoding nucleotide sequence resulting in the NUCOCOG database. This extended version of the COG database is a valuable resource connecting sequence features with the functionality of the respective proteins. RESULTS: Here we present ANCAC, a web tool and MySQL database for the analysis of amino acid, nucleotide, and codon frequencies in COGs on the basis of freely definable phylogenetic patterns. We demonstrate the usefulness of ANCAC by analyzing amino acid frequencies, codon usage, and GC-content in a species- or function-specific context. With respect to amino acids we, at least in part, confirm the cognate bias hypothesis by using ANCAC's NUCOCOG dataset as the largest one available for that purpose thus far. CONCLUSIONS: Using the NUCOCOG datasets, ANCAC connects taxonomic, amino acid, and nucleotide sequence information with the functional classification via COGs and provides a GUI for flexible mining for sequence-bias. Thereby, to our knowledge, it is the only tool for the analysis of sequence composition in the light of physiological roles and phylogenetic context without requirement of substantial programming-skills.  相似文献   

4.
Identification of functional open reading frames in chloroplast genomes   总被引:7,自引:0,他引:7  
K H Wolfe  P M Sharp 《Gene》1988,66(2):215-222
We have used a rapid computer dot-matrix comparison method to identify all DNA regions which have been evolutionarily conserved between the completely sequenced chloroplast genomes of tobacco and a liverwort. Analysis of these regions reveals 74 homologous open reading frames (ORFs) which have been conserved as to length and amino acid sequence; these ORFs also have an excess of nucleotide substitutions at silent sites of codons. Since the nonfunctional parts of these genomes have become saturated with mutations and show no sequence similarity whatsoever, the homologous ORFs are almost certainly functional. A further four pairs of ORFs show homology limited to only a short part of their putative gene products. Amino acid sequence identities range between 50 and 99%; some chloroplast proteins are seen to be among the most slowly evolving of all known proteins. A search of the nucleotide and amino acid sequence databanks has revealed several previously unidentified genes in chloroplast sequences from other species, but no new homologies to prokaryotic genes.  相似文献   

5.
Transmembrane helices are the most readily predictable secondary structure components of proteins. They can be predicted to a high degree of accuracy in a variety of ways. Many of these methods compare new sequence data with the sequence characteristics of known transmembrane domains. However, the known transmembrane sequences are not necessarily representative of a particular organism. We attempt to demonstrate that parameters optimized for the known transmembrane domains are far from optimal when predicting transmembrane regions in a given genome. In particular, we have tested the effect of nucleotide bias upon the composition and hence the prediction characteristics of transmembrane helices. Our analysis shows that nucleotide bias of a genome has a strong and predictable influence upon the occurrences of several of the most important hydrophobic amino acids found within transmembrane helices. Thus, we show that nucleotide bias should be taken into account when determining putative transmembrane domains from sequence data.  相似文献   

6.
A new method of peptide analysis is presented which allows assignment of unknown proteins to coding regions of genomes which have been sequenced. This approach involves comparison of the molecular weights of peptides generated by partial proteolytic digestion with those predicted for a protein whose primary amino acid sequence is deduced from a corresponding nucleotide sequence. The proteolytic digestions are accomplished in situ in the stacking gel of a two-dimensional polyacrylamide gel system. We have used this system to show that two variant proteins of the human mitochondrial DNA, MV-1 and MV-2, are allelic and encoded by the unidentified reading frame 3 (URF 3) gene. This assignment was supported by sequence analysis of a clone of this mtDNA region from a HeLa cell line which expresses the uncommon variant MV-2. Four nucleotide changes were found in HeLa URF 3, relative to the reported sequence from human placenta. Two of these changes alter the primary amino acid sequence of the encoded protein. It is proposed that one of those amino acid changes may account for the observed molecular weight variation in MV-1 and MV-2 by proteolytic cleavage, conformational change, or secondary modification. We have used this method to also assign a mitochondrially translated protein to URF 6. These are the first assignments of mitochondrially synthesized polypeptides to human URF genes and prove conclusively that at least some of these genes are expressed in human cells.  相似文献   

7.
The advent of full genome sequences provides exceptionally rich data sets to explore molecular and evolutionary mechanisms that shape divergence among and within genomes. In this study, we use multivariate analysis to determine the processes driving genome-wide patterns of amino usage in the obligate endosymbiont Buchnera and its close free-living relative Escherichia coli. In the AT-rich Buchnera genome, the primary source of variation in amino acid usage differentiates high- and low-expression genes. Amino acids of high-expression Buchnera genes are generally less aromatic and use relatively GC-rich codons, suggesting that selection against aromatic amino acids and against amino acids with AT-rich codons is stronger in high-expression genes. Selection to maintain hydrophobic amino acids in integral membrane proteins is a primary factor driving protein evolution in E. coli but is a secondary factor in Buchnera. In E. coli, gene expression is a secondary force driving amino acid usage, and a correlation with tRNA abundance suggests that translational selection contributes to this effect. Although this and previous studies demonstrate that AT mutational bias and genetic drift influence amino acid usage in Buchnera, this genome-wide analysis argues that selection is sufficient to affect the amino acid content of proteins with different expression and hydropathy levels.  相似文献   

8.
Variations in GC content between genomes have been extensively documented. Genomes with comparable GC contents can, however, still differ in the apportionment of the G and C nucleotides between the two DNA strands. This asymmetric strand bias is known as GC skew. Here, we have investigated the impact of differences in nucleotide skew on the amino acid composition of the encoded proteins. We compared orthologous genes between animal mitochondrial genomes that show large differences in GC and AT skews. Specifically, we compared the mitochondrial genomes of mammals, which are characterized by a negative GC skew and a positive AT skew, to those of flatworms, which show the opposite skews for both GC and AT base pairs. We found that the mammalian proteins are highly enriched in amino acids encoded by CA-rich codons (as predicted by their negative GC and positive AT skews), whereas their flatworm orthologs were enriched in amino acids encoded by GT-rich codons (also as predicted from their skews). We found that these differences in mitochondrial strand asymmetry (measured as GC and AT skews) can have very large, predictable effects on the composition of the encoded proteins.  相似文献   

9.
Most previous work on the evolution of mobile DNA was limited by incomplete sequence information. Whole genome sequences allow us to overcome this limitation. I study the nucleotide diversity of prominent members of five insertion sequence families whose transposition activity is encoded by a single transposase gene. Eighteen among 376 completely sequenced bacterial genomes and plasmids carry between 3 and 20 copies of a given insertion sequence. I show that these copies generally show very low DNA divergence. Specifically, more than 68% of the transposase genes are identical within a genome. The average number of amino acid replacement substitutions at amino acid replacement sites is Ka = 0.013, that at silent sites is Ks = 0.1. This low intragenomic diversity stands in stark contrast to a much higher divergence of the same insertion sequences among distantly related genomes. Gene conversion among protein-coding genes is unlikely to account for this lack of diversity. The relation between transposition frequencies and silent substitution rates suggests that most insertion sequences in a typical genome are evolutionarily young and have been recently acquired. They may undergo periodic extinction in bacterial lineages. By implication, they are detrimental to their host in the long run. This is also suggested by the highly skewed and patchy distribution of insertion sequences among genomes. In sum, one can think of insertion sequences as slow-acting infectious diseases of cell lineages.  相似文献   

10.
Burton RS  Byrne RJ  Rawson PD 《Gene》2007,403(1-2):53-59
Previous work on the harpacticoid copepod Tigriopus californicus has focused on the extensive population differentiation in three mtDNA protein coding genes (COXI, COXII, Cytb). In order to get a more complete understanding of mtDNA evolution in this species, we sequenced three complete mitochondrial genomes (one from each of three California populations) and compared them to two published mtDNA genomes from an Asian congener, Tigriopus japonicus. Several features of the mtDNA genome appear to be conserved within the genus: 1) the unique order of the protein coding genes, rRNA genes and most of the tRNA genes, 2) the genome is compact, varying between 14.3 and 14.6 kb, and 3) all genes are encoded on the same strand of the mtDNA. Within T. californicus, extremely high levels of nucleotide divergence (>20%) are observed across much of the mitochondrial genome. Inferred amino acid sequences of the proteins encoded in the mtDNAs also show high levels of divergence; at the extreme, the three ND3 variants in T. californicus showed >25% amino acid substitutions, compared with <3% amino acid divergence at the previously studied COXI locus. Unusual secondary structures make functional assignments of some tRNAs difficult. The only apparent tRNA(trp) in these genomes completely overlaps the 5' end of the 16S rRNA in all three T. californicus mtDNAs. Although not previously noted, this feature is also conserved in T. japonicus mtDNAs; whether this sequence is processed into a functional tRNA has not been determined. The putative control region contains a duplicated segment of different length (from 88 to 155 bp) in each of the T. californicus sequences. In each case, the duplicated segments are not tandem repeats; despite their different lengths, the distance between the start of the first and the start of the second repeat is conserved (520 bp). The functional significance, if any, of this repeat structure remains unknown.  相似文献   

11.
Breton S  Burger G  Stewart DT  Blier PU 《Genetics》2006,172(2):1107-1119
Marine mussels of the genus Mytilus have an unusual mode of mitochondrial DNA (mtDNA) transmission termed doubly uniparental inheritance (DUI). Female mussels are homoplasmic for the F mitotype, which is inherited maternally, while males are usually heteroplasmic, carrying a mixture of the maternal F mitotype and the paternally inherited M genome. Two classes of M genomes have been observed: "standard" M genomes and "recently masculinized" M genomes. The latter are more similar to F genomes at the sequence level but are transmitted paternally like standard M genomes. In this study we report the complete sequences of two standard male M. edulis and one recently masculinized male M. trossulus mitochondrial genome. A comparative analysis, including the previously sequenced M. edulis F and M. galloprovincialis F and M mtDNAs, reveals that these genomes are identical in gene order, but highly divergent in nucleotide and amino acid sequence. The large amount (>20%) of nucleotide substitutions that fall in coding regions implies that there are several amino acid replacements between the F and M genomes, which likely have an impact on the structural and functional properties of the mitochondrial proteome. Correlation of the divergence rate of different protein-coding genes indicates that mtDNA-encoded proteins of the M genome are still under selective constraints, although less highly than genes of the F genome. The mosaic F/M control region of the masculinized F genome provides evidence for lineage-specific sequences that may be responsible for the different mode of transmission genetics. This analysis shows the value of comparative genomics to better understand the mechanisms of maintenance and segregation of mtDNA sequence variants in mytilid mussels.  相似文献   

12.
We sequenced most of the mitochondrial genome of the sawfly Perga condei (Insecta: Hymenoptera: Symphyta: Pergidae) and tested different models of phylogenetic reconstruction in order to resolve the position of the Hymenoptera within the Holometabola, using mitochondrial genomes. The mitochondrial genome sequenced for P. condei had less compositional bias and slower rates of molecular evolution than the honeybee, as well as a less rearranged genome organization. Phylogenetic analyses showed that, when using mitochondrial genomes, both adequate taxon sampling and more realistic models of analysis are necessary to resolve relationships among insect orders. Both parsimony and Bayesian analyses performed better when nucleotide instead of amino acid sequences were used. In particular, this study supports the placement of the Hymenoptera as sister group to the Mecopterida.  相似文献   

13.
14.

Background

Pseudoscorpions are chelicerates and have historically been viewed as being most closely related to solifuges, harvestmen, and scorpions. No mitochondrial genomes of pseudoscorpions have been published, but the mitochondrial genomes of some lineages of Chelicerata possess unusual features, including short rRNA genes and tRNA genes that lack sequence to encode arms of the canonical cloverleaf-shaped tRNA. Additionally, some chelicerates possess an atypical guanine-thymine nucleotide bias on the major coding strand of their mitochondrial genomes.

Results

We sequenced the mitochondrial genomes of two divergent taxa from the chelicerate order Pseudoscorpiones. We find that these genomes possess unusually short tRNA genes that do not encode cloverleaf-shaped tRNA structures. Indeed, in one genome, all 22 tRNA genes lack sequence to encode canonical cloverleaf structures. We also find that the large ribosomal RNA genes are substantially shorter than those of most arthropods. We inferred secondary structures of the LSU rRNAs from both pseudoscorpions, and find that they have lost multiple helices. Based on comparisons with the crystal structure of the bacterial ribosome, two of these helices were likely contact points with tRNA T-arms or D-arms as they pass through the ribosome during protein synthesis. The mitochondrial gene arrangements of both pseudoscorpions differ from the ancestral chelicerate gene arrangement. One genome is rearranged with respect to the location of protein-coding genes, the small rRNA gene, and at least 8 tRNA genes. The other genome contains 6 tRNA genes in novel locations. Most chelicerates with rearranged mitochondrial genes show a genome-wide reversal of the CA nucleotide bias typical for arthropods on their major coding strand, and instead possess a GT bias. Yet despite their extensive rearrangement, these pseudoscorpion mitochondrial genomes possess a CA bias on the major coding strand. Phylogenetic analyses of all 13 mitochondrial protein-coding gene sequences consistently yield trees that place pseudoscorpions as sister to acariform mites.

Conclusion

The well-supported phylogenetic placement of pseudoscorpions as sister to Acariformes differs from some previous analyses based on morphology. However, these two lineages share multiple molecular evolutionary traits, including substantial mitochondrial genome rearrangements, extensive nucleotide substitution, and loss of helices in their inferred tRNA and rRNA structures.  相似文献   

15.
At less than 90 Mbp, the tiny nuclear genome of the carnivorous bladderwort plant Utricularia is an attractive model system for studying molecular evolutionary processes leading to genome miniaturization. Recently, we reported that expression of genes encoding DNA repair and reactive oxygen species (ROS) detoxification enzymes is highest in Utricularia traps, and we argued that ROS mutagenic action correlates with the high nucleotide substitution rates observed in the Utricularia plastid, mitochondrial, and nuclear genomes. Here, we extend our analysis of 100 nuclear genes from Utricularia and related asterid eudicots to examine nucleotide substitution biases and their potential correlation with ROS-induced DNA lesions. We discovered an unusual bias toward GC nucleotides, most prominently in transition substitutions at the third position of codons, which are presumably silent with respect to adaptation. Given the general tendency of biased gene conversion to drive GC bias, and of ROS to induce double strand breaks requiring recombinational repair, we propose that some of the unusual features of the bladderwort and its genome may be more reflective of these nonadaptive processes than of natural selection.  相似文献   

16.
GH Liu  SY Wang  WY Huang  GH Zhao  SJ Wei  HQ Song  MJ Xu  RQ Lin  DH Zhou  XQ Zhu 《PloS one》2012,7(7):e42172
Complete mitochondrial (mt) genomes and the gene rearrangements are increasingly used as molecular markers for investigating phylogenetic relationships. Contributing to the complete mt genomes of Gastropoda, especially Pulmonata, we determined the mt genome of the freshwater snail Galba pervia, which is an important intermediate host for Fasciola spp. in China. The complete mt genome of G. pervia is 13,768 bp in length. Its genome is circular, and consists of 37 genes, including 13 genes for proteins, 2 genes for rRNA, 22 genes for tRNA. The mt gene order of G. pervia showed novel arrangement (tRNA-His, tRNA-Gly and tRNA-Tyr change positions and directions) when compared with mt genomes of Pulmonata species sequenced to date, indicating divergence among different species within the Pulmonata. A total of 3655 amino acids were deduced to encode 13 protein genes. The most frequently used amino acid is Leu (15.05%), followed by Phe (11.24%), Ser (10.76%) and IIe (8.346%). Phylogenetic analyses using the concatenated amino acid sequences of the 13 protein-coding genes, with three different computational algorithms (maximum parsimony, maximum likelihood and Bayesian analysis), all revealed that the families Lymnaeidae and Planorbidae are closely related two snail families, consistent with previous classifications based on morphological and molecular studies. The complete mt genome sequence of G. pervia showed a novel gene arrangement and it represents the first sequenced high quality mt genome of the family Lymnaeidae. These novel mtDNA data provide additional genetic markers for studying the epidemiology, population genetics and phylogeographics of freshwater snails, as well as for understanding interplay between the intermediate snail hosts and the intra-mollusca stages of Fasciola spp..  相似文献   

17.
Acyl-CoA dehydrogenases (ACADs), which are key enzymes in fatty acid and amino acid catabolism, form a large, pan-taxonomic protein family with at least 13 distinct subfamilies. Yet most reported ACAD members have no subfamily assigned, and little is known about the taxonomic distribution and evolution of the subfamilies. In completely sequenced genomes from approximately 210 species (eukaryotes, bacteria and archaea), we detect ACAD subfamilies by rigorous ortholog identification combining sequence similarity search with phylogeny. We then construct taxonomic subfamily-distribution profiles and build phylogenetic trees with orthologous proteins. Subfamily profiles provide unparalleled insight into the organisms’ energy sources based on genome sequence alone and further predict enzyme substrate specificity, thus generating explicit working hypotheses for targeted biochemical experimentation. Eukaryotic ACAD subfamilies are traditionally considered as mitochondrial proteins, but we found evidence that in fungi one subfamily is located in peroxisomes and participates in a distinct β-oxidation pathway. Finally, we discern horizontal transfer, duplication, loss and secondary acquisition of ACAD genes during evolution of this family. Through these unorthodox expansion strategies, the ACAD family is proficient in utilizing a large range of fatty acids and amino acids—strategies that could have shaped the evolutionary history of many other ancient protein families.  相似文献   

18.
Singer GA  Hickey DA 《Gene》2003,317(1-2):39-47
A number of recent studies have shown that thermophilic prokaryotes have distinguishable patterns of both synonymous codon usage and amino acid composition, indicating the action of natural selection related to thermophily. On the other hand, several other studies of whole genomes have illustrated that nucleotide bias can have dramatic effects on synonymous codon usage and also on the amino acid composition of the encoded proteins. This raises the possibility that the thermophile-specific patterns observed at both the codon and protein levels are merely reflections of a single underlying effect at the level of nucleotide composition. Moreover, such an effect at the nucleotide level might be due entirely to mutational bias. In this study, we have compared the genomes of thermophiles and mesophiles at three levels: nucleotide content, codon usage and amino acid composition. Our results indicate that the genomes of thermophiles are distinguishable from mesophiles at all three levels and that the codon and amino acid frequency differences cannot be explained simply by the patterns of nucleotide composition. At the nucleotide level, we see a consistent tendency for the frequency of adenine to increase at all codon positions within the thermophiles. Thermophiles are also distinguished by their pattern of synonymous codon usage for several amino acids, particularly arginine and isoleucine. At the protein level, the most dramatic effect is a two-fold decrease in the frequency of glutamine residues among thermophiles. These results indicate that adaptation to growth at high temperature requires a coordinated set of evolutionary changes affecting (i) mRNA thermostability, (ii) stability of codon-anticodon interactions and (iii) increased thermostability of the protein products. We conclude that elevated growth temperature imposes selective constraints at all three molecular levels: nucleotide content, codon usage and amino acid composition. In addition to these multiple selective effects, however, the genomes of both thermophiles and mesophiles are often subject to superimposed large changes in composition due to mutational bias.  相似文献   

19.
Normalized nucleotide and amino acid contents of complete genome sequences can be visualized as radar charts. The shapes of these charts depict the characteristics of an organism’s genome. The normalized values calculated from the genome sequence theoretically exclude experimental errors. Further, because normalization is independent of both target size and kind, this procedure is applicable not only to single genes but also to whole genomes, which consist of a huge number of different genes. In this review, we discuss the applications of the normalization of the nucleotide and predicted amino acid contents of complete genomes to the investigation of genome structure and to evolutionary research from primitive organisms to Homo sapiens. Some of the results could never have been obtained from the analysis of individual nucleotide or amino acid sequences but were revealed only after the normalization of nucleotide and amino acid contents was applied to genome research. The discovery that genome structure was homogeneous was obtained only after normalization methods were applied to the nucleotide or predicted amino acid contents of genome sequences. Normalization procedures are also applicable to evolutionary research. Thus, normalization of the contents of whole genomes is a useful procedure that can help to characterize organisms.  相似文献   

20.
In order to establish the molecular basis of the pathogenicity of the attenuated RC-HL strain of rabies virus used for the production of animal vaccine in Japan, the complete genome sequence of this strain was determined and compared with that of the parental Nishigahara strain which is virulent for adult mice. The viral genome of both strains was composed of 11,926 nucleotides. The nucleotide sequences of the two genomes showed a high homology of 98.9%. The homology of the G gene was lower than those of N, P, M and L genes at both nucleotide and deduced amino acid levels, and the percentage of radical amino acid substitutions on the G protein was the highest among the five proteins. These findings raise the possibility that the structure of the G protein is the most variable among the five proteins of the two strains. Furthermore, we found two clusters of amino acid substitutions on the G and L proteins. The relevance of these clusters to the difference in the pathogenicity between the two strains is discussed.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号