首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.

Background

Coalescent simulation is pivotal for understanding population evolutionary models and demographic histories, as well as for developing novel analytical methods for genetic association studies for DNA sequence data. A plethora of coalescent simulators are developed, but selecting the most appropriate program remains challenging.

Results

We extensively compared performances of five widely used coalescent simulators – Hudson’s ms, msHOT, MaCS, Simcoal2, and fastsimcoal, to provide a practical guide considering three crucial factors, 1) speed, 2) scalability and 3) recombination hotspot position and intensity accuracy. Although ms represents a popular standard coalescent simulator, it lacks the ability to simulate sequences with recombination hotspots. An extended program msHOT has compensated for the deficiency of ms by incorporating recombination hotspots and gene conversion events at arbitrarily chosen locations and intensities, but remains limited in simulating long stretches of DNA sequences. Simcoal2, based on a discrete generation-by-generation approach, could simulate more complex demographic scenarios, but runs comparatively slow. MaCS and fastsimcoal, both built on fast, modified sequential Markov coalescent algorithms to approximate standard coalescent, are much more efficient whilst keeping salient features of msHOT and Simcoal2, respectively. Our simulations demonstrate that they are more advantageous over other programs for a spectrum of evolutionary models. To validate recombination hotspots, LDhat 2.2 rhomap package, sequenceLDhot and Haploview were compared for hotspot detection, and sequenceLDhot exhibited the best performance based on both real and simulated data.

Conclusions

While ms remains an excellent choice for general coalescent simulations of DNA sequences, MaCS and fastsimcoal are much more scalable and flexible in simulating a variety of demographic events under different recombination hotspot models. Furthermore, sequenceLDhot appears to give the most optimal performance in detecting and validating cross-over hotspots.  相似文献   

2.
Miguel Arenas  David Posada 《Genetics》2010,184(2):429-437
The coalescent with recombination is a very useful tool in molecular population genetics. Under this framework, genealogies often represent the evolution of the substitution unit, and because of this, the few coalescent algorithms implemented for the simulation of coding sequences force recombination to occur only between codons. However, it is clear that recombination is expected to occur most often within codons. Here we have developed an algorithm that can evolve coding sequences under an ancestral recombination graph that represents the genealogies at each nucleotide site, thereby allowing for intracodon recombination. The algorithm is a modification of Hudson''s coalescent in which, in addition to keeping track of events occurring in the ancestral material that reaches the sample, we need to keep track of events occurring in ancestral material that does not reach the sample but that is produced by intracodon recombination. We are able to show that at typical substitution rates the number of nonsynonymous changes induced by intracodon recombination is small and that intracodon recombination does not generally result in inflated estimates of the overall nonsynonymous/synonymous substitution ratio (ω). On the other hand, recombination can bias the estimation of ω at particular codons, resulting in apparent rate variation among sites and in the spurious identification of positively selected sites. Importantly, in this case, allowing for variable synonymous rates across sites greatly reduces the false-positive rate and recovers statistical power. Finally, coalescent simulations with intracodon recombination could be used to better represent the evolution of nuclear coding genes or fast-evolving pathogens such as HIV-1.We have implemented this algorithm in a computer program called NetRecodon, freely available at http://darwin.uvigo.es.THE coalescent (Kingman 1982; Hudson 1990) provides an efficient sampling of genealogical histories from a theoretical population evolving under a neutral Wright–Fisher model (Ewens 1979; Kingman 1982; Hudson 1990). Coalescent simulations are commonly used in molecular population genetics to understand the behavior and interactions among evolutionary processes under different scenarios (Innan et al. 2005), such as hypothesis testing (DeChaine and Martin 2006), evaluation and comparison of different analytical methods (Carvajal-Rodriguez et al. 2006), or estimation of population genetic parameters (Beaumont et al. 2002). Indeed, to obtain meaningful biological inferences from these simulations, it is very important that the underlying model is as realistic as possible. In this regard, a number of models have been developed during the last decade that consider different evolutionary processes such as recombination (Simonsen and Churchill 1997; Wiuf and Posada 2003), gene conversion (Wiuf and Hein 2000), selection (Hudson and Kaplan 1988, 1995), and gene flow or demographic history (Slatkin 1987; Pybus and Rambaut 2002).Despite these advances, and in the face of a plethora of coalescent simulators (Excoffier et al. 2000; Hudson 2002; Posada and Wiuf 2003; Spencer and Coop 2004; Mailund et al. 2005; Schaffner et al. 2005; Marjoram and Wall 2006; Arenas and Posada 2007; Hellenthal and Stephens 2007; Liang et al. 2007), it was not possible until very recently to simulate recombining protein-coding DNA sequences within this framework (Anisimova et al. 2003; Arenas and Posada 2007). Importantly, to our knowledge, the algorithms described or implemented so far allow recombination only between codons, not within them. The reason for this unrealistic constraint is that standard codon models describe the probabilities of change along a lineage from one codon to another (Yang 2006), whereas recombination can occur between any two nucleotides, potentially resulting in one or more lineages not being shared by all the positions of the codon. In other words, although the unit for substitution in coding sequences is the codon, the unit for recombination in these sequences is still the nucleotide. Here we describe a new algorithm that overcomes this limitation by allowing for the evolution of different positions of the same codon in distinct genealogies. Furthermore, we use this algorithm to evaluate the effect of intracodon recombination on the generation of nonsynonymous (NS) diversity and on the estimation of the ratio of nonsynonymous-to-synonymous substitution rates (ω or dN/dS) (Li and Gojobori 1983) and the hypotheses derived from it.  相似文献   

3.

Background

The construction of customized nucleic acid sequences allows us to have greater flexibility in gene design for recombinant protein expression. Among the various parameters considered for such DNA sequence design, individual codon usage (ICU) has been implicated as one of the most crucial factors affecting mRNA translational efficiency. However, previous works have also reported the significant influence of codon pair usage, also known as codon context (CC), on the level of protein expression.

Results

In this study, we have developed novel computational procedures for evaluating the relative importance of optimizing ICU and CC for enhancing protein expression. By formulating appropriate mathematical expressions to quantify the ICU and CC fitness of a coding sequence, optimization procedures based on genetic algorithm were employed to maximize its ICU and/or CC fitness. Surprisingly, the in silico validation of the resultant optimized DNA sequences for Escherichia coli, Lactococcus lactis, Pichia pastoris and Saccharomyces cerevisiae suggests that CC is a more relevant design criterion than the commonly considered ICU.

Conclusions

The proposed CC optimization framework can complement and enhance the capabilities of current gene design tools, with potential applications to heterologous protein production and even vaccine development in synthetic biotechnology.  相似文献   

4.
Wu J 《BMC genomics》2008,9(Z2):S13

Background

Computational gene prediction tools routinely generate large volumes of predicted coding exons (putative exons). One common limitation of these tools is the relatively low specificity due to the large amount of non-coding regions.

Methods

A statistical approach is developed that largely improves the gene prediction specificity. The key idea is to utilize the evolutionary conservation principle relative to the coding exons. By first exploiting the homology between genomes of two related species, a probability model for the evolutionary conservation pattern of codons across different genomes is developed. A probability model for the dependency between adjacent codons/triplets is added to differentiate coding exons and random sequences. Finally, the log odds ratio is developed to classify putative exons into the group of coding exons and the group of non-coding regions.

Results

The method was tested on pre-aligned human-mouse sequences where the putative exons are predicted by GENSCAN and TWINSCAN. The proposed method is able to improve the exon specificity by 73% and 32% respectively, while the loss of the sensitivity ≤ 1%. The method also keeps 98% of RefSeq gene structures that are correctly predicted by TWINSCAN when removing 26% of predicted genes that are in non-coding regions. The estimated number of true exons in TWINSCAN's predictions is 157,070. The results and the executable codes can be downloaded from http://www.stat.purdue.edu/~jingwu/codon/

Conclusion

The proposed method demonstrates an application of the evolutionary conservation principle to coding exons. It is a complementary method which can be used as an additional criteria to refine many existing gene predictions.
  相似文献   

5.

Background

The Mongolian gerbils are a good model to mimic the Helicobacter pylori -associated pathogenesis of the human stomach. In the current study the gerbil-adapted strain B8 was completely sequenced, annotated and compared to previous genomes, including the 73 supercontigs of the parental strain B128.

Results

The complete genome of H. pylori B8 was manually curated gene by gene, to assign as much function as possible. It consists of a circular chromosome of 1,673,997 bp and of a small plasmid of 6,032 bp carrying nine putative genes. The chromosome contains 1,711 coding sequences, 293 of which are strain-specific, coding mainly for hypothetical proteins, and a large plasticity zone containing a putative type-IV-secretion system and coding sequences with unknown function. The cag -pathogenicity island is rearranged such that the cag A-gene is located 13,730 bp downstream of the inverted gene cluster cag B- cag 1. Directly adjacent to the cag A-gene, there are four hypothetical genes and one variable gene with a different codon usage compared to the rest of the H. pylori B8-genome. This indicates that these coding sequences might be acquired via horizontal gene transfer. The genome comparison of strain B8 to its parental strain B128 delivers 425 unique B8-proteins. Due to the fact that strain B128 was not fully sequenced and only automatically annotated, only 12 of these proteins are definitive singletons that might have been acquired during the gerbil-adaptation process of strain B128.

Conclusion

Our sequence data and its analysis provide new insight into the high genetic diversity of H. pylori -strains. We have shown that the gerbil-adapted strain B8 has the potential to build, possibly by a high rate of mutation and recombination, a dynamic pool of genetic variants (e.g. fragmented genes and repetitive regions) required for the adaptation-processes. We hypothesize that these variants are essential for the colonization and persistence of strain B8 in the gerbil stomach during inflammation.  相似文献   

6.

Key message

The core promoter of the antiquitin ALDH7B4 gene was compared between selected Brassicaceae. Conserved cis elements controlling osmotic stress and wound-induced expression were identified and analysed in Arabidopsis thaliana leaves and seeds.

Abstract

Aldehyde dehydrogenases metabolise a wide range of aliphatic and aromatic aldehydes, which become cytotoxic at high levels. Family 7 aldehyde dehydrogenase genes, often described as antiquitins or turgor-responsive genes in plants, are broadly conserved across all domains. Despite the high conservation of the plant ALDH7 proteins and their importance in stress responses, their regulation has not been investigated. Here, we compared ALDH7 genes of different Brassicaceae and found that, in contrast to the gene organisation and protein coding sequences, similarities in the promoter sequences were limited to the first few hundred nucleotides upstream of the translation start codon. The function of this region was studied by isolating the core promoter of the Arabidopsis thaliana ALDH7B4 gene, taken as model. The promoter was found to be responsive to wounding in addition to salt and dehydration stress. Cis-acting elements involved in stress responsiveness were analysed and two conserved ACGT-containing motifs proximal to the translation start codon were found to be essential for the responsiveness to osmotic stress in leaves and in seeds. The integrity of an upstream ACGT motif and a dehydration-responsive element/C-repeat—low temperature-responsive element was found to be necessary for ALDH7B4 expression in seeds and induction by salt, dehydration and ABA in leaves. The comparison of the gene expression in selected Arabidopsis mutants demonstrated that osmotic stress-induced ALDH7B4 expression in leaves and seeds involves both ABA- and lipid-signalling components.  相似文献   

7.

Background

Species of Paris Sect. Marmorata are valuable medicinal plants to synthesize steroidal saponins with effective pharmacological therapy. However, the wild resources of the species are threatened by plundering exploitation before the molecular genetics studies uncover the genomes and evolutionary significance. Thus, the availability of complete chloroplast genome sequences of Sect. Marmorata is necessary and crucial to the understanding the plastome evolution of this section and facilitating future population genetics studies. Here, we determined chloroplast genomes of Sect. Marmorata, and conducted the whole chloroplast genome comparison.

Results

This study presented detailed sequences and structural variations of chloroplast genomes of Sect. Marmorata. Over 40 large repeats and approximately 130 simple sequence repeats as well as a group of genomic hotspots were detected. Inverted repeat contraction of this section was inferred via comparing the chloroplast genomes with the one of P. verticillata. Additionally, almost all the plastid protein coding genes were found to prefer ending with A/U. Mutation bias and selection pressure predominately shaped the codon bias of most genes. And most of the genes underwent purifying selection, whereas photosynthetic genes experienced a relatively relaxed purifying selection.

Conclusions

Repeat sequences and hotspot regions can be scanned to detect the intraspecific and interspecific variability, and selected to infer the phylogenetic relationships of Sect. Marmorata and other species in subgenus Daiswa. Mutation and natural selection were the main forces to drive the codon bias pattern of most plastid protein coding genes. Therefore, this study enhances the understanding about evolution of Sect. Marmorata from the chloroplast genome, and provide genomic insights into genetic analyses of Sect. Marmorata.
  相似文献   

8.
9.

Background

Synonymous codon usage varies widely between genomes, and also between genes within genomes. Although there is now a large body of data on variations in codon usage, it is still not clear if the observed patterns reflect the effects of positive Darwinian selection acting at the level of translational efficiency or whether these patterns are due simply to the effects of mutational bias. In this study, we have included both intra-genomic and inter-genomic comparisons of codon usage. This allows us to distinguish more efficiently between the effects of nucleotide bias and translational selection.

Results

We show that there is an extreme degree of heterogeneity in codon usage patterns within the rice genome, and that this heterogeneity is highly correlated with differences in nucleotide content (particularly GC content) between the genes. In contrast to the situation observed within the rice genome, Arabidopsis genes show relatively little variation in both codon usage and nucleotide content. By exploiting a combination of intra-genomic and inter-genomic comparisons, we provide evidence that the differences in codon usage among the rice genes reflect a relatively rapid evolutionary increase in the GC content of some rice genes. We also noted that the degree of codon bias was negatively correlated with gene length.

Conclusion

Our results show that mutational bias can cause a dramatic evolutionary divergence in codon usage patterns within a period of approximately two hundred million years.The heterogeneity of codon usage patterns within the rice genome can be explained by a balance between genome-wide mutational biases and negative selection against these biased mutations. The strength of the negative selection is proportional to the length of the coding sequences. Our results indicate that the large variations in synonymous codon usage are not related to selection acting on the translational efficiency of synonymous codons.
  相似文献   

10.
11.
12.

Background

Visualising the evolutionary history of a set of sequences is a challenge for molecular phylogenetics. One approach is to use undirected graphs, such as median networks, to visualise phylogenies where reticulate relationships such as recombination or homoplasy are displayed as cycles. Median networks contain binary representations of sequences as nodes, with edges connecting those sequences differing at one character; hypothetical ancestral nodes are invoked to generate a connected network which contains all most parsimonious trees. Quasi-median networks are a generalisation of median networks which are not restricted to binary data, although phylogenetic information contained within the multistate positions can be lost during the preprocessing of data. Where the history of a set of samples contain frequent homoplasies or recombination events quasi-median networks will have a complex topology. Graph reduction or pruning methods have been used to reduce network complexity but some of these methods are inapplicable to datasets in which recombination has occurred and others are procedurally complex and/or result in disconnected networks.

Results

We address the problems inherent in construction and reduction of quasi-median networks. We describe a novel method of generating quasi-median networks that uses all characters, both binary and multistate, without imposing an arbitrary ordering of the multistate partitions. We also describe a pruning mechanism which maintains at least one shortest path between observed sequences, displaying the underlying relations between all pairs of sequences while maintaining a connected graph.

Conclusion

Application of this approach to 5S rDNA sequence data from sea beet produced a pruned network within which genetic isolation between populations by distance was evident, demonstrating the value of this approach for exploration of evolutionary relationships.  相似文献   

13.

Background

The moss Physcomitrella patens is an attractive model system for plant biology and functional genome analysis. It shares many biological features with higher plants but has the unique advantage of an efficient homologous recombination system for its nuclear DNA. This allows precise genetic manipulations and targeted knockouts to study gene function, an approach that due to the very low frequency of targeted recombination events is not routinely possible in any higher plant.

Results

As an important prerequisite for a large-scale gene/function correlation study in this plant, we are establishing a collection of Physcomitrella patens transformants with insertion mutations in most expressed genes. A low-redundancy moss cDNA library was mutagenised in E. coli using a derivative of the transposon Tn1000. The resulting gene-disruption library was then used to transform Physcomitrella. Homologous recombination of the mutagenised cDNA with genomic coding sequences is expected to target insertion events preferentially to expressed genes. An immediate phenotypic analysis of transformants is made possible by the predominance of the haploid gametophytic state in the life cycle of the moss. Among the first 16,203 transformants analysed so far, we observed 2636 plants ( = 16.2%) that differed from the wild-type in a variety of developmental, morphological and physiological characteristics.

Conclusions

The high proportion of phenotypic deviations and the wide range of abnormalities observed among the transformants suggests that mutagenesis by gene-disruption library transformation is a useful strategy to establish a highly diverse population of Physcomitrella patens mutants for functional genome analysis.  相似文献   

14.

Background

The human malaria parasite Plasmodium falciparum survives pressures from the host immune system and antimalarial drugs by modifying its genome. Genetic recombination and nucleotide substitution are the two major mechanisms that the parasite employs to generate genome diversity. A better understanding of these mechanisms may provide important information for studying parasite evolution, immune evasion and drug resistance.

Results

Here, we used a high-density tiling array to estimate the genetic recombination rate among 32 progeny of a P. falciparum genetic cross (7G8 × GB4). We detected 638 recombination events and constructed a high-resolution genetic map. Comparing genetic and physical maps, we obtained an overall recombination rate of 9.6 kb per centimorgan and identified 54 candidate recombination hotspots. Similar to centromeres in other organisms, the sequences of P. falciparum centromeres are found in chromosome regions largely devoid of recombination activity. Motifs enriched in hotspots were also identified, including a 12-bp G/C-rich motif with 3-bp periodicity that may interact with a protein containing 11 predicted zinc finger arrays.

Conclusions

These results show that the P. falciparum genome has a high recombination rate, although it also follows the overall rule of meiosis in eukaryotes with an average of approximately one crossover per chromosome per meiosis. GC-rich repetitive motifs identified in the hotspot sequences may play a role in the high recombination rate observed. The lack of recombination activity in centromeric regions is consistent with the observations of reduced recombination near the centromeres of other organisms.  相似文献   

15.

Background

Mounting evidence indicates that HLA-mediated HIV evolution follows highly stereotypic pathways that result in HLA-associated footprints in HIV at the population level. However, it is not known whether characteristic HLA frequency distributions in different populations have resulted in additional unique footprints.

Methods

The phylogenetic dependency network model was applied to assess HLA-mediated evolution in datasets of HIV pol sequences from free plasma viruses and peripheral blood mononuclear cell (PBMC)-integrated proviruses in an immunogenetically unique cohort of Mexican individuals. Our data were compared with data from the IHAC cohort, a large multi-center cohort of individuals from Canada, Australia and the USA.

Results

Forty three different HLA-HIV codon associations representing 30 HLA-HIV codon pairs were observed in the Mexican cohort (q < 0.2). Strikingly, 23 (53%) of these associations differed from those observed in the well-powered IHAC cohort, strongly suggesting the existence of unique characteristics in HLA-mediated HIV evolution in the Mexican cohort. Furthermore, 17 of the 23 novel associations involved HLA alleles whose frequencies were not significantly different from those in IHAC, suggesting that their detection was not due to increased statistical power but to differences in patterns of epitope targeting. Interestingly, the consensus differed in four positions between the two cohorts and three of these positions could be explained by HLA-associated selection. Additionally, different HLA-HIV codon associations were seen when comparing HLA-mediated selection in plasma viruses and PBMC archived proviruses at the population level, with a significantly lower number of associations in the proviral dataset.

Conclusion

Our data support universal HLA-mediated HIV evolution at the population level, resulting in detectable HLA-associated footprints in the circulating virus. However, it also strongly suggests that unique genetic backgrounds in different HIV-infected populations may influence HIV evolution in a particular direction as particular HLA-HIV codon associations are determined by specific HLA frequency distributions. Our analysis also suggests a dynamic HLA-associated evolution in HIV with fewer HLA-HIV codon associations observed in the proviral compartment, which is likely enriched in early archived HIV sequences, compared to the plasma virus compartment. These results highlight the importance of comparative HIV evolutionary studies in immunologically different populations worldwide.  相似文献   

16.
We propose a genealogy-sampling algorithm, Sequential Markov Ancestral Recombination Tree (SMARTree), that provides an approach to estimation from SNP haplotype data of the patterns of coancestry across a genome segment among a set of homologous chromosomes. To enable analysis across longer segments of genome, the sequence of coalescent trees is modeled via the modified sequential Markov coalescent (Marjoram and Wall, Genetics 7:16, 2006). To assess performance in estimating these local trees, our SMARTree implementation is tested on simulated data. Our base data set is of the SNPs in 10 DNA sequences over 50 kb. We examine the effects of longer sequences and of more sequences, and of a recombination and/or mutational hotspot. The model underlying SMARTree is an approximation to the full recombinant-coalescent distribution. However, in a small trial on simulated data, recovery of local trees was similar to that of LAMARC (Kuhner et al. Genetics 156:1393-1401, 2000a), a sampler which uses the full model.  相似文献   

17.
Generalized linear mixed model for segregation distortion analysis   总被引:1,自引:0,他引:1  

Background

Concerted evolution refers to the pattern in which copies of multigene families show high intraspecific sequence homogeneity but high interspecific sequence diversity. Sequence homogeneity of these copies depends on relative rates of mutation and recombination, including gene conversion and unequal crossing over, between misaligned copies. The internally repetitive intergenic spacer (IGS) is located between the genes for the 28S and 18S ribosomal RNAs. To identify patterns of recombination and/or homogenization within IGS repeat arrays, and to identify regions of the IGS that are under functional constraint, we analyzed 13 complete IGS sequences from 10 individuals representing four species in the Daphnia pulex complex.

Results

Gene conversion and unequal crossing over between misaligned IGS repeats generates variation in copy number between arrays, as has been observed in previous studies. Moreover, terminal repeats are rarely involved in these events. Despite the occurrence of recombination, orthologous repeats in different species are more similar to one another than are paralogous repeats within species that diverged less than 4 million years ago. Patterns consistent with concerted evolution of these repeats were observed between species that diverged 8-10 million years ago. Sequence homogeneity varies along the IGS; the most homogeneous regions are downstream of the 28S rRNA gene and in the region containing the core promoter. The inadvertent inclusion of interspecific hybrids in our analysis uncovered evidence of both inter- and intrachromosomal recombination in the nonrepetitive regions of the IGS.

Conclusions

Our analysis of variation in ribosomal IGS from Daphnia shows that levels of homogeneity within and between species result from the interaction between rates of recombination and selective constraint. Consequently, different regions of the IGS are on substantially different evolutionary trajectories.  相似文献   

18.

Background

Steroid 21-hydroxylase deficiency is the most common cause of congenital adrenal hyperplasia (CAH). Detection of underlying mutations in CYP21A2 gene encoding steroid 21-hydroxylase enzyme is helpful both for confirmation of diagnosis and management of CAH patients. Here we report a novel 9-bp insertion in CYP21A2 gene and its structural and functional consequences on P450c21 protein by molecular modeling and molecular dynamics simulations methods.

Methods

A 30-day-old child was referred to our laboratory for molecular diagnosis of CAH. Sequencing of the entire CYP21A2 gene revealed a novel insertion (duplication) of 9-bp in exon 2 of one allele and a well-known mutation I172N in exon 4 of other allele. Molecular modeling and simulation studies were carried out to understand the plausible structural and functional implications caused by the novel mutation.

Results

Insertion of the nine bases in exon 2 resulted in addition of three valine residues at codon 71 of the P450c21 protein. Molecular dynamics simulations revealed that the mutant exhibits a faster unfolding kinetics and an overall destabilization of the structure due to the triple valine insertion was also observed.

Conclusion

The novel 9-bp insertion in exon 2 of CYP21A2 genesignificantly lowers the structural stability of P450c21 thereby leading to the probable loss of its function.  相似文献   

19.

Background

We present a C++ class library for Monte Carlo simulation of molecular systems, including proteins in solution. The design is generic and highly modular, enabling multiple developers to easily implement additional features. The statistical mechanical methods are documented by extensive use of code comments that – subsequently – are collected to automatically build a web-based manual.

Results

We show how an object oriented design can be used to create an intuitively appealing coding framework for molecular simulation. This is exemplified in a minimalistic C++ program that can calculate protein protonation states. We further discuss performance issues related to high level coding abstraction.

Conclusion

C++ and the Standard Template Library (STL) provide a high-performance platform for generic molecular modeling. Automatic generation of code documentation from inline comments has proven particularly useful in that no separate manual needs to be maintained.  相似文献   

20.
The COG database: an updated version includes eukaryotes   总被引:4,自引:0,他引:4  

Background

The availability of multiple, essentially complete genome sequences of prokaryotes and eukaryotes spurred both the demand and the opportunity for the construction of an evolutionary classification of genes from these genomes. Such a classification system based on orthologous relationships between genes appears to be a natural framework for comparative genomics and should facilitate both functional annotation of genomes and large-scale evolutionary studies.

Results

We describe here a major update of the previously developed system for delineation of Clusters of Orthologous Groups of proteins (COGs) from the sequenced genomes of prokaryotes and unicellular eukaryotes and the construction of clusters of predicted orthologs for 7 eukaryotic genomes, which we named KOGs after eukaryotic orthologous groups. The COG collection currently consists of 138,458 proteins, which form 4873 COGs and comprise 75% of the 185,505 (predicted) proteins encoded in 66 genomes of unicellular organisms. The eukaryotic orthologous groups (KOGs) include proteins from 7 eukaryotic genomes: three animals (the nematode Caenorhabditis elegans, the fruit fly Drosophila melanogaster and Homo sapiens), one plant, Arabidopsis thaliana, two fungi (Saccharomyces cerevisiae and Schizosaccharomyces pombe), and the intracellular microsporidian parasite Encephalitozoon cuniculi. The current KOG set consists of 4852 clusters of orthologs, which include 59,838 proteins, or ~54% of the analyzed eukaryotic 110,655 gene products. Compared to the coverage of the prokaryotic genomes with COGs, a considerably smaller fraction of eukaryotic genes could be included into the KOGs; addition of new eukaryotic genomes is expected to result in substantial increase in the coverage of eukaryotic genomes with KOGs. Examination of the phyletic patterns of KOGs reveals a conserved core represented in all analyzed species and consisting of ~20% of the KOG set. This conserved portion of the KOG set is much greater than the ubiquitous portion of the COG set (~1% of the COGs). In part, this difference is probably due to the small number of included eukaryotic genomes, but it could also reflect the relative compactness of eukaryotes as a clade and the greater evolutionary stability of eukaryotic genomes.

Conclusion

The updated collection of orthologous protein sets for prokaryotes and eukaryotes is expected to be a useful platform for functional annotation of newly sequenced genomes, including those of complex eukaryotes, and genome-wide evolutionary studies.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号