首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
It has been hypothesized that the length of an exon tends to increase with the GC content because stop codons are AT-rich and should occur less frequently in GC-rich exons. This prediction assumes that mutation pressure plays a significant role in the occurrence and distribution of stop codons. However, the prediction is applicable not to all exons, but only to the last coding exon of a gene and to single-exon CDS sequences. We classified exons in multiexon genes in eight eukaryotic species into three groups-the first exon, the internal, and the last exon-and computed the Spearman correlation between the exon length and the percentage GC (%GC) for each of the three groups. In only five of the species studied is the correlation for the last coding exon greater than that for the first or internal exons. For the single-exon CDS sequences, the correlation between CDS length and %GC is mostly negative. Thus, eukaryotic genomes do not support the predicted relationship between exon length and %GC. In prokaryotic genomes, CDS length and %GC are positively correlated in each of the 68 completely sequenced prokaryotic genomes in GenBank with genomic GC contents varying from 25 to 68%, except for the wall-less Mycoplasma genitalium and the syphilis pathogen Treponema pallidum. Moreover, the average CDS length and the genomic GC content are also positively correlated. After correcting for genome size, the partial correlation between the average CDS length and the genomic GC content is 0.3217 ( p < 0.025).  相似文献   

2.
3.
4.
5.
6.
7.
8.
In a number of programs for gene structure prediction in higher eukaryotic genomic sequences, exon prediction is decoupled from gene assembly: a large pool of candidate exons is predicted and scored from features located in the query DNA sequence, and candidate genes are assembled from such a pool as sequences of nonoverlapping frame-compatible exons. Genes are scored as a function of the scores of the assembled exons, and the highest scoring candidate gene is assumed to be the most likely gene encoded by the query DNA sequence. Considering additive gene scoring functions, currently available algorithms to determine such a highest scoring candidate gene run in time proportional to the square of the number of predicted exons. Here, we present an algorithm whose running time grows only linearly with the size of the set of predicted exons. Polynomial algorithms rely on the fact that, while scanning the set of predicted exons, the highest scoring gene ending in a given exon can be obtained by appending the exon to the highest scoring among the highest scoring genes ending at each compatible preceding exon. The algorithm here relies on the simple fact that such highest scoring gene can be stored and updated. This requires scanning the set of predicted exons simultaneously by increasing acceptor and donor position. On the other hand, the algorithm described here does not assume an underlying gene structure model. Indeed, the definition of valid gene structures is externally defined in the so-called Gene Model. The Gene Model specifies simply which gene features are allowed immediately upstream which other gene features in valid gene structures. This allows for great flexibility in formulating the gene identification problem. In particular it allows for multiple-gene two-strand predictions and for considering gene features other than coding exons (such as promoter elements) in valid gene structures.  相似文献   

9.
Intron 3 and the flanking exons of the calmodulin gene have been amplified, cloned, and sequenced from 18 members of the gastropod genus Littorina. From the 48 sequences, at least five different gene copies have been identified and their functionality characterized using a strategy based upon the potential protein product predicted from flanking exon data. The functionality analyses suggest that four of the genes code for functional copies of calmodulin. All five copies have been identified across a wide range of littorinid species although not ubiquitously. Using this novel approach based on intron sequences, we have identified an unprecedented number of potential calmodulin copies in Littorina, exceeding that reported for any other invertebrate. This suggests a higher number of, and more ancient, gene duplications than previously detected in a single genus.Reviewing Editor: Dr: Debashish Bhattacharya  相似文献   

10.
The accelerated rate of genomic sequencing has led to an abundance of completely sequenced genomes. Annotation of the open reading frames (ORFs) (i.e., gene prediction) in these genomes is an important task and is most often performed computationally based on features in the nucleic acid sequence. Using recent advances in proteomics, we set out to predict the set of ORFs for an organism based principally on expressed protein-based evidence. Using a novel search strategy, we mapped peptides detected in a whole-cell lysate of Mycoplasma pneumoniae onto a genomic scaffold and extended these "hits" into ORFs bound by traditional genetic signals to generate a "proteogenomic map". We were able to generate an ORF model for M. pneumoniae strain FH using proteomic data with a high correlation to models based on sequence features. Ultimately, we detected over 81% of the genomically predicted ORFs in M. pneumoniae strain M129 (the originally sequenced strain). We were also able to detect several new ORFs not originally predicted by genomic methods, various N-terminal extensions, and some evidence that would suggest that certain predicted ORFs are bogus. Some of these differences may be a result of the strain analyzed but demonstrate the robustness of protein analysis across closely related genomes. This technique is a cost-effective means to add value to genome annotation, and a prerequisite for proteome quantitation and in vivo interaction measures.  相似文献   

11.
Rapidly evolving proteins can aid the identification of genes underlying phenotypic adaptation across taxa, but functional and structural elements of genes can also affect evolutionary rates. In plants, the ‘edges’ of exons, flanking intron junctions, are known to contain splice enhancers and to have a higher degree of conservation compared to the remainder of the coding region. However, the extent to which these regions may be masking indicators of positive selection or account for the relationship between dN/dS and other genomic parameters is unclear. We investigate the effects of exon edge conservation on the relationship of dN/dS to various sequence characteristics and gene expression parameters in the model plant Arabidopsis thaliana. We also obtain lineage‐specific dN/dS estimates, making use of the recently sequenced genome of Thellungiella parvula, the second closest sequenced relative after the sister species Arabidopsis lyrata. Overall, we find that the effect of exon edge conservation, as well as the use of lineage‐specific substitution estimates, upon dN/dS ratios partly explains the relationship between the rates of protein evolution and expression level. Furthermore, the removal of exon edges shifts dN/dS estimates upwards, increasing the proportion of genes potentially under adaptive selection. We conclude that lineage‐specific substitutions and exon edge conservation have an important effect on dN/dS ratios and should be considered when assessing their relationship with other genomic parameters.  相似文献   

12.
The small genome size (740 Mb), short life cycle (3 months) and high economic importance as a food crop legume make chickpea (Cicer arietinum L.) an important system for genomics research. Although several genetic linkage maps using various markers and genomic tools have become available, sequencing efforts and their use are limited in chickpea genomic research. In this study, we explored the genome organization of chickpea by sequencing approximately 500 kb from 11 BAC clones (three representing ascochyta blight resistance QTL1 (ABR-QTL1) and eight randomly selected BAC clones). Our analysis revealed that these sequenced chickpea genomic regions have a gene density of one per 9.2 kb, an average gene length of 2,500 bp, an average of 4.7 exons per gene, with an average exon and intron size of 401 and 316 bp, respectively, and approximately 8.6% repetitive elements. Other features analyzed included exon and intron length, number of exons per gene, protein length and %GC content. Although there are reports on high synteny among legume genomes, the microsynteny between the 500 kb chickpea and available Medicago truncatula genomic sequences varied depending on the region analyzed. The GBrowse-based annotation of these BACs is available at http://www.genome.ou.edu/plants_totals.html . We believe that our work provides significant information that supports a chickpea genome sequencing effort in the future.  相似文献   

13.
Kim DW  Choi SH  Kim RN  Kim SH  Paik SG  Nam SH  Kim DW  Kim A  Kang A  Park HS 《Génome》2010,53(9):658-666
The sequencing and comparative genomic analysis of LMBR1 loci in mammals or other species, including human, would be very important in understanding evolutionary genetic changes underlying the evolution of limb development. In this regard, comparative genomic annotation of the false killer whale LMBR1 locus could shed new light on the evolution of limb development. We sequenced two false killer whale BAC clones, corresponding to 156 kb and 144 kb, respectively, harboring the tightly linked RNF32, LMBR1, and NOM1 genes. Our annotation of the false killer whale LMBR1 gene showed that it consists of 17 exons (1473 bp), in contrast to 18 exons (1596 bp) in human, and it displays 93.1% and 95.6% nucleotide and amino acid sequence similarity, respectively, compared with the human gene. In particular, we discovered that exon 10, deleted in the false killer whale LMBR1 gene, is present only in primates, and this fact strongly implies that exon 10 might be crucial in determining primate-specific limb development. ZRS and TFBS sequences have been well conserved across 11 species, suggesting that these regions could be involved in an important function of limb development and limb patterning. The neighboring gene RNF32 showed several lineage-conserved exons, such as exons 2 through 9 conserved in eutherian mammals, exons 3 through 9 conserved in mammals, and exons 5 through 9 conserved in vertebrates. The other neighboring gene, NOM1, had undergone a substitution (ATG→GTA) at the start codon, giving rise to a 36 bp shorter N-terminal sequence compared with the human sequence. Our comparative analysis of the false killer whale LMBR1 genomic locus provides important clues regarding the genetic regions that may play crucial roles in limb development and patterning.  相似文献   

14.
Genomic sequencing of avian haemosporidian parasites (Haemosporida) has been challenging due to excessive contamination from host DNA. In this study, we developed a cost-effective protocol to obtain parasite sequences from naturally infected birds, based on targeted sequence capture and next generation sequencing. With the genomic data of Haemoproteus tartakovskyi as a reference, we successfully sequenced up to 1000 genes from each of the 15 selected samples belonging to nine different cytochrome b lineages, eight of which belong to Haemoproteus and one to Plasmodium. The targeted sequences were enriched to ~104-fold, and mixed infections were identified as well as the proportions of each mixed lineage. We found that the total number of reads and the proportions of exons sequenced decreased when the parasite lineage became more divergent from the reference genome. For each of the samples, the recovery of sequences from different exons varied with the function and GC content of the exon. From the obtained sequences, we detected within-lineage variation in both mitochondrial and nuclear genes, which may be a result of local adaptation to different host species and environmental conditions. This targeted sequence capture protocol can be applied to a broader range of species and will open a new door for further studies on disease diagnostics and comparative analysis of haemosporidians evolution.  相似文献   

15.
The genome sequencing of H37Rv strain of Mycobacterium tuberculosis was completed in 1998 followed by the whole genome sequencing of a clinical isolate, CDC1551 in 2002. Since then, the genomic sequences of a number of other strains have become available making it one of the better studied pathogenic bacterial species at the genomic level. However, annotation of its genome remains challenging because of high GC content and dissimilarity to other model prokaryotes. To this end, we carried out an in-depth proteogenomic analysis of the M. tuberculosis H37Rv strain using Fourier transform mass spectrometry with high resolution at both MS and tandem MS levels. In all, we identified 3176 proteins from Mycobacterium tuberculosis representing ~80% of its total predicted gene count. In addition to protein database search, we carried out a genome database search, which led to identification of ~250 novel peptides. Based on these novel genome search-specific peptides, we discovered 41 novel protein coding genes in the H37Rv genome. Using peptide evidence and alternative gene prediction tools, we also corrected 79 gene models. Finally, mass spectrometric data from N terminus-derived peptides confirmed 727 existing annotations for translational start sites while correcting those for 33 proteins. We report creation of a high confidence set of protein coding regions in Mycobacterium tuberculosis genome obtained by high resolution tandem mass-spectrometry at both precursor and fragment detection steps for the first time. This proteogenomic approach should be generally applicable to other organisms whose genomes have already been sequenced for obtaining a more accurate catalogue of protein-coding genes.  相似文献   

16.
17.

Background  

A number of completely sequenced eukaryotic genome data are available in the public domain. Eukaryotic genes are either 'intron containing' or 'intronless'. Eukaryotic 'intronless' genes are interesting datasets for comparative genomics and evolutionary studies. The SEGE database containing a collection of eukaryotic single exon genes is available. However, SEGE is derived using GenBank. The redundant, incomplete and heterogeneous qualities of GenBank data are a bottleneck for biological investigation in comparative genomics and evolutionary studies. Such studies often require representative gene sets from each genome and this is possible only by deriving specific datasets from completely sequenced genome data. Thus Genome SEGE, a database for 'intronless' genes in completely sequenced eukaryotic genomes, has been constructed.  相似文献   

18.
The split structure of most mammalian protein-coding genes allows for the potential to produce multiple different mRNA and protein isoforms from a single gene locus through the process of alternative splicing (AS). We propose a computational approach called UNCOVER based on a pair hidden Markov model to discover conserved coding exonic sequences subject to AS that have so far gone undetected. Applying UNCOVER to orthologous introns of known human and mouse genes predicts skipped exons or retained introns present in both species, while discriminating them from conserved noncoding sequences. The accuracy of the model is evaluated on a curated set of genes with known conserved AS events. The prediction of skipped exons in the approximately 1% of the human genome represented by the ENCODE regions leads to more than 50 new exon candidates. Five novel predicted AS exons were validated by RT-PCR and sequencing analysis of 15 introns with strong UNCOVER predictions and lacking EST evidence. These results imply that a considerable number of conserved exonic sequences and associated isoforms are still completely missing from the current annotation of known genes. UNCOVER also identifies a small number of candidates for conserved intron retention.  相似文献   

19.
20.
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号