首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Localizing triplet periodicity in DNA and cDNA sequences   总被引:1,自引:0,他引:1  

Background  

The protein-coding regions (coding exons) of a DNA sequence exhibit a triplet periodicity (TP) due to fact that coding exons contain a series of three nucleotide codons that encode specific amino acid residues. Such periodicity is usually not observed in introns and intergenic regions. If a DNA sequence is divided into small segments and a Fourier Transform is applied on each segment, a strong peak at frequency 1/3 is typically observed in the Fourier spectrum of coding segments, but not in non-coding regions. This property has been used in identifying the locations of protein-coding genes in unannotated sequence. The method is fast and requires no training. However, the need to compute the Fourier Transform across a segment (window) of arbitrary size affects the accuracy with which one can localize TP boundaries. Here, we report a technique that provides higher-resolution identification of these boundaries, and use the technique to explore the biological correlates of TP regions in the genome of the model organism C. elegans.  相似文献   

2.
The mouse (Mus musculus) is the premier animal model for understanding human disease and development. Here we show that a comprehensive understanding of mouse biology is only possible with the availability of a finished, high-quality genome assembly. The finished clone-based assembly of the mouse strain C57BL/6J reported here has over 175,000 fewer gaps and over 139 Mb more of novel sequence, compared with the earlier MGSCv3 draft genome assembly. In a comprehensive analysis of this revised genome sequence, we are now able to define 20,210 protein-coding genes, over a thousand more than predicted in the human genome (19,042 genes). In addition, we identified 439 long, non–protein-coding RNAs with evidence for transcribed orthologs in human. We analyzed the complex and repetitive landscape of 267 Mb of sequence that was missing or misassembled in the previously published assembly, and we provide insights into the reasons for its resistance to sequencing and assembly by whole-genome shotgun approaches. Duplicated regions within newly assembled sequence tend to be of more recent ancestry than duplicates in the published draft, correcting our initial understanding of recent evolution on the mouse lineage. These duplicates appear to be largely composed of sequence regions containing transposable elements and duplicated protein-coding genes; of these, some may be fixed in the mouse population, but at least 40% of segmentally duplicated sequences are copy number variable even among laboratory mouse strains. Mouse lineage-specific regions contain 3,767 genes drawn mainly from rapidly-changing gene families associated with reproductive functions. The finished mouse genome assembly, therefore, greatly improves our understanding of rodent-specific biology and allows the delineation of ancestral biological functions that are shared with human from derived functions that are not.  相似文献   

3.

Background

Most eukaryotic genomes include a substantial repeat-rich fraction termed heterochromatin, which is concentrated in centric and telomeric regions. The repetitive nature of heterochromatic sequence makes it difficult to assemble and analyze. To better understand the heterochromatic component of the Drosophila melanogaster genome, we characterized and annotated portions of a whole-genome shotgun sequence assembly.

Results

WGS3, an improved whole-genome shotgun assembly, includes 20.7 Mb of draft-quality sequence not represented in the Release 3 sequence spanning the euchromatin. We annotated this sequence using the methods employed in the re-annotation of the Release 3 euchromatic sequence. This analysis predicted 297 protein-coding genes and six non-protein-coding genes, including known heterochromatic genes, and regions of similarity to known transposable elements. Bacterial artificial chromosome (BAC)-based fluorescence in situ hybridization analysis was used to correlate the genomic sequence with the cytogenetic map in order to refine the genomic definition of the centric heterochromatin; on the basis of our cytological definition, the annotated Release 3 euchromatic sequence extends into the centric heterochromatin on each chromosome arm.

Conclusions

Whole-genome shotgun assembly produced a reliable draft-quality sequence of a significant part of the Drosophila heterochromatin. Annotation of this sequence defined the intron-exon structures of 30 known protein-coding genes and 267 protein-coding gene models. The cytogenetic mapping suggests that an additional 150 predicted genes are located in heterochromatin at the base of the Release 3 euchromatic sequence. Our analysis suggests strategies for improving the sequence and annotation of the heterochromatic portions of the Drosophila and other complex genomes.  相似文献   

4.
While genome sequencing efforts reveal the basic building blocksof life, a genome sequence alone is insufficient for elucidatingbiological function. Genome annotation—the process ofidentifying genes and assigning function to each gene in a genomesequence—provides the means to elucidate biological functionfrom sequence. Current state-of-the-art high-throughput genomeannotation uses a combination of comparative (sequence similaritydata) and non-comparative (ab initio gene prediction algorithms)methods to identify protein-coding genes in genome sequences.Because approaches used to validate the presence of predictedprotein-coding genes are typically based on expressed RNA sequences,they cannot independently and unequivocally determine whethera predicted protein-coding gene is translated into a protein.With the ability to directly measure peptides arising from expressedproteins, high-throughput liquid chromatography-tandem massspectrometry-based proteomics approaches can be used to verifycoding regions of a genomic sequence. Here, we highlight severalways in which high-throughput tandem mass spectrometry-basedproteomics can improve the quality of genome annotations andsuggest that it could be efficiently applied during the genecalling process so that the improvements are propagated throughthe subsequent functional annotation process.   相似文献   

5.
Prediction of protein-coding regions and other features of primary DNA sequence have greatly contributed to experimental biology. Significant challenges remain in genome annotation methods, including the identification of small or overlapping genes and the assessment of mRNA splicing or unconventional translation signals in expression. We have employed a combined analysis of compositional biases and conservation together with frame-specific G+C representation to reevaluate and annotate the genome sequences of mouse and rat cytomegaloviruses. Our analysis predicts that there are at least 34 protein-coding regions in these genomes that were not apparent in earlier annotation efforts. These include 17 single-exon genes, three new exons of previously identified genes, a newly identified four-exon gene for a lectin-like protein (in rat cytomegalovirus), and 10 probable frameshift extensions of previously annotated genes. This expanded set of candidate genes provides an additional basis for investigation in cytomegalovirus biology and pathogenesis.  相似文献   

6.
Viruses employ various means to evade immune detection. Reduction of CD8(+) T cell epitopes is one of the common strategies used for this purpose. Hepatitis B virus (HBV), a member of the Hepadnaviridae family, has four open reading frames, with about 50% overlap between the genes they encode. We computed the CD8(+) T cell epitope density within HBV proteins and the mutations within the epitopes. Our results suggest that HBV accumulates escape mutations that reduce the number of epitopes. These mutations are not equally distributed among genes and reading frames. While the highly expressed core and X proteins are selected to have low epitope density, polymerase, which is expressed at low levels, does not undergo the same selection. In overlapping regions, mutations in one protein-coding sequence also affect the other protein-coding sequence. We show that mutations lead to the removal of epitopes in X and surface proteins even at the expense of the addition of epitopes in polymerase. The total escape mutation rate for overlapping regions is lower than that for nonoverlapping regions. The lower epitope replacement rate for overlapping regions slows the evolutionary escape rate of these regions but leads to the accumulation of mutations more robust in the transfer between hosts, such as mutations preventing proteasomal cleavage into epitopes.  相似文献   

7.
We develop an approximate maximum likelihood method to estimate flanking nucleotide context-dependent mutation rates and amino acid exchange-dependent selection in orthologous protein-coding sequences and use it to analyze genome-wide coding sequence alignments from mammals and yeast. Allowing context-dependent mutation provides a better fit to coding sequence data than simpler (context-independent or CpG "hotspot") models and significantly affects selection parameter estimates. Allowing asymmetric (nonreciprocal) selection on amino acid exchanges gives a better fit than simple dN/dS or symmetric selection models. Relative selection strength estimates from our models show good agreement with independent estimates derived from human disease-causing and engineered mutations. Selection strengths depend on local protein structure, showing expected biophysical trends in helical versus nonhelical regions and increased asymmetry on polar-hydrophobic exchanges with increased burial. The more stringent selection that has previously been observed for highly expressed proteins is primarily concentrated in buried regions, supporting the notion that such proteins are under stronger than average selection for stability. Our analyses indicate that a highly parameterized model of mutation and selection is computationally tractable and is a useful tool for exploring a variety of biological questions concerning protein and coding sequence evolution.  相似文献   

8.
We determined the complete mitochondrial DNA (mtDNA) sequence of a fluke, Paramphistomum cervi (Digenea: Paramphistomidae). This genome (14,014 bp) is slightly larger than that of Clonorchis sinensis (13,875 bp), but smaller than those of other digenean species. The mt genome of P. cervi contains 12 protein-coding genes, 22 transfer RNA genes, 2 ribosomal RNA genes and 2 non-coding regions (NCRs), a complement consistent with those of other digeneans. The arrangement of protein-coding and ribosomal RNA genes in the P. cervi mitochondrial genome is identical to that of other digeneans except for a group of Schistosoma species that exhibit a derived arrangement. The positions of some transfer RNA genes differ. Bayesian phylogenetic analyses, based on concatenated nucleotide sequences and amino-acid sequences of the 12 protein-coding genes, placed P. cervi within the Order Plagiorchiida, but relationships depicted within that order were not quite as expected from previous studies. The complete mtDNA sequence of P. cervi provides important genetic markers for diagnostics, ecological and evolutionary studies of digeneans.  相似文献   

9.
Vlahovicek K  Munteanu MG  Pongor S 《Genetica》1999,106(1-2):63-73
Bending is a local conformational micropolymorphism of DNA in which the original B-DNA structure is only distorted but not extensively modified. Bending can be predicted by simple static geometry models as well as by a recently developed elastic model that incorporate sequence dependent anisotropic bendability (SDAB). The SDAB model qualitatively explains phenomena including affinity of protein binding, kinking, as well as sequence-dependent vibrational properties of DNA. The vibrational properties of DNA segments can be studied by finite element analysis of a model subjected to an initial bending moment. The frequency spectrum is obtained by applying Fourier analysis to the displacement values in the time domain. This analysis shows that the spectrum of the bending vibrations quite sensitively depends on the sequence, for example the spectrum of a curved sequence is characteristically different from the spectrum of straight sequence motifs of identical basepair composition. Curvature distributions are genome-specific, and pronounced differences are found between protein-coding and regulatory regions, respectively, that is, sites of extreme curvature and/or bendability are less frequent in protein-coding regions. A WWW server is set up for the prediction of curvature and generation of 3D models from DNA sequences (http://www.icgeb.trieste.it/dna).This revised version was published online in October 2005 with corrections to the Cover Date.  相似文献   

10.
《遗传学报》2020,47(1):49-60
Noncoding RNAs(ncRNAs) play important roles in many biological processes and provide materials for evolutionary adaptations beyond protein-coding genes, such as in the arms race between the host and pathogen. However, currently, a comprehensive high-resolution analysis of primate genomes that includes the latest annotated ncRNAs is not available. Here, we developed a computational pipeline to estimate the selections that act on noncoding regions based on comparisons with a large number of reference sequences in introns adjacent to the interested regions. Our method yields result comparable with those of the established codon-based method and phyloP method for coding genes; thus, it provides a holistic framework for estimating the selection on the entire genome. We further showed that fastevolving protein-coding genes and their corresponding 50 UTRs have a significantly lower frequency of the CpG dinucleotides than those evolving at an average pace, and these fast-evolving genes are enriched in the process of immunity and host defense. We also identified fast-evolving miRNAs with antiviral functions in cells. Our results provide a resource for high-resolution evolution analysis of the primate genomes.  相似文献   

11.
DAGchainer: a tool for mining segmental genome duplications and synteny   总被引:8,自引:0,他引:8  
SUMMARY: Given the positions of protein-coding genes along genomic sequence and probability values for protein alignments between genes, DAGchainer identifies chains of gene pairs sharing conserved order between genomic regions, by identifying paths through a directed acyclic graph (DAG). These chains of collinear gene pairs can represent segmentally duplicated regions and genes within a single genome or syntenic regions between related genomes. Automated mining of the Arabidopsis genome for segmental duplications illustrates the use of DAGchainer.  相似文献   

12.
To elucidate the primary structure of preproenkephalin (A) mRNA expressed by haploid germ cells (round spermatids) in rat testis, we have screened a lambda gt11 cDNA library for preproenkephalin cDNA inserts. The largest cDNA insert contained a protein-coding sequence encoding 269 amino acid residues as well as 327 and 309 bases of the 5'- and 3'-untranslated regions, respectively. The protein-coding region plus 3'-untranslated region of the mRNA was over 99% homologous to that of brain preproenkephalin mRNA, whereas the 5'-untranslated region contained a distinct sequence including a partial sequence of intron A of the preproenkephalin gene [(1984) J. Biol. Chem. 259, 14301-14308; (1984) J. Biol. Chem. 259, 14309-14313]. Northern blot analysis using a 5'-end-specific probe showed that this type of preproenkephalin mRNA exists exclusively in the germ cells.  相似文献   

13.
Complete structure of the chloroplast genome of Arabidopsis thaliana.   总被引:7,自引:0,他引:7  
The complete nucleotide sequence of the chloroplast genome of Arabidopsis thaliana has been determined. The genome as a circular DNA composed of 154,478 bp containing a pair of inverted repeats of 26,264 bp, which are separated by small and large single copy regions of 17,780 bp and 84,170 bp, respectively. A total of 87 potential protein-coding genes including 8 genes duplicated in the inverted repeat regions, 4 ribosomal RNA genes and 37 tRNA genes (30 gene species) representing 20 amino acid species were assigned to the genome on the basis of similarity to the chloroplast genes previously reported for other species. The translated amino acid sequences from respective potential protein-coding genes showed 63.9% to 100% sequence similarity to those of the corresponding genes in the chloroplast genome of Nicotiana tabacum, indicating the occurrence of significant diversity in the chloroplast genes between two dicot plants. The sequence data and gene information are available on the World Wide Web database KAOS (Kazusa Arabidopsis data Opening Site) at http://www.kazusa.or.jp/arabi/.  相似文献   

14.
Datura stramonium is a widely used poisonous plant with great medicinal and economic value. Its chloroplast (cp) genome is 155,871 bp in length with a typical quadripartite structure of the large (LSC, 86,302 bp) and small (SSC, 18,367 bp) single-copy regions, separated by a pair of inverted repeats (IRs, 25,601 bp). The genome contains 113 unique genes, including 80 protein-coding genes, 29 tRNAs and four rRNAs. A total of 11 forward, 9 palindromic and 13 tandem repeats were detected in the D. stramonium cp genome. Most simple sequence repeats (SSR) are AT-rich and are less abundant in coding regions than in non-coding regions. Both SSRs and GC content were unevenly distributed in the entire cp genome. All preferred synonymous codons were found to use A/T ending codons. The difference in GC contents of entire genomes and of the three-codon positions suggests that the D. stramonium cp genome might possess different genomic organization, in part due to different mutational pressures. The five most divergent coding regions and four non-coding regions (trnH-psbA, rps4-trnS, ndhD-ccsA, and ndhI-ndhG) were identified using whole plastome alignment, which can be used to develop molecular markers for phylogenetics and barcoding studies within the Solanaceae. Phylogenetic analysis based on 68 protein-coding genes supported Datura as a sister to Solanum. This study provides valuable information for phylogenetic and cp genetic engineering studies of this poisonous and medicinal plant.  相似文献   

15.
Microsatellites within genes: structure, function, and evolution   总被引:39,自引:0,他引:39  
  相似文献   

16.
Molecular characterizations of bacteria often employ ribosomal DNA (rDNA) to establish the identity and relationships among organisms, but the use of rRNA sequences can be problematic as the result of alignment ambiguities caused by indels, the lack of informative characters, and varying functional constraints over the molecule. Although protein-coding regions have been used as an alternative to rRNA, there is neither consensus among the genes examined nor ways to rapidly obtain sequence information for such genes from uncharacterized bacterial species. To standardize the set of protein-coding loci assayed in bacterial genomes, we examined over 100 widely distributed genes to identify sets of universal primers for use in the PCR amplification of protein coding regions that are common to virtually all bacteria. From this set, we developed primer sets that each target of 10 genes spanning an array of genomic locations and functional categories. Although many of the primers contain sequence degeneracies that aid in targeting genes across diverse taxa, most are adequate for direct sequencing of amplification products, thereby eliminating intermediate cloning before sequence determination. We foresee the analysis of these protein-coding regions as being complementary to ribosomal DNA for answering questions pertaining to bacterial identification, classification, phylogenetics and evolution.  相似文献   

17.
Constructing multiple homologous alignments for protein-coding DNA sequences is crucial for a variety of bioinformatic analyses but remains computationally challenging. With the growing amount of sequence data available and the ongoing efforts largely dependent on protein-coding DNA alignments, there is an increasing demand for a tool that can process a large number of homologous groups and generate multiple protein-coding DNA alignments. Here we present a parallel tool - ParaAT that is capable of parallelly constructing multiple protein-coding DNA alignments for a large number of homologs. As testified on empirical datasets, ParaAT is well suited for large-scale data analysis in the high-throughput era, providing good scalability and exhibiting high parallel efficiency for computationally demanding tasks. ParaAT is freely available for academic use only at http://cbb.big.ac.cn/software.  相似文献   

18.
A total of sixty-two clones were selected from a TAC (transformation-competent artificial chromosome) genomic library of the Lotus japonicus accession MG-20 based on the sequence information of expressed sequence tags (ESTs), cDNA and gene information, and their nucleotide sequences were determined. The length of the sequenced regions in this study is 6,682,189 bp, and the total length of the regions sequenced so far is 18,711,484 bp together with the nucleotide sequences of 121 TAC clones previously reported. By comparison with the sequences in protein and EST databases and analysis with computer programs for gene modeling, a total of 573 potential protein-coding genes with known or predicted functions, 91 gene segments and 272 pseudogenes were identified in the newly sequenced regions. Each of the sequenced clones was localized onto the linkage map of two accessions of L. japonicus, Gifu B-129 and Miyakojima MG-20, using simple sequence repeat length polymorphism (SSLP) or derived cleaved amplified polymorphic sequence (dCAPS) markers generated based on the nucleotide sequences of the clones. The sequence data, gene information and mapping information are available through the World Wide Web at http://www.kazusa.or.jp/lotus/.  相似文献   

19.
20.
GCWIND is a microcomputer (IBM-PC compatible) program for theidentification of protein-coding open reading frames. The programis similar to the FRAME program, but the latter has only beenimplemented for a specialized graphics package. The base compositions(%G+C)for each of the three possible reading phases throughthe DNA sequence are displayed separately, together with thepositions of potential translation initiation and terminationcodons (on the leading and complementary strands), to providean immediate representation of those regions within the sequencethat have coding potential.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号