首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
De novo origin of coding sequence remains an obscure issue in molecular evolution. One of the possible paths for addition (subtraction) of DNA segments to (from) a gene is stop codon shift. Single nucleotide substitutions can destroy the existing stop codon, leading to uninterrupted translation up to the next stop codon in the gene’s reading frame, or create a premature stop codon via a nonsense mutation. Furthermore, short indels-caused frameshifts near gene’s end may lead to premature stop codons or to translation past the existing stop codon. Here, we describe the evolution of the length of coding sequence of prokaryotic genes by change of positions of stop codons. We observed cases of addition of regions of 3′UTR to genes due to mutations at the existing stop codon, and cases of subtraction of C-terminal coding segments due to nonsense mutations upstream of the stop codon. Many of the observed stop codon shifts cannot be attributed to sequencing errors or rare deleterious variants segregating within bacterial populations. The additions of regions of 3′UTR tend to occur in those genes in which they are facilitated by nearby downstream in-frame triplets which may serve as new stop codons. Conversely, subtractions of coding sequence often give rise to in-frame stop codons located nearby. The amino acid composition of the added region is significantly biased, compared to the overall amino acid composition of the genes. Our results show that in prokaryotes, shift of stop codon is an underappreciated contributor to functional evolution of gene length.  相似文献   

2.
3.
The cloning of several plant genes directly involved in triggering a disease resistance response has shown that numerous resistance genes in the nucleotide binding site (NBS)/leucine-rich repeat (LRR) class have similar conserved amino acid sequences. In this study, we used a short soybean DNA sequence, previously cloned based on its conserved NBS, as a probe to identify full-length resistance gene candidates. Two homologous, but genetically independent genes were identified. One gene maps to the soybean molecular linkage group (MLG) F and a second is coded on MLG E. The first gene contains a 3,279 nucleotide open reading frame (ORF) sequence and possesses all the functional motifs characteristic of previously cloned NBS/LRR resistance genes. The N-terminal sequence of the deduced gene product is highly characteristic of other resistance genes in the subgroup of NBS/LRR genes which show homology to the Toll/Interleukin-1 receptor genes. The C-terminal region is somewhat more divergent as seen in other cloned disease resistance genes. This region of the F-linked gene contains an LRR region that is characterized by two alternatively spliced products which produce gene products with either a four-repeat or a ten-repeat LRR. The second cloned gene that maps to soybean MLG E contains 1,565 nucleotides of ORF in the N-terminal domain. Despite strong homology, however, the 3′ region of this gene contains several in-frame stop codons and apparent frame shifts compared to the F-linked gene, suggesting that its functionality as a disease resistance gene is questionable. These two disease resistance gene candidates are shown to be closely related to one another and to the members of the NBS/LRR class of disease resistance genes. Received: 29 November 1999 / Accepted: 22 December 1999  相似文献   

4.
钟智  李宏 《生物物理学报》2008,24(5):379-392
以细菌和古菌基因组5′ UTR序列作为研究对象,分析在5′ UTR 的3个不同阅读框架中三联体AUG的分布,发现无论是细菌还是古菌基因组都在阅读框1中有非常明显的AUG缺失(depletion)。AUG的缺失表明在起始密码子上游的AUG很可能会对基因的翻译起始产生影响。分析得知:绝大部分的AUG都是以uORF(upstream open reading frame)的形式出现的,uAUG(upstream AUG)的数量很少,特别是在阅读框1中,而且在细菌基因组的阅读框1中uAUG较多地出现在了含有SD序列的基因上游。比较发现,uAUG引导的序列在同义密码子使用上的偏好性较真正的编码序列差,这可能表明细菌和古菌在同义密码子使用上的偏好性也是决定基因准确地翻译起始的重要因素之一。  相似文献   

5.
6.
7.
Until now the most efficient solution to align nucleotide sequences containing open reading frames was to use indirect procedures that align amino acid translation before reporting the inferred gap positions at the codon level. There are two important pitfalls with this approach. Firstly, any premature stop codon impedes using such a strategy. Secondly, each sequence is translated with the same reading frame from beginning to end, so that the presence of a single additional nucleotide leads to both aberrant translation and alignment.We present an algorithm that has the same space and time complexity as the classical Needleman-Wunsch algorithm while accommodating sequencing errors and other biological deviations from the coding frame. The resulting pairwise coding sequence alignment method was extended to a multiple sequence alignment (MSA) algorithm implemented in a program called MACSE (Multiple Alignment of Coding SEquences accounting for frameshifts and stop codons). MACSE is the first automatic solution to align protein-coding gene datasets containing non-functional sequences (pseudogenes) without disrupting the underlying codon structure. It has also proved useful in detecting undocumented frameshifts in public database sequences and in aligning next-generation sequencing reads/contigs against a reference coding sequence.MACSE is distributed as an open-source java file executable with freely available source code and can be used via a web interface at: http://mbb.univ-montp2.fr/macse.  相似文献   

8.
ABSTRACT: BACKGROUND: Gene prediction algorithms (or gene callers) are an essential tool for analyzing shotgun nucleic acid sequence data. Gene prediction is a ubiquitous step in sequence analysis pipelines; it reduces the volume of data by identifying the most likely reading frame for a fragment, permitting the out-of-frame translations to be ignored. In this study we evaluate five widely used ab initio gene-calling algorithms--FragGeneScan, MetaGeneAnnotator, MetaGeneMark, Orphelia, and Prodigal--for accuracy on short (75-1000 bp) fragments containing sequence error from previously published artificial data and "real" metagenomic datasets. RESULTS: While gene prediction tools have similar accuracies predicting genes on error-free fragments, in the presence of sequencing errors considerable differences between tools become evident. For error-containing short reads, FragGeneScan finds more prokaryotic coding regions than does MetaGeneAnnotator, MetaGeneMark, Orphelia, or Prodigal. This improved detection of genes in error-containing fragments, however, comes at the cost of much lower (50%) specificity and overprediction of genes in noncoding regions. CONCLUSIONS: Ab initio gene callers offer a significant reduction in the computational burden of annotating individual nucleic acid reads and are used in many metagenomic annotation systems. For predicting reading frames on raw reads, we find the hidden Markov model approach in FragGeneScan is more sensitive than other gene prediction tools, while Prodigal, MGA, and MGM are better suited for higher-quality sequences such as assembled contigs.  相似文献   

9.
We identified 411 processed sequences in the Arabidopsis thaliana genome based on the fact that they have lost their intron(s) and have a length that is at least 95% of the length of the gene that gave rise to them. These sequences were generated by 230 different genes and clearly originated from retrotranspositons events because most of them (91%) have a poly(A)-tail. They are composed of 376 sequences with frame shifts and/or premature stop codons (processed pseudogenes) and 35 sequences without disablements (processed genes). Eleven of these processed genes are likely functional retrotransposed genes because they have low Ka/Ks ratios and high Ks values, and their sequences match numerous Arabidopsis ESTs. Processed sequences are mostly randomly distributed in the Arabidopsis genome and their rate of accumulation has steadily been decreasing since it peaked some 50 MYA. In contrast with the situation observed in mammals, the processed sequences found in the Arabidopsis genome originate from genes with high copy numbers and not from highly expressed genes. The patterns of spontaneous mutations in Arabidopsis are slightly different than those of mammals but are similar to those observed in Drosophila. This suggests that methylated cytosine deamination is less frequent in Arabidopsis than in mammals. Electronic Supplementary Material Electronic Supplementary material is available for this article at and accessible for authorised users. [Reviewing Editor: Dr. Juergen Brosius]  相似文献   

10.
Along with the rapid advances of the nextgen sequencing technologies, more and more species are added to the list of organisms whose whole genomes are sequenced. However, the assembled draft genome of many organisms consists of numerous small contigs, due to the short length of the reads generated by nextgen sequencing platforms. In order to improve the assembly and bring the genome contigs together, more genome resources are needed. In this study, we developed a strategy to generate a valuable genome resource, physical map contig-specific sequences, which are randomly distributed genome sequences in each physical contig. Two-dimensional tagging method was used to create specific tags for 1,824 physical contigs, in which the cost was dramatically reduced. A total of 94,111,841 100-bp reads and 315,277 assembled contigs are identified containing physical map contig-specific tags. The physical map contig-specific sequences along with the currently available BAC end sequences were then used to anchor the catfish draft genome contigs. A total of 156,457 genome contigs (~79% of whole genome sequencing assembly) were anchored and grouped into 1,824 pools, in which 16,680 unique genes were annotated. The physical map contig-specific sequences are valuable resources to link physical map, genetic linkage map and draft whole genome sequences, consequently have the capability to improve the whole genome sequences assembly and scaffolding, and improve the genome-wide comparative analysis as well. The strategy developed in this study could also be adopted in other species whose whole genome assembly is still facing a challenge.  相似文献   

11.
应用简并性引物和基因组PCR反应从乌拉尔图小麦(Triticum urartu)不同种质材料中获得并测定了表达型和沉默型1Ay高分子量麦谷蛋白亚基基因全长编码区的基因组DNA序列.表达型1Ay基因编码区的序列与前人已发表的y型高分子量麦谷蛋白亚基基因编码区的序列高度同源,由其推导的1Ay亚基的一级结构与已知的高分子量麦谷蛋白亚基相似.在细菌细胞中,表达型1Ay基因编码区的克隆序列可经诱导而产生1Ay蛋白,该蛋白与种子中1Ay亚基在电泳迁移率和抗原性上类似,表明所克隆的序列真实地代表了表达型1Ay基因的全长编码区.但是,本研究所克隆的沉默型1Av基因的编码区序列因含有3个提前终止子而不能翻译成完整的1Ay蛋白.讨论了表达型1Ay基因在小麦籽粒加工品质改良中的潜在利用价值以及lAy基因沉默的机制.  相似文献   

12.
More than 90% of human genes are rich in intronic latent 5′ splice sites whose utilization in pre-mRNA splicing would introduce in-frame stop codons into the resultant mRNAs. We have therefore hypothesized that suppression of splicing (SOS) at latent 5′ splice sites regulates alternative 5′ splice site selection in a way that prevents the production of toxic nonsense mRNAs and verified this idea by showing that the removal of such in-frame stop codons is sufficient to activate latent splicing. Splicing control by SOS requires recognition of the mRNA reading frame, presumably recognizing the start codon sequence. Here we show that AUG sequences are indeed essential for SOS. Although protein translation does not seem to be required for SOS, the first AUG is shown here to be necessary but not sufficient. We further show that latent splicing can be elicited upon treatment with pactamycin—a drug known to block translation by its ability to recognize an RNA fold—but not by treatment with other drugs that inhibit translation through other mechanisms. The effect of pactamycin on SOS is dependent neither on steady-state translation nor on the pioneer round of translation. This effect is found for both transfected and endogenous genes, indicating that SOS is a natural mechanism.  相似文献   

13.
Properties of mRNA leading regions that modulate protein synthesis are little known (besides effects of their secondary structure). Here I explore how coding properties of leading regions may account for their disparate efficiencies. Trinucleotides that form off frame stop codons decrease costs of ribosomal slippages during protein synthesis: protein activity (as a proxy of gene expression, and as measured in experiments using artificial variants of 5' leading sequences of beta galactosidase in Escherichia coli) increases proportionally to the number of stop motifs in any frame in the 5' leading region. This suggests that stop codons in the 5' leading region, upstream of the recognized coding sequence, terminate eventual translations that sometimes start before ribosomes reach the mRNA's recognized start codon, increasing efficiency. This hypothesis is confirmed by further analyses: mRNAs with 5' leading regions containing in the same frame a start preceding a stop codon (in any frame) produce less enzymatic activity than those with the stop preceding the start. Hence coding properties, in addition to other properties, such as the secondary structure of the 5' leading region, regulate translation. This experimentally (a) confirms that within coding regions, off frame stops increase protein synthesis efficiency by early stopping frameshifted translation; (b) suggests that this occurs for all frames also in 5' leading regions and that (c) several alternative start codons that function at different probabilities should routinely be considered for all genes in the region of the recognized initiation codon. An unknown number of short peptides might be translated from coding and non-coding regions of RNAs.  相似文献   

14.
A BAC-based physical map of the channel catfish genome   总被引:3,自引:0,他引:3  
Xu P  Wang S  Liu L  Thorsen J  Kucuktas H  Liu Z 《Genomics》2007,90(3):380-388
Catfish is the major aquaculture species in the United States. To enhance its genome studies involving genetic linkage and comparative mapping, a bacterial artificial chromosome (BAC) contig-based physical map of the channel catfish (Ictalurus punctatus) genome was generated using four-color fluorescence-based fingerprints. Fingerprints of 34,580 BAC clones (5.6x genome coverage) were generated for the FPC assembly of the BAC contigs. A total of 3307 contigs were assembled using a cutoff value of 1x10(-20). Each contig contains an average of 9.25 clones with an average size of 292 kb. The combined contig size for all contigs was 0.965 Gb, approximately the genome size of the channel catfish. The reliability of the contig assembly was assessed by both hybridization of gene probes to BAC clones contained in the fingerprinted assembly and validation of randomly selected contigs using overgo probes designed from BAC end sequences. The presented physical map should greatly enhance genome research in the catfish, particularly aiding in the identification of genomic regions containing genes underlying important performance traits.  相似文献   

15.
Assembling individual genomes from complex community metagenomic data remains a challenging issue for environmental studies. We evaluated the quality of genome assemblies from community short read data (Illumina 100 bp pair-ended sequences) using datasets recovered from freshwater and soil microbial communities as well as in silico simulations. Our analyses revealed that the genome of a single genotype (or species) can be accurately assembled from a complex metagenome when it shows at least about 20 × coverage. At lower coverage, however, the derived assemblies contained a substantial fraction of non-target sequences (chimeras), which explains, at least in part, the higher number of hypothetical genes recovered in metagenomic relative to genomic projects. We also provide examples of how to detect intrapopulation structure in metagenomic datasets and estimate the type and frequency of errors in assembled genes and contigs from datasets of varied species complexity.  相似文献   

16.
Selenoprotein is biosynthesized by the incorporation of selenocysteine into proteins,where the TGA codon in the open reading frame does not act as a stop signal but is translated into selenocysteine.The dual functions of TGA result in mis-annotation or lack of selenoproteins in the sequenced genomes of many species.Available computational tools fail to correctly predict selenoproteins.Thus,we devel-oped a new method to identify selenoproteins from the genome of Anopheles gambiae computationally.Based on released genomic information,several programs were edited with PERL language to identify selenocysteine insertion sequence(SECIS)element,the coding potential of TGA codons,and cys-teine-containing homologs of selenoprotein genes.Our results showed that 11365 genes were termi-nated with TGA codons,918 of which contained SECIS elements.Similarity search revealed that 58 genes contained Sec/Cys pairs and similar flanking regions around in-frame TGA codons.Finally,7 genes were found to fully meet requirements for selenoproteins,although they have not been anno-tated as selenoproteins in NCBI databases.Deduced from their basic properties,the newly found se-lenoproteins in the genome of Anopheles gambiae are possibly related to in vivo oxidation tolerance and protein regulation in order to interfere with anopheles' vectorial capacity of Plasmodium.This study may also provide theoretical bases for the prevention of malaria from anopheles transmission.  相似文献   

17.
18.
Selenoprotein is biosynthesized by the incorporation of selenocysteine into proteins, where the TGA codon in the open reading frame does not act as a stop signal but is translated into selenocysteine. The dual functions of TGA result in mis-annotation or lack of selenoproteins in the sequenced genomes of many species. Available computational tools fail to correctly predict selenoproteins. Thus, we developed a new method to identify selenoproteins from the genome of Anopheles gambiae computationally.Based on released genomic information, several programs were edited with PERL language to identify selenocysteine insertion sequence (SECIS) element, the coding potential of TGA codons, and cysteine-containing homologs of selenoprotein genes. Our results showed that 11365 genes were terminated with TGA codons, 918 of which contained SECIS elements. Similarity search revealed that 58genes contained Sec/Cys pairs and similar flanking regions around in-frame TGA codons. Finally, 7genes were found to fully meet requirements for selenoproteins, although they have not been annotated as selenoproteins in NCBI databases. Deduced from their basic properties, the newly found selenoproteins in the genome of Anopheles gambiae are possibly related to in vivo oxidation tolerance and protein regulation in order to interfere with anopheles' vectorial capacity of Plasmodium. This study may also provide theoretical bases for the prevention of malaria from anopheles transmission.  相似文献   

19.
Genes for seven putative serine proteases (ChpA–ChpG) belonging to the trypsin subfamily and homologous to the virulence factor pat-1 were identified on the chromosome of Clavibacter michiganensis subsp. michiganensis ( Cmm ) NCPPB382. All proteases have signal peptides indicating export of these proteins. Their putative function is suggested by two motifs and an aspartate residue typical for serine proteases. Furthermore, six cysteine residues are located at conserved positions. The genes are clustered in a chromosomal region of about 50 kb with a significantly lower G + C content than common for Cmm . The genes chpA , chpB and chpD are pseudogenes as they contain frame shifts and/or in-frame stop codons. The genes chpC and chpG were inactivated by the insertion of an antibiotic resistance cassette. The chpG mutant was not impaired in virulence. However, in planta the titre of the chpC mutant was drastically reduced and only weak disease symptoms were observed. Complementation of the chpC mutant by the wild-type allele restored full virulence. ChpC is the first chromosomal gene of Cmm identified so far that affects the interaction of the pathogen with the host plant.  相似文献   

20.
It has been hypothesized that the length of an exon tends to increase with the GC content because stop codons are AT-rich and should occur less frequently in GC-rich exons. This prediction assumes that mutation pressure plays a significant role in the occurrence and distribution of stop codons. However, the prediction is applicable not to all exons, but only to the last coding exon of a gene and to single-exon CDS sequences. We classified exons in multiexon genes in eight eukaryotic species into three groups-the first exon, the internal, and the last exon-and computed the Spearman correlation between the exon length and the percentage GC (%GC) for each of the three groups. In only five of the species studied is the correlation for the last coding exon greater than that for the first or internal exons. For the single-exon CDS sequences, the correlation between CDS length and %GC is mostly negative. Thus, eukaryotic genomes do not support the predicted relationship between exon length and %GC. In prokaryotic genomes, CDS length and %GC are positively correlated in each of the 68 completely sequenced prokaryotic genomes in GenBank with genomic GC contents varying from 25 to 68%, except for the wall-less Mycoplasma genitalium and the syphilis pathogen Treponema pallidum. Moreover, the average CDS length and the genomic GC content are also positively correlated. After correcting for genome size, the partial correlation between the average CDS length and the genomic GC content is 0.3217 ( p < 0.025).  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号