首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
In the process of making full-length cDNA, predicting protein coding regions helps both in the preliminary analysis of genes and in any succeeding process. However, unfinished cDNA contains artifacts including many sequencing errors, which hinder the correct evaluation of coding sequences. Especially, predictions of short sequences are difficult because they provide little information for evaluating coding potential. In this paper, we describe ANGLE, a new program for predicting coding sequences in low quality cDNA. To achieve error-tolerant prediction, ANGLE uses a machine-learning approach, which makes better expression of coding sequence maximizing the use of limited information from input sequences. Our method utilizes not only codon usage, but also protein structure information which is difficult to be used for stochastic model-based algorithms, and optimizes limited information from a short segment when deciding coding potential, with the result that predictive accuracy does not depend on the length of an input sequence. The performance of ANGLE is compared with ESTSCAN on four dataset each of them having a different error rate (one frame-shift error or one substitution error per 200-500 nucleotides) and on one dataset which has no error. ANGLE outperforms ESTSCAN by 9.26% in average Matthews's correlation coefficient on short sequence dataset (< 1000 bases). On long sequence dataset, ANGLE achieves comparable performance.  相似文献   

2.
一种新的EST聚类方法   总被引:11,自引:0,他引:11  
该研究发展了一种EST(expressed sequence tag)聚类方法(ESTClustering),用于分析大规模EST测序中所产生的大量数据,以获得高质量,非重复表达序列,该方法在聚类过程中采用MEGABLAST工具对一致序列进行序列同源比较,并用phrap程序对每一EST簇进行拼接检验。这一聚类策略能降低测序错误带来的影响,有效识别基因家族成员,并避免选择性剪接的干扰,与NCB(National Center for Biotechnology Information)的UniGene clustering)方法相比,ESTClustering的聚类结果可以更好地反映表达序列的多样性,用ESTClustering对112256条拟南芥EST聚类测试,产生23581个EST簇,其中13597个EST簇有对应拟南芥基因组编码序列,与该基因组中有EST作为依据的预测基因数目接近。应用该方法对收集的147191条水稻EST序列进行聚类,形成33896个EST簇。  相似文献   

3.
The aim of this study was to analyze patterns of nucleotidic composition and codon usage in the pea aphid genome (Acyrthosiphon pisum). A collection of 60,000 expressed sequence tags (ESTs) in the pea aphid has been used to automatically reconstruct 5809 coding sequences (CDSs), based on similarity with known proteins and on coding style recognition. Reconstructions were manually checked for ribosomal proteins, leading to tentatively reconstruct the nea-complete set of this category. Pea aphid coding sequences showed a shift toward AT (especially at the third codon position) compared to drosophila homologues. Genes with a putative high level of expression (ribosomal and other genes with high EST support) remained more GC3-rich and had a distinct codon usage from bulk sequences: they exhibited a preference for C-ending codons and CGT (for arginine), which thus appeared optimal for translation. However, the discrimination was not as strong as in drosophila, suggesting a reduced degree of translational selection. The space of variation in codon usage for A. pisum appeared to be larger than in drosophila, with a substantial fraction of genes that remained GC3-rich. Some of those (in particular some structural proteins) also showed high levels of codon bias and a very strong preference for C-ending codons, which could be explained either by strong translational selection or by other mechanisms. Finally, genomic traces were analyzed to build 206 fragments containing a full CDS, which allowed studying the correlations between GC contents of coding and those of noncoding (flanking and introns) sequences.  相似文献   

4.
Until now the most efficient solution to align nucleotide sequences containing open reading frames was to use indirect procedures that align amino acid translation before reporting the inferred gap positions at the codon level. There are two important pitfalls with this approach. Firstly, any premature stop codon impedes using such a strategy. Secondly, each sequence is translated with the same reading frame from beginning to end, so that the presence of a single additional nucleotide leads to both aberrant translation and alignment.We present an algorithm that has the same space and time complexity as the classical Needleman-Wunsch algorithm while accommodating sequencing errors and other biological deviations from the coding frame. The resulting pairwise coding sequence alignment method was extended to a multiple sequence alignment (MSA) algorithm implemented in a program called MACSE (Multiple Alignment of Coding SEquences accounting for frameshifts and stop codons). MACSE is the first automatic solution to align protein-coding gene datasets containing non-functional sequences (pseudogenes) without disrupting the underlying codon structure. It has also proved useful in detecting undocumented frameshifts in public database sequences and in aligning next-generation sequencing reads/contigs against a reference coding sequence.MACSE is distributed as an open-source java file executable with freely available source code and can be used via a web interface at: http://mbb.univ-montp2.fr/macse.  相似文献   

5.
随着大规模技术的进步,收录到数据库中的序列很快,其中大多是未知功能的ESTs(表达序列标签,Expressed Sequence Tags),一般通过蛋白南-EST序列联配来实验EST的功能提示。由于EST含有5%左右的误差,特别严重的是其中的移框误差,用通常的方法将EST按6个框翻译为蛋白南序列再进行联配难以处理移框误差问题。通过考虑EST序列各种可能的误差,将氨基酸序列反翻译为核苷酸序列,在核  相似文献   

6.
Selection on Codon Usage for Error Minimization at the Protein Level   总被引:1,自引:0,他引:1  
Given the structure of the genetic code, synonymous codons differ in their capacity to minimize the effects of errors due to mutation or mistranslation. I suggest that this may lead, in protein-coding genes, to a preference for codons that minimize the impact of errors at the protein level. I develop a theoretical measure of error minimization for each codon, based on amino acid similarity. This measure is used to calculate the degree of error minimization for 82 genes of Drosophila melanogaster and 432 rodent genes and to study its relationship with CG content, the degree of codon usage bias, and the rate of nucleotide substitution. I show that (i) Drosophila and rodent genes tend to prefer codons that minimize errors; (ii) this cannot be merely the effect of mutation bias; (iii) the degree of error minimization is correlated with the degree of codon usage bias; (iv) the amino acids that contribute more to codon usage bias are the ones for which synonymous codons differ more in the capacity to minimize errors; and (v) the degree of error minimization is correlated with the rate of nonsynonymous substitution. These results suggest that natural selection for error minimization at the protein level plays a role in the evolution of coding sequences in Drosophila and rodents.Reviewing Editor: Dr. Massimo Di Giulio  相似文献   

7.
ABSTRACT. Parasitic dinoflagellates of the genus Amoebophrya play important roles in the ecology of estuaries and open ocean environments. Little is known of the cell and molecular biology of Amoebophrya , but the genus is intermediate on phylogenetic trees between apicomplexans and typical dinophycean dinoflagellates. Here, we constructed four cDNA libraries, from different stages after infecting the host, Karlodinium veneficum , with Amoebophrya sp. These libraries were used to generate 898 expressed sequence tags (ESTs), with sequences attributed to either the host or parasite, based on AT bias, codon usage, and occurrence during infection. Overall, 209 sequences were attributable to the parasite and 685 to the host. The 50 putative parasite sequences with good protein matches in GenBank were used to find the same protein from host ESTs. For 26 genes, both host and parasite sequences were identified, of which 20 encoded ribosomal proteins. PCR for seven predicted parasite and two host genes were used to confirm attributions. The most common host and parasite ESTs were compared to see if multiple gene copies were present. The host plastocyanin gene had multiple sequence variants, but parasite rps 27 a contained only one polymorphism, likely due to an amplification error. Amplification, cloning, and sequencing of five parasite protein-coding genes suggested that the parasite has a single sequence for each gene, but three host genes were found to have multiple variants. The genome of Amoebophrya sp. infecting K. veneficum appears to have an organization more similar to other eukaryotes than to the tandem gene arrangements found in dinoflagellates.  相似文献   

8.

Background  

Single nucleotide polymorphisms (SNPs) are important tools in studying complex genetic traits and genome evolution. Computational strategies for SNP discovery make use of the large number of sequences present in public databases (in most cases as expressed sequence tags (ESTs)) and are considered to be faster and more cost-effective than experimental procedures. A major challenge in computational SNP discovery is distinguishing allelic variation from sequence variation between paralogous sequences, in addition to recognizing sequencing errors. For the majority of the public EST sequences, trace or quality files are lacking which makes detection of reliable SNPs even more difficult because it has to rely on sequence comparisons only.  相似文献   

9.
By introducing synonymous mutations into the coding sequences of GP64sp and FibHsp signal peptides, the influences of mRNA secondary structure and codon usage of signal sequences on protein expression and secretion were investigated using baculovirus/insect cell expression system. The results showed that mRNA structural stability of the signal sequences was not correlated with the protein production and secretion levels, and FibHsp was more tolerable to codon changes than GP64sp. Codon bias analyses revealed that codons for GP64sp were well de-optimized and contained more non-optimal codons than FibHsp. Synonymous mutations in GP64sp sufficiently increased its average codon usage frequency and resulted in dramatic reduction of the activity and secretion of luciferase. Protein degradation inhibition assay with MG-132 showed that higher codon usage frequency in the signal sequence increased the production as well as the degradation of luciferase protein, indicating that the synonymous codon substitutions in the signal sequence caused misfolding of luciferase instead of slowing down the protein production. Meanwhile, we found that introduction of more non-optimal codons into FibHsp could increase the production and secretion levels of luciferase, which suggested a new strategy to improve the production of secretory proteins in insect cells.  相似文献   

10.
It is proven that under the independent codon model, the likelihood of a DNA coding sequence read according to the correct frame is asymptotically larger than that read with an incorrect frame. Based on this proposition, a single set of probabilities of the codon usage is enough for discriminating the six frames of coding sequences under the independent codon model. The direct coding sequence of Escherichia coli genome is taken as an example to examine the codon independency by using the mutual information and chi2 analysis. The contrast between the coding frame and the two offset frames is evident. A self-learning approach for generating training set is proposed to estimate probability parameters.  相似文献   

11.
Translational selection on codon usage in Xenopus laevis   总被引:2,自引:0,他引:2  
A correspondence analysis of codon usage in Xenopus laevis revealed that the first axis is strongly correlated with the base composition at third codon positions. The second axis discriminates between putatively highly expressed genes and the other coding sequences, with expression levels being confirmed by the analysis of Expressed sequence tag frequencies. The comparison of codon usage of the sequences displaying the extreme values on the second axis indicates that several codons are statistically more frequent among the highly expressed (mainly housekeeping) genes. Translational selection appears, therefore, to influence synonymous codon usage in Xenopus.  相似文献   

12.
In recent years, the amount of molecular sequencing data from Tetrahymena thermophila has dramatically increased. We analyzed G + C content, codon usage, initiator codon context and stop codon sites in the extremely A + T rich genome of this ciliate. Average G + C content was 38% for protein coding regions, 21% for 5' non-coding sequences, 19% for 3' non-coding sequences, 15% for introns, 19% for micronuclear limited sequences and 17% for macronuclear retained sequences flanking micronuclear specific regions. The 75 available T. thermophila protein coding sequences favored codons ending in T and, where possible, avoided those with G in the third position. Highly expressed genes were relatively G + C-rich and exhibited an extremely biased pattern of codon usage while developmentally regulated genes were more A + T-rich and showed less codon usage bias. Regions immediately preceding Tetrahymena translation initiator codons were generally A-rich. For the 60 stop codons examined, the frequency of G in the end + 1 site was much higher than expected whereas C never occupied this position.  相似文献   

13.
A lambdaZAP Express cDNA library was constructed with mRNA obtained from immature miracidia within eggs, hatched miracidia, and sporocysts of Echinostoma paraensei. This cDNA library was amplified and 213 expressed sequence tag (EST) sequences (averaging 466 nucleotides in length) were obtained. The mean percentage of unresolved bases within the EST sequences was 0.4%, ranging from 0 to 4.6%. The 213 ESTs represent 151 unique messages. BLAST (version 2.0.8) analysis disclosed that 64 unique E. paraensei messages (42.4%) had significant similarities (BLAST score < or =e-5), at deduced amino acid or nucleotide levels, with known sequences in the nonredundant GenBank databases or the dbEST database (NCBI). The remainder, 57.6% of the unique EST-encoded messages, scored nonsignificant hits. Most of the E. paraensei messages that could be assigned a cellular role based on sequence similarities were involved in gene/protein expression. Several ESTs scored highest similarities with sequences obtained from trematode species. A total of 22,560 nucleotides present in open reading frames from ESTs that aligned with known sequences was used to determine codon usage for E. paraensei. Analysis of a subset of eight ESTs that contained full-length open reading frames did not reveal a bias in codon usage. Also, EST sequences were found to contain 3' untranslated regions with an average length of 69.9 +/- 88.4 nucleotides (n = 46). The EST sequences were submitted to GenBank/dbEST, adding to the 51 available Echinostoma-derived sequences, to provide reference information for both phylogenetic analysis and study of general trematode biology.  相似文献   

14.
ABSTRACT: BACKGROUND: Gene finding is a complicated procedure that encapsulates algorithms for coding sequence modeling, identification of promoter regions, issues concerning overlapping genes and more. In the present study we focus on coding sequence modeling algorithms; that is, algorithms for identification and prediction of the actual coding sequences from genomic DNA. In this respect, we promote a novel multivariate method known as Canonical Powered Partial Least Squares (CPPLS) as an alternative to the commonly used Interpolated Markov model (IMM). Comparisons between the methods were performed on DNA, codon and protein sequences with highly conserved genes taken from several species with different genomic properties. RESULTS: The multivariate CPPLS approach classified coding sequence substantially better than the commonly used IMM on the same set of sequences. We also found that the use of CPPLS with codon representation gave significantly better classification results than both IMM with protein (p < 0.001) and with DNA (p < 0.001). Further, although the mean performance was similar, the variation of CPPLS performance on codon representation was significantly smaller than for IMM (p < 0.001). CONCLUSIONS: The performance of coding sequence modeling can be substantially improved by using an algorithm based on the multivariate CPPLS method applied to codon or DNA frequencies.  相似文献   

15.
MOTIVATION: Overlapping gene coding sequences (CDSs) are particularly common in viruses but also occur in more complex genomes. Detecting such genes with conventional gene-finding algorithms can be difficult for several reasons. If an overlapping CDS is on the same read-strand as a known CDS, then there may not be a distinct promoter or mRNA. Furthermore, the constraints imposed by double-coding can result in atypical codon biases. However, these same constraints lead to particular mutation patterns that may be detectable in sequence alignments. RESULTS: In this paper, we investigate several statistics for detecting double-coding sequences with pairwise alignments--including a new maximum-likelihood method. We also develop a model for double-coding sequence evolution. Using simulated sequences generated with the model, we characterize the distribution of each statistic as a function of sequence composition, length, divergence time and double-coding frame. Using these results, we develop several algorithms for detecting overlapping CDSs. The algorithms were tested on known overlapping CDSs and other overlapping open reading frames (ORFs) in the hepatitis B virus (HBV), Escherichia coli and Salmonella typhimurium genomes. The algorithms should prove useful for detecting novel overlapping genes--especially short coding ORFs in viruses. AVAILABILITY: Programs may be obtained from the authors. SUPPLEMENTARY INFORMATION: http://biochem.otago.ac.nz/double.html.  相似文献   

16.
In the present study, we developed a method for detecting sequences whose similarity to a target sequence is statistically significant and we examined the distribution of these sequences in the E. coli K-12 genome. Target sequences examined are as follows: (i) short repeat: Crossover hot-spot instigator (Chi) sequence, replication termination (Ter) sequence, and DnaA binding sequence (DnaA box); (ii) potential stem-loop structure repeats: palindromic unit (PU), boxC sequences, and intergenic repeat unit (IRU); (iii) potential RNA coding repeats: rRNAs, PAIR, TRIP, and QUAD; and (iv) potential protein coding repeats: insertion elements (ISs) and Long Direct Repeats (LDRs). We also examined the distribution of these sequences on leading and lagging strands. We obtained another four statistically significant LDR sequences with more than 187 bp matched to LDR-A near the LDR loci, suggesting that these regions might be used as high recombination hot spots for LDR. Adaptation of individual LDRs to E. coli genome is also discussed on the basis of codon usage.  相似文献   

17.
Partial cDNA sequencing was used to obtain 169 expressed sequence tags (ESTs) in the moss, Physcomitrella patens. The source of ESTs was a random cDNA library constructed from 7 day-old protonemata following treatment with 10(-4) M abscisic acid (ABA). Analysis of the ESTs identified 69% with homology to known sequences, 61% of which had significant homology to sequences of plant origin. More importantly, at least 11 ESTs had significant similarities to genes which are implicated in plant stress-responses, including responses which may involve ABA. These included a cDNA associated with desiccation tolerance, two heat shock protein genes, one cold acclimation protein cDNA and five others that may be involved in either oxidative or chemical stress or both, i.e., Zn/Cu-superoxide dismutase, NADPH protochlorophyllide oxidoreductase (PorB), selenium binding protein, glutathione peroxidase and glutathione S transferase. Analysis of codon usage between P. patens and seed plants indicated that although mosses and higher plants are to a large extent similar, minor variations also exists that may represent the distinctiveness of each group.  相似文献   

18.
不具有3-碱基周期性的编码序列初探   总被引:4,自引:0,他引:4  
对120个较短编码序列(<1 200 bp)的Fourier频谱进行分析表明,3-碱基周期性在短编码序列中并不是绝对存在的.统计分析提示,编码序列有无3-碱基周期性与序列的碱基组成和分布、所编码蛋白质氨基酸的选用和顺序以及同义密码子的使用都有一定的关系.一般地,非周期-3序列中A+U含量高于G+C含量,周期-3序列的情况则相反;非周期-3序列中碱基在密码子三个位点上的分布比周期-3序列中的分布均匀;非周期-3序列密码子和氨基酸的使用偏向没有周期-3序列的大.在利用Fourier分析方法预测DNA序列中的基因和外显子时,应充分考虑到这些现象.  相似文献   

19.
A computer program, which runs on MS-DOS personal computers, is described that assists in the design of synthetic genes coding for proteins. The goal of the program is the design of a gene which (i) contains as many unique restriction sites as possible and (ii) uses a specific codon usage. The gene designed according to the criteria above is (i) suitable for 'modular mutagenesis' experiments and (ii) optimized for expression. The program 'reverse-translates' protein sequences into degenerated DNA sequences, generates a map of potential restriction sites and locates sequence positions where unique restriction sites can be accommodated. The nucleic acid sequence is then 'refined' according to a specific codon usage to remove any degeneration. Unique restriction sites, if potentially present, can be 'forced' into the degenerated nucleic acid sequence by using 'priority codes' assigned to different restriction sequences.  相似文献   

20.
Synonymous codons are neutral at the protein level, therefore natural selection at the protein level should have no effect on their frequencies. Synonymous codons, however, differ in their capacity to reduce the effects of errors: after mutation, certain codons keep on coding for the same amino acid or for amino acids with similar properties, while other synonymous codons produce very different amino acids. Therefore, the impact of errors on a coding sequence (genetic robustness) can be measured by analysing its codon usage. I analyse the codon usage of sequenced nuclear and cytoplasmic genomes and I show that there is an extensive variation in genetic robustness at the DNA sequence level, both among genomes and among genes of the same genome. I also show theoretically that robustness can be adaptive, that is natural selection may lead to a preference for codons that reduce the impact of errors. If selection occurs only among the mutants of a codon (e.g. among the progeny before the adult phase), however, the codons that are more sensitive to the effects of mutations may increase in frequency because they manage to get rid more easily of deleterious mutations. I also suggest other possible explanations for the evolution of genetic robustness at the codon level.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号