首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 46 毫秒
1.
In a number of programs for gene structure prediction in higher eukaryotic genomic sequences, exon prediction is decoupled from gene assembly: a large pool of candidate exons is predicted and scored from features located in the query DNA sequence, and candidate genes are assembled from such a pool as sequences of nonoverlapping frame-compatible exons. Genes are scored as a function of the scores of the assembled exons, and the highest scoring candidate gene is assumed to be the most likely gene encoded by the query DNA sequence. Considering additive gene scoring functions, currently available algorithms to determine such a highest scoring candidate gene run in time proportional to the square of the number of predicted exons. Here, we present an algorithm whose running time grows only linearly with the size of the set of predicted exons. Polynomial algorithms rely on the fact that, while scanning the set of predicted exons, the highest scoring gene ending in a given exon can be obtained by appending the exon to the highest scoring among the highest scoring genes ending at each compatible preceding exon. The algorithm here relies on the simple fact that such highest scoring gene can be stored and updated. This requires scanning the set of predicted exons simultaneously by increasing acceptor and donor position. On the other hand, the algorithm described here does not assume an underlying gene structure model. Indeed, the definition of valid gene structures is externally defined in the so-called Gene Model. The Gene Model specifies simply which gene features are allowed immediately upstream which other gene features in valid gene structures. This allows for great flexibility in formulating the gene identification problem. In particular it allows for multiple-gene two-strand predictions and for considering gene features other than coding exons (such as promoter elements) in valid gene structures.  相似文献   

2.
3.
MOTIVATION: Locating protein-coding exons (CDSs) on a eukaryotic genomic DNA sequence is the initial and an essential step in predicting the functions of the genes embedded in that part of the genome. Accurate prediction of CDSs may be achieved by directly matching the DNA sequence with a known protein sequence or profile of a homologous family member(s). RESULTS: A new convention for encoding a DNA sequence into a series of 23 possible letters (translated codon or tron code) was devised to improve this type of analysis. Using this convention, a dynamic programming algorithm was developed to align a DNA sequence and a protein sequence or profile so that the spliced and translated sequence optimally matches the reference the same as the standard protein sequence alignment allowing for long gaps. The objective function also takes account of frameshift errors, coding potentials, and translational initiation, termination and splicing signals. This method was tested on Caenorhabditis elegans genes of known structures. The accuracy of prediction measured in terms of a correlation coefficient (CC) was about 95% at the nucleotide level for the 288 genes tested, and 97. 0% for the 170 genes whose product and closest homologue share more than 30% identical amino acids. We also propose a strategy to improve the accuracy of prediction for a set of paralogous genes by means of iterative gene prediction and reconstruction of the reference profile derived from the predicted sequences. AVAILABILITY: The source codes for the program 'aln' written in ANSI-C and the test data will be available via anonymous FTP at ftp.genome.ad.jp/pub/genomenet/saitama-cc. CONTACT: gotoh@cancer-c.pref.saitama.jp  相似文献   

4.
A total of 17 Pl and TAC clones each representing an assigned region of chromosome 5 were isolated from P1 and TAC genomic libraries of Arabidopsis thaliana Columbia, and their nucleotide sequences were determined. The length of the clones sequenced in this study summed up to 1,081,958 bp. As we have previously reported the sequence of 9,072,622 bp by analysis of 125 P1 and TAC clones, the total length of the sequences of chromosome 5 determined so far is now 10,154,580 bp. The sequences were subjected to similarity search against protein and EST databases and analysis with computer programs for gene modeling. As a consequence, a total of 253 potential protein-coding genes with known or predicted functions were identified. The positions of exons which do not show apparent similarity to known genes were also assigned using computer programs for exon prediction. The average density of the genes identified in this study was 1 gene per 4277 bp. Introns were observed in 74% of the potential protein genes, and the average number per gene and the average length of the introns were 4.3 and 168 bp, respectively. The sequence data and gene information are available on the World Wide Web database KAOS (Kazusa Arabidopsis data Opening Site) at http://www.kazusa.or.jp/arabi/.  相似文献   

5.
Comparative sequence analysis is a powerful approach to identify functional elements in genomic sequences. Herein, we describe AGenDA (Alignment-based GENe Detection Algorithm), a novel method for gene prediction that is based on long-range alignment of syntenic regions in eukaryotic genome sequences. Local sequence homologies identified by the DIALIGN program are searched for conserved splice signals to define potential protein-coding exons; these candidate exons are then used to assemble complete gene structures. The performance of our method was tested on a set of 105 human-mouse sequence pairs. These test runs showed that sensitivity and specificity of AGenDA are comparable with the best gene- prediction program that is currently available. However, since our method is based on a completely different type of input information, it can detect genes that are not detectable by standard methods and vice versa. Thus, our approach seems to be a useful addition to existing gene-prediction programs. Availability: DIALIGN is available through the Bielefeld Bioinformatics Server (BiBiServ) at http://bibiserv.techfak.uni-bielefeld.de/dialign/ The gene-prediction program AGenDA described in this paper will be available through the BiBiServ or MIPS web server at http://mips.gsf.de.  相似文献   

6.
Identifying the 3'-terminal exon in human DNA.   总被引:1,自引:0,他引:1  
MOTIVATION: We present JTEF, a new program for finding 3' terminal exons in human DNA sequences. This program is based on quadratic discriminant analysis, a standard non-linear statistical pattern recognition method. The quadratic discriminant functions used for building the algorithm were trained on a set of 3' terminal exons of type 3tuexon (those containing the true STOP codon). RESULTS: We showed that the average predictive accuracy of JTEF is higher than the presently available best programs (GenScan and Genemark.hmm) based on a test set of 65 human DNA sequences with 121 genes. In particular JTEF performs well on larger genomic contigs containing multiple genes and significant amounts of intergenic DNA. It will become a valuable tool for genome annotation and gene functional studies. AVAILABILITY: JTEF is available free for academic users on request from ftp://cshl.org/pub/science/mzhanglab/JTEF and will be made available through the World Wide Web (http://argon.cshl.org/).  相似文献   

7.
The present century has witnessed an unprecedented rise in genome sequences owing to various genome-sequencing programs. However, the same has not been replicated with cDNA or expressed sequence tags (ESTs). Hence, prediction of protein coding sequence of genes from this enormous collection of genomic sequences presents a significant challenge. While robust high throughput methods of cloning and expression could be used to meet protein requirements, lack of intron information creates a bottleneck. Computational programs designed for recognizing intron–exon boundaries for a particular organism or group of organisms have their own limitations. Keeping this in view, we describe here a method for construction of intron-less gene from genomic DNA in the absence of cDNA/EST information and organism-specific gene prediction program. The method outlined is a sequential application of bioinformatics to predict correct intron–exon boundaries and splicing by overlap extension PCR for spliced gene synthesis. The gene construct so obtained can then be cloned for protein expression. The method is simple and can be used for any eukaryotic gene expression.  相似文献   

8.
The Arabidopsis thaliana Em1 gene has been mapped to the lower arm of chromosome III. Fine analysis of 60 kb around this gene, based largely on identification and sequencing of cognate cDNAs, has allowed us to identify 15 genes or putative genes. Cognate cDNAs exist for ten of these genes, indicating that they are effectively expressed. Analysis by sequence alignment and intracellular localization prediction programs allows attribution of a potential protein product to these genes which show no obvious functional relationship. Comparison of the true exon/intron structure based on cDNA sequences with that proposed by three commonly used prediction programs shows that, in the absence of further information, the results of these predictions on anonymous genomic sequences should be interpreted with caution. Examination of the non-coding sequence showed the presence of a novel repeated, palindromic element. The results of this detailed analysis show that in-depth studies will be necessary to exploit correctly the complete A. thaliana genome sequence.  相似文献   

9.
10.
The complete genomic organization of the two mucin genes MUC2 and MUC6 was obtained by comparison of new and published mRNA sequences with newly available human genomic sequence. The two genes are located 38.5 kb apart in a head-to-head orientation within a gene complex on chromosome 11p15.5. The N-terminal organization of MUC6 is highly similar to that of MUC2, containing the D1, D2, D', and D3 Von Willebrand factor domains followed by the large tandem repeat domains located in exons 31 and 30, respectively. MUC6 has a much smaller C-terminal domain (101 amino acids) encoded by 2 exons containing only the CK domain, compared with MUC2, which has a C-terminal domain of 859 amino acids containing the D4, C, D, and CK domains, encoded by 19 exons. The gene structures agreed partially but not completely with predictions from gene prediction programs.  相似文献   

11.
The Japanese pufferfish Fugu rubripes with a genome of about 400 Mb is becoming increasingly recognized as a vertebrate model organism for comparative gene analysis (see Elgar 1996 for review). We have isolated and sequenced two Fugu cosmids spanning a genomic region of 66 kb containing the Fugu homolog to the human PCOLCE-I (Gl?ckner et al. 1998). We then examined if RUMMAGE-DP, a newly developed analysis tool for gene discovery which was designed for human and mouse genomic DNA, can be used for automatic annotation of Fugu genomic sequence. The exon prediction programs contained in RUMMAGE-DP performed better overall for the human sequence than for the Fugu contig. The GENSCAN program was the only exon prediction programme that performed equally well for both organisms. We show that RUMMAGE-DP is very useful in automatic analysis of Fugu sequences. Comparative analysis of the genomic structure of the PCOLCE-I genes in Fugu and human reveals that the exon/intron structure throughout the protein coding region is almost identical. We defined an additional domain based on the high degree of similarity of 26 aa between mammals and Fugu. The PCOLCE-I protein in both organisms contains two highly conserved CUB domains. Exons 6 and 7 are the only coding exons that differ in length between the two species. We assume that these exons do not code for any catalytic domain of the protein. Analysis of the remaining five Fugu genes within the 66 kb interval revealed no conserved synteny with the corresponding human 7q22 region. Received: 13 October 1998 / Accepted: 25 July 1999  相似文献   

12.
Information theory is a branch of mathematics that overlaps with communications, biology, and medical engineering. Entropy is a measure of uncertainty in the set of information. In this study, for each gene and its exons sets, the entropy was calculated in orders one to four. Based on the relative entropy of genes and exons, Kullback-Leibler divergence was calculated. After obtaining the Kullback-Leibler distance for genes and exons sets, the results were entered as input into 7 clustering algorithms: single, complete, average, weighted, centroid, median, and K-means. To aggregate the results of clustering, the AdaBoost algorithm was used. Finally, the results of the AdaBoost algorithm were investigated by GeneMANIA prediction server to explore the results from gene annotation point of view. All calculations were performed using the MATLAB Engineering Software (2015). Following our findings on investigating the results of genes metabolic pathways based on the gene annotations, it was revealed that our proposed clustering method yielded correct, logical, and fast results. This method at the same that had not had the disadvantages of aligning allowed the genes with actual length and content to be considered and also did not require high memory for large-length sequences. We believe that the performance of the proposed method could be used with other competitive gene clustering methods to group biologically relevant set of genes. Also, the proposed method can be seen as a predictive method for those genes bearing up weak genomic annotations.  相似文献   

13.
MOTIVATION: The annotation of the Arabidopsis thaliana genome remains a problem in terms of time and quality. To improve the annotation process, we want to choose the most appropriate tools to use inside a computer-assisted annotation platform. We therefore need evaluation of prediction programs with Arabidopsis sequences containing multiple genes. RESULTS: We have developed AraSet, a data set of contigs of validated genes, enabling the evaluation of multi-gene models for the Arabidopsis genome. Besides conventional metrics to evaluate gene prediction at the site and the exon levels, new measures were introduced for the prediction at the protein sequence level as well as for the evaluation of gene models. This evaluation method is of general interest and could apply to any new gene prediction software and to any eukaryotic genome. The GeneMark.hmm program appears to be the most accurate software at all three levels for the Arabidopsis genomic sequences. Gene modeling could be further improved by combination of prediction software. AVAILABILITY: The AraSet sequence set, the Perl programs and complementary results and notes are available at http://sphinx.rug.ac.be:8080/biocomp/napav/. CONTACT: Pierre.Rouze@gengenp.rug.ac.be.  相似文献   

14.
Gene identification in genomic DNA from eukaryotes is complicated by the vast combinatorial possibilities of potential exon assemblies. If the gene encodes a protein that is closely related to known proteins, gene identification is aided by matching similarity of potential translation products to those target proteins. The genomic DNA and protein sequences can be aligned directly by scoring the implied residues of in-frame nucleotide triplets against the protein residues in conventional ways, while allowing for long gaps in the alignment corresponding to introns in the genomic DNA. We describe a novel method for such spliced alignment. The method derives an optimal alignment based on scoring for both sequence similarity of the predicted gene product to the protein sequence and intrinsic splice site strength of the predicted introns. Application of the method to a representative set of 50 known genes from Arabidopsis thaliana showed significant improvement in prediction accuracy compared to previous spliced alignment methods. The method is also more accurate than ab initio gene prediction methods, provided sufficiently close target proteins are available. In view of the fast growth of public sequence repositories, we argue that close targets will be available for the majority of novel genes, making spliced alignment an excellent practical tool for high-throughput automated genome annotation.  相似文献   

15.
为探讨SMARCA1基因在中国山东SFMS家系患者发生中的作用,采用计算机杂交结合DNA序列分析方法,首先确定了SMARCA1基因的基因组结构,发现该基因的基因组DNA全长超过71.7kb,含有24个外显子和23个内含子,所有外显子和内含子接头皆遵循GT-AG法则,基因组结构的阐明,为进行基因突变检测和分析其生物学功能奠定了基础。在以上分析的基础上,通过PCR扩增结合测序分析,对在山东省发现的1个SFMS家系患者的SMARCA1基因的全部外显子和外显子内含子接头序列进行了基因突变检测,未检测到导致疾病的突变,提示中国山东SFMS家系患者不是由于SMARCA1基因编码区域内基因突变所致。  相似文献   

16.
The structure of the 3' one-third of the dystrophin gene has not previously been established. We have used vectorette PCR on a yeast artificial chromosome containing part of the human dystrophin gene to determine that there are 20 exons in this region and to characterize adjacent intron sequences of each one. Combined with previous information on the remainder of the gene, this study shows that the coding sequence is distributed between 79 exons. We have used PCR between exons to measure the distances that separate the more closely clustered exons. Vectorette PCR products were used as probes on Southern blots to assign all the 3' exons to genomic HindIII fragments that are commonly detected in the analysis of dystrophin gene deletions. The results will be useful for determining the effect of genomic deletions on the translational reading frame, for setting up genomic PCR assays to confirm point mutations, for analyzing splice site mutations, and for investigating potential cis-acting elements involved in tissue-specific alternative splicing. Vectorette PCR using primers derived from cDNA sequence represents an efficient and widely applicable method for establishing gene structure and obtaining intron sequence flanking exons, starting from a genomic clone and a cDNA sequence.  相似文献   

17.

Background

A large number of gene prediction programs for the human genome exist. These annotation tools use a variety of methods and data sources. In the recent ENCODE genome annotation assessment project (EGASP), some of the most commonly used and recently developed gene-prediction programs were systematically evaluated and compared on test data from the human genome. AUGUSTUS was among the tools that were tested in this project.

Results

AUGUSTUS can be used as an ab initio program, that is, as a program that uses only one single genomic sequence as input information. In addition, it is able to combine information from the genomic sequence under study with external hints from various sources of information. For EGASP, we used genomic sequence alignments as well as alignments to expressed sequence tags (ESTs) and protein sequences as additional sources of information. Within the category of ab initio programs AUGUSTUS predicted significantly more genes correctly than any other ab initio program. At the same time it predicted the smallest number of false positive genes and the smallest number of false positive exons among all ab initio programs. The accuracy of AUGUSTUS could be further improved when additional extrinsic data, such as alignments to EST, protein and/or genomic sequences, was taken into account.

Conclusion

AUGUSTUS turned out to be the most accurate ab initio gene finder among the tested tools. Moreover it is very flexible because it can take information from several sources simultaneously into consideration.
  相似文献   

18.
19.
The complete protein sequence of the human aldolase C isozyme has been determined from recombinant genomic clones. A genomic fragment of 6673 base pairs was isolated and the DNA sequence determined. Aldolase protein sequences, being highly conserved, allowed the derivation of the sequence of this isozyme by comparison of open reading frames in the genomic DNA to the protein sequence of other human aldolase enzymes. The protein sequence of the third aldolase isozyme found in vertebrates, aldolase C, completes the primary structural determination for this family of isozymes. Overall, the aldolase C isozyme shared 81% amino acid homology with aldolase A and 70% homology with aldolase B. The comparisons with other aldolase isozymes revealed several aldolase C-specific residues which could be involved in its function in the brain. The data indicated that the gene structure of aldolase C is the same as other aldolase genes in birds and mammals, having nine exons separated by eight introns, all in precisely the same positions, only the intron sizes being different. Eight of these exons contain the protein coding region comprised of 363 amino acids. The entire gene is approximately 4 kilobases.  相似文献   

20.
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号