首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 625 毫秒
1.
Gene identification in novel eukaryotic genomes by self-training algorithm   总被引:8,自引:0,他引:8  
Finding new protein-coding genes is one of the most important goals of eukaryotic genome sequencing projects. However, genomic organization of novel eukaryotic genomes is diverse and ab initio gene finding tools tuned up for previously studied species are rarely suitable for efficacious gene hunting in DNA sequences of a new genome. Gene identification methods based on cDNA and expressed sequence tag (EST) mapping to genomic DNA or those using alignments to closely related genomes rely either on existence of abundant cDNA and EST data and/or availability on reference genomes. Conventional statistical ab initio methods require large training sets of validated genes for estimating gene model parameters. In practice, neither one of these types of data may be available in sufficient amount until rather late stages of the novel genome sequencing. Nevertheless, we have shown that gene finding in eukaryotic genomes could be carried out in parallel with statistical models estimation directly from yet anonymous genomic DNA. The suggested method of parallelization of gene prediction with the model parameters estimation follows the path of the iterative Viterbi training. Rounds of genomic sequence labeling into coding and non-coding regions are followed by the rounds of model parameters estimation. Several dynamically changing restrictions on the possible range of model parameters are added to filter out fluctuations in the initial steps of the algorithm that could redirect the iteration process away from the biologically relevant point in parameter space. Tests on well-studied eukaryotic genomes have shown that the new method performs comparably or better than conventional methods where the supervised model training precedes the gene prediction step. Several novel genomes have been analyzed and biologically interesting findings are discussed. Thus, a self-training algorithm that had been assumed feasible only for prokaryotic genomes has now been developed for ab initio eukaryotic gene identification.  相似文献   

2.

Background

The design of oligonucleotides and PCR primers for studying large genomes is complicated by the redundancy of sequences. The eukaryotic genomes are particularly difficult to study due to abundant repeats. The speed of most existing primer evaluation programs is not sufficient for large-scale experiments.

Results

In order to improve the efficiency and success rate of automatic primer/oligo design, we created a novel method which allows rapid masking of repeats in large sequence files, for example in eukaryotic genomes. It also allows the detection of all alternative binding sites of PCR primers and the prediction of PCR products. The new method was implemented in a collection of efficient programs, the GENOMEMASKER package. The performance of the programs was compared to other similar programs. We also modified the PRIMER3 program, to be able to design primers from lowercase-masked sequences.

Conclusion

The GENOMEMASKER package is able to mask the entire human genome for non-unique primers within 6 hours and find locations of all binding sites for 10 000 designed primer pairs within 10 minutes. Additionally, it predicts all alternative PCR products from large genomes for given primer pairs.  相似文献   

3.
The development of new strategies for the in vivo modification of eukaryotic genomes has become an important objective of current research. Site-specific recombination has proven useful, as it allows controlled manipulation of murine, plant, and yeast genomes. Here we provide the first evidence that the prokaryotic site-specific recombinase (beta-recombinase), which catalyzes only intramolecular recombination, is active in eukaryotic environments. beta-Recombinase, encoded by the beta gene of the Gram-positive broad host range plasmid pSM19035, has been functionally expressed in eukaryotic cell lines, demonstrating high avidity for the nuclear compartment and forming a clear speckled pattern when assayed by indirect immunofluorescence. In simian COS-1 cells, transient beta-recombinase expression promoted deletion of a DNA fragment lying between two directly oriented specific recognition/crossing over sequences (six sites) located as an extrachromosomal DNA substrate. The same result was obtained in a recombination-dependent lacZ activation system tested in a cell line that stably expresses the beta-recombinase protein. In stable NIH/3T3 clones bearing different number of copies of the target sequences integrated at distinct chromosomal locations, transient beta-recombinase expression also promoted deletion of the intervening DNA, independently of the insertion position of the target sequences. The utility of this new recombination tool for the manipulation of eukaryotic genomes, used either alone or in combination with the other recombination systems currently in use, is discussed.  相似文献   

4.
The Z curve database: a graphic representation of genome sequences   总被引:7,自引:0,他引:7  
MOTIVATION: Genome projects for many prokaryotic and eukaryotic species have been completed and more new genome projects are being underway currently. The availability of a large number of genomic sequences for researchers creates a need to find graphic tools to study genomes in a perceivable form. The Z curve is one of such tools available for visualizing genomes. The Z curve is a unique three-dimensional curve representation for a given DNA sequence in the sense that each can be uniquely reconstructed given the other. The Z curve database for more than 1000 genomes have been established here. RESULTS: The database contains the Z curves for archaea, bacteria, eukaryota, organelles, phages, plasmids, viroids and viruses, whose genomic sequences are currently available. All the 3-dimensional Z curves and their three component curves are stored in the database. The applications of the Z curve database on comparative genomics, gene prediction, computation of G+C content with a windowless technique, prediction of replication origins and terminations of bacterial and archaeal genomes and study of local deviations from the Chargaff Parity Rule 2 etc. are presented in detail. The Z curve database reported here is a treasure trove in which biologists could find useful biological knowledge.  相似文献   

5.
Predicting protein-coding genes still remains a significant challenge. Although a variety of computational programs that use commonly machine learning methods have emerged, the accuracy of predictions remains a low level when implementing in large genomic sequences. Moreover, computational gene finding in newly se- quenced genomes is especially a difficult task due to the absence of a training set of abundant validated genes. Here we present a new gene-finding program, SCGPred, to improve the accuracy of prediction by combining multiple sources of evidence. SCGPred can perform both supervised method in previously well-studied genomes and unsupervised one in novel genomes. By testing with datasets composed of large DNA sequences from human and a novel genome of Ustilago maydi, SCGPred gains a significant improvement in comparison to the popular ab initio gene predictors. We also demonstrate that SCGPred can significantly improve prediction in novel genomes by combining several foreign gene finders with similarity alignments, which is superior to other unsupervised methods. Therefore, SCGPred can serve as an alternative gene-finding tool for newly sequenced eukaryotic genomes. The program is freely available at http://bio.scu.edu.cn/SCGPred/.  相似文献   

6.
BLAST (Basic Local Alignment Search Tool) searches against DNA and protein sequence databases have become an indispensable tool for biomedical research. The proliferation of the genome sequencing projects is steadily increasing the fraction of genome-derived sequences in the public databases and their importance as a public resource. We report here the availability of Genomic BLAST, a novel graphical tool for simplifying BLAST searches against complete and unfinished genome sequences. This tool allows the user to compare the query sequence against a virtual database of DNA and/or protein sequences from a selected group of organisms with finished or unfinished genomes. The organisms for such a database can be selected using either a graphic taxonomy-based tree or an alphabetical list of organism-specific sequences. The first option is designed to help explore the evolutionary relationships among organisms within a certain taxonomy group when performing BLAST searches. The use of an alphabetical list allows the user to perform a more elaborate set of selections, assembling any given number of organism-specific databases from unfinished or complete genomes. This tool, available at the NCBI web site http://www.ncbi.nlm.nih.gov/cgi-bin/Entrez/genom_table_cgi, currently provides access to over 170 bacterial and archaeal genomes and over 40 eukaryotic genomes.  相似文献   

7.
The accurate prediction of higher eukaryotic gene structures and regulatory elements directly from genomic sequences is an important early step in the understanding of newly assembled contigs and finished genomes. As more new genomes are sequenced, comparative approaches are becoming increasingly practical and valuable for predicting genes and regulatory elements. We demonstrate the effectiveness of a comparative method called pattern filtering; it utilizes synteny between two or more genomic segments for the annotation of genomic sequences. Pattern filtering optimally detects the signatures of conserved functional elements despite the stochastic noise inherent in evolutionary processes, allowing more accurate annotation of gene models. We anticipate that pattern filtering will facilitate sequence annotation and the discovery of new functional elements by the genetics and genomics communities.  相似文献   

8.
Coding information is the main source of heterogeneity (non-randomness) in the sequences of microbial genomes. The heterogeneity corresponds to a cluster structure in triplet distributions of relatively short genomic fragments (200-400 bp). We found a universal 7-cluster structure in microbial genomic sequences and explained its properties. We show that codon usage of bacterial genomes is a multi-linear function of their genomic G+C-content with high accuracy. Based on the analysis of 143 completely sequenced bacterial genomes available in Genbank in August 2004, we show that there are four "pure" types of the 7-cluster structure observed. All 143 cluster animated 3D-scatters are collected in a database which is made available on our web-site (http://www.ihes.fr/~zinovyev/7clusters). The findings can be readily introduced into software for gene prediction, sequence alignment or microbial genomes classification.  相似文献   

9.
10.
Tailed double-stranded DNA viruses (order Caudovirales) represent the dominant morphotype among viruses infecting bacteria. Analysis and comparison of complete genome sequences of tailed bacterial viruses provided insights into their origin and evolution. Structural and genomic studies have unexpectedly revealed that tailed bacterial viruses are evolutionarily related to eukaryotic herpesviruses. Organisms from the third domain of life, Archaea, are also infected by viruses that, in their overall morphology, resemble tailed viruses of bacteria. However, high-resolution structural information is currently unavailable for any of these viruses, and only a few complete genomes have been sequenced so far. Here we identified nine proviruses that are clearly related to tailed bacterial viruses and integrated into chromosomes of species belonging to four different taxonomic orders of the Archaea. This more than doubled the number of genome sequences available for comparative studies. Our analyses indicate that highly mosaic tailed archaeal virus genomes evolve by homologous and illegitimate recombination with genomes of other viruses, by diversification, and by acquisition of cellular genes. Comparative genomics of these viruses and related proviruses revealed a set of conserved genes encoding putative proteins similar to virion assembly and maturation, as well as genome packaging proteins of tailed bacterial viruses and herpesviruses. Furthermore, fold prediction and structural modeling experiments suggest that the major capsid proteins of tailed archaeal viruses adopt the same topology as the corresponding proteins of tailed bacterial viruses and eukaryotic herpesviruses. Data presented in this study strongly support the hypothesis that tailed viruses infecting archaea share a common ancestry with tailed bacterial viruses and herpesviruses.  相似文献   

11.
12.
Phosphate (PO(4)) is an important limiting nutrient in marine environments. Marine cyanobacteria scavenge PO(4) using the high-affinity periplasmic phosphate binding protein PstS. The pstS gene has recently been identified in genomes of cyanobacterial viruses as well. Here, we analyse genes encoding transporters in genomes from viruses that infect eukaryotic phytoplankton. We identified inorganic PO(4) transporter-encoding genes from the PHO4 superfamily in several virus genomes, along with other transporter-encoding genes. Homologues of the viral pho4 genes were also identified in genome sequences from the genera that these viruses infect. Genome sequences were available from host genera of all the phytoplankton viruses analysed except the host genus Bathycoccus. Pho4 was recovered from Bathycoccus by sequencing a targeted metagenome from an uncultured Atlantic Ocean population. Phylogenetic reconstruction showed that pho4 genes from pelagophytes, haptophytes and infecting viruses were more closely related to homologues in prasinophytes than to those in what, at the species level, are considered to be closer relatives (e.g. diatoms). We also identified PHO4 superfamily members in ocean metagenomes, including new metagenomes from the Pacific Ocean. The environmental sequences grouped with pelagophytes, haptophytes, prasinophytes and viruses as well as bacteria. The analyses suggest that multiple independent pho4 gene transfer events have occurred between marine viruses and both eukaryotic and bacterial hosts. Additionally, pho4 genes were identified in available genomes from viruses that infect marine eukaryotes but not those that infect terrestrial hosts. Commonalities in marine host-virus gene exchanges indicate that manipulation of host-PO(4) uptake is an important adaptation for viral proliferation in marine systems. Our findings suggest that PO(4) -availability may not serve as a simple bottom-up control of marine phytoplankton.  相似文献   

13.
Cot-based cloning and sequencing (CBCS), a synthesis of Cot analysis, DNA cloning and high-throughput sequencing, promises to accelerate the study of eukaryotic genomes. In particular, CBCS will (1) permit efficient gene discovery in species with substantial quantities of repetitive DNA, (2) allow the sequence complexity (i.e. all the unique sequence information) of large genomes to be elucidated at a fraction of the cost of shotgun sequencing, and (3) enhance genome sequencing efforts by facilitating capture of low-copy sequences not secured by EST sequencing. CBCS should accelerate comparative genomics research, especially in large genomes such as those of many crops.  相似文献   

14.
MOTIVATION: Discovery of host and pathogen genes expressed at the plant-pathogen interface often requires the construction of mixed libraries that contain sequences from both genomes. Sequence identification requires high-throughput and reliable classification of genome origin. When using single-pass cDNA sequences difficulties arise from the short sequence length, the lack of sufficient taxonomically relevant sequence data in public databases and ambiguous sequence homology between plant and pathogen genes. RESULTS: A novel method is described, which is independent of the availability of homologous genes and relies on subtle differences in codon usage between plant and fungal genes. We used support vector machines (SVMs) to identify the probable origin of sequences. SVMs were compared to several other machine learning techniques and to a probabilistic algorithm (PF-IND) for expressed sequence tag (EST) classification also based on codon bias differences. Our software (Eclat) has achieved a classification accuracy of 93.1% on a test set of 3217 EST sequences from Hordeum vulgare and Blumeria graminis, which is a significant improvement compared to PF-IND (prediction accuracy of 81.2% on the same test set). EST sequences with at least 50 nt of coding sequence can be classified using Eclat with high confidence. Eclat allows training of classifiers for any host-pathogen combination for which there are sufficient classified training sequences. AVAILABILITY: Eclat is freely available on the Internet (http://mips.gsf.de/proj/est) or on request as a standalone version. CONTACT: friedel@informatik.uni-muenchen.de.  相似文献   

15.
H Liu  Y Fu  J Xie  J Cheng  SA Ghabrial  G Li  X Yi  D Jiang 《PloS one》2012,7(7):e42147
Genome sequence of viruses can contribute greatly to the study of viral evolution, diversity and the interaction between viruses and hosts. Traditional molecular cloning methods for obtaining RNA viral genomes are time-consuming and often difficult because many viruses occur in extremely low titers. DsRNA viruses in the families, Partitiviridae, Totiviridae, Endornaviridae, Chrysoviridae, and other related unclassified dsRNA viruses are generally associated with symptomless or persistent infections of their hosts. These characteristics indicate that samples or materials derived from eukaryotic organisms used to construct cDNA libraries and EST sequencing might carry these viruses, which were not easily detected by the researchers. Therefore, the EST databases may include numerous unknown viral sequences. In this study, we performed in silico cloning, a procedure for obtaining full or partial cDNA sequence of a gene by bioinformatics analysis, using known dsRNA viral sequences as queries to search against NCBI Expressed Sequence Tag (EST) database. From this analysis, we obtained 119 novel virus-like sequences related to members of the families, Endornaviridae, Chrysoviridae, Partitiviridae, and Totiviridae. Many of them were identified in cDNA libraries of eukaryotic lineages, which were not known to be hosts for these viruses. Furthermore, comprehensive phylogenetic analysis of these newly discovered virus-like sequences with known dsRNA viruses revealed that these dsRNA viruses may have co-evolved with respective host supergroups over a long evolutionary time while potential horizontal transmissions of viruses between different host supergroups also is possible. We also found that some of the plant partitiviruses may have originated from fungal viruses by horizontal transmissions. These findings extend our knowledge of the diversity and possible host range of dsRNA viruses and offer insight into the origin and evolution of relevant viruses with their hosts.  相似文献   

16.
Subirana JA  Anokian E 《Gene》2011,473(2):76-81
A very simple new program is presented (G-SQUARES). It is useful in order to visualize the composition and basic structural features of whole genomes and selected chromosome regions. The frequency of all dimer and tetramer sequences is reported. Overall structural features are calculated, such as the tendency for alternation. A direct visual comparison among different sequences is easily available. Furthermore, the features which are visualized indicate further studies which should be carried out. Examples are presented on Alu sequences, CpG islands, whole eukaryotic and bacterial genomes.  相似文献   

17.
Physical partitioning techniques are routinely employed (during sample preparation stage) for segregating the prokaryotic and eukaryotic fractions of metagenomic samples. In spite of these efforts, several metagenomic studies focusing on bacterial and archaeal populations have reported the presence of contaminating eukaryotic sequences in metagenomic data sets. Contaminating sequences originate not only from genomes of micro-eukaryotic species but also from genomes of (higher) eukaryotic host cells. The latter scenario usually occurs in the case of host-associated metagenomes. Identification and removal of contaminating sequences is important, since these sequences not only impact estimates of microbial diversity but also affect the accuracy of several downstream analyses. Currently, the computational techniques used for identifying contaminating eukaryotic sequences, being alignment based, are slow, inefficient, and require huge computing resources. In this article, we present Eu-Detect, an alignment-free algorithm that can rapidly identify eukaryotic sequences contaminating metagenomic data sets. Validation results indicate that on a desktop with modest hardware specifications, the Eu-Detect algorithm is able to rapidly segregate DNA sequence fragments of prokaryotic and eukaryotic origin, with high sensitivity. A Web server for the Eu-Detect algorithm is available at http://metagenomics.atc.tcs.com/Eu-Detect/.  相似文献   

18.
The present century has witnessed an unprecedented rise in genome sequences owing to various genome-sequencing programs. However, the same has not been replicated with cDNA or expressed sequence tags (ESTs). Hence, prediction of protein coding sequence of genes from this enormous collection of genomic sequences presents a significant challenge. While robust high throughput methods of cloning and expression could be used to meet protein requirements, lack of intron information creates a bottleneck. Computational programs designed for recognizing intron–exon boundaries for a particular organism or group of organisms have their own limitations. Keeping this in view, we describe here a method for construction of intron-less gene from genomic DNA in the absence of cDNA/EST information and organism-specific gene prediction program. The method outlined is a sequential application of bioinformatics to predict correct intron–exon boundaries and splicing by overlap extension PCR for spliced gene synthesis. The gene construct so obtained can then be cloned for protein expression. The method is simple and can be used for any eukaryotic gene expression.  相似文献   

19.
20.
Comparative ab initio prediction of gene structures using pair HMMs   总被引:3,自引:0,他引:3  
We present a novel comparative method for the ab initio prediction of protein coding genes in eukaryotic genomes. The method simultaneously predicts the gene structures of two un-annotated input DNA sequences which are homologous to each other and retrieves the subsequences which are conserved between the two DNA sequences. It is capable of predicting partial, complete and multiple genes and can align pairs of genes which differ by events of exon-fusion or exon-splitting. The method employs a probabilistic pair hidden Markov model. We generate annotations using our model with two different algorithms: the Viterbi algorithm in its linear memory implementation and a new heuristic algorithm, called the stepping stone, for which both memory and time requirements scale linearly with the sequence length. We have implemented the model in a computer program called DOUBLESCAN. In this article, we introduce the method and confirm the validity of the approach on a test set of 80 pairs of orthologous DNA sequences from mouse and human. More information can be found at: http://www.sanger.ac.uk/Software/analysis/doublescan/  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号