首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
While genome sequencing efforts reveal the basic building blocksof life, a genome sequence alone is insufficient for elucidatingbiological function. Genome annotation—the process ofidentifying genes and assigning function to each gene in a genomesequence—provides the means to elucidate biological functionfrom sequence. Current state-of-the-art high-throughput genomeannotation uses a combination of comparative (sequence similaritydata) and non-comparative (ab initio gene prediction algorithms)methods to identify protein-coding genes in genome sequences.Because approaches used to validate the presence of predictedprotein-coding genes are typically based on expressed RNA sequences,they cannot independently and unequivocally determine whethera predicted protein-coding gene is translated into a protein.With the ability to directly measure peptides arising from expressedproteins, high-throughput liquid chromatography-tandem massspectrometry-based proteomics approaches can be used to verifycoding regions of a genomic sequence. Here, we highlight severalways in which high-throughput tandem mass spectrometry-basedproteomics can improve the quality of genome annotations andsuggest that it could be efficiently applied during the genecalling process so that the improvements are propagated throughthe subsequent functional annotation process.   相似文献   

2.
Gene identification in novel eukaryotic genomes by self-training algorithm   总被引:8,自引:0,他引:8  
Finding new protein-coding genes is one of the most important goals of eukaryotic genome sequencing projects. However, genomic organization of novel eukaryotic genomes is diverse and ab initio gene finding tools tuned up for previously studied species are rarely suitable for efficacious gene hunting in DNA sequences of a new genome. Gene identification methods based on cDNA and expressed sequence tag (EST) mapping to genomic DNA or those using alignments to closely related genomes rely either on existence of abundant cDNA and EST data and/or availability on reference genomes. Conventional statistical ab initio methods require large training sets of validated genes for estimating gene model parameters. In practice, neither one of these types of data may be available in sufficient amount until rather late stages of the novel genome sequencing. Nevertheless, we have shown that gene finding in eukaryotic genomes could be carried out in parallel with statistical models estimation directly from yet anonymous genomic DNA. The suggested method of parallelization of gene prediction with the model parameters estimation follows the path of the iterative Viterbi training. Rounds of genomic sequence labeling into coding and non-coding regions are followed by the rounds of model parameters estimation. Several dynamically changing restrictions on the possible range of model parameters are added to filter out fluctuations in the initial steps of the algorithm that could redirect the iteration process away from the biologically relevant point in parameter space. Tests on well-studied eukaryotic genomes have shown that the new method performs comparably or better than conventional methods where the supervised model training precedes the gene prediction step. Several novel genomes have been analyzed and biologically interesting findings are discussed. Thus, a self-training algorithm that had been assumed feasible only for prokaryotic genomes has now been developed for ab initio eukaryotic gene identification.  相似文献   

3.

Background

Gene prediction is a challenging but crucial part in most genome analysis pipelines. Various methods have evolved that predict genes ab initio on reference sequences or evidence based with the help of additional information, such as RNA-Seq reads or EST libraries. However, none of these strategies is bias-free and one method alone does not necessarily provide a complete set of accurate predictions.

Results

We present IPred (Integrative gene Prediction), a method to integrate ab initio and evidence based gene identifications to complement the advantages of different prediction strategies. IPred builds on the output of gene finders and generates a new combined set of gene identifications, representing the integrated evidence of the single method predictions.

Conclusion

We evaluate IPred in simulations and real data experiments on Escherichia Coli and human data. We show that IPred improves the prediction accuracy in comparison to single method predictions and to existing methods for prediction combination.

Electronic supplementary material

The online version of this article (doi:10.1186/s12864-015-1315-9) contains supplementary material, which is available to authorized users.  相似文献   

4.
5.

Background

A large number of gene prediction programs for the human genome exist. These annotation tools use a variety of methods and data sources. In the recent ENCODE genome annotation assessment project (EGASP), some of the most commonly used and recently developed gene-prediction programs were systematically evaluated and compared on test data from the human genome. AUGUSTUS was among the tools that were tested in this project.

Results

AUGUSTUS can be used as an ab initio program, that is, as a program that uses only one single genomic sequence as input information. In addition, it is able to combine information from the genomic sequence under study with external hints from various sources of information. For EGASP, we used genomic sequence alignments as well as alignments to expressed sequence tags (ESTs) and protein sequences as additional sources of information. Within the category of ab initio programs AUGUSTUS predicted significantly more genes correctly than any other ab initio program. At the same time it predicted the smallest number of false positive genes and the smallest number of false positive exons among all ab initio programs. The accuracy of AUGUSTUS could be further improved when additional extrinsic data, such as alignments to EST, protein and/or genomic sequences, was taken into account.

Conclusion

AUGUSTUS turned out to be the most accurate ab initio gene finder among the tested tools. Moreover it is very flexible because it can take information from several sources simultaneously into consideration.
  相似文献   

6.
7.

Background

Complete genome annotation is a necessary tool as Anopheles gambiae researchers probe the biology of this potent malaria vector.

Results

We reannotate the A. gambiae genome by synthesizing comparative and ab initio sets of predicted coding sequences (CDSs) into a single set using an exon-gene-union algorithm followed by an open-reading-frame-selection algorithm. The reannotation predicts 20,970 CDSs supported by at least two lines of evidence, and it lowers the proportion of CDSs lacking start and/or stop codons to only approximately 4%. The reannotated CDS set includes a set of 4,681 novel CDSs not represented in the Ensembl annotation but with EST support, and another set of 4,031 Ensembl-supported genes that undergo major structural and, therefore, probably functional changes in the reannotated set. The quality and accuracy of the reannotation was assessed by comparison with end sequences from 20,249 full-length cDNA clones, and evaluation of mass spectrometry peptide hit rates from an A. gambiae shotgun proteomic dataset confirms that the reannotated CDSs offer a high quality protein database for proteomics. We provide a functional proteomics annotation, ReAnoXcel, obtained by analysis of the new CDSs through the AnoXcel pipeline, which allows functional comparisons of the CDS sets within the same bioinformatic platform. CDS data are available for download.

Conclusion

Comprehensive A. gambiae genome reannotation is achieved through a combination of comparative and ab initio gene prediction algorithms.  相似文献   

8.
MOTIVATION: Computational gene prediction methods are an important component of whole genome analyses. While ab initio gene finders have demonstrated major improvements in accuracy, the most reliable methods are evidence-based gene predictors. These algorithms can rely on several different sources of evidence including predictions from multiple ab initio gene finders, matches to known proteins, sequence conservation and partial cDNAs to predict the final product. Despite the success of these algorithms, prediction of complete gene structures, especially for alternatively spliced products, remains a difficult task. RESULTS: LOCUS (Length Optimized Characterization of Unknown Spliceforms) is a new evidence-based gene finding algorithm which integrates a length-constraint into a dynamic programming-based framework for prediction of gene products. On a Caenorhabditis elegans test set of alternatively spliced internal exons, its performance exceeds that of current ab initio gene finders and in most cases can accurately predict the correct form of all the alternative products. As the length information used by the algorithm can be obtained in a high-throughput fashion, we propose that integration of such information into a gene-prediction pipeline is feasible and doing so may improve our ability to fully characterize the complete set of mRNAs for a genome. AVAILABILITY: LOCUS is available from http://ural.wustl.edu/software.html  相似文献   

9.
In most eukaryotes, transfer RNAs (tRNAs) are one of the very few classes of genes remaining in the mitochondrial genome, but some mitochondria have lost these vestiges of their prokaryotic ancestry. Sequencing of mitogenomes from the flowering plant genus Silene previously revealed a large range in tRNA gene content, suggesting rapid and ongoing gene loss/replacement. Here, we use this system to test longstanding hypotheses about how mitochondrial tRNA genes are replaced by importing nuclear-encoded tRNAs. We traced the evolutionary history of these gene loss events by sequencing mitochondrial genomes from key outgroups (Agrostemma githago and Silene [=Lychnis] chalcedonica). We then performed the first global sequencing of purified plant mitochondrial tRNA populations to characterize the expression of mitochondrial-encoded tRNAs and the identity of imported nuclear-encoded tRNAs. We also confirmed the utility of high-throughput sequencing methods for the detection of tRNA import by sequencing mitochondrial tRNA populations in a species (Solanum tuberosum) with known tRNA trafficking patterns. Mitochondrial tRNA sequencing in Silene revealed substantial shifts in the abundance of some nuclear-encoded tRNAs in conjunction with their recent history of mt-tRNA gene loss and surprising cases where tRNAs with anticodons still encoded in the mitochondrial genome also appeared to be imported. These data suggest that nuclear-encoded counterparts are likely replacing mitochondrial tRNAs even in systems with recent mitochondrial tRNA gene loss, and the redundant import of a nuclear-encoded tRNA may provide a mechanism for functional replacement between translation systems separated by billions of years of evolutionary divergence.  相似文献   

10.
Filarial parasitic nematodes (Filarioidea) cause substantial disease burden to humans and animals around the world. Recently there has been a coordinated global effort to generate, annotate, and curate genomic data from nematode species of medical and veterinary importance. This has resulted in two chromosome-level assemblies (Brugia malayi and Onchocerca volvulus) and 11 additional draft genomes from Filarioidea. These reference assemblies facilitate comparative genomics to explore basic helminth biology and prioritize new drug and vaccine targets. While the continual improvement of genome contiguity and completeness advances these goals, experimental functional annotation of genes is often hindered by poor gene models. Short-read RNA sequencing data and expressed sequence tags, in cooperation with ab initio prediction algorithms, are employed for gene prediction, but these can result in missing clade-specific genes, fragmented models, imperfect mapping of gene ends, and lack of isoform resolution. Long-read RNA sequencing can overcome these drawbacks and greatly improve gene model quality. Here, we present Iso-Seq data for B. malayi and Dirofilaria immitis, etiological agents of lymphatic filariasis and canine heartworm disease, respectively. These data cover approximately half of the known coding genomes and substantially improve gene models by extending untranslated regions, cataloging novel splice junctions from novel isoforms, and correcting mispredicted junctions. Furthermore, we validated computationally predicted operons, manually curated new operons, and merged fragmented gene models. We carried out analyses of poly(A) tails in both species, leading to the identification of non-canonical poly(A) signals. Finally, we prioritized and assessed known and putative anthelmintic targets, correcting or validating gene models for molecular cloning and target-based anthelmintic screening efforts. Overall, these data significantly improve the catalog of gene models for two important parasites, and they demonstrate how long-read RNA sequencing should be prioritized for ongoing improvement of parasitic nematode genome assemblies.  相似文献   

11.
12.
The deluge of data generated by genome sequencing has led to an increasing reliance on bioinformatic predictions, since the traditional experimental approach of characterizing gene function one at a time cannot possibly keep pace with the sequence-based discovery of novel genes. We have utilized Biolog phenotype MicroArrays to identify phenotypes of gene knockout mutants in the opportunistic pathogen and versatile soil bacterium Pseudomonas aeruginosa in a relatively high-throughput fashion. Seventy-eight P. aeruginosa mutants defective in predicted sugar and amino acid membrane transporter genes were screened and clear phenotypes were identified for 27 of these. In all cases, these phenotypes were confirmed by independent growth assays on minimal media. Using qRT-PCR, we demonstrate that the expression levels of 11 of these transporter genes were induced from 4- to 90-fold by their substrates identified via phenotype analysis. Overall, the experimental data showed the bioinformatic predictions to be largely correct in 22 out of 27 cases, and led to the identification of novel transporter genes and a potentially new histamine catabolic pathway. Thus, rapid phenotype identification assays are an invaluable tool for confirming and extending bioinformatic predictions.  相似文献   

13.
14.

Background

Cryptococcus neoformans, a basidiomycetous fungus of universal occurrence, is a significant opportunistic human pathogen causing meningitis. Owing to an increase in the number of immunosuppressed individuals along with emergence of drug-resistant strains, C. neoformans is gaining importance as a pathogen. Although, whole genome sequencing of three varieties of C. neoformans has been completed recently, no global proteomic studies have yet been reported.

Results

We performed a comprehensive proteomic analysis of C. neoformans var. grubii (Serotype A), which is the most virulent variety, in order to provide protein-level evidence for computationally predicted gene models and to refine the existing annotations. We confirmed the protein-coding potential of 3,674 genes from a total of 6,980 predicted protein-coding genes. We also identified 4 novel genes and corrected 104 predicted gene models. In addition, our studies led to the correction of translational start site, splice junctions and reading frame used for translation in a number of proteins. Finally, we validated a subset of our novel findings by RT-PCR and sequencing.

Conclusions

Proteogenomic investigation described here facilitated the validation and refinement of computationally derived gene models in the intron-rich genome of C. neoformans, an important fungal pathogen in humans.  相似文献   

15.
16.
17.
We present a bacterial genome computational analysis pipeline, called GenVar. The pipeline, based on the program GeneWise, is designed to analyze an annotated genome and automatically identify missed gene calls and sequence variants such as genes with disrupted reading frames (split genes) and those with insertions and deletions (indels). For a given genome to be analyzed, GenVar relies on a database containing closely related genomes (such as other species or strains) as well as a few additional reference genomes. GenVar also helps identify gene disruptions probably caused by sequencing errors. We exemplify GenVar's capabilities by presenting results from the analysis of four Brucella genomes. Brucella is an important human pathogen and zoonotic agent. The analysis revealed hundreds of missed gene calls, new split genes and indels, several of which are species specific and hence provide valuable clues to the understanding of the genome basis of Brucella pathogenicity and host specificity.  相似文献   

18.
Since historical times, the inherent human fascination with pearls turned the freshwater pearl mussel Margaritifera margaritifera (Linnaeus, 1758) into a highly valuable cultural and economic resource. Although pearl harvesting in M. margaritifera is nowadays residual, other human threats have aggravated the species conservation status, especially in Europe. This mussel presents a myriad of rare biological features, e.g. high longevity coupled with low senescence and Doubly Uniparental Inheritance of mitochondrial DNA, for which the underlying molecular mechanisms are poorly known. Here, the first draft genome assembly of M. margaritifera was produced using a combination of Illumina Paired-end and Mate-pair approaches. The genome assembly was 2.4 Gb long, possessing 105,185 scaffolds and a scaffold N50 length of 288,726 bp. The ab initio gene prediction allowed the identification of 35,119 protein-coding genes. This genome represents an essential resource for studying this species’ unique biological and evolutionary features and ultimately will help to develop new tools to promote its conservation.  相似文献   

19.
20.
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号