期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

IPred - integrating ab initio and evidence based gene predictions to improve prediction accuracy

Franziska Zickmann Bernhard Y Renard 《BMC genomics》2015,16(1)

Background

Gene prediction is a challenging but crucial part in most genome analysis pipelines. Various methods have evolved that predict genes ab initio on reference sequences or evidence based with the help of additional information, such as RNA-Seq reads or EST libraries. However, none of these strategies is bias-free and one method alone does not necessarily provide a complete set of accurate predictions.

Results

We present IPred (Integrative gene Prediction), a method to integrate ab initio and evidence based gene identifications to complement the advantages of different prediction strategies. IPred builds on the output of gene finders and generates a new combined set of gene identifications, representing the integrated evidence of the single method predictions.

Conclusion

We evaluate IPred in simulations and real data experiments on Escherichia Coli and human data. We show that IPred improves the prediction accuracy in comparison to single method predictions and to existing methods for prediction combination.

Electronic supplementary material

The online version of this article (doi:10.1186/s12864-015-1315-9) contains supplementary material, which is available to authorized users. 相似文献

2.

Integration of mapped RNA-Seq reads into automatic training of eukaryotic gene finding algorithm

Alexandre Lomsadze Paul D. Burns Mark Borodovsky 《Nucleic acids research》2014,42(15):e119

相似文献

3.

Evaluating High-Throughput Ab Initio Gene Finders to Discover Proteins Encoded in Eukaryotic Pathogen Genomes Missed by Laboratory Techniques

Stephen J. Goodswen Paul J. Kennedy John T. Ellis 《PloS one》2012,7(11)

Next generation sequencing technology is advancing genome sequencing at an unprecedented level. By unravelling the code within a pathogen’s genome, every possible protein (prior to post-translational modifications) can theoretically be discovered, irrespective of life cycle stages and environmental stimuli. Now more than ever there is a great need for high-throughput ab initio gene finding. Ab initio gene finders use statistical models to predict genes and their exon-intron structures from the genome sequence alone. This paper evaluates whether existing ab initio gene finders can effectively predict genes to deduce proteins that have presently missed capture by laboratory techniques. An aim here is to identify possible patterns of prediction inaccuracies for gene finders as a whole irrespective of the target pathogen. All currently available ab initio gene finders are considered in the evaluation but only four fulfil high-throughput capability: AUGUSTUS, GeneMark_hmm, GlimmerHMM, and SNAP. These gene finders require training data specific to a target pathogen and consequently the evaluation results are inextricably linked to the availability and quality of the data. The pathogen, Toxoplasma gondii, is used to illustrate the evaluation methods. The results support current opinion that predicted exons by ab initio gene finders are inaccurate in the absence of experimental evidence. However, the results reveal some patterns of inaccuracy that are common to all gene finders and these inaccuracies may provide a focus area for future gene finder developers. 相似文献

4.

UnSplicer: mapping spliced RNA-seq reads in compact genomes and filtering noisy splicing

Paul D. Burns Yang Li Jian Ma Mark Borodovsky 《Nucleic acids research》2014,42(4):e25

Accurate mapping of spliced RNA-Seq reads to genomic DNA has been known as a challenging problem. Despite significant efforts invested in developing efficient algorithms, with the human genome as a primary focus, the best solution is still not known. A recently introduced tool, TrueSight, has demonstrated better performance compared with earlier developed algorithms such as TopHat and MapSplice. To improve detection of splice junctions, TrueSight uses information on statistical patterns of nucleotide ordering in intronic and exonic DNA. This line of research led to yet another new algorithm, UnSplicer, designed for eukaryotic species with compact genomes where functional alternative splicing is likely to be dominated by splicing noise. Genome-specific parameters of the new algorithm are generated by GeneMark-ES, an ab initio gene prediction algorithm based on unsupervised training. UnSplicer shares several components with TrueSight; the difference lies in the training strategy and the classification algorithm. We tested UnSplicer on RNA-Seq data sets of Arabidopsis thaliana, Caenorhabditis elegans, Cryptococcus neoformans and Drosophila melanogaster. We have shown that splice junctions inferred by UnSplicer are in better agreement with knowledge accumulated on these well-studied genomes than predictions made by earlier developed tools. 相似文献

5.

MED: a new non-supervised gene prediction algorithm for bacterial and archaeal genomes

Huaiqiu Zhu Gang-Qing Hu Yi-Fan Yang Jin Wang Zhen-Su She 《BMC bioinformatics》2007,8(1):97

Background

Despite a remarkable success in the computational prediction of genes in Bacteria and Archaea, a lack of comprehensive understanding of prokaryotic gene structures prevents from further elucidation of differences among genomes. It continues to be interesting to develop new ab initio algorithms which not only accurately predict genes, but also facilitate comparative studies of prokaryotic genomes. 相似文献

6.

Gene structure conservation aids similarity based gene prediction 总被引：4，自引：1，他引：3

下载免费PDF全文

Meyer IM Durbin R 《Nucleic acids research》2004,32(2):776-783

One of the primary tasks in deciphering the functional contents of a newly sequenced genome is the identification of its protein coding genes. Existing computational methods for gene prediction include ab initio methods which use the DNA sequence itself as the only source of information, comparative methods using multiple genomic sequences, and similarity based methods which employ the cDNA or protein sequences of related genes to aid the gene prediction. We present here an algorithm implemented in a computer program called Projector which combines comparative and similarity approaches. Projector employs similarity information at the genomic DNA level by directly using known genes annotated on one DNA sequence to predict the corresponding related genes on another DNA sequence. It therefore makes explicit use of the conservation of the exon–intron structure between two related genes in addition to the similarity of their encoded amino acid sequences. We evaluate the performance of Projector by comparing it with the program Genewise on a test set of 491 pairs of independently confirmed mouse and human genes. It is more accurate than Genewise for genes whose proteins are <80% identical, and is suitable for use in a combined gene prediction system where other methods identify well conserved and non-conserved genes, and pseudogenes. 相似文献

7.

PSPP: A Protein Structure Prediction Pipeline for Computing Clusters

Michael S. Lee Rajkumar Bondugula Valmik Desai Nela Zavaljevski In-Chul Yeh Anders Wallqvist Jaques Reifman 《PloS one》2009,4(7)

Background

Protein structures are critical for understanding the mechanisms of biological systems and, subsequently, for drug and vaccine design. Unfortunately, protein sequence data exceed structural data by a factor of more than 200 to 1. This gap can be partially filled by using computational protein structure prediction. While structure prediction Web servers are a notable option, they often restrict the number of sequence queries and/or provide a limited set of prediction methodologies. Therefore, we present a standalone protein structure prediction software package suitable for high-throughput structural genomic applications that performs all three classes of prediction methodologies: comparative modeling, fold recognition, and ab initio. This software can be deployed on a user''s own high-performance computing cluster.

Methodology/Principal Findings

The pipeline consists of a Perl core that integrates more than 20 individual software packages and databases, most of which are freely available from other research laboratories. The query protein sequences are first divided into domains either by domain boundary recognition or Bayesian statistics. The structures of the individual domains are then predicted using template-based modeling or ab initio modeling. The predicted models are scored with a statistical potential and an all-atom force field. The top-scoring ab initio models are annotated by structural comparison against the Structural Classification of Proteins (SCOP) fold database. Furthermore, secondary structure, solvent accessibility, transmembrane helices, and structural disorder are predicted. The results are generated in text, tab-delimited, and hypertext markup language (HTML) formats. So far, the pipeline has been used to study viral and bacterial proteomes.

Conclusions

The standalone pipeline that we introduce here, unlike protein structure prediction Web servers, allows users to devote their own computing assets to process a potentially unlimited number of queries as well as perform resource-intensive ab initio structure prediction. 相似文献

8.

OMIGA: Optimized Maker-Based Insect Genome Annotation

Jinding Liu Huamei Xiao Shuiqing Huang Fei Li 《Molecular genetics and genomics : MGG》2014,289(4):567-573

相似文献

9.

Combining Physicochemical and Evolutionary Information for Protein Contact Prediction

Michael Schneider Oliver Brock 《PloS one》2014,9(10)

We introduce a novel contact prediction method that achieves high prediction accuracy by combining evolutionary and physicochemical information about native contacts. We obtain evolutionary information from multiple-sequence alignments and physicochemical information from predicted ab initio protein structures. These structures represent low-energy states in an energy landscape and thus capture the physicochemical information encoded in the energy function. Such low-energy structures are likely to contain native contacts, even if their overall fold is not native. To differentiate native from non-native contacts in those structures, we develop a graph-based representation of the structural context of contacts. We then use this representation to train an support vector machine classifier to identify most likely native contacts in otherwise non-native structures. The resulting contact predictions are highly accurate. As a result of combining two sources of information—evolutionary and physicochemical—we maintain prediction accuracy even when only few sequence homologs are present. We show that the predicted contacts help to improve ab initio structure prediction. A web service is available at http://compbio.robotics.tu-berlin.de/epc-map/. 相似文献

10.

Efficient Plant Gene Identification Based on Interspecies Mapping of Full-Length cDNAs

Naoki Amano Tsuyoshi Tanaka Hisataka Numa Hiroaki Sakai Takeshi Itoh 《DNA research》2010,17(5):271-279

We present an annotation pipeline that accurately predicts exon–intron structures and protein-coding sequences (CDSs) on the basis of full-length cDNAs (FLcDNAs). This annotation pipeline was used to identify genes in 10 plant genomes. In particular, we show that interspecies mapping of FLcDNAs to genomes is of great value in fully utilizing FLcDNA resources whose availability is limited to several species. Because low sequence conservation at 5′- and 3′-ends of FLcDNAs between different species tends to result in truncated CDSs, we developed an improved algorithm to identify complete CDSs by the extension of both ends of truncated CDSs. Interspecies mapping of 71 801 monocot FLcDNAs to the Oryza sativa genome led to the detection of 22 142 protein-coding regions. Moreover, in comparing two mapping programs and three ab initio prediction programs, we found that our pipeline was more capable of identifying complete CDSs. As demonstrated by monocot interspecies mapping, in which nucleotide identity between FLcDNAs and the genome was ∼80%, the resultant inferred CDSs were sufficiently accurate. Finally, we applied both inter- and intraspecies mapping to 10 monocot and dicot genomes and identified genes in 210 551 loci. Interspecies mapping of FLcDNAs is expected to effectively predict genes and CDSs in newly sequenced genomes. 相似文献

11.

Genome/transcriptome analysis of the chigger mite Leptotrombidium pallidum,a major vector for scrub typhus,with a special focus on genes more abundantly expressed in larval stage

《Journal of Asia》2020,23(3):816-824

Leptotrombidium pallidum is the major vector mite for Orientia tsutsugamushi, the causative agent of scrub typhus, in Asian countries, including Korea. Despite its medical importance, L. pallidum has little genetic information available to date. To analyze the L. pallidum genome, we extracted genomic DNA (gDNA) from a single female of a 7-generation inbred L. pallidum colony and amplified the gDNA by whole genome amplification (WGA). The resulting amplified gDNA was used to construct paired-end and mate-pair libraries that were sequenced using Illumina platforms (HiSeq2000 and MiSeq). More than 45 Gb of sequence reads from both paired-end and mate-pair libraries of the WGA gDNA were trimmed and then de novo assembled using CLC Assembly Cell v.4.0 for contig assembly and SSPACE for scaffolding. The assembly generated approximately 6,545 scaffolds with an N50 value of 92,945 and total size of ~ 193 Mb. For gene predictions, the PASA and GeneWise models were used, and ab initio gene predictions were performed independently, resulting in the prediction of 15,842 genes. RNA-Seq expression profiles revealed constitutive expression of 11,572 unique protein-coding genes in larva, 12,364 in protonymph, 12,872 in male adult, and 12,617 in female adult stages. Of the 15,842 predicted genes, 10,885 were commonly expressed through all L. pallidum stages. Genes selectively over-transcribed in the larval stage, which is when host parasitization and disease transmission occur, were further annotated, and their putative roles were discussed. 相似文献

12.

Re-annotation of the woodland strawberry (Fragaria vesca) genome

Omar Darwish Rachel Shahan Zhongchi Liu Janet P Slovin Nadim W Alkharouf 《BMC genomics》2015,16(1)

相似文献

13.

FREAD revisited: Accurate loop structure prediction using a database search algorithm

Yoonjoo Choi Charlotte M. Deane 《Proteins》2010,78(6):1431-1440

Loops are the most variable regions of protein structure and are, in general, the least accurately predicted. Their prediction has been approached in two ways, ab initio and database search. In recent years, it has been thought that ab initio methods are more powerful. In light of the continued rapid expansion in the number of known protein structures, we have re‐evaluated FREAD, a database search method and demonstrate that the power of database search methods may have been underestimated. We found that sequence similarity as quantified by environment specific substitution scores can be used to significantly improve prediction. In fact, FREAD performs appreciably better for an identifiable subset of loops (two thirds of shorter loops and half of the longer loops tested) than the ab initio methods of MODELLER, PLOP, and RAPPER. Within this subset, FREAD's predictive ability is length independent, in general, producing results within 2Å RMSD, compared to an average of over 10Å for loop length 20 for any of the other tested methods. We also benchmarked the prediction protocols on a set of 212 loops from the model structures in CASP 7 and 8. An extended version of FREAD is able to make predictions for 127 of these, it gives the best prediction of the methods tested in 61 of these cases. In examining FREAD's ability to predict in the model environment, we found that whole structure quality did not affect the quality of loop predictions. Proteins 2010. © 2009 Wiley‐Liss, Inc. 相似文献

14.

Gene identification in novel eukaryotic genomes by self-training algorithm 总被引：8，自引：0，他引：8

Lomsadze A Ter-Hovhannisyan V Chernoff YO Borodovsky M 《Nucleic acids research》2005,33(20):6494-6506

Finding new protein-coding genes is one of the most important goals of eukaryotic genome sequencing projects. However, genomic organization of novel eukaryotic genomes is diverse and ab initio gene finding tools tuned up for previously studied species are rarely suitable for efficacious gene hunting in DNA sequences of a new genome. Gene identification methods based on cDNA and expressed sequence tag (EST) mapping to genomic DNA or those using alignments to closely related genomes rely either on existence of abundant cDNA and EST data and/or availability on reference genomes. Conventional statistical ab initio methods require large training sets of validated genes for estimating gene model parameters. In practice, neither one of these types of data may be available in sufficient amount until rather late stages of the novel genome sequencing. Nevertheless, we have shown that gene finding in eukaryotic genomes could be carried out in parallel with statistical models estimation directly from yet anonymous genomic DNA. The suggested method of parallelization of gene prediction with the model parameters estimation follows the path of the iterative Viterbi training. Rounds of genomic sequence labeling into coding and non-coding regions are followed by the rounds of model parameters estimation. Several dynamically changing restrictions on the possible range of model parameters are added to filter out fluctuations in the initial steps of the algorithm that could redirect the iteration process away from the biologically relevant point in parameter space. Tests on well-studied eukaryotic genomes have shown that the new method performs comparably or better than conventional methods where the supervised model training precedes the gene prediction step. Several novel genomes have been analyzed and biologically interesting findings are discussed. Thus, a self-training algorithm that had been assumed feasible only for prokaryotic genomes has now been developed for ab initio eukaryotic gene identification. 相似文献

15.

Dihedral angle and secondary structure database of short amino acid fragments

下载免费PDF全文

Dayalan S Gooneratne ND Bevinakoppa S Schroder H 《Bioinformation》2006,1(3):78-80

Dihedral angles of amino acids are of considerable importance in protein tertiary structure prediction as they define the backbone of a protein and hence almost define the protein's entire conformation. Most ab initio protein structure prediction methods predict the secondary structure of a protein before predicting the tertiary structure because three-dimensional fold consists of repeating units of secondary structures. Hence, both dihedral angles and secondary structures are important in tertiary structure prediction of proteins. Here we describe a database called DASSD (Dihedral Angle and Secondary Structure Database of Short Amino acid Fragments) that contains dihedral angle values and secondary structure details of short amino acid fragments of lengths 1, 3 and 5. Information stored in this database was extracted from a set of 5,227 non-redundant high resolution (less than 2-angstroms) protein structures. In total, DASSD stores details for about 733,000 fragments. This database finds application in the development of ab initio protein structure prediction methods using fragment libraries and fragment assembly techniques. It is also useful in protein secondary structure prediction.

Availability 相似文献

16.

Unique Features of the Loblolly Pine (Pinus taeda L.) Megagenome Revealed Through Sequence Annotation

Jill L. Wegrzyn John D. Liechty Kristian A. Stevens Le-Shin Wu Carol A. Loopstra Hans A. Vasquez-Gross William M. Dougherty Brian Y. Lin Jacob J. Zieve Pedro J. Martínez-García Carson Holt Mark Yandell Aleksey V. Zimin James A. Yorke Marc W. Crepeau Daniela Puiu Steven L. Salzberg Pieter J. de Jong Keithanne Mockaitis Doreen Main Charles H. Langley David B. Neale 《Genetics》2014,196(3):891-909

The largest genus in the conifer family Pinaceae is Pinus, with over 100 species. The size and complexity of their genomes (∼20–40 Gb, 2n = 24) have delayed the arrival of a well-annotated reference sequence. In this study, we present the annotation of the first whole-genome shotgun assembly of loblolly pine (Pinus taeda L.), which comprises 20.1 Gb of sequence. The MAKER-P annotation pipeline combined evidence-based alignments and ab initio predictions to generate 50,172 gene models, of which 15,653 are classified as high confidence. Clustering these gene models with 13 other plant species resulted in 20,646 gene families, of which 1554 are predicted to be unique to conifers. Among the conifer gene families, 159 are composed exclusively of loblolly pine members. The gene models for loblolly pine have the highest median and mean intron lengths of 24 fully sequenced plant genomes. Conifer genomes are full of repetitive DNA, with the most significant contributions from long-terminal-repeat retrotransposons. In depth analysis of the tandem and interspersed repetitive content yielded a combined estimate of 82%. 相似文献

17.

Using mRNAs lengths to accurately predict the alternatively spliced gene products in Caenorhabditis elegans

Agrawal R Stormo GD 《Bioinformatics (Oxford, England)》2006,22(10):1239-1244

MOTIVATION: Computational gene prediction methods are an important component of whole genome analyses. While ab initio gene finders have demonstrated major improvements in accuracy, the most reliable methods are evidence-based gene predictors. These algorithms can rely on several different sources of evidence including predictions from multiple ab initio gene finders, matches to known proteins, sequence conservation and partial cDNAs to predict the final product. Despite the success of these algorithms, prediction of complete gene structures, especially for alternatively spliced products, remains a difficult task. RESULTS: LOCUS (Length Optimized Characterization of Unknown Spliceforms) is a new evidence-based gene finding algorithm which integrates a length-constraint into a dynamic programming-based framework for prediction of gene products. On a Caenorhabditis elegans test set of alternatively spliced internal exons, its performance exceeds that of current ab initio gene finders and in most cases can accurately predict the correct form of all the alternative products. As the length information used by the algorithm can be obtained in a high-throughput fashion, we propose that integration of such information into a gene-prediction pipeline is feasible and doing so may improve our ability to fully characterize the complete set of mRNAs for a genome. AVAILABILITY: LOCUS is available from http://ural.wustl.edu/software.html 相似文献

18.

Integrated modeling of protein-coding genes in the Manduca sexta genome using RNA-Seq data from the biochemical model insect

《Insect biochemistry and molecular biology》2015

The genome sequence of Manduca sexta was recently determined using 454 technology. Cufflinks and MAKER2 were used to establish gene models in the genome assembly based on the RNA-Seq data and other species' sequences. Aided by the extensive RNA-Seq data from 50 tissue samples at various life stages, annotators over the world (including the present authors) have manually confirmed and improved a small percentage of the models after spending months of effort. While such collaborative efforts are highly commendable, many of the predicted genes still have problems which may hamper future research on this insect species. As a biochemical model representing lepidopteran pests, M. sexta has been used extensively to study insect physiological processes for over five decades. In this work, we assembled Manduca datasets Cufflinks 3.0, Trinity 4.0, and Oases 4.0 to assist the manual annotation efforts and development of Official Gene Set (OGS) 2.0. To further improve annotation quality, we developed methods to evaluate gene models in the MAKER2, Cufflinks, Oases and Trinity assemblies and selected the best ones to constitute MCOT 1.0 after thorough crosschecking. MCOT 1.0 has 18,089 genes encoding 31,666 proteins: 32.8% match OGS 2.0 models perfectly or near perfectly, 11,747 differ considerably, and 29.5% are absent in OGS 2.0. Future automation of this process is anticipated to greatly reduce human efforts in generating comprehensive, reliable models of structural genes in other genome projects where extensive RNA-Seq data are available. 相似文献

19.

From Gigabyte to Kilobyte: A Bioinformatics Protocol for Mining Large RNA-Seq Transcriptomics Data

Jilong Li Jie Hou Lin Sun Jordan Maximillian Wilkins Yuan Lu Chad E. Niederhuth Benjamin Ryan Merideth Thomas P. Mawhinney Valeri V. Mossine C. Michael Greenlief John C. Walker William R. Folk Mark Hannink Dennis B. Lubahn James A. Birchler Jianlin Cheng 《PloS one》2015,10(4)

RNA-Seq techniques generate hundreds of millions of short RNA reads using next-generation sequencing (NGS). These RNA reads can be mapped to reference genomes to investigate changes of gene expression but improved procedures for mining large RNA-Seq datasets to extract valuable biological knowledge are needed. RNAMiner—a multi-level bioinformatics protocol and pipeline—has been developed for such datasets. It includes five steps: Mapping RNA-Seq reads to a reference genome, calculating gene expression values, identifying differentially expressed genes, predicting gene functions, and constructing gene regulatory networks. To demonstrate its utility, we applied RNAMiner to datasets generated from Human, Mouse, Arabidopsis thaliana, and Drosophila melanogaster cells, and successfully identified differentially expressed genes, clustered them into cohesive functional groups, and constructed novel gene regulatory networks. The RNAMiner web service is available at http://calla.rnet.missouri.edu/rnaminer/index.html. 相似文献

20.

Comprehensive Human Transcription Factor Binding Site Map for Combinatory Binding Motifs Discovery

Arnoldo J. Müller-Molina Hans R. Sch?ler Marcos J. Araúzo-Bravo 《PloS one》2012,7(11)

相似文献