共查询到20条相似文献,搜索用时 46 毫秒
1.
Functional and structural genomics using PEDANT 总被引:11,自引:0,他引:11
Frishman D Albermann K Hani J Heumann K Metanomski A Zollner A Mewes HW 《Bioinformatics (Oxford, England)》2001,17(1):44-57
MOTIVATION: Enormous demand for fast and accurate analysis of biological sequences is fuelled by the pace of genome analysis efforts. There is also an acute need in reliable up-to-date genomic databases integrating both functional and structural information. Here we describe the current status of the PEDANT software system for high-throughput analysis of large biological sequence sets and the genome analysis server associated with it. RESULTS: The principal features of PEDANT are: (i) completely automatic processing of data using a wide range of bioinformatics methods, (ii) manual refinement of annotation, (iii) automatic and manual assignment of gene products to a number of functional and structural categories, (iv) extensive hyperlinked protein reports, and (v) advanced DNA and protein viewers. The system is easily extensible and allows to include custom methods, databases, and categories with minimal or no programming effort. PEDANT is actively used as a collaborative environment to support several on-going genome sequencing projects. The main purpose of the PEDANT genome database is to quickly disseminate well-organized information on completely sequenced and unfinished genomes. It currently includes 80 genomic sequences and in many cases serves as the only source of exhaustive information on a given genome. The database also acts as a vehicle for a number of research projects in bioinformatics. Using SQL queries, it is possible to correlate a large variety of pre-computed properties of gene products encoded in complete genomes with each other and compare them with data sets of special scientific interest. In particular, the availability of structural predictions for over 300 000 genomic proteins makes PEDANT the most extensive structural genomics resource available on the web. 相似文献
2.
Hyoung-Sam Heo Sanghyuk Lee Yeon Ja Choi S. June Oh 《Biochemical and biophysical research communications》2010,397(1):120-126
Peptide mass fingerprinting (PMF) has become one of the most widely used methods for rapid identification of proteins in proteomics research. Many peaks, however, remain unassigned after PMF analysis, partly because of post-translational modification and the limited scope of protein sequences. Almost all PMF tools employ only known or predicted protein sequences and do not include open reading frames (ORFs) in the genome, which eliminates the chance of finding novel functional peptides. Unlike most tools that search protein sequences from known coding sequences, the tool we developed uses a database for theoretical small ORFs (tsORFs) and a PMF application using a tsORFs database (tsORFdb). The tsORFdb is a database for ORFeome that encompasses all potential tsORFs derived from whole genome sequences as well as the predicted ones. The massProphet system tries to extend the search scope to include the ORFeome using the tsORFdb. The tsORFdb and massProphet should be useful for proteomics research to give information about unknown small ORFs as well as predicted and registered proteins. 相似文献
3.
Identification of six novel genes by experimental validation of GeneMachine predicted genes 总被引:1,自引:0,他引:1
Makalowska I Sood R Faruque MU Hu P Robbins CM Eddings EM Mestre JD Baxevanis AD Carpten JD 《Gene》2002,284(1-2):203-213
4.
Millares P Lacourse EJ Perally S Ward DA Prescott MC Hodgkinson JE Brophy PM Rees HH 《PloS one》2012,7(3):e33590
Lack of genomic sequence data and the relatively high cost of tandem mass spectrometry have hampered proteomic investigations into helminths, such as resolving the mechanism underpinning globally reported anthelmintic resistance. Whilst detailed mechanisms of resistance remain unknown for the majority of drug-parasite interactions, gene mutations and changes in gene and protein expression are proposed key aspects of resistance. Comparative proteomic analysis of drug-resistant and -susceptible nematodes may reveal protein profiles reflecting drug-related phenotypes. Using the gastro-intestinal nematode, Haemonchus contortus as case study, we report the application of freely available expressed sequence tag (EST) datasets to support proteomic studies in unsequenced nematodes. EST datasets were translated to theoretical protein sequences to generate a searchable database. In conjunction with matrix-assisted laser desorption ionisation time-of-flight mass spectrometry (MALDI-TOF-MS), Peptide Mass Fingerprint (PMF) searching of databases enabled a cost-effective protein identification strategy. The effectiveness of this approach was verified in comparison with MS/MS de novo sequencing with searching of the same EST protein database and subsequent searches of the NCBInr protein database using the Basic Local Alignment Search Tool (BLAST) to provide protein annotation. Of 100 proteins from 2-DE gel spots, 62 were identified by MALDI-TOF-MS and PMF searching of the EST database. Twenty randomly selected spots were analysed by electrospray MS/MS and MASCOT Ion Searches of the same database. The resulting sequences were subjected to BLAST searches of the NCBI protein database to provide annotation of the proteins and confirm concordance in protein identity from both approaches. Further confirmation of protein identifications from the MS/MS data were obtained by de novo sequencing of peptides, followed by FASTS algorithm searches of the EST putative protein database. This study demonstrates the cost-effective use of available EST databases and inexpensive, accessible MALDI-TOF MS in conjunction with PMF for reliable protein identification in unsequenced organisms. 相似文献
5.
Gene identification in genomic DNA from eukaryotes is complicated by the vast combinatorial possibilities of potential exon assemblies. If the gene encodes a protein that is closely related to known proteins, gene identification is aided by matching similarity of potential translation products to those target proteins. The genomic DNA and protein sequences can be aligned directly by scoring the implied residues of in-frame nucleotide triplets against the protein residues in conventional ways, while allowing for long gaps in the alignment corresponding to introns in the genomic DNA. We describe a novel method for such spliced alignment. The method derives an optimal alignment based on scoring for both sequence similarity of the predicted gene product to the protein sequence and intrinsic splice site strength of the predicted introns. Application of the method to a representative set of 50 known genes from Arabidopsis thaliana showed significant improvement in prediction accuracy compared to previous spliced alignment methods. The method is also more accurate than ab initio gene prediction methods, provided sufficiently close target proteins are available. In view of the fast growth of public sequence repositories, we argue that close targets will be available for the majority of novel genes, making spliced alignment an excellent practical tool for high-throughput automated genome annotation. 相似文献
6.
Genomic BLAST: custom-defined virtual databases for complete and unfinished genomes 总被引:10,自引:0,他引:10
Cummings L Riley L Black L Souvorov A Resenchuk S Dondoshansky I Tatusova T 《FEMS microbiology letters》2002,216(2):133-138
BLAST (Basic Local Alignment Search Tool) searches against DNA and protein sequence databases have become an indispensable tool for biomedical research. The proliferation of the genome sequencing projects is steadily increasing the fraction of genome-derived sequences in the public databases and their importance as a public resource. We report here the availability of Genomic BLAST, a novel graphical tool for simplifying BLAST searches against complete and unfinished genome sequences. This tool allows the user to compare the query sequence against a virtual database of DNA and/or protein sequences from a selected group of organisms with finished or unfinished genomes. The organisms for such a database can be selected using either a graphic taxonomy-based tree or an alphabetical list of organism-specific sequences. The first option is designed to help explore the evolutionary relationships among organisms within a certain taxonomy group when performing BLAST searches. The use of an alphabetical list allows the user to perform a more elaborate set of selections, assembling any given number of organism-specific databases from unfinished or complete genomes. This tool, available at the NCBI web site http://www.ncbi.nlm.nih.gov/cgi-bin/Entrez/genom_table_cgi, currently provides access to over 170 bacterial and archaeal genomes and over 40 eukaryotic genomes. 相似文献
7.
8.
9.
Erythritol is a noncariogenic, low calorie sweetener. It is safe for people with diabetes and obese people. Candida magnoliae is an industrially important organism because of its ability to produce erythritol as a major product. The genome of C. magnoliae has not been sequenced yet, limiting the available proteome database. Therefore, systematic approaches were employed to construct the proteome map of C. magnoliae. Proteomic analysis with systematic approaches is based on two-dimensional electrophoresis, matrix-assisted laser desorption ionization time of flight mass spectrometry (MALDI-TOF MS), tandem mass spectrometry (MS/MS) and database interrogation. First, 24 spots were analyzed using peptide mass fingerprinting along with MALDI-TOF MS with high mass accuracy. Only four spots were reliably identified as carbonyl reductase and its isoforms. The reason for low sequence coverage seemed to be that these identification strategies were based on the presence of the protein database obtained from the publicly accessible genome database and the availability of cross-species protein identification. MS/MS (MS/MS ion search and de novo sequencing) in combination with similarity searches allowed successful identification of 39 spots. Several proteins including transaldolase identified by MS/MS ion searches were further confirmed by partial sequences from the expressed sequence tag database. In this study, 51 protein spots were analyzed and then potentially identified. The identified proteins were involved in glycolysis, stress response, other essential metabolisms and cell structures. 相似文献
10.
Frishman D Mokrejs M Kosykh D Kastenmüller G Kolesov G Zubrzycki I Gruber C Geier B Kaps A Albermann K Volz A Wagner C Fellenberg M Heumann K Mewes HW 《Nucleic acids research》2003,31(1):207-211
The PEDANT genome database (http://pedant.gsf.de) provides exhaustive automatic analysis of genomic sequences by a large variety of established bioinformatics tools through a comprehensive Web-based user interface. One hundred and seventy seven completely sequenced and unfinished genomes have been processed so far, including large eukaryotic genomes (mouse, human) published recently. In this contribution, we describe the current status of the PEDANT database and novel analytical features added to the PEDANT server in 2002. Those include: (i) integration with the BioRS data retrieval system which allows fast text queries, (ii) pre-computed sequence clusters in each complete genome, (iii) a comprehensive set of tools for genome comparison, including genome comparison tables and protein function prediction based on genomic context, and (iv) computation and visualization of protein-protein interaction (PPI) networks based on experimental data. The availability of functional and structural predictions for 650 000 genomic proteins in well organized form makes PEDANT a useful resource for both functional and structural genomics. 相似文献
11.
Separation of proteins by two-dimensional gel electrophoresis (2-DE) coupled with identification of proteins through peptide
mass fingerprinting (PMF) by matrix-assisted laser desorption ionization time-of-flight mass spectrometry (MALDI-TOF MS) is
the widely used technique for proteomic analysis. This approach relies, however, on the presence of the proteins studied in
public-accessible protein databases or the availability of annotated genome sequences of an organism. In this work, we investigated
the reliability of using raw genome sequences for identifying proteins by PMF without the need of additional information such
as amino acid sequences. The method is demonstrated for proteomic analysis of Klebsiella pneumoniae grown anaerobically on glycerol. For 197 spots excised from 2-DE gels and submitted for mass spectrometric analysis 164 spots
were clearly identified as 122 individual proteins. 95% of the 164 spots can be successfully identified merely by using peptide
mass fingerprints and a strain-specific protein database (ProtKpn) constructed from the raw genome sequences of K. pneumoniae. Cross-species protein searching in the public databases mainly resulted in the identification of 57% of the 66 high expressed
protein spots in comparison to 97% by using the ProtKpn database. 10 dha regulon related proteins that are essential for the initial enzymatic steps of anaerobic glycerol metabolism were successfully
identified using the ProtKpn database, whereas none of them could be identified by cross-species searching. In conclusion,
the use of strain-specific protein database constructed from raw genome sequences makes it possible to reliably identify most
of the proteins from 2-DE analysis simply through peptide mass fingerprinting. 相似文献
12.
Protein identification via peptide mass fingerprinting (PMF) remains a key component of high-throughput proteomics experiments in post-genomic science. Candidate protein identifications are made using bioinformatic tools from peptide peak lists obtained via mass spectrometry (MS). These algorithms rely on several search parameters, including the number of potential uncut peptide bonds matching the primary specificity of the hydrolytic enzyme used in the experiment. Typically, up to one of these "missed cleavages" are considered by the bioinformatics search tools, usually after digestion of the in silico proteome by trypsin. Using two distinct, nonredundant datasets of peptides identified via PMF and tandem MS, a simple predictive method based on information theory is presented which is able to identify experimentally defined missed cleavages with up to 90% accuracy from amino acid sequence alone. Using this simple protocol, we are able to "mask" candidate protein databases so that confident missed cleavage sites need not be considered for in silico digestion. We show that that this leads to an improvement in database searching, with two different search engines, using the PMF dataset as a test set. In addition, the improved approach is also demonstrated on an independent PMF data set of known proteins that also has corresponding high-quality tandem MS data, validating the protein identifications. This approach has wider applicability for proteomics database searching, and the program for predicting missed cleavages and masking Fasta-formatted protein sequence databases has been made available via http:// ispider.smith.man.ac uk/MissedCleave. 相似文献
13.
BAC-end Sequence Analysis and a Draft Physical Map of the Common Bean (Phaseolus vulgaris L.) Genome
Jessica A. Schlueter Jose Luis Goicoechea Kristi Collura Navdeep Gill Jer-Young Lin Yeisoo Yu Dave Kudrna Andrea Zuccolo C. Eduardo Vallejos Monica Muñoz-Torres Matthew W. Blair Joe Tohme Jeff Tomkins Phillip McClean Rod A. Wing Scott A. Jackson 《Tropical plant biology》2008,1(1):40-48
Common bean (Phaseolus vulgaris L.) is a legume that is an important source of dietary protein in developing countries throughout the world. Utilizing the G19833 BAC library for P. vulgaris from Clemson University, 89,017 BAC-end sequences were generated giving 62,588,675 base pairs of genomic sequence covering approximately 9.54% of the genome. Analysis of these sequences in combination with 1,404 shotgun sequences from the cultivar Bat7 revealed that approximately 49.2% of the genome contains repetitive sequence and 29.3% is genic. Compared to other legume BAC-end sequencing projects, it appears that P. vulgaris has higher predicted levels of repetitive sequence, but this may be due to a more intense identification strategy combining both similarity-based matches as well as de novo identification of repeats. In addition, fingerprints for 41,717 BACs were obtained and assembled into a draft physical map consisting of 1,183 clone contigs and 6,385 singletons with ~9x coverage of the genome. 相似文献
14.
15.
Marine R Polson SW Ravel J Hatfull G Russell D Sullivan M Syed F Dumas M Wommack KE 《Applied and environmental microbiology》2011,77(22):8071-8079
Construction of DNA fragment libraries for next-generation sequencing can prove challenging, especially for samples with low DNA yield. Protocols devised to circumvent the problems associated with low starting quantities of DNA can result in amplification biases that skew the distribution of genomes in metagenomic data. Moreover, sample throughput can be slow, as current library construction techniques are time-consuming. This study evaluated Nextera, a new transposon-based method that is designed for quick production of DNA fragment libraries from a small quantity of DNA. The sequence read distribution across nine phage genomes in a mock viral assemblage met predictions for six of the least-abundant phages; however, the rank order of the most abundant phages differed slightly from predictions. De novo genome assemblies from Nextera libraries provided long contigs spanning over half of the phage genome; in four cases where full-length genome sequences were available for comparison, consensus sequences were found to match over 99% of the genome with near-perfect identity. Analysis of areas of low and high sequence coverage within phage genomes indicated that GC content may influence coverage of sequences from Nextera libraries. Comparisons of phage genomes prepared using both Nextera and a standard 454 FLX Titanium library preparation protocol suggested that the coverage biases according to GC content observed within the Nextera libraries were largely attributable to bias in the Nextera protocol rather than to the 454 sequencing technology. Nevertheless, given suitable sequence coverage, the Nextera protocol produced high-quality data for genomic studies. For metagenomics analyses, effects of GC amplification bias would need to be considered; however, the library preparation standardization that Nextera provides should benefit comparative metagenomic analyses. 相似文献
16.
Isabelle O'Bryon Sarah C. Jenson Eric D. Merkley 《Protein science : a publication of the Protein Society》2020,29(9):1864-1878
Mass spectrometry‐based proteomics is a popular and powerful method for precise and highly multiplexed protein identification. The most common method of analyzing untargeted proteomics data is called database searching, where the database is simply a collection of protein sequences from the target organism, derived from genome sequencing. Experimental peptide tandem mass spectra are compared to simplified models of theoretical spectra calculated from the translated genomic sequences. However, in several interesting application areas, such as forensics, archaeology, venomics, and others, a genome sequence may not be available, or the correct genome sequence to use is not known. In these cases, de novo peptide identification can play an important role. De novo methods infer peptide sequence directly from the tandem mass spectrum without reference to a sequence database, usually using graph‐based or machine learning algorithms. In this review, we provide a basic overview of de novo peptide identification methods and applications, briefly covering de novo algorithms and tools, and focusing in more depth on recent applications from venomics, metaproteomics, forensics, and characterization of antibody drugs. 相似文献
17.
Dorella FA Fachin MS Billault A Dias Neto E Soravito C Oliveira SC Meyer R Miyoshi A Azevedo V 《Genetics and molecular research : GMR》2006,5(4):653-663
Corynebacterium pseudotuberculosis is a gram-positive bacterium that causes caseous lymphadenitis in sheep and goats. However, despite the economic losses caused by caseous lymphadenitis, there is little information about the molecular mechanisms of pathogenesis of this bacterium. Genomic libraries constructed in bacterial artificial chromosome (BAC) vectors have become the method of choice for clone development in high-throughput genomic-sequencing projects. Large-insert DNA libraries are useful for isolation and characterization of important genomic regions and genes. In order to identify targets that might be useful for genome sequencing, we constructed a C. pseudotuberculosis BAC library in the vector pBeloBAC11. This library contains about 18,000 BAC clones, with inserts ranging in size from 25 to 120 kb, theoretically representing a 390-fold coverage of the C. pseudotuberculosis genome (estimated to be 2.5-3.1 Mb). Many genomic survey sequences (GSSs) with homology to C. diphtheriae, C. glutamicum, C. efficiens, and C. jeikeium proteins were observed within a sample of 215 sequenced clones, confirming their close phylogenetic relationship. Computer analyses of GSSs did not detect chimeric, deleted, or rearranged BAC clones, showing that this library has low redundancy. This GSSs collection is now available for further genetic and physical analysis of the C. pseudotuberculosis genome. The GSS strategy that we used to develop our library proved to be efficient for the identification of genes and will be an important tool for mapping, assembly, comparative, and functional genomic studies in a C. pseudotuberculosis genome sequencing project that will begin this year. 相似文献
18.
With the continuing accomplishments of the human genome project, high-throughput strategies to identify DNA sequences that are important in mammalian gene regulation are becoming increasingly feasible. In contrast to the historic, labour-intensive, wet-laboratory methods for identifying regulatory sequences, many modern approaches are heavily focused on the computational analysis of large genomic data sets. Data from inter-species genomic sequence comparisons and genome-wide expression profiling, integrated with various computational tools, are poised to contribute to the decoding of genomic sequence and to the identification of those sequences that orchestrate gene regulation. In this review, we highlight several genomic approaches that are being used to identify regulatory sequences in mammalian genomes. 相似文献
19.
Sun XM Tang YP Meng XZ Zhang WW Li S Deng ZR Xu ZK Song RT 《Acta biochimica et biophysica Sinica》2006,38(11):812-820
Dunaliella is a genus of wall-less unicellular eukaryotic green alga.Its exceptional resistancesto salt and various other stresses have made it an ideal model for stress tolerance study.However,very littleis known about its genome and genomic sequences.In this study,we sequenced and analyzed a 29,268 bpgenomic fragment from DunalieIla viridis.The fragment showed low sequence homology to the GenBankdatabase.At the nucleotide level,only a segment with significant sequence homology to 18S rRNA wasfound.The fragment contained six putative genes,but only one gene showed significant homology at theprotein level to GenBank database.The average GC content of this sequence was 51.1%,which was muchlower than that of close related green algae Chlamydomonas (65.7%).Significant segmental duplicationswere found within this fragment.The duplicated sequences accounted for about 35.7% of the entireregion.Large amounts of simple sequence repeats (microsatellites) were found,with strong bias towards(AC)_n type (76%).Analysis of other Dunaliella genomic sequences in the GenBank database (total 25,749bp) was in agreement with these findings.These sequence features made it difficult to sequence Dunaliellagenomic sequences.Further investigation should be made to reveal the biological significance of these uniquesequence features. 相似文献
20.
To improve the utility of increasingly large numbers of available unannotated and initially poorly annotated genomic sequences for proteome analysis, we demonstrate that effective protein identification can be made on a large and unannotated genome. The strategy developed is to translate the unannotated genome sequence into amino acid sequence encoding putative proteins in all six reading frames, to identify peptides by tandem mass spectrometry (MS/MS), to localize them on the genome sequence, and to preliminarily annotate the protein via a similarity search by BLAST. These tasks have been optimized and automated. Optimization to obtain multiple peptide matches in effect extends the searchable region and results in more robust protein identification. The viability of this strategy is demonstrated with the identification of 223 cilia proteins in the unicellular eukaryotic model organism Tetrahymena thermophila, whose initial genomic sequence draft was released in November 2003. To the best of our knowledge, this is the first demonstration of large-scale protein identification based on such a large, unannotated genome. Of the 223 cilia proteins, 84 have no similarity to proteins in NCBI's nonredundant (nr) database. This methodology allows identifying the locations of the genes encoding these novel proteins, which is a necessary first step to downstream functional genomic experimentation. 相似文献