首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Complete sets of cloned protein-encoding open reading frames (ORFs), or ORFeomes, are essential tools for large-scale proteomics and systems biology studies. Here we describe human ORFeome version 3.1 (hORFeome v3.1), currently the largest publicly available resource of full-length human ORFs (available at ). Generated by Gateway recombinational cloning, this collection contains 12,212 ORFs, representing 10,214 human genes, and corresponds to a 51% expansion of the original hORFeome v1.1. An online human ORFeome database, hORFDB, was built and serves as the central repository for all cloned human ORFs (http://horfdb.dfci.harvard.edu). This expansion of the original ORFeome resource greatly increases the potential experimental search space for large-scale proteomics studies, which will lead to the generation of more comprehensive datasets.  相似文献   

2.
Origin and properties of non-coding ORFs in the yeast genome.   总被引:4,自引:0,他引:4       下载免费PDF全文
In a recent paper we have estimated the total number of protein coding open reading frames (ORFs) in the Saccharomyces cerevisiae genome, based on their properties, at about 4800. This number is much smaller than the 5800-6000 which is widely accepted. In this paper we analyse differences between the set of ORFs with known phenotypes annotated in the Munich Information Centre for Protein Sequences (MIPS) database and ORFs for which the probability of coding, counted by us, is very low. We have found that many of the latter ORFs have properties of antisense sequences of coding ORFs, which suggests that they could have been generated by duplication of coding sequences. Since coding sequences generate ORFs inside themselves, with especially high frequency in the antisense sequences, we have looked for homology between known proteins and hypothetical polypeptides generated by ORFs under consideration in all the six phases. For many ORFs we have found paralogues and orthologues in phases different than the phase which had been assumed in the MIPS database as coding.  相似文献   

3.
WorfDB (Worm ORFeome DataBase; http://worfdb.dfci.harvard.edu) was created to integrate and disseminate the data from the cloning of complete set of approximately 19 000 predicted protein-encoding Open Reading Frames (ORFs) of Caenorhabditis elegans (also referred to as the 'worm ORFeome'). WorfDB serves as a central data repository enabling the scientific community to search for availability and quality of cloned ORFs. So far, ORF sequence tags (OSTs) obtained for all individual clones have allowed exon structure corrections for approximately 3400 ORFs originally predicted by the C. elegans sequencing consortium. In addition, we now have OSTs for approximately 4300 predicted genes for which no ESTs were available. The database contains this OST information along with data pertinent to the cloning process. WorfDB could serve as a model database for other metazoan ORFeome cloning projects.  相似文献   

4.
Functional characterization of the human genome requires tools for systematically modulating gene expression in both loss-of-function and gain-of-function experiments. We describe the production of a sequence-confirmed, clonal collection of over 16,100 human open-reading frames (ORFs) encoded in a versatile Gateway vector system. Using this ORFeome resource, we created a genome-scale expression collection in a lentiviral vector, thereby enabling both targeted experiments and high-throughput screens in diverse cell types.  相似文献   

5.
GENIUS II is an automated database system in which open reading frames (ORFs) in complete genomes are assigned to known protein three-dimensional (3D) structures. The system uses the multiple intermediate sequence search method in which query and target sequences are linked by intermediate sequences gathered by PSI-BLAST search. By applying the system to 129 complete genomes, 43.8% on average of the ORFs in the genomes were assigned to known 3D structures and the results are available for free at GENIUS II web site.  相似文献   

6.
The unannotated regions of the Escherichia coli genome DNA sequence from the EcoSeq6 database, totaling 1,278 'intergenic' sequences of the combined length of 359,279 basepairs, were analyzed using computer-assisted methods with the aim of identifying putative unknown genes. The proposed strategy for finding new genes includes two key elements: i) prediction of expressed open reading frames (ORFs) using the GeneMark method based on Markov chain models for coding and non-coding regions of Escherichia coli DNA, and ii) search for protein sequence similarities using programs based on the BLAST algorithm and programs for motif identification. A total of 354 putative expressed ORFs were predicted by GeneMark. Using the BLASTX and TBLASTN programs, it was shown that 208 ORFs located in the unannotated regions of the E. coli chromosome are significantly similar to other protein sequences. Identification of 182 ORFs as probable genes was supported by GeneMark and BLAST, comprising 51.4% of the GeneMark 'hits' and 87.5% of the BLAST 'hits'. 73 putative new genes, comprising 20.6% of the GeneMark predictions, belong to ancient conserved protein families that include both eubacterial and eukaryotic members. This value is close to the overall proportion of highly conserved sequences among eubacterial proteins, indicating that the majority of the putative expressed ORFs that are predicted by GeneMark, but have no significant BLAST hits, nevertheless are likely to be real genes. The majority of the putative genes identified by BLAST search have been described since the release of the EcoSeq6 database, but about 70 genes have not been detected so far. Among these new identifications are genes encoding proteins with a variety of predicted functions including dehydrogenases, kinases, several other metabolic enzymes, ATPases, rRNA methyltransferases, membrane proteins, and different types of regulatory proteins.  相似文献   

7.
Protein identification via peptide mass fingerprinting (PMF) remains a key component of high-throughput proteomics experiments in post-genomic science. Candidate protein identifications are made using bioinformatic tools from peptide peak lists obtained via mass spectrometry (MS). These algorithms rely on several search parameters, including the number of potential uncut peptide bonds matching the primary specificity of the hydrolytic enzyme used in the experiment. Typically, up to one of these "missed cleavages" are considered by the bioinformatics search tools, usually after digestion of the in silico proteome by trypsin. Using two distinct, nonredundant datasets of peptides identified via PMF and tandem MS, a simple predictive method based on information theory is presented which is able to identify experimentally defined missed cleavages with up to 90% accuracy from amino acid sequence alone. Using this simple protocol, we are able to "mask" candidate protein databases so that confident missed cleavage sites need not be considered for in silico digestion. We show that that this leads to an improvement in database searching, with two different search engines, using the PMF dataset as a test set. In addition, the improved approach is also demonstrated on an independent PMF data set of known proteins that also has corresponding high-quality tandem MS data, validating the protein identifications. This approach has wider applicability for proteomics database searching, and the program for predicting missed cleavages and masking Fasta-formatted protein sequence databases has been made available via http:// ispider.smith.man.ac uk/MissedCleave.  相似文献   

8.
Rice has become a model plant for genomic studies of monocot species, because of its relative small ge-nome size (430 Mb), high synteny with other impor-tant crop species such as maize, barley and wheat, the release of draft sequences of both indica[1] and japon-ica[2] genomes, and the near completion of the map-based sequencing of rice genome by the Interna-tional Rice Genome Sequencing Project. Currently, more than 340 Mb of non-overlapping genomic se-quences including completely sequenced…  相似文献   

9.
Yeast chromosome III: new gene functions.   总被引:19,自引:1,他引:18       下载免费PDF全文
E V Koonin  P Bork    C Sander 《The EMBO journal》1994,13(3):493-503
  相似文献   

10.
Chlamydia trachomatis represents a group of human pathogenic obligate intracellular and gram-negative bacteria. The genome of C. trachomatis D comprises 894 open reading frames (ORFs). In this study the global expression of genes in C. trachomatis A, D and L2, which are responsible for different chlamydial diseases, was investigated using a proteomics approach. Based on silver stained two-dimensional polyacrylamide gel electrophoresis (2-D PAGE), gels with purified elementary bodies (EB) and auto-radiography of gels with 35S-labeled C. trachomatis proteins up to 700 protein spots were detectable within the range of the immobilized pH gradient (IPG) system used. Using mass spectrometry and N-terminal sequencing followed by database searching we identified 250 C. trachomatis proteins from purified EB of which 144 were derived from different genes representing 16% of the ORFs predicted from the C. trachomatis D genome and the 7.5 kb C. trachomatis plasmid. Important findings include identification of proteins from the type III secretion apparatus, enzymes from the central metabolism and confirmation of expression of 25 hypothetical ORFs and five polymorphic membrane proteins. Comparison of serovars generated novel data on genetic variability as indicated by electrophoretic variation and potentially important examples of serovar specific differences in protein abundance. The availability of the complete genome made it feasible to map and to identify proteins of C. trachomatis on a large scale and the integration of our data in a 2-D PAGE database will create a basis for post genomic research, important for the understanding of chlamydial development and pathogenesis.  相似文献   

11.
Emerging evidence places small proteins (≤50 amino acids) more centrally in physiological processes. Yet, their functional identification and the systematic genome annotation of their cognate small open-reading frames (smORFs) remains challenging both experimentally and computationally. Ribosome profiling or Ribo-Seq (that is a deep sequencing of ribosome-protected fragments) enables detecting of actively translated open-reading frames (ORFs) and empirical annotation of coding sequences (CDSs) using the in-register translation pattern that is characteristic for genuinely translating ribosomes. Multiple identifiers of ORFs that use the 3-nt periodicity in Ribo-Seq data sets have been successful in eukaryotic smORF annotation. They have difficulties evaluating prokaryotic genomes due to the unique architecture (e.g. polycistronic messages, overlapping ORFs, leaderless translation, non-canonical initiation etc.). Here, we present a new algorithm, smORFer, which performs with high accuracy in prokaryotic organisms in detecting putative smORFs. The unique feature of smORFer is that it uses an integrated approach and considers structural features of the genetic sequence along with in-frame translation and uses Fourier transform to convert these parameters into a measurable score to faithfully select smORFs. The algorithm is executed in a modular way, and dependent on the data available for a particular organism, different modules can be selected for smORF search.  相似文献   

12.
通过高精度的双向电泳技术对家蚕中部丝腺组织的蛋白质进行分离,采用基质辅助激光解析电离飞行时间质谱(matrix-assistedlaserdesorption/ionizationtimeofflightmassspectrometry,MALDI-TOF-MS)对其中一些表达量较高的蛋白点进行鉴定,并利用GPMAW(GeneralProtein/MassAnalysisforWindows)软件结合家蚕基因组预测的蛋白质数据库构建本地的肽质量指纹图谱数据库,对所得到的肽质量指纹图谱进行分析。研究发现,经过双向凝胶电泳及其图象分析技术,硝酸银染色和考马斯亮蓝染色分别能分离出500个以上和100个以上的蛋白点。这些蛋白质点主要集中在分子量15~90kD区域,等电点pH3·5~7之间。MALDI-TOF-MS鉴定的25个考染蛋白点中有60%以上的PMF(PeptideMassFingerprint)的信号峰较强。在数据库检索过程中,利用家蚕肽质量指纹数据库所得检索结果与在Mascot的检索结果相比,前者不仅能够准确鉴定出一些已有研究报道的蛋白,从而验证检索方法的可行性,而且还能够对一些已经被家蚕基因组数据库所预测但未曾报道的新蛋白质进行鉴定,从而建立了一整套适合于家蚕蛋白质组研究的方法,并为其它绢丝昆虫蛋白质组研究提供了重要参考。  相似文献   

13.
Proteomic analysis   总被引:10,自引:0,他引:10  
The field of proteomics is becoming increasingly important as genome sequences are being completed and annotated. Recent advances in proteomics include experimental and mathematical proofs of the need to complement microarray analysis with protein analysis, improved sensitivity for mass spectrometric analysis of separated proteins, better informatic tools for gel analysis and protein spot annotation, first steps towards automated experimental procedures, and new technology for quantitation of protein changes.  相似文献   

14.
In this study, the full mitochondrial genome of a basidiomycete fungus, Pleurotus ostreatus, was sequenced and analyzed. It is a circular DNA molecule of 73 242 bp and contains 44 known genes encoding 18 proteins and 26 RNA genes. The protein-coding genes include 14 common mitochondrial genes, one ribosomal small subunit protein 3 gene, one RNA polymerase gene and two DNA polymerase genes. In addition, one RNA and one DNA polymerase genes were identified in a mitochondrial plasmid. These two genes show relatively low similarities to their homologs in the mitochondrial genome but they are nearly identical to the known mitochondrial plasmid genes from another Pleurotus ostreatus strain. This suggests that the plasmid may mediate the horizontal gene transfer of the DNA and RNA polymerase genes into mitochondrial genome, and such a transfer may be an ancient event. Phylogenetic analysis based on the cox1 ORFs verified the traditional classification of Pleurotus ostreatus among fungi. However, the discordances were observed in the phylogenetic trees based on the six cox1 intronic ORFs of Pleurotus ostreatus and their homologs in other species, suggesting that these intronic ORFs are foreign DNA sequences obtained through HGT. In summary, this analysis provides valuable information towards the understanding of the evolution of fungal mtDNA.  相似文献   

15.
Modern proteomics approaches include techniques to examine the expression, localization, modifications, and complex formation of proteins in cells. In order to address issues of protein function in vitro using classical biochemical and biophysical approaches, high-throughput methods of cloning the appropriate reading frames, and expressing and purifying proteins efficiently are an important goal of modern proteomics approaches. This process becomes more difficult as functional proteomics efforts focus on the proteins from higher organisms, since issues of correctly identifying intron-exon boundaries and efficiently expressing and solubilizing the (often) multi-domain proteins from higher eukaryotes are challenging. Recently, 12,000 open-reading-frame (ORF) sequences from Caenorhabditis elegans have become available for functional proteomics studies [Nat. Gen. 34 (2003) 35]. We have implemented a high-throughput screening procedure to express, purify, and analyze by mass spectrometry hexa-histidine-tagged C. elegans ORFs in Escherichia coli using metal affinity ZipTips. We find that over 65% of the expressed proteins are of the correct mass as analyzed by matrix-assisted laser desorption MS. Many of the remaining proteins indicated to be "incorrect" can be explained by high-throughput cloning or genome database annotation errors. This provides a general understanding of the expected error rates in such high-throughput cloning projects. The ZipTip purified proteins can be further analyzed under both native and denaturing conditions for functional proteomics efforts.  相似文献   

16.
Separation of proteins by two-dimensional gel electrophoresis (2-DE) coupled with identification of proteins through peptide mass fingerprinting (PMF) by matrix-assisted laser desorption ionization time-of-flight mass spectrometry (MALDI-TOF MS) is the widely used technique for proteomic analysis. This approach relies, however, on the presence of the proteins studied in public-accessible protein databases or the availability of annotated genome sequences of an organism. In this work, we investigated the reliability of using raw genome sequences for identifying proteins by PMF without the need of additional information such as amino acid sequences. The method is demonstrated for proteomic analysis of Klebsiella pneumoniae grown anaerobically on glycerol. For 197 spots excised from 2-DE gels and submitted for mass spectrometric analysis 164 spots were clearly identified as 122 individual proteins. 95% of the 164 spots can be successfully identified merely by using peptide mass fingerprints and a strain-specific protein database (ProtKpn) constructed from the raw genome sequences of K. pneumoniae. Cross-species protein searching in the public databases mainly resulted in the identification of 57% of the 66 high expressed protein spots in comparison to 97% by using the ProtKpn database. 10 dha regulon related proteins that are essential for the initial enzymatic steps of anaerobic glycerol metabolism were successfully identified using the ProtKpn database, whereas none of them could be identified by cross-species searching. In conclusion, the use of strain-specific protein database constructed from raw genome sequences makes it possible to reliably identify most of the proteins from 2-DE analysis simply through peptide mass fingerprinting.  相似文献   

17.
Nucleic acid sequences from genome sequencing projects are submitted as raw data, from which biologists attempt to elucidate the function of the predicted gene products. The protein sequences are stored in public databases, such as the UniProt Knowledgebase (UniProtKB), where curators try to add predicted and experimental functional information. Protein function prediction can be done using sequence similarity searches, but an alternative approach is to use protein signatures, which classify proteins into families and domains. The major protein signature databases are available through the integrated InterPro database, which provides a classification of UniProtKB sequences. As well as characterization of proteins through protein families, many researchers are interested in analyzing the complete set of proteins from a genome (i.e. the proteome), and there are databases and resources that provide non-redundant proteome sets and analyses of proteins from organisms with completely sequenced genomes. This article reviews the tools and resources available on the web for single and large-scale protein characterization and whole proteome analysis.  相似文献   

18.
Identifying all essential genomic components is critical for the assembly of minimal artificial life. In the genome-reduced bacterium Mycoplasma pneumoniae, we found that small ORFs (smORFs; < 100 residues), accounting for 10% of all ORFs, are the most frequently essential genomic components (53%), followed by conventional ORFs (49%). Essentiality of smORFs may be explained by their function as members of protein and/or DNA/RNA complexes. In larger proteins, essentiality applied to individual domains and not entire proteins, a notion we could confirm by expression of truncated domains. The fraction of essential non-coding RNAs (ncRNAs) non-overlapping with essential genes is 5% higher than of non-transcribed regions (0.9%), pointing to the important functions of the former. We found that the minimal essential genome is comprised of 33% (269,410 bp) of the M. pneumoniae genome. Our data highlight an unexpected hidden layer of smORFs with essential functions, as well as non-coding regions, thus changing the focus when aiming to define the minimal essential genome.  相似文献   

19.
We reported the isolation and identification of 10828 putative full-length cDNAs (FL-cDNA) from an indica rice cultivar, Minghui 63, with the long-term goal to isolate all full-length cDNAs from indica genome. Comparison with the databases showed that 780 of them are new rice cDNAs with no match in japonica cDNA database. Totally, 9078 of the FL-cDNAs contained predicted ORFs matching with japonica FL-cDNAs and 6543 could find homologous proteins with complete ORFs. 53% of the matched FL-cDNAs isolated in this study had longer 5′UTR than japonica FL-cDNAs. In silico mapping showed that 9776 (90.28%) of the FL-cDNAs had matched genomic sequences in the japonica genome and 10046 (92.78%) had matched genomic sequences in the indica genome. The average nucleotide sequence identity between the two subspecies is 99.2%. A majority of FL-cDNAs (90%) could be classified with GO (gene ontology) terms based on homology proteins. More than 60% of the new cDNAs isolated in this study had no homology to the known proteins. This set of FL-cDNAs should be useful for functional genomics and proteomics studies.  相似文献   

20.
The availability of entire genome sequences is expected to revolutionize the way in which biology and medicine are conducted for years to come. However, achieving this promise still requires significant effort in the areas of gene annotation, cloning and expression of thousands of known and heretofore unknown protein-encoding genes. Traditional technologies of manipulating genes are too cumbersome and inefficient when one is dealing with more than a few genes at a time. Entire libraries composed of all protein-encoding open reading frames (ORFs) cloned in highly flexible vectors will be needed to take full advantage of the information found in any genome sequence. The creation of such ORFeome resources using novel technologies for cloning and expressing entire proteomes constitutes an effective gateway from whole genome sequencing efforts to downstream 'omics' applications.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号