首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 93 毫秒
1.
Nucleic acid sequences from genome sequencing projects are submitted as raw data, from which biologists attempt to elucidate the function of the predicted gene products. The protein sequences are stored in public databases, such as the UniProt Knowledgebase (UniProtKB), where curators try to add predicted and experimental functional information. Protein function prediction can be done using sequence similarity searches, but an alternative approach is to use protein signatures, which classify proteins into families and domains. The major protein signature databases are available through the integrated InterPro database, which provides a classification of UniProtKB sequences. As well as characterization of proteins through protein families, many researchers are interested in analyzing the complete set of proteins from a genome (i.e. the proteome), and there are databases and resources that provide non-redundant proteome sets and analyses of proteins from organisms with completely sequenced genomes. This article reviews the tools and resources available on the web for single and large-scale protein characterization and whole proteome analysis.  相似文献   

2.
Addressing protein localization within the nucleus   总被引:1,自引:0,他引:1       下载免费PDF全文
Bridging the gap between the number of gene sequences in databases and the number of gene products that have been functionally characterized in any way is a major challenge for biology. A key characteristic of proteins, which can begin to elucidate their possible functions, is their subcellular location. A number of experimental approaches can reveal the subcellular localization of proteins in mammalian cells. However, genome databases now contain predicted sequences for a large number of potentially novel proteins that have yet to be studied in any way, let alone have their subcellular localization determined. Here we ask whether using bioinformatics tools to analyse the sequence of proteins whose subnuclear localizations have been determined can reveal characteristics or signatures that might allow us to predict localization for novel protein sequences.  相似文献   

3.
Tang SN  Sun JM  Xiong WW  Cong PS  Li TH 《Biochimie》2012,94(3):847-853
Mycobacterium, the most common disease-causing genus, infects billions of people and is notoriously difficult to treat. Understanding the subcellular localization of mycobacterial proteins can provide essential clues for protein function and drug discovery. In this article, we present a novel approach that focuses on local sequence information to identify localization motifs that are generated by a merging algorithm and are selected based on a binomially distributed model. These localization motifs are employed as features for identifying the subcellular localization of mycobacterial proteins. Our approach provides more accurate results than previous methods and was tested on an independent dataset recently obtained from an experimental study to provide a first and reasonably accurate prediction of subcellular localization. Our approach can also be used for large-scale prediction of new protein entries in the UniportKB database and of protein sequences obtained experimentally. In addition, our approach identified many local motifs involved with the subcellular localization that also interact with the environment. Thus, our method may have widespread applications both in the study of the functions of mycobacterial proteins and in the search for a potential vaccine target for designing drugs.  相似文献   

4.
Issac B  Raghava GP 《BioTechniques》2002,33(3):548-50, 552, 554-6
Similarity searches are a powerful method for solving important biological problems such as database scanning, evolutionary studies, gene prediction, and protein structure prediction. FASTA is a widely used sequence comparison tool for rapid database scanning. Here we describe the GWFASTA server that was developed to assist the FASTA user in similarity searches against partially and/or completely sequenced genomes. GWFASTA consists of more than 60 microbial genomes, eight eukaryote genomes, and proteomes of annotatedgenomes. Infact, it provides the maximum number of databases for similarity searching from a single platform. GWFASTA allows the submission of more than one sequence as a single query for a FASTA search. It also provides integrated post-processing of FASTA output, including compositional analysis of proteins, multiple sequences alignment, and phylogenetic analysis. Furthermore, it summarizes the search results organism-wise for prokaryotes and chromosome-wise for eukaryotes. Thus, the integration of different tools for sequence analyses makes GWFASTA a powerful toolfor biologists.  相似文献   

5.
Shen HB  Chou KC 《Biopolymers》2007,85(3):233-240
Viruses can reproduce their progenies only within a host cell, and their actions depend both on its destructive tendencies toward a specific host cell and on environmental conditions. Therefore, knowledge of the subcellular localization of viral proteins in a host cell or virus-infected cell is very useful for in-depth studying of their functions and mechanisms as well as designing antiviral drugs. An analysis on the Swiss-Prot database (version 50.0, released on May 30, 2006) indicates that only 23.5% of viral protein entries are annotated for their subcellular locations in this regard. As for the gene ontology database, the corresponding percentage is 23.8%. Such a gap calls for the development of high throughput tools for timely annotating the localization of viral proteins within host and virus-infected cells. In this article, a predictor called "Virus-PLoc" has been developed that is featured by fusing many basic classifiers with each engineered according to the K-nearest neighbor rule. The overall jackknife success rate obtained by Virus-PLoc in identifying the subcellular compartments of viral proteins was 80% for a benchmark dataset in which none of proteins has more than 25% sequence identity to any other in a same location site. Virus-PLoc will be freely available as a web-server at http://202.120.37.186/bioinf/virus for the public usage. Furthermore, Virus-PLoc has been used to provide large-scale predictions of all viral protein entries in Swiss-Prot database that do not have subcellular location annotations or are annotated as being uncertain. The results thus obtained have been deposited in a downloadable file prepared with Microsoft Excel and named "Tab_Virus-PLoc.xls." This file is available at the same website and will be updated twice a year to include the new entries of viral proteins and reflect the continuous development of Virus-PLoc.  相似文献   

6.
Identification of the novel PE multigene family was an unexpected finding of the genomic sequencing of Mycobacterium tuberculosis. Presently, the biological role of the PE and PE_PGRS proteins encoded by this unique family of mycobacterial genes remains unknown. In this report, a representative PE_PGRS gene (Rv1818c/PE_PGRS33) was selected to investigate the role of these proteins. Cell fractionation studies and fluorescence analysis of recombinant strains of Mycobacterium smegmatis and M. tuberculosis expressing green fluorescent protein (GFP)-tagged proteins indicated that the Rv1818c gene product localized in the mycobacterial cell wall, mostly at the bacterial cell poles, where it is exposed to the extracellular milieu. Further analysis of this PE_PGRS protein showed that the PE domain is necessary for subcellular localization. In addition, the PGRS domain, but not PE, affects bacterial shape and colony morphology when Rv1818c is overexpressed in M. smegmatis and M. tuberculosis. Taken together, the results indicate that PE_PGRS and PE proteins can be associated with the mycobacterial cell wall and influence cellular structure as well as the formation of mycobacterial colonies. Regulated expression of PE genes could have implications for the survival and pathogenesis of mycobacteria within the human host and in other environmental niches.  相似文献   

7.
Using liquid chromatography-mass spectrometry, 528 proteins were identified that are expressed during growth at 4 degrees C in the cold adapted archaeon, Methanococcoides burtonii. Of those, 135 were annotated previously as unique or conserved hypothetical proteins. We have performed a comprehensive, integrated analysis of the latter proteins using threading, InterProScan, predicted subcellular localization and visualization of conserved gene context across multiple prokaryotic genomes. Functional information was obtained for 55 proteins, providing new insight into the physiology of M. burtonii. Many of the proteins were predicted to be involved in DNA/RNA binding or modification and cell signaling, suggesting a complex, uncharacterized regulatory network controlling cellular processes during growth at low-temperature. Novel enzymatic functions were predicted for several proteins, including a putative candidate gene for the posttranslational modification of the key methanogenesis enzyme coenzyme M methyl reductase. A bacterial-like CRISPR locus was identified as a strong candidate for archaeal-bacterial lateral gene transfer. Gene context analysis proved a valuable augmentation to the other predictive methods in several cases, by revealing conserved gene associations and annotations in other microbial genomes. Our results underscore the importance of addressing the "hypothetical protein problem" for a complete understanding of cell physiology.  相似文献   

8.
Sequence conserved for subcellular localization   总被引:6,自引:0,他引:6       下载免费PDF全文
The more proteins diverged in sequence, the more difficult it becomes for bioinformatics to infer similarities of protein function and structure from sequence. The precise thresholds used in automated genome annotations depend on the particular aspect of protein function transferred by homology. Here, we presented the first large-scale analysis of the relation between sequence similarity and identity in subcellular localization. Three results stood out: (1) The subcellular compartment is generally more conserved than what might have been expected given that short sequence motifs like nuclear localization signals can alter the native compartment; (2) the sequence conservation of localization is similar between different compartments; and (3) it is similar to the conservation of structure and enzymatic activity. In particular, we found the transition between the regions of conserved and nonconserved localization to be very sharp, although the thresholds for conservation were less well defined than for structure and enzymatic activity. We found that a simple measure for sequence similarity accounting for pairwise sequence identity and alignment length, the HSSP distance, distinguished accurately between protein pairs of identical and different localizations. In fact, BLAST expectation values outperformed the HSSP distance only for alignments in the subtwilight zone. We succeeded in slightly improving the accuracy of inferring localization through homology by fine tuning the thresholds. Finally, we applied our results to the entire SWISS-PROT database and five entirely sequenced eukaryotes.  相似文献   

9.
10.
Increasing evidence demonstrates the importance of long coiled-coil proteins for the spatial organization of cellular processes. Although several protein classes with long coiled-coil domains have been studied in animals and yeast, our knowledge about plant long coiled-coil proteins is very limited. The repeat nature of the coiled-coil sequence motif often prevents the simple identification of homologs of animal coiled-coil proteins by generic sequence similarity searches. As a consequence, counterparts of many animal proteins with long coiled-coil domains, like lamins, golgins, or microtubule organization center components, have not been identified yet in plants. Here, all Arabidopsis proteins predicted to contain long stretches of coiled-coil domains were identified by applying the algorithm MultiCoil to a genome-wide screen. A searchable protein database, ARABI-COIL (http://www.coiled-coil.org/arabidopsis), was established that integrates information on number, size, and position of predicted coiled-coil domains with subcellular localization signals, transmembrane domains, and available functional annotations. ARABI-COIL serves as a tool to sort and browse Arabidopsis long coiled-coil proteins to facilitate the identification and selection of candidate proteins of potential interest for specific research areas. Using the database, candidate proteins were identified for Arabidopsis membrane-bound, nuclear, and organellar long coiled-coil proteins.  相似文献   

11.
12.

Background

Despite significant improvements in computational annotation of genomes, sequences of abnormal, incomplete or incorrectly predicted genes and proteins remain abundant in public databases. Since the majority of incomplete, abnormal or mispredicted entries are not annotated as such, these errors seriously affect the reliability of these databases. Here we describe the MisPred approach that may provide an efficient means for the quality control of databases. The current version of the MisPred approach uses five distinct routines for identifying abnormal, incomplete or mispredicted entries based on the principle that a sequence is likely to be incorrect if some of its features conflict with our current knowledge about protein-coding genes and proteins: (i) conflict between the predicted subcellular localization of proteins and the absence of the corresponding sequence signals; (ii) presence of extracellular and cytoplasmic domains and the absence of transmembrane segments; (iii) co-occurrence of extracellular and nuclear domains; (iv) violation of domain integrity; (v) chimeras encoded by two or more genes located on different chromosomes.

Results

Analyses of predicted EnsEMBL protein sequences of nine deuterostome (Homo sapiens, Mus musculus, Rattus norvegicus, Monodelphis domestica, Gallus gallus, Xenopus tropicalis, Fugu rubripes, Danio rerio and Ciona intestinalis) and two protostome species (Caenorhabditis elegans and Drosophila melanogaster) have revealed that the absence of expected signal peptides and violation of domain integrity account for the majority of mispredictions. Analyses of sequences predicted by NCBI's GNOMON annotation pipeline show that the rates of mispredictions are comparable to those of EnsEMBL. Interestingly, even the manually curated UniProtKB/Swiss-Prot dataset is contaminated with mispredicted or abnormal proteins, although to a much lesser extent than UniProtKB/TrEMBL or the EnsEMBL or GNOMON-predicted entries.

Conclusion

MisPred works efficiently in identifying errors in predictions generated by the most reliable gene prediction tools such as the EnsEMBL and NCBI's GNOMON pipelines and also guides the correction of errors. We suggest that application of the MisPred approach will significantly improve the quality of gene predictions and the associated databases.  相似文献   

13.
林昊 《生物信息学》2009,7(4):252-254
由于蛋白质亚细胞位置与其一级序列存在很强的相关性,利用多样性增量来描述蛋白质之间氨基酸组分和二肽组分的相似程度,采用修正的马氏判别式(这里称为IDQD方法)对分枝杆菌蛋白质的亚细胞位置进行了预测。利用Jackknife检验对不同序列相似度下的蛋白质数据集进行了预测研究,结果显示,当数据集的序列相似度小于等于70%时,算法的预测精度稳定在75%左右。在对整体852条蛋白质的预测成功率达到87.7%,这一结果优于已有算法的预测精度,说明IDQD是一种有效的分枝杆菌蛋白质亚细胞预测方法。  相似文献   

14.
In the past, a large number of methods have been developed for predicting various characteristics of a protein from its composition. In order to exploit the full potential of protein composition, we developed the web-server COPid to assist the researchers in annotating the function of a protein from its composition using whole or part of the protein. COPid has three modules called search, composition and analysis. The search module allows searching of protein sequences in six different databases. Search results list database proteins in ascending order of Euclidian distance or descending order of compositional similarity with the query sequence. The composition module allows calculation of the composition of a sequence and average composition of a group of sequences. The composition module also allows computing composition of various types of amino acids (e.g. charge, polar, hydrophobic residues). The analysis module provides the following options; i) comparing composition of two classes of proteins, ii) creating a phylogenetic tree based on the composition and iii) generating input patterns for machine learning techniques. We have evaluated the performance of composition-based (or alignment-free) similarity search in the subcellular localization of proteins. It was found that the alignment free method performs reasonably well in predicting certain classes of proteins. The COPid web-server is available at http://www.imtech.res.in/raghava/copid/.  相似文献   

15.
In plant genomes, the function of a substantial percentage of the putative protein-coding open reading frames (ORFs) is unknown. These ORFs have no significant sequence similarity to known proteins, which complicates the task of functional study of these proteins. Efforts are being made to explore methods that are complementary to, or may be used in combination with, sequence alignment and clustering methods. A web-based protein functional class prediction software, SVMProt, has shown some capability for predicting functional class of distantly related proteins. Here the usefulness of SVMProt for functional study of novel plant proteins is evaluated. To test SVMProt, 49 plant proteins (without a sequence homolog in the Swiss-Prot protein database, not in the SVMProt training set, and with functional indications provided in the literature) were selected from a comprehensive search of MEDLINE abstracts and Swiss-Prot databases in 1999-2004. These represent unique proteins the function of which, at present, cannot be confidently predicted by sequence alignment and clustering methods. The predicted functional class of 31 proteins was consistent, and that of four other proteins was weakly consistent, with published functions. Overall, the functional class of 71.4% of these proteins was consistent, or weakly consistent, with functional indications described in the literature. SVMProt shows a certain level of ability to provide useful hints about the functions of novel plant proteins with no similarity to known proteins.  相似文献   

16.
Amino acid sequence analysis corresponding to the PPE proteins in H37Rv and CDC 1551 strains of theMycobacterium tuberculosis genomes resulted in the identification of a previously uncharacterized 225 amino acid-residue common region in 22 proteins. The pairwise sequence identities were as low as 18%. Conservation of amino acid residues was observed at fifteen positions that were distributed over the whole length of the region. The secondary structure corresponding to this region is predicted to be a mixture of a-helices and β-strands. Although the function is not known, proteins with this region specific to mycobacterial species may be associated with a common function. We further observed another group of 20 PPE proteins corresponding to the conserved C-terminal region comprising 44 amino acid residues with GFxGT and PxxPxxW sequence motifs. This region is preceded by a hydrophobic region, comprising 40–100 amino acid residues, that is flanked by charged amino acid residues. Identification of conserved regions described above may be useful to detect related proteins from other genomes and assist the design of suitable experiments to test their corresponding functions. Amino acid sequence analysis corresponding to the PE proteins resulted in the identification of tandem repeats comprising 41-43 amino acid residues in the C-terminal variable regions in two PE proteins (Rv0978 and Rv0980). These correspond to the AB repeats that were first identified in some proteins of theMethanosarcina mazei genome, and were demonstrated as surface antigens. We observed the AB repeats also in several other proteins of hitherto uncharacterized function inArchaea andBacteria genomes. Some of these proteins are also associated with another repeat called the C-repeat or the PKD-domain comprising 85 amino acid residues. The secondary structure corresponding to the AB repeat is predicted mainly as 4 β-strands. We suggest that proteins with AB repeats inMycobacterium tuberculosis and other genomes may be associated as surface antigens. TheM. leprae genome, however, does not contain either the AB or C-repeats and different proteins may therefore be recruited as surface antigens in theM. leprae genome compared to theM. tuberculosis genome.  相似文献   

17.
Functional and topological characterization of protein interaction networks   总被引:1,自引:0,他引:1  
The elucidation of the cell's large-scale organization is a primary challenge for post-genomic biology, and understanding the structure of protein interaction networks offers an important starting point for such studies. We compare four available databases that approximate the protein interaction network of the yeast, Saccharomyces cerevisiae, aiming to uncover the network's generic large-scale properties and the impact of the proteins' function and cellular localization on the network topology. We show how each database supports a scale-free, topology with hierarchical modularity, indicating that these features represent a robust and generic property of the protein interactions network. We also find strong correlations between the network's structure and the functional role and subcellular localization of its protein constituents, concluding that most functional and/or localization classes appear as relatively segregated subnetworks of the full protein interaction network. The uncovered systematic differences between the four protein interaction databases reflect their relative coverage for different functional and localization classes and provide a guide for their utility in various bioinformatics studies.  相似文献   

18.
Approximately, one-third of the world's population is infected with Mycobacterium tuberculosis, the causative agent of tuberculosis. Secreted and membrane proteins that interact with the host play important roles for the pathogenicity of the bacteria and are potential drug targets or components of vaccines. In this present study, subcellular fractionation in combination with membrane enrichment was used to comprehensively analyze the M. tuberculosis proteome. The proteome of the M. tuberculosis cell wall, membrane, cytosol, lysate, and culture filtrate was defined with a high coverage. Exceptional enrichment for membrane proteins was achieved using wheat germ agglutinin (WGA)-affinity two-phase partitioning, a technique that has to date not yet been exploited for the enrichment of mycobacterial membranes. Overall, 1051 M. tuberculosis protein groups including 183 transmembrane proteins have been identified by LC-MS/MS analysis using stringent database search criteria with a minimum of two peptides and an estimated FDR of less than 1%. With many mycobacterial antigens and lipoglycoproteins identified, the results from this study suggest that many of the newly discovered proteins could represent potential candidates mediating host-pathogen interactions. In addition, this data set provides experimental information about protein localization and thus serves as a valuable resource for M. tuberculosis proteome research.  相似文献   

19.
Analysis by two-dimensional gel electrophoresis revealed that Mycobacterium avium expresses several proteins unique to an intracellular infection. One abundant protein with an apparent molecular mass of 50 kDa was isolated, and the N-terminal sequence was determined. It matches a sequence in the M. tuberculosis database (Sanger) with similarity to the enzyme isocitrate lyase of both Corynebacterium glutamicum and Rhodococcus fascians. Only marginal similarity was observed between this open reading frame (ORF) (termed icl) and a second distinct ORF (named aceA) which exhibits a low similarity to other isocitrate lyases. Both ORFs can be found as distinct genes in the various mycobacterial databases recently published. Isocitrate lyase is a key enzyme in the glyoxylate cycle and is essential as an anapleurotic enzyme for growth on acetate and certain fatty acids as carbon source. In this study we express and purify Icl, as well as AceA proteins, and show that both exhibit isocitrate lyase activity. Various known inhibitors for isocitrate lyase were effective. Furthermore, we present evidence that in both M. avium and M. tuberculosis the production and activity of the isocitrate lyase is enhanced under minimal growth conditions when supplemented with acetate or palmitate.  相似文献   

20.
在高等植物花药发育和花粉形成中, MYB转录因子起着非常重要的作用, 其中MYB80是参与绒毡层发育及引起雄性不育的重要转录因子。该研究以拟南芥(Arabidopsis thaliana) AtMYB80为参考序列, 通过BLAST比对分析, 在白菜(Brassica rapa)、甘蓝(B. oleracea)和甘蓝型油菜(B. napus)中分别获得MYB80基因的2、2和6个同源序列, 运用生物信息学方法对其核苷酸序列及编码的氨基酸序列进行组成成分、亚细胞定位、磷酸化位点、疏水性/亲水性、蛋白质二级、三级结构和功能域分析。结果表明, MYB80转录因子亚细胞定位于细胞核, 具有多个不同的磷酸化位点, 肽链表现为亲水性; 二级、三级结构预测显示, MYB80蛋白以α-螺旋和无规则卷曲为主要结构元件; 保守结构域分析表明, 其N端具有2个串联的SANT功能域, 属于R2R3型MYB转录因子。多重序列比对和进化树分析结果表明, 甘蓝型油菜与白菜、甘蓝的序列相似性大于92%, 且MYB80转录因子的功能结构域具有较高的同源性和较强的序列保守性。该研究结果对深入解析甘蓝型油菜MYB80的生物学功能及育性调控的分子机理具有重要意义, 为甘蓝型油菜杂种优势利用提供了依据。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号