首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 187 毫秒
1.
Knowledge of the interactions between proteins and nucleic acids is the basis of understanding various biological activities and designing new drugs. How to accurately identify the nucleic-acid-binding residues remains a challenging task. In this paper, we propose an accurate predictor, GraphBind, for identifying nucleic-acid-binding residues on proteins based on an end-to-end graph neural network. Considering that binding sites often behave in highly conservative patterns on local tertiary structures, we first construct graphs based on the structural contexts of target residues and their spatial neighborhood. Then, hierarchical graph neural networks (HGNNs) are used to embed the latent local patterns of structural and bio-physicochemical characteristics for binding residue recognition. We comprehensively evaluate GraphBind on DNA/RNA benchmark datasets. The results demonstrate the superior performance of GraphBind than state-of-the-art methods. Moreover, GraphBind is extended to other ligand-binding residue prediction to verify its generalization capability. Web server of GraphBind is freely available at http://www.csbio.sjtu.edu.cn/bioinf/GraphBind/.  相似文献   

2.

Background

Vitamins are typical ligands that play critical roles in various metabolic processes. The accurate identification of the vitamin-binding residues solely based on a protein sequence is of significant importance for the functional annotation of proteins, especially in the post-genomic era, when large volumes of protein sequences are accumulating quickly without being functionally annotated.

Results

In this paper, a new predictor called TargetVita is designed and implemented for predicting protein-vitamin binding residues using protein sequences. In TargetVita, features derived from the position-specific scoring matrix (PSSM), predicted protein secondary structure, and vitamin binding propensity are combined to form the original feature space; then, several feature subspaces are selected by performing different feature selection methods. Finally, based on the selected feature subspaces, heterogeneous SVMs are trained and then ensembled for performing prediction.

Conclusions

The experimental results obtained with four separate vitamin-binding benchmark datasets demonstrate that the proposed TargetVita is superior to the state-of-the-art vitamin-specific predictor, and an average improvement of 10% in terms of the Matthews correlation coefficient (MCC) was achieved over independent validation tests. The TargetVita web server and the datasets used are freely available for academic use at http://csbio.njust.edu.cn/bioinf/TargetVita or http://www.csbio.sjtu.edu.cn/bioinf/TargetVita.

Electronic supplementary material

The online version of this article (doi:10.1186/1471-2105-15-297) contains supplementary material, which is available to authorized users.  相似文献   

3.
4.
Protein attribute prediction from primary sequences is an important task and how to extract discriminative features is one of the most crucial aspects. Because single-view feature cannot reflect all the information of a protein, fusing multi-view features is considered as a promising route to improve prediction accuracy. In this paper, we propose a novel framework for protein multi-view feature fusion: first, features from different views are parallely combined to form complex feature vectors; Then, we extend the classic principal component analysis to the generalized principle component analysis for further feature extraction from the parallely combined complex features, which lie in a complex space. Finally, the extracted features are used for prediction. Experimental results on different benchmark datasets and machine learning algorithms demonstrate that parallel strategy outperforms the traditional serial approach and is particularly helpful for extracting the core information buried among multi-view feature sets. A web server for protein structural class prediction based on the proposed method (COMSPA) is freely available for academic use at: http://www.csbio.sjtu.edu.cn/bioinf/COMSPA/.  相似文献   

5.
Predicting protein subcellular locations has attracted much attention in the past decade. However, one of the most challenging problems is that many proteins were found simultaneously existing in, or moving between, two or more different cell components in a eukaryotic cell. Seldom previous predictors were able to deal with such multiplex proteins although they have extremely important implications in future drug discovery in terms of their specific subcellular targeting. Approximately 20% of the human proteome consists of such multiplex proteins with multiple sample labels. In order to efficiently handle such multiplex human proteins, we have developed a novel multi-label (ML) learning and prediction framework called ML-PLoc, which decomposes the multi-label prediction problem into multiple independent binary classification problems. ML-PLoc is constructed based on support vector machine (SVM) and sequential evolution information. Experimental results show that ML-PLoc can achieve an overall accuracy 64.6% and recall ratio 67.2% on a benchmark dataset consisting of 14 human subcellular locations, and is very powerful for dealing with multiplex proteins. The current approach represents a new strategy to deal with the multi-label biological problems. ML-PLoc software is freely available for academic use at: http://www.csbio.sjtu.edu.cn/bioinf/ML-PLoc.  相似文献   

6.
Identifying the interactions between proteins and ligands is significant for drug discovery and design. Considering the diverse binding patterns of ligands, the ligand-specific methods are trained per ligand to predict binding residues. However, most of the existing ligand-specific methods ignore shared binding preferences among various ligands and generally only cover a limited number of ligands with a sufficient number of known binding proteins. In this study, we propose a relation-aware framework LigBind with graph-level pre-training to enhance the ligand-specific binding residue predictions for 1159 ligands, which can effectively cover the ligands with a few known binding proteins. LigBind first pre-trains a graph neural network-based feature extractor for ligand-residue pairs and relation-aware classifiers for similar ligands. Then, LigBind is fine-tuned with ligand-specific binding data, where a domain adaptive neural network is designed to automatically leverage the diversity and similarity of various ligand-binding patterns for accurate binding residue prediction. We construct ligand-specific benchmark datasets of 1159 ligands and 16 unseen ligands, which are used to evaluate the effectiveness of LigBind. The results demonstrate the LigBind’s efficacy on large-scale ligand-specific benchmark datasets, and it generalizes well to unseen ligands. LigBind also enables accurate identification of the ligand-binding residues in the main protease, papain-like protease and the RNA-dependent RNA polymerase of SARS-CoV-2. The web server and source codes of LigBind are available at http://www.csbio.sjtu.edu.cn/bioinf/LigBind/ and https://github.com/YYingXia/LigBind/ for academic use.  相似文献   

7.
One of the fundamental goals in proteomics and cell biology is to identify the functions of proteins in various cellular organelles and pathways. Information of subcellular locations of proteins can provide useful insights for revealing their functions and understanding how they interact with each other in cellular network systems. Most of the existing methods in predicting plant protein subcellular localization can only cover three or four location sites, and none of them can be used to deal with multiplex plant proteins that can simultaneously exist at two, or move between, two or more different location sites. Actually, such multiplex proteins might have special biological functions worthy of particular notice. The present study was devoted to improve the existing plant protein subcellular location predictors from the aforementioned two aspects. A new predictor called “Plant-mPLoc” is developed by integrating the gene ontology information, functional domain information, and sequential evolutionary information through three different modes of pseudo amino acid composition. It can be used to identify plant proteins among the following 12 location sites: (1) cell membrane, (2) cell wall, (3) chloroplast, (4) cytoplasm, (5) endoplasmic reticulum, (6) extracellular, (7) Golgi apparatus, (8) mitochondrion, (9) nucleus, (10) peroxisome, (11) plastid, and (12) vacuole. Compared with the existing methods for predicting plant protein subcellular localization, the new predictor is much more powerful and flexible. Particularly, it also has the capacity to deal with multiple-location proteins, which is beyond the reach of any existing predictors specialized for identifying plant protein subcellular localization. As a user-friendly web-server, Plant-mPLoc is freely accessible at http://www.csbio.sjtu.edu.cn/bioinf/plant-multi/. Moreover, for the convenience of the vast majority of experimental scientists, a step-by-step guide is provided on how to use the web-server to get the desired results. It is anticipated that the Plant-mPLoc predictor as presented in this paper will become a very useful tool in plant science as well as all the relevant areas.  相似文献   

8.
The calpain family of Ca2+‐dependent cysteine proteases plays a vital role in many important biological processes which is closely related with a variety of pathological states. Activated calpains selectively cleave relevant substrates at specific cleavage sites, yielding multiple fragments that can have different functions from the intact substrate protein. Until now, our knowledge about the calpain functions and their substrate cleavage mechanisms are limited because the experimental determination and validation on calpain binding are usually laborious and expensive. In this work, we aim to develop a new computational approach (LabCaS) for accurate prediction of the calpain substrate cleavage sites from amino acid sequences. To overcome the imbalance of negative and positive samples in the machine‐learning training which have been suffered by most of the former approaches when splitting sequences into short peptides, we designed a conditional random field algorithm that can label the potential cleavage sites directly from the entire sequences. By integrating the multiple amino acid features and those derived from sequences, LabCaS achieves an accurate recognition of the cleave sites for most calpain proteins. In a jackknife test on a set of 129 benchmark proteins, LabCaS generates an AUC score 0.862. The LabCaS program is freely available at: http://www.csbio.sjtu.edu.cn/bioinf/LabCaS . Proteins 2013. © 2012 Wiley Periodicals, Inc.  相似文献   

9.
Pan XY  Tian Y  Huang Y  Shen HB 《Genomics》2011,97(5):257-264
Epistatic miniarray profiling (E-MAP) is a powerful tool for analyzing gene functions and their biological relevance. However, E-MAP data suffers from large proportion of missing values, which often results in misleading and biased analysis results. It is urgent to develop effective missing value estimation methods for E-MAP. Although several independent algorithms can be applied to achieve this goal, their performance varies significantly on different datasets, indicating different algorithms having their own advantages and disadvantages. In this paper, we propose a novel ensemble approach EMDI based on the high-level diversity to impute missing values that consists of two global and four local base estimators. Experimental results on five E-MAP datasets show that EMDI outperforms all single base algorithms, demonstrating an appropriate combination providing complementarity among different methods. Comparison results between several fusion strategies also demonstrate that the proposed high-level diversity scheme is superior to others. EMDI is freely available at www.csbio.sjtu.edu.cn/bioinf/EMDI/.  相似文献   

10.
Residue contact map is essential for protein three‐dimensional structure determination. But most of the current contact prediction methods based on residue co‐evolution suffer from high false‐positives as introduced by indirect and transitive contacts (i.e., residues A–B and B–C are in contact, but A–C are not). Built on the work by Feizi et al. (Nat Biotechnol 2013; 31:726–733), which demonstrated a general network model to distinguish direct dependencies by network deconvolution, this study presents a new balanced network deconvolution (BND) algorithm to identify optimized dependency matrix without limit on the eigenvalue range in the applied network systems. The algorithm was used to filter contact predictions of five widely used co‐evolution methods. On the test of proteins from three benchmark datasets of the 9th critical assessment of protein structure prediction (CASP9), CASP10, and PSICOV (precise structural contact prediction using sparse inverse covariance estimation) database experiments, the BND can improve the medium‐ and long‐range contact predictions at the L/5 cutoff by 55.59% and 47.68%, respectively, without additional central processing unit cost. The improvement is statistically significant, with a P‐value < 5.93 × 10?3 in the Student's t‐test. A further comparison with the ab initio structure predictions in CASPs showed that the usefulness of the current co‐evolution‐based contact prediction to the three‐dimensional structure modeling relies on the number of homologous sequences existing in the sequence databases. BND can be used as a general contact refinement method, which is freely available at: http://www.csbio.sjtu.edu.cn/bioinf/BND/ . Proteins 2015; 83:485–496. © 2014 Wiley Periodicals, Inc.  相似文献   

11.
Conotoxins are small disulfide-rich peptides that are invaluable channel-targeted peptides and target neuronal receptors. They show prospects for being potent pharmaceuticals in the treatment of Alzheimer's disease, Parkinson's disease, and epilepsy. Accurate and fast prediction of conotoxin superfamily is very helpful towards the understanding of its biological and pharmacological functions especially in the post-genomic era. In the present study, we have developed a novel approach called PredCSF for predicting the conotoxin superfamily from the amino acid sequence directly based on fusing different kinds of sequential features by using modified one-versus-rest SVMs. The input features to the PredCSF classifiers are composed of physicochemical properties, evolutionary information, predicted second structure and amino acid composition, where the most important features are further screened by random forest feature selection to improve the prediction performance. The prediction results show that PredCSF can obtain an overall accuracy of 90.65% based on a benchmark dataset constructed from the most recent database, which consists of 4 main conotoxin superfamilies and 1 class of non-conotoxin class. Systematic experiments also show that combing different features is helpful for enhancing the prediction power when dealing with complex biological problems. PredCSF is expected to be a powerful tool for in silico identification of novel conotonxins and is freely available for academic use at http://www.csbio.sjtu.edu.cn/bioinf/PredCSF.  相似文献   

12.
The fold pattern of a protein is one level deeper than its structural classification, and hence is more challenging and complicated for prediction. Many efforts have been made in this regard, but so far all the reported success rates are still under 70%, indicating that it is extremely difficult to enhance the success rate even by 1% or 2%. To address this problem, here a novel approach is proposed that is featured by combining the functional domain information and the sequential evolution information through a fusion ensemble classifier. The predictor thus developed is called PFP-FunDSeqE. Tests were performed for identifying proteins among their 27 fold patterns. Compared with the existing predictors tested by a same stringent benchmark dataset, the new predictor can, for the first time, achieve over 70% success rate. The PFP-FunDSeqE predictor is freely available to the public as a web server at http://www.csbio.sjtu.edu.cn/bioinf/PFP-FunDSeqE/.  相似文献   

13.
Du P  Wang X  Xu C  Gao Y 《Analytical biochemistry》2012,425(2):117-119
The pseudo-amino acid composition has been widely used to convert complicated protein sequences with various lengths to fixed length digital feature vectors while keeping considerable sequence order information. However, so far the only software available to the public is the web server PseAAC (http://www.csbio.sjtu.edu.cn/bioinf/PseAAC), which has some limitations in dealing with large-scale datasets. Here, we propose a new cross-platform stand-alone software program, called PseAAC-Builder (http://www.pseb.sf.net), which can be used to generate various modes of Chou's pseudo-amino acid composition in a much more efficient and flexible way. It is anticipated that PseAAC-Builder may become a useful tool for studying various protein attributes.  相似文献   

14.
Cryo-electron microscopy (cryo-EM) single-particle analysis is a revolutionary imaging technique to resolve and visualize biomacromolecules. Image alignment in cryo-EM is an important and basic step to improve the precision of the image distance calculation. However, it is a very challenging task due to high noise and low signal-to-noise ratio. Therefore, we propose a new deep unsupervised difference learning (UDL) strategy with novel pseudo-label guided learning network architecture and apply it to pair-wise image alignment in cryo-EM. The training framework is fully unsupervised. Furthermore, a variant of UDL called joint UDL (JUDL), is also proposed, which is capable of utilizing the similarity information of the whole dataset and thus further increase the alignment precision. Assessments on both real-world and synthetic cryo-EM single-particle image datasets suggest the new unsupervised joint alignment method can achieve more accurate alignment results. Our method is highly efficient by taking advantages of GPU devices. The source code of our methods is publicly available at “http://www.csbio.sjtu.edu.cn/bioinf/JointUDL/” for academic use.  相似文献   

15.
Thiopeptides are a growing class of sulfur-rich, highly modified heterocyclic peptides that are mainly active against Gram-positive bacteria including various drug-resistant pathogens. Recent studies also reveal that many thiopeptides inhibit the proliferation of human cancer cells, further expanding their application potentials for clinical use. Thiopeptide biosynthesis shares a common paradigm, featuring a ribosomally synthesized precursor peptide and conserved posttranslational modifications, to afford a characteristic core system, but differs in tailoring to furnish individual members. Identification of new thiopeptide gene clusters, by taking advantage of increasing information of DNA sequences from bacteria, may facilitate new thiopeptide discovery and enrichment of the unique biosynthetic elements to produce novel drug leads by applying the principle of combinatorial biosynthesis. In this study, we have developed a web-based tool ThioFinder to rapidly identify thiopeptide biosynthetic gene cluster from DNA sequence using a profile Hidden Markov Model approach. Fifty-four new putative thiopeptide biosynthetic gene clusters were found in the sequenced bacterial genomes of previously unknown producing microorganisms. ThioFinder is fully supported by an open-access database ThioBase, which contains the sufficient information of the 99 known thiopeptides regarding the chemical structure, biological activity, producing organism, and biosynthetic gene (cluster) along with the associated genome if available. The ThioFinder website offers researchers a unique resource and great flexibility for sequence analysis of thiopeptide biosynthetic gene clusters. ThioFinder is freely available at http://db-mml.sjtu.edu.cn/ThioFinder/.  相似文献   

16.
Proteases are vitally important to life cycles and have become a main target in drug development. According to their action mechanisms, proteases are classified into six types: (1) aspartic, (2) cysteine, (3) glutamic, (4) metallo, (5) serine, and (6) threonine. Given the sequence of an uncharacterized protein, can we identify whether it is a protease or non-protease? If it is, what type does it belong to? To address these problems, a 2-layer predictor, called "ProtIdent", is developed by fusing the functional domain and sequential evolution information: the first layer is for identifying the query protein as protease or non-protease; if it is a protease, the process will automatically go to the second layer to further identify it among the six types. The overall success rates in both cases by rigorous cross-validation tests were higher than 92%. ProtIdent is freely accessible to the public as a web server at http://www.csbio.sjtu.edu.cn/bioinf/Protease.  相似文献   

17.
Predicting subcellular localization of human proteins is a challenging problem, particularly when query proteins may have a multiplex character, i.e., simultaneously exist at, or move between, two or more different subcellular location sites. In a previous study, we developed a predictor called “Hum-mPLoc” to deal with the multiplex problem for the human protein system. However, Hum-mPLoc has the following shortcomings. (1) The input of accession number for a query protein is required in order to obtain a higher expected success rate by selecting to use the higher-level prediction pathway; but many proteins, such as synthetic and hypothetical proteins as well as those newly discovered proteins without being deposited into databanks yet, do not have accession numbers. (2) Neither functional domain nor sequential evolution information were taken into account in Hum-mPLoc, and hence its power may be reduced accordingly. In view of this, a top-down strategy to address these shortcomings has been implemented. The new predictor thus obtained is called Hum-mPLoc 2.0, where the accession number for input is no longer needed whatsoever. Moreover, both the functional domain information and the sequential evolution information have been fused into the predictor by an ensemble classifier. As a consequence, the prediction power has been significantly enhanced. The web server of Hum-mPLoc2.0 is freely accessible at http://www.csbio.sjtu.edu.cn/bioinf/hum-multi-2/.  相似文献   

18.
By incorporating the information of gene ontology, functional domain, and sequential evolution, a new predictor called Gneg-mPLoc was developed. It can be used to identify Gram-negative bacterial proteins among the following eight locations: (1) cytoplasm, (2) extracellular, (3) fimbrium, (4) flagellum, (5) inner membrane, (6) nucleoid, (7) outer membrane, and (8) periplasm. It can also be used to deal with the case when a query protein may simultaneously exist in more than one location. Compared with the original predictor called Gneg-PLoc, the new predictor is much more powerful and flexible. For a newly constructed stringent benchmark dataset in which none of proteins included has ≥25% pairwise sequence identity to any other in a same subset (location), the overall jackknife success rate achieved by Gneg-mPLoc was 85.5%, which was more than 14% higher than the corresponding rate by the Gneg-PLoc. As a user friendly web-server, Gneg-mPLoc is freely accessible at http://www.csbio.sjtu.edu.cn/bioinf/Gneg-multi/.  相似文献   

19.
Pattern genes are a group of genes that have a modularized expression behavior under serial physiological conditions. The identification of pattern genes will provide a path toward a global and dynamic understanding of gene functions and their roles in particular biological processes or events, such as development and pathogenesis. In this study, we present PaGenBase, a novel repository for the collection of tissue- and time-specific pattern genes, including specific genes, selective genes, housekeeping genes and repressed genes. The PaGenBase database is now freely accessible at http://bioinf.xmu.edu.cn/PaGenBase/. In the current version (PaGenBase 1.0), the database contains 906,599 pattern genes derived from the literature or from data mining of more than 1,145,277 gene expression profiles in 1,062 distinct samples collected from 11 model organisms. Four statistical parameters were used to quantitatively evaluate the pattern genes. Moreover, three methods (quick search, advanced search and browse) were designed for rapid and customized data retrieval. The potential applications of PaGenBase are also briefly described. In summary, PaGenBase will serve as a resource for the global and dynamic understanding of gene function and will facilitate high-level investigations in a variety of fields, including the study of development, pathogenesis and novel drug discovery.  相似文献   

20.

Background

Metagenomics can reveal the vast majority of microbes that have been missed by traditional cultivation-based methods. Due to its extremely wide range of application areas, fast metagenome sequencing simulation systems with high fidelity are in great demand to facilitate the development and comparison of metagenomics analysis tools.

Results

We present here a customizable metagenome simulation system: NeSSM (Next-generation Sequencing Simulator for Metagenomics). Combining complete genomes currently available, a community composition table, and sequencing parameters, it can simulate metagenome sequencing better than existing systems. Sequencing error models based on the explicit distribution of errors at each base and sequencing coverage bias are incorporated in the simulation. In order to improve the fidelity of simulation, tools are provided by NeSSM to estimate the sequencing error models, sequencing coverage bias and the community composition directly from existing metagenome sequencing data. Currently, NeSSM supports single-end and pair-end sequencing for both 454 and Illumina platforms. In addition, a GPU (graphics processing units) version of NeSSM is also developed to accelerate the simulation. By comparing the simulated sequencing data from NeSSM with experimental metagenome sequencing data, we have demonstrated that NeSSM performs better in many aspects than existing popular metagenome simulators, such as MetaSim, GemSIM and Grinder. The GPU version of NeSSM is more than one-order of magnitude faster than MetaSim.

Conclusions

NeSSM is a fast simulation system for high-throughput metagenome sequencing. It can be helpful to develop tools and evaluate strategies for metagenomics analysis and it’s freely available for academic users at http://cbb.sjtu.edu.cn/~ccwei/pub/software/NeSSM.php.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号