Computational Identification of Potential Molecular Interactions in Arabidopsis |
| |
Authors: | Mingzhi Lin Bin Hu Lijuan Chen Peng Sun Yi Fan Ping Wu Xin Chen |
| |
Affiliation: | Department of Bioinformatics, Zhejiang University, Hangzhou, People''s Republic of China, 310058 (M.L., B.H., L.C., Y.F., X.C.); and State Key Laboratory of Plant Physiology and Biochemistry, Zhejiang University, Hangzhou, People''s Republic of China, 310058 (P.S., P.W.) |
| |
Abstract: | Knowledge of the protein interaction network is useful to assist molecular mechanism studies. Several major repositories have been established to collect and organize reported protein interactions. Many interactions have been reported in several model organisms, yet a very limited number of plant interactions can thus far be found in these major databases. Computational identification of potential plant interactions, therefore, is desired to facilitate relevant research. In this work, we constructed a support vector machine model to predict potential Arabidopsis (Arabidopsis thaliana) protein interactions based on a variety of indirect evidence. In a 100-iteration bootstrap evaluation, the confidence of our predicted interactions was estimated to be 48.67%, and these interactions were expected to cover 29.02% of the entire interactome. The sensitivity of our model was validated with an independent evaluation data set consisting of newly reported interactions that did not overlap with the examples used in model training and testing. Results showed that our model successfully recognized 28.91% of the new interactions, similar to its expected sensitivity (29.02%). Applying this model to all possible Arabidopsis protein pairs resulted in 224,206 potential interactions, which is the largest and most accurate set of predicted Arabidopsis interactions at present. In order to facilitate the use of our results, we present the Predicted Arabidopsis Interactome Resource, with detailed annotations and more specific per interaction confidence measurements. This database and related documents are freely accessible at http://www.cls.zju.edu.cn/pair/.The complex cellular functions of an organism rely on physical interactions between proteins. Deciphering the protein-protein interaction network to understand higher level phenotypes and their regulations is always a major focus of both experimental biologists and computational biologists. A number of high-throughput (HTP) assays have been developed to identify in vitro protein interactions from several model organisms (Uetz et al., 2000; Giot et al., 2003; Li et al., 2004). A number of initiatives, such as IntAct (Kerrien et al., 2006), Molecular INTeraction database (Chatr-aryamontri et al., 2007), the Database of Interacting Proteins (Salwinski et al., 2004), Biomolecular Interaction Network Database (BIND; Alfarano et al., 2005), and BioGRID (Stark et al., 2006), have been established to systematically collect and organize the interaction data reported by both proteome-scale HTP experiments and traditional low-throughput studies focusing on individual proteins or pathways.Arabidopsis (Arabidopsis thaliana) has long been studied as a model organism to investigate the physiology, biochemistry, growth, development, and metabolism of a flowering plant at the molecular level. The molecular mechanism studies of various phenotypes and their regulations in Arabidopsis may be facilitated by a comprehensive reference protein interaction network, based on which working hypotheses could be invented with more guidance and confidence. However, due to technological limitations, most experimentally reported protein interactions in available databases were from other organisms. A very limited number of plant interactions could be found in these databases. Therefore, an accurate prediction of the Arabidopsis interactome would be valuable to assist relevant research.Studies on the computational identification of potential interactions started along with the advent of HTP interaction-detection technologies, which often produced a large number of false positives (Deane et al., 2002). Indirect evidence of protein interaction (e.g. protein colocalization and relevance in function) were hence introduced to boost the confidence of HTP results (Jansen et al., 2003). Further investigations demonstrated that direct inference of protein interactions from such indirect evidence alone was possible (Scott and Barton, 2007). The accuracy and effectiveness of using indirect evidence to predict interactions have also been thoroughly assessed (Qi et al., 2006; Suthram et al., 2006). These works offered precious insights into how protein interactions may be predicted accurately on a proteomic scale. In other organisms such as Homo sapiens, the prediction of an entire interactome has already been proven applicable and useful (Rhodes et al., 2005).On the other side, several efforts have been made to collect and organize a comprehensive map of Arabidopsis molecular interactions. For instances, around 20,000 interactions were inferred by homology to known interactions in other organisms (Geisler-Lee et al., 2007). Another work predicted 23,396 interactions based on multiple indirect data and curated 4,666 interactions from the literature and enzyme complexes (Cui et al., 2008). The Arabidopsis reactome database was established describing the functions of 2,195 proteins with 8,269 reactions in 318 superpathways (Tsesmetzis et al., 2008). And a general interaction database, IntAct (Kerrien et al., 2006), had allocated a special unit actively curating all plant protein interactions from literature and submitted data sets, which now contains 2,649 Arabidopsis interactions. However, in yeast, approximately 18,000 protein-protein interactions had been estimated for approximately 6,000 genes (Yu et al., 2008). Assuming the same rate of interaction, approximately 200,000 protein interactions would be expected for approximately 20,000 Arabidopsis genes. Therefore, the current collection of Arabidopsis interactions is still significantly limited. Moreover, most previous prediction works did not provide rigorous confidence measurements for their predicted interactions, which further limited their scope of applications.Recent advances in statistical learning presented a powerful algorithm, support vector machine (SVM), which may be used to predict interactions based on multiple indirect data. Although the basis of SVM had been laid in the 1960s, the idea of SVM was only officially proposed in the 1990s by Vapnik (1998, 2000). Then, research on its theoretical and application aspects thrived. It has been applied in a wide range of problems, including text categorization (de Vel et al., 2001; Kim et al., 2001), image classification and object detection (Ben-Yacoub et al., 1999; Karlsen et al., 2000), flood stage forecasting (Liong and Sivapragasam, 2002), microarray gene expression data analysis (Brown et al., 2000), drug design (Zhao et al., 2006a, 2006b), protein solvent accessibility prediction (Yuan et al., 2002), and protein fold prediction (Ding and Dubchak, 2001; Hua and Sun, 2001). Many studies have demonstrated that SVM was consistently superior to other supervised learning methods (Brown et al., 2000; Burbidge et al., 2001; Cai et al., 2003).In this work, with careful preparation of example data and selection of indirect evidence, we constructed an SVM model to predict potential Arabidopsis interactions. False positives were tightly controlled. With the high-confidence model, we identified altogether 224,206 potential interactions, which were expected to be 48.67% accurate and to cover 29.02% of the entire Arabidopsis interactome. More specific confidence measurements were also assigned on a per interaction basis. To facilitate the use of our results, we present the Predicted Arabidopsis Interactome Resource (PAIR; http://www.cls.zju.edu.cn/pair/), featuring detailed annotations and a friendly user interface. |
| |
Keywords: | |
|
|