首页 | 本学科首页   官方微博 | 高级检索  
   检索      


Focus Issue on Plant Cell Walls: A Bioinformatics Approach to the Identification,Classification, and Analysis of Hydroxyproline-Rich Glycoproteins
Authors:Allan M Showalter  Brian Keppler  Jens Lichtenberg  Dazhang Gu  Lonnie R Welch
Institution:Molecular and Cellular Biology Program, Department of Environmental and Plant Biology (A.M.S., B.K.), and Center for Intelligent, Distributed, and Dependable Systems, Russ College of Engineering and Technology (J.L., D.G., L.R.W.), Ohio University, Athens, Ohio 45701–2979
Abstract:Hydroxyproline-rich glycoproteins (HRGPs) are a superfamily of plant cell wall proteins that function in diverse aspects of plant growth and development. This superfamily consists of three members: hyperglycosylated arabinogalactan proteins (AGPs), moderately glycosylated extensins (EXTs), and lightly glycosylated proline-rich proteins (PRPs). Hybrid and chimeric versions of HRGP molecules also exist. In order to “mine” genomic databases for HRGPs and to facilitate and guide research in the field, the BIO OHIO software program was developed that identifies and classifies AGPs, EXTs, PRPs, hybrid HRGPs, and chimeric HRGPs from proteins predicted from DNA sequence data. This bioinformatics program is based on searching for biased amino acid compositions and for particular protein motifs associated with known HRGPs. HRGPs identified by the program are subsequently analyzed to elucidate the following: (1) repeating amino acid sequences, (2) signal peptide and glycosylphosphatidylinositol lipid anchor addition sequences, (3) similar HRGPs via Basic Local Alignment Search Tool, (4) expression patterns of their genes, (5) other HRGPs, glycosyl transferase, prolyl 4-hydroxylase, and peroxidase genes coexpressed with their genes, and (6) gene structure and whether genetic mutants exist in their genes. The program was used to identify and classify 166 HRGPs from Arabidopsis (Arabidopsis thaliana) as follows: 85 AGPs (including classical AGPs, lysine-rich AGPs, arabinogalactan peptides, fasciclin-like AGPs, plastocyanin AGPs, and other chimeric AGPs), 59 EXTs (including SP5 EXTs, SP5/SP4 EXTs, SP4 EXTs, SP4/SP3 EXTs, a SP3 EXT, “short” EXTs, leucine-rich repeat-EXTs, proline-rich extensin-like receptor kinases, and other chimeric EXTs), 18 PRPs (including PRPs and chimeric PRPs), and AGP/EXT hybrid HRGPs.The genomics era has produced vast amounts of biological data that await examination. In order to “mine” such data effectively, a bioinformatics approach can be utilized to identify genes of interest, subject them to various in silico analyses, and extract relevant biological information on them from various public databases. Examination of such data produces novel insights with respect to the genes in question and can be used to facilitate and guide further research in the field. Such is the case here, where bioinformatics tools were developed to identify, classify, and analyze members of the Hyp-rich glycoprotein (HRGP) superfamily encoded by the Arabidopsis (Arabidopsis thaliana) genome.HRGPs are a superfamily of plant cell wall proteins that are subdivided into three families, arabinogalactan proteins (AGPs), extensins (EXTs), and Pro-rich proteins (PRPs), and extensively reviewed (Showalter, 1993; Kieliszewski and Lamport, 1994; Nothnagel, 1997; Cassab, 1998; José-Estanyol and Puigdomènech, 2000; Seifert and Roberts, 2007). However, it has become increasingly clear that the HRGP superfamily is perhaps better represented as a spectrum of molecules ranging from the highly glycosylated AGPs to the moderately glycosylated EXTs and finally to the lightly glycosylated PRPs. Moreover, hybrid HRGPs, composed of HRGP modules from different families, and chimeric HRGPs, composed of one or more HRGP modules within a non-HRGP protein, also can be considered part of the HRGP superfamily. Given that many HRGPs are composed of repetitive protein sequences, particularly the EXTs and PRPs, and many have low sequence similarity to one another, particularly the AGPs, BLAST searches typically identify only a few closely related family members and do not represent a particularly effective means to identify members of the HRGP superfamily in a comprehensive manner.Building upon the work of Schultz et al. (2002) that focused on the AGP family, a new bioinformatics software program, BIO OHIO, developed at Ohio University, makes it possible to search all 28,952 proteins encoded by the Arabidopsis genome and identify putative HRGP genes. Two distinct types of searches are possible with this program. First, the program can search for biased amino acid compositions in the genome-encoded protein sequences. For example, classical AGPs can be identified by their biased amino acid compositions of greater then 50% Pro (P), Ala (A), Ser (S), and Thr (T), as indicated by greater than 50% PAST. Similarly, arabinogalactan peptides (AG peptides) are identified by biased amino acid compositions of greater then 35% PAST, but the protein (i.e. peptide) must also be between 50 and 90 amino acids in length. Likewise, PRPs can be identified by a biased amino acid composition of greater then 45% PVKCYT. Second, the program can search for specific amino acid motifs that are commonly found in known HRGPs. For example, SP4 pentapeptide and SP3 tetrapeptide motifs are associated with EXTs, a fasciclin H1 motif is found in fasciclin-like AGPs (FLAs), and PPVX(K/T) (where X is any amino acid) and KKPCPP motifs are found in several known PRPs (Fowler et al., 1999). In addition to searching for HRGPs, the program can analyze proteins identified by a search. For example, the program checks for potential signal peptide sequences and glycosylphosphatidylinositol (GPI) plasma member anchor addition sequences, both of which are associated with HRGPs (Showalter, 1993, 2001; Youl et al., 1998; Sherrier et al., 1999; Svetek et al., 1999). Moreover, the program can identify repeated amino acid sequences within the sequence and has the ability to search for bias amino acid compositions within a sliding window of user-defined size, making it possible to identify HRGP domains within a protein sequence.Here, we report on the use of this bioinformatics program in identifying, classifying, and analyzing members of the HRGP superfamily (i.e. AGPs, EXTs, PRPs, hybrid HRGPs, and chimeric HRGPs) in the genetic model plant Arabidopsis. An overview of this bioinformatics approach is presented in Figure 1. In addition, public databases and programs were accessed and utilized to extract relevant biological information on these HRGPs in terms of their expression patterns, most similar sequences via BLAST analysis, available genetic mutants, and coexpressed HRGP, glycosyl transferase (GT), prolyl 4-hydroxylase (P4H), and peroxidase genes in Arabidopsis. This information provides new insight to the HRGP superfamily and can be used by researchers to facilitate and guide further research in the field. Moreover, the bioinformatics tools developed here can be readily applied to protein sequences from other species to analyze their HRGPs or, for that matter, any given protein family by altering the input parameters.Open in a separate windowFigure 1.Bioinformatics workflow diagram summarizing the identification, classification, and analysis of HRGPs (AGPs, EXTs, and PRPs) in Arabidopsis. Classical AGPs were defined as containing greater than 50% PAST coupled with the presence of AP, PA, SP, and TP repeats distributed throughout the protein, Lys-rich AGPs were a subgroup of classical AGPs that included a Lys-rich domain, and chimeric AGPs were defined as containing greater than 50% PAST coupled with the localized distribution of AP, PA, SP, and TP repeats. AG peptides were defined to be 50 to 90 amino acids in length and containing greater than 35% PAST coupled with the presence of AP, PA, SP, and TP repeats distributed throughout the peptide. FLAs were defined as having a fasciclin domain coupled with the localized distribution of AP, PA, SP, and TP repeats. Extensins were defined as containing two or more SP3 or SP4 repeats coupled with the distribution of such repeats throughout the protein; chimeric extensins were similarly identified but were distinguished from the extensins by the localized distribution of such repeats in the protein; and short extensins were defined to be less than 200 amino acids in length coupled with the extensin definition. PRPs were identified as containing greater than 45% PVKCYT or two or more KKPCPP or PVX(K/T) repeats coupled with the distribution of such repeats and/or PPV throughout the protein. Chimeric PRPs were similarly identified but were distinguished from PRPs by the localized distribution of such repeats in the protein. Hybrid HRGPs (i.e. AGP/EXT hybrids) were defined as containing two or more repeat units used to identify AGPs, extensins, or PRPs. The presence of a signal peptide was used to provide added support for the identification of an HRGP but was not used in an absolute fashion. Similarly, the presence of a GPI anchor addition sequence was used to provide added support for the identification of classical AGPs and AG peptides, which are known to contain such sequences. BLAST searches were also used to provide some support to our classification if the query sequence showed similarity to other members of an HRGP subfamily. Note that some AGPs, particularly chimeric AGPs, and PRPs were identified from an Arabidopsis database annotation search and that two chimeric extensins were identified from the primary literature as noted in the text.
Keywords:
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号