首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
When using cDNA microarrays, normalization to correct labeling bias is a common preliminary step before further data analysis is applied, its objective being to reduce the variation between arrays. To date, assessment of the effectiveness of normalization has mainly been confined to the ability to detect differentially expressed genes. Since a major use of microarrays is the expression-based phenotype classification, it is important to evaluate microarray normalization procedures relative to classification. Using a model-based approach, we model the systemic-error process to generate synthetic gene-expression values with known ground truth. These synthetic expression values are subjected to typical normalization methods and passed through a set of classification rules, the objective being to carry out a systematic study of the effect of normalization on classification. Three normalization methods are considered: offset, linear regression, and Lowess regression. Seven classification rules are considered: 3-nearest neighbor, linear support vector machine, linear discriminant analysis, regular histogram, Gaussian kernel, perceptron, and multiple perceptron with majority voting. The results of the first three are presented in the paper, with the full results being given on a complementary website. The conclusion from the different experiment models considered in the study is that normalization can have a significant benefit for classification under difficult experimental conditions, with linear and Lowess regression slightly outperforming the offset method.[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20]  相似文献   

2.
3.
A variety of high-throughput methods have made it possible to generate detailed temporal expression data for a single gene or large numbers of genes. Common methods for analysis of these large data sets can be problematic. One challenge is the comparison of temporal expression data obtained from different growth conditions where the patterns of expression may be shifted in time. We propose the use of wavelet analysis to transform the data obtained under different growth conditions to permit comparison of expression patterns from experiments that have time shifts or delays. We demonstrate this approach using detailed temporal data for a single bacterial gene obtained under 72 different growth conditions. This general strategy can be applied in the analysis of data sets of thousands of genes under different conditions.[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29]  相似文献   

4.
Knowledge of elaborate structures of protein complexes is fundamental for understanding their functions and regulations. Although cross-linking coupled with mass spectrometry (MS) has been presented as a feasible strategy for structural elucidation of large multisubunit protein complexes, this method has proven challenging because of technical difficulties in unambiguous identification of cross-linked peptides and determination of cross-linked sites by MS analysis. In this work, we developed a novel cross-linking strategy using a newly designed MS-cleavable cross-linker, disuccinimidyl sulfoxide (DSSO). DSSO contains two symmetric collision-induced dissociation (CID)-cleavable sites that allow effective identification of DSSO-cross-linked peptides based on their distinct fragmentation patterns unique to cross-linking types (i.e. interlink, intralink, and dead end). The CID-induced separation of interlinked peptides in MS/MS permits MS3 analysis of single peptide chain fragment ions with defined modifications (due to DSSO remnants) for easy interpretation and unambiguous identification using existing database searching tools. Integration of data analyses from three generated data sets (MS, MS/MS, and MS3) allows high confidence identification of DSSO cross-linked peptides. The efficacy of the newly developed DSSO-based cross-linking strategy was demonstrated using model peptides and proteins. In addition, this method was successfully used for structural characterization of the yeast 20 S proteasome complex. In total, 13 non-redundant interlinked peptides of the 20 S proteasome were identified, representing the first application of an MS-cleavable cross-linker for the characterization of a multisubunit protein complex. Given its effectiveness and simplicity, this cross-linking strategy can find a broad range of applications in elucidating the structural topology of proteins and protein complexes.Proteins form stable and dynamic multisubunit complexes under different physiological conditions to maintain cell viability and normal cell homeostasis. Detailed knowledge of protein interactions and protein complex structures is fundamental to understanding how individual proteins function within a complex and how the complex functions as a whole. However, structural elucidation of large multisubunit protein complexes has been difficult because of a lack of technologies that can effectively handle their dynamic and heterogeneous nature. Traditional methods such as nuclear magnetic resonance (NMR) analysis and x-ray crystallography can yield detailed information on protein structures; however, NMR spectroscopy requires large quantities of pure protein in a specific solvent, whereas x-ray crystallography is often limited by the crystallization process.In recent years, chemical cross-linking coupled with mass spectrometry (MS) has become a powerful method for studying protein interactions (13). Chemical cross-linking stabilizes protein interactions through the formation of covalent bonds and allows the detection of stable, weak, and/or transient protein-protein interactions in native cells or tissues (49). In addition to capturing protein interacting partners, many studies have shown that chemical cross-linking can yield low resolution structural information about the constraints within a molecule (2, 3, 10) or protein complex (1113). The application of chemical cross-linking, enzymatic digestion, and subsequent mass spectrometric and computational analyses for the elucidation of three-dimensional protein structures offers distinct advantages over traditional methods because of its speed, sensitivity, and versatility. Identification of cross-linked peptides provides distance constraints that aid in constructing the structural topology of proteins and/or protein complexes. Although this approach has been successful, effective detection and accurate identification of cross-linked peptides as well as unambiguous assignment of cross-linked sites remain extremely challenging due to their low abundance and complicated fragmentation behavior in MS analysis (2, 3, 10, 14). Therefore, new reagents and methods are urgently needed to allow unambiguous identification of cross-linked products and to improve the speed and accuracy of data analysis to facilitate its application in structural elucidation of large protein complexes.A number of approaches have been developed to facilitate MS detection of low abundance cross-linked peptides from complex mixtures. These include selective enrichment using affinity purification with biotinylated cross-linkers (1517) and click chemistry with alkyne-tagged (18) or azide-tagged (19, 20) cross-linkers. In addition, Staudinger ligation has recently been shown to be effective for selective enrichment of azide-tagged cross-linked peptides (21). Apart from enrichment, detection of cross-linked peptides can be achieved by isotope-labeled (2224), fluorescently labeled (25), and mass tag-labeled cross-linking reagents (16, 26). These methods can identify cross-linked peptides with MS analysis, but interpretation of the data generated from interlinked peptides (two peptides connected with the cross-link) by automated database searching remains difficult. Several bioinformatics tools have thus been developed to interpret MS/MS data and determine interlinked peptide sequences from complex mixtures (12, 14, 2732). Although promising, further developments are still needed to make such data analyses as robust and reliable as analyzing MS/MS data of single peptide sequences using existing database searching tools (e.g. Protein Prospector, Mascot, or SEQUEST).Various types of cleavable cross-linkers with distinct chemical properties have been developed to facilitate MS identification and characterization of cross-linked peptides. These include UV photocleavable (33), chemical cleavable (19), isotopically coded cleavable (24), and MS-cleavable reagents (16, 26, 3438). MS-cleavable cross-linkers have received considerable attention because the resulting cross-linked products can be identified based on their characteristic fragmentation behavior observed during MS analysis. Gas-phase cleavage sites result in the detection of a “reporter” ion (26), single peptide chain fragment ions (3538), or both reporter and fragment ions (16, 34). In each case, further structural characterization of the peptide product ions generated during the cleavage reaction can be accomplished by subsequent MSn1 analysis. Among these linkers, the “fixed charge” sulfonium ion-containing cross-linker developed by Lu et al. (37) appears to be the most attractive as it allows specific and selective fragmentation of cross-linked peptides regardless of their charge and amino acid composition based on their studies with model peptides.Despite the availability of multiple types of cleavable cross-linkers, most of the applications have been limited to the study of model peptides and single proteins. Additionally, complicated synthesis and fragmentation patterns have impeded most of the known MS-cleavable cross-linkers from wide adaptation by the community. Here we describe the design and characterization of a novel and simple MS-cleavable cross-linker, DSSO, and its application to model peptides and proteins and the yeast 20 S proteasome complex. In combination with new software developed for data integration, we were able to identify DSSO-cross-linked peptides from complex peptide mixtures with speed and accuracy. Given its effectiveness and simplicity, we anticipate a broader application of this MS-cleavable cross-linker in the study of structural topology of other protein complexes using cross-linking and mass spectrometry.  相似文献   

5.
A decoding algorithm is tested that mechanistically models the progressive alignments that arise as the mRNA moves past the rRNA tail during translation elongation. Each of these alignments provides an opportunity for hybridization between the single-stranded, -terminal nucleotides of the 16S rRNA and the spatially accessible window of mRNA sequence, from which a free energy value can be calculated. Using this algorithm we show that a periodic, energetic pattern of frequency 1/3 is revealed. This periodic signal exists in the majority of coding regions of eubacterial genes, but not in the non-coding regions encoding the 16S and 23S rRNAs. Signal analysis reveals that the population of coding regions of each bacterial species has a mean phase that is correlated in a statistically significant way with species () content. These results suggest that the periodic signal could function as a synchronization signal for the maintenance of reading frame and that codon usage provides a mechanism for manipulation of signal phase.[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32]  相似文献   

6.
Optimal performance of LC-MS/MS platforms is critical to generating high quality proteomics data. Although individual laboratories have developed quality control samples, there is no widely available performance standard of biological complexity (and associated reference data sets) for benchmarking of platform performance for analysis of complex biological proteomes across different laboratories in the community. Individual preparations of the yeast Saccharomyces cerevisiae proteome have been used extensively by laboratories in the proteomics community to characterize LC-MS platform performance. The yeast proteome is uniquely attractive as a performance standard because it is the most extensively characterized complex biological proteome and the only one associated with several large scale studies estimating the abundance of all detectable proteins. In this study, we describe a standard operating protocol for large scale production of the yeast performance standard and offer aliquots to the community through the National Institute of Standards and Technology where the yeast proteome is under development as a certified reference material to meet the long term needs of the community. Using a series of metrics that characterize LC-MS performance, we provide a reference data set demonstrating typical performance of commonly used ion trap instrument platforms in expert laboratories; the results provide a basis for laboratories to benchmark their own performance, to improve upon current methods, and to evaluate new technologies. Additionally, we demonstrate how the yeast reference, spiked with human proteins, can be used to benchmark the power of proteomics platforms for detection of differentially expressed proteins at different levels of concentration in a complex matrix, thereby providing a metric to evaluate and minimize preanalytical and analytical variation in comparative proteomics experiments.Access to proteomics performance standards is essential for several reasons. First, to generate the highest quality data possible, proteomics laboratories routinely benchmark and perform quality control (QC)1 monitoring of the performance of their instrumentation using standards. Second, appropriate standards greatly facilitate the development of improvements in technologies by providing a timeless standard with which to evaluate new protocols or instruments that claim to improve performance. For example, it is common practice for an individual laboratory considering purchase of a new instrument to require the vendor to run “demo” samples so that data from the new instrument can be compared head to head with existing instruments in the laboratory. Third, large scale proteomics studies designed to aggregate data across laboratories can be facilitated by the use of a performance standard to measure reproducibility across sites or to compare the performance of different LC-MS configurations or sample processing protocols used between laboratories to facilitate development of optimized standard operating procedures (SOPs).Most individual laboratories have adopted their own QC standards, which range from mixtures of known synthetic peptides to digests of bovine serum albumin or more complex mixtures of several recombinant proteins (1). However, because each laboratory performs QC monitoring in isolation, it is difficult to compare the performance of LC-MS platforms throughout the community.Several standards for proteomics are available for request or purchase (2, 3). RM8327 is a mixture of three peptides developed as a reference material in collaboration between the National Institute of Standards and Technology (NIST) and the Association of Biomolecular Resource Facilities. Mixtures of 15–48 purified human proteins are also available, such as the HUPO (Human Proteome Organisation) Gold MS Protein Standard (Invitrogen), the Universal Proteomics Standard (UPS1; Sigma), and CRM470 from the European Union Institute for Reference Materials and Measurements. Although defined mixtures of peptides or proteins can address some benchmarking and QC needs, there is an additional need for more complex reference materials to fully represent the challenges of LC-MS data acquisition in complex matrices encountered in biological samples (2, 3).Although it has not been widely distributed as a reference material, the yeast Saccharomyces cerevisiae proteome has been extensively used by the proteomics community to characterize the capabilities of a variety of LC-MS-based approaches (415). Yeast provides a uniquely attractive complex performance standard for several reasons. Yeast encodes a complex proteome consisting of ∼4,500 proteins expressed during normal growth conditions (7, 1618). The concentration range of yeast proteins is sufficient to challenge the dynamic range of conventional mass spectrometers; the abundance of proteins ranges from fewer than 50 to more than 106 molecules per cell (4, 15, 16). Additionally, it is the most extensively characterized complex biological proteome and the only one associated with several large scale studies estimating the abundance of all detectable proteins (5, 9, 16, 17, 19, 20) as well as LC-MS/MS data sets showing good correlation between LC-MS/MS detection efficiency and the protein abundance estimates (4, 11, 12, 15). Finally, it is inexpensive and easy to produce large quantities of yeast protein extract for distribution.In this study, we describe large scale production of a yeast S. cerevisiae performance standard, which we offer to the community through NIST. Through a series of interlaboratory studies, we created a reference data set characterizing the yeast performance standard and defining reasonable performance of ion trap-based LC-MS platforms in expert laboratories using a series of performance metrics. This publicly available data set provides a basis for additional laboratories using the yeast standard to benchmark their own performance as well as to improve upon the current status by evolving protocols, improving instrumentation, or developing new technologies. Finally, we demonstrate how the yeast performance standard, spiked with human proteins, can be used to benchmark the power of proteomics platforms for detection of differentially expressed proteins at different levels of concentration in a complex matrix.  相似文献   

7.
Cross-linking/mass spectrometry resolves protein–protein interactions or protein folds by help of distance constraints. Cross-linkers with specific properties such as isotope-labeled or collision-induced dissociation (CID)-cleavable cross-linkers are in frequent use to simplify the identification of cross-linked peptides. Here, we analyzed the mass spectrometric behavior of 910 unique cross-linked peptides in high-resolution MS1 and MS2 from published data and validate the observation by a ninefold larger set from currently unpublished data to explore if detailed understanding of their fragmentation behavior would allow computational delivery of information that otherwise would be obtained via isotope labels or CID cleavage of cross-linkers. Isotope-labeled cross-linkers reveal cross-linked and linear fragments in fragmentation spectra. We show that fragment mass and charge alone provide this information, alleviating the need for isotope-labeling for this purpose. Isotope-labeled cross-linkers also indicate cross-linker-containing, albeit not specifically cross-linked, peptides in MS1. We observed that acquisition can be guided to better than twofold enrich cross-linked peptides with minimal losses based on peptide mass and charge alone. By help of CID-cleavable cross-linkers, individual spectra with only linear fragments can be recorded for each peptide in a cross-link. We show that cross-linked fragments of ordinary cross-linked peptides can be linearized computationally and that a simplified subspectrum can be extracted that is enriched in information on one of the two linked peptides. This allows identifying candidates for this peptide in a simplified database search as we propose in a search strategy here. We conclude that the specific behavior of cross-linked peptides in mass spectrometers can be exploited to relax the requirements on cross-linkers.Cross-linking/mass spectrometry extends the use of mass-spectrometry-based proteomics from identification (1, 2), quantification (3), and characterization of protein complexes (4) into resolving protein structures and protein–protein interactions (58). Chemical reagents (cross-linkers) covalently connect amino acid pairs that are within a cross-linker-specific distance range in the native three-dimensional structure of a protein or protein complex. A cross-linking/mass spectrometry experiment is typically conducted in four steps: (1) cross-linking of the target protein or complex, (2) protein digestion (usually with trypsin), (3) LC-MS analysis, and (4) database search. The digested peptide mixture consists of linear and cross-linked peptides, and the latter can be enriched by strong cation exchange (9) or size exclusion chromatography (10). Cross-linked peptides are of high value as they provide direct information on the structure and interactions of proteins.Cross-linked peptides fragment under collision-induced dissociation (CID) conditions primarily into b- and y-ions, as do their linear counterparts. An important difference regarding database searches between linear and cross-linked peptides stems from not knowing which peptides might be cross-linked. Therefore, one has to consider each single peptide and all pairwise combinations of peptides in the database. Having n peptides leads to (n2 + n)/2 possible pairwise combinations. This leads to two major challenges: With increasing size of the database, search time and the risk of identifying false positives increases. One way of circumventing these problems is to use MS2-cleavable cross-linkers (11, 12), at the cost of limited experimental design and choice of cross-linker.In a first database search approach (13), all pairwise combinations of peptides in a database were considered in a concatenated and linearized form. Thereby, all possible single bond fragments are considered in one of the two database entries per peptide pair, and the cross-link can be identified by a normal protein identification algorithm. Already, the second search approach split the peptides for the purpose of their identification (14). Linear fragments were used to retrieve candidate peptides from the database that are then matched based on the known mass of the cross-linked pair and scored as a pair against the spectrum. Isotope-labeled cross-linkers were used to sort the linear and cross-linked fragments apart. Many other search tools and approaches have been developed since (10, 1519); see (20) for a more detailed list, at least some of which follow the general idea of an open modification search (2124).As a general concept for open modification search of cross-linked peptides, cross-linked peptides represent two peptides, each with an unknown modification given by the mass of the other peptide and the cross-linker. One identifies both peptides individually and then matches them based on knowing the mass of cross-linked pair (14, 22, 24). Alternatively, one peptide is identified first and, using that peptide and the cross-linker as a modification mass, the second peptide is identified from the database (21, 23). An important element of the open modification search approach is that it essentially converts the quadratic search space of the cross-linked peptides into a linear search space of modified peptides. Still, many peptides and many modification positions have to be considered, especially when working with large databases or when using highly reactive cross-linkers with limited amino acid selectivity (25).We hypothesize that detailed knowledge of the fragmentation behavior of cross-linked peptides might reveal ways to improve the identification of cross-linked peptides. Detailed analyses of the fragmentation behavior of linear peptides exist (2628), and the analysis of the fragmentation behavior of cross-linked peptides has guided the design of scores (24, 29). Further, cross-link-specific ions have been observed from higher energy collision dissociation (HCD) data (30). Isotope-labeled cross-linkers are used to distinguish cross-linked from linear fragments, generally in low-resolution MS2 of cross-linked peptides (14).We compared the mass spectrometric behavior of cross-linked peptides to that of linear peptides, using 910 high-resolution fragment spectra matched to unique cross-linked peptides from multiple different public datasets at 5% peptide-spectrum match (PSM)1 false discovery rate (FDR). In addition, we repeated all experiments with a larger sample set that contains 8,301 spectra—also including data from ongoing studies from our lab (Supplemental material S9-S12). This paper presents the mass spectrometric signature of cross-linked peptides that we identified in our analysis and the resulting heuristics that are incorporated into an integrated strategy for the analysis and identification of cross-linked peptides. We present computational strategies that indicate the possibility of alleviating the need for mass-spectrometrically restricted cross-linker choice.  相似文献   

8.
9.
A complete understanding of the biological functions of large signaling peptides (>4 kDa) requires comprehensive characterization of their amino acid sequences and post-translational modifications, which presents significant analytical challenges. In the past decade, there has been great success with mass spectrometry-based de novo sequencing of small neuropeptides. However, these approaches are less applicable to larger neuropeptides because of the inefficient fragmentation of peptides larger than 4 kDa and their lower endogenous abundance. The conventional proteomics approach focuses on large-scale determination of protein identities via database searching, lacking the ability for in-depth elucidation of individual amino acid residues. Here, we present a multifaceted MS approach for identification and characterization of large crustacean hyperglycemic hormone (CHH)-family neuropeptides, a class of peptide hormones that play central roles in the regulation of many important physiological processes of crustaceans. Six crustacean CHH-family neuropeptides (8–9.5 kDa), including two novel peptides with extensive disulfide linkages and PTMs, were fully sequenced without reference to genomic databases. High-definition de novo sequencing was achieved by a combination of bottom-up, off-line top-down, and on-line top-down tandem MS methods. Statistical evaluation indicated that these methods provided complementary information for sequence interpretation and increased the local identification confidence of each amino acid. Further investigations by MALDI imaging MS mapped the spatial distribution and colocalization patterns of various CHH-family neuropeptides in the neuroendocrine organs, revealing that two CHH-subfamilies are involved in distinct signaling pathways.Neuropeptides and hormones comprise a diverse class of signaling molecules involved in numerous essential physiological processes, including analgesia, reward, food intake, learning and memory (1). Disorders of the neurosecretory and neuroendocrine systems influence many pathological processes. For example, obesity results from failure of energy homeostasis in association with endocrine alterations (2, 3). Previous work from our lab used crustaceans as model organisms found that multiple neuropeptides were implicated in control of food intake, including RFamides, tachykinin related peptides, RYamides, and pyrokinins (46).Crustacean hyperglycemic hormone (CHH)1 family neuropeptides play a central role in energy homeostasis of crustaceans (717). Hyperglycemic response of the CHHs was first reported after injection of crude eyestalk extract in crustaceans. Based on their preprohormone organization, the CHH family can be grouped into two sub-families: subfamily-I containing CHH, and subfamily-II containing molt-inhibiting hormone (MIH) and mandibular organ-inhibiting hormone (MOIH). The preprohormones of the subfamily-I have a CHH precursor related peptide (CPRP) that is cleaved off during processing; and preprohormones of the subfamily-II lack the CPRP (9). Uncovering their physiological functions will provide new insights into neuroendocrine regulation of energy homeostasis.Characterization of CHH-family neuropeptides is challenging. They are comprised of more than 70 amino acids and often contain multiple post-translational modifications (PTMs) and complex disulfide bridge connections (7). In addition, physiological concentrations of these peptide hormones are typically below picomolar level, and most crustacean species do not have available genome and proteome databases to assist MS-based sequencing.MS-based neuropeptidomics provides a powerful tool for rapid discovery and analysis of a large number of endogenous peptides from the brain and the central nervous system. Our group and others have greatly expanded the peptidomes of many model organisms (3, 1833). For example, we have discovered more than 200 neuropeptides with several neuropeptide families consisting of as many as 20–40 members in a simple crustacean model system (5, 6, 2531, 34). However, a majority of these neuropeptides are small peptides with 5–15 amino acid residues long, leaving a gap of identifying larger signaling peptides from organisms without sequenced genome. The observed lack of larger size peptide hormones can be attributed to the lack of effective de novo sequencing strategies for neuropeptides larger than 4 kDa, which are inherently more difficult to fragment using conventional techniques (3437). Although classical proteomics studies examine larger proteins, these tools are limited to identification based on database searching with one or more peptides matching without complete amino acid sequence coverage (36, 38).Large populations of neuropeptides from 4–10 kDa exist in the nervous systems of both vertebrates and invertebrates (9, 39, 40). Understanding their functional roles requires sufficient molecular knowledge and a unique analytical approach. Therefore, developing effective and reliable methods for de novo sequencing of large neuropeptides at the individual amino acid residue level is an urgent gap to fill in neurobiology. In this study, we present a multifaceted MS strategy aimed at high-definition de novo sequencing and comprehensive characterization of the CHH-family neuropeptides in crustacean central nervous system. The high-definition de novo sequencing was achieved by a combination of three methods: (1) enzymatic digestion and LC-tandem mass spectrometry (MS/MS) bottom-up analysis to generate detailed sequences of proteolytic peptides; (2) off-line LC fractionation and subsequent top-down MS/MS to obtain high-quality fragmentation maps of intact peptides; and (3) on-line LC coupled to top-down MS/MS to allow rapid sequence analysis of low abundance peptides. Combining the three methods overcomes the limitations of each, and thus offers complementary and high-confidence determination of amino acid residues. We report the complete sequence analysis of six CHH-family neuropeptides including the discovery of two novel peptides. With the accurate molecular information, MALDI imaging and ion mobility MS were conducted for the first time to explore their anatomical distribution and biochemical properties.  相似文献   

10.
A Boolean network is a model used to study the interactions between different genes in genetic regulatory networks. In this paper, we present several algorithms using gene ordering and feedback vertex sets to identify singleton attractors and small attractors in Boolean networks. We analyze the average case time complexities of some of the proposed algorithms. For instance, it is shown that the outdegree-based ordering algorithm for finding singleton attractors works in time for , which is much faster than the naive time algorithm, where is the number of genes and is the maximum indegree. We performed extensive computational experiments on these algorithms, which resulted in good agreement with theoretical results. In contrast, we give a simple and complete proof for showing that finding an attractor with the shortest period is NP-hard.[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32]  相似文献   

11.
Insulin plays a central role in the regulation of vertebrate metabolism. The hormone, the post-translational product of a single-chain precursor, is a globular protein containing two chains, A (21 residues) and B (30 residues). Recent advances in human genetics have identified dominant mutations in the insulin gene causing permanent neonatal-onset DM2 (14). The mutations are predicted to block folding of the precursor in the ER of pancreatic β-cells. Although expression of the wild-type allele would in other circumstances be sufficient to maintain homeostasis, studies of a corresponding mouse model (57) suggest that the misfolded variant perturbs wild-type biosynthesis (8, 9). Impaired β-cell secretion is associated with ER stress, distorted organelle architecture, and cell death (10). These findings have renewed interest in insulin biosynthesis (1113) and the structural basis of disulfide pairing (1419). Protein evolution is constrained not only by structure and function but also by susceptibility to toxic misfolding.Insulin plays a central role in the regulation of vertebrate metabolism. The hormone, the post-translational product of a single-chain precursor, is a globular protein containing two chains, A (21 residues) and B (30 residues). Recent advances in human genetics have identified dominant mutations in the insulin gene causing permanent neonatal-onset DM2 (14). The mutations are predicted to block folding of the precursor in the ER of pancreatic β-cells. Although expression of the wild-type allele would in other circumstances be sufficient to maintain homeostasis, studies of a corresponding mouse model (57) suggest that the misfolded variant perturbs wild-type biosynthesis (8, 9). Impaired β-cell secretion is associated with ER stress, distorted organelle architecture, and cell death (10). These findings have renewed interest in insulin biosynthesis (1113) and the structural basis of disulfide pairing (1419). Protein evolution is constrained not only by structure and function but also by susceptibility to toxic misfolding.  相似文献   

12.
Mitochondria play a central role in energy metabolism and cellular survival, and consequently mitochondrial dysfunction is associated with a number of human pathologies. Reversible protein phosphorylation emerges as a central mechanism in the regulation of several mitochondrial processes. In skeletal muscle, mitochondrial dysfunction is linked to insulin resistance in humans with obesity and type 2 diabetes. We performed a phosphoproteomics study of functional mitochondria isolated from human muscle biopsies with the aim to obtain a comprehensive overview of mitochondrial phosphoproteins. Combining an efficient mitochondrial isolation protocol with several different phosphopeptide enrichment techniques and LC-MS/MS, we identified 155 distinct phosphorylation sites in 77 mitochondrial phosphoproteins, including 116 phosphoserine, 23 phosphothreonine, and 16 phosphotyrosine residues. The relatively high number of phosphotyrosine residues suggests an important role for tyrosine phosphorylation in mitochondrial signaling. Many of the mitochondrial phosphoproteins are involved in oxidative phosphorylation, tricarboxylic acid cycle, and lipid metabolism, i.e. processes proposed to be involved in insulin resistance. We also assigned phosphorylation sites in mitochondrial proteins involved in amino acid degradation, importers and transporters, calcium homeostasis, and apoptosis. Bioinformatics analysis of kinase motifs revealed that many of these mitochondrial phosphoproteins are substrates for protein kinase A, protein kinase C, casein kinase II, and DNA-dependent protein kinase. Our results demonstrate the feasibility of performing phosphoproteome analysis of organelles isolated from human tissue and provide novel targets for functional studies of reversible phosphorylation in mitochondria. Future comparative phosphoproteome analysis of mitochondria from healthy and diseased individuals will provide insights into the role of abnormal phosphorylation in pathologies, such as type 2 diabetes.Mitochondria are the primary energy-generating systems in eukaryotes. They play a crucial role in oxidative metabolism, including carbohydrate metabolism, fatty acid oxidation, and urea cycle, as well as in calcium signaling and apoptosis (1, 2). Mitochondrial dysfunction is centrally involved in a number of human pathologies, such as type 2 diabetes, Parkinson disease, and cancer (3). The most prevalent form of cellular protein post-translational modifications (PTMs),1 reversible phosphorylation (46), is emerging as a central mechanism in the regulation of mitochondrial functions (7, 8). The steadily increasing numbers of reported mitochondrial kinases, phosphatases, and phosphoproteins imply an important role of protein phosphorylation in different mitochondrial processes (911).Mass spectrometry (MS)-based proteome analysis is a powerful tool for global profiling of proteins and their PTMs, including protein phosphorylation (12, 13). A variety of proteomics techniques have been developed for specific enrichment of phosphorylated proteins and peptides and for phosphopeptide-specific data acquisition techniques at the MS level (14). Enrichment methods based on affinity chromatography, such as titanium dioxide (TiO2) (1517), zwitterionic hydrophilic interaction chromatography (ZIC-HILIC) (18), immobilized metal affinity chromatography (IMAC) (19, 20), and ion exchange chromatography (strong anion exchange and strong cation exchange) (21, 22), have shown high efficiencies for enrichment of phosphopeptides (14). Recently, we demonstrated that calcium phosphate precipitation (CPP) is highly effective for enriching phosphopeptides (23). It is now generally accepted that no single method is comprehensive, but combinations of different enrichment methods produce distinct overlapping phosphopeptide data sets to enhance the overall results in phosphoproteome analysis (24, 25). Phosphopeptide sequencing by mass spectrometry has seen tremendous advances during the last decade (26). For example, MS/MS product ion scanning, multistage activation, and precursor ion scanning are effective methods for identifying serine (Ser), threonine (Thr), and tyrosine (Tyr) phosphorylated peptides (14, 26).A “complete” mammalian mitochondrial proteome was reported by Mootha and co-workers (27) and included 1098 proteins. The mitochondrial phosphoproteome has been characterized in a series of studies, including yeast, mouse and rat liver, porcine heart, and plants (19, 2831). To date, the largest data set by Deng et al. (30) identified 228 different phosphoproteins and 447 phosphorylation sites in rat liver mitochondria. However, the in vivo phosphoproteome of human mitochondria has not been determined. A comprehensive mitochondrial phosphoproteome is warranted for further elucidation of the largely unknown mechanisms by which protein phosphorylation modulates diverse mitochondrial functions.The percutaneous muscle biopsy technique is an important tool in the diagnosis and management of human muscle disorders and has been widely used to investigate metabolism and various cellular and molecular processes in normal and abnormal human muscle, in particular the molecular mechanism underlying insulin resistance in obesity and type 2 diabetes (32). Skeletal muscle is rich in mitochondria and hence a good source for a comprehensive proteomics and functional analysis of mitochondria (32, 33).The major aim of the present study was to obtain a comprehensive overview of site-specific phosphorylation of mitochondrial proteins in functionally intact mitochondria isolated from human skeletal muscle. Combining an efficient protocol for isolation of skeletal muscle mitochondria with several different state-of-the-art phosphopeptide enrichment methods and high performance LC-MS/MS, we identified 155 distinct phosphorylation sites in 77 mitochondrial phosphoproteins, many of which have not been reported before. We characterized this mitochondrial phosphoproteome by using bioinformatics tools to classify functional groups and functions, including kinase substrate motifs.  相似文献   

13.
In this study, we present a fully automated tool, called IDEAL-Q, for label-free quantitation analysis. It accepts raw data in the standard mzXML format as well as search results from major search engines, including Mascot, SEQUEST, and X!Tandem, as input data. To quantify as many identified peptides as possible, IDEAL-Q uses an efficient algorithm to predict the elution time of a peptide unidentified in a specific LC-MS/MS run but identified in other runs. Then, the predicted elution time is used to detect peak clusters of the assigned peptide. Detected peptide peaks are processed by statistical and computational methods and further validated by signal-to-noise ratio, charge state, and isotopic distribution criteria (SCI validation) to filter out noisy data. The performance of IDEAL-Q has been evaluated by several experiments. First, a serially diluted protein mixed with Escherichia coli lysate showed a high correlation with expected ratios and demonstrated good linearity (R2 = 0.996). Second, in a biological replicate experiment on the THP-1 cell lysate, IDEAL-Q quantified 87% (1,672 peptides) of all identified peptides, surpassing the 45.7% (909 peptides) achieved by the conventional identity-based approach, which only quantifies peptides identified in all LC-MS/MS runs. Manual validation on all 11,940 peptide ions in six replicate LC-MS/MS runs revealed that 97.8% of the peptide ions were correctly aligned, and 93.3% were correctly validated by SCI. Thus, the mean of the protein ratio, 1.00 ± 0.05, demonstrates the high accuracy of IDEAL-Q without human intervention. Finally, IDEAL-Q was applied again to the biological replicate experiment but with an additional SDS-PAGE step to show its compatibility for label-free experiments with fractionation. For flexible workflow design, IDEAL-Q supports different fractionation strategies and various normalization schemes, including multiple spiked internal standards. User-friendly interfaces are provided to facilitate convenient inspection, validation, and modification of quantitation results. In summary, IDEAL-Q is an efficient, user-friendly, and robust quantitation tool. It is available for download.Quantitative analysis of protein expression promises to provide fundamental understanding of the biological changes or biomarker discoveries in clinical applications. In recent years, various stable isotope labeling techniques, e.g. ICAT (1), enzymatic labeling using 18O/16O (2, 3), stable isotope labeling by amino acids in cell culture (4), and isobaric tagging for relative and absolute quantitation (2, 5), coupled with LC-MS/MS have been widely used for large scale quantitative proteomics. However, several factors, such as the limited number of samples, the complexity of procedures in isotopic labeling experiments, and the high cost of reagents, limit the applicability of isotopic labeling techniques to high throughput analysis. Unlike the labeling approaches, the label-free quantitation approach quantifies protein expression across multiple LC-MS/MS analyses directly without using any labeling technique (79). Thus, it is particularly useful for analyzing clinical specimens in highly multiplexed quantitation (10, 11); theoretically, it can be used to compare any number of samples. Despite these significant advantages, data analysis in label-free experiments is an intractable problem because of the experimental procedures. First, although high reproducibility in LC is considered a critical prerequisite, variations, including the aging of separation columns, changes in sample buffers, and fluctuations in temperature, will cause a chromatographic shift in retention time for analytes in different LC-MS/MS runs and thus complicate the analysis. In addition, under the label-free approach, many technical replicate analyses across a large number of samples are often acquired; however, comparing a large number of data files further complicates data analysis and renders lower quantitation accuracy than that derived by labeling methods. Hence, an accurate, automated computation tool is required to effectively solve the problem of chromatographic shift, analyze a large amount of experimental data, and provide convenient user interfaces for manual validation of quantitation results.The rapid emergence of new label-free techniques for biomarker discovery has inspired the development of a number of bioinformatics tools in recent years. For example, Scaffold (Proteome Software) and Census (12) process PepXML search results to quantify relative protein expression based on spectral counting (1315), which uses the number of MS/MS spectra assigned to a protein to determine the relative protein amount. Spectral counting has demonstrated a high correlation with protein abundance; however, to achieve good quantitation accuracy with the technique, high speed MS/MS data acquisition is required. Moreover, manipulations of the exclusion/inclusion strategy also affect the accuracy of spectral counting significantly. Because peptide level quantitation is also important for post-translational modification studies, the accuracy of spectral counting on peptide level quantitation deserves further study.Another type of quantitation analysis determines peptide abundance by MS1 peak signals. According to some studies, MS1 peak signals across different LC-MS/MS runs can be highly reproducible and correlate well with protein abundance in complex biological samples (79). Quantitation analysis methods based on MS1 peak signals can be classified into three categories: identity-based, pattern-based, and hybrid-based methods (16). Identity-based methods (79) depend on the results of MS/MS sequencing to identify and detect peptide signals in MS1 data. However, because the data acquisition speed of MS scanning is insufficient, a considerable number of low abundance peptides may not be selected for limited MS/MS sequencing. Only a few peptides can be repetitively identified in all LC-MS/MS runs and subsequently quantified; thus, only a small fraction of identified peptides are quantified, resulting in a small number of quantifiable peptides/proteins.In contrast to identity-based methods, pattern-based methods (1723), including the publicly available MSight (20), MZmine (21, 22), and msInspect (23), tend to quantify all peptide peaks in MS1 data to increase the number of quantifiable peptides. These methods first detect all peaks in each MS1 data and then align the detected peaks across different LC-MS/MS runs. However, in pattern-based methods, efficient detection and alignment of the peaks between each pair of LC-MS/MS runs are a major challenge. To align the peaks, several methods based on dynamic programming or image pattern recognition have been proposed (2426). The algorithms applied in these methods require intensive computation, and their computation time increases dramatically as the number of compared samples increases because all the LC-MS/MS runs must be processed. Therefore, pattern-based approaches are infeasible for processing a large number of samples. Furthermore, pattern recognition algorithms may fail on data containing noise or overlapping peptide signal (i.e. co-eluting peptides). The hybrid-based quantitation approach (16, 2730) combines a pattern recognition algorithm with peptide identification results to align shifted peptides for quantitation. The pioneering accurate mass and time tag strategy (27) takes advantage of very sensitive, highly accurate mass measurement instruments with a wide dynamic range, e.g. FTICR-MS and TOF-MS, for quantitation analysis. PEPPeR (16) and SuperHirn (28) apply pattern recognition algorithms to align peaks and use the peptide identification results as landmarks to improve the alignment. However, because these methods still align all peaks in MS1 data, they suffer the same computation time problem as pattern-based methods.To resolve the computation-intensive problem in the hybrid approach, we present a fully automated software system, called IDEAL-Q, for label-free quantitation including differential protein expression and protein modification analysis. Instead of using computation-intensive pattern recognition methods, IDEAL-Q uses a computation-efficient fragmental regression method for identity-based alignment of all confidently identified peptides in a local elution time domain. It then performs peptide cross-assignment by mapping predicted elution time profiles across multiple LC-MS experiments. To improve the quantitation accuracy, IDEAL-Q applies three validation criteria to the detected peptide peak clusters to filter out noisy signals, false peptide peak clusters, and co-eluting peaks. Because of the above key features, i.e. fragmental regression and stringent validation, IDEAL-Q can substantially increase the number of quantifiable proteins as well as the quantitation accuracy compared with other extracted ion chromatogram (XIC)1 -based tools. Notably, to accommodate different designs, IDEAL-Q supports various built-in normalization procedures, including normalization based on multiple internal standards, to eliminate systematic biases. It also adapts to different fractionation strategies for in-depth proteomics profiling.We evaluated the performance of IDEAL-Q on three levels: 1) quantitation of a standard protein mixture, 2) large scale proteome quantitation using replicate cell lysate, and 3) proteome scale quantitative analysis of protein expression that incorporates an additional fractionation step. We demonstrated that IDEAL-Q can quantify up to 89% of identified proteins (703 proteins) in the replicate THP-1 cell lysate. Moreover, by manual validation of the entire 11,940 peptide ions corresponding to 1,990 identified peptides, 93% of peptide ions were accurately quantified. In another experiment on replicate data containing huge chromatographic shifts obtained from two independent LC-MS/MS instruments, IDEAL-Q demonstrated its robust quantitation and its ability to rectify such shifts. Finally, we applied IDEAL-Q to the THP-1 replicate experiment with an additional SDS-PAGE fractionation step. Equipped with user-friendly visualization interfaces and convenient data output for publication, IDEAL-Q represents a generic, robust, and comprehensive tool for label-free quantitative proteomics.  相似文献   

14.
Given the ease of whole genome sequencing with next-generation sequencers, structural and functional gene annotation is now purely based on automated prediction. However, errors in gene structure are frequent, the correct determination of start codons being one of the main concerns. Here, we combine protein N termini derivatization using (N-Succinimidyloxycarbonylmethyl)tris(2,4,6-trimethoxyphenyl)phosphonium bromide (TMPP Ac-OSu) as a labeling reagent with the COmbined FRActional DIagonal Chromatography (COFRADIC) sorting method to enrich labeled N-terminal peptides for mass spectrometry detection. Protein digestion was performed in parallel with three proteases to obtain a reliable automatic validation of protein N termini. The analysis of these N-terminal enriched fractions by high-resolution tandem mass spectrometry allowed the annotation refinement of 534 proteins of the model marine bacterium Roseobacter denitrificans OCh114. This study is especially efficient regarding mass spectrometry analytical time. From the 534 validated N termini, 480 confirmed existing gene annotations, 41 highlighted erroneous start codon annotations, five revealed totally new mis-annotated genes; the mass spectrometry data also suggested the existence of multiple start sites for eight different genes, a result that challenges the current view of protein translation initiation. Finally, we identified several proteins for which classical genome homology-driven annotation was inconsistent, questioning the validity of automatic annotation pipelines and emphasizing the need for complementary proteomic data. All data have been deposited to the ProteomeXchange with identifier PXD000337.Recent developments in mass spectrometry and bioinformatics have established proteomics as a common and powerful technique for identifying and quantifying proteins at a very broad scale, but also for characterizing their post-translational modifications and interaction networks (1, 2). In addition to the avalanche of proteomic data currently being reported, many genome sequences are established using next-generation sequencing, fostering proteomic investigations of new cellular models. Proteogenomics is a relatively recent field in which high-throughput proteomic data is used to verify coding regions within model genomes to refine the annotation of their sequences (28). Because genome annotation is now fully automated, the need for accurate annotation for model organisms with experimental data is crucial. Many projects related to genome re-annotation of microorganisms with the help of proteomics have been recently reported, such as for Mycoplasma pneumoniae (9), Rhodopseudomonas palustris (10), Shewanella oneidensis (11), Thermococcus gammatolerans (12), Deinococcus deserti (13), Salmonella thyphimurium (14), Mycobacterium tuberculosis (15, 16), Shigella flexneri (17), Ruegeria pomeroyi (18), and Candida glabrata (19), as well as for higher organisms such as Anopheles gambiae (20) and Arabidopsis thaliana (4, 5).The most frequently reported problem in automatic annotation systems is the correct identification of the translational start codon (2123). The error rate depends on the primary annotation system, but also on the organism, as reported for Halobacterium salinarum and Natromonas pharaonis (24), Deinococcus deserti (21), and Ruegeria pomeroyi (18), where the error rate is estimated above 10%. Identification of a correct translational start site is essential for the genetic and biochemical analysis of a protein because errors can seriously impact subsequent biological studies. If the N terminus is not correctly identified, the protein will be considered in either a truncated or extended form, leading to errors in bioinformatic analyses (e.g. during the prediction of its molecular weight, isoelectric point, cellular localization) and major difficulties during its experimental characterization. For example, a truncated protein may be heterologously produced as an unfolded polypeptide recalcitrant to structure determination (25). Moreover, N-terminal modifications, which are poorly documented in annotation databases, may occur (26, 27).Unfortunately, the poor polypeptide sequence coverage obtained for the numerous low abundance proteins in current shotgun MS/MS proteomic studies implies that the overall detection of N-terminal peptides obtained in proteogenomic studies is relatively low. Different methods for establishing the most extensive list of protein N termini, grouped under the so-called “N-terminomics” theme, have been proposed to selectively enrich or improve the detection of these peptides (2, 28, 29). Large N-terminome studies have recently been reported based on resin-assisted enrichment of N-terminal peptides (30) or terminal amine isotopic labeling of substrates (TAILS) coupled to depletion of internal peptides with a water-soluble aldehyde-functionalized polymer (3135). Among the numerous N-terminal-oriented methods (2), specific labeling of the N terminus of intact proteins with N-tris(2,4,6-trimethoxyphenyl)phosphonium acetyl succinamide (TMPP-Ac-OSu)1 has proven reliable (21, 3639). TMPP-derivatized N-terminal peptides have interesting properties for further LC-MS/MS mass spectrometry: (1) an increase in hydrophobicity because of the trimethoxyphenyl moiety added to the peptides, increasing their retention times in reverse phase chromatography, (2) improvement of their ionization because of the introduction of a positively charged group, and (3) a much simpler fragmentation pattern in tandem mass spectrometry. Other reported approaches rely on acetylation, followed by trypsin digestion, and then biotinylation of free amino groups (40); guanidination of lysine lateral chains followed by N-biotinylation of the N termini and trypsin digestion (41); or reductive amination of all free amino groups with formaldehyde preceeding trypsin digestion (42). Recently, we applied the TMPP method to the proteome of the Deinococcus deserti bacterium isolated from upper sand layers of the Sahara desert (13). This method enabled the detection of N-terminal peptides allowing the confirmation of 278 translation initiation codons, the correction of 73 translation starts, and the identification of non-canonical translation initiation codons (21). However, most TMPP-labeled N-terminal peptides are hidden among the more abundant internal peptides generated after proteolysis of a complex proteome, precluding their detection. This results in disproportionately fewer N-terminal validations, that is, 5 and 8% of total polypeptides coded in the theoretical proteomes of Mycobacterium smegmatis (37) and Deinococcus deserti (21) with a total of 342 and 278 validations, respectively.An interesting chromatographic method to fractionate peptide mixtures for gel-free high-throughput proteome analysis has been developed over the last years and applied to various topics (43, 44). This technique, known as COmbined FRActional DIagonal Chromatography (COFRADIC), uses a double chromatographic separation with a chemical reaction in between to change the physico-chemical properties of the extraneous peptides to be resolved from the peptides of interest. Its previous applications include the separation of methionine-containing peptides (43), N-terminal peptide enrichment (45, 46), sulfur amino acid-containing peptides (47), and phosphorylated peptides (48). COFRADIC was identified as the best method for identification of N-terminal peptides of two archaea, resulting in the identification of 240 polypeptides (9% of the theoretical proteome) for Halobacterium salinarum and 220 (8%) for Natronomonas pharaonis (24).Taking advantage of both the specificity of TMPP labeling, the resolving power of COFRADIC for enrichment, and the increase in information through the use of multiple proteases, we performed the proteogenomic analysis of a marine bacterium from the Roseobacter clade, namely Roseobacter denitrificans OCh114. This novel approach allowed us to validate and correct 534 unique proteins (13% of the theoretical proteome) with TMPP-labeled N-terminal signatures obtained using high-resolution tandem mass spectrometry. We corrected 41 annotations and detected five new open reading frames in the R. denitrificans genome. We further identified eight distinct proteins showing direct evidence for multiple start sites.  相似文献   

15.
Mathematical tools developed in the context of Shannon information theory were used to analyze the meaning of the BLOSUM score, which was split into three components termed as the BLOSUM spectrum (or BLOSpectrum). These relate respectively to the sequence convergence (the stochastic similarity of the two protein sequences), to the background frequency divergence (typicality of the amino acid probability distribution in each sequence), and to the target frequency divergence (compliance of the amino acid variations between the two sequences to the protein model implicit in the BLOCKS database). This treatment sharpens the protein sequence comparison, providing a rationale for the biological significance of the obtained score, and helps to identify weakly related sequences. Moreover, the BLOSpectrum can guide the choice of the most appropriate scoring matrix, tailoring it to the evolutionary divergence associated with the two sequences, or indicate if a compositionally adjusted matrix could perform better.[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29]  相似文献   

16.
17.
18.
The combination of chemical cross-linking and mass spectrometry has recently been shown to constitute a powerful tool for studying protein–protein interactions and elucidating the structure of large protein complexes. However, computational methods for interpreting the complex MS/MS spectra from linked peptides are still in their infancy, making the high-throughput application of this approach largely impractical. Because of the lack of large annotated datasets, most current approaches do not capture the specific fragmentation patterns of linked peptides and therefore are not optimal for the identification of cross-linked peptides. Here we propose a generic approach to address this problem and demonstrate it using disulfide-bridged peptide libraries to (i) efficiently generate large mass spectral reference data for linked peptides at a low cost and (ii) automatically train an algorithm that can efficiently and accurately identify linked peptides from MS/MS spectra. We show that using this approach we were able to identify thousands of MS/MS spectra from disulfide-bridged peptides through comparison with proteome-scale sequence databases and significantly improve the sensitivity of cross-linked peptide identification. This allowed us to identify 60% more direct pairwise interactions between the protein subunits in the 20S proteasome complex than existing tools on cross-linking studies of the proteasome complexes. The basic framework of this approach and the MS/MS reference dataset generated should be valuable resources for the future development of new tools for the identification of linked peptides.The study of protein–protein interactions is crucial to understanding how cellular systems function because proteins act in concert through a highly organized set of interactions. Most cellular processes are carried out by large macromolecular assemblies and regulated through complex cascades of transient protein–protein interactions (1). In the past several years numerous high-throughput studies have pioneered the systematic characterization of protein–protein interactions in model organisms (24). Such studies mainly utilize two techniques: the yeast two-hybrid system, which aims at identifying binary interactions (5), and affinity purification combined with tandem mass spectrometry analysis for the identification of multi-protein assemblies (68). Together these led to a rapid expansion of known protein–protein interactions in human and other model organisms. Patche and Aloy recently estimated that there are more than one million interactions catalogued to date (9).But despite rapid progress, most current techniques allow one to determine only whether proteins interact, which is only the first step toward understanding how proteins interact. A more complete picture comes from characterizing the three-dimensional structures of protein complexes, which provide mechanistic insights that govern how interactions occur and the high specificity observed inside the cell. Traditionally the gold-standard methods used to solve protein structures are x-ray crystallography and NMR, and there have been several efforts similar to structural genomics (10) aiming to comprehensively solve the structures of protein complexes (11, 12). Although there has been accelerated growth of structures for protein monomers in the Protein Data Bank in recent years (11), the growth of structures for protein complexes has remained relatively small (9). Many factors, including their large size, transient nature, and dynamics of interactions, have prevented many complexes from being solved via traditional approaches in structural biology. Thus, the development of complementary analytical techniques with which to probe the structure of large protein complexes continues to evolve (1318).Recent developments have advanced the analysis of protein structures and interaction by combining cross-linking and tandem mass spectrometry (17, 1924). The basic idea behind this technique is to capture and identify pairs of amino acid residues that are spatially close to each other. When these linked pairs of residues are from the same protein (intraprotein cross-links), they provide distance constraints that help one infer the possible conformations of protein structures. Conversely, when pairs of residues come from different proteins (interprotein cross-links), they provide information about how proteins interact with one another. Although cross-linking strategies date back almost a decade (25, 26), difficulty in analyzing the complex MS/MS spectrum generated from linked peptides made this approach challenging, and therefore it was not widely used. With recent advances in mass spectrometry instrumentation, there has been renewed interest in employing this strategy to determine protein structures and identify protein–protein interactions. However, most studies thus far have been focused on purified protein complexes. With today''s mass spectrometers being capable of analyzing tens of thousands of spectra in a single experiment, it is now potentially feasible to extend this approach to the analysis of complex biological samples. Researchers have tried to realize this goal using both experimental and computational approaches. Indeed, a plethora of chemical cross-linking reagents are now available for stabilizing these complexes, and some are designed to allow for easier peptide identification when employed in concert with MS analysis (20, 27, 28). There have also been several recent efforts to develop computational methods for the automatic identification of linked peptides from MS/MS spectra (2936). However, because of the lack of large annotated training data, most approaches to date either borrow fragmentation models learned from unlinked, linear peptides or learn the fragmentation statistics from training data of limited size (30, 37), which might not generalize well across different samples. In some cases it is possible to generate relatively large training data, but it is often very labor intensive and involves hundreds of separate LC-MS/MS runs (36). Here, employing disulfide-bridged peptides as an example, we propose a novel method that uses a combinatorial peptide library to (a) efficiently generate a large mass spectral reference dataset for linked peptides and (b) use these data to automatically train our new algorithm, MXDB, which can efficiently and accurately identify linked peptides from MS/MS spectra.  相似文献   

19.
Liquid chromatography-tandem mass spectrometry (LC-MS/MS)-based proteomics provides a wealth of information about proteins present in biological samples. In bottom-up LC-MS/MS-based proteomics, proteins are enzymatically digested into peptides prior to query by LC-MS/MS. Thus, the information directly available from the LC-MS/MS data is at the peptide level. If a protein-level analysis is desired, the peptide-level information must be rolled up into protein-level information. We propose a principal component analysis-based statistical method, ProPCA, for efficiently estimating relative protein abundance from bottom-up label-free LC-MS/MS data that incorporates both spectral count information and LC-MS peptide ion peak attributes, such as peak area, volume, or height. ProPCA may be used effectively with a variety of quantification platforms and is easily implemented. We show that ProPCA outperformed existing quantitative methods for peptide-protein roll-up, including spectral counting methods and other methods for combining LC-MS peptide peak attributes. The performance of ProPCA was validated using a data set derived from the LC-MS/MS analysis of a mixture of protein standards (the UPS2 proteomic dynamic range standard introduced by The Association of Biomolecular Resource Facilities Proteomics Standards Research Group in 2006). Finally, we applied ProPCA to a comparative LC-MS/MS analysis of digested total cell lysates prepared for LC-MS/MS analysis by alternative lysis methods and show that ProPCA identified more differentially abundant proteins than competing methods.One of the fundamental goals of proteomics methods for the biological sciences is to identify and quantify all proteins present in a sample. LC-MS/MS-based proteomics methodologies offer a promising approach to this problem (13). These methodologies allow for the acquisition of a vast amount of information about the proteins present in a sample. However, extracting reliable protein abundance information from LC-MS/MS data remains challenging. In this work, we were primarily concerned with the analysis of data acquired using bottom-up label-free LC-MS/MS-based proteomics techniques where “bottom-up” refers to the fact that proteins are enzymatically digested into peptides prior to query by the LC-MS/MS instrument platform (4), and “label-free” indicates that analyses are performed without the aid of stable isotope labels. One challenge inherent in the bottom-up approach to proteomics is that information directly available from the LC-MS/MS data is at the peptide level. When a protein-level analysis is desired, as is often the case with discovery-driven LC-MS research, peptide-level information must be rolled up into protein-level information.Spectral counting (510) is a straightforward and widely used example of peptide-protein roll-up for LC-MS/MS data. Information experimentally acquired in single stage (MS) and tandem (MS/MS) spectra may lead to the assignment of MS/MS spectra to peptide sequences in a database-driven or database-free manner using various peptide identification software platforms (SEQUEST (11) and Mascot (12), for instance); the identified peptide sequences correspond, in turn, to proteins. In principle, the number of tandem spectra matched to peptides corresponding to a certain protein, the spectral count (SC),1 is positively associated with the abundance of a protein (5). In spectral counting techniques, raw or normalized SCs are used as a surrogate for protein abundance. Spectral counting methods have been moderately successful in quantifying protein abundance and identifying significant proteins in various settings. However, SC-based methods do not make full use of information available from peaks in the LC-MS domain, and this surely leads to loss of efficiency.Peaks in the LC-MS domain corresponding to peptide ion species are highly sensitive to differences in protein abundance (13, 14). Identifying LC-MS peaks that correspond to detected peptides and measuring quantitative attributes of these peaks (such as height, area, or volume) offers a promising alternative to spectral counting methods. These methods have become especially popular in applications using stable isotope labeling (15). However, challenges remain, especially in the label-free analysis of complex proteomics samples where complications in peak detection, alignment, and integration are a significant obstacle. In practice, alignment, identification, and quantification of LC-MS peptide peak attributes (PPAs) may be accomplished using recently developed peak matching platforms (1618). A highly sensitive indicator of protein abundance may be obtained by rolling up PPA measurements into protein-level information (16, 19, 20). Existing peptide-protein roll-up procedures based on PPAs typically involve taking the mean of (possibly normalized) PPA measurements over all peptides corresponding to a protein to obtain a protein-level estimate of abundance. Despite the promise of PPA-based procedures for protein quantification, the performance of PPA-based methods may vary widely depending on the particular roll-up procedure used; furthermore, PPA-based procedures are limited by difficulties in accurately identifying and measuring peptide peak attributes. These two issues are related as the latter issue affects the robustness of PPA-based roll-up methods. Indeed, existing peak matching and quantification platforms tend to result in PPA measurement data sets with substantial missingness (16, 19, 21), especially when working with very complex samples where substantial dynamic ranges and ion suppression are difficulties that must be overcome. Missingness may, in turn, lead to instability in protein-level abundance estimates. A good peptide-protein roll-up procedure that utilizes PPAs should account for this missingness and the resulting instability in a principled way. However, even in the absence of missingness, there is no consensus in the existing literature on peptide-protein roll-up for PPA measurements.In this work, we propose ProPCA, a peptide-protein roll-up method for efficiently extracting protein abundance information from bottom-up label-free LC-MS/MS data. ProPCA is an easily implemented, unsupervised method that is related to principle component analysis (PCA) (22). ProPCA optimally combines SC and PPA data to obtain estimates of relative protein abundance. ProPCA addresses missingness in PPA measurement data in a unified way while capitalizing on strengths of both SCs and PPA-based roll-up methods. In particular, ProPCA adapts to the quality of the available PPA measurement data. If the PPA measurement data are poor and, in the extreme case, no PPA measurements are available, then ProPCA is equivalent to spectral counting. On the other hand, if there is no missingness in the PPA measurement data set, then the ProPCA estimate is a weighted mean of PPA measurements and spectral counts where the weights are chosen to reflect the ability of spectral counts and each peptide to predict protein abundance.Below, we assess the performance of ProPCA using a data set obtained from the LC-MS/MS analysis of protein standards (UPS2 proteomic dynamic range standard set2 manufactured by Sigma-Aldrich) and show that ProPCA outperformed other existing roll-up methods by multiple metrics. The applicability of ProPCA is not limited by the quantification platform used to obtain SCs and PPA measurements. To demonstrate this, we show that ProPCA continued to perform well when used with an alternative quantification platform. Finally, we applied ProPCA to a comparative LC-MS/MS analysis of digested total human hepatocellular carcinoma (HepG2) cell lysates prepared for LC-MS/MS analysis by alternative lysis methods. We show that ProPCA identified more differentially abundant proteins than competing methods.  相似文献   

20.
iTRAQ (isobaric tags for relative or absolute quantitation) is a mass spectrometry technology that allows quantitative comparison of protein abundance by measuring peak intensities of reporter ions released from iTRAQ-tagged peptides by fragmentation during MS/MS. However, current data analysis techniques for iTRAQ struggle to report reliable relative protein abundance estimates and suffer with problems of precision and accuracy. The precision of the data is affected by variance heterogeneity: low signal data have higher relative variability; however, low abundance peptides dominate data sets. Accuracy is compromised as ratios are compressed toward 1, leading to underestimation of the ratio. This study investigated both issues and proposed a methodology that combines the peptide measurements to give a robust protein estimate even when the data for the protein are sparse or at low intensity. Our data indicated that ratio compression arises from contamination during precursor ion selection, which occurs at a consistent proportion within an experiment and thus results in a linear relationship between expected and observed ratios. We proposed that a correction factor can be calculated from spiked proteins at known ratios. Then we demonstrated that variance heterogeneity is present in iTRAQ data sets irrespective of the analytical packages, LC-MS/MS instrumentation, and iTRAQ labeling kit (4-plex or 8-plex) used. We proposed using an additive-multiplicative error model for peak intensities in MS/MS quantitation and demonstrated that a variance-stabilizing normalization is able to address the error structure and stabilize the variance across the entire intensity range. The resulting uniform variance structure simplifies the downstream analysis. Heterogeneity of variance consistent with an additive-multiplicative model has been reported in other MS-based quantitation including fields outside of proteomics; consequently the variance-stabilizing normalization methodology has the potential to increase the capabilities of MS in quantitation across diverse areas of biology and chemistry.Different techniques are being used and developed in the field of proteomics to allow quantitative comparison of samples between one state and another. These can be divided into gel- (14) or mass spectrometry-based (58) techniques. Comparative studies have found that each technique has strengths and weaknesses and plays a complementary role in proteomics (9, 10). There is significant interest in stable isotope labeling strategies of proteins or peptides as with every measurement there is the potential to use an internal reference allowing relative quantitation comparison, which significantly increases sensitivity of detection of change in abundance. Isobaric labeling techniques such as tandem mass tags (11, 12) or isobaric tags for relative or absolute quantitation (iTRAQ)1 (13, 14) allow multiplexing of four, six and eight separately labeled samples within one experiment. In contrast to most other quantitative proteomics methods where precursor ion intensities are measured, here the measurement and ensuing quantitation of iTRAQ reporter ions occurs after fragmentation of the precursor ion. Differentially labeled peptides are selected in MS as a single mass precursor ion as the size difference of the tags is equalized by a balance group. The reporter ions are only liberated in MS/MS after the reporter ion and balance groups fragment from the labeled peptides during CID. iTRAQ has been applied to a wide range of biological applications from bacteria under nitrate stress (15) to mouse models of cerebellar dysfunction (16).For the majority of MS-based quantitation methods (including MS/MS-based methods like iTRAQ), the measurements are made at the peptide level and then combined to compute a summarized value for the protein from which they arose. An advantage is that the protein can be identified and quantified from data of multiple peptides often with multiple values per distinct peptide, thereby enhancing confidence in both identity and the abundance. However, the question arises of how to summarize the peptide readings to obtain an estimate of the protein ratio. This will involve some sort of averaging, and we need to consider the distribution of the data, in particular the following three aspects. (i) Are the data centered around a single mode (which would be related to the true protein quantitation), or are there phenomena that make them multimodal? (ii) Are the data approximately symmetric (non-skewed) around the mode? (iii) Are there outliers? In the case of multimodality, it is recommended that an effort be made to separate the various phenomena into their separate variables and to dissect the multimodality. Li et al. (17) developed ASAP ratio for ICAT data that includes a complex data combination strategy. Peptide abundance ratios are calculated by combining data from multiple fractions across MS runs and then averaging across peptides to give an abundance ratio for each parent protein. GPS Explorer, a software package developed for iTRAQ, assumes normality in the peptide ratio for a protein once an outlier filter is applied (18). The iTRAQ package ProQuant assumes that peptide ratio data for a protein follow a log-normal distribution (19). Averaging can be via mean (20), weighted average (21, 22), or weighted correlation (23). Some of these methods try to take into account the varying precision of the peptide measurements. There are many different ideas of how to process peptide data, but as yet no systematic study has been completed to guide analysis and ensure the methods being utilized are appropriate.The quality of a quantitation method can be considered in terms of precision, which refers to how well repeated measurements agree with each other, and accuracy, which refers to how much they on average deviate from the true value. Both of these types of variability are inherent to the measurement process. Precision is affected by random errors, non-reproducible and unpredictable fluctuations around the true value. (In)accuracy, by contrast, is caused by systematic biases that go consistently in the same direction. In iTRAQ, systematic biases can arise because of inconsistencies in iTRAQ labeling efficiency and protein digestion (22). Typically, ratiometric normalization has been used to address this tag bias where all peptide ratios are multiplied by a global normalization factor determined to center the ratio distribution on 1 (19, 22). Even after such normalization, concerns have been raised that iTRAQ has imperfect accuracy with ratios shrunken toward 1, and this underestimation has been reported across multiple MS platforms (2327). It has been suggested that this underestimation arises from co-eluting peptides with similar m/z values, which are co-selected during ion selection and co-fragmented during CID (23, 27). As the majority of these will be at a 1:1 ratio across the reporter ion tags (as required for normalization in iTRAQ experiments), they will contribute a background value equally to each of the iTRAQ reporter ion signals and diminish the computed ratios.With regard to random errors, iTRAQ data are seen to exhibit heterogeneity of variance; that is the variance of the signal depends on its mean. In particular, the coefficient of variation (CV) is higher in data from low intensity peaks than in data from high intensity peaks (16, 22, 23). This has also been observed in other MS-based quantitation techniques when quantifying from the MS signal (2830). Different approaches have been proposed to model the variance heterogeneity. Pavelka et al. (31) used a power law global error model in conjunction with quantitation data derived from spectral counts. Other authors have proposed that the higher CV at low signal arises from the majority of MS instrumentation measuring ion counts as whole numbers (32). Anderle et al. (28) described a two-component error model in which Poisson statistics of ion counts measured as whole numbers dominate at the low intensity end of the dynamic range and multiplicative effects dominate at the high intensity end and demonstrated its fit to label-free LC-MS quantitation data. Previously, in the 1990s, Rocke and Lorenzato (29) proposed a two-component additive-multiplicative error model in an environmental toxin monitoring study utilizing gas chromatography MS.How can the variance heterogeneity be addressed in the data analysis? Some of the current approaches include outlier removal (18, 25), weighted means (21, 22), inclusion filters (16, 22), logarithmic transformation (19), and weighted correlation analysis (23). Outlier removal methods, for example using Dixon''s test, assume a normal distribution for which there is little empirical basis. The inclusion filter method, where low intensity data are excluded, reduces the protein coverage considerably if the heterogeneity is to be significantly reduced. The weighted mean method results in higher intensity readings contributing more to the weighted mean than readings from low intensity readings. Filtering, outlier removal, and weighted methods are of limited use for peptides for which only a few low intensity readings were made; however, such cases typically dominate the data sets. Even with a logarithmic transformation, heterogeneity has been reported for iTRAQ data (16, 19, 22). Current methods struggle to address the issue and to maintain sensitivity.Here we investigate the data analysis issues that relate to precision and accuracy in quantitation and propose a robust methodology that is designed to make use of all data without ad hoc filtering rules. The additive-multiplicative model mentioned above motivates the so-called generalized logarithm transformation, a transformation that addresses heterogeneity of variance by approximately stabilizing the variance of the transformed signal across its whole dynamic range (33). Huber et al. (33) provided an open source software package, variance-stabilizing normalization (VSN), that determines the data-dependent transformation parameters. Here we report that the application of this transformation is beneficial for the analysis of iTRAQ data. We investigated the error structure of iTRAQ quantitation data using different peak identification and quantitation packages, LC-MS/MS data collection systems, and both the 4-plex and 8-plex iTRAQ systems. The usefulness of the VSN transformation to address heterogeneity of variance was demonstrated. Furthermore, we considered the correlations between multiple, peptide-level readings for the same protein and proposed a method to summarize them to a protein abundance estimate. We considered same-same comparisons to assess the magnitude of experimental variability and then used a set of complex biological samples whose biology has been well characterized to assess the power of the method to detect true differential abundance. We assessed the accuracy of the system with a four-protein mixture at known ratios spanning a -fold change expression range of 1–4. From this, we proposed a methodology to address the accuracy issues of iTRAQ.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号