首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
We propose here a new concept of peptide detectability which could be an important factor in explaining the relationship between a protein's quantity and the peptides identified from it in a high-throughput proteomics experiment. We define peptide detectability as the probability of observing a peptide in a standard sample analyzed by a standard proteomics routine and argue that it is an intrinsic property of the peptide sequence and neighboring regions in the parent protein. To test this hypothesis we first used publicly available data and data from our own synthetic samples in which quantities of model proteins were controlled. We then applied machine learning approaches to demonstrate that peptide detectability can be predicted from its sequence and the neighboring regions in the parent protein with satisfactory accuracy. The utility of this approach for protein quantification is demonstrated by peptides with higher detectability generally being identified at lower concentrations over those with lower detectability in the synthetic protein mixtures. These results establish a direct link between protein concentration and peptide detectability. We show that for each protein there exists a level of peptide detectability above which peptides are detected and below which peptides are not detected in an experiment. We call this level the minimum acceptable detectability for identified peptides (MDIP) which can be calibrated to predict protein concentration. Triplicate analysis of a biological sample showed that these MDIP values are consistent among the three data sets.  相似文献   

2.
The proteins secreted by prostate cancer cells (PC3(AR)6) were separated by strong anion exchange chromatography, digested with trypsin and analyzed by unbiased liquid chromatography tandem mass spectrometry with an ion trap. The spectra were matched to peptides within proteins using a goodness of fit algorithm that showed a low false positive rate. The parent ions for MS/MS were randomly and independently sampled from a log-normal population and therefore could be analyzed by ANOVA. Normal distribution analysis confirmed that the parent and fragment ion intensity distributions were sampled over 99.9% of their range that was above the background noise. Arranging the ion intensity data with the identified peptide and protein sequences in structured query language (SQL) permitted the quantification of ion intensity across treatments, proteins and peptides. The intensity of 101,905 fragment ions from 1421 peptide precursors of 583 peptides from 233 proteins separated over 11 sample treatments were computed together in one ANOVA model using the statistical analysis system (SAS) prior to Tukey-Kramer honestly significant difference (HSD) testing. Thus complex mixtures of proteins were identified and quantified with a high degree of confidence using an ion trap without isotopic labels, multivariate analysis or comparing chromatographic retention times.  相似文献   

3.
Most of the gene prediction algorithms for prokaryotes are based on Hidden Markov Models or similar machine-learning approaches, which imply the optimization of a high number of parameters. The present paper presents a novel method for the classification of coding and non-coding regions in prokaryotic genomes, based on a suitably defined compression index of a DNA sequence. The main features of this new method are the non-parametric logic and the costruction of a dictionary of words extracted from the sequences. These dictionaries can be very useful to perform further analyses on the genomic sequences themselves. The proposed approach has been applied on some prokaryotic complete genomes, obtaining optimal scores of correctly recognized coding and non-coding regions. Several false-positive and false-negative cases have been investigated in detail, which have revealed that this approach can fail in the presence of highly structured coding regions (e.g., genes coding for modular proteins) or quasi-random non-coding regions (e.g., regions hosting non-functional fragments of copies of functional genes; regions hosting promoters or other protein-binding sequences). We perform an overall comparison with other gene-finder software, since at this step we are not interested in building another gene-finder system, but only in exploring the possibility of the suggested approach.  相似文献   

4.
In mass spectrometry-based protein quantification, peptides that are shared across different protein sequences are often discarded as being uninformative with respect to each of the parent proteins. We investigate the use of shared peptides which are ubiquitous (~50% of peptides) in mass spectrometric data-sets for accurate protein identification and quantification. Different from existing approaches, we show how shared peptides can help compute the relative amounts of the proteins that contain them. Also, proteins with no unique peptide in the sample can still be analyzed for relative abundance. Our article uses shared peptides in protein quantification and makes use of combinatorial optimization to reduce the error in relative abundance measurements. We describe the topological and numerical properties required for robust estimates, and use them to improve our estimates for ill-conditioned systems. Extensive simulations validate our approach even in the presence of experimental error. We apply our method to a model of Arabidopsis thaliana root knot nematode infection, and investigate the differential role of several protein family members in mediating host response to the pathogen.  相似文献   

5.
We introduce a new representation and feature extraction method for biological sequences. Named bio-vectors (BioVec) to refer to biological sequences in general with protein-vectors (ProtVec) for proteins (amino-acid sequences) and gene-vectors (GeneVec) for gene sequences, this representation can be widely used in applications of deep learning in proteomics and genomics. In the present paper, we focus on protein-vectors that can be utilized in a wide array of bioinformatics investigations such as family classification, protein visualization, structure prediction, disordered protein identification, and protein-protein interaction prediction. In this method, we adopt artificial neural network approaches and represent a protein sequence with a single dense n-dimensional vector. To evaluate this method, we apply it in classification of 324,018 protein sequences obtained from Swiss-Prot belonging to 7,027 protein families, where an average family classification accuracy of 93%±0.06% is obtained, outperforming existing family classification methods. In addition, we use ProtVec representation to predict disordered proteins from structured proteins. Two databases of disordered sequences are used: the DisProt database as well as a database featuring the disordered regions of nucleoporins rich with phenylalanine-glycine repeats (FG-Nups). Using support vector machine classifiers, FG-Nup sequences are distinguished from structured protein sequences found in Protein Data Bank (PDB) with a 99.8% accuracy, and unstructured DisProt sequences are differentiated from structured DisProt sequences with 100.0% accuracy. These results indicate that by only providing sequence data for various proteins into this model, accurate information about protein structure can be determined. Importantly, this model needs to be trained only once and can then be applied to extract a comprehensive set of information regarding proteins of interest. Moreover, this representation can be considered as pre-training for various applications of deep learning in bioinformatics. The related data is available at Life Language Processing Website: http://llp.berkeley.edu and Harvard Dataverse: http://dx.doi.org/10.7910/DVN/JMFHTN.  相似文献   

6.
A database has been compiled documenting the peptide conformations and geometries from 70 diverse proteins refined at 1.75 A or better. Analysis of the well-ordered residues within the database shows phi, psi-distributions that have more fine structure than is generally observed. Also, clear evidence is presented that the peptide covalent geometry depends on conformation, with the interpeptide N-C alpha-C bond angle varying by nearly +/-5 degrees from its standard value. The observed deviations from standard peptide geometry are greatest near the edges of well-populated regions, consistent with strain occurring in these conformations. Minimization of such hidden strain could be an important factor in thermostability of proteins. These empirical data describing how equilibrium peptide geometry varies as a function of conformation confirm and extend quantum mechanics calculations, and have predictive value that will aid both theoretical and experimental analyses of protein structure.  相似文献   

7.
Fibroblasts increase the catabolism of certain intracellular proteins in response to serum withdrawal, and these proteins contain specific peptide regions that may be required for their increased degradation. We show that the increased degradation of microinjected ribonuclease A during serum withdrawal can be blocked by co-injection of a pentapeptide corresponding to residues 7-11 of ribonuclease A, Lys-Phe-Glu-Arg-Gln. Furthermore, similar peptide sequences appear to play a widespread role in targeting proteins for enhanced degradation. Affinity-purified antibodies raised against the pentapeptide are able to precipitate 20-35% of radiolabeled cytosolic proteins from fibroblasts. Such proteins are preferentially degraded when cells are deprived of serum while nonimmunoprecipitable proteins are degraded at the same rate in the presence and absence of serum. Immunoreactive cytosolic proteins also exist in rat liver and kidney, and these proteins are depleted when protein degradation rates are enhanced due to starvation. Several types of evidence suggest that the peptides recognized in cellular proteins are similar to Lys-Phe-Glu-Arg-Gln but are not this exact sequence. Analyses of amino acid sequences for four proteins whose degradative rates are enhanced in response to serum withdrawal and for four proteins that are degraded in a serum-independent manner indicate two possible peptide motifs related to Lys-Phe-Glu-Arg-Gln that may target cellular proteins for enhanced degradation. These results, combined with previous studies (McElligott, M. A., Miao, P., and Dice, J. F. (1985) J. Biol. Chem. 260, 11986-11993), suggest that these peptide regions target specific proteins to a lysosomal pathway of degradation during serum withdrawal.  相似文献   

8.
A goodness of fit test may be used to assign tandem mass spectra of peptides to amino acid sequences and to directly calculate the expected probability of mis-identification. The product of the peptide expectation values directly yields the probability that the parent protein has been mis-identified. A relational database could capture the mass spectral data, the best fit results, and permit subsequent calculations by a general statistical analysis system. The many files of the Hupo blood protein data correlated by X!TANDEM against the proteins of ENSEMBL were collected into a relational database. A redundant set of 247,077 proteins and peptides were correlated by X!TANDEM, and that was collapsed to a set of 34,956 peptides from 13,379 distinct proteins. About 6875 distinct proteins were only represented by a single distinct peptide, 2866 proteins showed 2 distinct peptides, and 3454 proteins showed at least three distinct peptides by X!TANDEM. More than 99% of the peptides were associated with proteins that had cumulative expectation values, i.e. probability of false positive identification, of one in one hundred or less. The distribution of peptides per protein from X!TANDEM was significantly different than those expected from random assignment of peptides.  相似文献   

9.
《Journal of Proteomics》2010,73(1):103-111
A goodness of fit test may be used to assign tandem mass spectra of peptides to amino acid sequences and to directly calculate the expected probability of mis-identification. The product of the peptide expectation values directly yields the probability that the parent protein has been mis-identified. A relational database could capture the mass spectral data, the best fit results, and permit subsequent calculations by a general statistical analysis system. The many files of the Hupo blood protein data correlated by X!TANDEM against the proteins of ENSEMBL were collected into a relational database. A redundant set of 247,077 proteins and peptides were correlated by X!TANDEM, and that was collapsed to a set of 34,956 peptides from 13,379 distinct proteins. About 6875 distinct proteins were only represented by a single distinct peptide, 2866 proteins showed 2 distinct peptides, and 3454 proteins showed at least three distinct peptides by X!TANDEM. More than 99% of the peptides were associated with proteins that had cumulative expectation values, i.e. probability of false positive identification, of one in one hundred or less. The distribution of peptides per protein from X!TANDEM was significantly different than those expected from random assignment of peptides.  相似文献   

10.
Intrinsic disorder in the Protein Data Bank   总被引:2,自引:0,他引:2  
The Protein Data Bank (PDB) is the preeminent source of protein structural information. PDB contains over 32,500 experimentally determined 3-D structures solved using X-ray crystallography or nuclear magnetic resonance spectroscopy. Intrinsically disordered regions fail to form a fixed 3-D structure under physiological conditions. In this study, we compare the amino-acid sequences of proteins whose structures are determined by X-ray crystallography with the corresponding sequences from the Swiss-Prot database. The analyzed dataset includes 16,370 structures, which represent 18,101 PDB chains and 5,434 different proteins from 910 different organisms (2,793 eukaryotic, 2,109 bacterial, 288 viral, and 244 archaeal). In this dataset, on average, each Swiss-Prot protein is represented by 7 PDB chains with 76% of the crystallized regions being represented by more than one structure. Intriguingly, the complete sequences of only approximately 7% of proteins are observed in the corresponding PDB structures, and only approximately 25% of the total dataset have >95% of their lengths observed in the corresponding PDB structures. This suggests that the vast majority of PDB proteins is shorter than their corresponding Swiss-Prot sequences and/or contain numerous residues, which are not observed in maps of electron density. To determine the prevalence of disordered regions in PDB, the residues in the Swiss-Prot sequences were grouped into four general categories, "Observed" (which correspond to structured regions), "Not observed" (regions with missing electron density, potentially disordered), "Uncharacterized," and "Ambiguous," depending on their appearance in the corresponding PDB entries. This non-redundant set of residues can be viewed as a 'fragment' or empirical domain database that contains a set of experimentally determined structured regions or domains and a set of experimentally verified disordered regions or domains. We studied the propensities and properties of residues in these four categories and analyzed their relations to the predictions of disorder using several algorithms. "Non-observed," "Ambiguous," and "Uncharacterized" regions were shown to possess the amino acid compositional biases typical of intrinsically disordered proteins. The application of four different disorder predictors (PONDR(R) VL-XT, VL3-BA, VSL1P, and IUPred) revealed that the vast majority of residues in the "Observed" dataset are ordered, and that the "Not observed" regions are mostly disordered. The "Uncharacterized" regions possess some tendency toward order, whereas the predictions for the short "Ambiguous" regions are really ambiguous. Long "Ambiguous" regions (>70 amino acid residues) are mostly predicted to be ordered, suggesting that they are likely to be "wobbly" domains. Overall, we showed that completely ordered proteins are not highly abundant in PDB and many PDB sequences have disordered regions. In fact, in the analyzed dataset approximately 10% of the PDB proteins contain regions of consecutive missing or ambiguous residues longer than 30 amino-acids and approximately 40% of the proteins possess short regions (> or =10 and < 30 amino-acid long) of missing and ambiguous residues.  相似文献   

11.
Abstract

The Protein Data Bank (PDB) is the preeminent source of protein structural information. PDB contains over 32,500 experimentally determined 3-D structures solved using X-ray crystallography or nuclear magnetic resonance spectroscopy. Intrinsically disordered regions fail to form a fixed 3-D structure under physiological conditions. In this study, we compare the amino-acid sequences of proteins whose structures are determined by X-ray crystallography with the corresponding sequences from the Swiss-Prot database. The analyzed dataset includes 16,370 structures, which represent 18,101 PDB chains and 5,434 different proteins from 910 different organisms (2,793 eukaryotic, 2,109 bacterial, 288 viral, and 244 archaeal). In this dataset, on average, each Swiss-Prot protein is represented by 7 PDB chains with 76% of the crystallized regions being represented by more than one structure. Intriguingly, the complete sequences of only ~7% of proteins are observed in the corresponding PDB structures, and only ~25% of the total dataset have >95% of their lengths observed in the corresponding PDB structures. This suggests that the vast majority of PDB proteins is shorter than their corresponding Swiss-Prot sequences and/or contain numerous residues, which are not observed in maps of electron density. To determine the prevalence of disordered regions in PDB, the residues in the Swiss-Prot sequences were grouped into four general categories, “Observed” (which correspond to structured regions), “Not observed” (regions with missing electron density, potentially disordered), “Uncharacterized,” and “Ambiguous,” depending on their appearance in the corresponding PDB entries. This non-redundant set of residues can be viewed as a ‘fragment’ or empirical domain database that contains a set of experimentally determined structured regions or domains and a set of experimentally verified disordered regions or domains. We studied the propensities and properties of residues in these four categories and analyzed their relations to the predictions of disorder using several algorithms. “Non-observed,” “Ambiguous,” and “Uncharacterized” regions were shown to possess the amino acid compositional biases typical of intrinsically disordered proteins. The application of four different disorder predictors (PONDR® VL-XT, VL3-BA, VSL1P, and IUPred) revealed that the vast majority of residues in the “Observed” dataset are ordered, and that the “Not observed” regions are mostly disordered. The “Uncharacterized” regions possess some tendency toward order, whereas the predictions for the short “Ambiguous” regions are really ambiguous. Long “Ambiguous” regions (>70 amino acid residues) are mostly predicted to be ordered, suggesting that they are likely to be “wobbly” domains.

Overall, we showed that completely ordered proteins are not highly abundant in PDB and many PDB sequences have disordered regions. In fact, in the analyzed dataset ~10% of the PDB proteins contain regions of consecutive missing or ambiguous residues longer than 30 amino-acids and ~40% of the proteins possess short regions (≥10 and <30 amino-acid long) of missing and ambiguous residues.  相似文献   

12.
13.
S. J. Leach 《Biopolymers》1983,22(1):425-440
Most of a protein surface is potentially antigenic, consisting of numerous overlapping domains each complementary to antibody-combining sites. These domains may include peptide sequences that are demonstrably antigenic but only when antibodies from the appropriate host individuals and species are used. Methods for locating antigenic peptide sequences are described in which hydrophilic polyamide supports are used for peptide synthesis, then solid-phase radioimmunoassay with antisera and protein A. Most antigenic domains, however, comprise amino acid side chains contributed by two or more nearby polypeptide chains. Such domains can be identified by comparing the cross-reactivities of groups of very closely related proteins towards monoclonal antibodies raised to one of them. Such studies, using myoglobins, have identified a number of residues not previously shown to be antigenic and have provided a guide for the choice of synthetic peptides which are likely to carry several immunodominant side chains. One such peptide corresponding to residues (72–89) of beef myoglobin has been shown, using CD and antibodies to the parent protein, to have interesting conformational and antigenic properties. The peptide (25–55) is also antigenic.  相似文献   

14.
Exotic functions of antifreeze proteins (AFP) and antifreeze glycopeptides (AFGP) have recently been attracted with much interest to develop them as commercial products. AFPs and AFGPs inhibit ice crystal growth by lowering the water freezing point without changing the water melting point. Our group isolated the Antarctic yeast Glaciozyma antarctica that expresses antifreeze protein to assist it in its survival mechanism at sub-zero temperatures. The protein is unique and novel, indicated by its low sequence homology compared to those of other AFPs. We explore the structure-function relationship of G. antarctica AFP using various approaches ranging from protein structure prediction, peptide design and antifreeze activity assays, nuclear magnetic resonance (NMR) studies and molecular dynamics simulation. The predicted secondary structure of G. antarctica AFP shows several α-helices, assumed to be responsible for its antifreeze activity. We designed several peptide fragments derived from the amino acid sequences of α-helical regions of the parent AFP and they also showed substantial antifreeze activities, below that of the original AFP. The relationship between peptide structure and activity was explored by NMR spectroscopy and molecular dynamics simulation. NMR results show that the antifreeze activity of the peptides correlates with their helicity and geometrical straightforwardness. Furthermore, molecular dynamics simulation also suggests that the activity of the designed peptides can be explained in terms of the structural rigidity/flexibility, i.e., the most active peptide demonstrates higher structural stability, lower flexibility than that of the other peptides with lower activities, and of lower rigidity. This report represents the first detailed report of downsizing a yeast AFP into its peptide fragments with measurable antifreeze activities.  相似文献   

15.
Short structured peptides can provide scaffolds for protease-resistant peptide therapeutics, serve as useful building blocks in biomedical and biotechnological applications, and shed light on the role of secondary structure elements in protein folding. It is well known that directed evolution is a powerful method for creating proteins and peptides with novel properties, and a system for the selection of short peptides based on structure from a randomized library would be an important advancement. In this study, phage particles monovalently displaying a short peptide and an N-terminal 6×His tag on their P3 coat protein were bound to nickel agarose resin and were subsequently challenged with a protease that specifically cleaves at a site within the peptide. The extent to which phage is proteolytically released from the resin was found to be dependent on the structural properties of the inserted peptide sequences. As proofs-of-concept, a structured peptide has been isolated from a pool of flexible peptides using a trypsin selection, and a flexible peptide has been isolated from a pool of structured peptides using a chymotrypsin selection. This selection system will be a strong technological platform for the creation of short peptides with interesting structural properties using directed evolution.  相似文献   

16.
DNA in a single-stranded form (ssDNA) exists transiently within the cell and comprises the telomeres of linear chromosomes and the genomes of some DNA viruses. As with RNA, in the single-stranded state, some DNA sequences are able to fold into complex secondary and tertiary structures that may be recognized by proteins and participate in gene regulation. To better understand how such DNA elements might fold and interact with proteins, and to compare recognition features to those of a structured RNA, we used in vitro selection to identify ssDNAs that bind an RNA-binding peptide from the HIV Rev protein with high affinity and specificity. The large majority of selected binders contain a non-Watson-Crick G.T base-pair and an adjacent C:G base-pair and both are essential for binding. This GT motif can be presented in different DNA contexts, including a nearly perfect duplex and a branched three-helix structure, and appears to be recognized in large part by arginine residues separated by one turn of an alpha-helix. Interestingly, a very similar GT motif is necessary also for protein binding and function of a well-characterized model ssDNA regulatory element from the proenkephalin promoter.  相似文献   

17.
The Abelson murine leukemia virus transforming gene product is a phosphorylated protein encoded by both viral and cellular sequences. This gene product has an amino-terminal region derived from the gag gene of its parent virus and a carboxyl-terminal region of (abl) derived from a normal murine cellular gene. Using a combination of partial proteolytic cleavage techniques and antisera specific for gag and abl sequences, we mapped in vivo phosphorylation sites to different regions of the protein. Phosphoproteins encoded by strain variants and transformation-defective mutants of Abelson murine leukemia virus with defined deletions in the primary sequence of the abl region were compared by two dimensional limit digest peptide mapping. Specific phosphorylation pattern differences for wild-type and mutant proteins probably represented deletions of specific phosphate acceptor sites in the abl region. An in vitro autophosphorylation activity copurified with the Abelson murine leukemia virus protein from transformation-competent strains. A peptide analysis of such in vitro reactions demonstrated that these phosphorylation sites were restricted to the amino-terminal region, and the specific sites appeared to be unrelated to the sites found on proteins phosphorylated in vivo. Thus, the autophosphorylation reaction probably correlates with an activity important in transformation, but the specific end product in vitro bears little resemblance to its function in vivo.  相似文献   

18.
19.
Cell-penetrating peptides have proven themselves as valuable vectors for intracellular delivery. Relatively little is known about the frequency of cell-penetrating sequences in native proteins and their functional role. By computational comparison of peptide sequences, we recently predicted that intracellular loops of G-protein coupled receptors (GPCR) have high probability for occurrence of cell-penetrating motifs. Since the loops are also receptor and G-protein interaction sites, we postulated that the short cell-penetrating peptides, derived from GPCR, when applied extracellularly can pass the membrane and modulate G-protein activity similarly to parent receptor proteins. Two model systems were analyzed as proofs of the principle. A peptide based on the C-terminal intracellular sequence of the rat angiotensin receptor (AT1AR) is shown to internalize into live cells and elicit blood vessel contraction even in the presence of AT1AR antagonist Sar1-Thr8-angiotensin II. The peptide interacts with the same selectivity towards G-protein subtypes as agonist-activated AT1AR and blockade of phospholipase C abolishes its effect. Another cell-penetrating peptide, G53-2 derived from human glucagon-like peptide receptor (GLP-1R) is shown to induce insulin release from isolated pancreatic islets. The mechanism was again found to be shared with the original GLP-1R, namely G11-mediated inositol 1,4,5-triphosphate release pathway. These data reveal a novel possibility to mimic the effects of signalling transmembrane proteins by application of shorter peptide fragments.  相似文献   

20.
Ormeci L  Gursoy A  Tunca G  Erman B 《Proteins》2007,66(1):29-40
The probabilities of the various basins in Ramachandran maps are examined critically. The theoretical basis of probability calculations both from molecular computations and from protein libraries are discussed. The well-defined basins of the Ramachandran maps are treated as rotational isomeric states. Statistical independence and dependence of the states of different residues along the peptide chain are discussed. The Flory isolated pair hypothesis, near neighbor correlations, context effects, and long-range correlations are examined critically. A method of evaluating long-range correlations in helical and extended sequences is introduced in analogy with earlier polymer theory. Three different protein libraries are constructed where data is considered from residues in the (i) coiled regions, (ii) all regions, and (iii) only the helical and extended regions of proteins. Singlet and pairwise dependent probabilities calculated from these libraries are used to predict whether a given sequence is helical or extended. Predictions using pairwise dependence were not better than those using singlet probabilities. Modeling of long-range correlations improved the predictions significantly. Removal of the Chameleon sequences from the data set also improved the predictions, but to a lesser extent.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号