首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Kwon D  Vannucci M  Song JJ  Jeong J  Pfeiffer RM 《Proteomics》2008,8(15):3019-3029
In recent years there has been an increased interest in using protein mass spectroscopy to discriminate diseased from healthy individuals with the aim of discovering molecular markers for disease. A crucial step before any statistical analysis is the pre-processing of the mass spectrometry data. Statistical results are typically strongly affected by the specific pre-processing techniques used. One important pre-processing step is the removal of chemical and instrumental noise from the mass spectra. Wavelet denoising techniques are a standard method for denoising. Existing techniques, however, do not accommodate errors that vary across the mass spectrum, but instead assume a homogeneous error structure. In this paper we propose a novel wavelet denoising approach that deals with heterogeneous errors by incorporating a variance change point detection method in the thresholding procedure. We study our method on real and simulated mass spectrometry data and show that it improves on performances of peak detection methods.  相似文献   

2.
We addressed the problem of discriminating between 24 diseased and 17 healthy specimens on the basis of protein mass spectra. To prepare the data, we performed mass to charge ratio (m/z) normalization, baseline elimination, and conversion of absolute peak height measures to height ratios. After preprocessing, the major difficulty encountered was the extremely large number of variables (1676 m/z values) versus the number of examples (41). Dimensionality reduction was treated as an integral part of the classification process; variable selection was coupled with model construction in a single ten-fold cross-validation loop. We explored different experimental setups involving two peak height representations, two variable selection methods, and six induction algorithms, all on both the original 1676-mass data set and on a prescreened 124-mass data set. Highest predictive accuracies (1-2 off-sample misclassifications) were achieved by a multilayer perceptron and Na?ve Bayes, with the latter displaying more consistent performance (hence greater reliability) over varying experimental conditions. We attempted to identify the most discriminant peaks (proteins) on the basis of scores assigned by the two variable selection methods and by neural network based sensitivity analysis. These three scoring schemes consistently ranked four peaks as the most relevant discriminators: 11683, 1403, 17350 and 66107.  相似文献   

3.
Wagner M  Naik D  Pothen A 《Proteomics》2003,3(9):1692-1698
We report our results in classifying protein matrix-assisted laser desorption/ionization-time of flight mass spectra obtained from serum samples into diseased and healthy groups. We discuss in detail five of the steps in preprocessing the mass spectral data for biomarker discovery, as well as our criterion for choosing a small set of peaks for classifying the samples. Cross-validation studies with four selected proteins yielded misclassification rates in the 10-15% range for all the classification methods. Three of these proteins or protein fragments are down-regulated and one up-regulated in lung cancer, the disease under consideration in this data set. When cross-validation studies are performed, care must be taken to ensure that the test set does not influence the choice of the peaks used in the classification. Misclassification rates are lower when both the training and test sets are used to select the peaks used in classification versus when only the training set is used. This expectation was validated for various statistical discrimination methods when thirteen peaks were used in cross-validation studies. One particular classification method, a linear support vector machine, exhibited especially robust performance when the number of peaks was varied from four to thirteen, and when the peaks were selected from the training set alone. Experiments with the samples randomly assigned to the two classes confirmed that misclassification rates were significantly higher in such cases than those observed with the true data. This indicates that our findings are indeed significant. We found closely matching masses in a database for protein expression in lung cancer for three of the four proteins we used to classify lung cancer. Data from additional samples, increased experience with the performance of various preprocessing techniques, and affirmation of the biological roles of the proteins that help in classification, will strengthen our conclusions in the future.  相似文献   

4.
Prospective accuracy for longitudinal markers   总被引:1,自引:0,他引:1  
Zheng Y  Heagerty PJ 《Biometrics》2007,63(2):332-341
In this article we focus on appropriate statistical methods for characterizing the prognostic value of a longitudinal clinical marker. Frequently it is possible to obtain repeated measurements. If the measurement has the ability to signify a pending change in the clinical status of a patient then the marker has the potential to guide key medical decisions. Heagerty, Lumley, and Pepe (2000, Biometrics 56, 337-344) proposed characterizing the diagnostic accuracy of a marker measured at baseline by calculating receiver operating characteristic curves for cumulative disease or death incidence by time t. They considered disease status as a function of time, D(t) = 1(Tor= 0, after the baseline time) can discriminate between people who become diseased and those who do not in a subsequent time interval [s, t]. We assume the disease status is derived from an observed event time T and thus interest is in individuals who transition from disease free to diseased. We seek methods that also allow the inclusion of prognostic covariates that permit patient-specific decision guidelines when forecasting a future change in health status. Our proposal is to use flexible semiparametric models to characterize the bivariate distribution of the event time and marker values at an arbitrary time s. We illustrate the new methods by analyzing a well-known data set from HIV research, the Multicenter AIDS Cohort Study data.  相似文献   

5.
MOTIVATION: Early cancer detection has always been a major research focus in solid tumor oncology. Early tumor detection can theoretically result in lower stage tumors, more treatable diseases and ultimately higher cure rates with less treatment-related morbidities. Protein mass spectrometry is a potentially powerful tool for early cancer detection. We propose a novel method for sample classification from protein mass spectrometry data. When applied to spectra from both diseased and healthy patients, the 'peak probability contrast' technique provides a list of all common peaks among the spectra, their statistical significance and their relative importance in discriminating between the two groups. We illustrate the method on matrix-assisted laser desorption and ionization mass spectrometry data from a study of ovarian cancers. RESULTS: Compared to other statistical approaches for class prediction, the peak probability contrast method performs as well or better than several methods that require the full spectra, rather than just labelled peaks. It is also much more interpretable biologically. The peak probability contrast method is a potentially useful tool for sample classification from protein mass spectrometry data.  相似文献   

6.
To understand the effect of abnormal brood odors on the initiation or control of hygienic behavior in honey bees, we employed the associative learning paradigm, proboscis extension reflex conditioning. Bees from two genetic lines(hygienic and non-hygienic) were able to discriminate between high concentrations of two floral odors equally well. Differential discrimination abilities were observed between the two lines when healthy and diseased brood odors were used, with the bees from the hygienic line discriminating between the pair of brood odors better than the non-hygienic bees. These results suggest that hygienic behavior in individual bees is associated with the bees' responses to olfactory stimuli emanating from diseased brood.  相似文献   

7.
In a comparative proteome analysis of peripheral blood mononuclear cells (PBMCs), we analyzed 130 two-dimensional gels obtained from 33 healthy control individuals and 32 patients diagnosed with rheumatoid arthritis (RA). We found 16 protein spots that are deregulated in patients with RA and, using peptide mass fingerprinting and Western blot analyses, identified these spots as belonging to 9 distinct proteins. A hierarchical clustering procedure organizes the study subjects into two main clusters based on the expression of these 16 protein spots, one that contains mostly healthy control individuals and the other mostly RA patients. The majority of the proteins differentially expressed in RA patients when compared with healthy controls can be detected as protein fragments in PBMCs obtained from RA patients. This set of deregulated proteins includes several factors that have been shown to be autoantigens in autoimmune diseases.  相似文献   

8.
MOTIVATION: A common problem in the emerging field of metabolomics is the consolidation of signal lists derived from metabolic profiling of different cell/tissue/fluid states where a number of replicate experiments was collected on each state. RESULTS: We describe an approach for the consolidation of peak lists based on hierarchical clustering, first within each set of replicate experiments and then between the sets of replicate experiments. The problems of finding the dendrogram tree cutoff which gives the optimal number of peak clusters and the effect of different clustering methods were addressed. When applied to gas chromatography-mass spectrometry metabolic profiling data acquired on Leishmania mexicana, this approach resulted in robust data matrices which completely separated the wild-type and two mutant parasite lines based on their metabolic profile.  相似文献   

9.
ABSTRACT: BACKGROUND: The detection of conserved residue clusters on a protein structure is one of the effective strategies for the prediction of functional protein regions. Various methods, such as Evolutionary Trace, have been developed based on this strategy. In such approaches, the conserved residues are identified through comparisons of homologous amino acid sequences. Therefore, the selection of homologous sequences is a critical step. It is empirically known that a certain degree of sequence divergence in the set of homologous sequences is required for the identification of conserved residues. However, the development of a method to select homologous sequences appropriate for the identification of conserved residues has not been sufficiently addressed. An objective and general method to select appropriate homologous sequences is desired for the efficient prediction of functional regions. RESULTS: We have developed a novel index to select the sequences appropriate for the identification of conserved residues, and implemented the index within our method to predict the functional regions of a protein. The implementation of the index improved the performance of the functional region prediction. The index represents the degree of conserved residue clustering on the tertiary structure of the protein. For this purpose, the structure and sequence information were integrated within the index by the application of spatial statistics. Spatial statistics is a field of statistics in which not only the attributes but also the geometrical coordinates of the data are considered simultaneously. Higher degrees of clustering generate larger index scores. We adopted the set of homologous sequences with the highest indexscore, under the assumption that the best prediction accuracy is obtained when the degree of clustering is the maximum. The set of sequences selected by the index led to higher functional region prediction performance than the sets of sequences selected by other sequence-based methods. CONCLUSIONS: Appropriate homologous sequences are selected automatically and objectively by the index. Such sequence selection improved the performance of functional region prediction. As far as we know, this is the first approach in which spatial statistics have been applied t o protein analyses. Such integration of structure and sequence information would be useful for other bioinformatics problems.  相似文献   

10.
Through the use of proboscis-extension reflex conditioning, we demonstrate that honey bees (Apis mellifera L.) bred for hygienic behavior (a behavioral mechanism of disease resistance) are able to discriminate between odors of healthy and diseased brood at a lower stimulus level than bees from a non-hygienic line. Electroantennogram recordings confirmed that hygienic bees exhibit increased olfactory sensitivity to low concentrations of the odor of chalkbrood infected pupae (a fungal disease caused by Ascosphaera apis). Three-week-old hygienic bees were able to discriminate between the brood odors significantly better than three-week old non-hygienic bees. However, the differential performance in brood odor discrimination was primarily genetically based, not a direct result of age, experience, or the temporary behavioral state of the bee. Lower stimulus thresholds for both the olfactory and behavioral responses of hygienic bees may facilitate their ability to detect, uncap and remove diseased brood rapidly from the nest. In contrast, non-hygienic bees, possessing higher response thresholds, may not be able to detect diseased brood as easily. Our results provide an example of how physiological and behavioral differences between the hygienic and non-hygienic honey bee lines, operating at the level of the individual, could produce colony-specific behavioral phenotypes.  相似文献   

11.
Pre-eclampsia is a multi-system disorder of pregnancy with major maternal and perinatal implications. Emerging therapeutic strategies are most likely to be maximally effective if commenced weeks or even months prior to the clinical presentation of the disease. Although widespread plasma alterations precede the clinical onset of pre-eclampsia, no single plasma constituent has emerged as a sensitive or specific predictor of risk. Consequently, currently available methods of identifying the condition prior to clinical presentation are of limited clinical use. We have exploited genetic programming, a powerful data mining method, to identify patterns of metabolites that distinguish plasma from patients with pre-eclampsia from that taken from healthy, matched controls. High-resolution gas chromatography time-of-flight mass spectrometry (GC-tof-MS) was performed on 87 plasma samples from women with pre-eclampsia and 87 matched controls. Normalised peak intensity data were fed into the Genetic Programming (GP) system which was set up to produce a model that gave an output of 1 for patients and 0 for controls. The model was trained on 50% of the data generated and tested on a separate hold-out set of 50%. The model generated by GP from the GC-tof-MS data identified a metabolomic pattern that could be used to produce two simple rules that together discriminate pre-eclampsia from normal pregnant controls using just 3 of the metabolite peak variables, with a sensitivity of 100% and a specificity of 98%. Thus, pre-eclampsia can be diagnosed at the level of small-molecule metabolism in blood plasma. These findings justify a prospective assessment of metabolomic technology as a screening tool for pre-eclampsia, while identification of the metabolites involved may lead to an improved understanding of the aetiological basis of pre-eclampsia and thus the development of targeted therapies.  相似文献   

12.
The data generated from protein fingerprints with an alpha-helical peptide array were analyzed using several statistical methods such as hierarchical clustering analysis and principal component analysis to discriminate target proteins.  相似文献   

13.
Spectral clustering of protein sequences   总被引:1,自引:0,他引:1  
An important problem in genomics is automatically clustering homologous proteins when only sequence information is available. Most methods for clustering proteins are local, and are based on simply thresholding a measure related to sequence distance. We first show how locality limits the performance of such methods by analysing the distribution of distances between protein sequences. We then present a global method based on spectral clustering and provide theoretical justification of why it will have a remarkable improvement over local methods. We extensively tested our method and compared its performance with other local methods on several subsets of the SCOP (Structural Classification of Proteins) database, a gold standard for protein structure classification. We consistently observed that, the number of clusters that we obtain for a given set of proteins is close to the number of superfamilies in that set; there are fewer singletons; and the method correctly groups most remote homologs. In our experiments, the quality of the clusters as quantified by a measure that combines sensitivity and specificity was consistently better [on average, improvements were 84% over hierarchical clustering, 34% over Connected Component Analysis (CCA) (similar to GeneRAGE) and 72% over another global method, TribeMCL].  相似文献   

14.
Iterative cluster analysis of protein interaction data   总被引:3,自引:0,他引:3  
MOTIVATION: Generation of fast tools of hierarchical clustering to be applied when distances among elements of a set are constrained, causing frequent distance ties, as happens in protein interaction data. RESULTS: We present in this work the program UVCLUSTER, that iteratively explores distance datasets using hierarchical clustering. Once the user selects a group of proteins, UVCLUSTER converts the set of primary distances among them (i.e. the minimum number of steps, or interactions, required to connect two proteins) into secondary distances that measure the strength of the connection between each pair of proteins when the interactions for all the proteins in the group are considered. We show that this novel strategy has advantages over conventional clustering methods to explore protein-protein interaction data. UVCLUSTER easily incorporates the information of the largest available interaction datasets to generate comprehensive primary distance tables. The versatility, simplicity of use and high speed of UVCLUSTER on standard personal computers suggest that it can be a benchmark analytical tool for interactome data analysis. AVAILABILITY: The program is available upon request from the authors, free for academic users. Additional information available at http://www.uv.es/genomica/UVCLUSTER.  相似文献   

15.
When applying hierarchical clustering algorithms to cluster patient samples from microarray data, the clustering patterns generated by most algorithms tend to be dominated by groups of highly differentially expressed genes that have closely related expression patterns. Sometimes, these genes may not be relevant to the biological process under study or their functions may already be known. The problem is that these genes can potentially drown out the effects of other genes that are relevant or have novel functions. We propose a procedure called complementary hierarchical clustering that is designed to uncover the structures arising from these novel genes that are not as highly expressed. Simulation studies show that the procedure is effective when applied to a variety of examples. We also define a concept called relative gene importance that can be used to identify the influential genes in a given clustering. Finally, we analyze a microarray data set from 295 breast cancer patients, using clustering with the correlation-based distance measure. The complementary clustering reveals a grouping of the patients which is uncorrelated with a number of known prognostic signatures and significantly differing distant metastasis-free probabilities.  相似文献   

16.
17.
Advances in proteomic technologies continue to substantially accelerate capability for generating experimental data on protein levels, states, and activities in biological samples. For example, studies on receptor tyrosine kinase signaling networks can now capture the phosphorylation state of hundreds to thousands of proteins across multiple conditions. However, little is known about the function of many of these protein modifications, or the enzymes responsible for modifying them. To address this challenge, we have developed an approach that enhances the power of clustering techniques to infer functional and regulatory meaning of protein states in cell signaling networks. We have created a new computational framework for applying clustering to biological data in order to overcome the typical dependence on specific a priori assumptions and expert knowledge concerning the technical aspects of clustering. Multiple clustering analysis methodology ('MCAM') employs an array of diverse data transformations, distance metrics, set sizes, and clustering algorithms, in a combinatorial fashion, to create a suite of clustering sets. These sets are then evaluated based on their ability to produce biological insights through statistical enrichment of metadata relating to knowledge concerning protein functions, kinase substrates, and sequence motifs. We applied MCAM to a set of dynamic phosphorylation measurements of the ERRB network to explore the relationships between algorithmic parameters and the biological meaning that could be inferred and report on interesting biological predictions. Further, we applied MCAM to multiple phosphoproteomic datasets for the ERBB network, which allowed us to compare independent and incomplete overlapping measurements of phosphorylation sites in the network. We report specific and global differences of the ERBB network stimulated with different ligands and with changes in HER2 expression. Overall, we offer MCAM as a broadly-applicable approach for analysis of proteomic data which may help increase the current understanding of molecular networks in a variety of biological problems.  相似文献   

18.
MOTIVATION: Independent component analysis (ICA) is a signal processing technique that can be utilized to recover independent signals from a set of their linear mixtures. We propose ICA for the analysis of signals obtained from large proteomics investigations such as clinical multi-subject studies based on MALDI-TOF MS profiling. The method is validated on simulated and experimental data for demonstrating its capability of correctly extracting protein profiles from MALDI-TOF mass spectra. RESULTS: The comparison on peak detection with an open-source and two commercial methods shows its superior reliability in reducing the false discovery rate of protein peak masses. Moreover, the integration of ICA and statistical tests for detecting the differences in peak intensities between experimental groups allows to identify protein peaks that could be indicators of a diseased state. This data-driven approach demonstrates to be a promising tool for biomarker-discovery studies based on MALDI-TOF MS technology. AVAILABILITY: The MATLAB implementation of the method described in the article and both simulated and experimental data are freely available at http://www.unich.it/proteomica/bioinf/.  相似文献   

19.
Clustering is one of the most powerful tools in computational biology. The conventional wisdom is that events that occur in clusters are probably not random. In protein docking, the underlying principle is that clustering occurs because long-range electrostatic and/or desolvation forces steer the proteins to a low free-energy attractor at the binding region. Something similar occurs in the docking of small molecules, although in this case shorter-range van der Waals forces play a more critical role. Based on the above, we have developed two different clustering strategies to predict docked conformations based on the clustering properties of a uniform sampling of low free-energy protein-protein and protein-small molecule complexes. We report on significant improvements in the automated prediction and discrimination of docked conformations by using the cluster size and consensus as a ranking criterion. We show that the success of clustering depends on identifying the appropriate clustering radius of the system. The clustering radius for protein-protein complexes is consistent with the range of the electrostatics and desolvation free energies (i.e., between 4 and 9 Angstroms); for protein-small molecule docking, the radius is set by van der Waals interactions (i.e., at approximately 2 Angstroms). Without any a priori information, a simple analysis of the histogram of distance separations between the set of docked conformations can evaluate the clustering properties of the data set. Clustering is observed when the histogram is bimodal. Data clustering is optimal if one chooses the clustering radius to be the minimum after the first peak of the bimodal distribution. We show that using this optimal radius further improves the discrimination of near-native complex structures.  相似文献   

20.
Until today, a definite diagnosis of Creutzfeldt–Jakob disease (CJD) can only be made neuropathologically. At lifetime the early and differential diagnosis is often a problem. With SELDI we analyzed cerebrospinal fluid (CSF) from 32 CJD patients, 32 patients having other dementive diseases and 31 non‐demented control subjects for diagnosis‐dependent protein pattern differences. In a screening set of patients, peaks that discriminate best between groups were identified. These peaks were subsequently analyzed using an independent validation set of patients. Diagnostic accuracies were compared with established markers like tau protein and 14‐3‐3‐protein. Potential marker proteins were purified and identified by LC‐MS/MS. In the validation set only one peak of 8.6 kDa out of ten in the screening set could be confirmed. This protein was identified to be ubiquitin and increased levels in CSF (but not in serum) of CJD patients were confirmed by Western blot. Ubiquitin allows the correct diagnoses of that CJD cases missed by tau protein or 14‐3‐3‐protein. We conclude that ubiquitin is a promising additional CSF biomarker for diagnosis of CJD, especially in differential diagnostically difficult cases. The selective increase of ubiquitin in CSF of CJD patients might point to an involvement of ubiquitin in pathophysiological process.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号