首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.

Background  

Recent technological advances in mass spectrometry pose challenges in computational mathematics and statistics to process the mass spectral data into predictive models with clinical and biological significance. We discuss several classification-based approaches to finding protein biomarker candidates using protein profiles obtained via mass spectrometry, and we assess their statistical significance. Our overall goal is to implicate peaks that have a high likelihood of being biologically linked to a given disease state, and thus to narrow the search for biomarker candidates.  相似文献   

2.
The potential for obtaining a true mass spectrometric protein identification result depends on the choice of algorithm as well as on experimental factors that influence the information content in the mass spectrometric data. Current methods can never prove definitively that a result is true, but an appropriate choice of algorithm can provide a measure of the statistical risk that a result is false, i.e., the statistical significance. We recently demonstrated an algorithm, Probity, which assigns the statistical significance to each result. For any choice of algorithm, the difficulty of obtaining statistically significant results depends on the number of protein sequences in the sequence collection searched. By simulations of random protein identifications and using the Probity algorithm, we here demonstrate explicitly how the statistical significance depends on the number of sequences searched. We also provide an example on how the practitioner's choice of taxonomic constraints influences the statistical significance.  相似文献   

3.
MOTIVATION: Early cancer detection has always been a major research focus in solid tumor oncology. Early tumor detection can theoretically result in lower stage tumors, more treatable diseases and ultimately higher cure rates with less treatment-related morbidities. Protein mass spectrometry is a potentially powerful tool for early cancer detection. We propose a novel method for sample classification from protein mass spectrometry data. When applied to spectra from both diseased and healthy patients, the 'peak probability contrast' technique provides a list of all common peaks among the spectra, their statistical significance and their relative importance in discriminating between the two groups. We illustrate the method on matrix-assisted laser desorption and ionization mass spectrometry data from a study of ovarian cancers. RESULTS: Compared to other statistical approaches for class prediction, the peak probability contrast method performs as well or better than several methods that require the full spectra, rather than just labelled peaks. It is also much more interpretable biologically. The peak probability contrast method is a potentially useful tool for sample classification from protein mass spectrometry data.  相似文献   

4.
In this work, the commonly used algorithms for mass spectrometry based protein identification, Mascot, MS-Fit, ProFound and SEQUEST, were studied in respect to the selectivity and sensitivity of their searches. The influence of various search parameters were also investigated. Approximately 6600 searches were performed using different search engines with several search parameters to establish a statistical basis. The applied mass spectrometric data set was chosen from a current proteome study. The huge amount of data could only be handled with computational assistance. We present a software solution for fully automated triggering of several peptide mass fingerprinting (PMF) and peptide fragmentation fingerprinting (PFF) algorithms. The development of this high-throughput method made an intensive evaluation based on data acquired in a typical proteome project possible. Previous evaluations of PMF and PFF algorithms were mainly based on simulations.  相似文献   

5.
Rodent tumorigenicity experiments are conducted to determine the safety of substances for human exposure. The carcinogenicity of a substance is generally determined by statistical tests that compare the effects of treatment on the rate of tumor development at several body sites. The statistical analysis of such studies often includes hypothesis testing of the dose effect at each of the sites. However, the multiplicity of the significance tests may cause an excess overall false positive rate. In consideration of this problem, recent interest has focused on developing methods to test simultaneously for the treatment effect at multiple sites. In this paper, we propose a test that is based on the count of tumor-bearing sites. The test is appropriate regardless of tumor lethality or of treatment-related differences in the underlying mortality. Simulations are given which compare the performance of the proposed test to several other tests including a Bonferroni adjustment of site-specific tests, and the test is illustrated using the data from the large ED01 experiment.  相似文献   

6.
An algorithm for protein identification based on mass spectrometric proteolytic peptide mapping and genome database searching is presented. The algorithm ranks database proteins based on direct calculation of the probability of random matching and assigns the statistical significance to each result. We investigate the performance of the algorithm by simulation and show that the algorithm responds to random data in the desired manner and that the statistical significance computed indicates the risk that a particular identification result is false.  相似文献   

7.
Global gel-free proteomic analysis by mass spectrometry has been widely used as an important tool for exploring complex biological systems at the whole genome level. Simultaneous analysis of a large number of protein species is a complicated and challenging task. The challenges exist throughout all stages of a global gel-free proteomic analysis: experimental design, peptide/protein identification, data preprocessing and normalization, and inferential analysis. In addition to various efforts to improve the analytical technologies, statistical methodologies have been applied in all stages of proteomic analyses to help extract relevant information efficiently from large proteomic datasets. In this review, we summarize current applications of statistics in several stages of global gel-free proteomic analysis by mass spectrometry. We discuss the challenges associated with the applications of various statistical tools. Whenever possible, we also propose potential solutions on how to improve the data collection and interpretation for mass-spectrometry-based global proteomic analysis using more sophisticated and/or novel statistical approaches.  相似文献   

8.
Automated methods for assigning peptides to observed tandem mass spectra typically return a list of peptide-spectrum matches, ranked according to an arbitrary score. In this article, we describe methods for converting these arbitrary scores into more useful statistical significance measures. These methods employ a decoy sequence database as a model of the null hypothesis, and use false discovery rate (FDR) analysis to correct for multiple testing. We first describe a simple FDR inference method and then describe how estimating and taking into account the percentage of incorrectly identified spectra in the entire data set can lead to increased statistical power.  相似文献   

9.
BACKGROUND: HDX mass spectrometry is a powerful platform to probe protein structure dynamics during ligand binding, protein folding, enzyme catalysis, and such. HDX mass spectrometry analysis derives the protein structure dynamics based on the mass increase of a protein of which the backbone protons exchanged with solvent deuterium. Coupled with enzyme digestion and MS/MS analysis, HDX mass spectrometry can be used to study the regional dynamics of protein based on the m/z value or percentage of deuterium incorporation for the digested peptides in the HDX experiments. Various software packages have been developed to analyze HDX mass spectrometry data. Despite the progresses, proper and explicit statistical treatment is still lacking in most of the current HDX mass spectrometry software. In order to address this issue, we have developed the HDXanalyzer for the statistical analysis of HDX mass spectrometry data using R, Python, and RPY2. IMPLEMENTATION AND RESULTS: HDXanalyzer package contains three major modules, the data processing module, the statistical analysis module, and the user interface. RPY2 is employed to enable the connection of these three components, where the data processing module is implemented using Python and the statistical analysis module is implemented with R. RPY2 creates a low-level interface for R and allows the effective integration of statistical module for data processing. The data processing module generates the centroid for the peptides in form of m/z value, and the differences of centroids between the peptides derived from apo and ligand-bound protein allow us to evaluate whether the regions have significant changes in structure dynamics or not. Another option of the software is to calculate the deuterium incorporation rate for the comparison. The two types of statistical analyses are Paired Student's t-test and the linear combination of the intercept for multiple regression and ANCOVA model. The user interface is implemented with wxpython to facilitate the data visualization in graphs and the statistical analysis output presentation. In order to evaluate the software, a previously published xylanase HDX mass spectrometry analysis dataset is processed and presented. The results from the different statistical analysis methods are compared and shown to be similar. The statistical analysis results are overlaid with the three dimensional structure of the protein to highlight the regional structure dynamics changes in the xylanase enzyme. CONCLUSION: Statistical analysis provides crucial evaluation of whether a protein region is significantly protected or unprotected during the HDX mass spectrometry studies. Although there are several other available software programs to process HDX experimental data, HDXanalyzer is the first software program to offer multiple statistical methods to evaluate the changes in protein structure dynamics based on HDX mass spectrometry analysis. Moreover, the statistical analysis can be carried out for both m/z value and deuterium incorporation rate. In addition, the software package can be used for the data generated from a wide range of mass spectrometry instruments.  相似文献   

10.
Pairwise sequence alignment is a central problem in bioinformatics, which forms the basis of various other applications. Two related sequences are expected to have a high alignment score, but relatedness is usually judged by statistical significance rather than by alignment score. Recently, it was shown that pairwise statistical significance gives promising results as an alternative to database statistical significance for getting individual significance estimates of pairwise alignment scores. The improvement was mainly attributed to making the statistical significance estimation process more sequence-specific and database-independent. In this paper, we use sequence-specific and position-specific substitution matrices to derive the estimates of pairwise statistical significance, which is expected to use more sequence-specific information in estimating pairwise statistical significance. Experiments on a benchmark database with sequence-specific substitution matrices at different levels of sequence-specific contribution were conducted, and results confirm that using sequence-specific substitution matrices for estimating pairwise statistical significance is significantly better than using a standard matrix like BLOSUM62, and than database statistical significance estimates reported by popular database search programs like BLAST, PSI-BLAST (without pretrained PSSMs), and SSEARCH on a benchmark database, but with pretrained PSSMs, PSI-BLAST results are significantly better. Further, using position-specific substitution matrices for estimating pairwise statistical significance gives significantly better results even than PSI-BLAST using pretrained PSSMs.  相似文献   

11.
12.
Eriksson J  Fenyö D 《Proteomics》2002,2(3):262-270
A rapid and accurate method for testing the significance of protein identities determined by mass spectrometric analysis of protein digests and genome database searching is presented. The method is based on direct computation using a statistical model of the random matching of measured and theoretical proteolytic peptide masses. Protein identification algorithms typically rank the proteins of a genome database according to a score based on the number of matches between the masses obtained by mass spectrometry analysis and the theoretical proteolytic peptide masses of a database protein. The random matching of experimental and theoretical masses can cause false results. A result is significant only if the score characterizing the result deviates significantly from the score expected from a false result. A distribution of the score (number of matches) for random (false) results is computed directly from our model of the random matching, which allows significance testing under any experimental and database search constraints. In order to mimic protein identification data quality in large-scale proteome projects, low-to-high quality proteolytic peptide mass data were generated in silico and subsequently submitted to a database search program designed to include significance testing based on direct computation. This simulation procedure demonstrates the usefulness of direct significance testing for automatically screening for samples that must be subjected to peptide sequence analysis by e.g. tandem mass spectrometry in order to determine the protein identity.  相似文献   

13.
Summary .   In this article, we apply the recently developed Bayesian wavelet-based functional mixed model methodology to analyze MALDI-TOF mass spectrometry proteomic data. By modeling mass spectra as functions, this approach avoids reliance on peak detection methods. The flexibility of this framework in modeling nonparametric fixed and random effect functions enables it to model the effects of multiple factors simultaneously, allowing one to perform inference on multiple factors of interest using the same model fit, while adjusting for clinical or experimental covariates that may affect both the intensities and locations of peaks in the spectra. For example, this provides a straightforward way to account for systematic block and batch effects that characterize these data. From the model output, we identify spectral regions that are differentially expressed across experimental conditions, in a way that takes both statistical and clinical significance into account and controls the Bayesian false discovery rate to a prespecified level. We apply this method to two cancer studies.  相似文献   

14.
In this work we show that in genome-wide association studies (GWAS) there is a strong bias favoring of genes covered by larger numbers of SNPs. Thus, we state here that there is a need for correction for such bias when performing downstream gene-level analysis, e.g. pathway analysis and gene-set analysis. We investigate several methods of obtaining gene level statistical significance in GWAS, and compare their effectiveness in correcting such bias. We also propose a simple algorithm based on first order statistic that corrects such bias.  相似文献   

15.
Ossipova E  Fenyö D  Eriksson J 《Proteomics》2006,6(7):2079-2085
The two central problems in protein identification by searching a protein sequence collection with MS data are the optimal use of experimental information to allow for identification of low abundance proteins and the accurate assignment of the probability that a result is false. For comprehensive MS-based protein identification, it is necessary to choose an appropriate algorithm and optimal search conditions. We report a systematic study of the quality of PMF-based protein identifications under different sequence collection search conditions using the Probability algorithm, which assigns the statistical significance to each result. We employed 2244 PMFs from 2-DE-separated human blood plasma proteins, and performed identification under various search constraints: mass accuracy (0.01-0.3 Da), maximum number of missed cleavage sites (0-2), and size of the sequence collection searched (5.6 x 10(4)-1.8 x 10(5)). By counting the number of significant results (significance levels 0.05, 0.01, and 0.001) for each condition, we demonstrate the search condition impact on the successful outcome of proteome analysis experiments. A mass correction procedure utilizing mass deviations of albumin matching peptides was tested in an attempt to improve the statistical significance of identifications and iterative searching was employed for identification of multiple proteins from each PMF.  相似文献   

16.
17.
18.
Summary Recently, it has been suggested that an association exists between breakpoints involved in constitutional rearrangements and fragile sites; however, statistical analyses of this relationship are controversial. We have analyzed 1200 breakpoint from different constitutional rearrangements, 1522 breakpoints with respect to their recurrence and 217 breakpoints from sperm chromosomes as reported by several authors. The coincidence between breakpoints and fragile sites was 35.3%, 43.6% and 41.9% respectively. The statistical significance of these coincidences depends on whether factors such as the relative length of the bands or the recurrence of the rearrangements are taken into account.  相似文献   

19.
Diao G  Lin DY 《Biometrics》2005,61(3):789-798
Statistical methods for the detection of genes influencing quantitative traits with the aid of genetic markers are well developed for normally distributed, fully observed phenotypes. Many experiments are concerned with failure-time phenotypes, which have skewed distributions and which are usually subject to censoring because of random loss to follow-up, failures from competing causes, or limited duration of the experiment. In this article, we develop semiparametric statistical methods for mapping quantitative trait loci (QTLs) based on censored failure-time phenotypes. We formulate the effects of the QTL genotype on the failure time through the Cox (1972, Journal of the Royal Statistical Society, Series B 34, 187-220) proportional hazards model and derive efficient likelihood-based inference procedures. In addition, we show how to assess statistical significance when searching several regions or the entire genome for QTLs. Extensive simulation studies demonstrate that the proposed methods perform well in practical situations. Applications to two animal studies are provided.  相似文献   

20.
Statistical analysis of diversification with species traits   总被引:1,自引:0,他引:1  
Testing whether some species traits have a significant effect on diversification rates is central in the assessment of macroevolutionary theories. However, we still lack a powerful method to tackle this objective. I present a new method for the statistical analysis of diversification with species traits. The required data are observations of the traits on recent species, the phylogenetic tree of these species, and reconstructions of ancestral values of the traits. Several traits, either continuous or discrete, and in some cases their interactions, can be analyzed simultaneously. The parameters are estimated by the method of maximum likelihood. The statistical significance of the effects in a model can be tested with likelihood ratio tests. A simulation study showed that past random extinction events do not affect the Type I error rate of the tests, whereas statistical power is decreased, though some power is still kept if the effect of the simulated trait on speciation is strong. The use of the method is illustrated by the analysis of published data on primates. The analysis of these data showed that the apparent overall positive relationship between body mass and species diversity is actually an artifact due to a clade-specific effect. Within each clade the effect of body mass on speciation rate was in fact negative. The present method allows to take both effects (clade and body mass) into account simultaneously.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号