共查询到20条相似文献,搜索用时 0 毫秒
1.
BACKGROUND: Comparing distributions of data is an important goal in many applications. For example, determining whether two samples (e.g., a control and test sample) are statistically significantly different is useful to detect a response, or to provide feedback regarding instrument stability by detecting when collected data varies significantly over time. METHODS: We apply a variant of the chi-squared statistic to comparing univariate distributions. In this variant, a control distribution is divided such that an equal number of events fall into each of the divisions, or bins. This approach is thereby a mini-max algorithm, in that it minimizes the maximum expected variance for the control distribution. The control-derived bins are then applied to test sample distributions, and a normalized chi-squared value is computed. We term this algorithm Probability Binning. RESULTS: Using a Monte-Carlo simulation, we determined the distribution of chi-squared values obtained by comparing sets of events derived from the same distribution. Based on this distribution, we derive a conversion of any given chi-squared value into a metric that is analogous to a t-score, i.e., it can be used to estimate the probability that a test distribution is different from a control distribution. We demonstrate that this metric scales with the difference between two distributions, and can be used to rank samples according to similarity to a control. Finally, we demonstrate the applicability of this metric to ranking immunophenotyping distributions to suggest that it indeed can be used to objectively determine the relative distance of distributions compared to a single control. CONCLUSION: Probability Binning, as shown here, provides a useful metric for determining the probability that two or more flow cytometric data distributions are different. This metric can also be used to rank distributions to identify which are most similar or dissimilar. In addition, the algorithm can be used to quantitate contamination of even highly-overlapping populations. Finally, as demonstrated in an accompanying paper, Probability Binning can be used to gate on events that represent significantly different subsets from a control sample. Published 2001 Wiley-Liss, Inc. 相似文献
2.
Baggerly KA 《Cytometry》2001,45(2):141-150
BACKGROUND: A key problem in immunohistochemistry is assessing when two sample histograms are significantly different. One test that is commonly used for this purpose in the univariate case is the chi-squared test. Comparing multivariate distributions is qualitatively harder, as the "curse of dimensionality" means that the number of bins can grow exponentially. For the chi-squared test to be useful, data-dependent binning methods must be employed. An example of how this can be done is provided by the "probability binning" method of Roederer et al. (1,2,3). METHODS: We derive the theoretical distribution of the probability binning statistic, giving it a more rigorous foundation. We show that the null distribution is a scaled chi-square, and show how it can be related to the standard chi-squared statistic. RESULTS: A small simulation shows how the theoretical results can be used to (a) modify the probability binning statistic to make it more sensitive and (b) suggest variant statistics which, while still exploiting the data-dependent strengths of the probability binning procedure, may be easier to work with. CONCLUSIONS: The probability binning procedure effectively uses adaptive binning to locate structure in high-dimensional data. The derivation of a theoretical basis provides a more detailed interpretation of its behavior and renders the probability binning method more flexible. 相似文献
3.
4.
Fisher information for a multivariate extreme value distribution 总被引:7,自引:0,他引:7
Explicit algebraic formulae for the Fisher information matrixof the multivariate extreme value distribution with generalisedextreme value margins and logistic dependence structure aregiven. 相似文献
5.
6.
7.
Using radius frequency distribution functions as a metric for quantifying root systems 总被引:1,自引:0,他引:1
Root radius frequency distributions have been measured to quantify the effect of plant type, environment and methodology on root systems, however, to date the results of such studies have not been synthesised. We propose that cumulative frequency distribution functions can be used as a metric to describe root systems because (1) statistical properties of the frequency distribution can be determined, (2) the parameters for these can be used as a means of comparison, and (3) the analytical expressions can be easily incorporated into models that are dependent upon root geometry. We collated a database of 96 root radii frequency distributions and botanical and methodology traits for each distribution. To determine if there was a frequency distribution function that was best suited to root radii measurements we fitted the exponential, Rayleigh, normal, log-normal, logistic and Weibull cumulative distribution functions to each distribution in our database. We found that the log-normal function provided the best fit to these distributions and that none of the distribution functions was better or worse suited to particular shapes. We derived analytical expressions for root surface and volume and found that they are a valid, and simpler method for incorporating root architectural traits into analytical models. We also found that growth habit and growth media had a significant effect on mean root radius. 相似文献
8.
Paul E. Anderson Nicholas V. Reo Nicholas J. DelRaso Travis E. Doom Michael L. Raymer 《Metabolomics : Official journal of the Metabolomic Society》2008,4(3):261-272
In many metabolomics studies, NMR spectra are divided into bins of fixed width. This spectral quantification technique, known
as uniform binning, is used to reduce the number of variables for pattern recognition techniques and to mitigate effects from
variations in peak positions; however, shifts in peaks near the boundaries can cause dramatic quantitative changes in adjacent
bins due to non-overlapping boundaries. Here we describe a new Gaussian binning method that incorporates overlapping bins
to minimize these effects. A Gaussian kernel weights the signal contribution relative to distance from bin center, and the
overlap between bins is controlled by the kernel standard deviation. Sensitivity to peak shift was assessed for a series of
test spectra where the offset frequency was incremented in 0.5 Hz steps. For a 4 Hz shift within a bin width of 24 Hz, the
error for uniform binning increased by 150%, while the error for Gaussian binning increased by 50%. Further, using a urinary
metabolomics data set (from a toxicity study) and principal component analysis (PCA), we showed that the information content
in the quantified features was equivalent for Gaussian and uniform binning methods. The separation between groups in the PCA
scores plot, measured by the J
2 quality metric, is as good or better for Gaussian binning versus uniform binning. The Gaussian method is shown to be robust
in regards to peak shift, while still retaining the information needed by classification and multivariate statistical techniques
for NMR-metabolomics data. 相似文献
9.
10.
T. G. Pottinger 《Journal of fish biology》2010,76(3):601-621
The response of six species of freshwater fishes, from the families Cyprinidae (common carp Cyprinus carpio, roach Rutilus rutilus and chub Leuciscus cephalus) and Salmonidae (rainbow trout Oncorhynchus mykiss, brown trout Salmo trutta and Arctic charr Salvelinus alpinus), to a standardized stressor was evaluated. A 6 h period of confinement resulted in changes to plasma cortisol, glucose, amino acid and lactate levels compared with unconfined controls. There were significant differences in the response profiles both within and between families. The cyprinid species exhibited higher and more sustained stress‐induced increases in plasma cortisol and glucose than the salmonid species. In cyprinids, plasma lactate and plasma amino acid concentration showed less disturbance following stress than in salmonids. The results of the study, together with an evaluation of previously published data for eight salmonid species and six cyprinid species, support the hypothesis that differences in core elements of the stress response exist between species of fishes, and that this variation may have a systematic basis. 相似文献
11.
12.
13.
Next-generation sequencing (NGS) technologies allow the sequencing of microbial communities directly from the environment without prior culturing. The output of environmental DNA sequencing consists of many reads from genomes of different unknown species, making the clustering together reads from the same (or similar) species (also known as binning) a crucial step. The difficulties of the binning problem are due to the following four factors: (1) the lack of reference genomes; (2) uneven abundance ratio of species; (3) short NGS reads; and (4) a large number of species (can be more than a hundred). None of the existing binning tools can handle all four factors. No tools, including both AbundanceBin and MetaCluster 3.0, have demonstrated reasonable performance on a sample with more than 20 species. In this article, we introduce MetaCluster 4.0, an unsupervised binning algorithm that can accurately (with about 80% precision and sensitivity in all cases and at least 90% in some cases) and efficiently bin short reads with varying abundance ratios and is able to handle datasets with 100 species. The novelty of MetaCluster 4.0 stems from solving a few important problems: how to divide reads into groups by a probabilistic approach, how to estimate the 4-mer distribution of each group, how to estimate the number of species, and how to modify MetaCluster 3.0 to handle a large number of species. We show that Meta Cluster 4.0 is effective for both simulated and real datasets. Supplementary Material is available at www.liebertonline.com/cmb. 相似文献
14.
Stephanie J. Brodie James T. Thorson Gemma Carroll Elliott L. Hazen Steven Bograd Melissa A. Haltuch Kirstin K. Holsman Stan Kotwicki Jameal F. Samhouri Ellen Willis-Norton Rebecca L. Selden 《Ecography》2020,43(1):11-24
Species distribution models (SDMs) are a common approach to describing species’ space-use and spatially-explicit abundance. With a myriad of model types, methods and parameterization options available, it is challenging to make informed decisions about how to build robust SDMs appropriate for a given purpose. One key component of SDM development is the appropriate parameterization of covariates, such as the inclusion of covariates that reflect underlying processes (e.g. abiotic and biotic covariates) and covariates that act as proxies for unobserved processes (e.g. space and time covariates). It is unclear how different SDMs apportion variance among a suite of covariates, and how parameterization decisions influence model accuracy and performance. To examine trade-offs in covariation parameterization in SDMs, we explore the attribution of spatiotemporal and environmental variation across a suite of SDMs. We first used simulated species distributions with known environmental preferences to compare three types of SDM: a machine learning model (boosted regression tree), a semi-parametric model (generalized additive model) and a spatiotemporal mixed-effects model (vector autoregressive spatiotemporal model, VAST). We then applied the same comparative framework to a case study with three fish species (arrowtooth flounder, pacific cod and walleye pollock) in the eastern Bering Sea, USA. Model type and covariate parameterization both had significant effects on model accuracy and performance. We found that including either spatiotemporal or environmental covariates typically reproduced patterns of species distribution and abundance across the three models tested, but model accuracy and performance was maximized when including both spatiotemporal and environmental covariates in the same model framework. Our results reveal trade-offs in the current generation of SDM tools between accurately estimating species abundance, accurately estimating spatial patterns, and accurately quantifying underlying species–environment relationships. These comparisons between model types and parameterization options can help SDM users better understand sources of model bias and estimate error. 相似文献
15.
Previous behaviour genetic studies of aggression have yielded inconsistent results: reported heritabilities for different types of aggressive behaviour ranging from 0 to 0.98. In the present study, 247 adult twin pairs (183 MZ pairs; 64 same-sex DZ pairs) were administered seven self-report questionnaires which yielded 18 measures of aggression. Univariate genetic analyses showed moderate to high heritabilities for 14 of these 18 measures and for a general aggression factor and three correlated aggression factors extracted from the measures. Multivariate genetic analyses showed sizeable genetic correlations between the different dimensions of aggression. Thus, individual differences in many types of aggressive behaviour are attributable to some extent to genetic factors and there is considerable overlap between the genes that operate on different types of aggressive behaviour. 相似文献
16.
17.
Cavelaars AE Kunst AE Geurts JJ Crialesi R Grötvedt L Helmert U Lahelma E Lundberg O Matheson J Mielck A Rasmussen NK Regidor E do Rosário-Giraldes M Spuhler T Mackenbach JP 《BMJ (Clinical research ed.)》2000,320(7242):1102-1107
ObjectiveTo investigate international variations in smoking associated with educational level.DesignInternational comparison of national health, or similar, surveys.SubjectsMen and women aged 20 to 44 years and 45 to 74 years.Setting12 European countries, around 1990.ResultsIn the 45 to 74 year age group, higher rates of current and ever smoking among lower educated subjects were found in some countries only. Among women this was found in Great Britain, Norway, and Sweden, whereas an opposite pattern, with higher educated women smoking more, was found in southern Europe. Among men a similar north-south pattern was found but it was less noticeable than among women. In the 20 to 44 year age group, educational differences in smoking were generally greater than in the older age group, and smoking rates were higher among lower educated people in most countries. Among younger women, a similar north-south pattern was found as among older women. Among younger men, large educational differences in smoking were found for northern European as well as for southern European countries, except for Portugal.ConclusionsThese international variations in social gradients in smoking, which are likely to be related to differences between countries in their stage of the smoking epidemic, may have contributed to the socioeconomic differences in mortality from ischaemic heart disease being greater in northern European countries. The observed age patterns suggest that socioeconomic differences in diseases related to smoking will increase in the coming decades in many European countries. 相似文献
18.
19.
Paul E. Anderson Deirdre A. Mahle Travis E. Doom Nicholas V. Reo Nicholas J. DelRaso Michael L. Raymer 《Metabolomics : Official journal of the Metabolomic Society》2011,7(2):179-190
The interpretation of nuclear magnetic resonance (NMR) experimental results for metabolomics studies requires intensive signal
processing and multivariate data analysis techniques. A key step in this process is the quantification of spectral features,
which is commonly accomplished by dividing an NMR spectrum into several hundred integral regions or bins. Binning attempts
to minimize effects from variations in peak positions caused by sample pH, ionic strength, and composition, while reducing
the dimensionality for multivariate statistical analyses. Herein we develop an improved novel spectral quantification technique,
dynamic adaptive binning. With this technique, bin boundaries are determined by optimizing an objective function using a dynamic
programming strategy. The objective function measures the quality of a bin configuration based on the number of peaks per
bin. This technique shows a significant improvement over both traditional uniform binning and other adaptive binning techniques.
This improvement is quantified via synthetic validation sets by analyzing an algorithm’s ability to create bins that do not
contain more than a single peak and that maximize the distance from peak to bin boundary. The validation sets are developed
by characterizing the salient distributions in experimental NMR spectroscopic data. Further, dynamic adaptive binning is applied
to a 1H NMR-based experiment to monitor rat urinary metabolites to empirically demonstrate improved spectral quantification. 相似文献