首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
MOTIVATION AND RESULTS: Durbin et al. (2002), Huber et al. (2002) and Munson (2001) independently introduced a family of transformations (the generalized-log family) which stabilizes the variance of microarray data up to the first order. We introduce a method for estimating the transformation parameter in tandem with a linear model based on the procedure outlined in Box and Cox (1964). We also discuss means of finding transformations within the generalized-log family which are optimal under other criteria, such as minimum residual skewness and minimum mean-variance dependency. AVAILABILITY: R and Matlab code and test data are available from the authors on request.  相似文献   

2.
Phylogenetic methods for the analysis of species data are widely used in evolutionary studies. However, preliminary data transformations and data reduction procedures (such as a size‐correction and principal components analysis, PCA) are often performed without first correcting for nonindependence among the observations for species. In the present short comment and attached R and MATLAB code, I provide an overview of statistically correct procedures for phylogenetic size‐correction and PCA. I also show that ignoring phylogeny in preliminary transformations can result in significantly elevated variance and type I error in our statistical estimators, even if subsequent analysis of the transformed data is performed using phylogenetic methods. This means that ignoring phylogeny during preliminary data transformations can possibly lead to spurious results in phylogenetic statistical analyses of species data.  相似文献   

3.
MOTIVATION: Standard statistical techniques often assume that data are normally distributed, with constant variance not depending on the mean of the data. Data that violate these assumptions can often be brought in line with the assumptions by application of a transformation. Gene-expression microarray data have a complicated error structure, with a variance that changes with the mean in a non-linear fashion. Log transformations, which are often applied to microarray data, can inflate the variance of observations near background. RESULTS: We introduce a transformation that stabilizes the variance of microarray data across the full range of expression. Simulation studies also suggest that this transformation approximately symmetrizes microarray data.  相似文献   

4.
We studied the fulfilment of assumptions of normality and homogeneity of error variance, prior to application of analysis of variance (ANOVA), for in vitro clonal propagation data. We assessed the use of data transformations and mean values for situations when the original data did not satisfy the required assumptions. The purpose of the study was to establish whether the use of original, transformed or mean values had any effect on F values, significance levels and clonal heritability values. The F values, significance levels and values of clonal heritability obtained showed analysis of variance to be reliable, despite deviations with respect to normality and homogeneity of variance and despite the fact that samples sizes were unequal. Original data may be used for ANOVA applied to measured variables such as number of shoots per explant, length of tallest shoot, number of 1-cm segments per explant and also derived variables such as the multiplication coefficient. Frequency data can be used for analysis of variance of categorical-type variables such as apical necrosis and percentage of responsive explant. For shoot colour variables, the distributions were very skewed and the variances were very different, but even though the sample sizes were not identical in all cases, lack of homogeneity of variance did not significantly affect F values, significance levels or clonal heritability values, and thus analysis of variance may be applied to the original data. The use of original and frequency data makes interpretation of the results easier than when transformed data are used and also allows us to calculate variance components more accurately than when using mean values, which do not provide as much information. Clonal heritability values from transformed data and mean values showed differences of less than one hundredth compared with those from original data. Box–Cox-transformed data showed slightly lower heritability values than those corresponding to original data, whereas clonal heritability values from both mean data and angular-transformed data were slightly higher than those obtained using original data. In clonal variability studies with single growth medium, nutritional conditions that encouraged highly unequal growth or characteristics among clones gave rise to data that were unlikely to satisfy the conditions of normality or homogeneity of variance.  相似文献   

5.
MOTIVATION: A variance stabilizing transformation for microarray data was recently introduced independently by several research groups. This transformation has sometimes been called the generalized logarithm or glog transformation. In this paper, we derive several alternative approximate variance stabilizing transformations that may be easier to use in some applications. RESULTS: We demonstrate that the started-log and the log-linear-hybrid transformation families can produce approximate variance stabilizing transformations for microarray data that are nearly as good as the generalized logarithm (glog) transformation. These transformations may be more convenient in some applications.  相似文献   

6.
Non-normality in the distribution of individual observations of production and quality traits in forest tree breeding may cause inaccurate selection and overestimation of predicted selection gain. The distribution of individual observations of traits such as height, diameter, branch diameter, branch angle and number of branches per whorl is not always normal. We investigated how the observations were distributed and to what degree it is possible to improve normality, homogeneity of error variance and additivity by using empirical power transformations. Computer simulations showed that a seriously skewed distribution impairs selection efficiency and exaggerates selection gain expectations. If the distribution is heavily skewed, transformation might be worthwhile. It does not seem possible to offer any general advice about which varities should be transformed, but in most cases there seems to be no need of any transformation.  相似文献   

7.
Computer simulations are used to examine the significance levels and powers of several tests which have been employed to compare the means of Poisson distributions. In particular, attention is focused on the behaviour of the tests when the means are small, as is often the case in ecological studies when populations of organisms are sampled using quadrats. Two approaches to testing are considered. The first assumes a log linear model for the Poisson data and leads to tests based on the deviance. The second employs standard analysis of variance tests following data transformations, including the often used logarithmic and square root transformations. For very small means it is found that a deviance-based test has the most favourable characteristics, generally outperforming analysis of variance tests on transformed data; none of the latter appears consistently better than any other. For larger means the standard analysis of variance on untransformed data performs well.  相似文献   

8.
We introduce a statistical model for microarray gene expression data that comprises data calibration, the quantification of differential expression, and the quantification of measurement error. In particular, we derive a transformation h for intensity measurements, and a difference statistic Deltah whose variance is approximately constant along the whole intensity range. This forms a basis for statistical inference from microarray data, and provides a rational data pre-processing strategy for multivariate analyses. For the transformation h, the parametric form h(x)=arsinh(a+bx) is derived from a model of the variance-versus-mean dependence for microarray intensity data, using the method of variance stabilizing transformations. For large intensities, h coincides with the logarithmic transformation, and Deltah with the log-ratio. The parameters of h together with those of the calibration between experiments are estimated with a robust variant of maximum-likelihood estimation. We demonstrate our approach on data sets from different experimental platforms, including two-colour cDNA arrays and a series of Affymetrix oligonucleotide arrays.  相似文献   

9.
BIOLOG EcoPlates provide one method for determination of functional diversity indices and community-level physiological profiling of microbial populations based on carbon substrate utilization. In this study, the effect of data transformation on BIOLOG EcoPlate data derived from wetland mesocosms and biofiltration systems was examined. Homoscedasticity, normality, and the number of linear correlations between variables were quantified and evaluated for data that had been transformed using either Taylor or logarithmic transforms. Subsequent multivariate analysis was implemented using the untransformed, Taylor transformed and logarithmic transformed data sets. The effect of data transformation and its effect on principle component analysis are presented. The transforms are shown to help increase homogeneity of variance, increase normality of the data, and increase the number of significant linear correlations for the data. Separate principle component analyses and ordinations of the data showed the transforms to be well suited to this type of data and in particular illustrate the ability of the logarithmic transform to reduce the influence of high leverage or outlying observations on the overall analysis and its robustness in terms of treating data from different ecological systems. Although BIOLOG EcoPlates were used in this study to illustrate the use of transformations on multivariate data, the techniques described may be employed on similar microplate data. In addition, if homoscedasticity, normality and the number of linear correlations within a data set are not evaluated and the possibility of transforming the data, using the Taylor, logarithmic or another transform, is not considered, erroneous analysis and misleading conclusions may be attained when performing multivariate analysis on microplate data.  相似文献   

10.
Both ecological field studies and attempts to extrapolate from laboratory experiments to natural populations generally encounter the high degree of natural variability and chaotic behavior that typify natural ecosystems. Regardless of this variability and non-normal distribution, most statistical models of natural systems use normal error which assumes independence between the variance and mean. However, environmental data are often random or clustered and are better described by probability distributions which have more realistic variance to mean relationships. Until recently statistical software packages modeled only with normal error and researchers had to assume approximate normality on the original or transformed scale of measurement and had to live with the consequences of often incorrectly assuming independence between the variance and mean. Recent developments in statistical software allow researchers to use generalized linear models (GLMs) and analysis can now proceed with probability distributions from the exponential family which more realistically describe natural conditions: binomial (even distribution with variance less than mean), Poisson (random distribution with variance equal mean), negative binomial (clustered distribution with variance greater than mean). GLMs fit parameters on the original scale of measurement and eliminate the need for obfuscating transformations, reduce bias for proportions with unequal sample size, and provide realistic estimates of variance which can increase power of tests. Because GLMs permit modeling according to the non-normal behavior of natural systems and obviate the need for normality assumptions, they will likely become a widely used tool for analyzing toxicity data. To demonstrate the broad-scale utility of GLMs, we present several examples where the use of GLMs improved the statistical power of field and laboratory studies to document the rapid ecological recovery of Prince William Sound following the Exxon Valdez oil spill.  相似文献   

11.
An analytical quantitative comparison of data from the literature about frequencies of mutations and transformations induced by mutagenic-carcinogenic compounds in mammalian cells was carried out without any selection of unfitting data. The analysis was performed for equitoxic doses and background level. Data on transformation frequency came from 105 experiments performed with 34 carcinogenic compounds: those on mutation frequency came from 66 experiments performed with 26 mutagenic compounds; 7 compounds were assayed for both these activities. The difference in frequency between structural mutations and transformations was about 10(2)-10(3) and it appears statistically extremely significant. These results seem to indicate an absolute difference between structural mutations and transformations. In the framework of other observations it is suggested that structural alterations in a single gene are perhaps only a component of the steps present in the oncogenetic process. We may regard as "epigenetic" type phenomena the other steps involved in this process.  相似文献   

12.
There may be experiments where due to misadventure or logistic or ethical reasons final measurements on all experimental units cannot be obtained. If at least 50% of the final measurements have been taken estimates of the lower quantiles and the median can be obtained. For such curtailed experiments it is shown how quantiles, above those that can be estimated directly from the data set, can be estimated indirectly by exploiting a property of symmetric distributions. The performance of the indirect quantile estimator is compared with that of the direct quantile estimator and conditions for the indirect estimator to have smaller variance than the direct estimator are presented. It is also shown how the indirect estimator may be pooled with the direct estimator to obtain an improved estimate of the upper quantiles. When it cannot be assumed that the data come from a symmetric distribution transformations to symmetry may be performed and the indirect estimation technique used on the transformed data; back transformations then yield the estimates of the upper quantiles.  相似文献   

13.
MOTIVATION: Pre-processing of SELDI-TOF mass spectrometry data is currently performed on a largel y ad hoc basis. This makes comparison of results from independent analyses troublesome and does not provide a framework for distinguishing different sources of variation in data. RESULTS: In this article, we consider the task of pooling a large number of single-shot spectra, a task commonly performed automatically by the instrument software. By viewing the underlying statistical problem as one of heteroscedastic linear regression, we provide a framework for introducing robust methods and for dealing with missing data resulting from a limited span of recordable intensity values provided by the instrument. Our framework provides an interpretation of currently used methods as a maximum-likelihood estimator and allows theoretical derivation of its variance. We observe that this variance depends crucially on the total number of ionic species, which can vary considerably between different pooled spectra. This variation in variance can potentially invalidate the results from naive methods of discrimination/classification and we outline appropriate data transformations. Introducing methods from robust statistics did not improve the standard errors of the pooled samples. Imputing missing values however-using the EM algorithm-had a notable effect on the result; for our data, the pooled height of peaks which were frequently truncated increased by up to 30%.  相似文献   

14.
Statistical methods for distinguishing the common types of enzyme inhibitors are presented. Steady-state kinetic data in the doulbe-reciprocal form are analyzed. The test for competitive and uncompetitive inhibition simply reveals whether there is a significant difference between the sum of the residual variances for each data set (i.e., each line of the double-reciprocal plot) fitted to a straight line and the residual variance generated by fitting the data points of all the data sets to one of these models (in double-reciprocal form). A standard F test is performed to quantitate the significance of the additional error created by fitting the data to the model. F values are converted to probability values which express the degree by which the data conform to the model. The F test is not directly suitable for verifying noncompetitive inhibitors because they produce both slope and intercept effects. Therefore, the data are first transformed so that the point of convergence of the data sets is moved to the origin of the double-reciprocal graph. Equations are presented to fit each transformed data set to a straight line and also to fit the transformed data sets to a family of straight lines that intersect at the origin. The sums of the residual variances of the first fitting and the total residual variance of the second fitting are then amenable to comparison by the F test because the intercept effects have been abolished. Thus, the degree of conformity to a model describing a family of lines with a common intersection can be assessed. Additional verification of noncompetitive inhibition requires the establishment that the point of convergence resides to the left of the 1v axis, and the statistical rejection of alternate inhibition models.  相似文献   

15.
High‐dimensional data provide many potential confounders that may bolster the plausibility of the ignorability assumption in causal inference problems. Propensity score methods are powerful causal inference tools, which are popular in health care research and are particularly useful for high‐dimensional data. Recent interest has surrounded a Bayesian treatment of propensity scores in order to flexibly model the treatment assignment mechanism and summarize posterior quantities while incorporating variance from the treatment model. We discuss methods for Bayesian propensity score analysis of binary treatments, focusing on modern methods for high‐dimensional Bayesian regression and the propagation of uncertainty. We introduce a novel and simple estimator for the average treatment effect that capitalizes on conjugacy of the beta and binomial distributions. Through simulations, we show the utility of horseshoe priors and Bayesian additive regression trees paired with our new estimator, while demonstrating the importance of including variance from the treatment regression model. An application to cardiac stent data with almost 500 confounders and 9000 patients illustrates approaches and facilitates comparison with existing alternatives. As measured by a falsifiability endpoint, we improved confounder adjustment compared with past observational research of the same problem.  相似文献   

16.
To compensate for a power analysis based on a poor estimate of variance, internal pilot designs use some fraction of the planned observations to reestimate error variance and modify the final sample size. Ignoring the randomness of the final sample size may bias the final variance estimate and inflate test size. We propose and evaluate three different tests that control test size for an internal pilot in a general linear univariate model with fixed predictors and Gaussian errors. Test 1 uses the first sample plus those observations guaranteed to be collected in the second sample for the final variance estimate. Test 2 depends mostly on the second sample for the final variance estimate. Test 3 uses the unadjusted variance estimate and modifies the critical value to bound test size. We also examine three sample-size modification rules. Only test 2 can control conditional test size, align with a modification rule, and provide simple power calculations. We recommend it if the minimum second (incremental) sample is at least moderate (perhaps 20). Otherwise, the bounding test appears to have the highest power in small samples. Reanalyzing published data highlights some advantages and disadvantages of the various tests.  相似文献   

17.
MOTIVATION: Many standard statistical techniques are effective on data that are normally distributed with constant variance. Microarray data typically violate these assumptions since they come from non-Gaussian distributions with a non-trivial mean-variance relationship. Several methods have been proposed that transform microarray data to stabilize variance and draw its distribution towards the Gaussian. Some methods, such as log or generalized log, rely on an underlying model for the data. Others, such as the spread-versus-level plot, do not. We propose an alternative data-driven multiscale approach, called the Data-Driven Haar-Fisz for microarrays (DDHFm) with replicates. DDHFm has the advantage of being 'distribution-free' in the sense that no parametric model for the underlying microarray data is required to be specified or estimated; hence, DDHFm can be applied very generally, not just to microarray data. RESULTS: DDHFm achieves very good variance stabilization of microarray data with replicates and produces transformed intensities that are approximately normally distributed. Simulation studies show that it performs better than other existing methods. Application of DDHFm to real one-color cDNA data validates these results. AVAILABILITY: The R package of the Data-Driven Haar-Fisz transform (DDHFm) for microarrays is available in Bioconductor and CRAN.  相似文献   

18.
Transformation and normalization of oligonucleotide microarray data   总被引:3,自引:0,他引:3  
MOTIVATION: Most methods of analyzing microarray data or doing power calculations have an underlying assumption of constant variance across all levels of gene expression. The most common transformation, the logarithm, results in data that have constant variance at high levels but not at low levels. Rocke and Durbin showed that data from spotted arrays fit a two-component model and Durbin, Hardin, Hawkins, and Rocke, Huber et al. and Munson provided a transformation that stabilizes the variance as well as symmetrizes and normalizes the error structure. We wish to evaluate the applicability of this transformation to the error structure of GeneChip microarrays. RESULTS: We demonstrate in an example study a simple way to use the two-component model of Rocke and Durbin and the data transformation of Durbin, Hardin, Hawkins and Rocke, Huber et al. and Munson on Affymetrix GeneChip data. In addition we provide a method for normalization of Affymetrix GeneChips simultaneous with the determination of the transformation, producing a data set without chip or slide effects but with constant variance and with symmetric errors. This transformation/normalization process can be thought of as a machine calibration in that it requires a few biologically constant replicates of one sample to determine the constant needed to specify the transformation and normalize. It is hypothesized that this constant needs to be found only once for a given technology in a lab, perhaps with periodic updates. It does not require extensive replication in each study. Furthermore, the variance of the transformed pilot data can be used to do power calculations using standard power analysis programs. AVAILABILITY: SPLUS code for the transformation/normalization for four replicates is available from the first author upon request. A program written in C is available from the last author.  相似文献   

19.
MOTIVATION: DNA microarrays are now capable of providing genome-wide patterns of gene expression across many different conditions. The first level of analysis of these patterns requires determining whether observed differences in expression are significant or not. Current methods are unsatisfactory due to the lack of a systematic framework that can accommodate noise, variability, and low replication often typical of microarray data. RESULTS: We develop a Bayesian probabilistic framework for microarray data analysis. At the simplest level, we model log-expression values by independent normal distributions, parameterized by corresponding means and variances with hierarchical prior distributions. We derive point estimates for both parameters and hyperparameters, and regularized expressions for the variance of each gene by combining the empirical variance with a local background variance associated with neighboring genes. An additional hyperparameter, inversely related to the number of empirical observations, determines the strength of the background variance. Simulations show that these point estimates, combined with a t -test, provide a systematic inference approach that compares favorably with simple t -test or fold methods, and partly compensate for the lack of replication.  相似文献   

20.
《Endocrine practice》2014,20(3):207-212
ObjectiveTo introduce a statistical method of assessing hospital-based non–intensive care unit (non-ICU) inpatient glucose control.MethodsPoint-of-care blood glucose (POC-BG) data from hospital non-ICUs were extracted for January 1 through December 31, 2011. Glucose data distribution was examined before and after Box-Cox transformations and compared to normality. Different subsets of data were used to establish upper and lower control limits, and exponentially weighted moving average (EWMA) control charts were constructed from June, July, and October data as examples to determine if out-of-control events were identified differently in nontransformed versus transformed data.ResultsA total of 36,381 POC-BG values were analyzed. In all 3 monthly test samples, glucose distributions in nontransformed data were skewed but approached a normal distribution once transformed. Interpretation of out-of-control events from EWMA control chart analyses also revealed differences. In the June test data, an out-of-control process was identified at sample 53 with nontransformed data, whereas the transformed data remained in control for the duration of the observed period. Analysis of July data demonstrated an out-of-control process sooner in the transformed (sample 55) than nontransformed (sample data, whereas for October, transformed data remained in control longer than nontransformed data.ConclusionStatistical transformations increase the normal behavior of inpatient non-ICU glycemic data sets. The decision to transform glucose data could influence the interpretation and conclusions about the status of inpatient glycemic control. Further study is required to determine whether transformed versus nontransformed data influence clinical decisions or evaluation of interventions. (Endocr Pract. 2014;20:207-212)  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号