首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Different multivariate data analysis techniques based on factor analysis and multivariate curve resolution are shown for the study of biochemical evolutionary processes like conformational changes and protein folding. Several simulated CD spectral data sets describing different hypothetical protein folding pathways are analyzed and discussed in relation to the feasibility of factor analysis techniques to detect and resolve the number of components needed to explain the evolution of the CD spectra corresponding to the process (i.e., to detect the presence of intermediate forms). When more than two components (the native and unordered forms) are needed to explain the evolution of the spectra, an iterative multivariate curve resolution procedure based on an alternating least squares algorithm is proposed to estimate the CD spectrum corresponding to the intermediate form.  相似文献   

2.
MOTIVATION: One important aspect of data-mining of microarray data is to discover the molecular variation among cancers. In microarray studies, the number n of samples is relatively small compared to the number p of genes per sample (usually in thousands). It is known that standard statistical methods in classification are efficient (i.e. in the present case, yield successful classifiers) particularly when n is (far) larger than p. This naturally calls for the use of a dimension reduction procedure together with the classification one. RESULTS: In this paper, the question of classification in such a high-dimensional setting is addressed. We view the classification problem as a regression one with few observations and many predictor variables. We propose a new method combining partial least squares (PLS) and Ridge penalized logistic regression. We review the existing methods based on PLS and/or penalized likelihood techniques, outline their interest in some cases and theoretically explain their sometimes poor behavior. Our procedure is compared with these other classifiers. The predictive performance of the resulting classification rule is illustrated on three data sets: Leukemia, Colon and Prostate.  相似文献   

3.
Linear mixed model (LMM) analysis has been recently used extensively for estimating additive genetic variances and narrow-sense heritability in many genomic studies. While the LMM analysis is computationally less intensive than the Bayesian algorithms, it remains infeasible for large-scale genomic data sets. In this paper, we advocate the use of a statistical procedure known as symmetric differences squared (SDS) as it may serve as a viable alternative when the LMM methods have difficulty or fail to work with large datasets. The SDS procedure is a general and computationally simple method based only on the least squares regression analysis. We carry out computer simulations and empirical analyses to compare the SDS procedure with two commonly used LMM-based procedures. Our results show that the SDS method is not as good as the LMM methods for small data sets, but it becomes progressively better and can match well with the precision of estimation by the LMM methods for data sets with large sample sizes. Its major advantage is that with larger and larger samples, it continues to work with the increasing precision of estimation while the commonly used LMM methods are no longer able to work under our current typical computing capacity. Thus, these results suggest that the SDS method can serve as a viable alternative particularly when analyzing ‘big’ genomic data sets.  相似文献   

4.
A computational method is presented for minimizing the weighted sum of squares of the differences between observed and expected pairwise distances between species, where the expectations are generated by an additive tree model. The criteria of Fitch and Margoliash (1967, Science 155:279-284) and Cavalli-Sforza and Edwards (1967, Evolution 21:550-570) are both weighted least squares, with different weights. The method presented iterates lengths of adjacent branches in the tree three at a time. The weighted sum of squares never increases during the process of iteration, and the iterates approach a stationary point on the surface of the sum of squares. This iterative approach makes it particularly easy to maintain the constraint that branch lengths never become negative, although negative branch lengths can also be allowed. The method is implemented in a computer program, FITCH, which has been distributed since 1982 as part of the PHYLIP package of programs for inferring phylogenies, and is also implemented in PAUP*. The present method is compared, using some simulated data sets, with an implementation of the method of De Soete (1983, Psychometrika 48:621-626); it is slower than De Soete's method but more effective at finding the least squares tree. The relationship of this method to the neighbor-joining method is also discussed.  相似文献   

5.
The log response ratio, lnRR, is the most frequently used effect size statistic for meta-analysis in ecology. However, often missing standard deviations (SDs) prevent estimation of the sampling variance of lnRR. We propose new methods to deal with missing SDs via a weighted average coefficient of variation (CV) estimated from studies in the dataset that do report SDs. Across a suite of simulated conditions, we find that using the average CV to estimate sampling variances for all observations, regardless of missingness, performs with minimal bias. Surprisingly, even with missing SDs, this simple method outperforms the conventional approach (basing each effect size on its individual study-specific CV) with complete data. This is because the conventional method ultimately yields less precise estimates of the sampling variances than using the pooled CV from multiple studies. Our approach is broadly applicable and can be implemented in all meta-analyses of lnRR, regardless of ‘missingness’.  相似文献   

6.
This paper applies the inverse probability weighted least‐squares method to predict total medical cost in the presence of censored data. Since survival time and medical costs may be subject to right censoring and therefore are not always observable, the ordinary least‐squares approach cannot be used to assess the effects of explanatory variables. We demonstrate how inverse probability weighted least‐squares estimation provides consistent asymptotic normal coefficients with easily computable standard errors. In addition, to assess the effect of censoring on coefficients, we develop a test comparing ordinary least‐squares and inverse probability weighted least‐squares estimators. We demonstrate the methods developed by applying them to the estimation of cancer costs using Medicare claims data. (© 2004 WILEY‐VCH Verlag GmbH & Co. KGaA, Weinheim)  相似文献   

7.
Lui KJ  Kelly C 《Biometrics》2000,56(1):309-315
Lipsitz et al. (1998, Biometrics 54, 148-160) discussed testing the homogeneity of the risk difference for a series of 2 x 2 tables. They proposed and evaluated several weighted test statistics, including the commonly used weighted least squares test statistic. Here we suggest various important improvements on these test statistics. First, we propose using the one-sided analogues of the test procedures proposed by Lipsitz et al. because we should only reject the null hypothesis of homogeneity when the variation of the estimated risk differences between centers is large. Second, we generalize their study by redesigning the simulations to include the situations considered by Lipsitz et al. (1998) as special cases. Third, we consider a logarithmic transformation of the weighted least squares test statistic to improve the normal approximation of its sampling distribution. On the basis of Monte Carlo simulations, we note that, as long as the mean treatment group size per table is moderate or large (> or = 16), this simple test statistic, in conjunction with the commonly used adjustment procedure for sparse data, can be useful when the number of 2 x 2 tables is small or moderate (< or = 32). In these situations, in fact, we find that our proposed method generally outperforms all the statistics considered by Lipsitz et al. Finally, we include a general guideline about which test statistic should be used in a variety of situations.  相似文献   

8.
Partially paired data sets often occur in microarray experiments (Kim et al., 2005; Liu, Liang and Jang, 2006). Discussions of testing with partially paired data are found in the literature (Lin and Stivers 1974; Ekbohm, 1976; Bhoj, 1978). Bhoj (1978) initially proposed a test statistic that uses a convex combination of paired and unpaired t statistics. Kim et al. (2005) later proposed the t3 statistic, which is a linear combination of paired and unpaired t statistics, and then used it to detect differentially expressed (DE) genes in colorectal cancer (CRC) cDNA microarray data. In this paper, we extend Kim et al.'s t3 statistic to the Hotelling's T2 type statistic Tp for detecting DE gene sets of size p. We employ Efron's empirical null principle to incorporate inter-gene correlation in the estimation of the false discovery rate. Then, the proposed Tp statistic is applied to Kim et al's CRC data to detect the DE gene sets of sizes p=2 and p=3. Our results show that for small p, particularly for p=2 and marginally for p=3, the proposed Tp statistic compliments the univariate procedure by detecting additional DE genes that were undetected in the univariate test procedure. We also conduct a simulation study to demonstrate that Efron's empirical null principle is robust to the departure from the normal assumption.  相似文献   

9.
Susko E 《Systematic biology》2011,60(5):668-675
Generalized least squares (GLS) methods provide a relatively fast means of constructing a confidence set of topologies. Because they utilize information about the covariances between distances, it is reasonable to expect additional efficiency in estimation and confidence set construction relative to other least squares (LS) methods. Difficulties have been found to arise in a number of practical settings due to estimates of covariance matrices being ill conditioned or even noninvertible. We present here new ways of estimating the covariance matrices for distances that are much more likely to be positive definite, as the actual covariance matrices are. A thorough investigation of performance is also conducted. An alternative to GLS that has been proposed for constructing confidence sets of topologies is weighted least squares (WLS). As currently implemented, this approach is equivalent to the use of GLS but with covariances set to zero rather than being estimated. In effect, this approach assumes normality of the estimated distances and zero covariances. As the results here illustrate, this assumption leads to poor performance. A 95% confidence set is almost certain to contain the true topology but will contain many more topologies than are needed. On the other hand, the results here also indicate that, among LS methods, WLS performs quite well at estimating the correct topology. It turns out to be possible to improve the performance of WLS for confidence set construction through a relatively inexpensive normal parametric bootstrap that utilizes the same variances and covariances of GLS. The resulting procedure is shown to perform at least as well as GLS and thus provides a reasonable alternative in cases where covariance matrices are ill conditioned.  相似文献   

10.
In this work, the application of a multivariate curve resolution procedure based on alternating least squares optimization (MCR-ALS) for the analysis of data from DNA microarrays is proposed. For this purpose, simulated and publicly available experimental data sets have been analyzed. Application of MCR-ALS, a method that operates without the use of any training set, has enabled the resolution of the relevant information about different cancer lines classification using a set of few components; each of these defined by a sample and a pure gene expression profile. From resolved sample profiles, a classification of samples according to their origin is proposed. From the resolved pure gene expression profiles, a set of over- or underexpressed genes that could be related to the development of cancer diseases has been selected. Advantages of the MCR-ALS procedure in relation to other previously proposed procedures such as principal component analysis are discussed.  相似文献   

11.
Huang J  Ma S  Xie H 《Biometrics》2006,62(3):813-820
We consider two regularization approaches, the LASSO and the threshold-gradient-directed regularization, for estimation and variable selection in the accelerated failure time model with multiple covariates based on Stute's weighted least squares method. The Stute estimator uses Kaplan-Meier weights to account for censoring in the least squares criterion. The weighted least squares objective function makes the adaptation of this approach to multiple covariate settings computationally feasible. We use V-fold cross-validation and a modified Akaike's Information Criterion for tuning parameter selection, and a bootstrap approach for variance estimation. The proposed method is evaluated using simulations and demonstrated on a real data example.  相似文献   

12.
13.
A method is presented for the analysis of data from crossfostering experiments in which parts of litters are reciprocally interchanged at birth. Observed variances and covariances of differently related individuals are expressed as functions of theoretical causal components of phenotypic variance (additive direct, dominance direct, additive maternal, dominance maternal, direct-maternal covariance, and environmental). Causal components are estimated by weighted least squares analysis of this system of equations, including a ridge-regression procedure to examine consequences of correlation between observed components. Ridge regression suggests that dominance direct genetic variance is generally underestimated, but that narrow-sense heritability estimates are reliable.  相似文献   

14.
This paper provides a key element for the calculation of the damage costs of air pollution, namely the valuation of mortality, important because premature mortality makes by far the largest contribution. Whereas several studies have tried to quantify the cost of air pollution mortality by multiplying a number of deaths by the ‘value of prevented fatality’ (also known as ‘value of statistical life’), we explain why such an approach is not correct and why one needs to evaluate the change in life expectancy due to air pollution. Therefore, an estimate for the monetary value of a life year (VOLY) is needed. The most appropriate method for determining VOLY is contingent valuation (CV). To determine VOLY for the EU, we have conducted a CV survey in 9 European countries: France, Spain, UK, Denmark, Germany, Switzerland, Czech Republic, Hungary, and Poland with a total sample size of 1463 persons. Based on the results from this 9-country CV survey we recommend a VOLY estimate of 40,000 € for cost–benefit analysis of air pollution policies for the European Union. As for confidence intervals, we argue that VOLY is at least 25,000 € and at the most 100,000 €.  相似文献   

15.
In (nonlinear) regression with heteroscedastic errors, introduction of a variance model can be useful to obtain good estimators of the regression parameter. For example, the variance model can be used to obtain the optimal weights in weighted least squares. Methodology of this kind is often used in the analysis of assay data in clinical chemistry, pharmacokinetics, and toxicology. In a series of papers in the pharmacological literature, Sheiner and Beal and others advocate the extended least squares (ELS) methodology that combines regression and variance model into a single objective function based on normal-theory maximum likelihood. The inadequacy of this method is folklore in the (mathematical) statistical literature. In this article it is pointed out that this methodology may lead to inconsistent estimators in practically relevant situations. A review is given of other methods that may be preferable to ELS.  相似文献   

16.
In human metabolic profiling studies, between-subject variability is often the dominant feature and can mask the potential classifications of clinical interest. Conventional models such as principal component analysis (PCA) are usually not effective in such situations and it is therefore highly desirable to find a suitable model which is able to discover the underlying pattern hidden behind the high between-subject variability. In this study we employed two clinical metabolomics data sets as the testing grounds, in which such variability had been observed, and we demonstrate that a proper choice of chemometrics model can help to overcome this issue of high between-subject variability. Two data sets were used to represent two different types of experiment designs. The first data set was obtained from a small-scale study investigating volatile organic compounds (VOCs) collected from chronic wounds using a skin patch device and analysed by thermal desorption-gas chromatography-mass spectrometry. Five patients were recruited and for each patient three sites sampled in triplicate: healthy skin, boundary of the lesion and top of the lesion, the aim was to discriminate these three types of samples based on their VOC profile. The second data set was from a much larger study involving 35 healthy subjects, 47 patients with chronic obstructive pulmonary disease and 33 with asthma. The VOCs in the breath of each subject were collected using a mask device and analysed again by GC–MS with the aim of discriminating the three types of subjects based on breath VOC profiles. Multilevel simultaneous component analysis, multilevel partial least squares for discriminant analysis, ANOVA-PCA, and a novel simplified ANOVA-PCA model—which we have named ANOVA-Mean Centre (ANOVA-MC)—were applied on these two data sets. Significantly improved results were obtained by using these models. We also present a novel validation procedure to verify statistically the results obtained from those models.  相似文献   

17.
Partial least squares discriminant analysis (PLS-DA) is a partial least squares regression of a set Y of binary variables describing the categories of a categorical variable on a set X of predictor variables. It is a compromise between the usual discriminant analysis and a discriminant analysis on the significant principal components of the predictor variables. This technique is specially suited to deal with a much larger number of predictors than observations and with multicollineality, two of the main problems encountered when analysing microarray expression data. We explore the performance of PLS-DA with published data from breast cancer (Perou et al. 2000). Several such analyses were carried out: (1) before vs after chemotherapy treatment, (2) estrogen receptor positive vs negative tumours, and (3) tumour classification. We found that the performance of PLS-DA was extremely satisfactory in all cases and that the discriminant cDNA clones often had a sound biological interpretation. We conclude that PLS-DA is a powerful yet simple tool for analysing microarray data.  相似文献   

18.

Background

Ordinary differential equations (ODEs) are often used to understand biological processes. Since ODE-based models usually contain many unknown parameters, parameter estimation is an important step toward deeper understanding of the process. Parameter estimation is often formulated as a least squares optimization problem, where all experimental data points are considered as equally important. However, this equal-weight formulation ignores the possibility of existence of relative importance among different data points, and may lead to misleading parameter estimation results. Therefore, we propose to introduce weights to account for the relative importance of different data points when formulating the least squares optimization problem. Each weight is defined by the uncertainty of one data point given the other data points. If one data point can be accurately inferred given the other data, the uncertainty of this data point is low and the importance of this data point is low. Whereas, if inferring one data point from the other data is almost impossible, it contains a huge uncertainty and carries more information for estimating parameters.

Results

G1/S transition model with 6 parameters and 12 parameters, and MAPK module with 14 parameters were used to test the weighted formulation. In each case, evenly spaced experimental data points were used. Weights calculated in these models showed similar patterns: high weights for data points in dynamic regions and low weights for data points in flat regions. We developed a sampling algorithm to evaluate the weighted formulation, and demonstrated that the weighted formulation reduced the redundancy in the data. For G1/S transition model with 12 parameters, we examined unevenly spaced experimental data points, strategically sampled to have more measurement points where the weights were relatively high, and fewer measurement points where the weights were relatively low. This analysis showed that the proposed weights can be used for designing measurement time points.

Conclusions

Giving a different weight to each data point according to its relative importance compared to other data points is an effective method for improving robustness of parameter estimation by reducing the redundancy in the experimental data.
  相似文献   

19.
Linear regression and two-class classification with gene expression data   总被引:3,自引:0,他引:3  
MOTIVATION: Using gene expression data to classify (or predict) tumor types has received much research attention recently. Due to some special features of gene expression data, several new methods have been proposed, including the weighted voting scheme of Golub et al., the compound covariate method of Hedenfalk et al. (originally proposed by Tukey), and the shrunken centroids method of Tibshirani et al. These methods look different and are more or less ad hoc. RESULTS: We point out a close connection of the three methods with a linear regression model. Casting the classification problem in the general framework of linear regression naturally leads to new alternatives, such as partial least squares (PLS) methods and penalized PLS (PPLS) methods. Using two real data sets, we show the competitive performance of our new methods when compared with the other three methods.  相似文献   

20.
Stocks of commercial fish are often modelled using sampling data of various types, of unknown precision, and from various sources assumed independent. We want each set to contribute to estimates of the parameters in relation to its precision and goodness of fit with the model. Iterative re-weighting of the sets is proposed for linear models until the weight of each set is found to be proportional to (relative weighting) or equal to (absolute weighting) the set-specific residual invariances resulting from a generalised least squares fit. Formulae for the residual variances are put forward involving fractional allocation of degrees of freedom depending on the numbers of independent observations in each set, the numbers of sets contributing to the estimate of each parameter, and the number of weights estimated. To illustrate the procedure, numbers of the 1984 year-class of North Sea cod (a) landed commercially each year, and (b) caught per unit of trawling time by an annual groundfish survey are modelled as a function of age to estimate total mortality, Z, relative catching power of the two fishing methods, and relative precision of the two sets of observations as indices of stock abundance. It was found that the survey abundance indices displayed residual variance about 29 times higher than that of the annual landings.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号