首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Lin DY  Wei LJ  Ying Z 《Biometrics》2002,58(1):1-12
Residuals have long been used for graphical and numerical examinations of the adequacy of regression models. Conventional residual analysis based on the plots of raw residuals or their smoothed curves is highly subjective, whereas most numerical goodness-of-fit tests provide little information about the nature of model misspecification. In this paper, we develop objective and informative model-checking techniques by taking the cumulative sums of residuals over certain coordinates (e.g., covariates or fitted values) or by considering some related aggregates of residuals, such as moving sums and moving averages. For a variety of statistical models and data structures, including generalized linear models with independent or dependent observations, the distributions of these stochastic processes tinder the assumed model can be approximated by the distributions of certain zero-mean Gaussian processes whose realizations can be easily generated by computer simulation. Each observed process can then be compared, both graphically and numerically, with a number of realizations from the Gaussian process. Such comparisons enable one to assess objectively whether a trend seen in a residual plot reflects model misspecification or natural variation. The proposed techniques are particularly useful in checking the functional form of a covariate and the link function. Illustrations with several medical studies are provided.  相似文献   

2.
Vector recursive residuals are developed for multivariate regression models on a field. A vector response variable is observed at points on a rectangular grid, together with regression variables measured at the same points. Neighbouring values of the response vector may be correlated and simple models are considered using a direct product structure for the variance matrix. Subsequent to obtaining vector recursive residuals principal component analysis is applied to obtain an evaluation of any changes that may be occurring in the regression relationship over the field. The method is then applied to the problem of detecting zones of bush fire damage and recovery from LANDSAT data.  相似文献   

3.
Residuals are frequently used to evaluate the validity of the assumptions of statistical models and may also be employed as tools for model selection. For standard (normal) linear models, for example, residuals are used to verify homoscedasticity, linearity of effects, presence of outliers, normality and independence of the errors. Similar uses may be envisaged for three types of residuals that emerge from the fitting of linear mixed models. We review some of the residual analysis techniques that have been used in this context and propose a standardization of the conditional residual useful to identify outlying observations and clusters. We illustrate the procedures with a practical example.  相似文献   

4.
We develop an approach for the exploratory analysis of gene expression data, based upon blind source separation techniques. This approach exploits higher-order statistics to identify a linear model for (logarithms of) expression profiles, described as linear combinations of "independent sources." As a result, it yields "elementary expression patterns" (the "sources"), which may be interpreted as potential regulation pathways. Further analysis of the so-obtained sources show that they are generally characterized by a small number of specific coexpressed or antiexpressed genes. In addition, the projections of the expression profiles onto the estimated sources often provides significant clustering of conditions. The algorithm relies on a large number of runs of "independent component analysis" with random initializations, followed by a search of "consensus sources." It then provides estimates for independent sources, together with an assessment of their robustness. The results obtained on two datasets (namely, breast cancer data and Bacillus subtilis sulfur metabolism data) show that some of the obtained gene families correspond to well known families of coregulated genes, which validates the proposed approach.  相似文献   

5.
We propose a new method to estimate and correct for phylogenetic inertia in comparative data analysis. The method, called phylogenetic eigenvector regression (PVR) starts by performing a principal coordinate analysis on a pairwise phylogenetic distance matrix between species. Traits under analysis are regressed on eigenvectors retained by a broken-stick model in such a way that estimated values express phylogenetic trends in data and residuals express independent evolution of each species. This partitioning is similar to that realized by the spatial autoregressive method, but the method proposed here overcomes the problem of low statistical performance that occurs with autoregressive method when phylogenetic correlation is low or when sample size is too small to detect it. Also, PVR is easier to perform with large samples because it is based on well-known techniques of multivariate and regression analyses. We evaluated the performance of PVR and compared it with the autoregressive method using real datasets and simulations. A detailed worked example using body size evolution of Carnivora mammals indicated that phylogenetic inertia in this trait is elevated and similarly estimated by both methods. In this example, Type I error at α = 0.05 of PVR was equal to 0.048, but an increase in the number of eigenvectors used in the regression increases the error. Also, similarity between PVR and the autoregressive method, defined by correlation between their residuals, decreased by overestimating the number of eigenvalues necessary to express the phylogenetic distance matrix. To evaluate the influence of cladogram topology on the distribution of eigenvalues extracted from the double-centered phylogenetic distance matrix, we analyzed 100 randomly generated cladograms (up to 100 species). Multiple linear regression of log transformed variables indicated that the number of eigenvalues extracted by the broken-stick model can be fully explained by cladogram topology. Therefore, the broken-stick model is an adequate criterion for determining the correct number of eigenvectors to be used by PVR. We also simulated distinct levels of phylogenetic inertia by producing a trend across 10, 25, and 50 species arranged in “comblike” cladograms and then adding random vectors with increased residual variances around this trend. In doing so, we provide an evaluation of the performance of both methods with data generated under different evolutionary models than tested previously. The results showed that both PVR and autoregressive method are efficient in detecting inertia in data when sample size is relatively high (more than 25 species) and when phylogenetic inertia is high. However, PVR is more efficient at smaller sample sizes and when level of phylogenetic inertia is low. These conclusions were also supported by the analysis of 10 real datasets regarding body size evolution in different animal clades. We concluded that PVR can be a useful alternative to an autoregressive method in comparative data analysis.  相似文献   

6.
Summary .  The majority of the statistical literature for the joint modeling of longitudinal and time-to-event data has focused on the development of models that aim at capturing specific aspects of the motivating case studies. However, little attention has been given to the development of diagnostic and model-assessment tools. The main difficulty in using standard model diagnostics in joint models is the nonrandom dropout in the longitudinal outcome caused by the occurrence of events. In particular, the reference distribution of statistics, such as the residuals, in missing data settings is not directly available and complex calculations are required to derive it. In this article, we propose a multiple-imputation-based approach for creating multiple versions of the completed data set under the assumed joint model. Residuals and diagnostic plots for the complete data model can then be calculated based on these imputed data sets. Our proposals are exemplified using two real data sets.  相似文献   

7.
提出了一种利用粪便可见-近红外反射光谱进行圈养高山麝种群年龄组分析的新方法.以FieldSpec~((R))3地物光谱仪采集了145份高山麝粪便(成体麝粪样45份,亚成体和幼体各50份)的光谱数据,将其随机分成训练集(100份)和检验集(45份).光谱经S.Golay平滑和一阶导数处理后以主成分分析法(PCA)降维.以前6个主成分(含原始光谱95.00%的特征信息)作为新变量,利用训练集样本,分别以Fisher线性判别、Bayes逐步判别以及BP-神经网络(BP-ANN)3种方法建立高山麝种群年龄组的分析模型.对检验集45个未知样的预测表明,BP-ANN模型判别的准确率最高,为84.44%.3种方法所建的模型对幼麝粪样判别的准确率最高,可达93.33%.分析发现亚成体粪样具有过渡性质,但幼麝粪样与成体粪样易于判别.结果表明,利用粪便的可见-近红外反射光谱进行高山麝年龄组的快速、非接触性判别是可行的,且PCA 结合BP-ANN判别是一种优选方法.  相似文献   

8.
This article is the first of a series of articles detailing the development of near-infrared (NIR) methods for solid-dosage form analysis. Experiments were conducted at the Duquesne University Center for Pharmaceutical Technology to qualify the capabilities of instrumentation and sample handling systems, evaluate the potential effect of one source of a process signature on calibration development, and compare the utility of reflection and transmission data collection methods. A database of 572 production-scale sample spectra was used to evaluate the interbatch spectral variability of samples produced under routine manufacturing conditions. A second database of 540 spectra from samples produced under various compression conditions was analyzed to determine the feasibility of pooling spectral data acquired from samples produced at diverse scales. Instrument qualification tests were performed, and appropriate limits for instrument performance were established. To evaluate the repeatability of the sample positioning system, multiple measurements of a single tablet were collected. With the application of appropriate spectral preprocessing techniques, sample repositioning error was found to be insignificant with respect to NIR analyses of product quality attributes. Sample shielding was demonstrated to be unnecessary for transmission analyses. A process signature was identified in the reflection data. Additional tests demonstrated that the process signature was largely orthogonal to spectral variation because of hardness. Principal component analysis of the compression sample set data demonstrated the potential for quantitative model development. For the data sets studied, reflection analysis was demonstrated to be more robust than transmission analysis. Published: October 6, 2005 The views presented in this article do not necessarily reflect those of the Food and Drug Administration.  相似文献   

9.
Here we focus on factor analysis from a best practices point of view, by investigating the factor structure of neuropsychological tests and using the results obtained to illustrate on choosing a reasonable solution. The sample (n=1051 individuals) was randomly divided into two groups: one for exploratory factor analysis (EFA) and principal component analysis (PCA), to investigate the number of factors underlying the neurocognitive variables; the second to test the “best fit” model via confirmatory factor analysis (CFA). For the exploratory step, three extraction (maximum likelihood, principal axis factoring and principal components) and two rotation (orthogonal and oblique) methods were used. The analysis methodology allowed exploring how different cognitive/psychological tests correlated/discriminated between dimensions, indicating that to capture latent structures in similar sample sizes and measures, with approximately normal data distribution, reflective models with oblimin rotation might prove the most adequate.  相似文献   

10.
Linear mixed effects models have been widely used in analysis of data where responses are clustered around some random effects, so it is not reasonable to assume independence between observations in the same cluster. In most biological applications, it is assumed that the distributions of the random effects and of the residuals are Gaussian. This makes inferences vulnerable to the presence of outliers. Here, linear mixed effects models with normal/independent residual distributions for robust inferences are described. Specific distributions examined include univariate and multivariate versions of the Student‐ t, the slash and the contaminated normal. A Bayesian framework is adopted and Markov chain Monte Carlo is used to carry out the posterior analysis. The procedures are illustrated using birth weight data on rats in a toxicological experiment. Results from the Gaussian and robust models are contrasted, and it is shown how the implementation can be used for outlier detection. The thick‐tailed distributions provide an appealing robust alternative to the Gaussian process in linear mixed models, and they are easily implemented using data augmentation and MCMC techniques.  相似文献   

11.
ARIMA与SVM组合模型在害虫预测中的应用   总被引:2,自引:0,他引:2  
向昌盛  周子英 《昆虫学报》2010,53(9):1055-1060
害虫发生是一种复杂、 动态时间序列数据, 单一预测模型都是基于线性或非线性数据, 不能同时捕捉害虫发生的线性和非线性规律, 很难达到理想的预测精度。本研究首先采用差分自回归移动平均模型对昆虫发生时间序列进行线性建模, 然后采用支持向量机对非线性部分进行建模, 最后得到两种模型的组合预测结果。将组合模型应用到松毛虫Dendrolimus punctatus发生面积的预测, 实验结果表明组合模型的预测精度明显优于单一模型, 发挥了两种模型各自的优势。组合模型是一种切实可行的害虫预测预报方法。  相似文献   

12.
This study introduces a more recent data analysis method, Hilbert Huang Transform method (HHT), to describe contaminant concentration data of a non-stationary and non-linear nature. In order to improve the modeling of the contaminant concentrations, it is proposed to first process the data using the Empirical mode decomposition (EMD) method from HHT to obtain a collection of intrinsic mode functions (IMFs) which can then be modeled separately using either autoregressive moving average (ARMA) models expanded with a seasonal term, or linear regression analysis, depending on the nature of the IMF. Three priority contaminants measured at Niagara-on-the-Lakes are selected for this study. It is found that the trend of fluoranthene concentrations from April of 1986 to March of 1997 is decreasing and then beginning to increase; the 1,2,4-trichlorobenzene concentrations are decreasing; while the dieldrin concentrations are decreasing. With HHT, appropriate time series models can be identified and constructed for the studied contaminant concentrations to better illustrate the variability of each IMF (and thus the contaminant concentrations) for the studied period. For all data sets modeled in this study, pre-processing the data with HHT allowed for higher R2 values, correlation coefficients and lower sum of squared errors when compared to modeling without HHT. It is thus confirmed that pre-processing the data with HHT and modeling with time series analysis will provide a more effective means of the studied data sets when identifying and analyzing the trends and variability of studied contaminant concentrations in the Niagara River.  相似文献   

13.
14.
MOTIVATION: Microarray data are used in a range of application areas in biology, although often it contains considerable numbers of missing values. These missing values can significantly affect subsequent statistical analysis and machine learning algorithms so there is a strong motivation to estimate these values as accurately as possible before using these algorithms. While many imputation algorithms have been proposed, more robust techniques need to be developed so that further analysis of biological data can be accurately undertaken. In this paper, an innovative missing value imputation algorithm called collateral missing value estimation (CMVE) is presented which uses multiple covariance-based imputation matrices for the final prediction of missing values. The matrices are computed and optimized using least square regression and linear programming methods. RESULTS: The new CMVE algorithm has been compared with existing estimation techniques including Bayesian principal component analysis imputation (BPCA), least square impute (LSImpute) and K-nearest neighbour (KNN). All these methods were rigorously tested to estimate missing values in three separate non-time series (ovarian cancer based) and one time series (yeast sporulation) dataset. Each method was quantitatively analyzed using the normalized root mean square (NRMS) error measure, covering a wide range of randomly introduced missing value probabilities from 0.01 to 0.2. Experiments were also undertaken on the yeast dataset, which comprised 1.7% actual missing values, to test the hypothesis that CMVE performed better not only for randomly occurring but also for a real distribution of missing values. The results confirmed that CMVE consistently demonstrated superior and robust estimation capability of missing values compared with other methods for both series types of data, for the same order of computational complexity. A concise theoretical framework has also been formulated to validate the improved performance of the CMVE algorithm. AVAILABILITY: The CMVE software is available upon request from the authors.  相似文献   

15.
Analyzing growth components in trees   总被引:1,自引:0,他引:1  
Observed growth, as given, for instance, by the length of successive annual shoots along the main axis of a plant, is mainly the result of two components: an ontogenetic component and an environmental component. An open question is whether the ontogenetic component along an axis at the growth unit or annual shoot scale takes the form of a trend or of a succession of phases. Various methods of analysis ranging from exploratory analysis (symmetric smoothing filters, sample autocorrelation functions) to statistical modeling (multiple change-point models, hidden semi-Markov chains and hidden hybrid model combining Markovian and semi-Markovian states) are applied to extract and characterize both the ontogenetic and environmental components using contrasted examples. This led us in particular to favor the hypothesis of an ontogenetic component structured as a succession of stationary phases and to highlight phase changes of high magnitude in unexpected situations (for instance, when growth globally decreases). These results shed light in a new way on botanical concepts such as "phase change" and "morphogenetic gradient".  相似文献   

16.
The advent of cheap, powerful microcomputer systems makes the analysis of data via sophisticated techniques available to the personnel who are non-specialists in computing systems. The DAMP package described here is intended for use on personal computers and has therefore been written in BASIC for portability. The analysis techniques are powerful, comprising algorithms to perform sample-data generation, plotting displays, digital data filtering, auto-correlation functions, fast Fourier transforms and autoregressive modelling. The last technique contains a number of options including the display of z-plane plots, frequency response of the model, residual plotting and auto-correlation of the residuals. Illustrative results are shown from psychological mood data and rat locomotor activity. The package is designed both to instruct a user in the techniques of spectral analysis, and also to provide a range of methods for investigating time and frequency behaviour of biomedical data.  相似文献   

17.
D. N. Alstad 《Hydrobiologia》1981,79(2):137-140
Sampling and statistical techniques are presented to identify nonrandom distributional patterns resulting from microhabitat selection by stream insects. The method is based on the frequency of conspecific combinations in a series of nearest-neighbor pairs. Its use is demonstrated with data from a Rocky Mountain caddisfly community.  相似文献   

18.
In areas of the North Pacific that are largely free of overfishing, climate regime shifts – abrupt changes in modes of low‐frequency climate variability – are seen as the dominant drivers of decadal‐scale ecological variability. We assessed the ability of leading modes of climate variability [Pacific Decadal Oscillation (PDO), North Pacific Gyre Oscillation (NPGO), Arctic Oscillation (AO), Pacific‐North American Pattern (PNA), North Pacific Index (NPI), El Niño‐Southern Oscillation (ENSO)] to explain decadal‐scale (1965–2008) patterns of climatic and biological variability across two North Pacific ecosystems (Gulf of Alaska and Bering Sea). Our response variables were the first principle component (PC1) of four regional climate parameters [sea surface temperature (SST), sea level pressure (SLP), freshwater input, ice cover], and PCs 1–2 of 36 biological time series [production or abundance for populations of salmon (Oncorhynchus spp.), groundfish, herring (Clupea pallasii), shrimp, and jellyfish]. We found that the climate modes alone could not explain ecological variability in the study region. Both linear models (for climate PC1) and generalized additive models (for biology PC1–2) invoking only the climate modes produced residuals with significant temporal trends, indicating that the models failed to capture coherent patterns of ecological variability. However, when the residual climate trend and a time series of commercial fishery catches were used as additional candidate variables, resulting models of biology PC1–2 satisfied assumptions of independent residuals and out‐performed models constructed from the climate modes alone in terms of predictive power. As measured by effect size and Akaike weights, the residual climate trend was the most important variable for explaining biology PC1 variability, and commercial catch the most important variable for biology PC2. Patterns of climate sensitivity and exploitation history for taxa strongly associated with biology PC1–2 suggest plausible mechanistic explanations for these modeling results. Our findings suggest that, even in the absence of overfishing and in areas strongly influenced by internal climate variability, climate regime shift effects can only be understood in the context of other ecosystem perturbations.  相似文献   

19.
A number of circular regression models have been proposed in the literature. In recent years, there is a strong interest shown on the subject of outlier detection in circular regression. An outlier detection procedure can be developed by defining a new statistic in terms of the circular residuals. In this paper, we propose a new measure which transforms the circular residuals into linear measures using a trigonometric function. We then employ the row deletion approach to identify observations that affect the measure the most, a candidate of outlier. The corresponding cut-off points and the performance of the detection procedure when applied on Down and Mardia’s model are studied via simulations. For illustration, we apply the procedure on circadian data.  相似文献   

20.
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号