首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Nardi A  Schemper M 《Biometrics》1999,55(2):523-529
The identification of individuals who 'died far too early' or 'lived far too long' as compared to their survival probabilities from a Cox regression can lead to the detection of new prognostic factors. Methods to identify outliers are generally based on residuals. For Cox regression, only deviance residuals have been considered for this purpose, but we show that these residuals are not very suitable. Instead, we develop and propose two new types of residuals: the suggested log-odds and normal deviate residuals are simple and intuitively appealing and their theoretical properties and empirical performance make them very suitable for outlier identification. Finally, various practical aspects of screening for individuals with outlying survival times are discussed by means of a cancer study example.  相似文献   

2.
Emi Tanaka 《Biometrics》2020,76(4):1374-1382
The aim of plant breeding trials is often to identify crop variety that are well adapt to target environments. These varieties are identified through genomic prediction from the analysis of multi-environmental field trial (MET) using linear mixed models. The occurrence of outliers in MET is common and known to adversely impact the accuracy of genomic prediction yet the detection of outliers are often neglected. A number of reasons stand for this—first, complex data such as a MET give rise to distinct levels of residuals (eg, at a trial level or individual observation level). This complexity offers additional challenges for an outlier detection method. Second, many linear mixed model software packages that cater for complex variance structures needed in the analysis of MET are not well streamlined for diagnostics by practitioners. We demonstrate outlier detection methods that are simple to implement in any linear mixed model software packages and computationally fast. Although these methods are not optimal methods in outlier detection, they offer practical value for ease of application in the analysis pipeline of regularly collected data. These are demonstrated using simulation based on two real bread wheat yield METs. In particular, models that consider analysis of yield trials either independently or jointly (thus borrowing strength across trials) are considered. Case studies are presented to highlight benefit of joint analysis for outlier detection.  相似文献   

3.
Features selection and architecture optimization in connectionist systems   总被引:1,自引:0,他引:1  
In this paper, we propose a features selection measure and an architecture optimization procedure for Multi-Layer Perceptrons (MLP). The algorithm presented in this contribution employs a heuristic measure named HVS (Heuristic for Variable Selection). This new measure allows us to identify and select important variables in the features space. This can be achieved by eliminating redundant features and those which do not contain enough relevant information. The proposed measure is used in a new procedure aimed at selecting the "best" MLP architecture given an initial structure. Application results for two generic problems: regression and discrimination, demonstrates the proposed selection algorithm's effectiveness in identifying optimized connectionist models with higher accuracy. Finally, an extension of HVS, named epsilonHVS, is proposed for discriminative features detection and architecture optimization for Time Delay Neural Networks models (TDNN).  相似文献   

4.

Aim

Species distribution data play a pivotal role in the study of ecology, evolution, biogeography and biodiversity conservation. Although large amounts of location data are available and accessible from public databases, data quality remains problematic. Of the potential sources of error, positional errors are critical for spatial applications, particularly where these errors place observations beyond the environmental or geographical range of species. These outliers need to be identified, checked and removed to improve data quality and minimize the impact on subsequent analyses. Manually checking all species records within large multispecies datasets is prohibitively costly. This work investigates algorithms that may assist in the efficient vetting of outliers in such large datasets.

Location

We used real, spatially explicit environmental data derived from the western part of Victoria, Australia, and simulated species distributions within this same region.

Methods

By adapting species distribution modelling (SDM), we developed a pseudo‐SDM approach for detecting outliers in species distribution data, which was implemented with random forest (RF) and support vector machine (SVM) resulting in two new methods: RF_pdSDM and SVM_pdSDM. Using virtual species, we compared eight existing multivariate outlier detection methods with these two new methods under various conditions.

Results

The two new methods based on the pseudo‐SDM approach had higher true skill statistic (TSS) values than other approaches, with TSS values always exceeding 0. More than 70% of the true outliers in datasets for species with a low and intermediate prevalence can be identified by checking 10% of the data points with the highest outlier scores.

Main conclusions

Pseudo‐SDM‐based methods were more effective than other outlier detection methods. However, this outlier detection procedure can only be considered as a screening tool, and putative outliers must be examined by experts to determine whether they are actual errors or important records within an inherently biased set of data.  相似文献   

5.
Outlier detection and cleaning procedures were evaluated to estimate mathematical restricted variogram models with discrete insect population count data. Because variogram modeling is significantly affected by outliers, methods to detect and clean outliers from data sets are critical for proper variogram modeling. In this study, we examined spatial data in the form of discrete measurements of insect counts on a rectangular grid. Two well-known insect pest population data were analyzed; one data set was the western flower thrips, Frankliniella occidentalis (Pergande) on greenhouse cucumbers and the other was the greenhouse whitefly, Trialeurodes vaporariorum (Westwood) on greenhouse cherry tomatoes. A spatial additive outlier model was constructed to detect outliers in both the isolated and patchy spatial distributions of outliers, and the outliers were cleaned with the neighboring median cleaner. To analyze the effect of outliers, we compared the relative nugget effects of data cleaned of outliers and data still containing outliers after transformation. In addition, the correlation coefficients between the actual and predicted values were compared using the leave-one-out cross-validation method with data cleaned of outliers and non-cleaned data after unbiased back transformation. The outlier detection and cleaning procedure improved geostatistical analysis, particularly by reducing the nugget effect, which greatly impacts the prediction variance of kriging. Consequently, the outlier detection and cleaning procedures used here improved the results of geostatistical analysis with highly skewed and extremely fluctuating data, such as insect counts.  相似文献   

6.
The discrete data structure and large sequencing depth of RNA sequencing (RNA-seq) experiments can often generate outlier read counts in one or more RNA samples within a homogeneous group. Thus, how to identify and manage outlier observations in RNA-seq data is an emerging topic of interest. One of the main objectives in these research efforts is to develop statistical methodology that effectively balances the impact of outlier observations and achieves maximal power for statistical testing. To reach that goal, strengthening the accuracy of outlier detection is an important precursor. Current outlier detection algorithms for RNA-seq data are executed within a testing framework and may be sensitive to sparse data and heavy-tailed distributions. Therefore, we propose a univariate algorithm that utilizes a probabilistic approach to measure the deviation between an observation and the distribution generating the remaining data and implement it within in an iterative leave-one-out design strategy. Analyses of real and simulated RNA-seq data show that the proposed methodology has higher outlier detection rates for both non-normalized and normalized negative binomial distributed data.  相似文献   

7.
Isolation by distance is usually tested by the correlation of genetic and geographic distances separating all pairwise populations' combinations. However, this method can be significantly biased by only a few highly diverged populations and lose the information of individual population. To detect outlier populations and investigate the relative strengths of gene flow and genetic drift for each population, we propose a decomposed pairwise regression analysis. This analysis was applied to the well-described one-dimensional stepping-stone system of stream-dwelling Dolly Varden charr ( Salvelinus malma ). When genetic and geographic distances were plotted for all pairs of 17 tributary populations, the correlation was significant but weak ( r 2 = 0.184). Seven outlier populations were determined based on the systematic bias of the regression residuals, followed by Akaike's information criteria. The best model, 10 populations included, showed a strong pattern of isolation by distance ( r 2 = 0.758), suggesting equilibrium between gene flow and genetic drift in these populations. Each outlier population was also analysed by plotting pairwise genetic and geographic distances against the 10 nonoutlier populations, and categorized into one of the three patterns: strong genetic drift, genetic drift with a limited gene flow and a high level of gene flow. These classifications were generally consistent with a priori predictions for each population (physical barrier, population size, anthropogenic impacts). Combined the genetic analysis with field observations, Dolly Varden in this river appeared to form a mainland-island or source-sink metapopulation structure. The generality of the method will merit many types of spatial genetic analyses.  相似文献   

8.
Jordi Peig  Andy J. Green 《Oikos》2009,118(12):1883-1891
Body condition is assumed to influence an animal's health and fitness. Various non‐destructive methods based on body mass and a measure of body length have been used as condition indices (CIs), but the dominant method amongst ecologists is currently the calculation of residuals from an ordinary least squares (OLS) regression of body mass against length. Recent studies of energy reserves in small mammals and starlings claimed to validate this method, although we argue that they did not include the most appropriate tests since they compared the CI with the absolute size of energy reserves. We present a novel CI (the ‘scaled mass index’) based on the central principle of scaling, with important methodological, biological and conceptual advantages. Through a reanalysis of data from small mammals, starlings and snakes, we show that the scaled mass index is a better indicator of the relative size of energy reserves and other body components than OLS residuals, performing better in all seven species and in 19 out of 20 analyses. We also present an empirical and theoretical comparison of the scaled mass index and OLS residuals as CIs. We argue that the scaled mass index is a useful new tool for ecologists.  相似文献   

9.
In this paper, we propose a simple parametric modal linear regression model where the response variable is gamma distributed using a new parameterization of this distribution that is indexed by mode and precision parameters, that is, in this new regression model, the modal and precision responses are related to a linear predictor through a link function and the linear predictor involves covariates and unknown regression parameters. The main advantage of our new parameterization is the straightforward interpretation of the regression coefficients in terms of the mode of the positive response variable, as is usual in the context of generalized linear models, and direct inference in parametric mode regression based on the likelihood paradigm. Furthermore, we discuss residuals and influence diagnostic tools. A Monte Carlo experiment is conducted to evaluate the performances of these estimators in finite samples with a discussion of the results. Finally, we illustrate the usefulness of the new model by two applications, to biology and demography.  相似文献   

10.
It is important to preprocess high-throughput data generated from mass spectrometry experiments in order to obtain a successful proteomics analysis. Outlier detection is an important preprocessing step. A naive outlier detection approach may miss many true outliers and instead select many non-outliers because of the heterogeneity of the variability observed commonly in high-throughput data. Because of this issue, we developed a outlier detection software program accounting for the heterogeneous variability by utilizing linear, non-linear and non-parametric quantile regression techniques. Our program was developed using the R computer language. As a consequence, it can be used interactively and conveniently in the R environment. AVAILABILITY: An R package, OutlierD, is available at the Bioconductor project at http://www.bioconductor.org  相似文献   

11.
Microarray technologies allow for simultaneous measurement of DNA copy number at thousands of positions in a genome. Gains and losses of DNA sequences reveal themselves through characteristic patterns of hybridization intensity. To identify change points along the chromosomes, we develop a marker clustering method which consists of 2 parts. First, a "circular clustering tree test statistic" attaches a statistic to each marker that measures the likelihood that it is a change point. Then construction of the marker statistics is followed by outlier detection approaches. The method provides a new way to build up a binary tree that can accurately capture change-point signals and is easy to perform. A simulation study shows good performance in change-point detection, and cancer cell line data are used to illustrate performance when regions of true copy number changes are known.  相似文献   

12.
Many analyses do not consider the problems associated with the effects of population size on encounter recording. Population size could impact on the detection of bird arrival time as there is a higher probability of observing earlier arrival when the population size is greater and the song activity of birds is increased, as occurs with a larger population. As a case study, we have analysed data on the red-backed shrike Lanius collurio collected in Western Poland during 1983–2000. In this period the red-backed shrike’s return to its breeding sites became significantly earlier whilst the contemporary population size increased significantly. To eliminate linear trends through time we have worked on the standardised residuals from regression of both arrival time and population size on year. The correlation between arrival time and population size residuals was significantly negative, further supporting the link between detection and population size. This finding suggests that, in studies of avian migration and its changes over time, the relationship between arrival date and population size needs to be considered. Received: 25 October 2000 / Revised: 5 September 2001 / Accepted: 5 September 2001  相似文献   

13.
Wei WH  Su JS 《Biometrics》1999,55(4):1295-1299
Deletion diagnostics are developed for identifying observations that influence the estimates of regression parameters and the mixture parameter in the families of relative risk functions for failure time data. The diagnostic for the regression parameters is a generalization of Cain and Lange's (1984, Biometrics 40, 493-499) measure of individual influence. The generalizations of martingale residuals, Schoenfeld's partial residuals (1982, Biometrika 69, 239-241), and score residuals by Therneau, Grambsch, and Fleming (1990, Biometrika 77, 147-160) are also obtained. The influence of some observations on regression parameters can be drastically modified as the mixture parameter changes, even for a very small change. In addition, adding or deleting some observations might result in choosing different models. The diagnostics are applied to a family proposed by Guerrero and Johnson (1982, Biometrika 69, 309-314). One illustrative example is presented.  相似文献   

14.
Due to the high sensitivity of diffusion tensor imaging (DTI) to physiological motion, clinical DTI scans often suffer a significant amount of artifacts. Tensor-fitting-based, post-processing outlier rejection is often used to reduce the influence of motion artifacts. Although it is an effective approach, when there are multiple corrupted data, this method may no longer correctly identify and reject the corrupted data. In this paper, we introduce a new criterion called “corrected Inter-Slice Intensity Discontinuity” (cISID) to detect motion-induced artifacts. We compared the performance of algorithms using cISID and other existing methods with regard to artifact detection. The experimental results show that the integration of cISID into fitting-based methods significantly improves the retrospective detection performance at post-processing analysis. The performance of the cISID criterion, if used alone, was inferior to the fitting-based methods, but cISID could effectively identify severely corrupted images with a rapid calculation time. In the second part of this paper, an outlier rejection scheme was implemented on a scanner for real-time monitoring of image quality and reacquisition of the corrupted data. The real-time monitoring, based on cISID and followed by post-processing, fitting-based outlier rejection, could provide a robust environment for routine DTI studies.  相似文献   

15.
In geo-statistics, the Durbin-Watson test is frequently employed to detect the presence of residual serial correlation from least squares regression analyses. However, the Durbin-Watson statistic is only suitable for ordered time or spatial series. If the variables comprise cross-sectional data coming from spatial random sampling, the test will be ineffectual because the value of Durbin-Watson’s statistic depends on the sequence of data points. This paper develops two new statistics for testing serial correlation of residuals from least squares regression based on spatial samples. By analogy with the new form of Moran’s index, an autocorrelation coefficient is defined with a standardized residual vector and a normalized spatial weight matrix. Then by analogy with the Durbin-Watson statistic, two types of new serial correlation indices are constructed. As a case study, the two newly presented statistics are applied to a spatial sample of 29 China’s regions. These results show that the new spatial autocorrelation models can be used to test the serial correlation of residuals from regression analysis. In practice, the new statistics can make up for the deficiencies of the Durbin-Watson test.  相似文献   

16.
Most proteomics experiments make use of 'high throughput' technologies such as 2-DE, MS or protein arrays to measure simultaneously the expression levels of thousands of proteins. Such experiments yield large, high-dimensional data sets which usually reflect not only the biological but also technical and experimental factors. Statistical tools are essential for evaluating these data and preventing false conclusions. Here, an overview is given of some typical statistical tools for proteomics experiments. In particular, we present methods for data preprocessing (e.g. calibration, missing values estimation and outlier detection), comparison of protein expression in different groups (e.g. detection of differentially expressed proteins or classification of new observations) as well as the detection of dependencies between proteins (e.g. protein clusters or networks). We also discuss questions of sample size planning for some of these methods.  相似文献   

17.
ABSTRACT: BACKGROUND: Mass spectrometry (MS) data are often generated from various biological or chemical experiments and there may exist outlying observations, which are extreme due to technical reasons. The determination of outlying observations is important in the analysis of replicated MS data because elaborate pre-processing is essential for successful analysis with reliable results and manual outlier detection as one of pre-processing steps is time-consuming. The heterogeneity of variability and low replication are often obstacles to successful analysis, including outlier detection. Existing approaches, which assume constant variability, can generate many false positives (outliers) and/or false negatives non-outliers). Thus, a more powerful and accurate approach is needed to account for the heterogeneity of variability and low replication. FINDINGS: We proposed an outlier detection algorithm using projection and quantile regression in MS data from multiple experiments. The performance of the algorithm and program was demonstrated by using both simulated and real-life data. The projection approach with linear, nonlinear, or nonparametric quantile regression was appropriate in heterogeneous high-throughput data with low replication. CONCLUSION: Various quantile regression approaches combined with projection were proposed for detecting outliers. The choice among linear, nonlinear, and nonparametric regressions is dependent on the degree of heterogeneity of the data. The proposed approach was illustrated with MS data with two or more replicates.  相似文献   

18.
By using deviance standardized residuals, the seemingly unrelated regression estimation procedure is extended to generalized linear models, and fitted by an iterative procedure. The matrix of cross products of standardized residuals is asymptotically multivariate normal, and can be used for further multivariate analyses and for hypothesis testing.  相似文献   

19.
The spatial signature of microevolutionary processes structuring genetic variation may play an important role in the detection of loci under selection. However, the spatial location of samples has not yet been used to quantify this. Here, we present a new two‐step method of spatial outlier detection at the individual and deme levels using the power spectrum of Moran eigenvector maps (MEM). The MEM power spectrum quantifies how the variation in a variable, such as the frequency of an allele at a SNP locus, is distributed across a range of spatial scales defined by MEM spatial eigenvectors. The first step (Moran spectral outlier detection: MSOD) uses genetic and spatial information to identify outlier loci by their unusual power spectrum. The second step uses Moran spectral randomization (MSR) to test the association between outlier loci and environmental predictors, accounting for spatial autocorrelation. Using simulated data from two published papers, we tested this two‐step method in different scenarios of landscape configuration, selection strength, dispersal capacity and sampling design. Under scenarios that included spatial structure, MSOD alone was sufficient to detect outlier loci at the individual and deme levels without the need for incorporating environmental predictors. Follow‐up with MSR generally reduced (already low) false‐positive rates, though in some cases led to a reduction in power. The results were surprisingly robust to differences in sample size and sampling design. Our method represents a new tool for detecting potential loci under selection with individual‐based and population‐based sampling by leveraging spatial information that has hitherto been neglected.  相似文献   

20.
León LF  Tsai CL 《Biometrics》2004,60(1):75-84
We propose a new type of residual and an easily computed functional form test for the Cox proportional hazards model. The proposed test is a modification of the omnibus test for testing the overall fit of a parametric regression model, developed by Stute, González Manteiga, and Presedo Quindimil (1998, Journal of the American Statistical Association93, 141-149), and is based on what we call censoring consistent residuals. In addition, we develop residual plots that can be used to identify the correct functional forms of covariates. We compare our test with the functional form test of Lin, Wei, and Ying (1993, Biometrika80, 557-572) in a simulation study. The practical application of the proposed residuals and functional form test is illustrated using both a simulated data set and a real data set.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号