首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.

Background

Using hybrid approach for gene selection and classification is common as results obtained are generally better than performing the two tasks independently. Yet, for some microarray datasets, both classification accuracy and stability of gene sets obtained still have rooms for improvement. This may be due to the presence of samples with wrong class labels (i.e. outliers). Outlier detection algorithms proposed so far are either not suitable for microarray data, or only solve the outlier detection problem on their own.

Results

We tackle the outlier detection problem based on a previously proposed Multiple-Filter-Multiple-Wrapper (MFMW) model, which was demonstrated to yield promising results when compared to other hybrid approaches (Leung and Hung, 2010). To incorporate outlier detection and overcome limitations of the existing MFMW model, three new features are introduced in our proposed MFMW-outlier approach: 1) an unbiased external Leave-One-Out Cross-Validation framework is developed to replace internal cross-validation in the previous MFMW model; 2) wrongly labeled samples are identified within the MFMW-outlier model; and 3) a stable set of genes is selected using an L1-norm SVM that removes any redundant genes present. Six binary-class microarray datasets were tested. Comparing with outlier detection studies on the same datasets, MFMW-outlier could detect all the outliers found in the original paper (for which the data was provided for analysis), and the genes selected after outlier removal were proven to have biological relevance. We also compared MFMW-outlier with PRAPIV (Zhang et al., 2006) based on same synthetic datasets. MFMW-outlier gave better average precision and recall values on three different settings. Lastly, artificially flipped microarray datasets were created by removing our detected outliers and flipping some of the remaining samples'' labels. Almost all the ‘wrong’ (artificially flipped) samples were detected, suggesting that MFMW-outlier was sufficiently powerful to detect outliers in high-dimensional microarray datasets.  相似文献   

2.
A gene-expression microarray datum is modeled as an exponential expression signal (log-normal distribution) and additive noise. Variance-stabilizing transformation based on this model is useful for improving the uniformity of variance, which is often assumed for conventional statistical analysis methods. However, the existing method of estimating transformation parameters may not be perfect because of poor management of outliers. By employing an information normalization technique, we have developed an improved parameter estimation method, which enables statistically more straightforward outlier exclusion and works well even in the case of small sample size. Validation of this method with experimental data has suggested that it is superior to the conventional method.  相似文献   

3.
Some a priori and a posteriori aspects of the identifiability problem for unidentifiable models are discussed. It is argued that the nation of identifiability from parameter bounds has a minor a priori structural relevance. The parameter bounds rationale may prove a useful a posteriori numerical notion. However, its practical potentiality needs careful evaluation, as the use of point estimates automatically builds into the model some hidden structural constraints. Examples are given.  相似文献   

4.
Microvolt T-wave alternans (TWA) are recognized as markers for malignant ventricular arrhythmias, leading to sudden cardiac death. Its extraordinary pathological significance and life-critical application demand elaborate modeling approaches and efficient analysis schemes. Accurate statistical model encompassing the dynamics of physiological noises and other outliers is highly significant to detection and estimation of the microvolt signal. The anomalies in parametric values characterizing the distributions of the above random phenomena are apt to incur modeling errors. Recent TWA detection theoretic approaches assume Laplacian noise due to leptokurtic distribution of electrode movement (em) and muscular activity (ma) recordings. The presented statistical analysis shows that the practiced model compromises the asymmetric nature of the probability distributions for em and ma. An analytical model called Biexponential distribution is suggested to realize the leptokurtic as well as the asymmetric nature of the noise characteristics. Comparative analysis is presented using visual inspection method, χ2 goodness-of-fit and Monte Carlo simulations. The proposed model achieves a best match of 99.14% and 98.13% for em and ma as compared to a Laplacian fit of 95.20% and 93.84%, respectively. Conversely, the worst fit values for em and ma are found to be 96.32% and 92.45% for Biexponential and 60.47% and 15.18% for Laplacian models, respectively. The augmented degree of freedom is likely to increase the complexity of the already challenging TWA detection problem; however, the proposed model achieves a more realistic representation of the real noise data by closely matching the statistical parameters.  相似文献   

5.
The niche model has been widely used to model the structure of complex food webs, and yet the ecological meaning of the single niche dimension has not been explored. In the niche model, each species has three traits, niche position, diet position and feeding range. Here, a new probabilistic niche model, which allows the maximum likelihood set of trait values to be estimated for each species, is applied to the food web of the Benguela fishery. We also developed the allometric niche model, in which body size is used as the niche dimension. About 80% of the links in the empirical data are predicted by the probabilistic niche model, a significant improvement over recent models. As in the niche model, species are uniformly distributed on the niche axis. Feeding ranges are exponentially distributed, but diet positions are not uniformly distributed below the predator. Species traits are strongly correlated with body size, but the allometric niche model performs significantly worse than the probabilistic niche model. The best-fit parameter set provides a significantly better model of the structure of the Benguela food web than was previously available. The methodology allows the identification of a number of taxa that stand out as outliers either in the model''s poor performance at predicting their predators or prey or in their parameter values. While important, body size alone does not explain the structure of the one-dimensional niche.  相似文献   

6.
While many decisions rely on real time quantitative PCR (qPCR) analysis few attempts have hitherto been made to quantify bounds of precision accounting for the various sources of variation involved in the measurement process. Besides influences of more obvious factors such as camera noise and pipetting variation, changing efficiencies within and between reactions affect PCR results to a degree which is not fully recognized. Here, we develop a statistical framework that models measurement error and other sources of variation as they contribute to fluorescence observations during the amplification process and to derived parameter estimates. Evaluation of reproducibility is then based on simulations capable of generating realistic variation patterns. To this end, we start from a relatively simple statistical model for the evolution of efficiency in a single PCR reaction and introduce additional error components, one at a time, to arrive at stochastic data generation capable of simulating the variation patterns witnessed in repeated reactions (technical repeats). Most of the variation in values was adequately captured by the statistical model in terms of foreseen components. To recreate the dispersion of the repeats'' plateau levels while keeping the other aspects of the PCR curves within realistic bounds, additional sources of reagent consumption (side reactions) enter into the model. Once an adequate data generating model is available, simulations can serve to evaluate various aspects of PCR under the assumptions of the model and beyond.  相似文献   

7.
Susan R. Wilson 《Genetics》1980,95(2):489-502
The statistical methods used by Schaffer, Yardley and Anderson (1977) and by Gibson et al. (1979) to analyze the variation in allele frequencies in two common types of experimental procedure, where the effective population size is finite, are extended to a more general situation involving a greater range of experiments. The analysis developed is more sensitive in detecting changes in allele frequency due to both fluctuating and balancing selection, as well as to directional selection. The error involved in many studies due to ignoring the effective population size structure would appear to be large. The range of hypotheses that can be considered may be increased as well. Finally, the method of determining bounds for the effective population size, when a particular genetic model is known to hold for a data set, is also outlined.  相似文献   

8.
The identifiability problem is addressed for n-compartment linear mammillary and catenary models, for the common case of input and output in the first compartment and prior information about one or more model rate constants. We first define the concept of independent constraints and show that n-compartment linear mammillary or catenary models are uniquely identifiable under n-1 independent constraints. Closed-form algorithms for bounding the constrained parameter space are then developed algebraically, and their validity is confirmed using an independent approach, namely joint estimation of the parameters of all uniquely identifiable submodels of the original multicompartmental model. For the noise-free (deterministic) case, the major effects of additional parameter knowledge are to narrow the bounds of rate constants that remain unidentifiable, as well as to possibly render others identifiable. When noisy data are considered, the means of the bounds of rate constants that remain unidentifiable are also narrowed, but the variances of some of these bound estimates increase. This unexpected result was verified by Monte Carlo simulation of several different models, using both normally and lognormally distributed data assumptions. Extensions and some consequences of this analysis useful for model discrimination and experiment design applications are also noted.  相似文献   

9.
In a companion paper, we demonstrated that dynamic range limitations can confound measurement of the osmotically inactive volume using electrical sensing zone instruments (e.g., Coulter counters), and presented an improved parameter estimation method in which a lognormal function was fit to the cell volume distribution to allow extrapolation beyond the bounds of the data. Presently, we have investigated the effect of dynamic range limitations on measurement of the cell membrane water permeability (Lp), and adapted the lognormal extrapolation method for estimation of Lp from transient volume data. An alternative strategy (the volume limit adjustment method, in which the measured isotonic volume distribution is used to generate model predictions for curve fitting, and the bounds of the dynamic range are adjusted such that extrapolation is not required) was also developed. The performance of these new algorithms was compared to that of a conventional parameter estimation method. The best-fit Lp values from in vitro experiments with mouse insulinoma (MIN6) cells differed significantly for the different parameter estimation techniques (< 0.001). Using in silico experiments, the volume limit adjustment method was shown to be the most accurate (relative error 0.4 ± 3.2%), whereas the conventional method underestimated Lp by 19 ± 2% for MIN6 cells. Parametric analysis revealed that the error associated with the conventional method was sensitive to the dynamic range and the width of the volume distribution. Our initial implementation of the lognormal extrapolation method also yielded significant errors, whereas accuracy of this algorithm improved after including a normalization scheme.  相似文献   

10.
Correspondence noise is a major factor limiting direction discrimination performance in random-dot kinematograms [1]. In the current study we investigated the influence of correspondence noise on Dmax, which is the upper limit for the spatial displacement of the dots for which coherent motion is still perceived. Human direction discrimination performance was measured, using 2-frame kinematograms having leftward/rightward motion, over a 200-fold range of dot-densities and a four-fold range of dot displacements. From this data Dmax was estimated for the different dot densities tested. A model was proposed to evaluate the correspondence noise in the stimulus. This model summed the outputs of a set of elementary Reichardt-type local detectors that had receptive fields tiling the stimulus and were tuned to the two directions of motion in the stimulus. A key assumption of the model was that the local detectors would have the radius of their catchment areas scaled with the displacement that they were tuned to detect; the scaling factor k linking the radius to the displacement was the only free parameter in the model and a single value of k was used to fit all of the psychophysical data collected. This minimal, correspondence-noise based model was able to account for 91% of the variability in the human performance across all of the conditions tested. The results highlight the importance of correspondence noise in constraining the largest displacement that can be detected.  相似文献   

11.
《Ecological Complexity》2007,4(4):223-233
An excitable model of fast phytoplankton and slow zooplankton dynamics is considered for the case of lysogenic viral infection of the phytoplankton population. The phytoplankton population is split into a susceptible (S) and an infected (I) part. Both parts grow logistically, limited by a common carrying capacity. Zooplankton (Z) is grazing on susceptibles and infected, following a Holling-type III functional response. The local analysis of the SIZ differential equations yields a number of stationary and/or oscillatory regimes and their combinations. Correspondingly interesting is the behaviour under multiplicative noise, modelled by stochastic differential equations. The external noise can enhance the survival of susceptibles and infected, respectively, that would go extinct in a deterministic environment. In the parameter range of excitability, noise can induce prey–predator oscillations and coherence resonance (CR). In the spatially extended case, synchronized global oscillations can be observed for medium noise intensities. Higher values of noise give rise to the formation of stationary spatial patterns.  相似文献   

12.
Outlier detection and cleaning procedures were evaluated to estimate mathematical restricted variogram models with discrete insect population count data. Because variogram modeling is significantly affected by outliers, methods to detect and clean outliers from data sets are critical for proper variogram modeling. In this study, we examined spatial data in the form of discrete measurements of insect counts on a rectangular grid. Two well-known insect pest population data were analyzed; one data set was the western flower thrips, Frankliniella occidentalis (Pergande) on greenhouse cucumbers and the other was the greenhouse whitefly, Trialeurodes vaporariorum (Westwood) on greenhouse cherry tomatoes. A spatial additive outlier model was constructed to detect outliers in both the isolated and patchy spatial distributions of outliers, and the outliers were cleaned with the neighboring median cleaner. To analyze the effect of outliers, we compared the relative nugget effects of data cleaned of outliers and data still containing outliers after transformation. In addition, the correlation coefficients between the actual and predicted values were compared using the leave-one-out cross-validation method with data cleaned of outliers and non-cleaned data after unbiased back transformation. The outlier detection and cleaning procedure improved geostatistical analysis, particularly by reducing the nugget effect, which greatly impacts the prediction variance of kriging. Consequently, the outlier detection and cleaning procedures used here improved the results of geostatistical analysis with highly skewed and extremely fluctuating data, such as insect counts.  相似文献   

13.
Litter decomposition rate (k) is typically estimated from proportional litter mass loss data using models that assume constant, normally distributed errors. However, such data often show non-normal errors with reduced variance near bounds (0 or 1), potentially leading to biased k estimates. We compared the performance of nonlinear regression using the beta distribution, which is well-suited to bounded data and this type of heteroscedasticity, to standard nonlinear regression (normal errors) on simulated and real litter decomposition data. Although the beta model often provided better fits to the simulated data (based on the corrected Akaike Information Criterion, AICc), standard nonlinear regression was robust to violation of homoscedasticity and gave equally or more accurate k estimates as nonlinear beta regression. Our simulation results also suggest that k estimates will be most accurate when study length captures mid to late stage decomposition (50–80% mass loss) and the number of measurements through time is ≥5. Regression method and data transformation choices had the smallest impact on k estimates during mid and late stage decomposition. Estimates of k were more variable among methods and generally less accurate during early and end stage decomposition. With real data, neither model was predominately best; in most cases the models were indistinguishable based on AICc, and gave similar k estimates. However, when decomposition rates were high, normal and beta model k estimates often diverged substantially. Therefore, we recommend a pragmatic approach where both models are compared and the best is selected for a given data set. Alternatively, both models may be used via model averaging to develop weighted parameter estimates. We provide code to perform nonlinear beta regression with freely available software.  相似文献   

14.
The implementation of Student t mixed models in animal breeding has been suggested as a useful statistical tool to effectively mute the impact of preferential treatment or other sources of outliers in field data. Nevertheless, these additional sources of variation are undeclared and we do not know whether a Student t mixed model is required or if a standard, and less parameterized, Gaussian mixed model would be sufficient to serve the intended purpose. Within this context, our aim was to develop the Bayes factor between two nested models that only differed in a bounded variable in order to easily compare a Student t and a Gaussian mixed model. It is important to highlight that the Student t density converges to a Gaussian process when degrees of freedom tend to infinity. The twomodels can then be viewed as nested models that differ in terms of degrees of freedom. The Bayes factor can be easily calculated from the output of a Markov chain Monte Carlo sampling of the complex model (Student t mixed model). The performance of this Bayes factor was tested under simulation and on a real dataset, using the deviation information criterion (DIC) as the standard reference criterion. The two statistical tools showed similar trends along the parameter space, although the Bayes factor appeared to be the more conservative. There was considerable evidence favoring the Student t mixed model for data sets simulated under Student t processes with limited degrees of freedom, and moderate advantages associated with using the Gaussian mixed model when working with datasets simulated with 50 or more degrees of freedom. For the analysis of real data (weight of Pietrain pigs at six months), both the Bayes factor and DIC slightly favored the Student t mixed model, with there being a reduced incidence of outlier individuals in this population.  相似文献   

15.
16.
Expression Quantitative Trait Loci (eQTL) analysis enables characterisation of functional genetic variation influencing expression levels of individual genes. In outbread populations, including humans, eQTLs are commonly analysed using the conventional linear model, adjusting for relevant covariates, assuming an allelic dosage model and a Gaussian error term. However, gene expression data generally have noise that induces heavy-tailed errors relative to the Gaussian distribution and often include atypical observations, or outliers. Such departures from modelling assumptions can lead to an increased rate of type II errors (false negatives), and to some extent also type I errors (false positives). Careful model checking can reduce the risk of type-I errors but often not type II errors, since it is generally too time-consuming to carefully check all models with a non-significant effect in large-scale and genome-wide studies. Here we propose the application of a robust linear model for eQTL analysis to reduce adverse effects of deviations from the assumption of Gaussian residuals. We present results from a simulation study as well as results from the analysis of real eQTL data sets. Our findings suggest that in many situations robust models have the potential to provide more reliable eQTL results compared to conventional linear models, particularly in respect to reducing type II errors due to non-Gaussian noise. Post-genomic data, such as that generated in genome-wide eQTL studies, are often noisy and frequently contain atypical observations. Robust statistical models have the potential to provide more reliable results and increased statistical power under non-Gaussian conditions. The results presented here suggest that robust models should be considered routinely alongside other commonly used methodologies for eQTL analysis.  相似文献   

17.
Emi Tanaka 《Biometrics》2020,76(4):1374-1382
The aim of plant breeding trials is often to identify crop variety that are well adapt to target environments. These varieties are identified through genomic prediction from the analysis of multi-environmental field trial (MET) using linear mixed models. The occurrence of outliers in MET is common and known to adversely impact the accuracy of genomic prediction yet the detection of outliers are often neglected. A number of reasons stand for this—first, complex data such as a MET give rise to distinct levels of residuals (eg, at a trial level or individual observation level). This complexity offers additional challenges for an outlier detection method. Second, many linear mixed model software packages that cater for complex variance structures needed in the analysis of MET are not well streamlined for diagnostics by practitioners. We demonstrate outlier detection methods that are simple to implement in any linear mixed model software packages and computationally fast. Although these methods are not optimal methods in outlier detection, they offer practical value for ease of application in the analysis pipeline of regularly collected data. These are demonstrated using simulation based on two real bread wheat yield METs. In particular, models that consider analysis of yield trials either independently or jointly (thus borrowing strength across trials) are considered. Case studies are presented to highlight benefit of joint analysis for outlier detection.  相似文献   

18.
Many recent microarrays hold an enormous number of probe sets, thus raising many practical and theoretical problems in controlling the false discovery rate (FDR). Biologically, it is likely that most probe sets are associated with un-expressed genes, so the measured values are simply noise due to non-specific binding; also many probe sets are associated with non-differentially-expressed (non-DE) genes. In an analysis to find DE genes, these probe sets contribute to the false discoveries, so it is desirable to filter out these probe sets prior to analysis. In the methodology proposed here, we first fit a robust linear model for probe-level Affymetrix data that accounts for probe and array effects. We then develop a novel procedure called FLUSH (Filtering Likely Uninformative Sets of Hybridizations), which excludes probe sets that have statistically small array-effects or large residual variance. This filtering procedure was evaluated on a publicly available data set from a controlled spiked-in experiment, as well as on a real experimental data set of a mouse model for retinal degeneration. In both cases, FLUSH filtering improves the sensitivity in the detection of DE genes compared to analyses using unfiltered, presence-filtered, intensity-filtered and variance-filtered data. A freely-available package called FLUSH implements the procedures and graphical displays described in the article.  相似文献   

19.
We present a new method for developing individualized biomathematical models that predict performance impairment for individuals restricted to total sleep loss. The underlying formulation is based on the two-process model of sleep regulation, which has been extensively used to develop group-average models. However, in the proposed method, the parameters of the two-process model are systematically adjusted to account for an individual's uncertain initial state and unknown trait characteristics, resulting in individual-specific performance prediction models. The method establishes the initial estimates of the model parameters using a set of past performance observations, after which the parameters are adjusted as each new observation becomes available. Moreover, by transforming the nonlinear optimization problem of finding the best estimates of the two-process model parameters into a set of linear optimization problems, the proposed method yields unique parameter estimates. Two distinct data sets are used to evaluate the proposed method. Results of simulated data (with superimposed noise) show that the model parameters asymptotically converge to their true values and the model prediction accuracy improves as the number of performance observations increases and the amount of noise in the data decreases. Results of a laboratory study (82 h of total sleep loss), for three sleep-loss phenotypes, suggest that individualized models are consistently more accurate than group-average models, yielding as much as a threefold reduction in prediction errors. In addition, we show that the two-process model of sleep regulation is capable of representing performance data only when the proposed individualized model is used.  相似文献   

20.
Identification of protein coding regions is fundamentally a statistical pattern recognition problem. Discriminant analysis is a statistical technique for classifying a set of observations into predefined classes and it is useful to solve such problems. It is well known that outliers are present in virtually every data set in any application domain, and classical discriminant analysis methods (including linear discriminant analysis (LDA) and quadratic discriminant analysis (QDA)) do not work well if the data set has outliers. In order to overcome the difficulty, the robust statistical method is used in this paper. We choose four different coding characters as discriminant variables and an approving result is presented by the method of robust discriminant analysis.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号