首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.

Background

Misclassification has been shown to have a high prevalence in binary responses in both livestock and human populations. Leaving these errors uncorrected before analyses will have a negative impact on the overall goal of genome-wide association studies (GWAS) including reducing predictive power. A liability threshold model that contemplates misclassification was developed to assess the effects of mis-diagnostic errors on GWAS. Four simulated scenarios of case–control datasets were generated. Each dataset consisted of 2000 individuals and was analyzed with varying odds ratios of the influential SNPs and misclassification rates of 5% and 10%.

Results

Analyses of binary responses subject to misclassification resulted in underestimation of influential SNPs and failed to estimate the true magnitude and direction of the effects. Once the misclassification algorithm was applied there was a 12% to 29% increase in accuracy, and a substantial reduction in bias. The proposed method was able to capture the majority of the most significant SNPs that were not identified in the analysis of the misclassified data. In fact, in one of the simulation scenarios, 33% of the influential SNPs were not identified using the misclassified data, compared with the analysis using the data without misclassification. However, using the proposed method, only 13% were not identified. Furthermore, the proposed method was able to identify with high probability a large portion of the truly misclassified observations.

Conclusions

The proposed model provides a statistical tool to correct or at least attenuate the negative effects of misclassified binary responses in GWAS. Across different levels of misclassification probability as well as odds ratios of significant SNPs, the model proved to be robust. In fact, SNP effects, and misclassification probability were accurately estimated and the truly misclassified observations were identified with high probabilities compared to non-misclassified responses. This study was limited to situations where the misclassification probability was assumed to be the same in cases and controls which is not always the case based on real human disease data. Thus, it is of interest to evaluate the performance of the proposed model in that situation which is the current focus of our research.
  相似文献   

2.
In this paper we introduce a misclassification model for the meiosis I non‐disjunction fraction in numerical chromosomal anomalies named trisomies. We obtain posteriors, and their moments, for the probability that a non‐disjunction occurs in the first division of meiosis and for the misclassification errors. We also extend previous works by providing the exact posterior, and its moments, for the probability that a non‐disjunction occurs in the first division of meiosis assuming the model proposed in the literature which does not consider that data are subject to misclassification. We perform Monte Carlo studies in order to compare Bayes estimates obtained by using both models. An application to Down Syndrome data is also presented. (© 2008 WILEY‐VCH Verlag GmbH & Co. KGaA, Weinheim)  相似文献   

3.
Annotation of the rapidly accumulating body of sequence data relies heavily on the detection of remote homologues and functional motifs in protein families. The most popular methods rely on sequence alignment. These include programs that use a scoring matrix to compare the probability of a potential alignment with random chance and programs that use curated multiple alignments to train profile hidden Markov models (HMMs). Related approaches depend on bootstrapping multiple alignments from a single sequence. However, alignment-based programs have limitations. They make the assumption that contiguity is conserved between homologous segments, which may not be true in genetic recombination or horizontal transfer. Alignments also become ambiguous when sequence similarity drops below 40%. This has kindled interest in classification methods that do not rely on alignment. An approach to classification without alignment based on the distribution of contiguous sequences of four amino acids (4-grams) was developed. Interest in 4-grams stemmed from the observation that almost all theoretically possible 4-grams (20(4)) occur in natural sequences and the majority of 4-grams are uniformly distributed. This implies that the probability of finding identical 4-grams by random chance in unrelated sequences is low. A Bayesian probabilistic model was developed to test this hypothesis. For each protein family in Pfam-A and PIR-PSD, a feature vector called a probe was constructed from the set of 4-grams that best characterised the family. In rigorous jackknife tests, unknown sequences from Pfam-A and PIR-PSD were compared with the probes for each family. A classification result was deemed a true positive if the probe match with the highest probability was in first place in a rank-ordered list. This was achieved in 70% of cases. Analysis of false positives suggested that the precision might approach 85% if selected families were clustered into subsets. Case studies indicated that the 4-grams in common between an unknown and the best matching probe correlated with functional motifs from PRINTS. The results showed that remote homologues and functional motifs could be identified from an analysis of 4-gram patterns.  相似文献   

4.
We develop three Bayesian predictive probability functions based on data in the form of a double sample. One Bayesian predictive probability function is for predicting the true unobservable count of interest in a future sample for a Poisson model with data subject to misclassification and two Bayesian predictive probability functions for predicting the number of misclassified counts in a current observable fallible count for an event of interest. We formulate a Gibbs sampler to calculate prediction intervals for these three unobservable random variables and apply our new predictive models to calculate prediction intervals for a real‐data example. (© 2004 WILEY‐VCH Verlag GmbH & Co. KGaA, Weinheim)  相似文献   

5.

Background

Imperfect diagnostic testing reduces the power to detect significant predictors in classical cross-sectional studies. Assuming that the misclassification in diagnosis is random this can be dealt with by increasing the sample size of a study. However, the effects of imperfect tests in longitudinal data analyses are not as straightforward to anticipate, especially if the outcome of the test influences behaviour. The aim of this paper is to investigate the impact of imperfect test sensitivity on the determination of predictor variables in a longitudinal study.

Methodology/Principal Findings

To deal with imperfect test sensitivity affecting the response variable, we transformed the observed response variable into a set of possible temporal patterns of true disease status, whose prior probability was a function of the test sensitivity. We fitted a Bayesian discrete time survival model using an MCMC algorithm that treats the true response patterns as unknown parameters in the model. We applied our approach to epidemiological data of bovine tuberculosis outbreaks in England and investigated the effect of reduced test sensitivity in the determination of risk factors for the disease. We found that reduced test sensitivity led to changes to the collection of risk factors associated with the probability of an outbreak that were chosen in the ‘best’ model and to an increase in the uncertainty surrounding the parameter estimates for a model with a fixed set of risk factors that were associated with the response variable.

Conclusions/Significance

We propose a novel algorithm to fit discrete survival models for longitudinal data where values of the response variable are uncertain. When analysing longitudinal data, uncertainty surrounding the response variable will affect the significance of the predictors and should therefore be accounted for either at the design stage by increasing the sample size or at the post analysis stage by conducting appropriate sensitivity analyses.  相似文献   

6.

Background

The Jaffe and enzymatic methods are the two most common methods for measuring serum creatinine. The Jaffe method is less expensive than the enzymatic method but is also more susceptible to interferences. Interferences can lead to misdiagnosis but interferences may vary by patient population. The overall risk associated with the Jaffe method depends on the probability of misclassification and the consequences of misclassification. This study assessed the risk associated with the Jaffe method in an outpatient population. We analyzed the discordance rate in the estimated glomerular filtration rate based on serum creatinine measurements obtained by the Jaffe and enzymatic method.

Methods

Method comparison and risk analysis. Five hundred twenty-nine eGFRs obtained by the Jaffe and enzymatic method were compared at four clinical decision limits. We determined the probability of discordance and the consequence of misclassification at each decision limit to evaluate the overall risk.

Results

We obtained 529 paired observations. Of these, 29 (5.5%) were discordant with respect to one of the decision limits (i.e. 15, 30, 45 or 60 ml/min/1.73m2). The magnitude of the differences (Jaffe result minus enzymatic result) were significant relative to analytical variation in 21 of the 29 (72%) of the discordant results. The magnitude of the differences were not significant relative to biological variation. The risk associated with misclassification was greatest at the 60 ml/min/1.73m2 decision limit because the probability of misclassification and the potential for adverse outcomes were greatest at that decision limit.

Conclusion

The Jaffe method is subject to bias due to interfering substances (loss of analytical specificity). The risk of misclassification is greatest at the 60 ml/min/1.73m2 decision limit; however, the risk of misclassification due to bias is much less than the risk of misclassification due to biological variation. The Jaffe method may pose low risk in selected populations if eGFR results near the 60 ml/min/1.73m2 decision limit are interpreted with caution.  相似文献   

7.
We have developed a new general approach for handling misclassification in discrete covariates or responses in regression models. The simulation and extrapolation (SIMEX) method, which was originally designed for handling additive covariate measurement error, is applied to the case of misclassification. The statistical model for characterizing misclassification is given by the transition matrix Pi from the true to the observed variable. We exploit the relationship between the size of misclassification and bias in estimating the parameters of interest. Assuming that Pi is known or can be estimated from validation data, we simulate data with higher misclassification and extrapolate back to the case of no misclassification. We show that our method is quite general and applicable to models with misclassified response and/or misclassified discrete regressors. In the case of a binary response with misclassification, we compare our method to the approach of Neuhaus, and to the matrix method of Morrissey and Spiegelman in the case of a misclassified binary regressor. We apply our method to a study on caries with a misclassified longitudinal response.  相似文献   

8.
Linkage is a phenomenon that correlates the genotypes of loci, rather than the phenotypes of one locus to the genotypes of another. It is therefore necessary to convert the observed trait phenotypes into trait-locus genotypes, which can then be analyzed for coinheritance with marker-locus genotypes. However, if the mode of inheritance of the trait is not known accurately, this conversion can often result in errors in the inferred trait-locus genotypes, which, in turn, can lead to the misclassification of the recombination status of meioses. As a result, the recombination fraction can be overestimated in two-point analysis, and false exclusions of the true trait locus can occur in multipoint analysis. We propose a method that increases the robustness of multipoint analysis to errors in the mode of inheritance assumptions of the trait, by explicitly allowing for misclassification of trait-locus genotypes. To this end, the definition of the recombination fraction is extended to the complex plane, as Theta=straight theta+straightepsiloni; theta is the recombination fraction between actual ("real") genotypes of marker and trait loci, and straightepsilon is the probability of apparent but false ("imaginary") recombinations between the actual and inferred trait-locus genotypes. "Complex" multipoint LOD scores are proven to be stochastically equivalent to conventional two-point LOD scores. The greater robustness to modeling errors normally associated with two-point analysis can thus be extended to multiple two-point analysis and multipoint analysis. The use of complex-valued recombination fractions also allows the stochastic equivalence of "model-based" and "model-free" methods to be extended to multipoint analysis.  相似文献   

9.
A finite mixture distribution model for data collected from twins.   总被引:2,自引:0,他引:2  
Most analyses of data collected from a classical twin study of monozygotic (MZ) and dizygotic (DZ) twins assume that zygosity has been diagnosed without error. However, large scale surveys frequently resort to questionnaire-based methods of diagnosis which classify twins as MZ or DZ with less than perfect accuracy. This article describes a mixture distribution approach to the analysis of twin data when zygosity is not perfectly diagnosed. Estimates of diagnostic accuracy are used to weight the likelihood of the data according to the probability that any given pair is either MZ or DZ. The performance of this method is compared to fully accurate diagnosis, and to the analysis of samples that include some misclassified pairs. Conventional analysis of samples containing misclassified pairs yields biased estimates of variance components, such that additive genetic variance (A) is underestimated while common environment (C) and specific environment (E) components are overestimated. The bias is non-trivial; for 10% misclassification, true values of Additive genetic: Common environment: Specific Environment variance components of.6:.2:.2 are estimated as.48:.29:.23, respectively. The mixture distribution yields unbiased estimates, while showing relatively little loss of statistical precision for misclassification rates of 15% or less. The method is shown to perform quite well even when no information on zygosity is available, and may be applied when pair-specific estimates of zygosity probabilities are available.  相似文献   

10.
We develop a Bayesian simulation based approach for determining the sample size required for estimating a binomial probability and the difference between two binomial probabilities where we allow for dependence between two fallible diagnostic procedures. Examples include estimating the prevalence of disease in a single population based on results from two imperfect diagnostic tests applied to sampled individuals, or surveys designed to compare the prevalences of two populations using diagnostic outcomes that are subject to misclassification. We propose a two stage procedure in which the tests are initially assumed to be independent conditional on true disease status (i.e. conditionally independent). An interval based sample size determination scheme is performed under this assumption and data are collected and used to test the conditional independence assumption. If the data reveal the diagnostic tests to be conditionally dependent, structure is added to the model to account for dependence and the sample size routine is repeated in order to properly satisfy the criterion under the correct model. We also examine the impact on required sample size when adding an extra heterogeneous population to a study.  相似文献   

11.
Because of the availability of efficient, user-friendly computer analysis programs, the construction of multilocus human genetic maps has become commonplace. At the level of resolution at which most of these maps have been developed, the methods have proved to be robust. This may not be true in the construction of high-resolution linkage maps (3-cM interlocus resolution or less). High-resolution meiotic maps, by definition, have a low probability of recombination occurring in an interval. As such, even low frequencies of errors in typing (1.5% or less) may influence mapping outcomes. To investigate the influence of aberrant observations on high-resolution maps, a Monte Carlo simulation analysis of multipoint linkage data was performed. Introduction of error was observed to reduce power to discriminate orders, dramatically inflate map length, and provide significant support for incorrect over correct orders. These results appear to be due to the misclassification of nonrecombinant gametes as multiple recombinants. Chi 2-Like goodness-of-fit analysis appears to be quite sensitive to the appearance of misclassified gametes, providing a simple test for aberrant data sets. Multiple pairwise likelihood analysis appears to be less sensitive than does multipoint analysis and may serve as a check for map validity.  相似文献   

12.
Holcroft CA  Spiegelman D 《Biometrics》1999,55(4):1193-1201
We compared several validation study designs for estimating the odds ratio of disease with misclassified exposure. We assumed that the outcome and misclassified binary covariate are available and that the error-free binary covariate is measured in a subsample, the validation sample. We considered designs in which the total size of the validation sample is fixed and the probability of selection into the validation sample may depend on outcome and misclassified covariate values. Design comparisons were conducted for rare and common disease scenarios, where the optimal design is the one that minimizes the variance of the maximum likelihood estimator of the true log odds ratio relating the outcome to the exposure of interest. Misclassification rates were assumed to be independent of the outcome. We used a sensitivity analysis to assess the effect of misspecifying the misclassification rates. Under the scenarios considered, our results suggested that a balanced design, which allocates equal numbers of validation subjects into each of the four outcome/mismeasured covariate categories, is preferable for its simplicity and good performance. A user-friendly Fortran program is available from the second author, which calculates the optimal sampling fractions for all designs considered and the efficiencies of these designs relative to the optimal hybrid design for any scenario of interest.  相似文献   

13.
We investigate if the use of a priori knowledge allows an improvement of medical decision making. We compare two frameworks of classification – direct and indirect classification – with respect to different classification errors: differential misclassification, observed misclassification and true misclassification. We analyze general behaviors of the classifiers in an artificial example and furthermore as being interested in the diagnosis of early glaucoma we adapt a simulation model of the optic nerve head. Indirect classifiers outperform direct classifiers in certain parameter situations of a Monte‐Carlo study. In summary, we demonstrate that indirect classification provides a flexible framework to improve diagnostic rules by using explicit a priori knowledge in clinical research.  相似文献   

14.
Assuming that some explicit functional relationship acts as a mathematical model we consider the case that we are given a finite set of hypercuboids, each of which contains at least one point of the true functional relationship with a certain given probability. We present a procedure to construct sets for the unknown parameters and probabilities for the event that those contain the true parameter value. The procedure is illustrated by an example from pharmaceutical technology.  相似文献   

15.
This paper extends an approach for estimating the ancestry probability, the probability that an inbred line is an ancestor of a given hybrid, to account for genotyping errors. The effect of such errors on ancestry probability estimates is evaluated through simulation. The simulation study shows that if misclassification is ignored, then ancestry probabilities may be slightly overestimated. The sensitivity of ancestry probability calculations to the assumed genotyping error rate is also assessed.  相似文献   

16.
HTS data from primary screening are usually analyzed by setting a cutoff for activity, in order to minimize both false-negative and false-positive rates. An alternative approach, based on a calculated probability of being active, is presented here. Given the predicted confirmation rate derived from this probability, the number of primary positives selected for follow-up can be optimized to maximize the number of true positives without picking too many false positives. Typical cutoff-determining methods are more serendipitous in their nature and not easily optimized in an effort to optimize screening efforts. An additional advantage of calculating a probability of being active for each compound screened is that orthogonal mixtures can be deconvoluted without presetting a deconvolution threshold. An important consequence of using the probability of being active with orthogonal mixtures is that individual compound screening results can be recorded irrespective of whether the assays were performed on single compounds or on cocktails.  相似文献   

17.
Empirical validation of the Essen-Möller probability of paternity.   总被引:1,自引:0,他引:1       下载免费PDF全文
The validity of the Essen-Möller formulation probability of paternity is supported by demonstrating its correctness in a model genetic system--the ABO system. An analysis was made of 1,393 paternity cases typed uniformly for HLA-A and -B, ABO, Rh, and MNSs, in which the mother named one man only as the child''s father and in which both mother and putative father identified themselves as Caucasian. For purposes of analysis, putative fathers not excluded from paternity by the four systems tested were regarded as actual fathers. The joint distribution of observed triplets of ABO phenotypes is shown to be statistically consistent with expected values, and the fractions of "true" fathers for a given triplet closely approximated the probability of paternity calculated using a realistic prior probability. Recent allegations of fallaciousness of the method by Li and Chakravarty and Aickin are discussed in terms of the results presented.  相似文献   

18.
Many confidence intervals calculated in practice are potentially not exact, either because the requirements for the interval estimator to be exact are known to be violated, or because the (exact) distribution of the data is unknown. If a confidence interval is approximate, the crucial question is how well its true coverage probability approximates its intended coverage probability. In this paper we propose to use the bootstrap to calculate an empirical estimate for the (true) coverage probability of a confidence interval. In the first instance, the empirical coverage can be used to assess whether a given type of confidence interval is adequate for the data at hand. More generally, when planning the statistical analysis of future trials based on existing data pools, the empirical coverage can be used to study the coverage properties of confidence intervals as a function of type of data, sample size, and analysis scale, and thus inform the statistical analysis plan for the future trial. In this sense, the paper proposes an alternative to the problematic pretest of the data for normality, followed by selection of the analysis method based on the results of the pretest. We apply the methodology to a data pool of bioequivalence studies, and in the selection of covariance patterns for repeated measures data.  相似文献   

19.
The problem of discrimination and classification is central to much of epidemiology. Here we consider the estimation of a logistic regression/discrimination function from training samples, when one of the training samples is subject to misclassification or mislabeling, e.g. diseased individuals are incorrectly classified/labeled as healthy controls. We show that this leads to zero-inflated binomial model with a defective logistic regression or discrimination function, whose parameters can be estimated using standard statistical methods such as maximum likelihood. These parameters can be used to estimate the probability of true group membership among those, possibly erroneously, classified as controls. Two examples are analyzed and discussed. A simulation study explores properties of the maximum likelihood parameter estimates and the estimates of the number of mislabeled observations.  相似文献   

20.
Within the framework of Fisher's discriminant analysis, we propose a multiclass classification method which embeds variable screening for ultrahigh‐dimensional predictors. Leveraging interfeature correlations, we show that the proposed linear classifier recovers informative features with probability tending to one and can asymptotically achieve a zero misclassification rate. We evaluate the finite sample performance of the method via extensive simulations and use this method to classify posttransplantation rejection types based on patients' gene expressions.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号