首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 125 毫秒
1.
Variability between raters' ordinal scores is commonly observed in imaging tests, leading to uncertainty in the diagnostic process. In breast cancer screening, a radiologist visually interprets mammograms and MRIs, while skin diseases, Alzheimer's disease, and psychiatric conditions are graded based on clinical judgment. Consequently, studies are often conducted in clinical settings to investigate whether a new training tool can improve the interpretive performance of raters. In such studies, a large group of experts each classify a set of patients' test results on two separate occasions, before and after some form of training with the goal of assessing the impact of training on experts' paired ratings. However, due to the correlated nature of the ordinal ratings, few statistical approaches are available to measure association between raters' paired scores. Existing measures are restricted to assessing association at just one time point for a single screening test. We propose here a novel paired kappa to provide a summary measure of association between many raters' paired ordinal assessments of patients' test results before versus after rater training. Intrarater association also provides valuable insight into the consistency of ratings when raters view a patient's test results on two occasions with no intervention undertaken between viewings. In contrast to existing correlated measures, the proposed kappa is a measure that provides an overall evaluation of the association among multiple raters' scores from two time points and is robust to the underlying disease prevalence. We implement our proposed approach in two recent breast-imaging studies and conduct extensive simulation studies to evaluate properties and performance of our summary measure of association.  相似文献   

2.
Large‐scale agreement studies are becoming increasingly common in medical settings to gain better insight into discrepancies often observed between experts' classifications. Ordered categorical scales are routinely used to classify subjects' disease and health conditions. Summary measures such as Cohen's weighted kappa are popular approaches for reporting levels of association for pairs of raters' ordinal classifications. However, in large‐scale studies with many raters, assessing levels of association can be challenging due to dependencies between many raters each grading the same sample of subjects' results and the ordinal nature of the ratings. Further complexities arise when the focus of a study is to examine the impact of rater and subject characteristics on levels of association. In this paper, we describe a flexible approach based upon the class of generalized linear mixed models to assess the influence of rater and subject factors on association between many raters' ordinal classifications. We propose novel model‐based measures for large‐scale studies to provide simple summaries of association similar to Cohen's weighted kappa while avoiding prevalence and marginal distribution issues that Cohen's weighted kappa is susceptible to. The proposed summary measures can be used to compare association between subgroups of subjects or raters. We demonstrate the use of hypothesis tests to formally determine if rater and subject factors have a significant influence on association, and describe approaches for evaluating the goodness‐of‐fit of the proposed model. The performance of the proposed approach is explored through extensive simulation studies and is applied to a recent large‐scale cancer breast cancer screening study.  相似文献   

3.
Summary The median failure time is often utilized to summarize survival data because it has a more straightforward interpretation for investigators in practice than the popular hazard function. However, existing methods for comparing median failure times for censored survival data either require estimation of the probability density function or involve complicated formulas to calculate the variance of the estimates. In this article, we modify a K ‐sample median test for censored survival data ( Brookmeyer and Crowley, 1982 , Journal of the American Statistical Association 77, 433–440) through a simple contingency table approach where each cell counts the number of observations in each sample that are greater than the pooled median or vice versa. Under censoring, this approach would generate noninteger entries for the cells in the contingency table. We propose to construct a weighted asymptotic test statistic that aggregates dependent χ2 ‐statistics formed at the nearest integer points to the original noninteger entries. We show that this statistic follows approximately a χ2 ‐distribution with k? 1 degrees of freedom. For a small sample case, we propose a test statistic based on combined p ‐values from Fisher’s exact tests, which follows a χ2 ‐distribution with 2 degrees of freedom. Simulation studies are performed to show that the proposed method provides reasonable type I error probabilities and powers. The proposed method is illustrated with two real datasets from phase III breast cancer clinical trials.  相似文献   

4.
We analysed the peer review of grant proposals under Marie Curie Actions, a major EU research funding instrument, which involves two steps: an independent assessment (Individual Evaluation Report, IER) performed remotely by 3 raters, and a consensus opinion reached during a meeting by the same raters (Consensus Report, CR). For 24,897 proposals evaluated from 2007 to 2013, the association between average IER and CR scores was very high across different panels, grant calls and years. Median average deviation (AD) index, used as a measure of inter-rater agreement, was 5.4 points on a 0-100 scale (interquartile range 3.4-8.3), overall, demonstrating a good general agreement among raters. For proposals where one rater disagreed with the other two raters (n=1424; 5.7%), or where all 3 raters disagreed (n=2075; 8.3%), the average IER and CR scores were still highly associated. Disagreement was more frequent for proposals from Economics/Social Sciences and Humanities panels. Greater disagreement was observed for proposals with lower average IER scores. CR scores for proposals with initial disagreement were also significantly lower. Proposals with a large absolute difference between the average IER and CR scores (≥10 points; n=368, 1.5%) generally had lower CR scores. An inter-correlation matrix of individual raters'' scores of evaluation criteria of proposals indicated that these scores were, in general, a reflection of raters’ overall scores. Our analysis demonstrated a good internal consistency and general high agreement among raters. Consensus meetings appear to be relevant for particular panels and subsets of proposals with large differences among raters’ scores.  相似文献   

5.
Power investigations, for example, in statistical procedures for the assessment of agreement among multiple raters often require the simultaneous simulation of several dependent binomial or Poisson distributions to appropriately model the stochastical dependencies between the raters' results. Regarding the rather large dimensions of the random vectors to be generated and the even larger number of interactions to be introduced into the simulation scenarios to determine all necessary information on their distributions' dependence stucture, one needs efficient and fast algorithms for the simulation of multivariate Poisson and binomial distributions. Therefore two equivalent models for the multivariate Poisson distribution are combined to obtain an algorithm for the quick implementation of its multivariate dependence structure. Simulation of the multivariate Poisson distribution then becomes feasible by first generating and then convoluting independent univariate Poisson variates with appropriate expectations. The latter can be computed via linear recursion formulae. Similar means for simulation are also considered for the binomial setting. In this scenario it turns out, however, that exact computation of the probability function is even easier to perform; therefore corresponding linear recursion formulae for the point probabilities of multivariate binomial distributions are presented, which only require information about the index parameter and the (simultaneous) success probabilities, that is the multivariate dependence structure among the binomial marginals.  相似文献   

6.
Two methods of testing multivariate hypotheses in an IXJXK contingency table are presented. In one case, use is made of the trace of the matrix of sum of squares and sum of products while in the other, determinants of the matrices are used to construct test statistics. Asymptotic equivalence of the methods is shown.  相似文献   

7.
Asymptotic and exact conditional approaches have often been used for testing agreement between two raters with binary outcomes. The exact conditional approach is guaranteed to respect the test size as compared to the traditionally used asymptotic approach based on the standardized Cohen''s kappa coefficient. An alternative to the conditional approach is an unconditional strategy which relaxes the restriction of fixed marginal totals as in the conditional approach. Three exact unconditional hypothesis testing procedures are considered in this article: an approach based on maximization, an approach based on the conditional p-value and maximization, and an approach based on estimation and maximization. We compared these testing procedures based on the commonly used Cohen''s kappa with regards to test size and power. We recommend the following two exact approaches for use in practice due to power advantages: the approach based on conditional p-value and maximization and the approach based on estimation and maximization.  相似文献   

8.
Various methods for quantifying cellular immunogold labelling on transmission electron microscope thin sections are currently available. All rely on sound random sampling principles and are applicable to single immunolabelling across compartments within a given cell type or between different experimental groups of cells. Although methods are also available to test for colocalization in double/triple immunogold labelling studies, so far, these have relied on making multiple measurements of gold particle densities in defined areas or of inter-particle nearest neighbour distances. Here, we present alternative two-step approaches to codistribution and colocalization assessment that merely require raw counts of gold particles in distinct cellular compartments. For assessing codistribution over aggregate compartments, initial statistical evaluation involves combining contingency table and chi-squared analyses to provide predicted gold particle distributions. The observed and predicted distributions allow testing of the appropriate null hypothesis, namely, that there is no difference in the distribution patterns of proteins labelled by different sizes of gold particle. In short, the null hypothesis is that of colocalization. The approach for assessing colabelling recognises that, on thin sections, a compartment is made up of a set of sectional images (profiles) of cognate structures. The approach involves identifying two groups of compartmental profiles that are unlabelled and labelled for one gold marker size. The proportions in each group that are also labelled for the second gold marker size are then compared. Statistical analysis now uses a 2 × 2 contingency table combined with the Fisher exact probability test. Having identified double labelling, the profiles can be analysed further in order to identify characteristic features that might account for the double labelling. In each case, the approach is illustrated using synthetic and/or experimental datasets and can be refined to correct observed labelling patterns to specific labelling patterns. These simple and efficient approaches should be of more immediate utility to those interested in codistribution and colocalization in multiple immunogold labelling investigations.  相似文献   

9.
Basu S  Banerjee M  Sen A 《Biometrics》2000,56(2):577-582
Cohen's kappa coefficient is a widely popular measure for chance-corrected nominal scale agreement between two raters. This article describes Bayesian analysis for kappa that can be routinely implemented using Markov chain Monte Carlo (MCMC) methodology. We consider the case of m > or = 2 independent samples of measured agreement, where in each sample a given subject is rated by two rating protocols on a binary scale. A major focus here is on testing the homogeneity of the kappa coefficient across the different samples. The existing frequentist tests for this case assume exchangeability of rating protocols, whereas our proposed Bayesian test does not make any such assumption. Extensive simulation is carried out to compare the performances of the Bayesian and the frequentist tests. The developed methodology is illustrated using data from a clinical trial in ophthalmology.  相似文献   

10.
Linear rank tests are widely used when testing for independence against stochastic order in a 2 x J contingency table with two treatments and J ordered outcome levels. For this purpose, numerical scores are assigned, possibly by default, to the J outcome levels. When the choice of scores is not apparent, integer (equally spaced) scores are often assigned. We show that this practice generally leads to unnecessarily conservative tests. The use of slightly perturbed scores will result in a less conservative and uniformly more powerful test.  相似文献   

11.
Clinical studies are often concerned with assessing whether different raters/methods produce similar values for measuring a quantitative variable. Use of the concordance correlation coefficient as a measure of reproducibility has gained popularity in practice since its introduction by Lin (1989, Biometrics 45, 255-268). Lin's method is applicable for studies evaluating two raters/two methods without replications. Chinchilli et al. (1996, Biometrics 52, 341-353) extended Lin's approach to repeated measures designs by using a weighted concordance correlation coefficient. However, the existing methods cannot easily accommodate covariate adjustment, especially when one needs to model agreement. In this article, we propose a generalized estimating equations (GEE) approach to model the concordance correlation coefficient via three sets of estimating equations. The proposed approach is flexible in that (1) it can accommodate more than two correlated readings and test for the equality of dependent concordant correlation estimates; (2) it can incorporate covariates predictive of the marginal distribution; (3) it can be used to identify covariates predictive of concordance correlation; and (4) it requires minimal distribution assumptions. A simulation study is conducted to evaluate the asymptotic properties of the proposed approach. The method is illustrated with data from two biomedical studies.  相似文献   

12.
13.
Several asymptotic tests were proposed for testing the null hypothesis of marginal homogeneity in square contingency tables with r categories. A simulation study was performed for comparing the power of four finite conservative conditional test procedures and of two asymptotic tests for twelve different contingency schemes for small sample sizes. While an asymptotic test proposed by STUART (1955) showed a rather satisfactory behaviour for moderate sample sizes, an asymptotic test proposed by BHAPKAR (1966) was quite anticonservative. With no a priori information the performance of (r - 1) simultaneous conditional binomial tests with a Bonferroni adjustment proved to be a quite efficient procedure. With assumptions about where to expect the deviations from the null hypothesis, other procedures favouring the larger or smaller conditional sample sizes, respectively, can have a great efficiency. The procedures are illustrated by means of a numerical example from clinical psychology.  相似文献   

14.
S T Gross 《Biometrics》1986,42(4):883-893
Published results on the use of the kappa coefficient of agreement have traditionally been concerned with situations where a large number of subjects is classified by a small group of raters. The coefficient is then used to assess the degree of agreement among the raters through hypothesis testing or confidence intervals. A modified kappa coefficient of agreement for multiple categories is proposed and a parameter-free distribution for testing null agreement is provided, for use when the number of raters is large relative to the number of categories and subjects. The large-sample distribution of kappa is shown to be normal in the nonnull case, and confidence intervals for kappa are provided. The results are extended to allow for an unequal number of raters per subject.  相似文献   

15.

Background

We consider the problem of assessing inter-rater agreement when there are missing data and a large number of raters. Previous studies have shown only ‘moderate’ agreement between pathologists in grading breast cancer tumour specimens. We analyse a large but incomplete data-set consisting of 24177 grades, on a discrete 1–3 scale, provided by 732 pathologists for 52 samples.

Methodology/Principal Findings

We review existing methods for analysing inter-rater agreement for multiple raters and demonstrate two further methods. Firstly, we examine a simple non-chance-corrected agreement score based on the observed proportion of agreements with the consensus for each sample, which makes no allowance for missing data. Secondly, treating grades as lying on a continuous scale representing tumour severity, we use a Bayesian latent trait method to model cumulative probabilities of assigning grade values as functions of the severity and clarity of the tumour and of rater-specific parameters representing boundaries between grades 1–2 and 2–3. We simulate from the fitted model to estimate, for each rater, the probability of agreement with the majority. Both methods suggest that there are differences between raters in terms of rating behaviour, most often caused by consistent over- or under-estimation of the grade boundaries, and also considerable variability in the distribution of grades assigned to many individual samples. The Bayesian model addresses the tendency of the agreement score to be biased upwards for raters who, by chance, see a relatively ‘easy’ set of samples.

Conclusions/Significance

Latent trait models can be adapted to provide novel information about the nature of inter-rater agreement when the number of raters is large and there are missing data. In this large study there is substantial variability between pathologists and uncertainty in the identity of the ‘true’ grade of many of the breast cancer tumours, a fact often ignored in clinical studies.  相似文献   

16.
Weighted least-squares approach for comparing correlated kappa   总被引:3,自引:0,他引:3  
Barnhart HX  Williamson JM 《Biometrics》2002,58(4):1012-1019
In the medical sciences, studies are often designed to assess the agreement between different raters or different instruments. The kappa coefficient is a popular index of agreement for binary and categorical ratings. Here we focus on testing for the equality of two dependent kappa coefficients. We use the weighted least-squares (WLS) approach of Koch et al. (1977, Biometrics 33, 133-158) to take into account the correlation between the estimated kappa statistics. We demonstrate how the SAS PROC CATMOD can be used to test for the equality of dependent Cohen's kappa coefficients and dependent intraclass kappa coefficients with nominal categorical ratings. We also test for the equality of dependent Cohen's kappa and dependent weighted kappa with ordinal ratings. The major advantage of the WLS approach is that it allows the data analyst a way of testing dependent kappa with popular SAS software. The WLS approach can handle any number of categories. Analyses of three biomedical studies are used for illustration.  相似文献   

17.
By the aid of analysing a medical example a three—step procedure for analysing multi—dimensional contingency tables is introduced. This procedure has some good properties. Step one is due to catch the relationship structure between the variables connected by the contingency table. Hereby only so—called graphical models, a subclass of hierarchical models in regard of the parameters of the log—linear model, are admitted. The models can be generated by combination of hypotheses of pairwise conditional independence. Hereby a so-called Extended Combination Procedure is proposed using the position of the Chain of (hierarchical) Hypotheses. A useful symbolic notation for ‘Dependence Models’ in addition to that in form of ‘Independence Models’ and ‘Minimal Sets’ is proposed. Step two analyses the significant conditional pairs in regard to the question for what attribute level combinations of the condition complexes the relations remain significant. Step three investigates those tables recognised as significant in step two more closely to get ideas about the ‘sources’ of dependencies and possibilities of collapsing parts of the table. The procedure is mostly used in explorative data analysis although the simple steps can be used to test hypotheses, too.  相似文献   

18.
An agreement index among more than two raters who employ ordinal classification is proposed here as an extension of the agreement index set up to consider such agreement between two raters as outlined by (JOLAYEMI , Biom. J. 32 (1990), 87–93). The method of application is outlined using a clinical diagnosis involving seven pathologists.  相似文献   

19.
Jürgen Habermas has argued against prenatal genetic interventions used to influence traits on the grounds that only biogenetic contingency in the conception of children preserves the conditions that make the presumption of moral equality possible. This argument fails for a number of reasons. The contingency that Habermas points to as the condition of moral equality is an artifact of evolutionary contingency and not inviolable in itself. Moreover, as a precedent for genetic interventions, parents and society already affect children's traits, which is to say there is moral precedent for influencing the traits of descendants. A veil‐of‐ignorance methodology can also be used to justify prenatal interventions through its method of advance consent and its preservation of the contingency of human identities in a moral sense. In any case, the selection of children's traits does not undermine the prospects of authoring a life since their future remains just as contingent morally as if no trait had been selected. Ironically, the prospect of preserving human beings as they are – to counteract genetic drift – might even require interventions to preserve the ability to author a life in a moral sense. In light of these analyses, Habermas' concerns about prenatal genetic interventions cannot succeed as objections to their practice as a matter of principle; the merits of these interventions must be evaluated individually.  相似文献   

20.
The original intrinsic rank test is generalized in that the sizes of the k samples may now be arbitrary, and the number of intrinsic rank intervals need not equal the number of samples. Furthermore, the size of these intervals can be made variable, subject only to relatively mild constraints. These generalizations permit the formulation and testing of more specific hypotheses concerning the commonality of the sample distributions. A generalized intrinsic rank function is used to transform the usual ordinal ranks, obtained from the combined samples, into intrinsic ranks. Original sample identity and intrinsic ranks are then cross-tabulated and evaluated as 2-way contingency table.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号