首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
S T Gross 《Biometrics》1986,42(4):883-893
Published results on the use of the kappa coefficient of agreement have traditionally been concerned with situations where a large number of subjects is classified by a small group of raters. The coefficient is then used to assess the degree of agreement among the raters through hypothesis testing or confidence intervals. A modified kappa coefficient of agreement for multiple categories is proposed and a parameter-free distribution for testing null agreement is provided, for use when the number of raters is large relative to the number of categories and subjects. The large-sample distribution of kappa is shown to be normal in the nonnull case, and confidence intervals for kappa are provided. The results are extended to allow for an unequal number of raters per subject.  相似文献   

2.
Barlow W 《Biometrics》1996,52(2):695-702
The kappa coefficient measures chance-corrected agreement between two observers in the dichotomous classification of subjects. The marginal probability of classification by each rater may depend on one or more confounding variables, however. Failure to account for these confounders may lead to inflated estimates of agreement. A multinomial model is used that assumes both raters have the same marginal probability of classification, but this probability may depend on one or more covariates. The model may be fit using software for conditional logistic regression. Additionally, likelihood-based confidence intervals for the parameter representing agreement may be computed. A simple example is discussed to illustrate model-fitting and application of the technique.  相似文献   

3.
Clinical studies are often concerned with assessing whether different raters/methods produce similar values for measuring a quantitative variable. Use of the concordance correlation coefficient as a measure of reproducibility has gained popularity in practice since its introduction by Lin (1989, Biometrics 45, 255-268). Lin's method is applicable for studies evaluating two raters/two methods without replications. Chinchilli et al. (1996, Biometrics 52, 341-353) extended Lin's approach to repeated measures designs by using a weighted concordance correlation coefficient. However, the existing methods cannot easily accommodate covariate adjustment, especially when one needs to model agreement. In this article, we propose a generalized estimating equations (GEE) approach to model the concordance correlation coefficient via three sets of estimating equations. The proposed approach is flexible in that (1) it can accommodate more than two correlated readings and test for the equality of dependent concordant correlation estimates; (2) it can incorporate covariates predictive of the marginal distribution; (3) it can be used to identify covariates predictive of concordance correlation; and (4) it requires minimal distribution assumptions. A simulation study is conducted to evaluate the asymptotic properties of the proposed approach. The method is illustrated with data from two biomedical studies.  相似文献   

4.
Weighted least-squares approach for comparing correlated kappa   总被引:3,自引:0,他引:3  
Barnhart HX  Williamson JM 《Biometrics》2002,58(4):1012-1019
In the medical sciences, studies are often designed to assess the agreement between different raters or different instruments. The kappa coefficient is a popular index of agreement for binary and categorical ratings. Here we focus on testing for the equality of two dependent kappa coefficients. We use the weighted least-squares (WLS) approach of Koch et al. (1977, Biometrics 33, 133-158) to take into account the correlation between the estimated kappa statistics. We demonstrate how the SAS PROC CATMOD can be used to test for the equality of dependent Cohen's kappa coefficients and dependent intraclass kappa coefficients with nominal categorical ratings. We also test for the equality of dependent Cohen's kappa and dependent weighted kappa with ordinal ratings. The major advantage of the WLS approach is that it allows the data analyst a way of testing dependent kappa with popular SAS software. The WLS approach can handle any number of categories. Analyses of three biomedical studies are used for illustration.  相似文献   

5.
Large‐scale agreement studies are becoming increasingly common in medical settings to gain better insight into discrepancies often observed between experts' classifications. Ordered categorical scales are routinely used to classify subjects' disease and health conditions. Summary measures such as Cohen's weighted kappa are popular approaches for reporting levels of association for pairs of raters' ordinal classifications. However, in large‐scale studies with many raters, assessing levels of association can be challenging due to dependencies between many raters each grading the same sample of subjects' results and the ordinal nature of the ratings. Further complexities arise when the focus of a study is to examine the impact of rater and subject characteristics on levels of association. In this paper, we describe a flexible approach based upon the class of generalized linear mixed models to assess the influence of rater and subject factors on association between many raters' ordinal classifications. We propose novel model‐based measures for large‐scale studies to provide simple summaries of association similar to Cohen's weighted kappa while avoiding prevalence and marginal distribution issues that Cohen's weighted kappa is susceptible to. The proposed summary measures can be used to compare association between subgroups of subjects or raters. We demonstrate the use of hypothesis tests to formally determine if rater and subject factors have a significant influence on association, and describe approaches for evaluating the goodness‐of‐fit of the proposed model. The performance of the proposed approach is explored through extensive simulation studies and is applied to a recent large‐scale cancer breast cancer screening study.  相似文献   

6.
Guo Y  Manatunga AK 《Biometrics》2009,65(1):125-134
Summary .  Assessing agreement is often of interest in clinical studies to evaluate the similarity of measurements produced by different raters or methods on the same subjects. We present a modified weighted kappa coefficient to measure agreement between bivariate discrete survival times. The proposed kappa coefficient accommodates censoring by redistributing the mass of censored observations within the grid where the unobserved events may potentially happen. A generalized modified weighted kappa is proposed for multivariate discrete survival times. We estimate the modified kappa coefficients nonparametrically through a multivariate survival function estimator. The asymptotic properties of the kappa estimators are established and the performance of the estimators are examined through simulation studies of bivariate and trivariate survival times. We illustrate the application of the modified kappa coefficient in the presence of censored observations with data from a prostate cancer study.  相似文献   

7.

Background

The Clubfoot Assessment Protocol (CAP) was developed for follow-up of children treated for clubfoot. The objective of this study was to analyze reliability and validity of the six items used in the domain CAPMotion Quality using inexperienced assessors.

Findings

Four raters (two paediatric orthopaedic surgeons, two senior physiotherapists) used the CAP scores to analyze, on two different occasions, 11 videotapes containing standardized recordings of motion activity according to the domain CAPMotion Quality These results were compared to a criterion (two raters, well experienced CAP assessors) for validity and for checking for learning effect. Weighted kappa statistics, exact percentage observer agreement (Po), percentage observer agreement including one level difference (Po-1) and amount of scoring scales defined how reliability was to be interpreted. Inter- and intra rater differences were calculated using median and inter quartile ranges (IQR) on item level and mean and limits of agreement on domain level. Inter-rater reliability varied between fair and moderate (kappa) and had a mean agreement of 48/88% (Po/Po-1). Intra -rater reliability varied between moderate to good with a mean agreement of 63/96%. The intra- and inter-rater differences in the present study were generally small both on item (0.00) and domain level (-1.10). There was exact agreement of 51% and Po-1 of 91% of the six items with the criterion. No learning effect was found.

Conclusion

The CAPMotion quality can be used by inexperienced assessors with sufficient reliability in daily clinical practice and showed acceptable accuracy compared to the criterion.  相似文献   

8.
By COHEN and others the kappa index was developed for measuring nominal scale agreement between two raters. This statistic measures the distance from the nullhypothesis of independent ratings of two observers. Here a modified kappa is introduced, which takes into account the distance between the marginal distributions, as well. This distance is interpreted as the so-called interobserver bias. Population analogues are defined for the modified kappa and a related conditional index. For these parameters asymptotic confidence intervals and tests are derived. The procedures are illustrated by fictitious and real examples.  相似文献   

9.
In clinical research and in more general classification problems, a frequent concern is the reliability of a rating system. In the absence of a gold standard, agreement may be considered as an indication of reliability. When dealing with categorical data, the well‐known kappa statistic is often used to measure agreement. The aim of this paper is to obtain a theoretical result about the asymptotic distribution of the kappa statistic with multiple items, multiple raters, multiple conditions, and multiple rating categories (more than two), based on recent work. The result settles a long lasting quest for the asymptotic variance of the kappa statistic in this situation and allows for the construction of asymptotic confidence intervals. A recent application to clinical endoscopy and to the diagnosis of inflammatory bowel diseases (IBDs) is shortly presented to complement the theoretical perspective.  相似文献   

10.
Quantitative PCR diagnostic platforms are moving towards increased sample throughput, with instruments capable of carrying out thousands of reactions at once already in use. The need for a computational tool to reliably assist in the validation of the results is therefore compelling. In the present study, 328 residual clinical samples provided by the Public Health England at Addenbrooke''s Hospital (Cambridge, UK) were processed by TaqMan Array Card assay, generating 15 744 reactions from 54 targets. The amplification data were analysed by the conventional cycle-threshold (CT) method and an improvement of the maxRatio (MR) algorithm developed to filter out the reactions with irregular amplification profiles. The reactions were also independently validated by three raters and a consensus was generated from their classification. The inter-rater agreement by Fleiss'' kappa was 0.885; the agreement between either CT or MR with the raters gave Fleiss'' kappa 0.884 and 0.902, respectively. Based on the consensus classification, the CT and MR methods achieved an assay accuracy of 0.979 and 0.987, respectively. These results suggested that the assumption-free MR algorithm was more reliable than the CT method, with clear advantages for the diagnostic settings.  相似文献   

11.
Asymptotic and exact conditional approaches have often been used for testing agreement between two raters with binary outcomes. The exact conditional approach is guaranteed to respect the test size as compared to the traditionally used asymptotic approach based on the standardized Cohen''s kappa coefficient. An alternative to the conditional approach is an unconditional strategy which relaxes the restriction of fixed marginal totals as in the conditional approach. Three exact unconditional hypothesis testing procedures are considered in this article: an approach based on maximization, an approach based on the conditional p-value and maximization, and an approach based on estimation and maximization. We compared these testing procedures based on the commonly used Cohen''s kappa with regards to test size and power. We recommend the following two exact approaches for use in practice due to power advantages: the approach based on conditional p-value and maximization and the approach based on estimation and maximization.  相似文献   

12.
Basu S  Banerjee M  Sen A 《Biometrics》2000,56(2):577-582
Cohen's kappa coefficient is a widely popular measure for chance-corrected nominal scale agreement between two raters. This article describes Bayesian analysis for kappa that can be routinely implemented using Markov chain Monte Carlo (MCMC) methodology. We consider the case of m > or = 2 independent samples of measured agreement, where in each sample a given subject is rated by two rating protocols on a binary scale. A major focus here is on testing the homogeneity of the kappa coefficient across the different samples. The existing frequentist tests for this case assume exchangeability of rating protocols, whereas our proposed Bayesian test does not make any such assumption. Extensive simulation is carried out to compare the performances of the Bayesian and the frequentist tests. The developed methodology is illustrated using data from a clinical trial in ophthalmology.  相似文献   

13.
Assessing the agreement between two or more raters is an important topic in medical practice. Existing techniques, which deal with categorical data, are based on contingency tables. This is often an obstacle in practice as we have to wait for a long time to collect the appropriate sample size of subjects to construct the contingency table. In this paper, we introduce a nonparametric sequential test for assessing agreement, which can be applied as data accrues, does not require a contingency table, facilitating a rapid assessment of the agreement. The proposed test is based on the cumulative sum of the number of disagreements between the two raters and a suitable statistic representing the waiting time until the cumulative sum exceeds a predefined threshold. We treat the cases of testing two raters' agreement with respect to one or more characteristics and using two or more classification categories, the case where the two raters extremely disagree, and finally the case of testing more than two raters' agreement. The numerical investigation shows that the proposed test has excellent performance. Compared to the existing methods, the proposed method appears to require significantly smaller sample size with equivalent power. Moreover, the proposed method is easily generalizable and brings the problem of assessing the agreement between two or more raters and one or more characteristics under a unified framework, thus providing an easy to use tool to medical practitioners.  相似文献   

14.
The horse early conception factor (ECF) test is designed for qualitative determination of the ECF glycoprotein in the mare that has conceived. The objectives of this study were to determine the performance of the horse ECF test for the detection of the non-pregnant mare, and to determine the agreement among subjects or "readers" regarding the interpretation of the test. Blood samples from 60 mares were collected on Days 0, 5, 8, 11 and 18 following ovulation. Pregnancy status diagnosed with the ECF test was compared (2 x 2 table) to pregnancy status diagnosed by palpation per rectum and ultrasound examination on Day 18 following ovulation. Three readers interpreted the ECF test outcome independently. Two laboratories independently interpreted the ECF test outcome from the same serum samples. Agreement was tested by kappa coefficient. Sensitivity, specificity, positive predictive value, negative predictive value, and accuracy ranged from 0.74 to 0.84, 0.14 to 0.33, 0.62 to 0.66, 0.33 to 0.44 and 0.57 to 0.60, respectively. Agreement between readers was substantial (0.60相似文献   

15.
An agreement index among more than two raters who employ ordinal classification is proposed here as an extension of the agreement index set up to consider such agreement between two raters as outlined by (JOLAYEMI , Biom. J. 32 (1990), 87–93). The method of application is outlined using a clinical diagnosis involving seven pathologists.  相似文献   

16.
The problems of diagnostic variability between certified cytotechnologists was studied. Three cytology laboratories submitted a total of 28 cervical smears that had a discordance between the cytologic and/or histologic ratings. Eight independent cytotechnologists provided blind readings on each slide, expressed as "absence of cervical intraepithelial neoplasia (CIN)" to "CIN III." The median rating was absence of CIN or CIN I for 8 slides, CIN II for 5 and CIN III for 15. With a kappa value greater than 0 reflecting agreement beyond chance expectation and a value of 0.40 indicating fair agreement, the kappa value for 8 X 28 ratings was 0.36 (P = .0001), with a 90% confidence interval (CI) between 0.34 and 0.37. The kappa value was 0.14 (P = .10), with a 90% CI between 0.10 and 0.18, on a subsample of nine smears with two or more positive cytology diagnoses but a negative histology. Sixteen of the 28 slides represented cases of histologically proven cancer. Treating cytologic diagnoses of CIN II and CIN III as positive, the sensitivity of the cytologist with reference to histology varied between 71% and 86% while the specificity ranged from 18% to 62%. The positive predictive value was 1/2.5 to 1/1 and the negative predictive value was 1/6 to 1/1. The predictive power (true positives/false positives) ranged from 1.0 to 2.2. The cytodiagnosis of these cervical smears from cases of discordance thus exhibited limited reliability. Standardization of the relevant cytologic knowledge and its routine application is needed to improve the level of performance.  相似文献   

17.
In clinical studies, it is often of interest to see the diagnostic agreement among clinicians on certain symptoms. Previous work has focused on the agreement between two clinicians under two different conditions or the agreement among multiple clinicians under one condition. Few have discussed the agreement study with a design where multiple clinicians examine the same group of patients under two different conditions. In this paper, we use the intraclass kappa statistic for assessing nominal scale agreement with such a design. We derive an explicit variance formula for the difference of correlated kappa statistics and conduct hypothesis testing for the equality of kappa statistics. Simulation studies show that the method performs well with realistic sample sizes and may be superior to a method that did not take into account the measurement dependence structure. The practical utility of the method is illustrated on data from an eosinophilic esophagitis (EoE) study.  相似文献   

18.
《Endocrine practice》2021,27(6):567-570
ObjectiveTo examine the performance and agreement of 5 modalities for testing sensory neuropathy against a neurothesiometer among Hispanic patients with type 1 diabetes (T1D) in an outpatient setting.MethodsA cross-sectional study was conducted at a tertiary reference center in Mexico City. Sensitivity, specificity, predictive values, and likelihood ratios were calculated using a VibraTip device, 128 Hz tuning fork, and the Semmes-Weinstein 5.07/10 g monofilament test, Ipswich touch test (IpTT), and pinprick test (PPT). The VPT obtained using a neurothesiometer was used as the standard. Agreement between tests was calculated using kappa coefficients.ResultsOur study included 78 patients (156 examinations), of whom 56.4% were females. The mean age was 38.2 ± 13.0 years, and the mean body mass index was 24.6 ± 4.8 kg/m2. The best sensitivity was found for IpTT and VibraTip (89.7% and 79.3%, respectively), while the PPT and IpTT had the highest positive predictive values (94.4% and 92.9%, respectively). The highest kappa coefficients were obtained for the IpTT vs neurothesiometer (kappa coefficient [κ] = 0.893, P < .001), followed by VibraTip vs neurothesiometer (κ = 0.782, P < .001). The VibraTip vs IpTT also had a substantial agreement (κ= 0.713, P < .001).ConclusionOur findings demonstrated that the IpTT had the best diagnostic performance and agreement compared with the standard in this cohort of Hispanic patients with T1D. The IpTT is a useful, simple test for diabetic neuropathy screening. These findings support its inclusion in future guidelines for diabetic foot examination.  相似文献   

19.
We analysed the peer review of grant proposals under Marie Curie Actions, a major EU research funding instrument, which involves two steps: an independent assessment (Individual Evaluation Report, IER) performed remotely by 3 raters, and a consensus opinion reached during a meeting by the same raters (Consensus Report, CR). For 24,897 proposals evaluated from 2007 to 2013, the association between average IER and CR scores was very high across different panels, grant calls and years. Median average deviation (AD) index, used as a measure of inter-rater agreement, was 5.4 points on a 0-100 scale (interquartile range 3.4-8.3), overall, demonstrating a good general agreement among raters. For proposals where one rater disagreed with the other two raters (n=1424; 5.7%), or where all 3 raters disagreed (n=2075; 8.3%), the average IER and CR scores were still highly associated. Disagreement was more frequent for proposals from Economics/Social Sciences and Humanities panels. Greater disagreement was observed for proposals with lower average IER scores. CR scores for proposals with initial disagreement were also significantly lower. Proposals with a large absolute difference between the average IER and CR scores (≥10 points; n=368, 1.5%) generally had lower CR scores. An inter-correlation matrix of individual raters'' scores of evaluation criteria of proposals indicated that these scores were, in general, a reflection of raters’ overall scores. Our analysis demonstrated a good internal consistency and general high agreement among raters. Consensus meetings appear to be relevant for particular panels and subsets of proposals with large differences among raters’ scores.  相似文献   

20.

Background

We consider the problem of assessing inter-rater agreement when there are missing data and a large number of raters. Previous studies have shown only ‘moderate’ agreement between pathologists in grading breast cancer tumour specimens. We analyse a large but incomplete data-set consisting of 24177 grades, on a discrete 1–3 scale, provided by 732 pathologists for 52 samples.

Methodology/Principal Findings

We review existing methods for analysing inter-rater agreement for multiple raters and demonstrate two further methods. Firstly, we examine a simple non-chance-corrected agreement score based on the observed proportion of agreements with the consensus for each sample, which makes no allowance for missing data. Secondly, treating grades as lying on a continuous scale representing tumour severity, we use a Bayesian latent trait method to model cumulative probabilities of assigning grade values as functions of the severity and clarity of the tumour and of rater-specific parameters representing boundaries between grades 1–2 and 2–3. We simulate from the fitted model to estimate, for each rater, the probability of agreement with the majority. Both methods suggest that there are differences between raters in terms of rating behaviour, most often caused by consistent over- or under-estimation of the grade boundaries, and also considerable variability in the distribution of grades assigned to many individual samples. The Bayesian model addresses the tendency of the agreement score to be biased upwards for raters who, by chance, see a relatively ‘easy’ set of samples.

Conclusions/Significance

Latent trait models can be adapted to provide novel information about the nature of inter-rater agreement when the number of raters is large and there are missing data. In this large study there is substantial variability between pathologists and uncertainty in the identity of the ‘true’ grade of many of the breast cancer tumours, a fact often ignored in clinical studies.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号