期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Assessing interrater agreement on binary measurements via intraclass odds ratio

下载免费PDF全文

Isabella Locatelli Valentin Rousson 《Biometrical journal. Biometrische Zeitschrift》2016,58(4):962-973

相似文献

2.

Peer Review Evaluation Process of Marie Curie Actions under EU’s Seventh Framework Programme for Research

David G. Pina Darko Hren Ana Maru?i? 《PloS one》2015,10(6)

We analysed the peer review of grant proposals under Marie Curie Actions, a major EU research funding instrument, which involves two steps: an independent assessment (Individual Evaluation Report, IER) performed remotely by 3 raters, and a consensus opinion reached during a meeting by the same raters (Consensus Report, CR). For 24,897 proposals evaluated from 2007 to 2013, the association between average IER and CR scores was very high across different panels, grant calls and years. Median average deviation (AD) index, used as a measure of inter-rater agreement, was 5.4 points on a 0-100 scale (interquartile range 3.4-8.3), overall, demonstrating a good general agreement among raters. For proposals where one rater disagreed with the other two raters (n=1424; 5.7%), or where all 3 raters disagreed (n=2075; 8.3%), the average IER and CR scores were still highly associated. Disagreement was more frequent for proposals from Economics/Social Sciences and Humanities panels. Greater disagreement was observed for proposals with lower average IER scores. CR scores for proposals with initial disagreement were also significantly lower. Proposals with a large absolute difference between the average IER and CR scores (≥10 points; n=368, 1.5%) generally had lower CR scores. An inter-correlation matrix of individual raters'' scores of evaluation criteria of proposals indicated that these scores were, in general, a reflection of raters’ overall scores. Our analysis demonstrated a good internal consistency and general high agreement among raters. Consensus meetings appear to be relevant for particular panels and subsets of proposals with large differences among raters’ scores. 相似文献

3.

A Modification of Kappa for Interobserver Bias

J. Krauth 《Biometrical journal. Biometrische Zeitschrift》1984,26(4):435-445

By COHEN and others the kappa index was developed for measuring nominal scale agreement between two raters. This statistic measures the distance from the nullhypothesis of independent ratings of two observers. Here a modified kappa is introduced, which takes into account the distance between the marginal distributions, as well. This distance is interpreted as the so-called interobserver bias. Population analogues are defined for the modified kappa and a related conditional index. For these parameters asymptotic confidence intervals and tests are derived. The procedures are illustrated by fictitious and real examples. 相似文献

4.

A sequential test for assessing observed agreement between raters

下载免费PDF全文

Sotiris Bersimis Athanasios Sachlas Subha Chakraborti 《Biometrical journal. Biometrische Zeitschrift》2018,60(1):128-145

Assessing the agreement between two or more raters is an important topic in medical practice. Existing techniques, which deal with categorical data, are based on contingency tables. This is often an obstacle in practice as we have to wait for a long time to collect the appropriate sample size of subjects to construct the contingency table. In this paper, we introduce a nonparametric sequential test for assessing agreement, which can be applied as data accrues, does not require a contingency table, facilitating a rapid assessment of the agreement. The proposed test is based on the cumulative sum of the number of disagreements between the two raters and a suitable statistic representing the waiting time until the cumulative sum exceeds a predefined threshold. We treat the cases of testing two raters' agreement with respect to one or more characteristics and using two or more classification categories, the case where the two raters extremely disagree, and finally the case of testing more than two raters' agreement. The numerical investigation shows that the proposed test has excellent performance. Compared to the existing methods, the proposed method appears to require significantly smaller sample size with equivalent power. Moreover, the proposed method is easily generalizable and brings the problem of assessing the agreement between two or more raters and one or more characteristics under a unified framework, thus providing an easy to use tool to medical practitioners. 相似文献

5.

The kappa coefficient of agreement for multiple observers when the number of subjects is small 总被引：2，自引：0，他引：2

S T Gross 《Biometrics》1986,42(4):883-893

Published results on the use of the kappa coefficient of agreement have traditionally been concerned with situations where a large number of subjects is classified by a small group of raters. The coefficient is then used to assess the degree of agreement among the raters through hypothesis testing or confidence intervals. A modified kappa coefficient of agreement for multiple categories is proposed and a parameter-free distribution for testing null agreement is provided, for use when the number of raters is large relative to the number of categories and subjects. The large-sample distribution of kappa is shown to be normal in the nonnull case, and confidence intervals for kappa are provided. The results are extended to allow for an unequal number of raters per subject. 相似文献

6.

Inference Procedures for Assessing Interobserver Agreement among Multiple Raters 总被引：1，自引：0，他引：1

Mekibib Altaye Allan Dormer Neil Klar 《Biometrics》2001,57(2):584-588

We propose a new procedure for constructing inferences about a measure of interobserver agreement in studies involving a binary outcome and multiple raters. The proposed procedure, based on a chi-square goodness-of-fit test as applied to the correlated binomial model (Bahadur, 1961, in Studies in Item Analysis and Prediction, 158-176), is an extension of the goodness-of-fit procedure developed by Donner and Eliasziw (1992, Statistics in Medicine 11, 1511-1519) for the case of two raters. The new procedure is shown to provide confidence-interval coverage levels that are close to nominal over a wide range of parameter combinations. The procedure also provides a sample-size formula that may be used to determine the required number of subjects and raters for such studies. 相似文献

7.

Assessing agreement between multiple raters with missing rating information, applied to breast cancer tumour grading

Fanshawe TR Lynch AG Ellis IO Green AR Hanka R 《PloS one》2008,3(8):e2925

Background

We consider the problem of assessing inter-rater agreement when there are missing data and a large number of raters. Previous studies have shown only ‘moderate’ agreement between pathologists in grading breast cancer tumour specimens. We analyse a large but incomplete data-set consisting of 24177 grades, on a discrete 1–3 scale, provided by 732 pathologists for 52 samples.

Methodology/Principal Findings

We review existing methods for analysing inter-rater agreement for multiple raters and demonstrate two further methods. Firstly, we examine a simple non-chance-corrected agreement score based on the observed proportion of agreements with the consensus for each sample, which makes no allowance for missing data. Secondly, treating grades as lying on a continuous scale representing tumour severity, we use a Bayesian latent trait method to model cumulative probabilities of assigning grade values as functions of the severity and clarity of the tumour and of rater-specific parameters representing boundaries between grades 1–2 and 2–3. We simulate from the fitted model to estimate, for each rater, the probability of agreement with the majority. Both methods suggest that there are differences between raters in terms of rating behaviour, most often caused by consistent over- or under-estimation of the grade boundaries, and also considerable variability in the distribution of grades assigned to many individual samples. The Bayesian model addresses the tendency of the agreement score to be biased upwards for raters who, by chance, see a relatively ‘easy’ set of samples.

Conclusions/Significance

Latent trait models can be adapted to provide novel information about the nature of inter-rater agreement when the number of raters is large and there are missing data. In this large study there is substantial variability between pathologists and uncertainty in the identity of the ‘true’ grade of many of the breast cancer tumours, a fact often ignored in clinical studies. 相似文献

8.

Weighted least-squares approach for comparing correlated kappa 总被引：3，自引：0，他引：3

Barnhart HX Williamson JM 《Biometrics》2002,58(4):1012-1019

In the medical sciences, studies are often designed to assess the agreement between different raters or different instruments. The kappa coefficient is a popular index of agreement for binary and categorical ratings. Here we focus on testing for the equality of two dependent kappa coefficients. We use the weighted least-squares (WLS) approach of Koch et al. (1977, Biometrics 33, 133-158) to take into account the correlation between the estimated kappa statistics. We demonstrate how the SAS PROC CATMOD can be used to test for the equality of dependent Cohen's kappa coefficients and dependent intraclass kappa coefficients with nominal categorical ratings. We also test for the equality of dependent Cohen's kappa and dependent weighted kappa with ordinal ratings. The major advantage of the WLS approach is that it allows the data analyst a way of testing dependent kappa with popular SAS software. The WLS approach can handle any number of categories. Analyses of three biomedical studies are used for illustration. 相似文献

9.

On the Measure of Agreement Between Two Raters

E. Tejumola Jolayemi 《Biometrical journal. Biometrische Zeitschrift》1990,32(1):87-93

The degree of agreement between two raters is re-examined. An alternative statistic which uses the chi-square distribution is proposed. We conclude that this statistic is better than the usual k-statistic when the classification variable is at least ordinal. 相似文献

10.

Norovirus infections in preterm infants: wide variety of clinical courses

Sven Armbrust Axel Kramer Dirk Olbertz Kathrin Zimmermann Christoph Fusch 《BMC research notes》2009,2(1):1-6

Background

The Clubfoot Assessment Protocol (CAP) was developed for follow-up of children treated for clubfoot. The objective of this study was to analyze reliability and validity of the six items used in the domain CAPMotion Quality using inexperienced assessors.

Findings

Four raters (two paediatric orthopaedic surgeons, two senior physiotherapists) used the CAP scores to analyze, on two different occasions, 11 videotapes containing standardized recordings of motion activity according to the domain CAPMotion Quality These results were compared to a criterion (two raters, well experienced CAP assessors) for validity and for checking for learning effect. Weighted kappa statistics, exact percentage observer agreement (Po), percentage observer agreement including one level difference (Po-1) and amount of scoring scales defined how reliability was to be interpreted. Inter- and intra rater differences were calculated using median and inter quartile ranges (IQR) on item level and mean and limits of agreement on domain level. Inter-rater reliability varied between fair and moderate (kappa) and had a mean agreement of 48/88% (Po/Po-1). Intra -rater reliability varied between moderate to good with a mean agreement of 63/96%. The intra- and inter-rater differences in the present study were generally small both on item (0.00) and domain level (-1.10). There was exact agreement of 51% and Po-1 of 91% of the six items with the criterion. No learning effect was found.

Conclusion

The CAPMotion quality can be used by inexperienced assessors with sufficient reliability in daily clinical practice and showed acceptable accuracy compared to the criterion. 相似文献

11.

Reliability and intermethod agreement for body fat assessment among two field and two laboratory methods in adolescents

Vicente-Rodríguez G Rey-López JP Mesana MI Poortvliet E Ortega FB Polito A Nagy E Widhalm K Sjöström M Moreno LA;HELENA Study Group 《Obesity (Silver Spring, Md.)》2012,20(1):221-228

To increase knowledge about reliability and intermethods agreement for body fat (BF) is of interest for assessment, interpretation, and comparison purposes. It was aimed to examine intra- and inter-rater reliability, interday variability, and degree of agreement for BF using air-displacement plethysmography (Bod-Pod), dual-energy X-ray absorptiometry (DXA), bioelectrical impedance analysis (BIA), and skinfold measurements in European adolescents. Fifty-four adolescents (25 females) from Zaragoza and 30 (14 females) from Stockholm, aged 13-17 years participated in this study. Two trained raters in each center assessed BF with Bod-Pod, DXA, BIA, and anthropometry (DXA only in Zaragoza). Intermethod agreement and reliability were studied using a 4-way ANOVA for the same rater on the first day and two additional measurements on a second day, one each rater. Technical error of measurement (TEM) and percentage coefficient of reliability (%R) were also reported. No significant intrarater, inter-rater, or interday effect was observed for %BF for any method in either of the cities. In Zaragoza, %BF was significantly different when measured by Bod-Pod and BIA in comparison with anthropometry and DXA (all P < 0.001). The same result was observed in Stockholm (P < 0.001), except that DXA was not measured. Bod-Pod, DXA, BIA, and anthropometry are reliable for %BF repeated assessment within the same day by the same or different raters or in consecutive days by the same rater. Bod-Pod showed close agreement with BIA as did DXA with anthropometry; however, Bod-Pod and BIA presented higher values of %BF than anthropometry and DXA. 相似文献

12.

Modeling concordance correlation via GEE to evaluate reproducibility

Barnhart HX Williamson JM 《Biometrics》2001,57(3):931-940

Clinical studies are often concerned with assessing whether different raters/methods produce similar values for measuring a quantitative variable. Use of the concordance correlation coefficient as a measure of reproducibility has gained popularity in practice since its introduction by Lin (1989, Biometrics 45, 255-268). Lin's method is applicable for studies evaluating two raters/two methods without replications. Chinchilli et al. (1996, Biometrics 52, 341-353) extended Lin's approach to repeated measures designs by using a weighted concordance correlation coefficient. However, the existing methods cannot easily accommodate covariate adjustment, especially when one needs to model agreement. In this article, we propose a generalized estimating equations (GEE) approach to model the concordance correlation coefficient via three sets of estimating equations. The proposed approach is flexible in that (1) it can accommodate more than two correlated readings and test for the equality of dependent concordant correlation estimates; (2) it can incorporate covariates predictive of the marginal distribution; (3) it can be used to identify covariates predictive of concordance correlation; and (4) it requires minimal distribution assumptions. A simulation study is conducted to evaluate the asymptotic properties of the proposed approach. The method is illustrated with data from two biomedical studies. 相似文献

13.

Identification and classification of involuntary leg muscle contractions in electromyographic records from individuals with spinal cord injury

《Journal of electromyography and kinesiology》2014,24(5):747-754

Involuntary muscle contractions (spasms) are common after human spinal cord injury (SCI). Our aim was to compare how well two raters independently identified and classified different types of spasms in the same electromyographic records (EMG) using predefined rules. Muscle spasms were identified by the presence, timing and pattern of EMG recorded from paralyzed leg muscles of four subjects with chronic cervical SCI. Spasms were classified as one of five types: unit, tonic, clonus, myoclonus, mixed. In 48 h of data, both raters marked the same spasms most of the time. More variability in the total spasm count arose from differences between muscles (84%; within subjects) than differences between subjects (6.5%) or raters (2.6%). Agreement on spasm classification was high (89%). Differences in spasm count, and classification largely occurred when EMG was marked as a single spasm by one rater but split into multiple spasms by the other rater. EMG provides objective measurements of spasm number and type in contrast to the self-reported spasm counts that are often used to make clinical decisions about spasm management. Data on inter-rater agreement and discrepancies on muscle spasm analysis can both drive the design and evaluation of software to automate spasm identification and classification. 相似文献

14.

Measurement of interrater agreement with adjustment for covariates

Barlow W 《Biometrics》1996,52(2):695-702

The kappa coefficient measures chance-corrected agreement between two observers in the dichotomous classification of subjects. The marginal probability of classification by each rater may depend on one or more confounding variables, however. Failure to account for these confounders may lead to inflated estimates of agreement. A multinomial model is used that assumes both raters have the same marginal probability of classification, but this probability may depend on one or more covariates. The model may be fit using software for conditional logistic regression. Additionally, likelihood-based confidence intervals for the parameter representing agreement may be computed. A simple example is discussed to illustrate model-fitting and application of the technique. 相似文献

15.

Asymptotic distributions of kappa statistics and their differences with many raters,many rating categories and two conditions

下载免费PDF全文

Luca Grassano Guido Pagana Marco Daperno Enrico Bibbona Mauro Gasparini 《Biometrical journal. Biometrische Zeitschrift》2018,60(1):146-154

In clinical research and in more general classification problems, a frequent concern is the reliability of a rating system. In the absence of a gold standard, agreement may be considered as an indication of reliability. When dealing with categorical data, the well‐known kappa statistic is often used to measure agreement. The aim of this paper is to obtain a theoretical result about the asymptotic distribution of the kappa statistic with multiple items, multiple raters, multiple conditions, and multiple rating categories (more than two), based on recent work. The result settles a long lasting quest for the asymptotic variance of the kappa statistic in this situation and allows for the construction of asymptotic confidence intervals. A recent application to clinical endoscopy and to the diagnosis of inflammatory bowel diseases (IBDs) is shortly presented to complement the theoretical perspective. 相似文献

16.

Bayesian inference for kappa from single and multiple studies

Basu S Banerjee M Sen A 《Biometrics》2000,56(2):577-582

Cohen's kappa coefficient is a widely popular measure for chance-corrected nominal scale agreement between two raters. This article describes Bayesian analysis for kappa that can be routinely implemented using Markov chain Monte Carlo (MCMC) methodology. We consider the case of m > or = 2 independent samples of measured agreement, where in each sample a given subject is rated by two rating protocols on a binary scale. A major focus here is on testing the homogeneity of the kappa coefficient across the different samples. The existing frequentist tests for this case assume exchangeability of rating protocols, whereas our proposed Bayesian test does not make any such assumption. Extensive simulation is carried out to compare the performances of the Bayesian and the frequentist tests. The developed methodology is illustrated using data from a clinical trial in ophthalmology. 相似文献

17.

An Estimating Equations Approach for Modelling Kappa

Neil Klar Stuart R. Lipsitz Joseph G. Ibrahim 《Biometrical journal. Biometrische Zeitschrift》2000,42(1):45-58

Agreement between raters for binary outcome data is typically assessed using the kappa coefficient. There has been considerable recent work extending logistic regression to provide summary estimates of interrater agreement adjusted for covariates predictive of the marginal probability of classification by each rater. We propose an estimating equations approach which can also be used to identify covariates predictive of kappa. Models may include an arbitrary and variable number of raters per subject and yet do not require any stringent parametric assumptions. Examples used to illustrate this procedure include an investigation of factors affecting agreement between primary and proxy respondents from a case‐control study and a study of the effects of gender and zygosity on twin concordance for smoking history. 相似文献

18.

Assessment of bilateral cleft lip nose deformity: a comparison of results as judged by cleft surgeons and laypersons

Lo LJ Wong FH Mardini S Chen YR Noordhoff MS 《Plastic and reconstructive surgery》2002,110(3):733-8; discussion 739-41

Reconstruction of bilateral cleft lip nose deformity is difficult and the outcome is inconsistent. This study was conducted to evaluate the gross outcome and the difference in the assessment of nasal appearance as judged by two groups of raters, cleft surgeons and laypersons. Sixty-four patients with bilateral cleft lip were selected for review. The patients' ages ranged from 5 to 30 years. All patients had undergone primary cleft lip repair and secondary nasal reconstruction, and had been followed for at least 6 months. One image for each patient, which included a digitized frontal, lateral, and worm's-eye view, was projected for evaluation by the raters. The raters included five cleft surgeons and five laypersons. A rating scheme was used in which a score of 3 was given for a good, close to normal nasal appearance, 2 for an average result that needed minor revision, and 1 for a poor result that needed major reconstruction. The scores were averaged for each patient in each group and for each group as a whole. The final outcome was judged as good, fair, or poor on the basis of the mean score for each patient. Statistical analysis was performed. The mean score for all patients was 2.08 as assessed by the laypersons and 2.18 as assessed by the cleft surgeon group. There was no statistically significant difference between the two groups. Comparisons on rating scores among different raters revealed a fair agreement on the ratings within each of the two groups. The results were found to be good in 29.7 percent, fair in 64.1 percent, and poor in 6.3 percent of patients when evaluated by the surgeons. When rated by the laypersons, the nasal appearance was found to be good in 26.6 percent, fair in 60.9 percent, and poor in 12.5 percent of patients. This difference in distribution between the two groups was not statistically significant. When comparing the results given by the two groups of assessors, there was agreement on the nasal appearance in 65.6 percent of patients, and a difference in grading in the rest. For the patients who received different grading, the surgeons rated them one grade higher in 63.6 percent and one grade lower in 36.4 percent. There was no difference in grading between any of the evaluators that reflected a two-grade discrepancy in evaluation of results. This study shows that the surgical outcome of bilateral cleft lip nose deformity repair, at the authors' institution, is less than optimal. When assessing bilateral cleft lip nose appearance, the judgment of results by cleft surgeons was similar to that of the laypersons. However, different rating of results existed within each of the two groups, supporting the importance of clearly assessing patient/parent expectations and defining realistic surgical goals. 相似文献

19.

Models of Chance when Measuring Interrater Agreement with Kappa

Alexander von Eye Silvia Srensen 《Biometrical journal. Biometrische Zeitschrift》1991,33(7):781-787

Two measures of reliability for nominal scales are compared: Coefficient Kappa and k_n, a modification suggested for agreement matrices with free marginals. It is illustrated that the evaluation of two rater's agreement may come to a contradictory conclusion, depending on whether k or k_n is used. On the basis of the underlying chance models it is concluded that k and k_n cannot be interpreted in the same manner. Specifically when raters disagree, the two measures can be widely discrepant. 相似文献

20.

Cervical smear adequacy: cellularity references were found to increase both interobserver agreement and unsatisfactory rate

D. Moore D. Pugh-Cain T. Walker 《Cytopathology》2009,20(3):161-168

Objectives: To determine the degree of interobserver variation in the assessment of conventional cervical smear adequacy as defined by The Bethesda System (TBS) 2001, and to determine the effect of using reference images of known squamous cellularity when performing squamous adequacy assessments.
Methods: Experimental pre-test/post-test design utilizing 70 conventionally prepared cervical smears. Sample smears containing scant squamous cellularity were independently rated on two occasions by six cytotechnologists. Time 1 was without the use of reference images, and Time 2 was aided by cellularity reference images. The κ statistic was used to compare rater agreement.
Results: The level of agreement increased from an average κ of 0.26 (SD 0.10) for Time 1, to an average κ of 0.40 (SD 0.15) for Time 2. The difference in mean κ values at the two assessments was statistically significant ( t = 3.71; P = 0.002). Unanimous agreement among the raters was observed for 15 samples (21.42%) at Time 1 (only one of which was classified as unsatisfactory) and 21 samples (30.00%) at Time 2 (12 of which were classified as unsatisfactory).
Conclusion: Interobserver agreement increased after cellularity reference images were implemented. Using TBS 2001 squamous adequacy criteria and images of known squamous cellularity as references resulted in a decreased number of smears reported as satisfactory. 相似文献