期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Evaluating the effects of rater and subject factors on measures of association

Kerrie P. Nelson Aya A. Mitani Don Edwards 《Biometrical journal. Biometrische Zeitschrift》2018,60(3):639-656

Large‐scale agreement studies are becoming increasingly common in medical settings to gain better insight into discrepancies often observed between experts' classifications. Ordered categorical scales are routinely used to classify subjects' disease and health conditions. Summary measures such as Cohen's weighted kappa are popular approaches for reporting levels of association for pairs of raters' ordinal classifications. However, in large‐scale studies with many raters, assessing levels of association can be challenging due to dependencies between many raters each grading the same sample of subjects' results and the ordinal nature of the ratings. Further complexities arise when the focus of a study is to examine the impact of rater and subject characteristics on levels of association. In this paper, we describe a flexible approach based upon the class of generalized linear mixed models to assess the influence of rater and subject factors on association between many raters' ordinal classifications. We propose novel model‐based measures for large‐scale studies to provide simple summaries of association similar to Cohen's weighted kappa while avoiding prevalence and marginal distribution issues that Cohen's weighted kappa is susceptible to. The proposed summary measures can be used to compare association between subgroups of subjects or raters. We demonstrate the use of hypothesis tests to formally determine if rater and subject factors have a significant influence on association, and describe approaches for evaluating the goodness‐of‐fit of the proposed model. The performance of the proposed approach is explored through extensive simulation studies and is applied to a recent large‐scale cancer breast cancer screening study. 相似文献

2.

Weighted least-squares approach for comparing correlated kappa 总被引：3，自引：0，他引：3

Barnhart HX Williamson JM 《Biometrics》2002,58(4):1012-1019

In the medical sciences, studies are often designed to assess the agreement between different raters or different instruments. The kappa coefficient is a popular index of agreement for binary and categorical ratings. Here we focus on testing for the equality of two dependent kappa coefficients. We use the weighted least-squares (WLS) approach of Koch et al. (1977, Biometrics 33, 133-158) to take into account the correlation between the estimated kappa statistics. We demonstrate how the SAS PROC CATMOD can be used to test for the equality of dependent Cohen's kappa coefficients and dependent intraclass kappa coefficients with nominal categorical ratings. We also test for the equality of dependent Cohen's kappa and dependent weighted kappa with ordinal ratings. The major advantage of the WLS approach is that it allows the data analyst a way of testing dependent kappa with popular SAS software. The WLS approach can handle any number of categories. Analyses of three biomedical studies are used for illustration. 相似文献

3.

A sequential test for assessing observed agreement between raters

下载免费PDF全文

Sotiris Bersimis Athanasios Sachlas Subha Chakraborti 《Biometrical journal. Biometrische Zeitschrift》2018,60(1):128-145

Assessing the agreement between two or more raters is an important topic in medical practice. Existing techniques, which deal with categorical data, are based on contingency tables. This is often an obstacle in practice as we have to wait for a long time to collect the appropriate sample size of subjects to construct the contingency table. In this paper, we introduce a nonparametric sequential test for assessing agreement, which can be applied as data accrues, does not require a contingency table, facilitating a rapid assessment of the agreement. The proposed test is based on the cumulative sum of the number of disagreements between the two raters and a suitable statistic representing the waiting time until the cumulative sum exceeds a predefined threshold. We treat the cases of testing two raters' agreement with respect to one or more characteristics and using two or more classification categories, the case where the two raters extremely disagree, and finally the case of testing more than two raters' agreement. The numerical investigation shows that the proposed test has excellent performance. Compared to the existing methods, the proposed method appears to require significantly smaller sample size with equivalent power. Moreover, the proposed method is easily generalizable and brings the problem of assessing the agreement between two or more raters and one or more characteristics under a unified framework, thus providing an easy to use tool to medical practitioners. 相似文献

4.

Peer Review Evaluation Process of Marie Curie Actions under EU’s Seventh Framework Programme for Research

David G. Pina Darko Hren Ana Maru?i? 《PloS one》2015,10(6)

We analysed the peer review of grant proposals under Marie Curie Actions, a major EU research funding instrument, which involves two steps: an independent assessment (Individual Evaluation Report, IER) performed remotely by 3 raters, and a consensus opinion reached during a meeting by the same raters (Consensus Report, CR). For 24,897 proposals evaluated from 2007 to 2013, the association between average IER and CR scores was very high across different panels, grant calls and years. Median average deviation (AD) index, used as a measure of inter-rater agreement, was 5.4 points on a 0-100 scale (interquartile range 3.4-8.3), overall, demonstrating a good general agreement among raters. For proposals where one rater disagreed with the other two raters (n=1424; 5.7%), or where all 3 raters disagreed (n=2075; 8.3%), the average IER and CR scores were still highly associated. Disagreement was more frequent for proposals from Economics/Social Sciences and Humanities panels. Greater disagreement was observed for proposals with lower average IER scores. CR scores for proposals with initial disagreement were also significantly lower. Proposals with a large absolute difference between the average IER and CR scores (≥10 points; n=368, 1.5%) generally had lower CR scores. An inter-correlation matrix of individual raters'' scores of evaluation criteria of proposals indicated that these scores were, in general, a reflection of raters’ overall scores. Our analysis demonstrated a good internal consistency and general high agreement among raters. Consensus meetings appear to be relevant for particular panels and subsets of proposals with large differences among raters’ scores. 相似文献

5.

Assessing interrater agreement on binary measurements via intraclass odds ratio

下载免费PDF全文

Isabella Locatelli Valentin Rousson 《Biometrical journal. Biometrische Zeitschrift》2016,58(4):962-973

相似文献

6.

Testing the Intraclass Version of Kappa Coeffcient of Agreement with Binary Scale and Sample Size Determination

Jun‐mo Nam 《Biometrical journal. Biometrische Zeitschrift》2002,44(5):558-570

The intraclass version of kappa coefficient has been commonly applied as a measure of agreement for two ratings per subject with binary outcome in reliability studies. We present an efficient statistic for testing the strength of kappa agreement using likelihood scores, and derive asymptotic power and sample size formula. Exact evaluation shows that the score test is generally conservative and more powerful than a method based on a chi‐square goodness‐of‐fit statistic (Donner and Eliasziw , 1992, Statistics in Medicine 11 , 1511–1519). In particular, when the research question is one directional, the one‐sided score test is substantially more powerful and the reduction in sample size is appreciable. 相似文献

7.

An instrument for assessment of videotapes of general practitioners' performance.

J Cox H Mulholland 《BMJ (Clinical research ed.)》1993,306(6884):1043

OBJECTIVES--To identify those important characteristics of doctors'' and patients'' behaviour that distinguish between "good" and "bad" consultations when viewed on videotape; to use these characteristics to develop a reliable instrument for assessing general practitioners'' performance in their own consultations. DESIGN--Questionnaires completed by patients, general practitioner trainers, and general practitioner trainees. Reliability of draft instrument tested by general practitioner trainers. SETTING--All vocational training schemes for general practice in the Northern region of England. SUBJECTS--First stage: 76 patients in seven groups, 108 general practice trainers in 12 groups, and 122 general practice trainees in 10 groups. Second stage: 85 general practice trainers in 12 groups. MAIN OUTCOME MEASURES--Trainers'' ratings of importance; alpha coefficients of draft instrument by trainee, group, and consultation. RESULTS--6890 characteristics of good and bad consultations were consolidated into a draft assessment instrument consisting of 46 pairs of definitions separated by six point bipolar scales. Nine statement pairs given low importance ratings by trainers were eliminated, reducing the instrument to 37 statement pairs. To test reliability, general practitioner trainers used the instrument to assess three consultations. With the exception of one group of trainers, all alpha coefficients exceeded the acceptable level of 0.80. CONCLUSION--The instrument produced is reliable for assessing general practitioners'' performance in their own consultations. 相似文献

8.

The role of facial hair in women's perceptions of men's attractiveness,health, masculinity and parenting abilities

Barnaby J. Dixson Robert C. Brooks 《Evolution and human behavior》2013,34(3):236-241

Facial hair strongly influences people's judgments of men's socio-sexual attributes. However, the nature of these judgments is often contradictory. The levels of intermediate facial hair growth presented to raters and the stage of female raters' menstrual cycles might have influenced past findings. We quantified men's and women's judgments of attractiveness, health, masculinity and parenting abilities for photographs of men who were clean-shaven, lightly or heavily stubbled and fully bearded. We also tested the effect of the menstrual cycle and hormonal contraceptive use on women's ratings. Women judged faces with heavy stubble as most attractive and heavy beards, light stubble and clean-shaven faces as similarly less attractive. In contrast, men rated full beards and heavy stubble as most attractive, followed closely by clean-shaven and light stubble as least attractive. Men and women rated full beards highest for parenting ability and healthiness. Masculinity ratings increased linearly as facial hair increased, and this effect was more pronounced in women in the fertile phase of the menstrual cycle, although attractiveness ratings did not differ according to fertility. Our findings confirm that beardedness affects judgments of male socio-sexual attributes and suggest that an intermediate level of beardedness is most attractive while full-bearded men may be perceived as better fathers who could protect and invest in offspring. 相似文献

9.

Evaluation of Mitotic Activity Index in Breast Cancer Using Whole Slide Digital Images

Shaimaa Al-Janabi Henk-Jan van Slooten Mike Visser Tjeerd van der Ploeg Paul J. van Diest Mehdi Jiwa 《PloS one》2013,8(12)

Introduction

Mitotic Activity Index (MAI) is an important independent prognostic factor and an integral part of the breast cancer grading system. Thus, correct estimation of this prognostically relevant feature is essential for guiding treatment decision and assessing patient prognosis.The aim of this study was to validate the use of high resolution Whole Slide Images (WSI) in estimating MAI in breast cancer specimens.

Methods

MAI was evaluated in 100 consecutive breast cancer specimens by three observers on two occasions, microscopically and on WSI with a wash out period of 4 months. MAI was also translated to mitotic scores as in grading. Inter- and intra-observer agreement between microscopic and digital MAI counts and scores was measured.

Results

Almost perfect inter-observer agreements were obtained from counting MAI using a conventional microscope (intra-class correlation coefficient (ICCC) 0.879) as well as on WSI (ICCC 0.924). K coefficients reflected good inter-observer agreements among observers'' microscopic mitotic scores (average kappa 0.642). Comparable results were also observed among digital mitotic scores (average kappa 0.635). There was strong to perfect intra-observer agreements between MAI counts and mitotic scores for the two diagnostic modalities (ICCC 0.716–0.863, kappa 0.506–0.617). There were no significant differences in mitotic scores using both diagnostic modalities.

Conclusion

Scoring mitoses using WSI in breast cancer seems to be just as reliable and reproducible as when using a microscope. Further development of software and image quality will definitely encourage the use of WSI in routine pathology practice. 相似文献

10.

A Modification of Kappa for Interobserver Bias

J. Krauth 《Biometrical journal. Biometrische Zeitschrift》1984,26(4):435-445

By COHEN and others the kappa index was developed for measuring nominal scale agreement between two raters. This statistic measures the distance from the nullhypothesis of independent ratings of two observers. Here a modified kappa is introduced, which takes into account the distance between the marginal distributions, as well. This distance is interpreted as the so-called interobserver bias. Population analogues are defined for the modified kappa and a related conditional index. For these parameters asymptotic confidence intervals and tests are derived. The procedures are illustrated by fictitious and real examples. 相似文献

11.

Modeling kappa for measuring dependent categorical agreement data

Williamson JM Lipsitz SR Manatunga AK 《Biostatistics (Oxford, England)》2000,1(2):191-202

A method for analysing dependent agreement data with categorical responses is proposed. A generalized estimating equation approach is developed with two sets of equations. The first set models the marginal distribution of categorical ratings, and the second set models the pairwise association of ratings with the kappa coefficient (kappa) as a metric. Covariates can be incorporated into both sets of equations. This approach is compared with a latent variable model that assumes an underlying multivariate normal distribution in which the intraclass correlation coefficient is used as a measure of association. Examples are from a cervical ectopy study and the National Heart, Lung, and Blood Institute Veteran Twin Study. 相似文献

12.

The reliability of peer review in anthrozoology

《Anthrozo?s》2013,26(2):175-182

相似文献

13.

Models for Ordinal Agreement Data

Christof Schuster Alexander von Eye 《Biometrical journal. Biometrische Zeitschrift》2001,43(7):795-808

Statistical models can be used to describe the probabilistic structure underlying cross‐classified agreement data. This article explains how models for ordinal agreement data can be understood in terms of an association component and an agreement component. The association component accounts for the positive association typically present in ordinal ratings of two observers. The agreement component specifies a model for the diagonal cells of the cross‐classified ratings. Several models for ordinal agreement data proposed in the literature are special cases of this approach. A new log‐linear model for agreement data that can also be understood in terms of the two components is presented and illustrated using data from a case‐control study of coronary heart disease. 相似文献

14.

Powerful Exact Unconditional Tests for Agreement between Two Raters with Binary Endpoints

Guogen Shan Gregory E. Wilding 《PloS one》2014,9(5)

Asymptotic and exact conditional approaches have often been used for testing agreement between two raters with binary outcomes. The exact conditional approach is guaranteed to respect the test size as compared to the traditionally used asymptotic approach based on the standardized Cohen''s kappa coefficient. An alternative to the conditional approach is an unconditional strategy which relaxes the restriction of fixed marginal totals as in the conditional approach. Three exact unconditional hypothesis testing procedures are considered in this article: an approach based on maximization, an approach based on the conditional p-value and maximization, and an approach based on estimation and maximization. We compared these testing procedures based on the commonly used Cohen''s kappa with regards to test size and power. We recommend the following two exact approaches for use in practice due to power advantages: the approach based on conditional p-value and maximization and the approach based on estimation and maximization. 相似文献

15.

Dominance index: A simple measure of relative dominance status in primates

Doris Zumpe Richard P. Michael 《American journal of primatology》1986,10(4):291-300

A simple measure of relative dominance status (cardinal rank) is described which we have termed the dominance index. Like more familiar techniques for assessing rank order, it is based on the direction of aggressive and submissive behaviors between all possible paired combinations of animals in a social group. Using data from five groups of female rhesus monkeys, it reliably produced the same ordinal ranks as fight interaction matrices. There was also good agreement with the cardinal ranks produced by two additional measures of dominance and with those produced by observer ratings. The dominance index can be calculated when fights have not actually occurred and is largely independent of the frequency of agonistic interactions. It has, therefore, wide application and can estimate dominance during brief sampling periods (one hour) and also in stable groups when agonistic interactions are low. Its application is described in experiments in which the male in a group of females was changed and the hormonal status of the females was altered. Estrogen increased female dominance status relative to other females. 相似文献

16.

Unhealthy Days and Quality of Life in Irish Patients with Diabetes

Emma Louise Clifford Margaret M. Collins Claire M. Buckley Anthony P. Fitzgerald Ivan J. Perry 《PloS one》2013,8(12)

Objectives

To study the determinants of health-related quality of life (HRQoL) in Irish patients with diabetes using the Centres for Disease Controls'' (CDC''s) ‘Unhealthy Days’ summary measure and to assesses the agreement between this generic HRQoL measure and the disease-specific Audit of Diabetes Dependant Quality of Life (ADDQoL) measure.

Research Design and Methods

Data were analysed from the Diabetes Quality of Life Study, a cross-sectional study of 1,456 people with diabetes in Ireland (71% response rate). Unhealthy days were assessed using the CDC''s ‘Unhealthy days’ summary measure. Quality of life (QoL) was also assessed using the ADDQoL measure. Analyses were conducted primarily using logistic regression. The agreement between the two QoL instruments was measured using the kappa co-efficient.

Results

Participants reported a median of 2 unhealthy days per month. In multivariate analyses, female gender (P = 0.001), insulin use (P = 0.030), diabetes complications (P = <0.001) were significantly associated with more unhealthy days. Older patients had fewer unhealthy days per month (P = 0.003). Agreement between the two measures of QoL (unhealthy days measure and ADDQoL) was poor, Kappa = 0.234

Conclusions

The findings highlight the determinants of HRQoL in patients with diabetes using a generic HRQoL summary measure. The ‘Unhealthy Days’ and the ADDQoL have poor agreement, therefore the ‘Unhealthy Days’ summary measure may be assessing a different construct. Nonetheless, this study demonstrates that the generic ‘Unhealthy Days’ summary measure can be used to detect determinants of HRQoL in patients with diabetes. 相似文献

17.

Bayesian inference for kappa from single and multiple studies

Basu S Banerjee M Sen A 《Biometrics》2000,56(2):577-582

Cohen's kappa coefficient is a widely popular measure for chance-corrected nominal scale agreement between two raters. This article describes Bayesian analysis for kappa that can be routinely implemented using Markov chain Monte Carlo (MCMC) methodology. We consider the case of m > or = 2 independent samples of measured agreement, where in each sample a given subject is rated by two rating protocols on a binary scale. A major focus here is on testing the homogeneity of the kappa coefficient across the different samples. The existing frequentist tests for this case assume exchangeability of rating protocols, whereas our proposed Bayesian test does not make any such assumption. Extensive simulation is carried out to compare the performances of the Bayesian and the frequentist tests. The developed methodology is illustrated using data from a clinical trial in ophthalmology. 相似文献

18.

Can primary prevention or selective screening for melanoma be more precisely targeted through general practice? A prospective study to validate a self administered risk score.

A. Jackson C. Wilkinson M. Ranger R. Pill P. August 《BMJ (Clinical research ed.)》1998,316(7124):34-39

OBJECTIVES: To establish whether a questionnaire incorporating MacKie''s risk factor flow chart can identify patients at high risk for melanoma so that they can be targeted for primary and secondary prevention. To validate the risk score derived from the questionnaire and test the feasibility of self completion by comparing patients'' self reported skin characteristics with a skin examination performed by an experienced general practitioner. DESIGN: Prospective questionnaire survey followed by a comparative study. SETTING: 16 randomly selected group practices in a health district in Cheshire, United Kingdom. SUBJECTS: Questionnaire survey--3105 consecutive patients aged 16 years and over attending for a primary care consultation; comparative study--a self selected subsample of 388 of the 3,105 patients. MAIN OUTCOME MEASURES: MacKie risk group for melanoma. Comparison of high risk skin characteristics reported by patients and those noted during a skin examination by a doctor (kappa statistic). RESULTS: 4.3% of patients (87% women) were in the highest risk group and 4.4% (79% men) were in the second highest risk group, as defined by the MacKie score. Agreement between patients'' self appraisal of skin characteristics and clinical skin examinations was reflected in kappa values of 0.67 for freckles, 0.60 for moles, and 0.43 for atypical naevi. CONCLUSION: This questionnaire helped to identify a group at high risk for melanoma. Furthermore, good agreement was found when the patient''s risk scores were compared with results of the clinical skin examination. This risk score is potentially useful in targeting primary and secondary prevention of melanoma through general practice. 相似文献

19.

Norovirus infections in preterm infants: wide variety of clinical courses

Sven Armbrust Axel Kramer Dirk Olbertz Kathrin Zimmermann Christoph Fusch 《BMC research notes》2009,2(1):1-6

Background

The Clubfoot Assessment Protocol (CAP) was developed for follow-up of children treated for clubfoot. The objective of this study was to analyze reliability and validity of the six items used in the domain CAPMotion Quality using inexperienced assessors.

Findings

Four raters (two paediatric orthopaedic surgeons, two senior physiotherapists) used the CAP scores to analyze, on two different occasions, 11 videotapes containing standardized recordings of motion activity according to the domain CAPMotion Quality These results were compared to a criterion (two raters, well experienced CAP assessors) for validity and for checking for learning effect. Weighted kappa statistics, exact percentage observer agreement (Po), percentage observer agreement including one level difference (Po-1) and amount of scoring scales defined how reliability was to be interpreted. Inter- and intra rater differences were calculated using median and inter quartile ranges (IQR) on item level and mean and limits of agreement on domain level. Inter-rater reliability varied between fair and moderate (kappa) and had a mean agreement of 48/88% (Po/Po-1). Intra -rater reliability varied between moderate to good with a mean agreement of 63/96%. The intra- and inter-rater differences in the present study were generally small both on item (0.00) and domain level (-1.10). There was exact agreement of 51% and Po-1 of 91% of the six items with the criterion. No learning effect was found.

Conclusion

The CAPMotion quality can be used by inexperienced assessors with sufficient reliability in daily clinical practice and showed acceptable accuracy compared to the criterion. 相似文献

20.

Measuring agreement of multivariate discrete survival times using a modified weighted kappa coefficient

Guo Y Manatunga AK 《Biometrics》2009,65(1):125-134

Summary . Assessing agreement is often of interest in clinical studies to evaluate the similarity of measurements produced by different raters or methods on the same subjects. We present a modified weighted kappa coefficient to measure agreement between bivariate discrete survival times. The proposed kappa coefficient accommodates censoring by redistributing the mass of censored observations within the grid where the unobserved events may potentially happen. A generalized modified weighted kappa is proposed for multivariate discrete survival times. We estimate the modified kappa coefficients nonparametrically through a multivariate survival function estimator. The asymptotic properties of the kappa estimators are established and the performance of the estimators are examined through simulation studies of bivariate and trivariate survival times. We illustrate the application of the modified kappa coefficient in the presence of censored observations with data from a prostate cancer study. 相似文献