首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Weighted kappa is defined as a measure of pairwise inter observer agreement. A weighted intra class kappa coefficient is proposed to measure agreement on a particular response category. An interclass kappa coefficient is proposed for each pair of response categories. Simple estimation procedures are presented for the case where the observers judging one subject are not necessarily the same as those judging another subject. Large sample standard errors are derived and a numerical example is given.  相似文献   

2.

Background

Magnetic Resonance Imaging (MRI) is considered the mainstay imaging investigation in patients suspected of lumbar disc herniations. Both imaging and clinical findings determine the final decision of surgery. The objective of this study was to assess MRI observer variation in patients with sciatica who are potential candidates for lumbar disc surgery.

Methods

Patients for this study were potential candidates (n = 395) for lumbar disc surgery who underwent MRI to assess eligibility for a randomized trial. Two neuroradiologists and one neurosurgeon independently evaluated all MRIs. A four point scale was used for both probability of disc herniation and root compression, ranging from definitely present to definitely absent. Multiple characteristics of the degenerated disc herniation were scored. For inter-agreement analysis absolute agreements and kappa coefficients were used. Kappa coefficients were categorized as poor (<0.00), slight (0.00–0.20), fair (0.21–0.40), moderate (0.41–0.60), substantial (0.61–0.80) and excellent (0.81–1.00) agreement.

Results

Excellent agreement was found on the affected disc level (kappa range 0.81–0.86) and the nerve root that most likely caused the sciatic symptoms (kappa range 0.86–0.89). Interobserver agreement was moderate to substantial for the probability of disc herniation (kappa range 0.57–0.77) and the probability of nerve root compression (kappa range 0.42–0.69). Absolute pairwise agreement among the readers ranged from 90–94% regarding the question whether the probability of disc herniation on MRI was above or below 50%. Generally, moderate agreement was observed regarding the characteristics of the symptomatic disc level and of the herniated disc.

Conclusion

The observer variation of MRI interpretation in potential candidates for lumbar disc surgery is satisfactory regarding characteristics most important in decision for surgery. However, there is considerable variation between observers in specific characteristics of the symptomatic disc level and herniated disc.  相似文献   

3.
Weighted least-squares approach for comparing correlated kappa   总被引:3,自引:0,他引:3  
Barnhart HX  Williamson JM 《Biometrics》2002,58(4):1012-1019
In the medical sciences, studies are often designed to assess the agreement between different raters or different instruments. The kappa coefficient is a popular index of agreement for binary and categorical ratings. Here we focus on testing for the equality of two dependent kappa coefficients. We use the weighted least-squares (WLS) approach of Koch et al. (1977, Biometrics 33, 133-158) to take into account the correlation between the estimated kappa statistics. We demonstrate how the SAS PROC CATMOD can be used to test for the equality of dependent Cohen's kappa coefficients and dependent intraclass kappa coefficients with nominal categorical ratings. We also test for the equality of dependent Cohen's kappa and dependent weighted kappa with ordinal ratings. The major advantage of the WLS approach is that it allows the data analyst a way of testing dependent kappa with popular SAS software. The WLS approach can handle any number of categories. Analyses of three biomedical studies are used for illustration.  相似文献   

4.
An in vitro cataract classification system was developed in our laboratories and used to demonstrate a relationship between sustained aspirin intake and the apparent deceleration or retardation of human cataract formation. The purpose of this investigation was to assess the reliability of this cataract classification schema. Sets of extracted human cataractous lenses, which had been photographed in vitro, were randomly assigned to five observers. The task was to classify the lenses on the basis of nuclear and cortical involvement, as reflected in color and area changes along five groupings. Assessments were made on the basis of both intraobserver and interobserver agreement levels, corrected for chance (weighted kappa values). All five examiners evidenced levels of intraobserver agreement which ranged between "Good" and "Fair" and "Excellent" (.46-.83). Each of the five observers was ranked on the basis of his agreement levels with the remaining four observers. The results followed a predictable pattern such that the more experienced the observer in classifying cataracts, the more consistent his rankings vis-à-vis the remaining four evaluators. These results are discussed in the general context of observer variability studies in the field of medicine.  相似文献   

5.
Estimating the number of dolphins in a group is a challenging task. To assess the accuracy and precision of dolphin group size estimates, observer estimates were compared to counts from large‐format vertical aerial photographs. During 11 research cruises, a total of 2,435 size estimates of 434 groups were made by 59 observers. Observer estimates were modeled as a function of the photo count in a hierarchical Bayesian framework. Accuracy varied widely among observers, and somewhat less widely among dolphin species. Most observers tended to underestimate, and the tendency increased with group size. Groups of 25, 50, 100, and 500 were underestimated by <1%, 16%, 27%, and 47%, respectively, on average. Precision of group size estimates was low, and estimates were highly variable among observers for the same group. Predicted true group size, given an observer estimate, was larger than the observer estimate for groups of more than about 25 dolphins. Predicted group size had low precision, with coefficients of variation ranging from 0.7 to 1.9. Studies which depend on group size estimates will be improved if the tendency to underestimate group size and the high uncertainty of group size estimates are included in the analysis.  相似文献   

6.
Humans differ in how they perceive, assess, and measure animal behaviour. This is problematic because strong observer bias can reduce statistical power, accuracy of scientific inference, and in the worst cases, lead to spurious results. Unfortunately, reports and studies of measurement reliability in animal behaviour studies are rare. Here, we investigated two aspects of measurement reliability in working dogs: inter‐observer agreement and criterion validity (comparing novice ratings with those given by experts). Here, we extend for the first time a powerful framework used in human psychological studies to investigate three potential aspects of (dis)agreement in nonhuman animal behaviour research: (a) that some behaviours are easier to observe than others; (b) that some subjects are easier to observe than others; and (c) that observers with different levels of experience with the subject animal give the same or different ratings. We found that novice observers with the same level of experience agreed upon measures of a wide range of behaviours. We found no evidence that age of the dogs affected agreement between these same novice observers. However, when observers with different levels of experience (i.e., novices vs. a working dog expert) assessed the same dogs, agreement appeared to be strongly affected by the measurement instrument used to assess behaviour. Given that animal behaviour research often utilizes different observers with different levels of experience, our results suggest that further tests of how different observers may measure behaviour in different ways are needed across a wider variety of organisms and measurement instruments.  相似文献   

7.
《Endocrine practice》2012,18(4):538-548
ObjectiveTo determine the intraobserver and interobserver agreement levels in the evaluation of technetium Tc 99m sestamibi parathyroid scintigraphic images.MethodsNinety-eight patients with hyperparathyroidism were included in the study, and their parathyroid images were evaluated by 4 experienced nuclear medicine observers. The 98 cases were evaluated twice by each observer within an interval of 2 weeks. The evaluations were performed directly on workstations with use of digital images. A questionnaire was completed by each observer. The presence of a lesion, the number and the localizations of the lesions, and whether the lesion was clear or doubtful were all evaluated. Cohen kappa statistics and total agreement percentages were calculated by using SPSS version 11.0 software.ResultsThe 4 observers performed 8 different evaluations and identified a minimum of 38 and a maximum of 43 cases with a parathyroid lesion (or lesions). Both the intraobserver and the interobserver agreements were “very good” for the presence of a parathyroid lesion. The intraobserver agreement was also “very good” and the interobserver agreement was “good” (for only 1 pair of observers) or “very good” for the evaluation of the number of parathyroid lesions. The intraobserver agreement was “very good” or “good” and the interobserver agreement was “good” for the lesion localization and for the presence of a doubtful lesion.ConclusionParathyroid scintigraphy seems to be an observer independent method in the detection of a parathyroid lesion, in the determination of the number of lesions, and in the localizations of the lesions. The measured high agreement between observers increases the reliability of parathyroid scintigraphy. (Endocr Pract. 2012;18: 538-548)  相似文献   

8.
By introducing replicate observations into observer agreement studies, one can obtain better measures of observer agreement than heretofore possible. New methodology based on the analysis of latent variables allows a separation of within- and between-observer variation for binary measures of assessment among pairs of observers. Maximum likelihood estimation and hypothesis testing are discussed. The methodology is illustrated using data on the assessment of dysplasia by pathologists.  相似文献   

9.
Weighted kappa was defined as a measure of pairwise interobserver agreement for the case where the observers judging one subject are not necessarily the same as those judging another subject. In this paper improved formulas for the large sample variance of the weighted kappa statistic are derived, a new definition of interclass kappa coefficients is suggested, and the intraclass correlation coefficient is shown to be a special case of weighted kappa.  相似文献   

10.
Mathematical indices for quantitation of fetal heart rate variability have been proposed by numerous authors, but there have only been infrequent attempts to determine which such indices correspond to the semi-subjective evaluation of variability observed by clinicians. We have previously examined most of the published indices by using them for calculation of the variability of sets of computer-generated numbers, and seeing if they fulfill certain criteria of validity. Two sets of indices (each measuring short-term and long-term variability) were selected as acceptable. Segments of fetal heart rate records from both humans and sheep, with a wide range of subjective variability, were used to compare the mathematically derived indices with the semi-subjective evaluation of three observers. The results show that the mathematical indices of short-term variability compare closely to its subjective evaluation of being present or absent. The long-term variability of indices also increase progressively with the observers' evaluations of increasing variability. The agreement among observers, measured by Cohen's kappa test, is generally "substantial", although for some indices the agreement was "moderate" to "almost perfect". We conclude that the two sets of indices examined do quantitate what is clinically regarded as fetal heart rate variability.  相似文献   

11.
Abstract. Reproducibility of vegetation measurements is critical for large‐scale or long‐term studies, where numerous observers collect data, but past studies have questioned repro‐ducibility of some techniques. Five methods of evaluating understory composition were appraised for reproducibility among six observers in two forest types in south‐central Alaska: ocular estimates in quadrats, overall community species rank and cover estimates, nested rooted frequency, horizontal‐vertical profiles, and pin drop (systematic points). One forest type was selected to represent structure of coastal communities, another to represent structure of interior Alaska communities. Three general methods of evaluating reproducibility were considered: standard deviations (precision among observers), components of variance (percentage of total variance attributable to observers), and analysis of variance (significance of observer variance). Observer variances were generally similar among techniques and significant in most cases. No technique stood out as being more reproducible than others. Features of techniques other than reproducibility may be more important when selecting a technique. Management decisions based on vegetation cover data should consider the observer errors involved as well as biological significance.  相似文献   

12.
S T Gross 《Biometrics》1986,42(4):883-893
Published results on the use of the kappa coefficient of agreement have traditionally been concerned with situations where a large number of subjects is classified by a small group of raters. The coefficient is then used to assess the degree of agreement among the raters through hypothesis testing or confidence intervals. A modified kappa coefficient of agreement for multiple categories is proposed and a parameter-free distribution for testing null agreement is provided, for use when the number of raters is large relative to the number of categories and subjects. The large-sample distribution of kappa is shown to be normal in the nonnull case, and confidence intervals for kappa are provided. The results are extended to allow for an unequal number of raters per subject.  相似文献   

13.
Validating biodiversity indicators requires an analysis of their applicability, their range of validity and their degree of correlation with the biodiversity they are supposed to represent. In this process, assessing the magnitude of observer effect is an essential step, especially if non-specialist observers are involved. Tree microhabitats – woodpecker cavities, cracks and bark characteristics – are reputed to be easily detected by non-specialists as microhabitat observation does not require prior forestry or ecology knowledge. We therefore quantified the probabilities of true and false positive detections made by observers during inventories.Within two 0.5 ha plots in a forest reserve that has not been harvested for at least 150 years, 14 observers with various backgrounds visually inventoried microhabitats on 106 oak (Quercus petraea and Quercus robur) and beech (Fagus sylvatica) trees. We used parametric and Bayesian statistics to compare these observers’ recorded observations with results from an independent census.The mean number of microhabitats per tree varied widely among observers – from 1.4 to over 3. Only five observers reported a mean number of microhabitats per tree that was statistically equivalent to the reference census. The probability of true detection also varied among observers for each microhabitat (from to 0 to 1) as did the probability of false positive detection (from 0 to 0.7). These results show that microhabitat inventories are particularly prone to observer effects.Such strong observer effects weaken the usefulness of microhabitats as biodiversity indicators. If microhabitat inventories are to be developed, we recommend controlling for observer effects by (i) defining standard operating procedures and multiplying the number of observer training sessions and of consensual standardization censuses; (ii) using pairs of observers to record microhabitats whenever possible (though the efficiency of this method remains to be tested); (iii) planning fieldwork so that the factors of interest are not confused with observer effects; and (iv) integrating observer profiles into the statistical models used to analyze the data.  相似文献   

14.
Ethnicity can be a means by which people identify themselves and others. This type of identification mediates many kinds of social interactions and may reflect adaptations to a long history of group living in humans. Recent admixture in the US between groups from different continents, and the historically strong emphasis on phenotypic differences between members of these groups, presents an opportunity to examine the degree of concordance between estimates of group membership based on genetic markers and on visually-based estimates of facial features. We first measured the degree of Native American, European, African and East Asian genetic admixture in a sample of 14 self-identified Hispanic individuals, chosen to cover a broad range of Native American and European genetic admixture proportions. We showed frontal and side-view photographs of the 14 individuals to 241 subjects living in New Mexico, and asked them to estimate the degree of NA admixture for each individual. We assess the overall concordance for each observer based on an aggregated measure of the difference between the observer and the genetic estimates. We find that observers reach a significantly higher degree of concordance than expected by chance, and that the degree of concordance as well as the direction of the discrepancy in estimates differs based on the ethnicity of the observer, but not on the observers'' age or sex. This study highlights the potentially high degree of discordance between physical appearance and genetic measures of ethnicity, as well as how perceptions of ethnic affiliation are context-specific. We compare our findings to those of previous studies and discuss their implications.  相似文献   

15.
《Animal behaviour》1986,34(4):1016-1025
Fourteen adult female domestic cats were watched by two observers for 3 months. Ratings of 18 aspects of each cat's behavioural style were obtained independently from each observer. Correlations between observers were statistically significant for 15 of the 18 aspects and seven of the correlation coefficients were greater than 0·7. The ratings were compared with results of direct recording methods, where equivalent measures were available and, in five out of six cases, the results of the ratings and direct methods were significantly correlated. The rating method is, therefore, generally reliable and can be adequately validated. Some assessments of observer ratings which are not obviously and easily related to direct recordings may prove particularly useful in developmental studies of alternative modes of behaviour and the origins of individual differences.  相似文献   

16.
Within- and between-group observer variability can confound scientific discovery. If observer variability can be quantified and is addressed, data collected by participants with wide ranges of experience and training can yield more reliable inferences. The American pika (Ochotona princeps) is a mammalian sentinel of climate change that has received consideration for listing under the United States Endangered Species Act. As a result, numerous pika monitoring initiatives have been started throughout the mountains in western North America. Some initiatives employ research teams of biological science technicians (professionals), whereas many rely on networks of citizen scientists, or volunteers, for data collection. To date, few studies have quantified observer variability during pika surveys; none have explored the reliability of professional crews or volunteers. We conducted pika surveys in Glacier National Park, Montana, to quantify observer variability. We investigated observer variability 1) among a crew of professionals, 2) among volunteers, and 3) between professionals and volunteers. Professionals were more consistent at identifying pika signs and estimating potential home ranges and consistently found more pika signs than did the volunteers, with the exception of pika sightings. Estimates of pika occupancy were consistent at each site among volunteers conducting sitting surveys. We suggest that sitting surveys conducted by volunteers can reliably detect pika site occupancy. However, data on population dynamics of pikas (e.g., density) should be collected by professionals. Observer variability analyses of this nature should be common practice for wildlife-resource managers and scientists, especially with observers of varying levels of experience and motivation. © 2012 The Wildlife Society.  相似文献   

17.
Pinnipeds are often monitored by counting individuals at haul-out sites, but the often large numbers of densely packed individuals at these sites are difficult to enumerate accurately. Errors in enumeration can induce bias and reduce precision in estimates of population size and trend. We used data from paired observers monitoring walrus haul-outs in Bristol Bay, Alaska, to quantify observer variability and assess its relative importance. The probability of a pair of observers making identical counts was < 0.1 for walrus groups with >50 individuals. Mean count differences ranged up to 25% for the largest counts, depending on beach and observers. In at least some cases, there was a clear tendency for counts of one observer to be consistently greater than counts of the other observer in a pair, indicating that counts of at least one of the observers were biased. These results suggest that efforts to improve accuracy of counts will be worthwhile. However, we also found that variation among observers was relatively small compared to variation among visits to a beach so that efforts to account for other sources of variation will be more important.  相似文献   

18.
In epidemiological studies, cases cannot always be interviewed due to them being too ill or already deceased. Under these circumstances, proxy interviews are often conducted; however, the veridicality of information about mobile phone use gained by proxy interviews has been doubted. The issue is undecided due to the lack of empirical data. We conducted a study of 119 heterosexual couples. Both partners answered two questionnaires about mobile phone use, one about their own use and one about their partner's use. Overall agreement assessed using Cohen's kappa, Passing and Bablok regression, and concordance coefficients between self and proxy data was poor to moderate (e.g., concordance coefficients of 0.55 for duration of use). The only item with good agreement was whether or not a prepaid phone was used (Cohen's kappa 0.78 and 0.63 for male and female estimates, respectively), and to a lesser degree, the onset of mobile phone use (concordance coefficients of 0.66 and 0.61). Poorest agreement was obtained for the side of the head the mobile phone was held during calls (kappa coefficients of 0.20 and 0.24 for female and male estimates, respectively). We conclude that the assessment of mobile phone use by proxy data cannot be relied on except for information about onset of mobile phone use, use of prepaid or contract phones, and, to a lesser degree, duration of daily use. Agreement concerning the important information about side of the head the mobile phone is held during calls was poorest and only slightly better than chance. Bioelectromagnetics 33:561–567, 2012. © 2012 Wiley Periodicals, Inc.  相似文献   

19.
Observer variability affects virtually all aspects of clinical medicine and investigation. One important aspect, not previously examined, is the selection of abstracts for presentation at national medical meetings. In the present study, 109 abstracts, submitted to the American Association for the Study of Liver Disease, were evaluated by three “blind” reviewers for originality, design-execution, importance, and overall scientific merit. Of the 77 abstracts rated for all parameters by all observers, interobserver agreement ranged between 81 and 88%. However, corresponding intraclass correlations varied between 0.16 (approaching statistical significance) and 0.37 (p < 0.01). Specific tests of systematic differences in scoring revealed statistically significant levels of observer bias on most of the abstract components. Moreover, the mean differences in interobserver ratings were quite small compared to the standard deviations of these differences. These results emphasize the importance of evaluating the simple percentage of rater agreement within the broader context of observer variability and systematic bias.  相似文献   

20.
Volunteers are increasingly being recruited into citizen science projects to collect observations for scientific studies. An additional goal of these projects is to engage and educate these volunteers. Thus, there are few barriers to participation resulting in volunteer observers with varying ability to complete the project’s tasks. To improve the quality of a citizen science project’s outcomes it would be useful to account for inter-observer variation, and to assess the rarely tested presumption that participating in a citizen science projects results in volunteers becoming better observers. Here we present a method for indexing observer variability based on the data routinely submitted by observers participating in the citizen science project eBird, a broad-scale monitoring project in which observers collect and submit lists of the bird species observed while birding. Our method for indexing observer variability uses species accumulation curves, lines that describe how the total number of species reported increase with increasing time spent in collecting observations. We find that differences in species accumulation curves among observers equates to higher rates of species accumulation, particularly for harder-to-identify species, and reveals increased species accumulation rates with continued participation. We suggest that these properties of our analysis provide a measure of observer skill, and that the potential to derive post-hoc data-derived measurements of participant ability should be more widely explored by analysts of data from citizen science projects. We see the potential for inferential results from analyses of citizen science data to be improved by accounting for observer skill.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号