Krippendorff K. Estimation of reliability, systematic error and accidental error of interval data. Educ Psychol Meas. 1970;30:61-70. McKenzie DP, Mackinnon AJ, Péladeau N, Onghena P, Bruce PC, Clarke DM, Harrigan S, McGorry. Comparison of kappas correlated by resampling: does one level of concordance differ significantly from another? J Psychiatrist Res. 1996;30 (6): 483-92. While evaluators tend to agree, the differences between evaluators` observations are close to zero. When one appraiser is generally higher or lower than the other by a consistent amount, the distortion of zero is different. If evaluators tend to disagree, but in the absence of a consistent model where one rating is higher than the other, the average is close to zero. Confidence limits (usually 95%) can be calculated both for distortion and for each of the compliance limits. If the number of categories used is small (for example.B. 2 or 3), the probability that 2 evaluators agree by chance increases dramatically.
This is because both evaluators must limit themselves to the limited number of options available, which affects the overall rate of the agreement, and not necessarily their propensity to enter into an “intrinsic” agreement (an agreement is considered “intrinsic” if it is not due to chance). For nominal data and without missing values, Fleiss`K and Krippendorffs Alpha can be recommended in the same way for the evaluation of inter-rater reliability. Since the asymptotic confidence interval for Fleiss`K may have a very low probability of coverage, only standard bootstrap confidence intervals can be recommended, as is the case in our study. If the measurement scale is not nominal and/or if there are missing (completely random) values, only Krippendorffs Alpha is suitable. The correct choice of the measurement scale for categorical variables is essential for an impartial assessment of reliability. The analysis of variables in a nominal setting, collected in an orderly manner, significantly underestimates the true reliability of the measurement, as our case study shows. For those interested in a one-fits-all approach, Krippendorffs Alpha could thus become the measure of choice. As our recommendations cannot simply be applied in the available software solutions, we offer with this article a free R script that allows you to calculate both Fleiss` K and Krippendorffs Alpha with the proposed bootstrap confidence intervals (additional file 3). We compared the performance of Fleiss`K and Krippendorffs Alpha as a measure of the Reliability of the Inter-Rater.
These two coefficients are very flexible as they can handle two or more evaluators and categories. In both our simulation study and a case study, the point estimates of Fleiss`K and Krippendorffs Alpha were very similar and were not related to overestimation or underestimation. The asymptotic confidence interval for Fleiss`K resulted in a very low probability of coverage, while the standard bootstrap interval gave very similar and valid results for Fleiss` K and Krippendorffs Alpha. The limitations of the asymptotic approach to the confidence interval are related to the fact that the underlying asymptotic normal distribution applies only to the assumption that the true Fleiss-K is zero. For zero assumptions carried forward (we simulated real values between 0.4 and 0.93), the default error is no longer appropriate [18, 23]. Since bootstrap confidence intervals are not based on assumptions about the underlying distribution, they offer a better approach in cases where calculating the right standard error is not easy for certain assumptions [24-26]. . . .