Is there a debate over the various measures of inter-coder reliability? For a short discussion, as well as a practical guide for testing inter-coder reliability in content analysis, I found this: Lombard, Snyder-Duch & Campanella Bracken, Practical Resources for Assessing and Reporting Intercoder Reliability in Content Analysis Research Projects. I've used the more conservative Cohen’s kappa to measure inter-coder reliability. See Landis & Koch, The Measurement of Observer Agreement for Categorical Data, 33 Biometrics 159 (1977). But, is there a preferred measure? If so, why? Is simple percent agreement always disfavored? Is there an article that thoroughly compares the options? Comments are open.
I came across another helpful discussion, from James F. Spriggs, II and Thomas G. Hansford, Measuring Legal Change: The Reliability and Validity of Shepard’s Citations, 53 Pol. Res. Q. 327, 334 & n.11 (2000):
"Kappa means that the level of agreement is [that] percent greater than would be expected by change and thus indicates. . . . If Kappa equals 0 then the amount of agreement between the two coders is exactly what one would expect by chance. If Kappa equals 1, then the coders agree perfectly. When evaluating the extent to which the two coders agree, Landis and Koch (1977) [Richard J. Landis & Gary G. Koch, The Measurement of Observer Agreements for Categorical Data, Biometrics 33:159-174 (1977).] attach the following labels to the size of the Kappa statistic: <0.00 is Poor; 0.00-0.20 is Slight; 0.21-0.40 is Fair; 0.41-0.60 is Moderate; 0.61-0.80 is Substantial; and 0.81-1.00 is Almost Perfect."
Posted by: Mark Hall | 24 March 2006 at 12:35 PM
We came across an article by two political scientists that says the "pi coefficient" is the most common realiability measure, especially for more than 2 coders. It also says that a pi value of 0.6 is considered minimally acceptable in communications research. Glenn A. Phelps & John B. Gates, The Myth of Jurisprudence: Interpretitve Theory in the Constitutional Opinions of Justices Rehnquist and Brennan, 31 Santa Clara L. Rev. 567, nn. 69, 74 (1991).
Posted by: Mark Hall and Ron Wright | 13 March 2006 at 10:10 AM
We've been trying to address similar issues; it seems one of the first questions you need to answer is about the nature of the coding. For dichotomous coding, I've seen suggestions that Pearson's R may be sufficient.
Lombard et al., wrote a similar piece to the one you've cited a few years ago:
Lombard, M, J Snyder-Duch, et al. (2002). "Content Analysis in Mass Communication: Assessment and Reporting of Intercoder Reliability." Human Communication Research 28(4): 587-604
At any rate, I'd be interested to see what you all find, and will check back if I find other useful sources.
Posted by: Ken Cousins | 02 March 2006 at 09:44 AM
Good question, and that website is a great resource. Simply reporting percent level of agreement is not sufficient because it tells us nothing about whether the level of agreement is greater than what one would expect by chance. In typical content coding, some items are so straightforward there should be no disagreement, so any substantial disagreement is a sign of problems. Or, there is often a long list of factors to code for, most of which are not present in most or many cases. So it's easy to produce a high level of agreement overall, with coders almost always indicating "not present," but the key question is the level of agreement when one or more coder finds the factor present. Tricky stuff to get right.
I recently came across the following article, which has a good but technical discussion of these issues in the context of coding legal cases: Charles A. Johnson, Content-Analytic Techniques and Judicial Resesarch, 15 Am. Politics Q. 169-197 (1987).
At the end of the day, I'm also still not sure how best to do this. Like much of statistics, I suppose it's as much "art" as it is science.
Posted by: Mark Hall | 01 March 2006 at 08:42 AM