I have posted on SSRN my article, forthcoming in Hastings Law Journal, Coding Complexity: Bringing Law to the Empirical Analysis of the Supreme Court. This article examines the well-known and widely-used U.S. Supreme Court Database (created by Harold Spaeth) – and most recently mentioned here – and addresses the Database’s limitations particularly for those interested in law and legal doctrine.
The key point of the Article is that the Database does not contain complete or accurate information about law and legal doctrine as they appear in Supreme Court opinions. Given Harold Spaeth’s own purposes in creating the Database, these limitations may not be surprising -- although they do raise at least some challenges to his attitudinal model. Unfortunately, however, they are frequently misunderstood. Scholars all too frequently use the Database in ways that it simply cannot support, leading to the possibility of invalid or unreliable results. This post summarizes the Article’s main arguments.
The primary challenges presented by the Database involve the coding for the “issue,” “issue area,” and “legal provision” variables. As the names of these variables suggest, they are frequently used by researchers interested in studying law and legal doctrine. Yet, the coding protocols for these variables (as set forth in the Codebook are not conducive to such research. Some of the limitations of these variables include:
(A) The “issue” variable is not, despite its name, designed to identify any legal issues in a case. Rather, it is designed to identify the “public policy context” of a case. A case like Schenck v. Pro-Choice Network of Western N.Y. is one example. In Schenck, a group of abortion protesters challenged an injunction limiting their activities as violating the First Amendment. The only legal issue in the case involves the First Amendment and the limits it places on judicial power. But the Database codes the case as having an issue of “abortion” because that is the factual, or “public policy” context in which the case arises.
(B) The coding contains a strong presumption of assigning each case only a single issue. So the Database does not add a First Amendment issue code to the coding of Schenck.
(C) The issue codes are quite underinclusive and somewhat dated. For example, there are no codes for immunities, for sexual harassment, or for the dormant commerce clause.
(D) Each of the approximately 260 issue codes is classified into one, and only one, of 13 “issue areas.” In some cases, the classification makes no sense. For example, in Markman v. Westview Instruments, Inc., the Court addressed the question of whether patent claims construction is a question for the judge or the jury; that is, whether there is a 7th Amendment jury right. The Database classifies Markman as a case about the right to a jury trial, but that code, which does not distinguish between civil and criminal jury rights, is located in the Criminal Procedure issue area.
(E) The legal provision code does not identify cases or judge-made legal doctrines. It is limited to identifying statutes, constitutional provisions, and court rules.
(F) The coding protocols provide that only legal provisions mentioned in a case’s syllabus should be identified. But the syllabus – a short summary of the case – is akin to headnotes. It is not officially part of the case, it is not written by the justices or their law clerks, and it cannot be cited by lawyers or judges.
To some extent, misuses of the Database are likely due to differences in the ways that different disciplines (political science and law) use the same words. To some extent, misuses stem from scholars failing to evaluate their research design in light of the Database’s coding protocols, which are described in the Database’s Codebook. In my Article, I provide a series of examples of research project that fail to adequately take account of the Database’s limitations and that therefore produce results that may be inaccurate.
To further explore the limitations of the Database and to experiment with more legally nuanced types of coding, I undertook a Recoding Project of a random sample of 10% of the cases from the last Rehnquist natural court. The details of the coding project are, of course, explained in the Article. Among other things, I redefined “issue” to mean legal issue, I expanded and rearranged the lists of issues and issue areas, I put no limit on the number of issues that could be coded per case, I redefined legal provision to include seminal cases and legal doctrines, and I identified legal provisions by looking at the opinions themselves, not just the syllabi.
Some of the key findings of the Recoding Project include:
(1) I identified an average of 3.7 issues and 2.4 issue areas per case, rather than the single issue and issue area per case identified in the Database.
(2) I identified an average of 2.2 as many legal provisions per case as did the original Database.
(3) A surprising number of legal provisions that I identified should have been identified in the Database because they were mentioned in the syllabi.
(4) In both issue and legal provision coding, the “missing” codes – those that I identified but that the Database did not – disproportionately related to structural and jurisprudential issues, including procedure, the powers and operations of the federal and state governments, and the relationship between different branches of government.
These and other findings have a variety of implications for researchers working with the Database. Chief among these is the importance of not drawing conclusions about the Supreme Court’s cases by looking at the numbers and types of issues, issue areas, and legal provisions coded. Researchers all too often rely on such information to draw conclusions about case complexity or about the number of issue “dimensions” in the cases. In other words, researchers sometimes point to the Database to justify their assumptions that most Supreme Court cases involve only a single issue. But as the Article demonstrates, this single-issue coding is -- or at least may well be -- an artifact of a coding protocol that presumes that each case should be assigned only a single issue, so such conclusions are circular. A second important implication is that the Database’s issues and issue areas do not accurately identify all cases involving particular legal issues and that not all cases with a particular issue or issue area code in fact involve the legal issues that a researcher might presume from the names of those codes.
Die Londoner Polizei hat aus der Affäre um den Tod eines Mannes bei den G20-Protesten erste Konsequenzen gezogen. Ein Polizist, der bei seiner Attacke auf den Mann gefilmt worden war, wurde vom Dienst suspendiert. Scotland Yard habe ihn „im Interesse der Öffentlichkeit“ mit sofortiger Wirkung freigestellt, teilte die unabhängige Polizeiaufsichtsbehörde am Donnerstagabend mit.
Auf einem Amateurvideo war zu sehen, wie der Polizist den 47-jährigen Ian Tomlinson schlägt und zu Boden stößt. Wenige Minuten später brach der Zeitungsverkäufer zusammen und erlitt einen Herzinfarkt.
Posted by: raivo pommer -eesti. | 09 April 2009 at 01:13 PM
Thanks for your suggestions, Daniel. I also wonder about tagging as a way to create a more flexible multi-user database.
As for whether the coding issues are secrets -- they are not in the sense that most of the coding issues I identify are described in the Database's Codebook. My concern, however, is that many, many researchers fail to understand the coding limitations and their implications for the results of their own and others' work. I've seen -- and document in the Article -- numerous examples.
Posted by: CarolynShapiro | 01 October 2008 at 03:30 PM
... it has never been a secret that all sorts of coding issues exist. It also has never been a secret that analysis of data of this sort is always of a limited nature.
Posted by: Sean Wilson | 01 October 2008 at 01:08 PM
Professor Shapiro's article raises some very compelling points.
Just to follow up to this post, I would encourage readers to consider advances in Information Science that might be leveraged to increase both the quality and granularity of the dataset. Here are two potential approaches:
(1) Computational Linguistic Techniques (i.e. approaches to search for intra-case linguistic diversity)
(2) Network Measures (including community detection algorithms)
Anyway, these are just a couple thoughts of how to generate a forward looking agenda.
Best, Dan
Posted by: Daniel Katz | 30 September 2008 at 11:04 AM