As legal scholarship becomes increasingly empirical, legal scholars need to become more aware of and sensitive to relevant scholarly norms that inform empirical work. One underexamined norm relates to expectations concerning data availability and the facilitation of replication. Because replication is central to the empirical scholarship enterprise scholars owe some duty to facilitate replication by others. Such a duty, of course, necessarily implicates access to data (and, as I discuss below, possibly to coding as well). The nature, extent, and contour of that duty, however, are far from clear and, indeed, variation distinguishes various disciplines. Two common scenarios illustrate some of the complexities and nuances.
One scenario involves publicly available data, such as those managed and archived by ICPSR at Michigan. For legal scholars that use such data in published work, one obvious expectation (indeed, requirement) is to identify the specific dataset by its ICPSR number in a footnote (or table note). However, is mere dataset identification enough? For example, because some amount of data preparation and manipulation (e.g., collapsing, filtering, re-coding) is almost inevitable, should legal scholars also be expected to make available their coding? Similarly, should table-specific coding be made available by authors? To do so would reduce the burden on subsequent scholars enormously and facilitate replication, follow-up analyses, etc.
A second scenario is even more delicate as it involves original datasets generated by scholars and not publicly archived. To be sure, it is difficult to overstate the effort necessary to develop a first-class dataset. In light of a scholar's often consider investment of sweat equity and because quality datasets frequently support multiple articles, what obligations attach to scholars in terms of facilitating replication efforts by others? On the one hand, a mechanical requirement to release the complete dataset, including data not yet published on (a "total disclosure duty" position), would likely deter dataset building efforts, at least at the margins. At the other extreme, general scholarly norms would assuredly resist a "no duty to disclose" position as some amount of disclosure and data availability are necessary so that findings can be vetted and knowledge advanced. To be sure, considerable middle ground separates these two polar positions.
I welcome thoughts and perspectives on how empirical legal scholars should resolve these (and related) issues and navigate through this uncertain and evolving terrain.
The last point by Stephen Wasby is what I regard as the most important: scholars always have big ideas and big dreams, but reality sometimes gets in the way. Shouldn't there be some idealism here, some sense of, "I'd rather we get to the truth of the matter here than shoot purely for the credit."
Posted by: Jason @ Beaneball | 25 March 2006 at 12:53 PM
(1) The P.S. symposium to which Sara refers deals with most of these issues. (2) A point missed in the comments is that journal editors in political science require that access to the datasets (for those who wish to "check" and re-run the data), not just identification of existing ICPR data sets, be allowed as a condition precedent to publication. (3) The Heise and Henderson comments about "I did the damn work, I want to use my data" miss a key point: if someone looks at your database, unless that person has considerable skill, that person's paper based on your database won't be accepted for publication; perhaps more important, you probably won't have collected exactly the data that the other person wants -- some variables will be missing, so considerable additional work will still be necessary. In short, I think the "someone else will use my data" argument is overstated. And the tendency will be to go to the ICPR-type datasets, not to an individual author's dataset (after all, no offense intended, how can we be sure you "did it right" rather than some sloppy RA having miscoded?). I would end with a comment that cuts in a different direction: If your data set is substantial, there will be more than enough "stuff" there for others to use, and I would want to see others use it than have it sit in my basement, or my computer, or wherever -- because none of us ever gets done with data we collect what we hope (and say) we will do.
Posted by: Stephen L. Wasby | 24 March 2006 at 02:49 PM
I agree with Joe Doherty. Data collection and cleaning is time consuming work. The expectation I most commonly heard prior to this past year or so is that "others can have access to my data when I am done." When an outside organization pays for the data creation (like the NSF or a foundation), then they typically stipulate a liberal data release policy as a condition of funding.
Re veracity of results (which is essentially how the term “replication” is being used in this thread—not replication of a controlled experiment), it is certainly reasonable for a journal editor to ask for copies of regression results, or possibly alternative specifications of a model, or some concrete evidence of how a variable was coded.
However, immediate full disclosure of a database with lots of sweat equity is counterproductive for the incentive reasons set out by Michael Heise. Until the value of a privately constructed dataset has been largely exhausted, an author need only disclose where / how the data was collected. If I am going to work late, blow my RA budget, spend dozens of hours locating data from arcane library sources, write letters and emails to cajole cooperation from various institutions, deal with IRBs, and generally neglect my family to clean data on weekends, the rewards need to be commensurate with the effort. One dataset is usually good for three, four, or more papers. I’ll be damned if I am going to willingly hand papers two, three or four on a silver platter to a less industrious colleague.
It is not virtuous to disregard the basic economics of intellectual property--if an asset becomes a public good, it will be undersupplied by the market.
Posted by: William Henderson | 22 March 2006 at 09:57 PM
I disagree with the conclusion to Michael's statement,
"[b]ecause replication is central to the empirical scholarship enterprise scholars owe some duty to facilitate replication by others. Such a duty, of course, necessarily implicates access to data (and, as I discuss below, possibly to coding as well)."
While data disclosure is a public good that should be encouraged [I am a rabid consumer of readily available data] the OBLIGATION of a scholar to facilitate replication should extend no further than an exact statement of the procedures employed in data collection and analysis. I grant that there are exceptions -- every bit of data collected with government funds should be disclosed upon publication -- but I do not believe that scholars are ENTITLED to the data used by others in their research. [Excuse the caps, I can't make italics.]
I say this because to assert such a duty only legitimizes a problematic shortcut. In my experience more than one-half of the research effort takes place in the data collection phase; running a regression is relatively nominal compared to the decisions that are made in coding. What proponents of data disclosure are proposing is a duplication of the statistical tests, not a replication of the empirical research. While duplication is important if the results are suspect (Donohue and Wolfers on capital punishment is a case in point), we don't advance the ball very much if we are not equally or more concerned with measurement. (Donohue and Wolfers did fine, by the way, in collecting their own data to replicate research when the data were not forthcoming from the authors.)
Having said that, I do believe that data disclosure is a significant social good. The pedagogical importance of having relevant ELS datasets available for coursework should be uncontroversial, and using existing datasets as foundations for additional data colletion is undeniably worthwhile. And there are great incentives to making data public; it probably increases one's citation count. But a professional duty to disclose places an obligation upon the scholar that is not balanced by the entitlement of the reader.
Posted by: Joe Doherty | 22 March 2006 at 05:16 PM
Sorry -- that link isn't working. Try here:
http://gking.harvard.edu/projects/repl.shtml
Posted by: Sara Benesh | 22 March 2006 at 02:07 PM
This is an important and consequential debate. PS: Political Science & Politics published a forum on the topic a few years ago that can be read here: http://gking.harvard.edu/projects/repl.shtm.I think it provides some useful perspectives on the question.
I, myself, am a strong proponent of what Michael calls the "total disclosure" position. I undestand the desire to get what you can from your own data before sharing it with the world, but when folks are overly cautious in sharing their data, it makes me suspicious -- what are they trying to hide??
Posted by: Sara Benesh | 22 March 2006 at 02:04 PM