Norms concerning data sharing appear to be in some flux and variation across disciplines (and journals) is on the rise. Some interesting thoughts on why more scholars do not share their data to facilitate replication (here). A strong form of the argument favoring data sharing--that journals should make data (and/or code) sharing a condition of peer-review--is found here.
While simply massive and (potentially) a bit unwieldy, the Database on Ideology, Money in Politics, and Elections (DIME) is, quite literally, unparalleled in terms of scope. Indeed, the database contains over 100 million political contributions made by individuals and organizations to local, state, and federal elections spanning a period from 1979 to 2012.
DIME "is intended as a general resource for the study of campaign finance and ideology in American politics. The database was developed as part of the project on Ideology in the Political Marketplace, which is an on-going effort to conduct a comprehensive ideological mapping of political elites, interest groups, and donors using the common-space CFscore scaling methodology (Bonica 2013). Constructing the database required a large-scale effort to compile, clean, and process data on contribution records, candidate characteristics, and election outcomes from various sources. The resulting database contains over 100 million political contributions made by individuals and organizations to local, state, and federal elections spanning a period from 1979 to 2012. A corresponding database of candidates and committees provides additional information on state and federal elections."
Motivations for the database's creation include: "to make data on campaign finance and elections (1) more centralized and accessible, (2) easier to work with, and (3) more versatile in terms of the types of questions that can be addressed." Housed at Stanford University, the database website requires users wishing to access the data to establish an account (free of charge) and includes direct access to a codebook.
Dan Kahan (Yale) has been a leading critic of the increasingly-popular "M-Turk" studies. And despite, as Andy Gelman (Columbia--Statistics) notes, "M-Turk’s combination of low cost and low validity [make] it an attractive option for many researchers," such attractiveness, as Kahan points out (here), should be measured due to critical limitations incident to an array of M-Turk's sampling issues. Whether one agrees or disagrees with Kahan, researchers using M-Turk-generated data certainly need to engage Kahan's critique.
In general, a surprising number of researchers fail to develop and execute a strategy for handling "duplicate" entries in a data set. For example, in assessing criminal outcomes researchers need to decide ex ante whether their dependent variable of interest is the "crime" or the "criminal." If it's the latter, due to recidivism, researchers then need to think through how to handle the possibility that a single individual may commit multiple, independent crimes. If so, a single criminal may appear more than once in a data set and, in so doing, raise potential "double count" issues. (Obviously, one may plausibly treat multiple crimes committed by the same individual as separate events and intentionally "double count.") In the event that the researcher wants to avoid "double counting," however, identifying and rooting out duplicate cases in large data sets can prove difficult. To this end, a nice (albeit abbreviated) discussion of how to use Stata's "duplicate" command is found here.
A common, hands-on task that can be accomplished in various ways (some more parsimonious than others). An example (with coding illustrations and suggestions) recently emerged on the Stata list (here) might help those new to empirical methods and/or data management.
This is a fascinating and complex topic where the rules/norms appear to be shifting in real-time. Some of the complexities are aptly illustrated in a recent post on Andrew Gelman's blog (here). (For those interested, even more valuable than the post itself is the growing list of comments that vividly illustrate this issue's potential complexities.)
While I personally don't think such defenses remain necessary, I nonetheless note a recent response by Christina Boyd (Georgia--Poli Sci) in the Buffalo Law Review that takes up the task. In addition, Boyd's paper, In Defense of Empirical Legal Studies, includes a helpful summary of (and links to) a few of the leading publicly-available ELS-related data sets (with a poli sci tilt).
While generally not news, academic squabbles over coding decisions, including coding decisions involving leading data sets, become news when they spill into the academic literature. A current example involves the Supreme Court Database and its use in a paper by Lee Epstein (Wash U), Christopher Parker (Centenary College), and Jeff Segal (SUNY Stony Brook).
In Do Justices Defend the Speech They Hate? In-Group Bias, Opportunism, and the First Amendment, Epstein et al. exploit the Supreme Court Database, use a "two-level hierarchical model of 4,519 votes in 516 cases," and find that "although liberal justices are (overall) more supportive of free speech claims than conservative justices, the votes of both liberal and conservative justices tend to reflect their preferences toward the speakers' ideological grouping, and not solely an underlying taste for (or against) the First Amendment." (The paper's findings made their way into a New York Timesarticle.)
Another scholar, Todd Pettys (Iowa), however, took issue with Epstein et al., particularly their reliance on the some of data set's coding decisions. In Free Expression, In-Group Bias, and the Court's Conservatives: A Critique of the Epstein-Parker-Segal Study, Pettys digs deeply into the underlying data set and writes: "In a recent, widely publicized study, a prestigious team of political scientists concluded that there is strong evidence of ideological in-group bias among the Supreme Court’s members in First Amendment free-expression cases, with the current four most conservative justices being the Roberts Court’s worst offenders. Beneath the surface of the authors’ conclusions, however, one finds a surprisingly sizable combination of coding errors, superficial case readings, and questionable judgments about litigants’ ideological affiliations. Many of those problems likely flow either from shortcomings that reportedly afflict the Supreme Court Database (the data set that nearly always provides the starting point for empirical studies of the Court) or from a failure to take seriously the importance of attending to cases’ details."
Responding to Pettys' critique of their work, Epstein et al. write: "The upshot is that our coding procedures reject over 80% of the author's allegations, meaning his critique reduces to about 2% of our coding decisions, extrapolating over his non-random audit percentage. Although we think a few of these are debatable, we are happy to concede, and have corrected the dataset to reflect the changes he desires. We have also rerun the analysis. The results do not change in any substantively or statistically significant way (see Appendix A), nor do the results of the summary of our study reported in the New York Times and other outlets (see Appendix B)."
Regardless of who has the better of this particular argument, greater attention to coding conventions and the underlying accuracy of any data set (particularly influential data sets such as the Supreme Court Database) is welcome.
Andrew Gelman (Columbia--Statistics) notes (here) that among statistics' three essential elements, "measurement, comparison, and variation," measurement receives short shrift. Why?
"Part of it is surely that measurement takes effort, and we have other demands on our time. But it’s more than that. I think a large part is that we don’t carefully think about evaluation as a measurement issue and we’re not clear on what we want students to learn and how we can measure this." To this I would add one additional practical aspect. For those conducting secondary analyses of data sets put together by others, most typically defer to measurement decisions already baked into data sets.
Regardless, when measurement goes awry, measurement error emerges and bad things happen.
A small but interesting wrinkle. Many data sets include cases with missing data (hopefully not too many) and these cases will be excluded from many regression specifications. When "describing" the data set, does one use all of the cases (including those cases excluded from regression analyses) or just those cases included in the regression? If it's the latter, is there an easy way to generate basic summary statistics for the non-excluded cases? (The answer to the final question is "yes," and a helpful explanation--and illustrations--can be found here.)
Compare the following two visual displays of quantitative information. The former is described by some (here) as a "bad chart." While perhaps reasonable minds can disagree, the later is generally recognized as "probably the best statistical graphic ever drawn" (here).
Not infrequently researchers need to merge two separate data files into one. What should be an easy task is one fraught with tricky details. Structural issues (e.g., are you adding new "cases" or, rather, new data to existing cases) warrant initial attention as their resolution drives downstream ID file "linking" issues. Parts of this general issue are helpfully discussed in a thread (here).
A recent thread on a listserv prompts this reminder about the availability of a leading Supreme Court database that traces its roots back to Prof. Harold Spaeth's pioneering efforts from a few decades ago.
“The Supreme Court Database is the definitive source for researchers, students, journalists, and citizens interested in the U.S. Supreme Court. The Database contains over two hundred pieces of information about each case decided by the Court between the 1946 and 2013 terms. Examples include the identity of the court whose decision the Supreme Court reviewed, the parties to the suit, the legal provisions considered in the case, and the votes of the Justices.”
The database includes "both Case Centered and Justice Centered data. In the Case Centered data the unit of analysis is the case; i.e., each row of the dataset contains information about an individual dispute. These data should be used unless the votes of the individual justices are of interest. The Justice Centered data include a row for each justice participating in each dispute."
An online codebook (here) and a brief history of the database (here) by Harold Spaeth accompany the database.
Over at the Stata forum, a discussion involving importing large Excel datasets unearthed a very helpful coding nugget. The key is entry #3 that reveals an undocumented setting ("set excelxlsxlargefile on"). According to the poster, activating the setting:
"will allow -import excel- to bypass the size checking. But you should be warned, the library we use to import Excel files has a large memory footprint when dealing with large xlsx files. Also the library currently has no ability to allow user to break during the middle of loading an Excel file. Hence when you attempt to load [a large Excel] xlsx file, the Stata session will become unresponsive until it finishes. During this time, you will not be able to break out using the break button."