Compare the following two visual displays of quantitative information. The former is described by some (here) as a "bad chart." While perhaps reasonable minds can disagree, the later is generally recognized as "probably the best statistical graphic ever drawn" (here).
Not infrequently researchers need to merge two separate data files into one. What should be an easy task is one fraught with tricky details. Structural issues (e.g., are you adding new "cases" or, rather, new data to existing cases) warrant initial attention as their resolution drives downstream ID file "linking" issues. Parts of this general issue are helpfully discussed in a thread (here).
A recent thread on a listserv prompts this reminder about the availability of a leading Supreme Court database that traces its roots back to Prof. Harold Spaeth's pioneering efforts from a few decades ago.
“The Supreme Court Database is the definitive source for researchers, students, journalists, and citizens interested in the U.S. Supreme Court. The Database contains over two hundred pieces of information about each case decided by the Court between the 1946 and 2013 terms. Examples include the identity of the court whose decision the Supreme Court reviewed, the parties to the suit, the legal provisions considered in the case, and the votes of the Justices.”
The database includes "both Case Centered and Justice Centered data. In the Case Centered data the unit of analysis is the case; i.e., each row of the dataset contains information about an individual dispute. These data should be used unless the votes of the individual justices are of interest. The Justice Centered data include a row for each justice participating in each dispute."
An online codebook (here) and a brief history of the database (here) by Harold Spaeth accompany the database.
Over at the Stata forum, a discussion involving importing large Excel datasets unearthed a very helpful coding nugget. The key is entry #3 that reveals an undocumented setting ("set excelxlsxlargefile on"). According to the poster, activating the setting:
"will allow -import excel- to bypass the size checking. But you should be warned, the library we use to import Excel files has a large memory footprint when dealing with large xlsx files. Also the library currently has no ability to allow user to break during the middle of loading an Excel file. Hence when you attempt to load [a large Excel] xlsx file, the Stata session will become unresponsive until it finishes. During this time, you will not be able to break out using the break button."
In a recent post over at Concurring Opinions, Harry Surden (Colorado) concludes with the following prediction: "In the not too distant future, such data-driven approaches to engaging in legal prediction are likely to become more common within law. Outside of law, data analytics and machine-learning have been transforming industries ranging from medicine to finance, and it is unlikely that law will remain as comparatively untouched by such sweeping changes as it remains today."
If Surden is even partially correct we should expect to see data increasingly pressed into the service of a more sophisticated legal outcomes "prediction" business. Of course, the Katz, Bommarito, and Blackman paper's claim (discussed in Surden's post) for a 70.9% successful prediction rate needs to be placed into some context. Specifically, as many law profs and appellate litigators instinctively already know, simply by predicting a reversal one can correctly predict the outcome of a Supreme Court case with approximately 56-73% accuracy (for an extended discussion, click here). While a 70.9% prediction rate is important, when it comes to Supreme Court cases the correct baseline is not a Priest-Klein 50%.
Andrew Gelman (Columbia--Statistics) has a nice post (here) that underscores a common point: A general pull towards identifying "typical" responses can deflect researchers from a potentially more interesting story about variation. His second--but often related--point is that it is awfully difficult to overemphasize the need to simply "look" at the data.
As Gelman observes: "The resolution, I think, is that we have to avoid the tendency to think deterministically. There’s variation! As shown in the above histogram, some people reported thinking to be “not at all enjoyable,” some reported it to be “somewhat enjoyable,” and there were a lot of people in the middle. Given this, it’s not so helpful to make statements about what people “typically” enjoy (as in the abstract of the paper)."
Prof. Hans Rosling (Karolinska Institute--International Health) spoke at Cornell recently. His talk made palpably clear why he is the most plausible successor to Prof. Tufte's legacy forged by his seminal The Visual Display of Quantitative Information. Rosling's 20-minute Ted Talk, described as "the best stats you've ever seen," has been viewed almost 9 million times. But the most comprehensive archive of Rosling's work is found at the Gapminder website (here). Setting aside the raw aesthetics of Rosling's presentation of data, his substantive task involves bringing data to trendy hypotheses about world macro trends. Seriously folks, while not necessarily "legal" in the narrow sense, this is nonetheless worth a look.
In a perfect world, data sets arrive complete, without any missing data, and shaped consistent with the desired unit of analysis. In the real world, however, many data sets requires manipulation. A recent thread on the Stata forum underscores the utility of the 'reshape' command.
Ronen Avraham (Texas) recently alerted me to an update of one of the more helpful resources around for those who study torts and tort reform. Specifically, the Database of State Tort Law Reforms (DSTLR 5th) seeks to create "one 'canonized' dataset will increase our understanding of tort reform’s impacts on our lives." A fuller (though excerpted) description follows.
"This manuscript of the DSTLR (5th) updates the DSTLR (4th) and contains the most detailed, complete and comprehensive legal dataset of the most prevalent tort reforms in the United States between 1980 and 2012.... The dataset records state laws in all fifty states and the District of Columbia over the last several decades. For each reform we record the effective date, a short description of the reform, whether or not the jury is allowed to know about the reform, whether the reform was upheld or struck down by the states’ courts, as well as whether it was amended by the state legislator... ."
Once again, Northwestern University will be sponsoring a pair of workshops on "Research Design for Causal Inference," organized by Bernie Black (Northwestern) and Mat McCubbins (Duke). Details can be found here; there is no formal registration deadline, but space is limited (and registration for the "main workshop" July 7-11 is already closed to graduate students). These workshops feature world-class faculty, and are an excellent, efficient way to become acquainted with contemporary approaches for making causal inferences from various kinds of observational and experimental data.
lawstat is an "R software package on statistical tests widely utilized in biostatistics, public policy and law." It gathers together a range of basic, commonly-used statistical procedures, and includes examples from those fields. Particularly for those of you new to R, it's worth adding to your list of useful R resources.
Stata's graphics capabilities--while certainly powerful--have always struck me as unduly complicated, certainly when it comes to coding graph commands. Consequently, it is with trepidation that I note a recent Stata Blog posting that illustrates Stata's animated graphics possibilities. (I also note that such efforts will necessarily involve video editing software (such as Camtasia or FFmpeg)). For those willing to invest the time/effort, however, the end results can be quite clever and interesting. Those inclined (and with time to burn) might want to look at some examples here.
The good folks at UCLA maintain an exceptionally rich and deep collection of online resources germane to ELS. Insofar as much secondary analysis involves pulling data stored at ICPSR, this web-based seminar might interest. While it relates specifically to UCLA's data archive, the general points apply to most data archives, including ICPSR.
Building off of Johnson & Johnson's recent decision to make all of its clinical data available to scientists around the world (favorable New York Time op-ed here), Dave Schwartz (Chicago-Kent) and co-authors make the case (here) that this same impulse should extend to empirical legal studies as well, particularly studies of patent assertion entities ("PAE"). An excerpt follows.
"Why is more information about PAE litigation not public? After all, the underlying data relates to litigation in the federal courts, and thus does not implicate privacy concerns like in the medical context. However, most of the raw data has been gathered and coded by private companies. For-profit businesses legitimately desire to use the information within their business and to prevent competitors and others from using commercially valuable information. That said, we believe corporate owners should release as much of the raw data (not merely descriptive statistics) as they can. To the extent that the raw data is not released or shared, society should be extremely cautious before relying upon it to make important public policy decisions."