Many of us doing empirical legal research are involved in data collection projects of various types. Until very recently, my research assistants and I wasted a tremendous amount of time duplicating work and coordinating amongst ourselves. Recently I have learned -- from some of our excellent IT staff and some savvy graduate students -- about useful systems we all can use to manage our data collection projects.
The traditional approach is to hire teams of law or graduate students to collect and and code data, perhaps reading cases and coding a number of variables. What I had done in the past is to have the students enter data in either Excel or Stata, and then return their portion of their data to me when complete. Then, a research assistant would create a master dataset from all of the various sources. This approach is highly inefficient, and provides many opportunities for human error.
A better way to perform this sort of data collection is to create a web-based data collection portal that research assistants can use to input coded data. One advantage of this is the ability for multiple research assistants located anywhere to read from and write to the same data set. Systems can also be built to allow for various levels of integrity checks, supervision, etc. And, when complete, it is straightforward to build systems to deliver data in whatever form you need.
Here at Wash U we have built a system for a project my colleague Margo Schlanger has undertaken called the Civil Rights Litigation Clearinghouse. You'll be hearing about the substance of this exciting project on this blog and elsewhere starting in about a month. A crucial facet of this project was creating an information system to catalog and subsequently disseminate a diverse collection of case documents. And, germane to this post, to allow research assistants and the data owner, Margo, to code various aspects of the cases, documents, and court officials.
Below are some screenshots of the administrative back-end of this site. Using this environment, research assistants enter data about a case using actual court documents collected by the data owner. In this case, entered data can include the case name, the document title, the document type, citation etc.

click to enlarge
The data owner plays a governing role in overseeing the work done by their research assistants. It is important that a workflow and auditing system is present so information entered into the collection may be reviewed and modified where necessary. In the below screen the data owner views the status of an existing case. Here the case summary is in progress; the approver may edit what the student wrote and when ready approve the changes. Such measures allow data to be phased through an identified process and allow the public display of items to be granularly controlled.

click to enlarge
This custom information system has made the full maturation of this very ambitious data collection project feasible. Without a consistent and stable method of collection and interpretation, large data sets often get obfuscated and disjointed, thus rendering them difficult if not impossible to share in any meaningful manner.
Obviously planning for and organizing such an intricate piece of technology requires the right sort of collaboration. It is vital that appropriate subject matter experts are consulted as to avoid missteps in the planning, which is so often the most crucial step to any technology undertaking.
In this case, the base-technologies were quite standard; apache webserver, php and a mysql back-end. In addition to the hard technologies, it is helpful to have people skilled in information architecture, site design, web usability, database management, and obviously web development. And this is of course all in addition to the data owner and their collection, without which, the endeavor would not be necessary.
Another common task the people frequently undertake is harvesting data from various websites. This can be done by humans using point-click-copy-paste. But that, too, is inefficient and opens up lots of opportunities for human error. It also makes it difficult to replicate studies conducted by others, and to update existing datasets. There are a variety of scripting languages, including perl (using LWP::UserAgent for fetching and HTTP::Request::Common for posting forms) and python, that can be used to harvest data from websites and turn it into useable datasets. The Snoopy php module also works well for these purposes.
Using these technologies is more expensive in terms of time and resources at the start-up phase of a research project. But the efficiencies that can be captured are quite significant.
In response to Tracy's question: Yes. Everything is stored a database, so we can export whatever we want. All of the PDFs on the site have been OCRed and carry a text layer, so that content is fully searchable and exportable for content analysis.
Posted by: Andrew Martin | 07 September 2006 at 08:39 AM
Andrew,
This is terrific information. I am currently working on a project in which the PIs have discussed creating and deploying a similar interface; a database contractor is onboard to help us think the process through. I am thrilled that someone else has tackled something similar. I will be sending along this information to our group. Many thanks, bh.
Posted by: William Henderson | 07 September 2006 at 06:28 AM
Veeerry interesting. One query: can the text fields be independently accessed for content analysis? I'm guessing yes, but you never know until you know.
Posted by: Tracy Lightcap | 06 September 2006 at 09:20 PM
The Texas Criminal Justice Coalition similarly uses supervised interns and volunteers operating on a web-based portal to compile racial profiling data from hundreds of reports from local police and sheriffs departments, nearly all differing somewhat in format but containing (for the most part) several key statutorily required data points. The first report were done using Excel, nearly just as you describe, and the use of the uniform web interface really cleaned up the process and created means for more systematic data inegrity checks, etc.
(See the most recent (third annual) report and supporting materials here: http://www.criminaljusticecoalition.org/files/userfiles/Racial_Profiling_Data_2006_Analysis_of_Search_and_Consent_Policies_in_Texas.pdf)
Thanks for the data harvesting tips!
Posted by: Gritsforbreakfast | 06 September 2006 at 04:07 PM