Summer has arrived, finals are graded, and ELSers' thoughts inevitably turn to research and writing. With that in mind, I'm beginning a semi-regular series of posts titled "What Not To Do." The goal is to point out some common mistakes people make in presenting empirical/statistical analyses, and to suggest some better practices.
The first subject is naming variables, and has two parts. The first has to do with giving names to variables. While many variables "natural" names reflect their "natural" coding (think about a variable called age, for example), most (e.g., gender, or race, or partyid) do not. This occasionally leaves researchers in a bad spot; returning to some analyses done weeks or months before, one might wonder "Does gender=1 mean males or females?" A better practice is to choose variable names that indicate directionality whenever possible: female instead of gender, white instead of race, GOP instead of partyid, and so forth. (Of course, assigning variable and value labels will solve this problem, and is also good practice...)
Second, there is an unfortunate tendency to use variable names (of the sort used to identify variables in databases) in tables, figures, text, and the like. An anonymized example (culled from a relatively recent issue of a good law review) is here:
The variable names here are, shall we say, a bit opaque; moreover, they were clearly culled directly from the software output ("SubEqInv"? "LoneClub"?). The result is a table that violates Rule #1 of Tables and Figures: They should "stand on their own."
A better practice is to use variable descriptions, rather than statistical-software variable names, in tables and figures; an example of such a better use (also from a recent issue of a well-regarded law journal) is here:
Here, the names are full descriptions of the variables, and the result is a much clearer picture of what's going on in the analysis.
I'll talk more about tables of results in Part II.