Psych researchers have a bit of a reputation for being, shall we say, less-than-delicate programmers. It’s common to hear “it doesn’t matter if it’s ugly, as long as it works.”
I took computer science classes as an undergrad, and style was rigidly enforced. I had one notoriously strict professor who would dock up to a grade on an otherwise completely functional project if it was ugly. It wasn’t just simple elitism; ugly code is often inefficient code, and ugly code is hard to read and understand.
Code quality is something I’m constantly working on. You can see the development in my scripts; I only recently started using dplyr and the rest of the tidyverse in R, and what a difference it’s made to the quality of my code. I cringe a little, looking back at my earliest scripts (and they’re a matter of public record, forever). Cringe is good, though. Cringe signals improvement, and wisdom gained.
I thought I’d share a few of the practices that were drilled into me during my CS education that have helped improve the style, quality, and readability of my code.
1. Comment your code.
Please. If not for others, then for your future self. You will neither remember nor understand your code in the future as well as you think you will. I don’t to it as thoroughly as I ought; I’m not sure anyone does. This is easy to change, and doesn’t take much effort.
Functions should be commented with what their inputs are, what it does to those inputs, and what it returns. For a gold star, you can include examples of input and output.
''' A function that takes in a list of integers X and returns the arithmetic mean in floating-point form. ''' def mean(x): return(sum(x)/float(len(x)))
Global variables should be commented with what they are and how they’re used, so that you can change them without having to dig back through the code to make sure you understand what the variable does.
Commenting code makes it much easier for others to understand, and it cuts way down on re-learning if you have to go back and revisit old code.
2. Use sensible variable and function names.
It very rapidly becomes impossible to tell what is happening when all variables are named x, x1, x2, or tmp. While we want variable names to be succinct, they should also make sense and be recognizable. Degrees of freedom can be df. A subject’s height could be subj_height, rather than, say, h or s_h.
This is also good to do when you’re subsetting data. You don’t want to confuse yourself or others about which variable represents the full dataset, and which represents a subset!
Functions should also, ideally, be named after what they do. cartesian_to_polar(x, y) is obvious (if you know your coordinate systems); c2p(x, y) less so.
3. Avoid hardcoding.
“Hardcoding” is the practice of writing explicit, fixed values into functions and scripts instead of variables. So if you had a run_experiment function, hardcoded to do 100 trials, it might look like this:
def run_experiment: for i in range(100): do_trial(i) do_other_stuff(i)
And then maybe at the end of the script, you have to reference the number of trials again, maybe to calculate percent correct:
#let's assume for convenience that yes_responses is a list of bools correct_resp = sum(yes_responses)/100
This works fine, but what if you decide to change the number of trials? Then you’ll have to hunt back through every place you used 100 and change it. What would be a lot easier is to define a variable, num_trials, at the beginning of your script. Then, every time you need to reference this number, use num_trials rather than the hard number. Then, if you change your mind, you only have to change the value of num_trials to change its value everywhere else in the script.
This is especially relevant for experiment scripts, in which values might change over the course of development, or need to change during the experiment itself with the condition or trial type. It’s much more convenient to have all of you parameters mapped to variables in one place so that you only have to change them once to change behavior everywhere. Changes become easy and quick, and it will save heartache.
4. Think modular.
Break routines and operations into functions, particularly if you have to do them over and over again. For example, if you’re writing an experiment, you might want to have a function that handles a single trial, with some inputs that can be adjusted. Maybe you have long-exposure trials and short-exposure trials, for instance. It’s nice to be able to call do_trial(long_ontime) or do_trial(short_ontime), rather than having all of that logic imbedded in one monster script. If you need more flexibility, just write the function to accept more variables and pass them in.
If you have a function that you use a lot (I have one for saving out data), you can keep it in a separate file and source it in each time you need it, rather than rewriting it each time. Being able to re-use your code saves time and effort.
5. Be succinct.
Often, there’s a verbose way to do something, and a concise way to do something. For instance, in Python, you can very often replace for-loops with list comprehensions. In R, just about every for-loop can be replaced with a combination of calls to the venerable apply family of functions.
With the tidyverse, massive, ungainly nested calls are much cleaner. Here’s a snippet of some code I wrote with dplyr. Here, I start with a giant data frame of individual trials, each labeled with which pair of subjects it belongs to. I need to add a variable that marks whether each trial is “correct” or “incorrect” based on my criteria, and then take the mean percent correct so that each pair of subjects has a single value, tally up the number of trials completed, and spit it out as a summary table.
summary_stats_subj <- mutate(isubjs, correct = (objects==1 & diff_judgment == 'easy') | (objects==5 & diff_judgment == 'hard')) %>% group_by(pair) %>% summarize(pct_corr = mean(correct), n = n())
And here’s what that same code would look like without dplyr:
isubjs$correct <- (isubjs$objects == 1 & isubjs$diff_judgment == 'easy') | (isubjs$objects==5 & isubjs$diff_judgment == 'hard') summary _stats_subj <- cbind(tapply(isubjs$correct, isubjs$pair, mean), tapply(isubjs$correct, isubjs$pair, nrow)
That’s a lot uglier, for one thing. I have to do a lot more indexing into my dataframe, I have to call tapply twice (unless I wanted to write a custom function), I have to manually assemble things. With dplyr, I can create new variables with one call, don’t have to repeatedly reference the objects I’m working with, and then I can immediately pass the new objects or values into any subsequent calls. You read dplyr code left to right, rather than inside out, as you have to do with nested calls like mean(rowSums(data[,data$condition == x])). It’s more intuitive and a lot less verbose.
This is an example of doing more work now to do less work later. I’ve fallen out of the habit of unit-testing, but with the new year comes an opportunity to return to core values.
The idea is that you test each piece of code (functions, object classes, etc) as you write it and verify that it works. Even better, you can write tests for code first, before you even write the code itself. Tests make sure that code runs without crashing, that the output it gives you matches what you expect, that objects have the methods and attributes you want and that they all do what you expect, and so on.
Why bother? For one, it creates a nice feedback loop with modularity. Writing code in nice little packages makes it easy to test, which encourages writing code in nice little packages, etc. Two, it will save you a ton of time during debugging. If you write an entire script through, try to run it, and almost inevitably encounter bugs, you then have to search through a lot of possible failure points. Usually that means having to verify that all of the pieces work anyway, in order to zero in on the culprit(s). With unit testing, you know right away whether something is working and if it’s working correctly. This gives you a chance to fix things before you run the whole thing through.
Following good practices only nets benefits. It makes your old code more accessible to yourself. Possibly more critically, it makes your code more accessible to others. Open data and open materials are becoming more common, and it’s not enough just to throw uncommented code up on Github and call it duty done. Part of that responsibility to openness is making code readable, understandable, and transparent. These practices are a good place to start.