Program better, for fun and for profit

Psych researchers have a bit of a reputation for being, shall we say, less-than-delicate programmers. It’s common to hear “it doesn’t matter if it’s ugly, as long as it works.”

I took computer science classes as an undergrad, and style was rigidly enforced. I had one notoriously strict professor who would dock up to a grade on an otherwise completely functional project if it was ugly. It wasn’t just simple elitism; ugly code is often inefficient code, and ugly code is hard to read and understand.

Code quality is something I’m constantly working on. You can see the development in my scripts; I only recently started using dplyr and the rest of the tidyverse in R, and what a difference it’s made to the quality of my code. I cringe a little, looking back at my earliest scripts (and they’re a matter of public record, forever). Cringe is good, though. Cringe signals improvement, and wisdom gained.

I thought I’d share a few of the practices that were drilled into me during my CS education that have helped improve the style, quality, and readability of my code.

1. Comment your code.

Please. If not for others, then for your future self. You will neither remember nor understand your code in the future as well as you think you will. I don’t to it as thoroughly as I ought; I’m not sure anyone does. This is easy to change, and doesn’t take much effort.

Functions should be commented with what their inputs are, what it does to those inputs, and what it returns. For a gold star, you can include examples of input and output.

For example:


'''

A function that takes in a list of integers X

and returns the arithmetic mean in floating-point

form.

'''

def mean(x):
    return(sum(x)/float(len(x)))

Global variables should be commented with what they are and how they’re used, so that you can change them without having to dig back through the code to make sure you understand what the variable does.

Commenting code makes it much easier for others to understand, and it cuts way down on re-learning if you have to go back and revisit old code.

2. Use sensible variable and function names.

It very rapidly becomes impossible to tell what is happening when all variables are named x, x1, x2, or tmp. While we want variable names to be succinct, they should also make sense and be recognizable. Degrees of freedom can be df. A subject’s height could be subj_height, rather than, say, h or s_h.

This is also good to do when you’re subsetting data. You don’t want to confuse yourself or others about which variable represents the full dataset, and which represents a subset!

Functions should also, ideally, be named after what they do. cartesian_to_polar(x, y) is obvious (if you know your coordinate systems); c2p(x, y) less so.

3. Avoid hardcoding.

“Hardcoding” is the practice of writing explicit, fixed values into functions and scripts instead of variables. So if you had a run_experiment function, hardcoded to do 100 trials, it might look like this:


def run_experiment:

    for i in range(100):

        do_trial(i)
do_other_stuff(i)

And then maybe at the end of the script, you have to reference the number of trials again, maybe to calculate percent correct:


#let's assume for convenience that yes_responses is a list of bools

correct_resp = sum(yes_responses)/100

This works fine, but what if you decide to change the number of trials? Then you’ll have to hunt back through every place you used 100 and change it. What would be a lot easier is to define a variable, num_trials, at the beginning of your script. Then, every time you need to reference this number, use num_trials rather than the hard number. Then, if you change your mind, you only have to change the value of num_trials to change its value everywhere else in the script.

This is especially relevant for experiment scripts, in which values might change over the course of development, or need to change during the experiment itself with the condition or trial type. It’s much more convenient to have all of you parameters mapped to variables in one place so that you only have to change them once to change behavior everywhere. Changes become easy and quick, and it will save heartache.

4. Think modular.

Break routines and operations into functions, particularly if you have to do them over and over again. For example, if you’re writing an experiment, you might want to have a function that handles a single trial, with some inputs that can be adjusted. Maybe you have long-exposure trials and short-exposure trials, for instance. It’s nice to be able to call do_trial(long_ontime) or do_trial(short_ontime), rather than having all of that logic imbedded in one monster script. If you need more flexibility, just write the function to accept more variables and pass them in.

If you have a function that you use a lot (I have one for saving out data), you can keep it in a separate file and source it in each time you need it, rather than rewriting it each time. Being able to re-use your code saves time and effort.

5. Be succinct.

Often, there’s a verbose way to do something, and a concise way to do something. For instance, in Python, you can very often replace for-loops with list comprehensions. In R, just about every for-loop can be replaced with a combination of calls to the venerable apply family of functions.

With the tidyverse, massive, ungainly nested calls are much cleaner. Here’s a snippet of some code I wrote with dplyr. Here, I start with a giant data frame of individual trials, each labeled with which pair of subjects it belongs to. I need to add a variable that marks whether each trial is “correct” or “incorrect” based on my criteria, and then take the mean percent correct so that each pair of subjects has a single value, tally up the number of trials completed, and spit it out as a summary table.


summary_stats_subj <- mutate(isubjs, correct = (objects==1 & diff_judgment == 'easy') |
(objects==5 & diff_judgment == 'hard')) %>%
group_by(pair) %>%
summarize(pct_corr = mean(correct), n = n())

And here’s what that same code would look like without dplyr:


isubjs$correct <- (isubjs$objects == 1 & isubjs$diff_judgment == 'easy') | (isubjs$objects==5 & isubjs$diff_judgment == 'hard')

summary _stats_subj <- cbind(tapply(isubjs$correct, isubjs$pair, mean), tapply(isubjs$correct, isubjs$pair, nrow)

That’s a lot uglier, for one thing. I have to do a lot more indexing into my dataframe, I have to call tapply twice (unless I wanted to write a custom function), I have to manually assemble things. With dplyr, I can create new variables with one call, don’t have to repeatedly reference the objects I’m working with, and then I can immediately pass the new objects or values into any subsequent calls. You read dplyr code left to right, rather than inside out, as you have to do with nested calls like mean(rowSums(data[,data$condition == x])). It’s more intuitive and a lot less verbose.

6. Unit-test.

This is an example of doing more work now to do less work later. I’ve fallen out of the habit of unit-testing, but with the new year comes an opportunity to return to core values.

In unit-testing, you write tests in a separate script for little pieces of your code. This is easy in Python, and there’s a nice R package for it too.

The idea is that you test each piece of code (functions, object classes, etc) as you write it and verify that it works. Even better, you can write tests for code first, before you even write the code itself. Tests make sure that code runs without crashing, that the output it gives you matches what you expect, that objects have the methods and attributes you want and that they all do what you expect, and so on.

Why bother? For one, it creates a nice feedback loop with modularity. Writing code in nice little packages makes it easy to test, which encourages writing code in nice little packages, etc. Two, it will save you a ton of time during debugging. If you write an entire script through, try to run it, and almost inevitably encounter bugs, you then have to search through a lot of possible failure points. Usually that means having to verify that all of the pieces work anyway, in order to zero in on the culprit(s). With unit testing, you know right away whether something is working and if it’s working correctly. This gives you a chance to fix things before you run the whole thing through.

 

Following good practices only nets benefits. It makes your old code more accessible to yourself. Possibly more critically, it makes your code more accessible to others. Open data and open materials are becoming more common, and it’s not enough just to throw uncommented code up on Github and call it duty done. Part of that responsibility to openness is making code readable, understandable, and transparent. These practices are a good place to start.

Advertisements

3 thoughts on “Program better, for fun and for profit

  1. Re: #5, I for one remain unconvinced that tidyverse idioms sufficiently improve on base R functionality/style to warrant their widespread adoption as a matter of course. Often I see this argued through comparisons to (apologies) poorly written base R — most egregiously when comparing pipes to “massive, ungainly nested calls” that no one ought to be writing in the first place. To pick on the particular example here, the base R version can be written more nicely as:

    isubjs <- within(isubjs, {
    correct <- (objects == 1 & diff_judgment == 'easy') | (objects==5 & diff_judgment == 'hard')
    })
    summary_stats_subj <- aggregate(correct ~ pair, data=isubjs, function(x) c(pct_corr=mean(x), n=length(x)))

    which I submit is as neat, readable, and concise as the tidyverse version, and provides as nice a result.

    More generally, while one can certainly find cases where the “best” tidyverse code is clearly better than the “best” base R code, before recommending the tidyverse as the default way to go we should keep in mind that (a) time/energy spent learning tidyverse tools and idioms in order to clean up code is time/energy that could also be spent–perhaps more economically–simply learning to code in base R better, and (b) there are costs associated with making your code depend on foreign abstractions, both in terms of limiting the audience of people who are familiar enough with the abstraction to find your code understandable, and in terms of making your code even more reliant on outside dependencies rather than being as self-contained as is feasible. This is obviously not to say that we should never use such outside coding schemes, but it is to say that, IMO, good programming practice entails doing so only when we’re quite sure the benefits will outweigh the costs. (I wanted to somehow work in this image but didn’t know how, so here it is awkwardly linked at the end.)

    Like

  2. I’m so glad I followed the link at the end.

    You make a really good point, and you’re absolutely right that one can get anywhere you need to go with base R alone (lovely snippet, by the way!). That’s all I did for a while, and I definitely made it work just fine. What finally got me to switch to tidy was how it was built from the base to make the most common data manipulation so painless–it speaks the data’s language, rather than vice-versa. It can be nice for people who are less comfortable with programming for that reason; while it does have a somewhat steep price of entry knowledge-wise, it pays a very high return on investment.

    The point about external dependencies is also a really good one, and worth thinking about when deciding how to accomplish something. There’s always a danger that dependencies fail. For me, the price of tidy outweighs the costs, but others could certainly come to the opposite, equally reasonable conclusion.

    Like

  3. Re: #1 In Python, moving the comment to immediately below the function definition makes it available as a docstring that people can access using `mean.__doc__` anywhere they have access to the function.

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s