The blog has moved!

Just a heads up, from now on I’ll be blogging at I’ve moved all of the posts here over to there, updating and correcting some of them, spiffing up all of them.

Where are you now?

My new website is hosted with GitHub Pages. I created it by pretty much following this amazingly detailed tutorial to the letter.

Why the move?

I really like working with Markdown. I never quite liked how my code displayed in WordPress; I think it looks a lot better now, plus it’s nicer to be able to just write all my code and commentary in an R notebook (plus math rendering!) and then push it to GitHub to post new content.

What happens to this site?

For now, nothing! I’m leaving everything as-is as an archive, but I won’t be updating here anymore.

Thanks for reading, and I hope you like the new digs!


Considerations when writing a preregistration if you’re new to all this

(Note: I apparently use preregistration and pre-registration interchangeably, because I guess I want to have my hyphenation cake and eat it, too.)

I’m a selfish preregisterer (Preregisterizer? Preregistrar? Preregistrix?). I preregister experiments primarily for my own benefit. I make as many decisions as I can up front, when I’m unbiased and unattached. Yes, it’s more transparent, and yes, it provides a record to others that you made your decisions ahead of time and not in a post-hoc way. But I mostly do it for myself.

Like Odysseus tying himself to the mast, preregistration saves you from the siren song of totally justifiable post-hoc decisions that just happen to make things work a little more nicely. I do think that usually this isn’t ill-intentioned or even necessarily deliberate, but I do think it is extremely hard (if not impossible) to make decisions in an unbiased way once the data have started coming in.

Preregistration can be an asset, not an extra chore. It eliminates decision points, which speeds up the process considerably (especially in the analysis stage). It provides a record to yourself of your methods, analysis intentions, and motivating rationale. It also, of course, provides this same record to others.

So, what are some things you might want to think about when writing a pre-registration if you’re just starting out?

1. What are your hypotheses and predictions?

Why are you running this study? What hypotheses are you interested in testing, or what theories are you exploring? Do you have predictions about how the data will look, or a  set of possible outcomes depending on what mechanisms are at work? Can you generate hypothetical/example data for these possibilities?

2. How many subjects?

How many people are you going to run, and why? Maybe you do a power analysis to figure out how many you would need to find your smallest effect size of interest, maybe you just always like to run 50 people per group. However you decide on your number, you want to decide on this upfront.

Alternatively, if you’re going to be doing a sequential analysis, where you periodically inspect the data over the course of collection and stop once you’ve reached some threshold (such as a Bayes factor of 10 in either support of the alternative or support of the null), you’ll want to specify those details. What is the stopping point? How many batches will you run?

How will you be recruiting people? What’s their compensation? Are their any inclusion criteria they have to meet (right-handed, normal or corrected-normal vision, etc)?

3. What are your methods?

How are you going to run the study? To address this, I usually like to just write the methods section. I think it’s important to be detailed here because you want to prevent nudging of the experimental procedure if it “doesn’t seem to be working.” If possible, can you upload your experimental script as it will be run and all of the stimuli?

If you always test in the same place in the same way, you could write up a general equipment page that you link to in addition to the details of your particular study (if you always use the same eyetracker, for example).

This will make writing the methods section a breeze when the time comes!

4. What is your analysis plan?

How are you going to analyze the data? What kind of tests are you going to do, and between which conditions? Are there conditions under which you will pool observations across conditions? You want to make all of these decisions before you have any idea what the data are. .

If possible, can you upload the very script you’ll use to analyze the data?

How are you going to code free-responses? If people have to be classified, how are you going to do it? In the research I do, I run a lot of inattentional blindness studies. One of the things we have to decide is how to determine “noticers” (people who saw the unexpected event) from “nonnoticers” (people who missed it). This is a very consequential decision, and is definitely not one we want to have any flexibility over after the data have come in.

Are you going to exclude subjects or observations? Under what conditions? Try to imagine every possible reason you might need to exclude someone and list it. These can be things like failing attention checks, performing below chance on some critical accuracy measure, falling asleep during the experiment, reporting colorblindness during a color-based study, and so on.

Practice makes perfect

I wasn’t sure about this whole preregistration thing when I first started. It seemed like just another chore, and I was on the up and up, so why did I need to do this, anyway? But after I ran my first preregistered experiment, I came to appreciate the absence of flexibility. Even in what I expected to be a completely straightforward experiment, there were dozens of little decisions I could have made in the moment. It was so freeing to be able to just stick to the plan and have all the decisions made ahead of time. No agonizing, no doubting, no trying a million different things in the off-chance that some other analysis would look better. I just followed my map right out of the garden of forking paths.

If you want to start pre-registering your work, the Open Science Framework is a great option. There’s also AsPredicted.

Preregistering a study for the first time can feel strange, particularly if you’re accustomed to running non-preregistered studies. When you’re used to dealing with problems and decision points as they emerge over the course of a study, rather than anticipating them ahead of time, it can be hard to even enumerate them. Like anything, though, it gets much easier with practice.

One-day build: Shiny effect sizes

It had been a long time since I had built anything, and I needed to scratch the itch. I decided to build a Shiny port of Daniël Lakens’s beloved effect size calculator spreadsheets. You can find the app itself here.

These sheets are handy for a number of reasons, primarily if a paper or talk reports stats but no effect sizes (or if you want to double-check reported stats or effect sizes). I thought it would be convenient to have an online version.

All of the math is based on what’s used in the spreadsheet, and the default examples are taken from there too.

This was a really fun build, mostly because the actual math was pretty straightforward (and I could easily check it against the spreadsheets to make sure it was working, which is a refreshing change). I got to spend more time digging into the web-dev side of things, dealing with layout and the reactivity of the components of the app. I learned a lot, and it was great to be working with Shiny again.

The code is on Github. Suggestions for improvements and bug reporting encouraged!

Intro to unit testing in R

I’ve mentioned before that a great coding practice you can ingrain into your work is unit testing. The idea is simple: you write a script that automatically evaluates pieces of your code and checks it against expected behavior. For example, a correlation coefficient should always be between -1 and 1, so you could write a test that will raise an error if it encounters an r beyond these values. Or, you could check your data file after it’s read in to make sure it reads in with all rows and columns, nothing missing where it shouldn’t be, and every column of the data type you want it.

Hadley Wickham wrote an awesome R package that makes writing tests easy and pretty intuitive. In the testthat package, you can bundle a series of expect_that  statements into a test_that suite, which ideally would run a small cluster of closely related tests. Multiple test_that suites can be grouped into a context that tests an entire file, or a big chunk of functionality inside a file.

I’ll walk through examples. Code for this is up on GitHub, like always.

So first, to get things set up, you have a file that you want to test, and then you make a test script to go along with it. If you want to set up automated testing, name this file test_something.R. testthat looks for files that start with test_.

Once your file is set up, we can start writing tests! First, you source in the file that you’ll be testing:


The first thing I’ll be testing is that the data I defined in the dummy script meets my expectations. For reference, this is the data file I defined, but we can pretend I read it in from somewhere:

testing_data = data.frame('letters'=c('a', 'b', 'c', 'd'), 
                          'numbers'=seq(1, 4))

Creative, huh?

So, because this dummy script has multiple different parts that will all get tested, I’m first going to set a context just for the data. Doing so couldn’t be easier:

context('testing data integrity')

This just gives the test block a readable label that prints during output. Typically you have one context per file, but I’ve got multiple in this file for demo purposes.

Within a context, you have tests. Tests are best deployed to test very targeted functionality, and each test contains a small number of expect statements that define expected behavior. Here’s an example with our silly data frame. We want to make sure it has the expected number of columns and rows, so we could write a test that checks for this.

test_that('data dimensions correct', {
    expect_equal(ncol(testing_data), 2)
    expect_equal(nrow(testing_data), 4)

All we’re saying is we expect the actual number of columns in our data, returned by ncol(), to be equal to our number of variables (here 2), and that we should have four subjects. The first argument to expect_equal is what we’re testing, and the second argument is what that thing should be equal to (both of these tests pass).

This is sort of a goofy example, but you can imagine real applications–making sure your data has the right number of variables, for instance, or making sure that you have as many files as you do subjects to make sure data isn’t missing. If you’ve examined your data up using View() or head() just to check, you’ve pretty much done the manual equivalent of this test.

We could also check that no values are missing:

test_that('no missing values', {
    expect_identical(testing_data, na.omit(testing_data))

This is a different expect statement. This is expect_identical, which has no tolerance (unlike expect_equal). We should have exactly the same thing after removing missing values if we’re not anticipating any. (There’s also an expect_equivalent statement, which ignores the attributes of its comparators. There are about 22 different expect_ statements–you can read more about them in the package documentation.)

How about we check to make sure our data is of the right type? We can use expect_is(), which checks if the first argument inherits from the class specified in the second argument.

test_that('data types correct', {
    expect_is(testing_data$numbers, 'integer')
    expect_is(testing_data$letters, 'character') #this one fails; they're factors

#note that an equivalent statement would be:

#expect_that(testing_data, is_a('data.frame'))

This might seem silly, but I’ve been burned before by not realizing that what I thought was numeric data was actually stored as a factor. Never hurts to check before you start analyzing!

We can also, for example, run tests to make sure outputs from models conform to expectations. Here’s our toy model:

model_data = data.frame('y'=c(rnorm(25, 0, 1), rnorm(25, 1, 1)),
'x'=rep(c('c1', 'c2'), each=25))
test.mod = lm(y ~ x, data=model_data)

Now we could test, for example, that we have the expected number of coefficients:

test_that('right number of coefficients', {
    expect_equal(length(test.mod$coefficients), 2)

Or that all the factor levels in the data are also in the model (here we have to be mindful about differences in data type; levels returns a character string, while the model object returns a named list).

test_that('all factor levels present', {
    expect_equivalent(levels(model_data$x), unlist(test.mod$xlevels))

How about verifying that the intercept equals the mean of our first group? Because one value is named and the other isn’t, expect_equivalent will be the statement to use:

test_that('mean of group 1 equals intercept', {
    expect_equivalent(mean(model_data$y[model_data$x == 'c1']), test.mod$coefficients['(Intercept)'])

This is a very simple model, but you can imagine how it would be useful for a more complicated one.

We can also test custom functions that we’ve written. I’ve written a silly little function that tells you whether a number is even or odd:

even_odd = function(n){
    ifelse(n %% 2 == 0, print('even'), print('odd'))

Then we can make sure it reports the correct output. We can do this even though it prints (instead of returning a value) with prints_text:

test_that('even_odd prints the right message', {
    expect_that(even_odd(1), prints_text('odd'))
    expect_that(even_odd(6), prints_text('even'))

After you have the library loaded in and you’ve set the working directory to the right place, you can run the testing suite by calling test_file('file_to_test.R'). The advantage to calling this function, rather than just sourcing the file, is that it will run through all tests even if some fail. When you source it, the test script halts at the first error. If you have multiple testing files, you can call test_dir('my_test_dir/') instead, and it will run all of the `test_` files in that directory.

Here’s what output looks like. You can see that all of our tests are grouped under their contexts, which is a nice way to organize things. Each little dot is a successful test, while a number indicates a failure (with more detail printed below).


When we look in detail at the failure, we see that it’s titled with the name of our test_that suite (another good reason to be granular with how you organize your tests!), and then we see what went wrong:


So the context tells us which group of tests threw the error, and then the more detailed error message tells us which test failed (“data types correct” in the “data integrity” context). It also tells us exactly why the test failed. We wanted the data type to be a character, but it was read in as a factor. This alerts us that we need to set stringsAsFactors equal to FALSE when reading in our data (for example).

Here are some general tips on writing tests from Wickham:

  • Focus on testing the external interface to your functions – if you test the internal interface, then it’s harder to change the implementation in the future because as well as modifying the code, you’ll also need to update all the tests.
  • Strive to test each behaviour in one and only one test. Then if that behaviour later changes you only need to update a single test.
  • Avoid testing simple code that you’re confident will work. Instead focus your time on code that you’re not sure about, is fragile, or has complicated interdependencies. That said, I often find I make the most mistakes when I falsely assume that the problem is simple and doesn’t need any tests.
  • Always write a test when you discover a bug. You may find it helpful to adopt the test-first philosophy. There you always start by writing the tests, and then write the code that makes them pass. This reflects an important problem solving strategy: start by establishing your success criteria, how you know if you’ve solved the problem.

The nice thing about testing is that you can re-run the entire, automated set of tests any time you make a change to make sure you didn’t break anything. It’s a lot faster and more consistent than print-debugging or command-line inspection, and it will save you time if you write your tests early (or even before) in the coding process.

Happy testing!

Intro to R Notebooks

I find R Notebooks to be pure joy.

Notebooks are really cool little programming environments that seamlessly combine executable code chunks, inline graphs and tables, Latex, and markdown all in one place. If you like Python, you may already be familiar with IPython or Jupyter notebooks (which support R, in fact). R Notebooks are very similar, although I find them even easier to use. If you like, play along with a basic tutorial notebook I threw together.


Here is a brief, giddy introduction to some of the things I love about R Notebooks, in no particular order.

1. Inserting Code

It’s easy, and not even limited to R–you can insert Python, SQL, Stan, and a few others, too. R Notebooks will output everything inline (printing, tables, plots, etc.) and handles some automatic formatting. For instance, if you print a dataframe, it will automatically put it in this pretty format for you:


You can write R code just as you’re accustomed to. Load in libraries (as seen above), use other packages–anything you want. You can also specify a bunch of arguments in the header of a chunk to customize behavior, like including (or not) that code chunk and its results in the finished file, for example.

You can also do inline code with the `r` wrapper, which shows results inline but not the code. This is useful for referencing a variable that’s subject to change, for example; you can reference the variable name, and R will supply the data and format the text.

Before rendering:


After rendering:


A code chunk executes in the global environment, so other chunks all have access to anything defined by the others. You can even define global parameters in the header that get used throughout the document (you can also set these values when you knit if you knit with parameters).

2. Markdown and Latex

The text outside of code snippets is Markdown text. This is a handy reference for basic Markdown syntax. Markdown can take some getting used to, but the results look amazing.

You can also do inline Latex formatting just wrap it in mathmode ($ ) tags, and format away. R Notebooks even support hover-previewing for single-line expressions.

Perfect matrices!


Greek letters!


Sensuous integrals!


Beautiful equations!


Toil no more with Word’s equation editor, good people.

3. Plots and tables

We’ve all experienced the nightmare that is trying to put a figure in a Word or Pages document. Not only does it end up in the wrong place, but it splits your paragraph apart and adds eight blank pages.

We can all wake up from this nightmare. In R Notebooks, plots are generated under their code chunk and stay there, right where you want them. To clean up the report, you can even hide the code chunk that generated it. No need to save the figures separately–you can just make them on the fly each time.


You can also format and out put tables using a variety of packages, which also end up where they ought to.

4. Endless possibilities

There is so much nuance and customization you can add to these notebooks.

You can add a bunch of different parameters to the YAML header (including cool stuff like tables of contents), you can make or use templates, you can customize code chunk behavior globally or locally, you can set a bunch of global or local knitr options–you can do HTML tabsets to turn sections into tabs.

You can turn them into interactive Shiny apps, or imbed shiny apps within them!

Citations and bibliographies are supported, although I have never used them!

Just look at this cheatsheet for a rough idea of what you can do, or all this cool stuff.

If you’re just getting started, or aren’t looking for anything too fancy, the basic notebook functionality will more than meet your needs. If you’re really savvy, you can get a remarkable amount of mileage out of the more hidden features of the notebooks. Either way, something for everybody.

5. Look how pretty!

Even if you’re a total scrub like me, you get really lovely results from these notebooks:

No, it’s not just you: the picture is partial and cuts off

There are a ton of supported file formats, and they all look great. I think the HTML notebooks look the best, myself, but it also knits really nicely to PDF and other formats.

R Notebooks are extraordinary useful. They’re great if you need to share results with collaborators or want to throw together a quick report on some data. They are super useful for code tutorials and walkthroughs, since you can show the raw code, the results, and also link to helpful resources within the text. If you’re still taking classes, they make for lovely statistics homework. Plus, because the .Rmd files are just text, they’re small and so are easy to version with Git.

They really are a great tool. Easy to use if you just want the basics, but with a lot of customization available if you want to go deeper. The results look great, and as it becomes more important to have a clear, transparent, and reproducible workflow, R Notebooks are a great tool to have in your arsenal.

I’ve made a tutorial notebook available on Github for view and/or download (although the Latex doesn’t render properly in the Github preview) and RPubs (for a better render).

Data in the raw: Violin plots

One of the ways we can increase transparency in science, in addition to posting our data, materials, and pre-registering our methods, is to start including more information about our raw data in our write-ups and reports. One of the ways we can do this is just show it in a visualization.

The bar chart with error bars (usually ± n * standard error) is a classic plot type, but it obscures a lot of information about the underlying distribution that generated it. Scatterplots and histograms show more of the raw distribution, but they can be messy and hard to cleanly overlay with summary statistics.

Fortunately, violin plots bring together the informativeness of a histogram with the cleanliness of a bar chart, and they can be easily overlaid with summary statistics, error bars, and other information without too much additional clutter.

First, some data:


dat = data.frame('condition'=c(rep('t1', 30), rep('t2', 30)),
'value'=c(rnorm(30, 10, 3), rnorm(30, 20, 7)))

Here is the simplest incarnation of a violin plot, for two normally distributed groups:

basic_violin = ggplot(data=dat, aes(x=condition, y=value)) +
geom_violin(aes(fill=condition, color=condition)) +


These might be more accurately called “jug” or “vase” plots, since they rarely make pretty violin shapes and more often tend to look like postmodern sculpture. Name aside, the violin plot is a rotated, symmetric kernel density plot that shows the density of points at different values. Where the plot is wide, there is a high density of points; where it is narrow, a low density of points (like height on a histogram). We can see here that group T1 is less variable than T2; T1’s violin is short and squat, meaning  most of the points are massed in a small region. T2, on the other hand, is tall and narrow, meaning the points are spread thinner along a wider range of values

These guys looks a little sparse, though. Why don’t we dress them up a smidge?

Maybe you like a point mean and 2*SE error bars:

errbar_lims = group_by(dat, condition) %>%
summarize(mean=mean(value), se=sd(value)/sqrt(n()),
upper=mean+(2*se), lower=mean-(2*se))

mean_se_violin = ggplot() +
geom_violin(data=dat, aes(x=condition, y=value, fill=condition, color=condition)) +
geom_point(data=dat, aes(x=condition, y=value), stat="summary", fun.y=mean,        fun.ymax=mean, fun.ymin=mean, size=3) +
geom_errorbar(aes(x=errbar_lims$condition, ymax=errbar_lims$upper,
ymin=errbar_lims$lower), stat='identity', width=.25) +


Or maybe boxplots are your jam:

boxplot_violin <- ggplot(data=dat, aes(x=condition, y=value)) +
geom_violin(aes(fill=condition, color=condition)) +
geom_boxplot(width=.1, outlier.shape=NA) +


Here’s the way I like to do my violin plots, with each subject’s point plotted plus a horizontal line for the mean. I add a little horizontal jitter to each point to make things easier to see:

scatter_violin <- ggplot(data=dat, aes(x=condition, y=value)) +
geom_violin(aes(fill=condition, color=condition)) +
geom_crossbar(stat="summary", fun.y=mean, fun.ymax=mean, fun.ymin=mean, fatten=2, width=.5) +
geom_point(color="black", size=1, position = position_jitter(w=0.05)) +


There are a few things I really like about violin plots. One, they add a lot more information without taking up any more space than a bar plot would. Two, they give you an intuitive way to look at the distributions of your data. I don’t think many of us are accustomed to seeing data this way, and it goes to show that even well-behaved data doesn’t necessarily look like we might expect. T1 is just as normal as T2, but T1 “looks” a lot more normal than T2 does. You lose a lot of the characteristics of the data with a bar plot:

bar_plot <- ggplot(data=errbar_lims) +
geom_bar(aes(x=condition, y=mean, fill=condition, color=condition), stat='identity', position='dodge') +
geom_errorbar(aes(x=condition, ymax=upper,
ymin=lower), stat='identity', width=.25) +


It captures the broad strokes, to be sure, but a lot of the finer details disappear.

So there you go. Violin plots! Dress ’em up, dress ’em down. However you like them, they’re a nice plot type to have in your arsenal.

A primer on data wrangling

I found this Twitter thread on the vagaries of data wrangling killing momentum interesting, particularly the notion of how frustrating it must be to be at point A with your data, see Point B, where you’d like to go, and have no idea how to get there. Even for those with programming experience, data wrangling can be an enormous chore.

To that end, I thought I’d walk through a basic overview of how to accomplish some of the operations you might commonly encounter when you first get a data set. If you’re generating the data yourself, you can try to make your life easier by saving it in the format you want.

This blog post is also a notebook on GitHub, so you can download it and play with it as much as you like. It’s also a lot prettier than what WordPress let me get away with (I’ve made a poor reproduction of it here).

Where possible, I’ll show multiple ways to accomplish something and try to highlight packages that will make things easier.

MANDATORY DISCLAIMER: There are often at least six ways to do anything in R. If you don’t like any of the methods here, rest assured that there are others (and probably better ones, too); you can almost certainly find something that suits your style.

Reading in data

CSV Files

Here’s a common situation: one spreadsheet has all of a subject’s raw data, and you have a few dozen spreadsheets. First, let’s talk about an easy way to read that in as painlessly as possible. No for-loops needed here; we’ll just use the trusty apply() family from base R.

#Note that if your data are not in your current directory, you need to
#Call, for ex., setwd('~/my_data_folder') to set the data folder as
#the current directory
#Specify the path in list.files and then have it
#return the full path name of each file, rather than the relative path.
alldata = lapply(list.files(pattern = '*.csv'), read.csv)

We’re accomplishing a few things in one line. First, the call to list.files simply lists all of the files in your current directory. It has a bunch of optional arguments, too. You can specify a pattern, which is just a regex expression that specifies what you want. Here, I only want .csv files, so I specify that I want any file (“*” is a wildcard symbol allowing anything) that ends in the extension .csv. I can specify other arguments, like whether I want the full path names returned, whether I want it to also search sub-directories for files, and so on. After we have this list of files, we simply iterate over it and call read.csv() on each one. The end result is a list, wherein each element is one subject’s data frame. Now, a list of data frames is not the most useful data format to have. Fortunately, it’s easy to bind this list together into one big data frame. Here’s how to bring it all together in base R.

subjects_all ='rbind', alldata) head(subjects_all)


Note that the information about which data belong to each subject is lost here. You’ll need to add an identifier column, or make sure that each file has one, before reading it in this way.

How about some good old dplyr?

subjects_all = bind_rows(alldata, .id='subject')


We can use argument to specify an ID column, which will keep track of where the data comes from.

We can also use the handy rbindlist function from the data.table/dtplyr package. This will label the data automatically for us according to which data frame it came from; we can call this new column (specified by the id argument) anything we like.

subjects_all = rbindlist(alldata, idcol='subject')

Note also that rbindlist() is an order of magnitude faster than If you’ve got a lot of data, you’ll probably want to go with this function. data.tables are extremely fast and memory efficient in general, and might be a good option if you’re working with truly huge amounts of data. For most uses, though, this kind of optimization isn’t really necessary.

Text files

To read it in, we just call read.table instead.

allsubjs = lapply(list.files(pattern = '*.txt'), read.table,
                  header=TRUE, colClasses=c('double'))

Just like before, we end up with a list of data frames, one for each subject. In read.table(), unlike read.csv, the header argument defaults to FALSE, so be sure to change that. I also specify colClasses here to tell R what type of data the content of the columns is. Without that, these doubles get read in as factors; doing it now saves a little work later.

We can then bind these together with rbindlist just like we did when we used read.csv.


Need to read in Excel files, or read in each sheet in one file as a separate set of data?

Wickham’s got your back.

Some general notes

There are a lot of arguments you can specify in the read.csv function call that can save you work down the line–I used some of them when I was reading in the text files, but there are many more. You can even tell the function what strings should be read as NA values! This is really handy if you have NULL and what R to treat that as NA. You can also read in only part of the file, which is useful if you have a monster file and want to read it in in chunks.

Reshaping Data

It’s helpful to be able to switch at will between data in wide format, where each row is a subject and each column contains a variable, to long format, where each row is a value at a given time, or measure (for repeated measures).

Here’s a very simple data set. Each row is a subject’s scores. Column 1 is their subject number, followed by their scores in the control, treatment 1, and treatment 2 conditions. We tend to be most accustomed to seeing data this way. This is “wide” format.

traits = data.frame('id'=seq(1, 10),
                     'control'=floor(rnorm(10, 30, 5)),
                     'treat1'=floor(rnorm(10, 10, 2)),
                     'treat2'=floor(rnorm(10, 15, 3)))

Wide to Long

Now, when we cast this to “long” format, we will have the id column (the subject number), a variable column (in this case, which test was taken), and the value column (the score on each test). Here it is, melted two ways. In base:

traits_long_base = reshape(traits, idvar="id", direction='long', 
                            v.names=c('score'), timevar='test',
                            times=c('control', 'treat1', 'treat2'), 
                            varying=seq(2, 4))

We have to be pretty careful about specifying our arguments here. The idvar indicates which column we want to map over the data. Here, we want the subject number; we want each subject’s score on each test labeled with their unique ID. Direction is fairly self-explanatory; we’re going to long form here. v.names is the name (or names) of the new columns. Here, we’re collapsing everybody’s scores into a single column, so we call it 'score'. timevar is the variable that changes over time, or over repeated measures. Here it’s which test they took, so we call the new column 'test'. Then we tell it which values to use in this new times column with times; we want the name of the test. Then we tell it which columns of our data are varying over time/are our repeated measures; here it’s the final three columns (you can also specify a vector of strings).

Here are the same results with the melt() function from data.table or reshape2. We specify again which column represents our data labels, and then we tell it which columns we want it to treat as our “measures,” which in our case is our three tests (if unspecified, it just uses all non-id variables, so we could have left it out here):

traits_long_m = melt(traits, id.vars="id", 
                    measure.vars=c('control', 'treat1', 'treat2'),

And we get the same results with tidyr:

traits_long_t = gather(traits, key=test, value=score, control, treat1, treat2)

Here, the key/value pairing tells us about our outcome columns, and then we just list the columns to gather up.

Three roads, same destination. I find melt and gather both much more intuitive than reshape, with gather the easiest of them all to use, but your mileage may vary.

Data is really easy to plot this way:

plot = ggplot(traits_long_t, aes(x=test, y=score, color=test)) +

Simplicity itself.

Long to Wide

Now, if we want to go the other direction (long to wide), in base R, we call the same function with different arguments:

traits_wide_base = reshape(traits_long_base, direction='wide', 
                            timevar='test', idvar='id')

Now we have the original structure of our data back.

The inverse of melt() is dcast:

traits_wide_m = dcast(traits_long_m, id ~ test, value.var='score')

Right back to where we were.

And to undo gather, we spread:

traits_wide_t = spread(traits_long_t, test, score)

Reshaping with more variables

Here’s a more complex example.

traittest = data.frame('traitA'=factor(rep(c('high', 'med', 'low'), each=4)),
               'traitB'=factor(rep(c('positive', 'negative'), times=6)),
               'test1'=floor(rnorm(12, 10, 2)), 
               'test2'=floor(rnorm(12, 15, 2)))

There are a lot of ways to melt this data. Maybe we want to collapse the tests into a single column–in this case the traits are the identifier variables.

In base:

tt_bytrait_base = reshape(traittest, direction='long', v.names='score',
                           timevar='test', times=c('test1', 'test2'), 

With melt():

tt_bytrait_m = melt(traittest, measure.vars=c('test1', 'test2'), 

With gather:

ttest_bytrait_t = gather(traittest, test, score, test1, test2)

Or, we can let the tests be the identifiers, and collapse the
traits into a single column.


tt_bytest_base = reshape(traittest, direction='long', v.names='rating',
                          timevar='trait', times=c('traitA', 'traitB'),


tt_bytest_m = melt(traittest, measure.vars=c('traitA', 'traitB'),

With gather:

tt_bytest_t = gather(traittest, trait, rating, traitA, traitB)

Reformatting Data

So we’ve read data in, and can flip it between long and wide at will. Great, but what if the data itself needs to be fixed?

Recoding values

Let’s say you have some data that look like this:

yesno = data.frame('subj'=seq(1,10), 'resp'=rep(c('Y','N'), each=5))

So we have 10 subjects, and each one responded either yes (Y) or no (N) to… something. But maybe we don’t like the way this is coded; Y and N are hard to work with if we want to find average accuracy, for example. Maybe we want 1’s and 0’s instead, with which it is easy to do calculations.

If we want to recode these values, we have a few options. We can use indexing, of course, but there are also some functions that will save you some work.

Base has the ifelse function, which performs a logical comparison, and if true, returns the first value; else, the second:

yesno$resp = ifelse(yesno$resp == 'Y', 1, 0)

If we have more than two alternatives, you’ll have to use something like a switch statement:

yesnomaybe = data.frame('subj'=seq(1,15), 
                         'resp'=rep(c('Y','N','M'), each=5))


Now we have three options. Maybe we want ‘yes’ to be 1, ‘no’ to be -1, and ‘maybe’ to be 0. Here’s how you can do it with a switch statement and sapply to call it on each element:

yesnomaybe$resp = sapply(yesnomaybe$resp, 
                             function(x) switch(as.character(x), 
                                          'Y'=1, 'N'=-1, 'M'=0))

In dplyr, we have the recode function:

yesnomaybe$dplyr_recode = recode(yesnomaybe$resp, 
                           `1`='yes', `-1`='no', `0`='maybe')



Recoding, assuming you don’t have to do it for a huge number of possibilities, goes pretty fast.

Adding variables

Variables can be added to an existing data frame just with the $ operator:

df = data.frame('x'=rnorm(20, 6), 'y'=rnorm(20))
df$z = rnorm(20, 10)


If you need to manipulate two data vectors that are numeric, you can just add, multiply, etc. your columns together to perform these operations elementwise:

df$total = with(df, x + y + z)


You also have a lot of options in the dplyr library, notably transform:

df = transform(df, x = -x)

But now that we’ve updated a column, our total is wrong. Let’s fix it with mutate:

df = mutate(df, corrected_total = x + y + z)


Maybe I now want a dataframe just of the even numbers in the x column, and the residuals from total and corrected total (for… reasons). transmute is like mutate, but it throws away all the extra:

df_even = transmute(df, x_ev=floor(x)%%2==0, 

If none of these methods fit the bill, you can call apply along all the columns or rows of your data frame and write a custom function to do whatever processing you need.

Factor levels as column labels

Let’s take the unfortunate case of levels-as-columns, in which all the levels of a factor are columns, and people get a 1 or a 0 for each level instead of their value. Here’s some example data:

levs = data.frame('subj'=seq(1, 4), 'a'=c(0, 0, 1, 0), 
                                    'b'=c(1, 0, 0, 1), 
                                    'c'=c(0, 1, 0, 0))

So, what we have are three subjects, and a factor with three possible levels: A, B, and C. What we want is the subject and the actual level of their factor, so we need a 2-column matrix.

Here’s one way we might do that (there are others) that uses some procedures I’ve already shown. First, we’ll reshape the dataframe so that the factors end up in one column. This has the advantage of putting the actual values of the factors we want all in one place. Then we filter out the 0s, leaving behind only the levels the subject actually selected, drop the redundant ones colum, then put the subjects back in the right order.

For these examples, I’ll print out each intermediate stage of manipulation so that you can see what’s happening.

All about that base:

(lev_long = reshape(levs, idvar='subj', direction='long', v.names='value', 
         timevar='trait', times=c('a', 'b', 'c'), varying=c('a', 'b', 'c')))


First, we reshape the data. We need all of the factor-related pieces of information in a single column. We have a column with the possible factor levels, and a column indicating 0 (not the subject’s level) or 1 (the subject’s level).

(lev_filtered = with(lev_long, lev_long[value == 1, 1:2]))

The second step just uses good old-fashioned indexing to keep all rows where the value is 1 (aka, the subject has that level), and to keep only the useful subject and trait columns; what the with function does is tell R to perform all operations with the supplied data set, so we can reference columns by isolated names rather than having to do the verbose data_frame$column syntax.

(lev_reformed_base = lev_filtered[order(lev_filtered$subj),])

The final step is reordering the data according to the subject column in ascending order. Now we’ve got our data in a much more sensible format.

Tidyr and dplyr make quick work of this. First, we gather:

(lev_g = gather(levs, trait, value, a, b, c))

Filter out the 0s:

(lev_f = filter(lev_g, value != 0))

Retain only the useful columns:

(lev_s = select(lev_f, subj, trait))


Finally, put the subjects back in order:

(lev_reform = arrange(lev_s, subj))

Here are those steps strung together with piping and thus obviating the need for all those separate variable assignments:

levs_reformed = gather(levs, trait, value, a, b, c) %&amp;amp;gt;%
                filter(value != 0) %&amp;amp;gt;%
                select(subj, trait) %&amp;amp;gt;%

What about if we have multiple factors? Here we have a test and a report, each of which has three possible levels: ABC and XYZ, respectively.

mfac = data.frame('subj'=seq(1, 4), 'test.A'=c(0, 1, 0, 1), 
                   'test.B'=c(1, 0, 0, 0),
                   'test.C'=c(0, 0, 1, 0), 'report.X'=c(1, 0, 0, 0),
                   'report.Y'=c(0, 1, 1, 0), 'report.Z'=c(0, 0, 0, 1))

So what we want is a dataframe with three columns: subject number, test, and report. Subject 1 picked test A and report X, subject 2 picked test A and report Y, and so on.

This gets a little more complicated. If we collapse everything into one column, we’re going to have to then spread it back out to separate the factors. We’ve also got the item label merged to its type, which is a problem if we only want the letter designation.

Let’s try with base. Here’s the reshape-filter method:

mfac_long = reshape(mfac, idvar='subj', direction='long', v.names='value', 
                          timevar='measure', times=colnames(mfac)[-1], 
mfac_filt = with(mfac_long, mfac_long[value == 1, 1:2])
type_splits =, strsplit(mfac_filt$measure, '.', fixed=TRUE))
mfac_sep = data.frame('subj'=mfac_filtered$subj,
mfac_wide = reshape(mfac_sep, idvar='subj', direction='wide', timevar='type')
(mfac_reformed_base = mfac_wide[order(mfac_wide$subj),])


Pulling this off takes more finagling. Things are fine when we reshape and filter (note the trick used to save some verbage in reshape(); indexing with a negative excludes that item, so we’re saying we want all column names except the first), but then we have to recover whether our factor was a test or a report separately of its type. This means we have to split the string using strsplit, bind the results into a matrix (because they automatically come out as a list), and then take those newly-made factors and reshape it wide again with the test type and report type as their own columns. One nice thing about this approach, in spite of its many steps, is that it’s totally blind to the content of the labels (provided they are consistently delimited). If they’re labeled in a cooperative way, you don’t need to know how many labels there are or what they say, and they can be in any order.

Here’s another base approach, from my BFF Kelly Chang. This one uses the apply function to sweep a filter down the dataframe, then repackage the results:

labels = c('A', 'B', 'C', 'X', 'Y', 'Z')
filtered = t(apply(mfac[,2:ncol(mfac)], 1, function(x) labels[x==1]))
mfac_kc = data.frame(mfac$subj, filtered)
colnames(mfac_kc) = c('subj', 'test', 'report')

Here, you would supply the labels, rather than recovering them from the data itself (as was done in the previous approach). Here, order is important; the labels need to be in the same order as the corresponding columns for the filter to work.

With tidyr and dplyr, this approach can look something like this (still agnostic to the label content):

mfac_reformed = gather(mfac, measure, value, -subj) %&amp;amp;gt;%
                filter(value != 0) %&amp;amp;gt;%
                select(subj, measure) %&amp;amp;gt;%
                separate(measure, c('test', 'type')) %&amp;amp;gt;%
                spread(test, type) %&amp;amp;gt;%

The first few steps are the same; melt everything down and toss the zero values. Then, we need a step to yank apart the measure’s letter designation and its type. Fortunately, tidyr has a handy separate function that does just this; it pulls apart the joined values into two columns that we can label right away. Then, we need to spread our now distinct factor types back into columns–one for the test and one for the report–and sort by subject.

Note also that the intermediate steps in this last example, when we had to separate the two types of factors and get two separate ones back from the report.X format, which involved splitting the string and reshaping the data, can also be useful if you have data in this form, or if you have one big code for a condition or trial and at some point want to split it into its components. You can also use the colsplit() function from the reshape2 package for this purpose.

Parting Thoughts

And there you have it–a brief introduction to some common data manipulation tasks, and a few ways to handle them. This is only the thinnest of samples of methods. There are lots of different ways to accomplish things, and packages to help you do it. Many of these methods will undoubtedly have my fingerprints all over them; one of the reasons I approached these problems the way I did is to show how learning a skill in one context–reshaping data for plotting, for example–can be useful in other contexts, like changing a data frame’s fundamental structure. Many roads lead to the same place, and if you don’t like this one, another will get you there just as comfortably, if not more so.

Intro to Bootstrapping

The bootstrap, detailed by Efron and Tibshirani in their 1993 book An Introduction to the Bootstrap, is a powerful, flexible statistical concept based on the idea of “pulling yourself up by your bootstraps.” With today’s computing power, it’s easier than ever to apply and use. It also happens to be one of my favorite statistical tricks because of the elegance of the underlying logic.

The idea behind the bootstrap is simple. We know that if we repeatedly sample groups from the population, our measurement of that population will get increasingly accurate, becoming perfect when we’ve sampled every member of the population (and thus obviating the need for statistics at all). However, this world, like worlds without friction in physics, don’t resemble operating conditions. In the real world, you typically get one sample from your population. If only we could easily resample from the population a few more times.

Bootstrapping gets you the next best thing. We don’t resample from the population. Instead, we continuously resample our own data, with replacement, and generate a so-called bootstrapped distribution. We can then use this distribution to quantify uncertainty on all kinds of measures.

I’ll work through an example to show how this works in practice. We’ll sample from the normal to start, μ = 1.25, σ = 0.5. You can find the code for this on GitHub; I’ve set the seed within the script so you’ll get exactly the results I find here.

Let’s say we run a study with 50 people, and we’re interested in, among other things, getting a good estimation of the mean of the underlying population. So, you take your 50 people, take a mean, get a standard error, and maybe 95% confidence intervals to show the variation (± 1.96*SE).

[Just to note: bootstrapped confidence intervals give you a nice measure of uncertainty, and I like to use them as my error bars, but they don’t get you away from the problems of NHST, if that’s something you’re trying to avoid.]

Here’s a histogram of our sample (I’m not using ggplot2 for this out of unadulterated laziness):


Here are the stats for our mean, calculated the old-fashioned way from an n of 50:

mean: 1.27
SE: 0.18
95% CI: [1.14, 1.40]

Now, let’s bootstrap this mean. A way to conceptualize this is to imagine re-running your 50-person experiment over and over again, except we’re not going to be drawing new subjects from the population. Instead, we’re going to draw groups of 50 subjects from our sample, but do so with replacement. This means that some subjects may show up more than once, and some may never show up at all. We’ll take the mean of each group. We’ll do this many times–here, 1000–and so we end up with a distribution of 1000 means.

For this basic procedure, we can do it in two lines of code:

x <- rnorm(50, 1.25, .5)

boot <- replicate(1000, mean(sample(x, 50, replace=TRUE)))

And here’s the resulting histogram:



The bootstrapped mean (1.27) is of course centered on our sample mean. Now we have a distribution of values for the mean that we could expect to obtain with our sample.

Now, if we wanted to do confidence intervals, we could get the SE of the bootstrapped distribution and do it the usual way. However, we also have an option available to us called the percentile method. In order to find the confidence intervals, we sort all of our bootstrapped values, and then take the 2.5% quantile and the 97.5% quantile. If you repeat the sampling procedure with the same process on the same population (precisely what we do with our bootstrap), 95% of the time the mean falls between these endpoints. If we do this procedure with our particular sample, we get this:

95% bootstrapped CI: [1.14, 1.40]

Here, the bootstrapped CI and the standard-error derived CI match up (not that surprising, since our data is normally distributed and therefore exceptionally well-behaved!). However, if you have constrained data, such as accuracy, bootstrapped confidence intervals will automatically conform to those limits and won’t exceed 100% or dip below 0%.

Here’s an example of bootstrapping the estimate of the effect size. We’re going to assume a between-subjects design here. We’ll make two independent, normally distributed groups with a true effect of .5. I’m going to let the cohen.d() function from the effsize package do the heavy lifting and give me both the estimate and the confidence interval:

95% CI: [.15, .96]

Here’s how we would bootstrap those same intervals. We resample each group independently, because it’s between-subjects. The procedure is much the same if it’s within subjects, except you would resample your pairs of data because the two groups are not independent in that case. The code:

groups <- data.frame('exp'=rnorm(50, .5, 1), 'control'=rnorm(50))

boot_es <- replicate(1000, cohen.d(sample(groups$exp, 50, replace=TRUE),
sample(groups$control, 50, replace=TRUE))$estimate)

And here’s the histogram of the bootstrapped distribution, with the mean and bootstrapped CI plotted:


bootstrapped mean: .56
95% bootstrapped CI: [.14, .99]

And there you have it: bootstrapped confidence intervals on your effect sizes.

There is an incredible about of nuance in the bootstrap method, and many different ways to apply it to even the most complex data. You can use it for something as simple as confidence intervals on a mean, or as complicated as regression coefficients. It’s also great for getting CI estimates for statistics or measures that don’t have a formula method. The general idea–resample your own data to get estimates–underlies even the most complex applications of this method. If you understand the basic logic, it’s pretty easy to understand and start using new applications.

If you are mathematically inclined, there have been many proofs and a lot of work showing the first- and second-order correctness of bootstrapping estimations. The method I showed here for confidence intervals is just the percentile method; although it has been shown to work well in a wide variety of situations, there are other approaches, some which offer bias correction and other improvements (see

If you want to get started, there are some R packages that will handle bootstrapping for you:

boot (general-purpose, and can generate many variants of the various methods)
bootES (for effect sizes in particular)

And of course, if you want a really deep dive, you can check out the original book:

Efron & Tibshirani, An Introduction to the Bootstrap. Chapman & Hall/CRC, 1993.

Program better, for fun and for profit

Psych researchers have a bit of a reputation for being, shall we say, less-than-delicate programmers. It’s common to hear “it doesn’t matter if it’s ugly, as long as it works.”

I took computer science classes as an undergrad, and style was rigidly enforced. I had one notoriously strict professor who would dock up to a grade on an otherwise completely functional project if it was ugly. It wasn’t just simple elitism; ugly code is often inefficient code, and ugly code is hard to read and understand.

Code quality is something I’m constantly working on. You can see the development in my scripts; I only recently started using dplyr and the rest of the tidyverse in R, and what a difference it’s made to the quality of my code. I cringe a little, looking back at my earliest scripts (and they’re a matter of public record, forever). Cringe is good, though. Cringe signals improvement, and wisdom gained.

I thought I’d share a few of the practices that were drilled into me during my CS education that have helped improve the style, quality, and readability of my code.

1. Comment your code.

Please. If not for others, then for your future self. You will neither remember nor understand your code in the future as well as you think you will. I don’t to it as thoroughly as I ought; I’m not sure anyone does. This is easy to change, and doesn’t take much effort.

Functions should be commented with what their inputs are, what it does to those inputs, and what it returns. For a gold star, you can include examples of input and output.

For example:


A function that takes in a list of integers X

and returns the arithmetic mean in floating-point



def mean(x):

Global variables should be commented with what they are and how they’re used, so that you can change them without having to dig back through the code to make sure you understand what the variable does.

Commenting code makes it much easier for others to understand, and it cuts way down on re-learning if you have to go back and revisit old code.

2. Use sensible variable and function names.

It very rapidly becomes impossible to tell what is happening when all variables are named x, x1, x2, or tmp. While we want variable names to be succinct, they should also make sense and be recognizable. Degrees of freedom can be df. A subject’s height could be subj_height, rather than, say, h or s_h.

This is also good to do when you’re subsetting data. You don’t want to confuse yourself or others about which variable represents the full dataset, and which represents a subset!

Functions should also, ideally, be named after what they do. cartesian_to_polar(x, y) is obvious (if you know your coordinate systems); c2p(x, y) less so.

3. Avoid hardcoding.

“Hardcoding” is the practice of writing explicit, fixed values into functions and scripts instead of variables. So if you had a run_experiment function, hardcoded to do 100 trials, it might look like this:

def run_experiment:

    for i in range(100):


And then maybe at the end of the script, you have to reference the number of trials again, maybe to calculate percent correct:

#let's assume for convenience that yes_responses is a list of bools

correct_resp = sum(yes_responses)/100

This works fine, but what if you decide to change the number of trials? Then you’ll have to hunt back through every place you used 100 and change it. What would be a lot easier is to define a variable, num_trials, at the beginning of your script. Then, every time you need to reference this number, use num_trials rather than the hard number. Then, if you change your mind, you only have to change the value of num_trials to change its value everywhere else in the script.

This is especially relevant for experiment scripts, in which values might change over the course of development, or need to change during the experiment itself with the condition or trial type. It’s much more convenient to have all of you parameters mapped to variables in one place so that you only have to change them once to change behavior everywhere. Changes become easy and quick, and it will save heartache.

4. Think modular.

Break routines and operations into functions, particularly if you have to do them over and over again. For example, if you’re writing an experiment, you might want to have a function that handles a single trial, with some inputs that can be adjusted. Maybe you have long-exposure trials and short-exposure trials, for instance. It’s nice to be able to call do_trial(long_ontime) or do_trial(short_ontime), rather than having all of that logic imbedded in one monster script. If you need more flexibility, just write the function to accept more variables and pass them in.

If you have a function that you use a lot (I have one for saving out data), you can keep it in a separate file and source it in each time you need it, rather than rewriting it each time. Being able to re-use your code saves time and effort.

5. Be succinct.

Often, there’s a verbose way to do something, and a concise way to do something. For instance, in Python, you can very often replace for-loops with list comprehensions. In R, just about every for-loop can be replaced with a combination of calls to the venerable apply family of functions.

With the tidyverse, massive, ungainly nested calls are much cleaner. Here’s a snippet of some code I wrote with dplyr. Here, I start with a giant data frame of individual trials, each labeled with which pair of subjects it belongs to. I need to add a variable that marks whether each trial is “correct” or “incorrect” based on my criteria, and then take the mean percent correct so that each pair of subjects has a single value, tally up the number of trials completed, and spit it out as a summary table.

summary_stats_subj <- mutate(isubjs, correct = (objects==1 & diff_judgment == 'easy') |
(objects==5 & diff_judgment == 'hard')) %>%
group_by(pair) %>%
summarize(pct_corr = mean(correct), n = n())

And here’s what that same code would look like without dplyr:

isubjs$correct <- (isubjs$objects == 1 & isubjs$diff_judgment == 'easy') | (isubjs$objects==5 & isubjs$diff_judgment == 'hard')

summary _stats_subj <- cbind(tapply(isubjs$correct, isubjs$pair, mean), tapply(isubjs$correct, isubjs$pair, nrow)

That’s a lot uglier, for one thing. I have to do a lot more indexing into my dataframe, I have to call tapply twice (unless I wanted to write a custom function), I have to manually assemble things. With dplyr, I can create new variables with one call, don’t have to repeatedly reference the objects I’m working with, and then I can immediately pass the new objects or values into any subsequent calls. You read dplyr code left to right, rather than inside out, as you have to do with nested calls like mean(rowSums(data[,data$condition == x])). It’s more intuitive and a lot less verbose.

6. Unit-test.

This is an example of doing more work now to do less work later. I’ve fallen out of the habit of unit-testing, but with the new year comes an opportunity to return to core values.

In unit-testing, you write tests in a separate script for little pieces of your code. This is easy in Python, and there’s a nice R package for it too.

The idea is that you test each piece of code (functions, object classes, etc) as you write it and verify that it works. Even better, you can write tests for code first, before you even write the code itself. Tests make sure that code runs without crashing, that the output it gives you matches what you expect, that objects have the methods and attributes you want and that they all do what you expect, and so on.

Why bother? For one, it creates a nice feedback loop with modularity. Writing code in nice little packages makes it easy to test, which encourages writing code in nice little packages, etc. Two, it will save you a ton of time during debugging. If you write an entire script through, try to run it, and almost inevitably encounter bugs, you then have to search through a lot of possible failure points. Usually that means having to verify that all of the pieces work anyway, in order to zero in on the culprit(s). With unit testing, you know right away whether something is working and if it’s working correctly. This gives you a chance to fix things before you run the whole thing through.


Following good practices only nets benefits. It makes your old code more accessible to yourself. Possibly more critically, it makes your code more accessible to others. Open data and open materials are becoming more common, and it’s not enough just to throw uncommented code up on Github and call it duty done. Part of that responsibility to openness is making code readable, understandable, and transparent. These practices are a good place to start.

5 minute simulation: Variation in effect size estimates

If it’s not terribly obvious, shiny is my new favorite toy. It’s incredibly accessible for beginners, gives you great results with minimal effort, and can be as sophisticated as you need it to be.

I decided to throw together a quick simulation to look at the variation in effect size estimates we can expect at different sample sizes. There’s an increasing focus in psych on estimation of effects, rather than simply detection of effects. This is great, but, as it turns out, virtually impossible with a single study unless you are prepared to recruit massive numbers of subjects. Nothing here is new, but I like looking at distributions and playing with sliders, and I’ll take any excuse to make a little shiny widget.

In this simulation, we’re doing a basic, between-groups t-test and drawing samples from the normal distribution.

Here’s what you get if you use tiny (n=10) groups (that is, you’re a researcher with ~flair~) and no true effect is present:

The white line is the true effect, and the two red lines mark out the 95% quantile. N=10, true effect = 0.

Yikes. With samples that small, you could (and will, often!) get an enormous effect when none is present.

Here’s what we get with n=50, no effect present. I’ve left the x-axis fixed to make it easier to compare all of these plots.

N=50, true effect = 0.

This is a dramatic improvement over n=10, but you could still estimate what is considered a small (d=.1, traditionally) to medium (d=.3, traditionally) effect in either direction with appreciable frequency.

Just for fun, here’s n=100 and n=1000.

Even with ns of 100 in each group, you get a pretty good spread on the effect size.

I’ve used d=0 as an example, but you get this spread regardless of what the true d is; it will just shift to center on the true effect. In that case, you’ll detect an effect most of the time, but can be way off about its actual size. This doesn’t mean that you can throw power out the window by arguing that you only care about detection, of course–you’ll “detect” an effect a lot of the time when d=0 with small samples.

These simulations are the result of 3000 replications each, but in the shiny app you can go as low as 40 to roughly simulate meta-analyses (assuming every study had the same sample size).

For me, this really drives home just how important replications and meta-analyses–cumulative science in general, really–are, particularly for estimation. When you do a lot of these studies over and over again, as these simulations model, you’ll zero in on the true effect, but a study can’t do it alone.

The shiny app can be found here. You can tweak group size, true effect size, how many simulations are run, and the limits on the x-axis. You can also view a plot of the corresponding p-values.

The source code can be found here.