A few weeks back I posted a short diatribe on the merits and pitfalls of including your uncertainty, or error, in any argument you make. Some of you were quick to sing your praise of our friendly standard deviants, while others were more hesitant to jump on the confidence bandwagon.
However, one common thread amongst the responses was a general uncertainty about uncertainty. That is – what exactly we mean when we say “error bars”. It turns out that error bars are quite common, though quite varied in what they represent.
This post is a follow up which aims to answer two distinct questions: what exactly are error bars, and which ones should you use. So, without further ado:
What the heck are error bars anyway?
Well, technically this just means “bars that you include with your data that convey the uncertainty in whatever you’re trying to show”. However, there are several standard definitions, three of which I will cover here.
First, we’ll start with the same data as before.
Ok, so this is the raw data we’ve collected. As we can see, the values seem to be spread out around a central location in each case. The question that we’d like to figure out is: are these two means different. If they are, then we’re all going to switch to banana-themed theses.
Upon first glance, you might want to turn this into a bar plot:
However, as noted before, this leaves out a crucial factor: our uncertainty in these numbers. Remember how the original set of datapoints was spread around its mean. Here, we have lost all of that information.
So, let’s add some error bars!
The standard deviation
The simplest thing that we can do to quantify variability is calculate the “standard deviation”. Basically, this tells us how much the values in each group tend to deviate from their mean. Here is its equation:
As with most equations, this has a pretty intuitive breakdown:
And here’s what these bars look like when we plot them with our data:
OK, not so bad, but is standard deviation really what we want? We’ve just seen that this tells us about the variability of each point around the mean. However, we don’t really care about comparing one point to another, we actually want to compare one *mean* to another. Which brings us to…
Closely related to the standard deviation, the standard error gets more specifically at the kinds of questions you’re usually asking with data. We want to compare means, so rather than reporting variability in the data points, let’s report the variability we’d expect in the means of our groups. This is known as the standard error.
Now, here is where things can get a little convoluted, but the basic idea is this: we’ve collected one data set for each group, which gave us one mean in each group. If we wanted to calculate the variability in the means, then we’d have to repeat this process a bunch of times, calculating the group means each time.
However, we don’t want to do this, so what can we do?
One option is to make an assumption. Specifically, we might assume that if we were to repeat this experiment many many times, then it would roughly follow a normal distribution. Note – this is a big assumption, but it may be reasonable if we expect the Central Limit Theorem to hold in this case.
If we assume that the means are distributed according to a normal distribution, then the standard error (aka, the variability of group means) is defined as this:
Basically, this just says “take the general variability of the points around their group means (the standard deviation), and scale this number by the number of points that we’ve collected”.
This one also makes intuitive sense. If we increase the number of samples that we take each time, then the mean will be more stable from one experiment to another. Don’t believe me? Here are the results of repeating this experiment a thousand times under two conditions: one where we take a small number of points (n) in each group, and one where we take a large number of points.
See how the means are clustered more tightly around their central number when we have a large n? This represents a low standard error. AKA, on each experiment, we are more likely to get a mean that’s consistent across multiple experiments, so it is more reliable.
This sounds like a much better choice for plotting along with our data, because it directly answers the question “how certain are we that the means we’ve recorded are the “true” values?”
Let’s see what this looks like:
Wahoo! We’ve made our error bars even tinier. That’s no coincidence. Look at the equation for the standard error. If we increase N, we will always make the standard error smaller. As such, the standard error will always be smaller than the standard deviation.
OK, there’s one more problem that we actually introduced earlier. As I said before, we made an *assumption* that means would be roughly normally distributed across many experiments. But do we *really* know that this is the case? Is there a better way that we could give our uncertainty in group means, without assuming that things are normally distributed? Fortunately, there is…
Confidence Intervals (with bootstrapping)
Confidence intervals have been theorized for quite some time, but they’ve only become practical in the past twenty years or so as a common tool in data analysis. I’m going to talk about one way to calculate confidence intervals, a method known as “bootstrapping”. Basically, this uses the following logic:
I’m interested in finding the variability of our sample means across many experiments, but I don’t want to make too many assumptions about how the means would be distributed across many experiments. What can I do? Bootstrapping says “well, if I had the “full” data set, aka every possible datapoint that I could collect, then I could just “simulate” doing many experiments by taking a random sample from that “full dataset”.
However, I don’t have the full dataset, but I do have the sample that I’ve collected. As such, I’m going to say that the closest thing I’ve got to the true distribution of all the data is the sample that I’ve already got. Thus, I can simulate a bunch of experiments by taking samples from my own data *with replacement*. I’ll calculate the mean of each sample, and see how variable the means are across all of these simulations.
OK, that sounds really complicated, but it’s quite simple to do on our own. Let’s try it.
We need to:
- Take a bunch of samples of the same size as our original dataset. “With replacement” just means that we can sample the same datapoint more than one time.
- For each sample, we calculate the mean.
- Then we look at all of the means to figure out how variable they are
Doing this requires a bit of computation, so I’m not going to go into the details here. However, at the end of the day what you get is quite similar to the standard error. Why is this? Because in this case, we know that our data are normally distributed (we created them that way). However, in real life we can’t be as sure of this, and confidence intervals will tend to be more different from standard errors than they are here.
The way to interpret confidence intervals is that if we were to repeat the above process many times (including collecting a sample, then generating a bunch of “bootstrap” samples from the big sample, then taking the percentiles of these sample means), then 95% of the time, our interval would contain the “true” mean of the data.
So what should I use?
At the end of the day, there is never any 1-stop method that you should always use when showing error bars. And so the most important thing above all is that you’re explicit about what kind of error bars you show. The biggest confusions come when people show standard error, but people think it’s standard deviation, etc.
That said, in general you want to show the standard error or 95% confidence intervals rather than the standard deviation. This is because these are closer to the question you’re really asking: how reliable is the mean of my sample?
As for choosing between these two, I’ve got a personal preference for confidence intervals as it seems like they’re the most flexible and require less assumptions than the standard error. I’m sure that statisticians will argue this one until the cows come home, but again, being clear is often more important than being perfectly correct.
So that’s it for this short round of stats-tutorials. There are many other ways that we can quantify uncertainty, but these are some of the most common that you’ll see in the wild. If you’ve got a different way of doing this, we’d love to hear from you. Until then, may the p-values be ever in your favor.