Baron Bootstrap

Introduction

The word “bootstrap” comes from an old story about a hero - Baron Munchausen - who is riding around on his horse in a forest and suddenly gets stuck in a swamp. He screams for help but there is no one around who hears his voice! Luckily our hero does not give up and gets a great idea: “what if I just pull myself out of this swamp?”. He grabs the straps of his boots and pulls himself loose. Fantastic - he just invented bootstrapping.

Baron Munchausen ponytailing out of another swamp.1


Physics-defying stories aside, bootstrapping has become a common term for something seemingly impossible or counterintuitive. In this blogpost I will try to generate an intuition for the properties of statistical bootstrapping - resampling from your data to approximate resampling from a population.

The population

In order to explain bootstrapping, we need to generate an example. Let’s assume we want to know the average height of all the people in the Netherlands. With the power of R we can easily generate a population of 17104879 people (according to CBS, the amount of registered inhabitants of the Netherlands as per the creation of this post2). The Dutch are just about the tallest people on the planet, where the men are 180,7 centimeters tall, on average3.

Here is some more information about my population. For the sake of simplicity let’s assume that all inhabitants are actually men (which would be a disaster). The tallest Dutch man is 223 centimeters4 (which is very tall) and the shortest Dutch man is ridiculously hard to find on the internet. I have an intuition that it is further away from the mean of 180.7, which implies some negative skewness, but that’s not what this post is about so let’s also assume we have a non-skewed normal distribution.

After some fiddling with the standard deviation variable, I simulated the following population:

# generate population data
set.seed(3665364)
pop <- rnorm(17104879, mean = 180.7, sd = 7.5)

Let’s look how tall the tallest man from the hypothetical all-men Netherlands is, along with some other statistics!

##  Statistics about the population:
##  --------------------------------
##  The shortest person is  142.0121  cm tall.
##  Mean height is  180.6979  cm.
##  Median height is  180.698  cm.
##  The 5th and 95th percentiles are  168.3612 193.0287 cm.
##  The tallest person is  222.2139  cm tall.

That seems close enough to something I’d consider a population.

The set-up

Normally, when we want to estimate some population parameter such as the mean height, we cannot measure all persons. Therefore, we create a representative sample of persons from this population that we can measure. The size of our sample (\(N\)) should depend on how precise we want our final estimate to be - the standard error of a statistic depends directly on the number of persons in our sample. For the mean, the standard error is usually calculated as follows:

\[ se_\bar{x} = \frac{\hat{\sigma}}{\sqrt{N}}\text{, where } \hat{\sigma} = \sqrt{\frac{1}{N-1}\sum_{i=1}^N(x_i-\bar{x})^2} \]

With increasing \(N\), the \(\hat{\sigma}\) becomes smaller, and the \(se_\bar{x}\) becomes smaller as well. In words: with a larger sample comes an increase in precision (a reduction in error).

We can display the precision in the form of confidence intervals (CI). A \(p\%\) CI about an estimate indicates the area in which upon infinitely repeated sampling the TRUE parameter lies \(p\%\) of the time. In other words, if we would redo our experiment infinite amount of times, a 95% confidence interval will cover the true parameter in 95% of the replications.5

In real life, there is of course no such thing as infinite resampling; we only sample once and use the CI as an indication of precision. Let’s look at how the precision of the mean estimate increases as we take larger samples from the above population.

# randomly sample from population
small <- sample(pop, 20)
medium <- sample(pop, 100)
large <- sample(pop, 1000)

As you can see, the CI from each sample covers the true population value in this case. We can create these confidence intervals because we know the sampling distribution of the mean - the distribution of the mean that arises upon repeated sampling. For means of a normally distributed population, the sampling distribution is the familiar Student’s t-distribution. We use the probability density to determine our 95% CI (the 1.96 in the code above).

But what if we don’t exactly know what the sampling distribution is? For example, what happens if we do not have a normally distributed population? It turns out there is another way of generating CIs.

The bootstrap

The bootstrap has these steps:

  1. approximate the sampling distribution by taking the mean of n repeated samples with replacement from your original sample.

Huh? Only one step? Is it that simple?

Yes:

# Let's bootstrap 10000 times!
mean_sampling_distribution <- numeric(10000)
for (i in 1:10000){
  bootstrap_sample <- sample(medium, replace = T)
  mean_sampling_distribution[i] <- mean(bootstrap_sample)
}

Great! Now we have an empirical sampling distribution. What do we do now in order to get an estimate of our precision in the form of a confidence interval? That is simple too: sort the means attained from the bootstrap samples from low to high, and look at the 2.5th and 97.5th percentile. This yields a 95% bootstrap CI!

CIbootstrap <- quantile(mean_sampling_distribution, 
                        probs = c(0.025,0.975))

As you can see, that’s indeed very close to the original theoretical CI.

The advantage

The advantage of this method is that this does not require the researcher to know the exact form of the sampling distribution. We can now create CIs (and thereby an estimate of precision) for nearly any statistic that we think about, such as (a) the fifth percentile, (b) the median, (c) the ninety-ninth percentile, (d) this completely arbitrary statistic called van kesteren measure that I just came up with: \[\hat{k} = \bar{x}\cdot\frac{1}{N}\sum_{i=1}^N |x_i^\frac{1}{3}-\sqrt{\text{median}(x)}|\]

Probably there are analytical solutions to the sampling distributions of these measures (and I think it is likely that they have been derived at some point in the 1940s) but I don’t know them so I’ll bootstrap:

vkmeasure <- function(x){
  return(mean(x)/length(x)*sum(abs(x^(1/3)-sqrt(median(x)))))
}

perc5 <- median <- perc99 <- vkmeas <- numeric(10000)

for (i in 1:10000){
  bootstrap_sample <- sample(large, replace = T)
  perc5[i] <- quantile(bootstrap_sample, probs = 0.05)
  median[i] <- median(bootstrap_sample)
  perc99[i] <- quantile(bootstrap_sample, probs = 0.99)
  vkmeas[i] <- vkmeasure(bootstrap_sample)
}

We have magically done away with a problem - that of distributional assumptions - by pulling ourselves up from our bootstraps, not unlike our friend Baron Munchausen! Fantastic! Or is it?

Of course, there are some downsides to this approach; it is not the magical solution to getting rid of distributional assumptions in any situation. As you can see in the Fifth Percentile graph in the image above, there is some discreteness to the distribution that we do not expect in the actual sampling distribution of the fifth percentile characteristic:

# Sample 10000 times from the ACTUAL population
perc5actual <- numeric(10000)
for (i in 1:10000){
  smp <- sample(pop, 1000)
  perc5actual[i] <- quantile(smp, probs = 0.05)
}

In this case the bootstrap sampling distribution does not work so well for determining the uncertainty or 95% confidence interval around the statistic of interest. As seen in this figure, the actual sampling distribution looks quite different to that inferred from bootstrap. How would we solve this problem? There are several ways:

  1. find the analytical sampling distribution form and construct a CI based on that
  2. make a larger sample which will contain more different values (often infeasible)
  3. smooth the bootstrap procedure, for example by adding some random noise to the samples6

Take-home message

The bootstrap is a fantastic means of nonparametrically determining the uncertainty of your estimate, but it should be used with care. Inspect the resulting distribution for discreteness such as in the 5th percentile graph above. And mathematical statistics is not completely obsolete ;)


  1. Source: http://en.citizendium.org/wiki/Image:Dore-Munchausen-pull.jpg

  2. Source: https://www.cbs.nl/nl-nl/visualisaties/bevolkingsteller as per the writing of this post.

  3. Source: https://is.gd/cbsdata

  4. Quick google searchy source: https://www.langzijn.nl/tag/langste-man-nederland

  5. this explanation follows a frequentist framework of statistics. For an interesting sidestep and Bayesian credible intervals, do read this paper - http://doi.org/10.3758/s13423-015-0947-8

  6. Yes, the source is a wikipedia link: https://en.wikipedia.org/wiki/Bootstrapping_(statistics)#Smoothed_bootstrap