```
library(ggplot2)
library(manipulate)
library(firatheme) # https://github.com/vankesteren/firatheme
library(Massign) # https://github.com/vankesteren/Massign
```

A while ago, I blogged that a matrix \(A\) can be seen as an operation and that the determinant of this matrix \(|A|\) says something about the volume of the transformation. This blog is about another property of matrices: the characteristic function.

The characteristic function of a square matrix \(A\) of order \(n\) is defined as follows:

\[p_A(\lambda) = | \lambda I_n - A |\]

Let’s look at this function for the following \(2\times 2\) matrix:

\[A = \begin{bmatrix} 3 & 1 \\ 2 & 4 \end{bmatrix}\] \[ \begin{align} p_A(\lambda) &= | \lambda I_n - A | \\ &= \left| \begin{bmatrix} \lambda & 0 \\ 0 & \lambda \end{bmatrix} - \begin{bmatrix} 3 & 1 \\ 2 & 4 \end{bmatrix} \right| \\ &= \left| \begin{bmatrix} \lambda - 3 & -1 \\ -2 & \lambda - 4 \end{bmatrix} \right|\\ \end{align}\]

The roots of this function - a polynomial of order \(n\) - are the eigenvalues of the matrix. We can find them using some algebra, remembering that the determinant of a \(2\times 2\) matrix is calculated as \(ad-bc\):

\[ \begin{align} p_A(\lambda) &= (\lambda - 3)(\lambda - 4) + 2\\ &= \lambda^2 - 7\lambda + 10 \\ &= (\lambda - 2)(\lambda - 5) \end{align}\]

So \(\lambda = 2\) or \(\lambda = 5\). These are the two eigenvalues of this matrix.

Using the power of `R`

, we can get a better intuition for the characteristic function by visualising it. Below the code for a function that takes a matrix and visualises this function.

```
charfun <- function(mat, from, to) {
n <- ncol(mat)
stopifnot(n == nrow(mat))
x <- seq(from, to, length.out = 3000)
ggdat <- data.frame(
x = x,
y = vapply(x, function(lambda) det(lambda*diag(n) - mat), 1.0)
)
ev <- eigen(mat)$values
ev <- ev[ev <= to & ev > from]
evdat <- data.frame(x = ev, y = rep(0, length(ev)))
ggplot(ggdat, aes(x = x, y = y)) +
geom_hline(yintercept = 0, col = firaCols[5], lwd = 1) +
geom_vline(xintercept = ev, col = firaCols[2], lwd = 1, lty = 2) +
geom_line(col = firaCols[1], lwd = 1) +
geom_point(aes(x, y), evdat, size = 3, col = firaCols[1]) +
labs(x = "Lambda",
y = "Characteristic function value",
title = "Characteristic function of a matrix") +
theme_fira()
}
A %<-% "3, 1
2, 4"
charfun(A, 0, 7) + ggtitle("Characteristic function of A")
```

- The characteristic function crosses the axis at 2 and 5, just as we expected
- The characteristic function is indeed a quadratic function, i.e., a polynomial of order \(n = 2\).

You can play around with this function in `R`

by trying out different matrices. Try out a covariance matrix, different kinds of symmetric and assymmetric matrices!

After running the above `R`

chunks, you can run the following to play around with different covariance matrices of the following form:

\[A = \begin{bmatrix} 1 & a & b \\ a & 1 & c \\ b & c & 1 \end{bmatrix}\] Play around with it to see what happens to the eigenvalues of this matrix! For example, note that when a, b, c, and d are all 0 the eigenvalues are all 1. There are some nice symmetries to be explored here.

```
manipulate(
{
A %<-% " 1,
a, 1
b, c, 1"
charfun(A, 0, 2) + ylim(c(-.7,.7))
},
a = slider(-1, 1, initial = .5, step = 0.1),
b = slider(-1, 1, initial = .3, step = 0.1),
c = slider(-1, 1, initial = .2, step = 0.1)
)
```

I was on holiday and had to wait a while. Fortunately, close to where I was sitting there was a kid playing basketball all by himself. Of course I had to record how many hits and misses he made, to keep it as a nice dataset for further analysis.

Here is the dataset:

```
basket <- c("miss","miss","miss","hit","miss","hit","miss", "miss",
"hit","hit","miss","miss","miss","miss", "miss","miss",
"miss","hit","hit","hit","miss","hit")
```

Firstly, we can describe how often the kid manages to score a point:

So it looks like the kid misses a bit more than he hits. But there is much more information in this dataset: besides whether the kid hits or misses, we can say something about the *sequence* of these events.

From a sequence of observations, it is possible to construct a transition matrix

\[ \begin{bmatrix} A & B \\ C & D \end{bmatrix} \] Where the elements indicate the probability that:

- \(A\): a hit is followed by another hit
- \(B\): a hit is followed by a miss!
- \(C\): a miss is followed by a hit
- \(D\): a miss is followed by another miss

In other words, \(B\) and \(C\) indicate how likely it is to *transition* from hit to miss and vice versa, whereas \(A\) and \(C\) indicate how likely it is to stay in the same state (*transition* to self).

If we don’t know the true probabilities, we can enter the observed probabilities into the matrix. Here is an `R`

function for generating the transition matrix from the data vector of before:

```
transMat <- function(x, prob = TRUE) {
X <- t(as.matrix(x))
tt <- table( c(X[,-ncol(X)]), c(X[,-1]) )
if (prob) tt <- tt / rowSums(tt)
tt
}
transitionMatrix <- transMat(basket)
print(transitionMatrix, digits = 2)
```

```
##
## hit miss
## hit 0.43 0.57
## miss 0.36 0.64
```

This transition matrix completely defines the 2-state markov chain. Assuming these probabilities are stable, we can now generate processes just like the one we observed. And we can nicely visualise it using this markov chain generator post by setosa. Go to this link to play around!

If we assume this chain is stable over time, there is another nice property. Irrespective of the initial probabilities of hitting or missing that we choose, after a few steps of the markov process, the probability of hitting the basket already converges to the *steady state*:

In general, we can calculate the steady state easily from the transition matrix:

```
someBigNumber <- 1000
diag(transitionMatrix %^% someBigNumber)
```

```
## hit miss
## 0.3846154 0.6153846
```

Note that this leads to a similar but different result than simply counting the probability of hitting or missing as a table, as we did in the first figure:

`table(basket)/length(basket)`

```
## basket
## hit miss
## 0.3636364 0.6363636
```

- What should we trust here? The naïve probabilities or the markov steady state?
- Which assumptions lead to this discrepancy?
- Does the transition matrix really contain more relevant information about this process than the observed hit rate?

Any set of points can be represented in a matrix \(\boldsymbol{X}\). For example:

\[ \boldsymbol{X} = \begin{bmatrix} 0 & 0 \\ 0 & 1 \\ 1 & 1 \\ 1 & 0 \end{bmatrix}\] The four rows in this matrix correspond to four points in two-dimensional space. You can think of the first column as the x coordinate and the second column as the y coordinate of each point. For our chosen \(\boldsymbol{X}\), these points represent the corners of a unit square.

We can define a transformation matrix \(\boldsymbol{T}\) as a \(2\times 2\) matrix which through post-multiplication transforms these points into *another* set of points in 2-dimensional space \(\boldsymbol{X'}\). For example, we can take the identity matrix:

\[\boldsymbol{T} = \boldsymbol{I} = \begin{bmatrix} 1 & 0 \\ 0 & 1 \end{bmatrix}\]

This matrix is a kind of arbitrary transformation because by definition, \(\boldsymbol{X'} = \boldsymbol{X} \times \boldsymbol{I} = \boldsymbol{X}\): the set of transformed points is the same as the set of original points.

But what about a different transformation matrix, say

\[\boldsymbol{T} = \begin{bmatrix} 1 & 0.5 \\ 0 & 1 \end{bmatrix}\]

Now \(\boldsymbol{X'}\) is not equal to \(\boldsymbol{X}\): the points have been transformed! In particular, here we are dealing with a *skew*:

\[\boldsymbol{X'} = \boldsymbol{X} \times \boldsymbol{T} = \begin{bmatrix} 0 & 0 \\ 0 & 1 \\ 1 & 1 \\ 1 & 0 \end{bmatrix} \times \begin{bmatrix} 1 & 0 \\ 0.5 & 1 \end{bmatrix} = \begin{bmatrix} 0 & 0 \\ 0.5 & 1 \\ 1.5 & 1 \\ 1 & 0 \end{bmatrix}\]

Because this is all very abstract and a lot of numbers, below I’ve plotted the four points in \(\boldsymbol{X}\), connected them by lines, coloured the center, and applied the *skew* transformation, yielding \(\boldsymbol{X'}\).

I’ve also gone a bit further and made it interactive^{1}. So you can edit the numbers in the matrix and the unit square will transform accordingly. Play around with it to get an idea of transforming a set of points in 2-dimensional space.

Now that you have gained a feeling or intuition around the transformation matrix, I’ll tell you a great geometric trick I learnt from this youtube video: the surface area of \(\boldsymbol{X'}\) is equal to the size of the *determinant* of the transformation matrix \(\boldsymbol{T}\). This was a great revelation for me that made determinants much more easy to comprehend. This works in higher dimensions too: the transformed volume of a \(k\)-dimensional unit hypercube represents the size of the determinant of the transformation matrix \(\boldsymbol{T} \in \mathbb{R}^k\).

But we’re not there yet: determinants can be negative, wheras volumes and areas can’t. Luckily, the sign of the determinant can be inferred from \(\boldsymbol{X'}\) too. Specifically, it has to do with the *chirality* of the shape defined by \(\boldsymbol{X'}\). If the original square “flips” – that is, the original bottom right point becomes the new top left point or the original bottom left point becomes the new top right point – the sign of the determinant will be negative. In the illustration, that will make the shaded area red instead of blue.

The determinant of the currently entered \(\boldsymbol{T}\) is 0.

- Try to make \(\boldsymbol{T}\) look like a covariance matrix.
- Try to make the columns in \(\boldsymbol{T}\) linearly dependent.
- Try to flip the rows or columns of \(\boldsymbol{T}\) at any point.

Through exploring interactively what a transformation matrix does to a unit square, we can generate an intuition for the geometric meaning of the determinant.

There is an exciting field in Bayesian statistics all about testing *Informative Hypotheses*^{1}. In a few words, these hypotheses concern order restrictions: *“I think that my treatment group will have lower levels of depression than my control group”*. These order restrictions can also concern more than two groups. In fact, the hypothesis can be an arbitrarily large set of *greater than* (\(>\)), *smaller than* (\(<\)), and *equality* constraints (\(=\)), and it can be about any number of parameters in a statistical model.

Any software project implementing these user-definable arbitrarily complex hypotheses will need to check for *transitivity*. An example: if your hypothesis states that \(A>B\) and \(B>C\) then it is impossible under this hypothesis that \(C>A\) or that \(C=A\). The most famous intransitive relationship: rock-paper-scissors! I recently received the task to figure out a way of assessing whether a set of pairwise constraints is intransitive. This blogpost is about the algorithm I came up with for performing that task.

We can represent an informative hypothesis as a set of pairwise constraints. A pairwise table with on the left hand side (lhs) and right hand side (rhs) the parameters of interest and in the center one of the three available operators (op) would look like this:

lhs | op | rhs |
---|---|---|

A | > | B |

B | = | C |

C | > | D |

E | < | D |

B | < | E |

You may notice that this pairwise table makes for an *intransitive* informative hypothesis: B cannot be smaller than E if it is also greater than D, which in turn is greater than E ^{2}.

There are 2 observations to be made:

- We actually don’t have three operators \(\{<, >, =\}\), but only two: \(\{>, =\}\). For any \(<\) row in the table, we can simply switch around the lhs and rhs columns.
- For the purpose of checking transitivity, the \(=\) relation is fundamentally different than the \(>\) relation. In fact, we can remove all \(=\) relations in the pairwise tables by iteratively replacing the associated lhs and rhs parameters of interest in the entire table with the value (lhs,rhs) and removing the original \(=\) relation.

Incorporating the above two observations into the table, we obtain the following transitivity-equivalent pairwise table:

lhs | op | rhs |
---|---|---|

A | > | (B,C) |

(B,C) | > | D |

D | > | E |

E | > | B |

From this, I realised we can turn it into a graph. We have a set of vertices \(\{A, (B,C), D, E\}\) and a set of edges \(\{A\rightarrow (B,C),(B,C)\rightarrow D, D \rightarrow E, E \rightarrow B\}\). Here is the resulting graph:

Now that we have a graph, there is one key observation we can make: **if the graph is cyclic, the relation is intransitive, and if the graph is acyclic, the relation is transitive**. Think about it in terms of rock-paper-scissors again: scissors beats paper, paper beats rock, but rock beats scissors. Or, if you like pokémon: grass beats water, water beats fire, fire beats grass. I never thought this relationship, extremely ingrained into my brain due to my many hours of playing as a kid, would be useful in real life. But I digress.

We can represent a graph in a different way: as an adjacency matrix. In this \(p\times p\) matrix \(\boldsymbol{A}\), where \(p\) stands for the number of vertices in the graph, element \(a_{ij}=1\) if a directional relation from vertex \(i\) to vertex \(j\) and \(0\) otherwise. The adjacency matrix for our informative hypothesis example is as follows:

\[ \begin{bmatrix} 0 & 1 & 0 & 0 \\ 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 1 \\ 0 & 1 & 0 & 0 \end{bmatrix} \] The elements correspond to the vertices \(\{A, (B,C), D, E\}\) in order.

So why would we want to do this? With the adjacency matrix \(\boldsymbol{A}\), we can easily check whether the graph is a directed acyclic graph or DAG, and hence whether the hypothesis is transitive. This stackexchange answer by Dan Shved sums it up nicely:

The graph is a DAG if and only if each matrix \(\boldsymbol{A}^n\), \(n>0\), has only zeroes on the main diagonal. […] The test includes an infinite number of matrices \(\boldsymbol{A}^n\), one for each natural \(n\). In practice it is enough to check each \(n\) from 1 up to the number of vertices in the graph, inclusive.

Let’s program a proof of concept and see if it works!

```
library(expm)
isAcyclic <- function(amat) {
for (n in 1:ncol(amat)) {
if (any(diag(amat %^% n) != 0)) return(FALSE)
}
TRUE
}
```

That’s a short and neat program. It does exactly what the stack exchange answer told us to do: it checks for the exponent up to \(p\) whether there are nonzero elements on the diagonal. If it finds one, it quits and returns `FALSE`

, otherwise it returns `TRUE`

. It’s ok fast too, for realistic problems where users enter a pairlist of parameter inequalities:

```
library(microbenchmark)
# let's construct a 100x100 matrix
set.seed(147852)
A <- matrix(rbinom(1e2L^2, 1, 0.001), nrow = 1e2L)
diag(A) <- rep(0,1e2L)
microbenchmark(isAcyclic(A), unit = "ms", times = 10L)
```

```
## Unit: milliseconds
## expr min lq mean median uq max neval
## isAcyclic(A) 597.8342 691.7009 769.3391 738.0323 892.6885 934.4089 10
```

we have between 500 and 700 milliseconds for this \(100\times 100\) matrix which is acyclic, so the function actually has to run the 100 exponentiations. However, realistically we are usually looking at \(10\times 10\) matrices maximum, one row for each parameter not equal to another parameter in the hypothesis.

Actually, I haven’t told you all there is to something as seemingly simple as an adjacency matrix; it’s a great tool. The exponent \(n\) of the adjacency matrix actually indicates for each vertex *where you can reach in \(n\) steps*. So for \(n=1\) that is of course the adjacency itself, but for \(n=2\) the adjacency matrix indicates which vertices you can reach with 2 steps. So when there is a \(1\) on the diagonal, the associated exponent \(n\) actually indicates *how many steps it takes to come back to the original vertex*! This means that we could provide the user with feedback on which vertices are involved in the intransitivity, by checking which element on the diagonal is not equal to 0 like so:

```
acyclicTest <- function(amat) {
# input: adjacency matrix
# output: length 0 vector if it's acyclic (transitivity holds)
# Vector of vertex names if intransitive
for (i in 1:ncol(amat)) {
d <- diag(amat %^% i)
if (any(d != 0)) {
return(colnames(amat)[which(d!=0)])
}
}
character(0)
}
```

So let’s apply it to our example inequality constrained informative hypothesis:

```
library(Massign)
A %<-% "0, 1, 0, 0
0, 0, 1, 0
0, 0, 0, 1
0, 1, 0, 0"
colnames(A) <- rownames(A) <- c("A", "B,C", "D", "E")
test <- acyclicTest(A)
if (length(test) != 0) {
paste("Intransitivity detected in: {", paste(test, collapse = ","), "}")
}
```

`## [1] "Intransitivity detected in: { B,C,D,E }"`

This is a highly experimental and non-optimised algorithm. Tell me what you think about it! I think it’s neat and shows how flexible graphs and adjacency matrices are for many different and unexpected applications.

Thanks to Alexandra Sarafoglou for brainstorming!

https://informative-hypotheses.sites.uu.nl/publications/methodological-papers/↩

The conclusion from testing this hypothesis can be simple: the probability of the hypothesis relative to any other hypothesis with even the slightest probability is exactly 0. Thus the bayes factor is 0, we don’t even need to perform our experiment, and we can get on with our lives.↩

I’m currently (finally) learning more about linear algebra, statistical optimisation, and other matrix-related topics. While learning about such abstract topics, for me it really helps to convert the material into small `R`

programs. While doing this, I stumbled upon a problem: **Matrix construction in R kind of sucks**. Why? Look at this:

```
M <- matrix(c(1, 0.2, -0.3, 0.4,
0.2, 1, 0.6, -0.4,
-0.3, 0.6, 1, 0.4,
0.4, -0.4, 0.4, 1),
nrow = 4,
ncol = 4,
byrow = TRUE)
```

If I want to create a matrix, I need to (a) create a full vector of values to put in the matrix, (b) decide into how many rows/columns I want to put these values, and (c) decide whether to fill these values in a columnwise (default) or rowwise manner. This last step in particular is a nuisance, because the default makes sure we cannot have any form of “what you see is what you get” (WYSIWYG) in our script.

To solve this issue for people who want to rapidly create legible matrices, I created the package `Massign`

. This package has only one operator, `%<-%`

, and it works as follows to create the same matrix as above:

```
M %<-% " 1
0.2, 1
-0.3, 0.6, 1
0.4, -0.4, 0.4, 1"
```

Neat right? There are a few features to it:

In it’s most basic form, `Massign`

makes for legible code because of the “what you see is what you get” nature of the matrix construction function.

```
M %<-% " 1, 0.2, -0.3, 0.4
0.2, 1, 0.6, -0.4
-0.3, 0.6, 1, 0.4
0.4, -0.4, 0.4, 1"
M
```

```
## [,1] [,2] [,3] [,4]
## [1,] 1.0 0.2 -0.3 0.4
## [2,] 0.2 1.0 0.6 -0.4
## [3,] -0.3 0.6 1.0 0.4
## [4,] 0.4 -0.4 0.4 1.0
```

As shown before, when you enter a lower triangular matrix, `Massign`

automatically creates a symmetric matrix. This is a major feature, because properly creating the symmetric elements is not simple using the default `matrix()`

function.

Because `Massign`

constructs a `matrix()`

call under the hood and evaluates it in the environment in which the function is called, it is allowed to enter variables inside the text string:

```
phi <- 1.5
M %<-% "1, 1, 1
1, phi, phi^2
1, phi^2, phi^4"
M
```

```
## [,1] [,2] [,3]
## [1,] 1 1.00 1.0000
## [2,] 1 1.50 2.2500
## [3,] 1 2.25 5.0625
```

For the same reason, complex numbers work too. It does only work with numbers, though, so for character matrices you’re stuck with the `matrix()`

function for now.

The `%<-%`

operator in `Massign`

makes life a little easier for `R`

programmers who want to quickly construct a relatively small matrix for prototyping or learning. Take this code piece for generating an 8-dimensional multivariate normal distribution with positive correlations:

```
library(Massign)
library(MASS)
S %<-% " 1,
.5, 1
.5, .5, 1
.5, .5, .5, 1
.5, .5, .5, .5, 1
.6, .6, .6, .6, .6, 1
.5, .5, .5, .5, .5, .5, 1
.4, .4, .4, .4, .4, .4, .4, 1"
X <- mvrnorm(10, mu = rep(0,8), Sigma = S)
```

The package is on CRAN, so you can install it as follows:

`install.packages("Massign")`

If you have any complaints, tips, issues, or suggestions, you can submit an issue on the project’s GitHub page. Here you can also view the source code of the package!

How do we calculate a sample mean? This is probably one of the most basic questions in statistics, with as its common answer the following: Given a vector of sample values \(\bf{x}\) of length \(N\), the sample mean \(m\) is defined as \[m = \frac{1}{N}\sum^N_{i=1}x_i\]

But this first assumption, *given a vector of sample values*, recently did not hold for me. The sheer size of the vector that I needed to process made sure it did not fit on my computer. Even worse, I had no way of knowing how big this vector would exactly be! My question was thus: how do I calculate the mean of a stream of sample values of indetermined length?

The solution to this was rather simple for calculation of the mean. We initialise the value of the mean to 0, and then *update* our current guess of the mean with the next value’s weighted deviation from the current mean. This happens like so:

\[m_{i} = m_{i-1} + \frac{(x_i-m_{i-1})}{i}\]

Reading the formula in words, the \(i^{th}\) mean is the \((i-1)^{th}\) mean plus the deviation of the \(i^{th}\) input value from this current mean divided by the current \(i\), i.e., the amount of values that have been input.

In code this is simple to implement:

```
set.seed(142857)
# let's assume we receive data of length 1234
streamlength <- 1234
# initialise m_i
m_i <- 0
for (i in 1:streamlength) {
m_prev <- m_i
x_i <- rnorm(1, 0, 3)
m_i <- m_prev + (x_i-m_prev) / i
}
print(m_i)
```

`## [1] -0.00210514`

Note that in the above code we never save the full vector \(\bf{x}\); we only ever save the current and previous versions of the mean. This is perfect for an extremely large, variable length input vector such as the one I talked about in the introduction!

```
set.seed(142857)
# let's assume we receive data of length 1234
streamlength <- 1234
# initialise m_i
means <- numeric(streamlength)
for (i in 1:streamlength) {
x_i <- rnorm(1, 0, 3)
means[i] <- ifelse(i == 1,
x_i,
means[i-1] + (x_i-means[i-1]) / i)
}
```

Notice the big changes at the start of the stream, and the smaller changes at the end, asymptotically converging to the “true” mean value that we set here at 0.

We can also see this as a form of bayesian updating, if we turn the formula around like so:

\[m_{i} = \frac{(x_i-m_{i-1}) + i \cdot m_{i-1}}{i}\]

Here, we set the prior to be \(i \cdot m_{i-1}\), then we see \((x_i-m_{i-1})\) as our new data/evidence, \(m_i\) is our posterior, and \(i\) is the normalising constant. Cool!

The algorithm above translates nicely into the bayesian framework, but as with so many algorithms, it can be made much more efficient. It turns out that all we have to do is remember the `sum`

of the values input in the stream and a counter `i`

that indicates how many values went in. Then, when asking for the mean, all we need to do is \(m_i=\frac{\texttt{sum}}{\texttt{i}}\). Simple!

This is better for three reasons:

- It’s simpler.
- It’s less prone to numerical problems with your computer: you only perform one operation.
- This also extends to the variance and higher-order moments. For variance, we need to remember the
`sum`

, the`sum of squares`

, and the counter`i`

. Then, we can calculate the variance using the formula \(Var(X) = E[X^2] - (E[X])^2\) to calculate the variance: \(s^2_i=\frac{\texttt{sum of squares}}{\texttt{i}}-\left(\frac{\texttt{sum}}{\texttt{i}}\right)^2\). For each higher order moment, we need to remember a higher power sum in this framework.

Let’s do it!

```
set.seed(142857)
# let's assume we receive data of length 1234
streamlength <- 1234
# initialise
sum <- 0
sumsq <- 0
i <- 0
for (j in 1:streamlength) {
value <- rnorm(1, 0, 3)
sum <- sum + value
sumsq <- sumsq + value^2
i <- i + 1
}
list(sum = sum,
sum_of_squares = sumsq,
i = i,
mean = sum/i,
variance = sumsq/i - (sum/i)^2)
```

```
## $sum
## [1] -2.597743
##
## $sum_of_squares
## [1] 10948.2
##
## $i
## [1] 1234
##
## $mean
## [1] -0.00210514
##
## $variance
## [1] 8.872118
```

For the mean this was simple to implement. The question I’m pondering in the back of my mind throughout all this is the following: can *any* statistic be transformed into such a sequential statistic? How does this work for variance? Standard deviation? The median / other quantiles? If you let me know, I’ll be sure to update this blog post with the additions.

The word “bootstrap” comes from an old story about a hero - Baron Munchausen - who is riding around on his horse in a forest and suddenly gets stuck in a swamp. He screams for help but there is no one around who hears his voice! Luckily our hero does not give up and gets a great idea: “what if I just pull myself out of this swamp?”. He grabs the *straps* of his *boots* and pulls himself loose. Fantastic - he just invented bootstrapping.

Physics-defying stories aside, bootstrapping has become a common term for something seemingly impossible or counterintuitive. In this blogpost I will try to generate an intuition for the properties of *statistical bootstrapping* - resampling from your data to approximate resampling from a population.

In order to explain bootstrapping, we need to generate an example. Let’s assume we want to know the average height of all the people in the Netherlands. With the power of `R`

we can easily generate a population of 17104879 people (according to CBS, the amount of registered inhabitants of the Netherlands as per the creation of this post^{2}). The Dutch are just about the tallest people on the planet, where the men are 180,7 centimeters tall, on average^{3}.

Here is some more information about my population. For the sake of simplicity let’s assume that all inhabitants are actually men (which would be a disaster). The tallest Dutch man is 223 centimeters^{4} (which is very tall) and the shortest Dutch man is ridiculously hard to find on the internet. I have an intuition that it is further away from the mean of 180.7, which implies some negative skewness, but that’s not what this post is about so let’s also assume we have a non-skewed normal distribution.

After some fiddling with the standard deviation variable, I simulated the following population:

```
# generate population data
set.seed(3665364)
pop <- rnorm(17104879, mean = 180.7, sd = 7.5)
```

Let’s look how tall the tallest man from the hypothetical all-men Netherlands is, along with some other statistics!

```
## Statistics about the population:
## --------------------------------
## The shortest person is 142.0121 cm tall.
## Mean height is 180.6979 cm.
## Median height is 180.698 cm.
## The 5th and 95th percentiles are 168.3612 193.0287 cm.
## The tallest person is 222.2139 cm tall.
```

That seems close enough to something I’d consider a population.

Normally, when we want to estimate some population parameter such as the mean height, we cannot measure all persons. Therefore, we create a *representative sample* of persons from this population that we *can* measure. The size of our sample (\(N\)) should depend on how precise we want our final estimate to be - the *standard error* of a statistic depends directly on the number of persons in our sample. For the mean, the standard error is usually calculated as follows:

\[ se_\bar{x} = \frac{\hat{\sigma}}{\sqrt{N}}\text{, where } \hat{\sigma} = \sqrt{\frac{1}{N-1}\sum_{i=1}^N(x_i-\bar{x})^2} \]

With increasing \(N\), the \(\hat{\sigma}\) becomes smaller, and the \(se_\bar{x}\) becomes smaller as well. In words: with a larger sample comes an increase in precision (a reduction in error).

We can display the precision in the form of *confidence intervals* (CI). A \(p\%\) CI about an estimate indicates the area in which upon infinitely repeated sampling the TRUE parameter lies \(p\%\) of the time. In other words, if we would redo our experiment infinite amount of times, a 95% confidence interval will *cover* the true parameter in 95% of the replications.^{5}

In real life, there is of course no such thing as infinite resampling; we only sample once and use the CI as an indication of precision. Let’s look at how the precision of the mean estimate increases as we take larger samples from the above population.

```
# randomly sample from population
small <- sample(pop, 20)
medium <- sample(pop, 100)
large <- sample(pop, 1000)
```

As you can see, the CI from each sample covers the true population value in this case. We can create these confidence intervals because we know the *sampling distribution* of the mean - the distribution of the mean that arises upon repeated sampling. For means of a normally distributed population, the sampling distribution is the familiar Student’s *t*-distribution. We use the probability density to determine our 95% CI (the `1.96`

in the code above).

But what if we don’t exactly know what the sampling distribution is? For example, what happens if we do not have a normally distributed population? It turns out there is another way of generating CIs.

The bootstrap has these steps:

- approximate the sampling distribution by taking the mean of n repeated samples with replacement from your original sample.

Huh? Only one step? Is it that simple?

Yes:

```
# Let's bootstrap 10000 times!
mean_sampling_distribution <- numeric(10000)
for (i in 1:10000){
bootstrap_sample <- sample(medium, replace = T)
mean_sampling_distribution[i] <- mean(bootstrap_sample)
}
```

Great! Now we have an *empirical* sampling distribution. What do we do now in order to get an estimate of our precision in the form of a confidence interval? That is simple too: sort the means attained from the bootstrap samples from low to high, and look at the 2.5th and 97.5th percentile. This yields a 95% bootstrap CI!

```
CIbootstrap <- quantile(mean_sampling_distribution,
probs = c(0.025,0.975))
```

As you can see, that’s indeed very close to the original theoretical CI.

The advantage of this method is that this does not require the researcher to know the exact form of the sampling distribution. We can now create CIs (and thereby an estimate of precision) for nearly any statistic that we think about, such as (a) the fifth percentile, (b) the median, (c) the ninety-ninth percentile, (d) this completely arbitrary statistic called *van kesteren measure* that I just came up with: \[\hat{k} = \bar{x}\cdot\frac{1}{N}\sum_{i=1}^N |x_i^\frac{1}{3}-\sqrt{\text{median}(x)}|\]

Probably there are analytical solutions to the sampling distributions of these measures (and I think it is likely that they have been derived at some point in the 1940s) but I don’t know them so I’ll bootstrap:

```
vkmeasure <- function(x){
return(mean(x)/length(x)*sum(abs(x^(1/3)-sqrt(median(x)))))
}
perc5 <- median <- perc99 <- vkmeas <- numeric(10000)
for (i in 1:10000){
bootstrap_sample <- sample(large, replace = T)
perc5[i] <- quantile(bootstrap_sample, probs = 0.05)
median[i] <- median(bootstrap_sample)
perc99[i] <- quantile(bootstrap_sample, probs = 0.99)
vkmeas[i] <- vkmeasure(bootstrap_sample)
}
```

We have magically done away with a problem - that of distributional assumptions - by pulling ourselves up from our bootstraps, not unlike our friend Baron Munchausen! Fantastic! Or is it?

Of course, there are some downsides to this approach; it is not the magical solution to getting rid of distributional assumptions in any situation. As you can see in the Fifth Percentile graph in the image above, there is some discreteness to the distribution that we do not expect in the actual sampling distribution of the fifth percentile characteristic:

```
# Sample 10000 times from the ACTUAL population
perc5actual <- numeric(10000)
for (i in 1:10000){
smp <- sample(pop, 1000)
perc5actual[i] <- quantile(smp, probs = 0.05)
}
```

In this case the bootstrap sampling distribution does not work so well for determining the uncertainty or 95% confidence interval around the statistic of interest. As seen in this figure, the actual sampling distribution looks quite different to that inferred from bootstrap. How would we solve this problem? There are several ways:

- find the analytical sampling distribution form and construct a CI based on that
- make a larger sample which will contain more different values (often infeasible)
- smooth the bootstrap procedure, for example by adding some random noise to the samples
^{6}

The bootstrap is a fantastic means of nonparametrically determining the uncertainty of your estimate, but it should be used with care. Inspect the resulting distribution for discreteness such as in the 5th percentile graph above. And mathematical statistics is not completely obsolete ;)

Source: http://en.citizendium.org/wiki/Image:Dore-Munchausen-pull.jpg↩

Source: https://www.cbs.nl/nl-nl/visualisaties/bevolkingsteller as per the writing of this post.↩

Source: https://is.gd/cbsdata↩

Quick google searchy source: https://www.langzijn.nl/tag/langste-man-nederland↩

this explanation follows a frequentist framework of statistics. For an interesting sidestep and Bayesian credible intervals, do read this paper - http://doi.org/10.3758/s13423-015-0947-8↩

Yes, the source is a wikipedia link: https://en.wikipedia.org/wiki/Bootstrapping_(statistics)#Smoothed_bootstrap↩

The work of Verena Zuber and Korbinian Strimmer^{1} ^{2} ^{3} has inspired me to create this post. The journal articles I linked to in the footnote are absolutely worth reading! I certainly enjoyed them. In this blog, I try to convey my understanding of their work on gene selection and data whitening / decorrelation as a preprocessing step.

I stumbled upon whitening through my thesis research. In my thesis, I am looking at filter methods for feature selection in high-dimensional data, specifically in microarray (genetic) data. There are many different microarray and gene sequencing methods, but for simplicity let’s assume that microarray data is *information on the level of gene expression* for each hybridised^{4} gene. The goal with these data is often classification into two groups, e.g., malignant or benign. Because the high-dimensional nature of these data does not allow us to build a simple classification model (sometimes over 20 000 genes are hybridised!), we need to *select* genes which are important for classification^{5}.

Let’s take an example: we want to classify tissue in two categories: green and blue. For this, we collect pieces of green and blue tissue from as many participants (\(n\)) as possible, and we process those pieces to get their high-dimensional genomic microarray data. What results is an \(n \times p\) data matrix, where \(p\) is the amount of columns or genes hybridised^{6}. Our task is to select the subset \(q \in p\) of genes (features) which can predict the classes best.

Aside from using black-box methods such as regularisation, support vector machines, or random forests, the most simple way of selecting the subset \(q\) is through *filter methods*. Many filter methods exist^{7}, but the most straightforward one is as follows: Select the \(k\) genes with the highest *differential expression*, that is \(\text{abs}(\mu_{green}-\mu_{blue})\). The intuition behind this is this: genes that vary a lot across groups are very “predictive” of the class which their objects of study come from. For example, take the two hypothetical genes with expression levels below:

The gene with the small differential expression has more overlap between classes. Hence, if we would classify based on this gene with a method such as LDA^{8} or logistic regression, our misclassification rate would be higher.

There is a problem with this approach: the variance of gene expression might differ. Not taking this into account might mean that you consider a gene with high mean difference and even higher variance to be more important than a gene with moderate mean difference but a low variance. Luckily, this problem has been solved ages ago, by using the following quantity instead of the simple mean difference: \[ \frac{\mu_{green}-\mu_{blue}}{\sigma} \cdot c \], where \(c = \left( \frac{1}{n_{green}} + \frac{1}{n_{blue}} \right)^{-1/2}\)

Yes, this is a *t*-score. As can be seen from the equation, we are correcting for the variance in the original data. We can do this for many genes \((a, b, ...)\) at once, if we collect the variance of each gene expression in a diagonal matrix and the group means in vectors like so:

\[\mathbf{V} = \begin{bmatrix}\sigma_{a} & 0 \\ 0 & \sigma_{b}\end{bmatrix}, \quad \vec{\mu}_{green} = \begin{bmatrix} \mu^{a}_{green} \\ \mu^{b}_{green} \end{bmatrix}, \quad \vec{\mu}_{blue} = \begin{bmatrix} \mu^{a}_{blue} \\ \mu^{b}_{blue} \end{bmatrix}\]

Then we could write the t-score equation as follows^{9}:

\[t = c \cdot \mathbf{V}^{-1/2}(\vec{\mu}_{green}-\vec{\mu}_{blue})\]

Using this score is the same as performing a differential expression score analysis on *standardised* data^{10}. In standardisation, for each gene expression vector you would subtract the mean and divide by the standard deviation. The resulting vector has a standard deviation of 1 and a mean of 0. If you standardise, you basically *rescale* the variable, so the function in `R`

to do this is called `scale()`

.

Over and above *t*-score filter feature selection, there is one more issue. This issue is more complex, because unlike the previous issue it lives in multivariate space. Consider the following figure:

In this case, Gene a and Gene b individually have a hard time separating the blue and the green category both on their differential expression scores and on their t-scores. You can visualise this by looking at the *marginal distributions*^{11}.

Multivariately, however, there is little overlap between the green and blue classes. This happens because Gene a and Gene b are *correlated*. To correct for this correlation, we can perform another step over and above standardisation: *whitening*, or *decorrelation*. Hence the title of this blog. In the linear algebra notation of transforming the original data \(x\) to the whitened data \(z\) (specifically using ZCA-cor whitening), it is easy to see why it is an *additional* step:

\[z = \mathbf{P}^{-1/2}\mathbf{V}^{-1/2}x\], where \(\mathbf{P}\) indicates the correlation matrix.

So let’s see what this transformation does. Below you can find a scatterplot of randomly generated correlating bivariate data, much like *one of* the ellipses in the graph above. It moves from raw data in the first panel through standardised data (see the axis scale change) to decorrelated data in the third panel. The variance-covariance matrix used for generating the data was as follows:

\[\mathbf{\Sigma} = \begin{bmatrix}5 & 2.4 \\ 2.4 & 2 \end{bmatrix}\]

The third panel shows where the name “whitening” comes from: the resulting data looks like bivariate white noise. So what happens if we perform this transformation to the two-class case? Below I generated this type of data and performed the whitening procedure. I have plotted the marginal distributions for Gene a as well, to show the effect of whitening in univariate space (note the difference in scale).

As can be seen from the plots, the whitened data shows a stronger differentiation between the classes in univariate space: the overlapping area in the marginal distribution is relatively low when compared to that of the raw data. **Taking into account the correlation it has, Gene a thus has more information about the classes than we would assume based on its differential expression or its t-score**.

Using this trick, Zuber and Strimmer (2009) developed the *correlation-adjusted t-score*, or cat score, which extends the *t*-score as follows:

\[\text{cat} = c \cdot \mathbf{P}^{-1/2}\mathbf{V}^{-1/2}(\vec{\mu}_{green}-\vec{\mu}_{blue})\]

In their original paper, they show that this indeed works better than the unadjusted t-score in a variety of settings. One assumption that this procedure has is that it assumes equal variance in both classes. This might be something to work on!

If you made it all the way here, congratulations! I hope you learnt something. I certainly did while writing and coding all of this information into a legible format. Let me know what you think via email!

Kessy, A., Lewin, A., & Strimmer, K. (2015). Optimal whitening and decorrelation. arXiv preprint arXiv:1512.00809.↩

Zuber, V., & Strimmer, K. (2009). Gene ranking and biomarker discovery under correlation.

*Bioinformatics, 25*(20), 2700-2707.↩Zuber, V., & Strimmer, K. (2011). High-dimensional regression and variable selection using CAR scores.

*Statistical Applications in Genetics and Molecular Biology, 10*(1).↩Hybridisation is the process of the material (often dna or rna) attaching to the cells of the microarray matrix. The more specific material there is, the higher the resulting intensity in that matrix cell↩

or use more complex methods with other disadvantages↩

Note that the problem of high dimensionality is often denoted the \(n \gg p\) problem↩

Look at this pdf page of the CMA r package user manual↩

Isn’t linear algebra great?↩

All the math comes from Kessy, Lewin, & Strimmer (2015)↩

by collapsing the densities of the green and the blue classes onto the margin (either the x or y axis) we can construct a figure such as the first two images in this post. See this image I blatantly ripped from somewhere for an example of a bivariate distribution decomposed into two marginals↩