## Introduction

The work of Verena Zuber and Korbinian Strimmer1 2 3 has inspired me to create this post. The journal articles I linked to in the footnote are absolutely worth reading! I certainly enjoyed them. In this blog, I try to convey my understanding of their work on gene selection and data whitening / decorrelation as a preprocessing step.

## Gene selection

I stumbled upon whitening through my thesis research. In my thesis, I am looking at filter methods for feature selection in high-dimensional data, specifically in microarray (genetic) data. There are many different microarray and gene sequencing methods, but for simplicity let’s assume that microarray data is information on the level of gene expression for each hybridised4 gene. The goal with these data is often classification into two groups, e.g., malignant or benign. Because the high-dimensional nature of these data does not allow us to build a simple classification model (sometimes over 20 000 genes are hybridised!), we need to select genes which are important for classification5.

Let’s take an example: we want to classify tissue in two categories: green and blue. For this, we collect pieces of green and blue tissue from as many participants ($$n$$) as possible, and we process those pieces to get their high-dimensional genomic microarray data. What results is an $$n \times p$$ data matrix, where $$p$$ is the amount of columns or genes hybridised6. Our task is to select the subset $$q \in p$$ of genes (features) which can predict the classes best.

## Filtering

Aside from using black-box methods such as regularisation, support vector machines, or random forests, the most simple way of selecting the subset $$q$$ is through filter methods. Many filter methods exist7, but the most straightforward one is as follows: Select the $$k$$ genes with the highest differential expression, that is $$\text{abs}(\mu_{green}-\mu_{blue})$$. The intuition behind this is this: genes that vary a lot across groups are very “predictive” of the class which their objects of study come from. For example, take the two hypothetical genes with expression levels below:

The gene with the small differential expression has more overlap between classes. Hence, if we would classify based on this gene with a method such as LDA8 or logistic regression, our misclassification rate would be higher.

## Correcting for variance

There is a problem with this approach: the variance of gene expression might differ. Not taking this into account might mean that you consider a gene with high mean difference and even higher variance to be more important than a gene with moderate mean difference but a low variance. Luckily, this problem has been solved ages ago, by using the following quantity instead of the simple mean difference: $\frac{\mu_{green}-\mu_{blue}}{\sigma} \cdot c$, where $$c = \left( \frac{1}{n_{green}} + \frac{1}{n_{blue}} \right)^{-1/2}$$

Yes, this is a t-score. As can be seen from the equation, we are correcting for the variance in the original data. We can do this for many genes $$(a, b, ...)$$ at once, if we collect the variance of each gene expression in a diagonal matrix and the group means in vectors like so:

$\mathbf{V} = \begin{bmatrix}\sigma_{a} & 0 \\ 0 & \sigma_{b}\end{bmatrix}, \quad \vec{\mu}_{green} = \begin{bmatrix} \mu^{a}_{green} \\ \mu^{b}_{green} \end{bmatrix}, \quad \vec{\mu}_{blue} = \begin{bmatrix} \mu^{a}_{blue} \\ \mu^{b}_{blue} \end{bmatrix}$

Then we could write the t-score equation as follows9:

$t = c \cdot \mathbf{V}^{-1/2}(\vec{\mu}_{green}-\vec{\mu}_{blue})$

Using this score is the same as performing a differential expression score analysis on standardised data10. In standardisation, for each gene expression vector you would subtract the mean and divide by the standard deviation. The resulting vector has a standard deviation of 1 and a mean of 0. If you standardise, you basically rescale the variable, so the function in R to do this is called scale().

## Whitening

Over and above t-score filter feature selection, there is one more issue. This issue is more complex, because unlike the previous issue it lives in multivariate space. Consider the following figure: