## Introduction

The work of Verena Zuber and Korbinian Strimmer^{1} ^{2} ^{3} has inspired me to create this post. The journal articles I linked to in the footnote are absolutely worth reading! I certainly enjoyed them. In this blog, I try to convey my understanding of their work on gene selection and data whitening / decorrelation as a preprocessing step.

## Gene selection

I stumbled upon whitening through my thesis research. In my thesis, I am looking at filter methods for feature selection in high-dimensional data, specifically in microarray (genetic) data. There are many different microarray and gene sequencing methods, but for simplicity let’s assume that microarray data is *information on the level of gene expression* for each hybridised^{4} gene. The goal with these data is often classification into two groups, e.g., malignant or benign. Because the high-dimensional nature of these data does not allow us to build a simple classification model (sometimes over 20 000 genes are hybridised!), we need to *select* genes which are important for classification^{5}.

Let’s take an example: we want to classify tissue in two categories: green and blue. For this, we collect pieces of green and blue tissue from as many participants (\(n\)) as possible, and we process those pieces to get their high-dimensional genomic microarray data. What results is an \(n \times p\) data matrix, where \(p\) is the amount of columns or genes hybridised^{6}. Our task is to select the subset \(q \in p\) of genes (features) which can predict the classes best.

## Filtering

Aside from using black-box methods such as regularisation, support vector machines, or random forests, the most simple way of selecting the subset \(q\) is through *filter methods*. Many filter methods exist^{7}, but the most straightforward one is as follows: Select the \(k\) genes with the highest *differential expression*, that is \(\text{abs}(\mu_{green}-\mu_{blue})\). The intuition behind this is this: genes that vary a lot across groups are very “predictive” of the class which their objects of study come from. For example, take the two hypothetical genes with expression levels below:

The gene with the small differential expression has more overlap between classes. Hence, if we would classify based on this gene with a method such as LDA^{8} or logistic regression, our misclassification rate would be higher.

## Correcting for variance

There is a problem with this approach: the variance of gene expression might differ. Not taking this into account might mean that you consider a gene with high mean difference and even higher variance to be more important than a gene with moderate mean difference but a low variance. Luckily, this problem has been solved ages ago, by using the following quantity instead of the simple mean difference: \[ \frac{\mu_{green}-\mu_{blue}}{\sigma} \cdot c \], where \(c = \left( \frac{1}{n_{green}} + \frac{1}{n_{blue}} \right)^{-1/2}\)

Yes, this is a *t*-score. As can be seen from the equation, we are correcting for the variance in the original data. We can do this for many genes \((a, b, ...)\) at once, if we collect the variance of each gene expression in a diagonal matrix and the group means in vectors like so:

\[\mathbf{V} = \begin{bmatrix}\sigma_{a} & 0 \\ 0 & \sigma_{b}\end{bmatrix}, \quad \vec{\mu}_{green} = \begin{bmatrix} \mu^{a}_{green} \\ \mu^{b}_{green} \end{bmatrix}, \quad \vec{\mu}_{blue} = \begin{bmatrix} \mu^{a}_{blue} \\ \mu^{b}_{blue} \end{bmatrix}\]

Then we could write the t-score equation as follows^{9}:

\[t = c \cdot \mathbf{V}^{-1/2}(\vec{\mu}_{green}-\vec{\mu}_{blue})\]

Using this score is the same as performing a differential expression score analysis on *standardised* data^{10}. In standardisation, for each gene expression vector you would subtract the mean and divide by the standard deviation. The resulting vector has a standard deviation of 1 and a mean of 0. If you standardise, you basically *rescale* the variable, so the function in `R`

to do this is called `scale()`

.

## Whitening

Over and above *t*-score filter feature selection, there is one more issue. This issue is more complex, because unlike the previous issue it lives in multivariate space. Consider the following figure:

In this case, Gene a and Gene b individually have a hard time separating the blue and the green category both on their differential expression scores and on their t-scores. You can visualise this by looking at the *marginal distributions*^{11}.

Multivariately, however, there is little overlap between the green and blue classes. This happens because Gene a and Gene b are *correlated*. To correct for this correlation, we can perform another step over and above standardisation: *whitening*, or *decorrelation*. Hence the title of this blog. In the linear algebra notation of transforming the original data \(x\) to the whitened data \(z\) (specifically using ZCA-cor whitening), it is easy to see why it is an *additional* step:

\[z = \mathbf{P}^{-1/2}\mathbf{V}^{-1/2}x\], where \(\mathbf{P}\) indicates the correlation matrix.

So let’s see what this transformation does. Below you can find a scatterplot of randomly generated correlating bivariate data, much like *one of* the ellipses in the graph above. It moves from raw data in the first panel through standardised data (see the axis scale change) to decorrelated data in the third panel. The variance-covariance matrix used for generating the data was as follows:

\[\mathbf{\Sigma} = \begin{bmatrix}5 & 2.4 \\ 2.4 & 2 \end{bmatrix}\]

The third panel shows where the name “whitening” comes from: the resulting data looks like bivariate white noise. So what happens if we perform this transformation to the two-class case? Below I generated this type of data and performed the whitening procedure. I have plotted the marginal distributions for Gene a as well, to show the effect of whitening in univariate space (note the difference in scale).

As can be seen from the plots, the whitened data shows a stronger differentiation between the classes in univariate space: the overlapping area in the marginal distribution is relatively low when compared to that of the raw data. **Taking into account the correlation it has, Gene a thus has more information about the classes than we would assume based on its differential expression or its t-score**.

## cat score

Using this trick, Zuber and Strimmer (2009) developed the *correlation-adjusted t-score*, or cat score, which extends the *t*-score as follows:

\[\text{cat} = c \cdot \mathbf{P}^{-1/2}\mathbf{V}^{-1/2}(\vec{\mu}_{green}-\vec{\mu}_{blue})\]

In their original paper, they show that this indeed works better than the unadjusted t-score in a variety of settings. One assumption that this procedure has is that it assumes equal variance in both classes. This might be something to work on!

If you made it all the way here, congratulations! I hope you learnt something. I certainly did while writing and coding all of this information into a legible format. Let me know what you think via email!

Kessy, A., Lewin, A., & Strimmer, K. (2015). Optimal whitening and decorrelation. arXiv preprint arXiv:1512.00809.↩

Zuber, V., & Strimmer, K. (2009). Gene ranking and biomarker discovery under correlation.

*Bioinformatics, 25*(20), 2700-2707.↩Zuber, V., & Strimmer, K. (2011). High-dimensional regression and variable selection using CAR scores.

*Statistical Applications in Genetics and Molecular Biology, 10*(1).↩Hybridisation is the process of the material (often dna or rna) attaching to the cells of the microarray matrix. The more specific material there is, the higher the resulting intensity in that matrix cell↩

or use more complex methods with other disadvantages↩

Note that the problem of high dimensionality is often denoted the \(n \gg p\) problem↩

Look at this pdf page of the CMA r package user manual↩

Isn’t linear algebra great?↩

All the math comes from Kessy, Lewin, & Strimmer (2015)↩

by collapsing the densities of the green and the blue classes onto the margin (either the x or y axis) we can construct a figure such as the first two images in this post. See this image I blatantly ripped from somewhere for an example of a bivariate distribution decomposed into two marginals↩