Propagating measurement error uncertainty through multiple imputation
Introduction
Measurement error is becoming more common as automatic data labelling is occurring at larger scales. For example, large language models (LLMs) are now routinely used to turn unstructured data into structured tables, to measure particular constructs of interest. Our goal with these measurements, at least in the social sciences, is to investigate a certain real-world process or test a theory, generally by performing inference using a statistical model (such as linear regression). In this post, I show how to propagate measurement error resulting from the labeling process to such a statistical model – without having to do tedious derivations specific to a particular model type 🙂 This idea is not new, see, e.g., this paper1 or this chapter2.
Conceptually, assume a true value \(X_{true}\) can be measured only with error, yielding \(X_{obs}\). This error can have two types components: a fixed (deterministic) component and an independent random (stochastic) component. For our purposes, the fixed error can be an offset (intercept) or it could be any function that is invertible over the domain of the data.
To fix this, we are going to learn the deterministic and stochastic error from validation data using a nonparametric model, in our case regularized smoothing splines through the R package mgcv. Then, we will use this model to generate posterior samples from newly obtained measurements (with error). Last, we fit the model of interest several times and pool the results using an old trick from the multiple imputation literature.
Step 1: Measurement model
Assume we want to perform linear regression with a single predictor \(X\) and a single outcome \(Y\). After collecting validation data in which we obtain both the true value as well as the value with measurement error, we notice that both variables have stochastic and deterministic error:
Now, we create a model for each, using our penalized spline estimator. NB: for this measurement model, you can use anything you want, as long as it can do reasonable posterior simulation with good (conditional) coverage. In other words, this should be a model that can (a) capture different types of relations, including nonlinear, (b) predict well (on average) in out-of-sample situations, and (c) accurately represent the residual noise. Penalized splines do all of this for a wide class of data-generating processes.
mod_x <- gam(x ~ s(x_obs))
mod_y <- gam(y ~ s(y_obs))Step 2: generate posterior samples
After the validation data, we collect our “real” data, where we don’t have gold-standard “true” measurements available. We now have a standard missing data situation. What we do, then, is to predict those values from the observed data, and perform posterior sampling to adequately account for the uncertainty, akin to multiple imputation3. For example, we can sample 200 times. Then, we simply run the model of interest on each dataset, and collect the results in a list:
sim_dat <- lapply(1:200, function(i) {
data.frame(
x = simulate(mod_x, data = data.frame(x_obs = x_obs))[,1],
y = simulate(mod_y, data = data.frame(y_obs = y_obs))[,1]
)
})
sim_fit <- lapply(sim_dat, function(df) lm(y ~ x, data = df))We have now produced a set of models based on corrected datasets. To get an intuition for the correction, below is a plot with the true (unobserved, hidden) data, the measured (observed, with error) data, and one posterior sample from the model-corrected data.
In addition, I’ve plotted the regression lines. As shown, the third regression line looks a lot like the first, whereas the regression fitted on the measured model shows a quite different intercept and a flatter slope.
Step 3: Pooling multiple model fits
To perform adequate inference, we need to combine the model fits we created by sampling from the posterior of the measurement model. There is a lot of existing research on this topic already, with most coming from the missing data literature on multiple imputation methods. Specifically, we can apply Rubin’s pooling rules to combine the parameter estimates and their uncertainties to arrive at an overall inference for the parameters in the model.
First, let’s look at a model based on the unobserved “true” data, our target is to be as close as possible to these estimates:
tidy(lm(y_true ~ x_true))# A tibble: 2 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) -0.210 0.0241 -8.74 3.86e-13
2 x_true 0.425 0.0226 18.8 1.81e-30
Then, let’s look at the model we would get if we would naïvely apply regression on the uncorrected observed data:
tidy(lm(y_obs ~ x_obs))# A tibble: 2 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 0.811 0.0365 22.2 2.93e-35
2 x_obs 0.322 0.0276 11.7 1.04e-18
Finally, let’s pool the results from our posterior-sampled data.
tibble(summary(pool(sim_fit)))# A tibble: 2 × 6
term estimate std.error statistic df p.value
<fct> <dbl> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) -0.253 0.0505 -5.01 54.8 6.07e- 6
2 x 0.378 0.0492 7.69 50.3 4.91e-10
The naive model deviates strongly from the true model, whereas the pooled model fares much better in terms of the parameter estimates. The standard errors of the pooled model are bigger, which reflects the uncertainty stemming from the stochastic measurement error as captured in the nonparametric measurement model. Pooling can be done in the mice package for many different inferential models, from simple linear regression to complex mixed model types.
Conclusion
By learning a measurement model on data with stochastic and/or deterministic measurement error, we can correct inferences in a downstream analysis task through posterior sampling. A model pooling trick from the multiple imputation / missing data literature is used to combine sampled models. This generalizes to any type of variable and any type of model4, relying on the assumption that the estimated measurement model is correct.