Multiple Imputaton - Linear Regression in R

We have discussed the multiple imputation in non-monotone pattern of missingness in the article of Understanding Multiple Imputation in SAS, and sort out how to implement it in SAS. While here, I would like to learn how to use linear regression in multiple imputation to deal with monotone pattern data in R.

In R, we can use the mice package (Multiple Imputation with Chained Equations) to perform multiple imputation where there is an option for the linear regression method.

The chained equations is a variation of a Gibbs Sampler (an MCMC approach) that iterates between drawing estimates of missing values and estimates of parameters for distribution of the variable (both conditional on the other variables).

And what's the linear regression imputation and what are the advantages and disadvantages of it? Please see the below explaination from this article (Multiple Imputation).

In regression imputation, the existing variables are used to predict, and then the predicted value is substituted as if an actually obtained value. This approach has several advantages because the imputation retains a great deal of data over the listwise or pairwise deletion and avoids significantly altering the standard deviation or the shape of the distribution. However, as in a mean substitution, while a regression imputation substitutes a value predicted from other variables, no novel information is added, while the sample size has been increased and the standard error is reduced.

Then let's jump into implementation with the mice package. Here I will use the example data set (low1.sas7bdat) from Mallinckrodt et al. (https://journals.sagepub.com/doi/pdf/10.1177/2168479013501310) which is available via https://www.lshtm.ac.uk/research/centres-projects-groups/missing-data#dia-missing-data.

low1 <- haven::read_sas("./low1.sas7bdat")
head(low1)

## # A tibble: 6 × 6
##   PATIENT POOLINV trt   basval  week change
##     <dbl> <chr>   <chr>  <dbl> <dbl>  <dbl>
## 1    1005 101     2         16     1     -3
## 2    1005 101     2         16     2     -5
## 3    1005 101     2         16     4    -10
## 4    1005 101     2         16     6    -11
## 5    1005 101     2         16     8    -13
## 6    1006 101     2         17     1     -1

As shown above, this is a long format data set, including few variables, and the meaning of them is straightforward to understand literally. In order to meet the mice functions, I will first convert it to wide format with separte variables for each time points (week1 - week8).

low_wide <- low1 %>%
  pivot_wider(names_from = week, 
              names_prefix = "week",
              values_from = change) %>%
  select(-POOLINV)
head(low_wide)

## # A tibble: 6 × 8
##   PATIENT trt   basval week1 week2 week4 week6 week8
##     <dbl> <chr>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1    1005 2         16    -3    -5   -10   -11   -13
## 2    1006 2         17    -1    -2    -6   -10   -12
## 3    1008 1         32    -6   -12   -17   -20   -22
## 4    1011 1         18    -1    -5    -8    NA    NA
## 5    1012 1         22    -6    -9   -13   -16   -17
## 6    1015 2         29    -6   -14   -14   -20   -25

Then let's have a look at the missing pattern of this example.

low_wide %>%
  select(basval, week1, week2, week4, week8) %>%
  mutate(across(1:5, function(x) {
    if_else(is.na(x), ".", "X")
  })) %>%
  group_by_all() %>%
  count(name = "Freq") 

## # A tibble: 4 × 6
## # Groups:   basval, week1, week2, week4, week8 [4]
##   basval week1 week2 week4 week8  Freq
##   <chr>  <chr> <chr> <chr> <chr> <int>
## 1 X      X     .     .     .         4
## 2 X      X     X     .     .         3
## 3 X      X     X     X     .         9
## 4 X      X     X     X     X       184

Besides you can also use the mice::md.pattern(low_wide) function to display the missing patterns that is similar as the above outputs.

As shown above, the 'X' marks indicate that the data point in a certain visit is completed and '.' marks indicate the data is fully missing. So we can suppose that the example data is monotonic type and missingness is MAR for later analysis purposes.

Now we will generate 5 imputed datasets via mice function by setting m=5 and method = 'norm.predict' (called Linear regression through prediction).

low_imp <- mice(low_wide, method = "norm.predict")
summary(low_imp)

If you would like to check if the missingness are all completed, via low_imp$imp. And you can also specify which imputed datasets to use via setting action argument. The action = 0 will return the orginal dataset with missing values, and action = 1 corresponds to the first imputed datasets.

low_imp_1 <- complete(low_imp, action = 1)

And maybe we'd like to have all imputed datasets in a long format that will be easy to handle and analyze at some point.

low_imp_res <- complete(low_imp, action = "long", include = TRUE)

If you already have an imputed dataset with long format from another imputation method, mice::as.mids() would be a helpful function that can convert it to an object with mids class for the further analysis in mice. The mids class should contains the orginal data as well as imputed datasets with .imp and .id columns inside.

mids <- as.mids(low_imp_res)

The second step is to fit the ANCOVA model for week8 time point with treatment (trt) as independent variable, change from baseline in week 8 (week8) as response variable, and baseline (basval) as covariates.

mods <- with(
  low_imp, 
  lm(week8 ~ basval + trt)
)

The mods is an object with mira class that contains the call and fitted model object for each one of the imputations.

And the final step is to integrate the results from ANCOVA models using Rubin's Rules (https://bookdown.org/mwheymans/bookmi/rubins-rules.html), which is also the method used by proc mianalyze in SAS

pool_res <- pool(mods)
summary(pool_res)

##          term   estimate  std.error  statistic       df      p.value
## 1 (Intercept)  0.5759012 1.73148889  0.3326046 194.3680 7.397912e-01
## 2      basval -0.5329080 0.08089651 -6.5875270 194.3860 4.086519e-10
## 3        trt2 -1.7308118 0.66660413 -2.5964612 194.7855 1.013719e-02

We can see the pooled estimations like coefficient and standard error in the above output, which is obtained by coef() and vcov() functions from pooled models such as lm here. The estimated treatment coefficient is around -1.73 with a standard error around 0.67.

Actually the coefficient and standard error are not the final results we would like to display in the clinical report. We should get the LS-means estimation for each group, and do contrast between two groups and hypothesis test. That will be discussed in the next topic.

Reference

Multiple Imputaton: Linear Regression Flexible Imputation of Missing Data
Missing Data (Rough) Notes
Multiple Imputation with the mice Package
Multiple imputation without a specialist R package
Multiple Imputation