Estimated LS-means from Multiple Imputation of mice using emmeans package

Continue with the question in the previous article (Multiple Imputaton - Linear Regression in R), where we just discussed how to compute the pooled coefficients of ANCOVA using mice package but left out the Ls-means and hypothesis test. Luckly I find out that emmeans package have wrapped this process inside so we can use it to obtain the pooled Ls-means estimation and p-value straightforward wihout pool function of mice.

Supposed that we have fitted the ANCOVA for imputed datasets and get the fitted models mods for each imputation here. Then I will use the emmeans::emmeans() function to estimate the ls-means, which is not the indivival estimate for each imputation but rather the pooled one. The pool process remains to use the Rubin's Rules.

ems <- emmeans::emmeans(mods, ~trt)
data.frame(ems)

##   trt    emmean        SE       df  lower.CL   upper.CL
## 1   1 -10.55330 0.4696385 194.9729 -11.47953  -9.627078
## 2   2 -12.39073 0.4698075 194.8239 -13.31729 -11.464169

Afterwards using the emmeans::contrast() function to do the contrasts analysis and get the CI and p-value for the difference (trt2 - trt1).

conr <- emmeans::contrast(ems, method = list(c(-1, 1)), adjust = "none")
conr_test <- emmeans::test(conr)
data.frame(conr_test)

##   contrast  estimate        SE       df   t.ratio     p.value
## 1 c(-1, 1) -1.837429 0.6657861 194.8238 -2.759788 0.006335778

If we want to validate the above result using SAS procedure, we must first export the csv from the low_imp_res dataframe that we created in the last article.

write.csv(low_imp_res, file = "./low_imputed.csv", row.names = FALSE, na = "")

If we want to validate the above result using SAS procedure, we must first export the csv from the low_imp_res dataframe that we created in the last article. And than fit the ANCOVA model with proc mixed to obtain the ls-means estimate(lsm and diff) for each imputation. Finally use proc mianalyze to pool the results of all imputation for ls-means (comb_lsm) and difference (comb_diff) within two groups.

proc import datafile="&_projpth.\02_Raw Data\low_imputed.csv" 
    out=low_imp(where=(imp ne 0))
    dbms=csv replace;
    getnames = yes;
run;

ods output lsmeans=lsm diffs=diff; 
proc mixed data=low_imp; 
    by imp; 
    class trt / ref=first; 
    model week8=basval trt /ddfm=kr; 
    lsmeans trt / cl pdiff diff;
run;

proc sort data=lsm; by trt; run;
ods output ParameterEstimates=comb_lsm; 
proc mianalyze data=lsm; 
    by trt;
    modeleffects estimate; 
    stderr stderr;
run;

ods output ParameterEstimates=comb_diff; 
proc mianalyze data=diff;
    by trt _trt;
    modeleffects estimate; 
    stderr stderr;
run;

The SAS output can be seen below.

We can observe that the estimate and SE from SAS are consistent with R, but there is a significant discrepancy in DF (degress of freedom). In R df is 197 while in SAS it is 3.94E6. That's because there are different methods to calculate the the df, an older one is used in SAS and adjusted version is used in the mice package. That will also lead to different p-values. For more details can be found in Degrees of Freedom and P-values of Rubins-Rules.

Someone may be curious how to calculate the df in mice and SAS. Let’s repeat the process of calculation using the formulas in above link. The specific formulas are not shown here, which is very easy to understand. So I just convert them as the R code.

The older method used in SAS to calculate the df for the t-distribution is defined as (Rubin (1987), Van Buuren (2018)). The specific formulas are not shown here, which is very easy to understand. So I just convert them as the R code. From below, the lambda can be derived from the between (Vb) and total (Vt) missing data variance, and the m represents the number of imputed datasets. The rounded value is 3.94E6 that is equal to the results in SAS.

m <- 5
Vb <- 0.000372
Vt <- 0.443271

lambda <- (Vb + Vb / m) / Vt
df_old <- (m - 1) / lambda^2
df_old
## [1] 3944121

As the above df is too larger for the pooled result, compared to the dfs in each imputed dataset, which is inappropriate. Barnard and Rubin (1999) adjusted this df by using a new formula (See formula 9.9 in that article.). We should compute the Observed df (df_obs) and then adjusted df (df_adj) where n represents the sample size in the imputed datasets, and k the number of parameters in the fitted model (in my case, there are 3 parameters).

n <- 200
k <- 3
m <- 5
df_obs <- (((n - k) + 1) / ((n - k) + 3)) * (n - k) * (1 - lambda)
df_adj <- (df_old * df_obs) / (df_old + df_obs)
df_adj
## [1] 194.824

Finally we can look at the df for the pooled estimates from pool_res using pool() function. The df for trt2 term is about equal to our computation, as seen below.

pool_res <- pool(mods)
pool_res
## Class: mipo    m = 5 
##          term m   estimate        ubar            b           t dfcom       df         riv
## 1 (Intercept) 5  0.6767737 2.982354968 6.679090e-03 2.990369876   197 194.4394 0.002687443
## 2      basval 5 -0.5354029 0.006510448 1.426767e-05 0.006527569   197 194.4534 0.002629804
## 3        trt2 5 -1.8374286 0.442824415 3.722857e-04 0.443271157   197 194.8238 0.001008849
##        lambda        fmi
## 1 0.002680240 0.01278278
## 2 0.002622906 0.01272531
## 3 0.001007832 0.01110765

updated from 2024-05-28

The df calculation for pooling process in emmeans package for mina class has kept the consistency with mice package, using the the Barnard-Rubin adjustment for small samples (Barnard and Rubin, 1999) that mentioned in the pool() documents, see https://github.com/rvlenth/emmeans/issues/494. Thus we can get the same df in either the mice or emmeans packages. All above results have been updated.