Continue with the question in the previous article (Multiple Imputaton - Linear Regression in R), where we just discussed how to compute the pooled coefficients of ANCOVA using mice
package but left out the Ls-means and hypothesis test. Luckly I find out that emmeans
package have wrapped this process inside so we can use it to obtain the pooled Ls-means estimation and p-value straightforward wihout pool
function of mice
.
Supposed that we have fitted the ANCOVA for imputed datasets and get the fitted models mods
for each imputation here. Then I will use the emmeans::emmeans()
function to estimate the ls-means, which is not the indivival estimate for each imputation but rather the pooled one. The pool process remains to use the Rubin's Rules.
ems <- emmeans::emmeans(mods, ~trt)
data.frame(ems)
## trt emmean SE df lower.CL upper.CL
## 1 1 -10.55330 0.4696385 194.9729 -11.47953 -9.627078
## 2 2 -12.39073 0.4698075 194.8239 -13.31729 -11.464169
Afterwards using the emmeans::contrast()
function to do the contrasts analysis and get the CI and p-value for the difference (trt2 - trt1
).
conr <- emmeans::contrast(ems, method = list(c(-1, 1)), adjust = "none")
conr_test <- emmeans::test(conr)
data.frame(conr_test)
## contrast estimate SE df t.ratio p.value
## 1 c(-1, 1) -1.837429 0.6657861 194.8238 -2.759788 0.006335778
If we want to validate the above result using SAS procedure, we must first export the csv from the low_imp_res
dataframe that we created in the last article.
write.csv(low_imp_res, file = "./low_imputed.csv", row.names = FALSE, na = "")
If we want to validate the above result using SAS procedure, we must first export the csv from the low_imp_res
dataframe that we created in the last article. And than fit the ANCOVA model with proc mixed
to obtain the ls-means estimate(lsm
and diff
) for each imputation. Finally use proc mianalyze
to pool the results of all imputation for ls-means (comb_lsm
) and difference (comb_diff
) within two groups.
proc import datafile="&_projpth.\02_Raw Data\low_imputed.csv"
out=low_imp(where=(imp ne 0))
dbms=csv replace;
getnames = yes;
run;
ods output lsmeans=lsm diffs=diff;
proc mixed data=low_imp;
by imp;
class trt / ref=first;
model week8=basval trt /ddfm=kr;
lsmeans trt / cl pdiff diff;
run;
proc sort data=lsm; by trt; run;
ods output ParameterEstimates=comb_lsm;
proc mianalyze data=lsm;
by trt;
modeleffects estimate;
stderr stderr;
run;
ods output ParameterEstimates=comb_diff;
proc mianalyze data=diff;
by trt _trt;
modeleffects estimate;
stderr stderr;
run;
The SAS output can be seen below.
We can observe that the estimate and SE from SAS are consistent with R, but there is a significant discrepancy in DF (degress of freedom). In R df is 197
while in SAS it is 3.94E6
. That's because there are different methods to calculate the the df, an older one is used in SAS and adjusted version is used in the mice
package. That will also lead to different p-values. For more details can be found in Degrees of Freedom and P-values of Rubins-Rules.
Someone may be curious how to calculate the df in mice
and SAS. Let’s repeat the process of calculation using the formulas in above link. The specific formulas are not shown here, which is very easy to understand. So I just convert them as the R code.
The older method used in SAS to calculate the df for the t-distribution is defined as (Rubin (1987), Van Buuren (2018)). The specific formulas are not shown here, which is very easy to understand. So I just convert them as the R code. From below, the lambda
can be derived from the between (Vb
) and total (Vt
) missing data variance, and the m
represents the number of imputed datasets. The rounded value is 3.94E6
that is equal to the results in SAS.
m <- 5
Vb <- 0.000372
Vt <- 0.443271
lambda <- (Vb + Vb / m) / Vt
df_old <- (m - 1) / lambda^2
df_old
## [1] 3944121
As the above df
is too larger for the pooled result, compared to the dfs in each imputed dataset, which is inappropriate. Barnard and Rubin (1999) adjusted this df by using a new formula (See formula 9.9 in that article.). We should compute the Observed df (df_obs
) and then adjusted df (df_adj
) where n
represents the sample size in the imputed datasets, and k
the number of parameters in the fitted model (in my case, there are 3 parameters).
n <- 200
k <- 3
m <- 5
df_obs <- (((n - k) + 1) / ((n - k) + 3)) * (n - k) * (1 - lambda)
df_adj <- (df_old * df_obs) / (df_old + df_obs)
df_adj
## [1] 194.824
Finally we can look at the df for the pooled estimates from pool_res
using pool()
function. The df
for trt2
term is about equal to our computation, as seen below.
pool_res <- pool(mods)
pool_res
## Class: mipo m = 5
## term m estimate ubar b t dfcom df riv
## 1 (Intercept) 5 0.6767737 2.982354968 6.679090e-03 2.990369876 197 194.4394 0.002687443
## 2 basval 5 -0.5354029 0.006510448 1.426767e-05 0.006527569 197 194.4534 0.002629804
## 3 trt2 5 -1.8374286 0.442824415 3.722857e-04 0.443271157 197 194.8238 0.001008849
## lambda fmi
## 1 0.002680240 0.01278278
## 2 0.002622906 0.01272531
## 3 0.001007832 0.01110765
updated from 2024-05-28
The df
calculation for pooling process in emmeans
package for mina
class has kept the consistency with mice
package, using the the Barnard-Rubin adjustment for small samples (Barnard and Rubin, 1999) that mentioned in the pool()
documents, see https://github.com/rvlenth/emmeans/issues/494. Thus we can get the same df
in either the mice
or emmeans
packages. All above results have been updated.