This article aims to learn the basic calculation process of least-squares means (LS means).
I find it difficult to understand what LS actually means in its literal sense.
The definition from lsmeans
package is shown blow, that have been transitioned to emmeans
package.
Least-squares means (LS means for short) for a linear model are simply predictions—or averages thereof—over a regular grid of predictor settings which I call the reference grid.
In fact, even when I read this sentence, I was still very confused. What's the reference grid, and how to predict?
So let's see how the LS means is calculated, and the corresponding confidence interval as well.
Firstly import CDSIC pliot dataset, the same as the previous blog article - Conduct an ANCOVA model in R for Drug Trial. And then handle with the adsl
and adlb
to create an analysis dataset ana_dat
so that we can use ANCOVA by lm
function. Supposed that we want to see the CHG
(change from baseline) is affected by independent variable TRTP
(treatment) under the control of covariate variables BASE
(baseline) and AGE
(age).
Filter the dataset by BASE
variable as one missing value can be found in dataset.
library(tidyverse)
library(emmeans)
ana_dat2 <- filter(ana_dat, !is.na(BASE))
Then fit the ANCOVA model by lm
function.
fit <- lm(CHG ~ BASE + AGE + TRTP, data = ana_dat2)
anova(fit)
# Analysis of Variance Table
#
# Response: CHG
# Df Sum Sq Mean Sq F value Pr(>F)
# BASE 1 1.699 1.6989 0.9524 0.3322
# AGE 1 0.001 0.0010 0.0006 0.9811
# TRTP 2 8.343 4.1715 2.3385 0.1034
# Residuals 76 135.570 1.7838
We know that the LS means can be calculated according to reference grid that contains the mean of covariables and total factors for independent variables.
rg <- ref_grid(fit)
# 'emmGrid' object with variables:
# BASE = 5.4427
# AGE = 75.309
# TRTP = Placebo, Xanomeline Low Dose, Xanomeline High Dose
The mean of BASE
and AGE
are, as we can see from the table above, 5.4427
and 75.309
, respectively. Or we can calculate manually like:
summary(ana_dat2[,c("BASE", "AGE")])
# BASE AGE
# Min. : 3.497 Min. :51.00
# 1st Qu.: 4.774 1st Qu.:71.00
# Median : 5.273 Median :77.00
# Mean : 5.443 Mean :75.31
# 3rd Qu.: 5.718 3rd Qu.:81.00
# Max. :10.880 Max. :88.00
Then we can use summary()
or predict()
function to get the predicted value based on reference grid rg
.
rg_pred <- summary(rg)
rg_pred
# BASE AGE TRTP prediction SE df
# 5.44 75.3 Placebo 0.0578 0.506 76
# 5.44 75.3 Xanomeline Low Dose -0.1833 0.211 76
# 5.44 75.3 Xanomeline High Dose 0.5031 0.235 76
The prediction column is the same as from predict(rg)
. The prediction table looks like the predicted values of the different factor levels at the constant mean value.
In fact, we can aslo calculate the predicted value as we have the coefficients estimation of the regression equation from fit$coefficients
> fit$coefficients
(Intercept) BASE AGE
-1.11361290 0.11228582 0.00743963
TRTPXanomeline Low Dose TRTPXanomeline High Dose
-0.24108746 0.44531274
As the TRTP
includes multiple factors so it has been converted into dummy variables:
contrasts(ana_dat2$TRTP)
# Xanomeline Low Dose Xanomeline High Dose
# Placebo 0 0
# Xanomeline Low Dose 1 0
# Xanomeline High Dose 0 1
Now if we want to calculate the predicted value for the Xanomeline Low Dose
factor, it can be as follows:
> 0.11229*5.44+0.00744*75.3-0.24109*1-1.11361
[1] -0.1836104
Back to LS means, from its definition, it seems to be the average of the predicted values.
rg_pred %>%
group_by(TRTP) %>%
summarise(LSmean = mean(prediction))
# # A tibble: 3 × 2
# TRTP LSmean
# <fct> <dbl>
# 1 Placebo 0.0578
# 2 Xanomeline Low Dose -0.183
# 3 Xanomeline High Dose 0.503
It's exactly the same results as lsmeans(rg, "TRTP")
by emmeans
package. Or just using emmeans(fit, "TRTP")
can also get the same results
lsmeans(rg, "TRTP")
# TRTP lsmean SE df lower.CL upper.CL
# Placebo 0.0578 0.506 76 -0.949 1.065
# Xanomeline Low Dose -0.1833 0.211 76 -0.603 0.236
# Xanomeline High Dose 0.5031 0.235 76 0.036 0.970
The degree of freedom is 76
as the DF for TRTP
is 2
, and 1
and 1
for each covariables. So the total DF is 81-2-1-1=76
I think.
Using test
we can get the P value when we compare the lsmean to zero.
test(lsmeans(fit, "TRTP"))
# TRTP lsmean SE df t.ratio p.value
# Placebo 0.0578 0.506 76 0.114 0.9093
# Xanomeline Low Dose -0.1833 0.211 76 -0.870 0.3869
# Xanomeline High Dose 0.5031 0.235 76 2.145 0.0351
In fact, the t.ratio
is the t statistics, so we can calculate P value manually, like
2 * pt(abs(0.114), 76, lower.tail = F)
2 * pt(abs(-0.870), 76, lower.tail = F)
2 * pt(abs(2.145), 76, lower.tail = F)
Likewise the confidence interval of lsmean can also be calculated manually based on SE
and DF
, such as for Placebo factor.
> 0.0578 + c(-1, 1) * qt(0.975, 76) * 0.506
[1] -0.9499863 1.0655863
I think these steps will go a long way in understanding the meaning of least-squares means, and the logic behind it. Hope to be helpful.
Reference
“emmeans” package
最小二乘均值的估计模型
UNDERSTANDING ANALYSIS OF COVARIANCE (ANCOVA)
Confidence intervals and tests in emmeans
Least-squares Means: The R Package lsmeans