Definition of least-squares means (LS means)

This article aims to learn the basic calculation process of least-squares means (LS means).

I find it difficult to understand what LS actually means in its literal sense.

The definition from lsmeans package is shown blow, that have been transitioned to emmeans package.

Least-squares means (LS means for short) for a linear model are simply predictions—or averages thereof—over a regular grid of predictor settings which I call the reference grid.

In fact, even when I read this sentence, I was still very confused. What's the reference grid, and how to predict?

So let's see how the LS means is calculated, and the corresponding confidence interval as well.

Firstly import CDSIC pliot dataset, the same as the previous blog article - Conduct an ANCOVA model in R for Drug Trial. And then handle with the adsl and adlb to create an analysis dataset ana_dat so that we can use ANCOVA by lm function. Supposed that we want to see the CHG(change from baseline) is affected by independent variable TRTP(treatment) under the control of covariate variables BASE(baseline) and AGE(age).

Filter the dataset by BASE variable as one missing value can be found in dataset.

library(tidyverse)
library(emmeans)
ana_dat2 <- filter(ana_dat, !is.na(BASE))

Then fit the ANCOVA model by lm function.

fit <- lm(CHG ~ BASE + AGE + TRTP, data = ana_dat2)
anova(fit)

# Analysis of Variance Table
#
# Response: CHG
#           Df  Sum Sq Mean Sq F value Pr(>F)
# BASE       1   1.699  1.6989  0.9524 0.3322
# AGE        1   0.001  0.0010  0.0006 0.9811
# TRTP       2   8.343  4.1715  2.3385 0.1034
# Residuals 76 135.570  1.7838

We know that the LS means can be calculated according to reference grid that contains the mean of covariables and total factors for independent variables.

rg <- ref_grid(fit)

# 'emmGrid' object with variables:
#    BASE = 5.4427
#    AGE = 75.309
#    TRTP = Placebo, Xanomeline Low Dose, Xanomeline High Dose

The mean of BASE and AGE are, as we can see from the table above, 5.4427 and 75.309, respectively. Or we can calculate manually like:

summary(ana_dat2[,c("BASE", "AGE")])

#      BASE             AGE       
# Min.   : 3.497   Min.   :51.00  
# 1st Qu.: 4.774   1st Qu.:71.00  
# Median : 5.273   Median :77.00  
# Mean   : 5.443   Mean   :75.31  
# 3rd Qu.: 5.718   3rd Qu.:81.00  
# Max.   :10.880   Max.   :88.00

Then we can use summary() or predict() function to get the predicted value based on reference grid rg.

rg_pred <- summary(rg)
rg_pred

# BASE  AGE TRTP                 prediction    SE df
# 5.44 75.3 Placebo                  0.0578 0.506 76
# 5.44 75.3 Xanomeline Low Dose     -0.1833 0.211 76
# 5.44 75.3 Xanomeline High Dose     0.5031 0.235 76

The prediction column is the same as from predict(rg). The prediction table looks like the predicted values of the different factor levels at the constant mean value.

In fact, we can aslo calculate the predicted value as we have the coefficients estimation of the regression equation from fit$coefficients

> fit$coefficients
             (Intercept)                     BASE                      AGE 
             -1.11361290               0.11228582               0.00743963 
 TRTPXanomeline Low Dose TRTPXanomeline High Dose 
             -0.24108746               0.44531274

As the TRTP includes multiple factors so it has been converted into dummy variables:

contrasts(ana_dat2$TRTP)

#                      Xanomeline Low Dose Xanomeline High Dose
# Placebo                                0                    0
# Xanomeline Low Dose                    1                    0
# Xanomeline High Dose                   0                    1

Now if we want to calculate the predicted value for the Xanomeline Low Dose factor, it can be as follows:

> 0.11229*5.44+0.00744*75.3-0.24109*1-1.11361
[1] -0.1836104

Back to LS means, from its definition, it seems to be the average of the predicted values.

rg_pred %>%
  group_by(TRTP) %>%
  summarise(LSmean = mean(prediction))

# # A tibble: 3 × 2
#   TRTP                  LSmean
#   <fct>                  <dbl>
# 1 Placebo               0.0578
# 2 Xanomeline Low Dose  -0.183 
# 3 Xanomeline High Dose  0.503

It's exactly the same results as lsmeans(rg, "TRTP") by emmeans package. Or just using emmeans(fit, "TRTP") can also get the same results

lsmeans(rg, "TRTP")

# TRTP                  lsmean    SE df lower.CL upper.CL
# Placebo               0.0578 0.506 76   -0.949    1.065
# Xanomeline Low Dose  -0.1833 0.211 76   -0.603    0.236
# Xanomeline High Dose  0.5031 0.235 76    0.036    0.970

The degree of freedom is 76 as the DF for TRTP is 2, and 1 and 1 for each covariables. So the total DF is 81-2-1-1=76 I think.

Using test we can get the P value when we compare the lsmean to zero.

test(lsmeans(fit, "TRTP"))

# TRTP                  lsmean    SE df t.ratio p.value
# Placebo               0.0578 0.506 76   0.114  0.9093
# Xanomeline Low Dose  -0.1833 0.211 76  -0.870  0.3869
# Xanomeline High Dose  0.5031 0.235 76   2.145  0.0351

In fact, the t.ratio is the t statistics, so we can calculate P value manually, like

2 * pt(abs(0.114), 76, lower.tail = F)
2 * pt(abs(-0.870), 76, lower.tail = F)
2 * pt(abs(2.145), 76, lower.tail = F)

Likewise the confidence interval of lsmean can also be calculated manually based on SE and DF, such as for Placebo factor.

> 0.0578 + c(-1, 1) * qt(0.975, 76) * 0.506
[1] -0.9499863  1.0655863

I think these steps will go a long way in understanding the meaning of least-squares means, and the logic behind it. Hope to be helpful.

Reference

“emmeans” package
最小二乘均值的估计模型
 UNDERSTANDING ANALYSIS OF COVARIANCE (ANCOVA)
Confidence intervals and tests in emmeans
Least-squares Means: The R Package lsmeans