This article aims to learn the basic calculation process of least-squares means (LS means).

I find it difficult to understand what LS actually means in its literal sense.

The definition from `lsmeans`

package is shown blow, that have been transitioned to `emmeans`

package.

Least-squares means (LS means for short) for a linear model are simply predictions—or averages thereof—over a regular grid of predictor settings which I call the

reference grid.

In fact, even when I read this sentence, I was still very confused. What's the reference grid, and how to predict?

So let's see how the LS means is calculated, and the corresponding confidence interval as well.

Firstly import CDSIC pliot dataset, the same as the previous blog article - Conduct an ANCOVA model in R for Drug Trial. And then handle with the `adsl`

and `adlb`

to create an analysis dataset `ana_dat`

so that we can use ANCOVA by `lm`

function. Supposed that we want to see the `CHG`

(change from baseline) is affected by independent variable `TRTP`

(treatment) under the control of covariate variables `BASE`

(baseline) and `AGE`

(age).

Filter the dataset by `BASE`

variable as one missing value can be found in dataset.

```
library(tidyverse)
library(emmeans)
ana_dat2 <- filter(ana_dat, !is.na(BASE))
```

Then fit the ANCOVA model by `lm`

function.

```
fit <- lm(CHG ~ BASE + AGE + TRTP, data = ana_dat2)
anova(fit)
# Analysis of Variance Table
#
# Response: CHG
# Df Sum Sq Mean Sq F value Pr(>F)
# BASE 1 1.699 1.6989 0.9524 0.3322
# AGE 1 0.001 0.0010 0.0006 0.9811
# TRTP 2 8.343 4.1715 2.3385 0.1034
# Residuals 76 135.570 1.7838
```

We know that the LS means can be calculated according to reference grid that contains the mean of covariables and total factors for independent variables.

```
rg <- ref_grid(fit)
# 'emmGrid' object with variables:
# BASE = 5.4427
# AGE = 75.309
# TRTP = Placebo, Xanomeline Low Dose, Xanomeline High Dose
```

The mean of `BASE`

and `AGE`

are, as we can see from the table above, `5.4427`

and `75.309`

, respectively. Or we can calculate manually like:

```
summary(ana_dat2[,c("BASE", "AGE")])
# BASE AGE
# Min. : 3.497 Min. :51.00
# 1st Qu.: 4.774 1st Qu.:71.00
# Median : 5.273 Median :77.00
# Mean : 5.443 Mean :75.31
# 3rd Qu.: 5.718 3rd Qu.:81.00
# Max. :10.880 Max. :88.00
```

Then we can use `summary()`

or `predict()`

function to get the predicted value based on reference grid `rg`

.

```
rg_pred <- summary(rg)
rg_pred
# BASE AGE TRTP prediction SE df
# 5.44 75.3 Placebo 0.0578 0.506 76
# 5.44 75.3 Xanomeline Low Dose -0.1833 0.211 76
# 5.44 75.3 Xanomeline High Dose 0.5031 0.235 76
```

The prediction column is the same as from `predict(rg)`

. The prediction table looks like the predicted values of the different factor levels at the constant mean value.

In fact, we can aslo calculate the predicted value as we have the coefficients estimation of the regression equation from `fit$coefficients`

```
> fit$coefficients
(Intercept) BASE AGE
-1.11361290 0.11228582 0.00743963
TRTPXanomeline Low Dose TRTPXanomeline High Dose
-0.24108746 0.44531274
```

As the `TRTP`

includes multiple factors so it has been converted into dummy variables:

```
contrasts(ana_dat2$TRTP)
# Xanomeline Low Dose Xanomeline High Dose
# Placebo 0 0
# Xanomeline Low Dose 1 0
# Xanomeline High Dose 0 1
```

Now if we want to calculate the predicted value for the `Xanomeline Low Dose`

factor, it can be as follows:

```
> 0.11229*5.44+0.00744*75.3-0.24109*1-1.11361
[1] -0.1836104
```

Back to LS means, from its definition, it seems to be the average of the predicted values.

```
rg_pred %>%
group_by(TRTP) %>%
summarise(LSmean = mean(prediction))
# # A tibble: 3 × 2
# TRTP LSmean
# <fct> <dbl>
# 1 Placebo 0.0578
# 2 Xanomeline Low Dose -0.183
# 3 Xanomeline High Dose 0.503
```

It's exactly the same results as `lsmeans(rg, "TRTP")`

by `emmeans`

package. Or just using `emmeans(fit, "TRTP")`

can also get the same results

```
lsmeans(rg, "TRTP")
# TRTP lsmean SE df lower.CL upper.CL
# Placebo 0.0578 0.506 76 -0.949 1.065
# Xanomeline Low Dose -0.1833 0.211 76 -0.603 0.236
# Xanomeline High Dose 0.5031 0.235 76 0.036 0.970
```

The degree of freedom is `76`

as the DF for `TRTP`

is `2`

, and `1`

and `1`

for each covariables. So the total DF is `81-2-1-1=76`

I think.

Using `test`

we can get the P value when we compare the lsmean to zero.

```
test(lsmeans(fit, "TRTP"))
# TRTP lsmean SE df t.ratio p.value
# Placebo 0.0578 0.506 76 0.114 0.9093
# Xanomeline Low Dose -0.1833 0.211 76 -0.870 0.3869
# Xanomeline High Dose 0.5031 0.235 76 2.145 0.0351
```

In fact, the `t.ratio`

is the t statistics, so we can calculate P value manually, like

```
2 * pt(abs(0.114), 76, lower.tail = F)
2 * pt(abs(-0.870), 76, lower.tail = F)
2 * pt(abs(2.145), 76, lower.tail = F)
```

Likewise the confidence interval of lsmean can also be calculated manually based on `SE`

and `DF`

, such as for Placebo factor.

```
> 0.0578 + c(-1, 1) * qt(0.975, 76) * 0.506
[1] -0.9499863 1.0655863
```

I think these steps will go a long way in understanding the meaning of least-squares means, and the logic behind it. Hope to be helpful.

#### Reference

“emmeans” package

最小二乘均值的估计模型

UNDERSTANDING ANALYSIS OF COVARIANCE (ANCOVA)

Confidence intervals and tests in emmeans

Least-squares Means: The R Package lsmeans