This article aims to learn the basic calculation process of least-squares means (LS means).
I find it difficult to understand what LS actually means in its literal sense.
The definition from
lsmeans package is shown blow, that have been transitioned to
Least-squares means (LS means for short) for a linear model are simply predictions—or averages thereof—over a regular grid of predictor settings which I call the reference grid.
In fact, even when I read this sentence, I was still very confused. What's the reference grid, and how to predict?
So let's see how the LS means is calculated, and the corresponding confidence interval as well.
Firstly import CDSIC pliot dataset, the same as the previous blog article - Conduct an ANCOVA model in R for Drug Trial. And then handle with the
adlb to create an analysis dataset
ana_dat so that we can use ANCOVA by
lm function. Supposed that we want to see the
CHG(change from baseline) is affected by independent variable
TRTP(treatment) under the control of covariate variables
Filter the dataset by
BASE variable as one missing value can be found in dataset.
library(tidyverse) library(emmeans) ana_dat2 <- filter(ana_dat, !is.na(BASE))
Then fit the ANCOVA model by
fit <- lm(CHG ~ BASE + AGE + TRTP, data = ana_dat2) anova(fit) # Analysis of Variance Table # # Response: CHG # Df Sum Sq Mean Sq F value Pr(>F) # BASE 1 1.699 1.6989 0.9524 0.3322 # AGE 1 0.001 0.0010 0.0006 0.9811 # TRTP 2 8.343 4.1715 2.3385 0.1034 # Residuals 76 135.570 1.7838
We know that the LS means can be calculated according to reference grid that contains the mean of covariables and total factors for independent variables.
rg <- ref_grid(fit) # 'emmGrid' object with variables: # BASE = 5.4427 # AGE = 75.309 # TRTP = Placebo, Xanomeline Low Dose, Xanomeline High Dose
The mean of
AGE are, as we can see from the table above,
75.309, respectively. Or we can calculate manually like:
summary(ana_dat2[,c("BASE", "AGE")]) # BASE AGE # Min. : 3.497 Min. :51.00 # 1st Qu.: 4.774 1st Qu.:71.00 # Median : 5.273 Median :77.00 # Mean : 5.443 Mean :75.31 # 3rd Qu.: 5.718 3rd Qu.:81.00 # Max. :10.880 Max. :88.00
Then we can use
predict() function to get the predicted value based on reference grid
rg_pred <- summary(rg) rg_pred # BASE AGE TRTP prediction SE df # 5.44 75.3 Placebo 0.0578 0.506 76 # 5.44 75.3 Xanomeline Low Dose -0.1833 0.211 76 # 5.44 75.3 Xanomeline High Dose 0.5031 0.235 76
The prediction column is the same as from
predict(rg). The prediction table looks like the predicted values of the different factor levels at the constant mean value.
In fact, we can aslo calculate the predicted value as we have the coefficients estimation of the regression equation from
> fit$coefficients (Intercept) BASE AGE -1.11361290 0.11228582 0.00743963 TRTPXanomeline Low Dose TRTPXanomeline High Dose -0.24108746 0.44531274
TRTP includes multiple factors so it has been converted into dummy variables:
contrasts(ana_dat2$TRTP) # Xanomeline Low Dose Xanomeline High Dose # Placebo 0 0 # Xanomeline Low Dose 1 0 # Xanomeline High Dose 0 1
Now if we want to calculate the predicted value for the
Xanomeline Low Dose factor, it can be as follows:
> 0.11229*5.44+0.00744*75.3-0.24109*1-1.11361  -0.1836104
Back to LS means, from its definition, it seems to be the average of the predicted values.
rg_pred %>% group_by(TRTP) %>% summarise(LSmean = mean(prediction)) # # A tibble: 3 × 2 # TRTP LSmean # <fct> <dbl> # 1 Placebo 0.0578 # 2 Xanomeline Low Dose -0.183 # 3 Xanomeline High Dose 0.503
It's exactly the same results as
lsmeans(rg, "TRTP") by
emmeans package. Or just using
emmeans(fit, "TRTP") can also get the same results
lsmeans(rg, "TRTP") # TRTP lsmean SE df lower.CL upper.CL # Placebo 0.0578 0.506 76 -0.949 1.065 # Xanomeline Low Dose -0.1833 0.211 76 -0.603 0.236 # Xanomeline High Dose 0.5031 0.235 76 0.036 0.970
The degree of freedom is
76 as the DF for
1 for each covariables. So the total DF is
81-2-1-1=76 I think.
test we can get the P value when we compare the lsmean to zero.
test(lsmeans(fit, "TRTP")) # TRTP lsmean SE df t.ratio p.value # Placebo 0.0578 0.506 76 0.114 0.9093 # Xanomeline Low Dose -0.1833 0.211 76 -0.870 0.3869 # Xanomeline High Dose 0.5031 0.235 76 2.145 0.0351
In fact, the
t.ratio is the t statistics, so we can calculate P value manually, like
2 * pt(abs(0.114), 76, lower.tail = F) 2 * pt(abs(-0.870), 76, lower.tail = F) 2 * pt(abs(2.145), 76, lower.tail = F)
Likewise the confidence interval of lsmean can also be calculated manually based on
DF, such as for Placebo factor.
> 0.0578 + c(-1, 1) * qt(0.975, 76) * 0.506  -0.9499863 1.0655863
I think these steps will go a long way in understanding the meaning of least-squares means, and the logic behind it. Hope to be helpful.
UNDERSTANDING ANALYSIS OF COVARIANCE (ANCOVA)
Confidence intervals and tests in emmeans
Least-squares Means: The R Package lsmeans