`officer`

）可以用于生成editable图片在PPT中。这里的editable是指图片中每个元素包括散点、X/Y轴、标签都能修改，常用于图片的再修饰参考于：Chapter 5 officer for PowerPoint

其实`officer`

是一个`Officeverse`

套件中的一个包，还包括其他大家熟悉的，如：

`officedown`

，在rmarkdown中生成word`officedown`

，生成非常好用的表格`rvg`

，生成矢量图形`mschart`

，生成macrosoft office的图形

进入正题，假如你有一个R生成的图片，可以是R基础绘图生成的，也可以是ggplot2绘图生成，或者是其他绘图R包生成（但是必须要有`ggplot`

对象），均可通过以下方式转化成在PPT中的editable图片

首先生成图片并用`rvg::dml`

函数封装成矢量图以便后续在PPT中插入到各页slides中

`library(rvg) p1 <- dml(plot(1:10))library(ggplot2)g2 <- ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width)) + geom_point() + theme_classic()p2 <- dml(ggobj = g2)library(survival)library(survminer)g3 <- survfit(Surv(time, status) ~ sex, data = lung) %>% ggsurvplot(data = lung)p3 <- dml(ggobj = g3$plot)`

矢量图的对象生成后，接着根据下图的步骤添加到PPT中

先用`read_pptx`

根据默认模板生成一个空的PPT文件；然后用`add_slide`

生成一页空的slide；最后用`ph_with`

将矢量图对象插入其中。其中所涉及到的一些参数，需要先了解office PowerPoint的一些基本组件，可阅读：2.2 PowerPoint presentation properties

`library(officer)doc <- read_pptx()doc <- add_slide(doc, layout = "Title and Content", master = "Office Theme")doc <- ph_with(doc, p1, location = ph_location_fullsize() )doc <- add_slide(doc, layout = "Title and Content", master = "Office Theme")doc <- ph_with(doc, p2, location = ph_location_fullsize() )doc <- add_slide(doc, layout = "Title and Content", master = "Office Theme")doc <- ph_with(doc, p3, location = ph_location_fullsize() )print(doc, target = "test.pptx")`

最后即可打开`test.pptx`

文件修饰图片啦

In terms of the ANCOVA model, if you would like to add the margin of non-inferiority and superiority, you can just use the `lsmestimate`

statement with `testvalue=2`

when the margin is 2. Whereas for multiple imputation you can't just add this statement in the analysis step, you should define this margin in the pool step.

In order to echo the last article, here I will use the identical example data, first and second steps of the MI process, and just illustrate the difference in the third step. Assume that the endpoint is the change from baseline at week 6, and given that this drug is used to reduce the primary indicator, the null hypothesis might be that the CHG in the treatment group minus the placebo group is more than `-2`

, demonstrating that the drug efficacy is not superior to placebo.

`ods output ParameterEstimates=super; proc mianalyze data=diff theta0=-2; modeleffects estimate; stderr stderr;run;`

The combined imputation with a margin of `-2`

as following.

Now we can find the `Theta0`

value is `-2`

rather than the usual and default `0`

. And the two-sided p-value is `0.4745`

. If we would like to obtain the one-sided p-value, an additional calculation can be done. Or just a half of a two-sided p-value is also fine, which is the same.

`data super; set super; pval = (1 - probt(abs(tvalue),df));run;`

Otherwise the t-statistic and p-value can also be computed by the t distribution formula, as shown below in R.

`est <- -2.803439theta0 <- -2se <- 1.123403df <- 4800.7t <- (est - theta0) / se> t[1] -0.7151832pval <- pt(t, df)> pval[1] 0.2372653`

The superiority test is used as an example above, however non-inferiority test can follow the same procedure by simply altering the margin.

]]>There are plenty of methods that could be applied to the missing data, depending on the goal of the clinical trial. The most common and recommended is multiple imputation (MI), and other methods such as last observation carried forward (LOCF), observed case (OC) and mixed model for repeated measurement (MMRM) are also available for sensitivity analysis.

Multiple imputation is a model based method, but it's not just the model to impute data, it's also a framework with the implementation of various analytic models like ANCOVA. In general, there are 3 steps to implement MI, where R and SAS are all the same.

- Imputation, generating M datasets with imputed. But before starting this step, you'd better examine the missing pattern, Monotone missing data pattern, or Arbitrary pattern of missing data.
- Analysis, generating M sets of estimates from M imputed datasets using the statistical model.
- Pooling, the M sets of estimates will be combined into one MI estimate. This is very different from other imputation methods as it not only imputes missing values but also outputs the estimated value from multiple imputed datasets. The pooling method is Rubin's Rules (RR), which can pool parameter estimates such as mean differences, regression coefficients and standard errors, and then derive confidence intervals and p-values. The pool logic will be briefly introduced below.

After the routine introduction of MI, let's talk about how to implement the MI model to deal with actual missing data in SAS. I'm also planning to compare the SAS procedure with the `rbmi`

R package in the next article. To be honest, I tend to use R instead of SAS in my actual work, so I would like to introduce more R use in clinical trials.

Here is an example dataset from an antidepressant clinical trial of an active drug versus placebo. The relevant endpoint is the Hamilton 17-item depression rating scale (HAMD17) which was assessed at baseline and at weeks 1, 2, 4, and 6. This example comes from the `rbmi`

package so that I can use the same dataset in R programming. But I do the pre-processing and transposing to meet the data type of the MI procedure in SAS.

`library(rbmi)library(tidyverse)data("antidepressant_data")dat <- antidepressant_data# Use expand_locf to add rows corresponding to visits with missing outcomes to the datasetdat <- expand_locf( dat, PATIENT = levels(dat$PATIENT), # expand by PATIENT and VISIT VISIT = levels(dat$VISIT), vars = c("BASVAL", "THERAPY"), # fill with LOCF BASVAL and THERAPY group = c("PATIENT"), order = c("PATIENT", "VISIT"))dat2 <- pivot_wider( dat, id_cols = c(PATIENT, THERAPY, BASVAL), names_from = VISIT, names_prefix = "VISIT", values_from = HAMDTL17)write.csv(dat2, file = "./antidepressant.csv", na = "", row.names = F)`

And then import the csv file into SAS, as shown below.

Next, we should examine the missing pattern in the dataset by using zero imputation (`nimpute=0`

).

`proc mi data=antidepressant nimpute=0; var BASVAL VISIT4 VISIT5 VISIT6 VISIT7;run;`

The above graph indicates that there is a patient who doesn’t fit the monotone missing data pattern, so the missing pattern is non-monotone. With regard to which MI method should be performed, the MISSING DATA PATTERNS section and Table 4 from this article(MI FOR MI, OR HOW TO HANDLE MISSING INFORMATION WITH MULTIPLE IMPUTATION) can be used as references. I will select the MCMC (Markov Chain Monte Carlo) method for the multiple imputation afterwards.

Then starting the first step, here I choose the MCMC full-data imputation with `impute=full`

and specify the BY statement to obtain the separate imputed datasets in the treatment group. And I also specify the seed number as well, but keep in mind that it is only defined in the first group; the second group is not the seed number you define but another fixed and random seed number following the seed number you have.

`proc sort data=antidepressant; by THERAPY; run;proc mi data=antidepressant seed=12306 nimpute=100 out=imputed_data; mcmc chain=single impute=full initial=em (maxiter=1000) niter=1000 nbiter=1000; em maxiter=1000; by THERAPY; var BASVAL VISIT4 VISIT5 VISIT6 VISIT7;run;`

The second step is to implement the analysis model for each imputation datasets that were created from the first step. Assume that I want to estimate the endpoint of change from baseline at week 6 with the LSmean in each treatment by ANCOVA model, and the difference between them. It will include the fixed effect of treatment and fixed covariate of baseline.

`data imputed_data2; set imputed_data; CHG = VISIT7 - BASVAL;run;proc sort; by _imputation_; run;ods output lsmeans=lsm diffs=diff; proc mixed data=imputed_data2; by _imputation_; class THERAPY; model CHG=BASVAL THERAPY /ddfm=kr; lsmeans THERAPY / cl pdiff diff;run;`

We can see the LS mean for each imputation (`_Imputation_`

) in the `lsm`

dataset where each imputation has two rows including drug and placebo, and the difference between two groups in the `diff`

dataset as shwon below.

The third step is to pool all estimates from the second step, including the LS mean estimates and difference.

`proc sort data=lsm; by THERAPY; run;ods output ParameterEstimates=combined_lsm; proc mianalyze data=lsm; by THERAPY; modeleffects estimate; stderr stderr;run;ods output ParameterEstimates=combined_diff; proc mianalyze data=diff; by THERAPY _THERAPY; modeleffects estimate; stderr stderr;run;`

For now the imputations have been combined as shown below.

Above results indicate that each imputation has been combined and the final estimate is calculated by the Rubin's Rules (RR). The t-statistic, confidence interval and p-value are based on t distribution, so the most important is how to calculate the estimate and standard error. The pooled estimate is the mean value of all imputation's estimates. And the pooled SE is the square root of `Vtotal`

that can be calculated through formulas of 9.2-9.4. The formulas is cited from https://bookdown.org/mwheymans/bookmi/rubins-rules.html.

I'm trying to illustate the computing process of RR in R with the `diff`

dataset as example. Using R is as it's easy for me to do the matrix operations.

`diff <- haven::read_sas("./diff.sas7bdat")est <- mean(diff$Estimate)> est[1] -2.803439n <- nrow(diff)Vw <- mean(diff$StdErr^2)Vb <- sum((diff$Estimate - est)^2) / (n - 1)Vtotal <- Vw + Vb + Vb / nse <- sqrt(Vtotal)> se[1] 1.123403`

This `est`

and `se`

value is equal to the pooled Estimate of `-2.803439`

and StdErr of `1.123403`

in SAS.

With the interest of other parameters, you may ask how are the DF and t-statistics calculated? I recommend reading the entire article as mentioned above to comprehend the complete process that is not introduced here. Once the DF and t-statistics are determined, the confidence interval and p-value can be also computed easily by T distribution.

Multiple imputation is a recommended and useful tool in trial use, which provides robust parameter estimates depending on which missing pattern your data has.

In the next article, I will try to illustrate how to use MI in non-inferiority and superiority trials.

Multiple Imputation using SAS and R Programming

MI FOR MI, OR HOW TO HANDLE MISSING INFORMATION WITH MULTIPLE IMPUTATION

Chapter9 Rubin's Rules

We all know that there are two common methods to compute the confidence limit for a Hazard Ratio in the SAS `PHREG`

procedure. - Wald's Confidence Limits - Profile-Likelihood Confidence Limits

However, in R, we commonly use the `confint()`

or `summary()`

function to compute the CI from the `coxph`

model, which assumes normality. So it is identical to Wald's CI.

You can also compute it manually from the `EST ± SE * Z`

, as shown below.

`m <- coxph(Surv(time, status) ~ ph.ecog , data=na.omit(lung))ss <- summary(m)coef <- coef(m)se <- ss$coefficients[,"se(coef)"]c(exp(coef - qnorm(0.975) * se), exp(coef + qnorm(0.975) * se))`

But what's the weakness of Wald's CI? Refer to Why and When to Use Profile Likelihood Based Confidence Intervals

This blog says that since the standard errors of the general linear model are based on asymptotic variance, they may not be a good estimator of standard error for small samples. In particular, Wald Confidence Intervals may not perform very well. One should only use the Wald Confidence Interval if the likelihood function is symmetric about the MLE.

So what's the superiority of the Profile Likelihood CI?

In cases where the likelihood function is not symmetric about the MLE, the Profile Likelihood Based Confidence Interval serves better. This is because the Profile Likelihood Based Confidence Interval is based on the asymptotic chi-square distribution of the log likelihood ratio test statistic.

If you use the SAS `PHREG`

procedure, you can just simply define the `lr`

argument as `pl`

to get the Profile Likelihood CI for the hazard ratio.

Unfortunately you cannot get it in `coxph()`

function from the `survival`

package. I have tried the `coxphf()`

function from `coxphf`

package, but the CI is not identical to SAS with a little difference in a few decimal places.

`m2 <- coxphf::coxphf(formula=Surv(time, status) ~ ph.ecog, pl=FALSE, data=na.omit(lung))summary(m2)`

Anyway, this is an alternative way to compute the Profile Likelihood CI.

]]>其主要介绍了4种推断方法

- One-Proportion Inference
- One-Mean Inference
- Two-proportion inference
- Two-mean inference

当你想测试某个人群比例不等于某个固定比例，那么你可以使用**One-Proportion Inference**。

比如你推断在一些教堂中女性的比例超过55%。这里的55%就是我们需要推断的proportion，hypotheses则是`H0: pi=0.55`

和`Ha: pi>0.55`

。然后我们发现有一个教堂中100名人员里面有62个女性，那么`phat=62/100`

。此时我们是否有足够的证据推翻原假设？毕竟我们只是在一个教堂的抽样数据，所以我们得借助simulation，并假设原假设为真来计算假设检验的P值。因此我们借助`do`

和`rflip`

函数构建`pi=0.55`

的1000个模拟trials

`library(mosaic)pi <- 0.55 # probability of success for each tossn <- 100 # Number of times we toss the penny (sample size)trials <- 1000 # Number of trials (number of samples)observed <- 62 # Observed number of heads phat = observed / n # p-hat - the observed proportion of headsdata.sim <- do(trials) * rflip(n, prob = pi)`

然后计算在这1000次模拟中，出现proportion大于`phat`

的个数，除以总trials数就是我们想得到的"P值"。比较易于理解，由于是模拟所以得出来的，所以跟原本中的数值肯定是不会一样，但当模拟次数更大后，每次模拟的结果将会趋于一个较为稳定的值

`pvalue <- sum(data.sim$prop >= phat) / trials`

当你推测的值不是proportion而是一个mean值时，则可以考虑用**One-Mean Inference**。

比如你推测一辆汽车每加仑汽油的平均行驶里程不等于22英里。这里的`22`

就是我们需要推断的mean，hypotheses则是`H0: μ=22`

和`Ha: μ≠22`

。然后我们观测到在数据集`mtcars`

中每加仑汽油的平均行驶里程（`mpg`

）的mean为20.09。这时我们可以对`mtcars`

数据集进行重抽样来推断上述假设。

`mu <- 22observed <- mean(~mpg, data=mtcars, na.rm=T)paste("Observed value for sample mean: ", observed)trials <- 1000samples <- do(trials) * mosaic::mean(~mpg, data=resample(mtcars))`

基于`samples`

模拟数据，我们可以粗略计算下95%置信区间。

`# Let's compute a 95% Confidence Interval (ci <- quantile(samples$mean,c(0.025,0.975)))# 2.5% 97.5% # 18.18406 22.23453`

从置信区间可看出，其包含了我们H0假设的`μ=22`

，所提可以初步判断原假设成立，即拒绝了汽车每加仑汽油的平均行驶里程不等于`22`

这个推测。

接着根据重抽样的数据再次计算`mpg`

的均值大于`22`

的比例，由于是双侧假设，所以P值最终需要乘以2。

`pvalue <- sum(samples$mean >= mu) / trialspaste("Two-sided p-value is", 2 * pvalue)# [1] "Two-sided p-value is 0.088"`

当你推测的是两个proportion之间是否有显著不同时，则可以考虑用**Two-proportion inference**。

比如你推测支持某项政策的女性比例与支持该政策的男性比例有不同。这里两个比例就是我们所需要比较的，hypotheses则是`H0: π1=π2`

和`Ha: π1≠π2`

。这时我们观测到某个样本里`p1=62/100`

女性支持该政策，而男性则是`p2=51/100`

，数据如下：

`df <- rbind( do(38) * data.frame(Group = "Men", Support = "no"), do(62) * data.frame(Group = "Men", Support = "yes"), do(49) * data.frame(Group = "Women", Support ="no"), do(51) * data.frame(Group = "Women", Support = "yes") )(df.summary <- tally(Support ~ Group, data=df))`

接着计算两组之间proportion的差值，简单点就是`0.62-0.51=0.11`

，或者

`observed <- diffprop(Support ~ Group, data = df)paste("Observed difference in proportions: " , round(observed,3))# [1] "Observed difference in proportions: 0.11"`

然后使用打乱分组信息的方式做模拟，看看打乱后两组的差异的null distribution

`trials <- 1000null.dist <- do(trials) * diffprop(Support ~ shuffle(Group), data=df)histogram( ~ diffprop, data= null.dist, xlab = "Differences in proportions", main = "Null distribution for differences in proportions", v= observed)`

最后从null distribution中计算P值来确认是否能推翻原假设，也就是说假如H0假设是成立，那么差值大于`0.11`

（或更大更极端的值）的概率是多少，是否很小（如小于0.05）。个人理解若P值很大，则可以推翻原假设。

`p.value <- prop(~diffprop >= observed, data= null.dist)paste(" One-sided p-value: ", round(p.value,3))# [1] " One-sided p-value: 0.088"`

当你推测的是两个mean之间是否有显著不同时，则可以考虑用**Two-mean inference**

比如推测在长鳍金枪鱼（albacore）和黄鳍金枪鱼（yellowfin）中发现的汞的平均含量有不同。这里两个均值就是我们所需要比较的，hypotheses则是`H0: μ1=μ2`

和`Ha: μ1≠μ2`

。然后我们观测到在数据集`tuna.txt`

中albacore mean为`0.35763`

，而yellowfin mean为`0.35444`

，两者的差值为`-0.003`

。这时我们可以对`tuna`

数据集进行打乱分组信息来模拟并推断上述假设。

`df <- read.delim("http://citadel.sjfc.edu/faculty/ageraci/data/tuna.txt")str(df)favstats(~Mercury | Tuna, data=df)observed <- diffmean(~Mercury | Tuna, data=df, na.rm = T)paste("Observed difference in the means: ", round(observed, 3 ))# [1] "Observed difference in the means: -0.003"`

接着就类似于`Two-proportion inference`

，用`shuffle`

函数打乱分组，模拟1000次，然后计算P值来判断当原假设成立前提下该值是否极端，最后看是否能推翻原假设。

`trials <- 1000null.dist <- do(trials) * diffmean(Mercury ~ shuffle(Tuna), data=df, na.rm = T)pvalue <- prop(~ diffmean <= observed, data=null.dist)paste("The one-sided p-value is ", round(pvalue,3))# [1] "The one-sided p-value is 0.428"`

以上是对于参考资料的一个简单记录，simulation是一个非常有意思的方法，在临床试验中也较为常见，值得后续继续学习，本文所介绍的模拟是一个非常简单也易于理解的范例。

Chapter 7 Simulation-based Inference

Simulation-based inference with mosaic

MOSAIC R packages

Here we don't talk about how to determine which type of missingness your data have, you can refer to the articles Multiple Imputation.

Or a summary (Missing data assumptions and corresponding imputation methods) in Multiple imputation as a valid way of dealing with missing data

Let's keep it more practical and focus on how to impute missing data. For example, LOCF (Last Observation Carry Forward) is the standard method for imputing missing data in clinical trial studies. It is used to fill in missing values at a later point in the study, but that can lead to biased results. Other methods such as BOCF(Baseline Observation Carry Forward), WOCF(Worsts Observation Carry forward), and Multiple Imputation are also used, but rarely seen in oncology studies. The last common method like MMRM(Mixed-Effect Model Repeated Measure) is used for continuous missing data.

Given SAS is still the dominant delivery program, here I will record how to use SAS to handle this missing data. However I also perfectly suggest using R as the alternative program or QC program, as I believe R will be accepted by regulatory authorities, at least as an optional delivery program. Therefore I'm gonna record how to use the `rbmi`

package to deal with missing data like LOCF and multiple imputation, and compute the LS means with ANCOVA model in another article.

Here we create a dummy dataset with 3 columns: `usubjid`

, `avisitn`

and `aval`

.

`data data; input usubjid $8. avisitn aval; datalines;1001-101 0 851001-101 1 841001-101 2 861001-101 3 .1001-101 4 .1001-101 5 851002-101 0 901002-101 1 .1002-101 2 911002-101 3 921002-101 4 .1002-101 5 .;run;proc sort; by usubjid avisitn; run;`

Actually there are several methods to implement the LOCF, referring to LOCF-Different Approaches, Same Results Using LAG Function, RETAIN Statement, and ARRAY Facility. I usually use the `RETAIN`

statement as it's easy to understand and also very elegant. So this brings us to the final code, once the `usubjid`

changes, the `rn`

variable will be initialized to null(.) or first `aval`

grouped by `usubjid`

. And then through the `if`

statement to check if the next aval is not missing, and carry the `rn`

forward to the next aval.

`data locf; length dtype $10.; retain rn; set data; by usubjid avisitn; if first.usubjid then rn=.; if aval ne . then do; rn=aval; aval_locf=aval; end; else do; aval_locf=rn; dtype="LOCF"; end;run;`

We can see the final dataset below with LOCF'ed variable, `aval_LOCF`

.

And the BOCF and WOCF methods are also conservative like LOCF, and their programming logic is roughly the same. The former one can be used when subjects drop out due to Adverse Event, while the latter one can be used for lack of efficacy(LOE) indeed.

For Multiple Imputation(MI), it's more robust than LOCF, as it has multiple imputations.

The procedures for Multiple Imputation are generally the same in both SAS and R, such as:

- Impution, the missing data is imputed
`m`

times and generates`m`

complete datasets with a specified model or distribution. - Analysis, each of these datasets is analyzed using a certain statistical model or function, and generating
`m`

sets of estimates. - Pooling, the
`m`

sets of estimates are combined to one MI result with an appropriate method, like Rubin´s Rules (RR) that is specifically designed to pool parameter estimates and is also wrapped into SAS and R packages.

These procedures are easy to understand, so how to implement them?

- In SAS, you can use
`proc mi`

procedure for imputation, select one statistical model, such as`proc mixed`

for analysis, and lastly use`proc mianalyze`

procedure for pooling. - In R, although there are several R packages available for use, I personally prefer using
`mice`

and`rmbi`

packages, which will be introduced in other articles.

Compared to the above two methods, the MMRM (Mixed-effect Model for Repeated Measures) method does not do the imputation for individual missing data, while treating each individual as a random effect, as it has already considered the missing data in the model(that the missing data is implicitly imputed).

So it can be seen that MMRM does well in controlling type I error but LOCF may lead to the inflation of type I error. Although the MI method can also control Type I error, it is more conservative than MMRM because it will underestimate the treatment effect.

Actually there is really impressive article that talks about the comparison of MMRM versus MI, as well as the regulatory authorities' considerations on this topic. Referring to it would be quite helpful. Handling of Missing Data: Comparison of MMRM (mixed model repeated measures) versus MI (multiple imputation).

- In SAS, you can simply use
`proc mixed`

procedure using mixed model with maximum likelihood-based method. - In R, the
`nlme`

package is commonly used, but the new`mmrm`

package offers advanced functionality (just heard before...).

Multiple Imputation

SAS LOCF For Multiple Variables

SAS LOCF

LOCF-Different Approaches, Same Results Using LAG Function, RETAIN Statement, and ARRAY Facility

LOCF Method and Application in Clinical Data Analysis

临床试验中缺失数据的预防与处理

Handling of Missing Data: Comparison of MMRM (mixed model repeated measures) versus MI (multiple imputation)

`ggplot2`

.I encounter this question when I want to construct two different color ranges to `col`

aesthetic, such as `geom_line`

and `geom_text`

. Sometimes I may choose another way to visualize the data to avoid this situation, but I really want to know how to solve it if I have to use this color strategy.

From my Google search. I have found the best solution and it must be thanks to Elio Campitelli’s contribution, the author of the `ggnewscale`

package. He demonstrates how to implement two color scales and explain what the principle is. You can refer to this article, Multiple color (and fill) scales with ggplot2.

Here I just show how it works. Firstly we prepare the dummy data.

`library(tidyverse)library(ggnewscale)set.seed(123)data <- tibble( id = rep(1:5, each = 4), day = sample(5:20, 20, replace = TRUE), linecol = str_c("col", id), day2 = day + 2, label = rep(c("Group1", "Group2"), each = 10))`

And then I’d like to draw a line plot with labels around it. The line colors are determined by the `linecol`

variable, while label colors are by `label`

group. Let's look at the error example that doesn’t work with no surprise.

`data %>% ggplot(aes(x = day, y = id)) + geom_line(aes(col = linecol)) + scale_color_manual(values = c("red", "orange", "yellow", "green", "blue")) + geom_text(aes(label = label, col = label)) + scale_color_manual(values = c("blue", "orange"), guide = NULL) # ErrorScale for colour is already present.Adding another scale for colour, which will replace the existing scale.Error in `palette()`:! Insufficient values in manual scale. 7 needed but only 2 provided.`

In order to solve it, you just need to add one line code as mentioned in that reference article. `structure(ggplot2::standardise_aes_names("colour"), class = "new_aes")`

or the function `new_scale_color()`

wrapped in `ggnewscale`

package. If you want to use `scale_fill_*`

, replacing "colour" to "fill" is fine.

So the final code without any error is shown below.

`data %>% ggplot(aes(x = day, y = id)) + geom_line(aes(col = linecol)) + scale_color_manual(values = c("red", "orange", "yellow", "green", "blue")) + structure(ggplot2::standardise_aes_names("colour"), class = "new_aes") + # new_scale_color() + geom_text(aes(label = label, col = label)) + scale_color_manual(values = c("blue", "orange"), guide = NULL)`

There is no doubt that this solution is not very common and formal. And I really hope it can be merged into `ggplot2`

big family so that I only need to import one package. Ah!

本方法参考Call ChatGPT (or really any other API) from R，其于3月2号就发布了教程！

示例如下：

`library(httr)api_key <- "sk-Zzpcse7C0Mabe461NvEbToA3g765nYnmFwGgZ5b"response <- POST( # curl https://api.openai.com/v1/chat/completions url = "https://api.openai.com/v1/chat/completions", # -H "Authorization: Bearer $OPENAI_API_KEY" add_headers(Authorization = paste("Bearer", api_key)), # -H "Content-Type: application/json" content_type_json(), # -d '{ # "model": "gpt-3.5-turbo", # "messages": [{"role": "user", "content": "What is a banana?"}] # }' encode = "json", body = list( model = "gpt-3.5-turbo", messages = list(list(role = "user", content = "How to use ChatGPT API in R?")) ))chatGPT_answer <- content(response)$choices[[1]]$message$contentchatGPT_answer <- stringr::str_trim(chatGPT_answer)cat(chatGPT_answer)`

首先在OpenAI API page创建一个API key，需要openai账号登陆

然后`Create new secret key`

，会产生一个类似于上方示例代码中的`sk-Zzpcse7C0Mabe461NvEbToA3g765nYnmFwGgZ5b`

等一连串字符（注意：示例代码的key是假的，无法使用，因为我基于我的修改了一些字符）

最后修改示例代码中的`content`

后面的文字，输入你想要的即可实现与ChatGPT的交互。

此外你也可以打包成一个函数，方便调用，如：

`# Calls the ChatGPT API with the given prompt and returns the answerask_chatgpt <- function(prompt) { response <- POST( url = "https://api.openai.com/v1/chat/completions", add_headers(Authorization = paste("Bearer", api_key)), content_type_json(), encode = "json", body = list( model = "gpt-3.5-turbo", messages = list(list( role = "user", content = prompt )) ) ) str_trim(content(response)$choices[[1]]$message$content)}answer <- ask_chatgpt("How to use ChatGPT API in R?")cat(answer)`

整体上还是蛮方便，对吧~

`chatgpt`

若问是否有ChatGPT 相关R包可供使用？可参考ChatGPT coding assistant for RStudio

其不仅提供了与ChatGPT交互的函数`ask_chatgpt()`

，还有其他有意思功能，具体查看其文档说明即可！

使用也与上述API方法类似，也是先调用API key，然后选择对应的函数，但是会简洁一些，如：

`Sys.setenv(OPENAI_API_KEY = "sk-Zzpcse7C0Mabe461NvEbToA3g765nYnmFwGgZ5b") # fake key, don't use it.library(chatgpt)cat(ask_chatgpt("How to use ChatGPT API in R?"))*** ChatGPT input:How to use ChatGPT API in R?You can use the `httr` package in R to interact with ChatGPT's API. Here's a sample code to get started.1. Install `httr` package via `install.packages("httr")``1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

# Load httr library

library(httr)

# Set the API parameters

url <- "https://api.chatgpt.com/chat"

body <- list(

query = "Hi, how are you?",

token = "<your-api-token>"

)

# Send a POST request to the endpoint

response <- POST(url, body = body)

# Extract the response content as a string

content(response, as = "text")

In this example, the `url` variable stores the endpoint URL of ChatGPT API. `body` has two parameters:- `query`: The text you want to send to ChatGPT API as input.- `token`: Your API token provided by ChatGPT. After setting the variables, the `POST` function from `httr` package sends a POST request to the API endpoint. Finally, the response content is extracted as a string using `content` function.Make sure to replace `<your-api-token>` with your ChatGPT API token before running the

ChatGPT 真是太神奇了~假如ChatGPT web版不崩的话，我还是会优先使用web版，因为现阶段我只为search而不是为了develop.

]]>`mutate`

function from the `dplyr`

package in R.All the ways are referred to in this discussion in Stackoverflow. I keep a record of this due to the convenience for next reference.

First of all, I show one wrong way that I’ve done before. Given you have a dummy data below, and would like to split and get the first half of the string with `_`

delimiter.

`library(tidyverse)data <- tibble( label = c("a_1", "b_2", "c_3", "d_4", "e_5"))`

As per my past experience, I got used to splitting the label column by `str_split(label, "_")[[1]][1]`

. But that is unable to give the correct output where the values are all “a”. You can see below or try it by yourself.

`data %>% mutate(sublabel = str_split(label, "_")[[1]][1])# A tibble: 5 × 2 label sublabel <chr> <chr> 1 a_1 a 2 b_2 a 3 c_3 a 4 d_4 a 5 e_5 a `

Obviously you can see that’s definitely wrong. The correct way you can use has been listed below and I summarize them from that article in Stackoverflow.

Add the

`simplify = T`

argument that can return the data frame instead of a list, so that I can use`[,1]`

to extract the first half one.`data %>% mutate(sublabel = str_split(label, "_", simplify = T)[,1])`

Use

`separate()`

function instead of`str_split()`

through a very clever way to avoid the error.`data %>% separate(label, c("sublabel1", "sublabel2"))`

Similar to the first one, but use a more straight and explicit way to extract the first half one with the

`map_chr()`

function that can apply a function to each element of a list. So if I want to select the first one in one list, just using`map_chr(.,1)`

.`data %>% mutate(sublabel = str_split(label, "_") %>% map_chr(., 1))`

This is a brief post, and I hope it will be a reminder for me when I forget something.

]]>I was also confused about the distinction between these two estimations when I was new to the drug trials and requested to calculate the follow-up time.

For the median survival time, I suppose many people know how to deal with that through the Kaplan-Meier curve. But how about the median follow-up time? As we know, if we want to calculate the follow-up time, we can not guarantee that all subjects are ongoing. Thus we should think of a reasonable way to deal with the completed subjects, otherwise if we directly calculate the median value, the result will result in an underestimate.

Refer to Schemper and Smith, a very clear way to use the reverse Kaplan-Meier curve to calculate the median follow-up time. Given a tumor trial in which the event of interest is actually the loss-of-followup, it's easy to understand that we can not know how long they would have been followed if that event didn't happen. To make this calculation more analytical, Schemper and Smith suggest using the Kaplan-Meier curve with the reversed status indicator (if you use R, it can be seen that `1`

to indicate the subject who is censored, and `0`

to indicate the event.), where the median survival time can actually be interpreted as the median follow-up time.

Suppose we have a set of OS survival data where 0 indicates the death status. In R that can be used as below.

`library(survival)fit <- survfit(Surv(time, status==0) ~ 1, data = os_data)surv_median(fit)`

- M Schemper and TL Smith. A note on quantifying follow-up in studies of failure time. Controlled clinical trials (1996) vol. 17 (4) pp. 343-346
- Determining the median followup time

In the trials, we would actually draw up a plan to define the rules for how to impute the partial date. But here, I simplify the imputation rule as shown below to illustrate its implementation in R and SAS:

- If the day of analysis start date is missing then impute the first day of the month. If both the day and month are missing then impute to 01-Jan.
- If the day of analysis end date is missing then impute the last day of the month. If both the day and month are missing then impute to 31-Dec.
- If the imputed analysis end date is after the last alive date then set it to the last alive date.

Firstly, let’s create dummy data in SAS that includes four variables.

`data dummy; length USUBJID $20. LSTALVDT $20. AESTDTC $20. AEENDTC $20.; input USUBJID $ LSTALVDT $ AESTDTC $ AEENDTC $; datalines; SITE01-001 2023-01-10 2019-06-18 2019-06-29 SITE01-001 2023-01-10 2020-01-02 2020-02 SITE01-001 2023-01-10 2022-03 2022-03 SITE01-001 2023-01-10 2022-06 2022-06 SITE01-001 2023-01-10 2023 2023;run;`

`USUBJID`

, unique subject identifier.`LSTALVDT`

, last known alive date.`AESTDTC`

, start date of adverse event.`AEENDTC`

, end date of adverse event.

And we can see from the above rules that concatenating "01" with the date that misses the day is very easy. However if we want to calculate the `AENDT`

, we need to consider which day is matched with each month, for example, the 28th or 29th, 30th or 31th. So we need to apply the `intnx`

function to get the last day correctly.

`data dummy_2; set dummy; if length(AESTDTC)=7 then do; ASTDTF="D"; ASTDT=catx('-', AESTDTC, "01"); end; else if length(AESTDTC)=4 then do; ASTDTF="M"; ASTDT=catx('-', AESTDTC, "01-01"); end; else if length(AESTDTC)=10 then ASTDT=AESTDTC; if length(AEENDTC)=7 then do; AENDTF="D"; AEENDTC_=catx('-', AEENDTC, "01"); AENDT=put(intnx('month', input(AEENDTC_,yymmdd10.), 0, 'E'), yymmdd10.); end; else if length(AEENDTC)=4 then do; AENDTF="M"; AENDT=catx('-', AEENDTC, "12-31"); end; else if length(AEENDTC)=10 then AENDT=AEENDTC; if input(AENDT,yymmdd10.)>input(LSTALVDT,yymmdd10.) then AENDT=LSTALVDT; drop AEENDTC_;run;`

From the output we can see that when the day of date is missing, we set the imputation flag to 'D' as the flag variable, like `ASTDTF`

. If the month of the date is missing, set it to "M". It also considers leap years and sets the date to the last alive date if the imputed date is later than the last alive date. So I suppose all the dates have been imputed correctly.

Then let’s create the same dummy to see how to conduct the rules in R.

`library(tidyverse)library(lubridate)dummy <- tibble( USUBJID = "SITE01-001", LSTALVDT = "2023-01-10", AESTDTC = c("2019-06-18", "2020-01-02", "2022-03", "2022-06", "2023"), AEENDTC = c("2019-06-29", "2020-02", "2022-03", "2022-06", "2023"))`

The dummy data can be shown below.

`# A tibble: 5 × 4 USUBJID LSTALVDT AESTDTC AEENDTC <chr> <chr> <chr> <chr> 1 SITE01-001 2023-01-10 2019-06-18 2019-06-292 SITE01-001 2023-01-10 2020-01-02 2020-02 3 SITE01-001 2023-01-10 2022-03 2022-03 4 SITE01-001 2023-01-10 2022-06 2022-06 5 SITE01-001 2023-01-10 2023 2023 `

And then we follow the rules as the SAS used to impute the partial date in R. To get the last day of each month's imputation, we'd better use the `rollback()`

and `ceiling_date()`

functions in the `lubridate`

package to get the correct day considering the leap years. In addition, others are the common functions in the `tidyverse`

package to manipulate the data, like `case_when()`

and `select()`

.

`dummy_2 <- dummy %>% mutate( ASTDTF = case_when( str_length(AESTDTC) == 4 ~ "M", str_length(AESTDTC) == 7 ~ "D" ), ASTDT_ = case_when( str_length(AESTDTC) == 4 ~ str_c(AESTDTC, "01-01", sep = "-"), str_length(AESTDTC) == 7 ~ str_c(AESTDTC, "01", sep = "-"), is.na(ASTDTF) ~ AESTDTC ), ASTDT = ymd(ASTDT_), AENDTF = case_when( str_length(AEENDTC) == 4 ~ "M", str_length(AEENDTC) == 7 ~ "D" ), AENDT_ = case_when( str_length(AEENDTC) == 4 ~ str_c(AEENDTC, "12-31", sep = "-"), str_length(AEENDTC) == 7 ~ str_c(AEENDTC, "-15"), is.na(AENDTF) ~ AEENDTC ), AENDT = case_when( str_length(AEENDTC) == 7 ~ rollback(ceiling_date(ymd(AENDT_), "month")), TRUE ~ ymd(AENDT_) ), AENDT = if_else(AENDT > ymd(LSTALVDT), ymd(LSTALVDT), AENDT) ) %>% select(-ASTDT_, -AENDT_)`

Here we can see that the output is consistent with the SAS. It's very easy in R, right? You can also use many useful functions to transfer the different date types, for example from `date9.`

to `yymmdd10.`

like `dmy("01Jan2023")`

. Honestly the `lubridate`

package can provide a series of functions to deal with date manipulation, such as using `interval()`

to calculate the duration of AEs.

`# A tibble: 5 × 8 USUBJID LSTALVDT AESTDTC AEENDTC ASTDTF ASTDT AENDTF AENDT <chr> <chr> <chr> <chr> <chr> <date> <chr> <date> 1 SITE01-001 2023-01-10 2019-06-18 2019-06-29 NA 2019-06-18 NA 2019-06-292 SITE01-001 2023-01-10 2020-01-02 2020-02 NA 2020-01-02 D 2020-02-293 SITE01-001 2023-01-10 2022-03 2022-03 D 2022-03-01 D 2022-03-314 SITE01-001 2023-01-10 2022-06 2022-06 D 2022-06-01 D 2022-06-305 SITE01-001 2023-01-10 2023 2023 M 2023-01-01 M 2023-01-10`

`admiral`

PackageMaybe you would say if there is a package that can deal with date imputation for ADaM. A manipulation structure that is wrapped in a series of functions to sort out the common imputation situations in ADaM. There's no doubt that you can believe the `admiral`

package. Let me show some examples here to demonstrate how to use it for imputing partial dates.

`library(admiral)dummy %>% derive_vars_dt( dtc = AESTDTC, new_vars_prefix = "AST", highest_imputation = "M", date_imputation = "first" ) %>% mutate(LSTALVDT = ymd(LSTALVDT)) %>% derive_vars_dt( dtc = AEENDTC, new_vars_prefix = "AEND", highest_imputation = "M", date_imputation = "last", max_dates = vars(LSTALVDT) )`

Isn't the code quite straightforward? If your date vector is date time (DTM), you can use `derive_vars_dtm()`

instead.

`# A tibble: 5 × 8 USUBJID LSTALVDT AESTDTC AEENDTC ASTDT ASTDTF AENDDT AENDDTF <chr> <date> <chr> <chr> <date> <chr> <date> <chr> 1 SITE01-001 2023-01-10 2019-06-18 2019-06-29 2019-06-18 NA 2019-06-29 NA 2 SITE01-001 2023-01-10 2020-01-02 2020-02 2020-01-02 NA 2020-02-29 D 3 SITE01-001 2023-01-10 2022-03 2022-03 2022-03-01 D 2022-03-31 D 4 SITE01-001 2023-01-10 2022-06 2022-06 2022-06-01 D 2022-06-30 D 5 SITE01-001 2023-01-10 2023 2023 2023-01-01 M 2023-01-10 M`

I'm planning to learn how to use the `admiral`

package, for example, by building ADaM ADRS. I suppose this package improves the ecology of R greatly in drug trials.

Common Dating in R: With an example of partial date imputation

Tips to Manipulate the Partial Dates

Date and Time Imputation

To reach this purpose, we just need to take two steps:

- Split the data frame by group.
- Add a blank row.

The idea is extremly clear and similar to the SAS process. Here, let's see how to complete these two steps.

Firstly, I create test data like:

`library(tidyverse)data <- iris %>% group_by(Species) %>% slice_head(n = 3) %>% select(Species, everything())> data# A tibble: 9 × 5# Groups: Species [3] Species Sepal.Length Sepal.Width Petal.Length Petal.Width <fct> <dbl> <dbl> <dbl> <dbl>1 setosa 5.1 3.5 1.4 0.22 setosa 4.9 3 1.4 0.23 setosa 4.7 3.2 1.3 0.24 versicolor 7 3.2 4.7 1.45 versicolor 6.4 3.2 4.5 1.56 versicolor 6.9 3.1 4.9 1.57 virginica 6.3 3.3 6 2.58 virginica 5.8 2.7 5.1 1.99 virginica 7.1 3 5.9 2.1`

Now I'd like to insert rows between each `Species`

, which would mean inserting a row between 3-4 rows and 6-7 rows. So we need to use the `group_split`

function to split data by the `Species`

variable.

`data %>% group_split(Species)`

And then we can find that the output class is a list, so the next step we should do is convert this list class to a dataframe with blank rows. We can now use the functional programming tool `purrr`

, which has a map function `map_dfr`

to deal with this. It applies a function(here is the `add_row`

) to each element of the list.

`data %>% group_split(Species) %>% map_dfr(~add_row(.x, .after = Inf))# A tibble: 12 × 5 Species Sepal.Length Sepal.Width Petal.Length Petal.Width <fct> <dbl> <dbl> <dbl> <dbl> 1 setosa 5.1 3.5 1.4 0.2 2 setosa 4.9 3 1.4 0.2 3 setosa 4.7 3.2 1.3 0.2 4 NA NA NA NA NA 5 versicolor 7 3.2 4.7 1.4 6 versicolor 6.4 3.2 4.5 1.5 7 versicolor 6.9 3.1 4.9 1.5 8 NA NA NA NA NA 9 virginica 6.3 3.3 6 2.510 virginica 5.8 2.7 5.1 1.911 virginica 7.1 3 5.9 2.112 NA NA NA NA NA `

The above output is what I expected. And I feel the R programming is more brief and clear than SAS, do you think so?

In R the most simple function to replace NA is `replace()`

or `is.na()`

functions.

`library(tidyverse)data <- tibble( a = c(1, 2, NA, 3, 4), b = c(5, NA, 6, 7, 8), c = c(9, 10, 11, NA, 12))`

For instance, if we want to replace NAs in all columns, the simple functions can be used like:

`data[is.na(data)] <- 0replace(data, is.na(data), 0)`

In the more factual scenario, we will have both numeric and character columns at the same time, not only the numeric in the above example. It seems the prior method is not convenient as we must select the numeric or character columns first and then replace NA with any appropriate value. Through searching on Google, I suppose the more simple way is to use `dplyr::mutate_if()`

to check and select the specific type of columns, and `replace_na()`

to replace the NAs.

`data <- tibble( num1 = c(NA, 1, NA), num2 = c(2, NA, 3), chr1 = c("a", NA, "b"), chr2 = c("c", "d", NA))data %>% mutate_if(is.numeric, ~replace_na(., 0)) %>% mutate_if(is.character, ~replace_na(., "xx"))`

To be honest, I prefer the combo functions as I got used to applying the pipe `%>%`

code in R, so the relevant functions like `mutate_if()`

, `mutate_all()`

, `mutate_at()`

functions in `tidyverse`

R package are very convenient for me.

For instance, if you want to replace NAs with 0 on selected column names or indexes, as shown below.

`data %>% mutate_at(c(1,2), ~replace_na(., 0))`

Besides the `dplyr::coalesce()`

function can also be used to replace the NAs in a very tricky way, although it’s used to find the first non-missing element in common.

`data %>% mutate(num1 = coalesce(num1, 0))`

R – Replace NA with Empty String in a DataFrame

R – Replace NA with 0 in Multiple Columns

Let's see a demo.

`library(ggplot2)library(tidyverse)# Datadata(iris)ggplot(iris, aes(x = Species, y = Sepal.Length, colour = Species)) + geom_boxplot()`

Adding jittered points to the box plot in `ggplot`

is useful to see the underlying distribution of the data. You can use the `geom_jitter`

function with few params. For example, `width`

param to adjust the width of the jittered points.

`ggplot(iris, aes(x = Species, y = Sepal.Length, colour = Species, shape = Species)) + geom_boxplot() + geom_jitter(width = 0.25)`

Sometimes, we might try to add jittered data points to the grouped boxplot, but we can not use the `geom_jitter()`

function directly as it's a handy shortcut for `geom_point(position="jitter")`

. Let's see what chart will be generated as shown below. It makes the grouped boxplot with overlapping jittered data points.

`ggplot(iris2, aes(x = Species, y = Sepal.Length, colour = group, shape = group)) + geom_boxplot() + geom_jitter(width = 0.25)`

Natively, how to make a better and correct jittered data points to the grouped boxplot. We can use the `position_jitterdodge()`

as the position param, inside the `geom_point`

function.

`ggplot(iris2, aes(x = Species, y = Sepal.Length, colour = group, shape = group)) + geom_boxplot() + geom_point(position = position_jitterdodge(jitter.width = 0.25))`

Right now, we get a nice looking grouped boxplot with clearly separated boxes and jittered data points within each box.

https://r-charts.com/distribution/box-plot-jitter-ggplot2/

https://datavizpyr.com/how-to-make-grouped-boxplot-with-jittered-data-points-in-ggplot2/

Given that I want to plot a scatter plot with regression line for `sashelp.iris`

dataset by the GTL(Graph Template Language) process. So I define a GTL template firstly.

`proc template; define statgraph ScatterRegPlot; begingraph/ backgroundcolor=white border=false datacontrastcolors=(orange purple blue) datasymbols=(circlefilled trianglefilled DiamondFilled); layout overlay; scatterplot x=SepalLength y=SepalWidth /group=Species name='points'; regressionplot x=SepalLength y=SepalWidth / group=Species degree=3 name='reg'; discretelegend 'points'; endlayout; endgraph; end;run;`

Now let's see how to create RTF or PDF with this graph.

For PDF as below:

`ods escapechar="^";ods listing close;options nonumber nodate;ods pdf file="C:/Users/Desktop/example.pdf";proc sgrender data = sashelp.iris template = ScatterRegPlot; run;ods pdf close;ods listing;`

For RTF just change `ods pdf`

above to `ods rtf`

.

If we just want to save as PNG, as follows:

`ods listing gpath='C:/Users/TJ0695/Desktop' image_dpi = 300 style=Journal;ods graphics / imagename="example" imagefmt=png width = 20cm height = 15cm;proc sgrender data = sashelp.iris template = ScatterRegPlot; run;ods graphics off;`

If we increase DPI to 600, it will cause an error, like `ERROR: Java virtual machine exception. java.lang.OutOfMemoryError: Java heap space.`

. So we should modify the configuration file of SAS to fix this error.

- Run
`proc options option=config; run;`

to find the certain configuration file. - Open that file and find the specific text started with
`-Xms`

or`-Xmx`

, and change both of them to`1024m`

from`128m`

. - Reboot SAS and rerun the code.

After that the error doesn't appear again but the warning is still there.

I find it difficult to understand what LS actually means in its literal sense.

The definition from `lsmeans`

package is shown blow, that have been transitioned to `emmeans`

package.

Least-squares means (LS means for short) for a linear model are simply predictions—or averages thereof—over a regular grid of predictor settings which I call the

reference grid.

In fact, even when I read this sentence, I was still very confused. What's the reference grid, and how to predict?

So let's see how the LS means is calculated, and the corresponding confidence interval as well.

Firstly import CDSIC pliot dataset, the same as the previous blog article - Conduct an ANCOVA model in R for Drug Trial. And then handle with the `adsl`

and `adlb`

to create an analysis dataset `ana_dat`

so that we can use ANCOVA by `lm`

function. Supposed that we want to see the `CHG`

(change from baseline) is affected by independent variable `TRTP`

(treatment) under the control of covariate variables `BASE`

(baseline) and `AGE`

(age).

Filter the dataset by `BASE`

variable as one missing value can be found in dataset.

`library(tidyverse)library(emmeans)ana_dat2 <- filter(ana_dat, !is.na(BASE))`

Then fit the ANCOVA model by `lm`

function.

`fit <- lm(CHG ~ BASE + AGE + TRTP, data = ana_dat2)anova(fit)# Analysis of Variance Table## Response: CHG# Df Sum Sq Mean Sq F value Pr(>F)# BASE 1 1.699 1.6989 0.9524 0.3322# AGE 1 0.001 0.0010 0.0006 0.9811# TRTP 2 8.343 4.1715 2.3385 0.1034# Residuals 76 135.570 1.7838 `

We know that the LS means can be calculated according to reference grid that contains the mean of covariables and total factors for independent variables.

`rg <- ref_grid(fit)# 'emmGrid' object with variables:# BASE = 5.4427# AGE = 75.309# TRTP = Placebo, Xanomeline Low Dose, Xanomeline High Dose`

The mean of `BASE`

and `AGE`

are, as we can see from the table above, `5.4427`

and `75.309`

, respectively. Or we can calculate manually like:

`summary(ana_dat2[,c("BASE", "AGE")])# BASE AGE # Min. : 3.497 Min. :51.00 # 1st Qu.: 4.774 1st Qu.:71.00 # Median : 5.273 Median :77.00 # Mean : 5.443 Mean :75.31 # 3rd Qu.: 5.718 3rd Qu.:81.00 # Max. :10.880 Max. :88.00`

Then we can use `summary()`

or `predict()`

function to get the predicted value based on reference grid `rg`

.

`rg_pred <- summary(rg)rg_pred# BASE AGE TRTP prediction SE df# 5.44 75.3 Placebo 0.0578 0.506 76# 5.44 75.3 Xanomeline Low Dose -0.1833 0.211 76# 5.44 75.3 Xanomeline High Dose 0.5031 0.235 76`

The prediction column is the same as from `predict(rg)`

. The prediction table looks like the predicted values of the different factor levels at the constant mean value.

In fact, we can aslo calculate the predicted value as we have the coefficients estimation of the regression equation from `fit$coefficients`

`> fit$coefficients (Intercept) BASE AGE -1.11361290 0.11228582 0.00743963 TRTPXanomeline Low Dose TRTPXanomeline High Dose -0.24108746 0.44531274`

As the `TRTP`

includes multiple factors so it has been converted into dummy variables:

`contrasts(ana_dat2$TRTP)# Xanomeline Low Dose Xanomeline High Dose# Placebo 0 0# Xanomeline Low Dose 1 0# Xanomeline High Dose 0 1`

Now if we want to calculate the predicted value for the `Xanomeline Low Dose`

factor, it can be as follows:

`> 0.11229*5.44+0.00744*75.3-0.24109*1-1.11361[1] -0.1836104`

Back to LS means, from its definition, it seems to be the average of the predicted values.

`rg_pred %>% group_by(TRTP) %>% summarise(LSmean = mean(prediction))# # A tibble: 3 × 2# TRTP LSmean# <fct> <dbl># 1 Placebo 0.0578# 2 Xanomeline Low Dose -0.183 # 3 Xanomeline High Dose 0.503 `

It's exactly the same results as `lsmeans(rg, "TRTP")`

by `emmeans`

package. Or just using `emmeans(fit, "TRTP")`

can also get the same results

`lsmeans(rg, "TRTP")# TRTP lsmean SE df lower.CL upper.CL# Placebo 0.0578 0.506 76 -0.949 1.065# Xanomeline Low Dose -0.1833 0.211 76 -0.603 0.236# Xanomeline High Dose 0.5031 0.235 76 0.036 0.970`

The degree of freedom is `76`

as the DF for `TRTP`

is `2`

, and `1`

and `1`

for each covariables. So the total DF is `81-2-1-1=76`

I think.

Using `test`

we can get the P value when we compare the lsmean to zero.

`test(lsmeans(fit, "TRTP"))# TRTP lsmean SE df t.ratio p.value# Placebo 0.0578 0.506 76 0.114 0.9093# Xanomeline Low Dose -0.1833 0.211 76 -0.870 0.3869# Xanomeline High Dose 0.5031 0.235 76 2.145 0.0351`

In fact, the `t.ratio`

is the t statistics, so we can calculate P value manually, like

`2 * pt(abs(0.114), 76, lower.tail = F)2 * pt(abs(-0.870), 76, lower.tail = F)2 * pt(abs(2.145), 76, lower.tail = F)`

Likewise the confidence interval of lsmean can also be calculated manually based on `SE`

and `DF`

, such as for Placebo factor.

`> 0.0578 + c(-1, 1) * qt(0.975, 76) * 0.506[1] -0.9499863 1.0655863`

I think these steps will go a long way in understanding the meaning of least-squares means, and the logic behind it. Hope to be helpful.

“emmeans” package

最小二乘均值的估计模型

UNDERSTANDING ANALYSIS OF COVARIANCE (ANCOVA)

Confidence intervals and tests in emmeans

Least-squares Means: The R Package lsmeans

As an example dataset, I'll use the `cdiscpilot01`

from CDSIC that contains the AdaM and SDTM datasets for a single study. And then our purpose is to conduct efficacy analysis by ANCOVA with LS mean estimation. Suppose that we want to know whether or not the treatment has an impact on `Glucose`

while accounting for the baseline of glucose. The patients are limited to who reach the visit of `end of treatment`

but have not been discontinued due to AE.

ANCOVA makes several assumptions about the input data, such as:

- Linearity between the covariate and the outcome variable
- Homogeneity of regression slopes
- The outcome variable should be approximately normally distributed
- Homoscedasticity
- No significant outliers

Maybe we need to additional article to talk about how to conduct these assumptions, but not in here. So we suppose that all assumptions have been met for the ANCOVA.

Install and load the following required packages. And then load `adsl`

and `adlbc`

datasets from `cdiscpilot01`

study, which can be referred to another article: Example of SDTM and ADaM datasets from the CDISC.

`library(tidyverse)library(emmeans)library(gtsummary)library(multcomp)adsl <- haven::read_xpt(file = "./phuse-scripts/data/adam/cdiscpilot01/adsl.xpt")adlb <- haven::read_xpt(file = "./phuse-scripts/data/adam/cdiscpilot01/adlbc.xpt")`

Per the purpose, we need to filter the efficacy population and focus on `Glucose (mg/dL)`

lab test.

`gluc <- adlb %>% left_join(adsl %>% select(USUBJID, EFFFL), by = "USUBJID") %>% # PARAMCD is parameter code and here we focus on Glucose (mg/dL) filter(EFFFL == "Y" & PARAMCD == "GLUC") %>% arrange(TRTPN) %>% mutate(TRTP = factor(TRTP, levels = unique(TRTP)))`

And then to produce the analysis datasets by filtering the target patients who have reach out the end of treatment and not been discontinued due to AE.

`ana_dat <- gluc %>% filter(AVISIT == "End of Treatment" & DSRAEFL == "Y") %>% arrange(SUBJID, AVISITN) %>% mutate(AVISIT = factor(AVISIT, levels = unique(AVISIT)))`

Once we have the datasets for analysis, we need to examine the datasets first. I find `tbl_summary`

function in `gtsummary`

package can calculate descriptive statistics and provide a very nice table with clinical style, as shown below:

`ana_dat %>% dplyr::select(AGEGR1, SEX, RACE, TRTP, AVAL, BASE, CHG) %>% tbl_summary(by = TRTP, missing = "no") %>% add_n() %>% as_gt() %>% gt::tab_source_note(gt::md("*This data is from cdiscpilot01 study.*"))`

Here we can see the descriptive summary for each variables by the treatment group. Certainly we can also do some visualization like boxplot or scatterplot, but not present here.

We use `lm`

function to fit ANCOVA model with treatment(`TRTP`

) as independent variable, change from baseline(`CHG`

)as response variable, and baseline(`BASE`

) as covariates.

`fit <- lm(CHG ~ BASE + TRTP, data = ana_dat)summary(fit)`

The summary output for regression coefficients as follows. If you would like to obtain anova tables, should use `anova(fit)`

instead of `summary(fit)`

.

`Call:lm(formula = CHG ~ BASE + TRTP, data = ana_dat)Residuals: Min 1Q Median 3Q Max -3.1744 -0.7627 -0.0680 0.5633 5.0349 Coefficients: Estimate Std. Error t value Pr(>|t|)(Intercept) -0.5579 0.8809 -0.633 0.528BASE 0.1111 0.1329 0.837 0.405TRTPXanomeline Low Dose -0.2192 0.5433 -0.404 0.688TRTPXanomeline High Dose 0.4447 0.5528 0.804 0.424Residual standard error: 1.328 on 77 degrees of freedom (1 observation deleted due to missingness)Multiple R-squared: 0.06702, Adjusted R-squared: 0.03068 F-statistic: 1.844 on 3 and 77 DF, p-value: 0.1462`

From above results, we can easily conclude the regression coefficient and model, and the significance comparing to zero. With the coefficient, we can predict the any change based on baseline and treatment.

Besides we can use `contrasts`

function to obtain contrast metrices so that understand the dummy variables for `TRTP`

in the multiple regression model here.

`> contrasts(ana_dat$TRTP) Xanomeline Low Dose Xanomeline High DosePlacebo 0 0Xanomeline Low Dose 1 0Xanomeline High Dose 0 1`

From the anova table as shown below, it can been seen that the treatment have no statistical significance for the change in glucose under the control of the effects of baseline.

`> anova(fit)Analysis of Variance TableResponse: CHG Df Sum Sq Mean Sq F value Pr(>F)BASE 1 1.699 1.6989 0.9629 0.3295TRTP 2 8.061 4.0304 2.2844 0.1087Residuals 77 135.853 1.7643 `

If you would like to make the output more pretty, `tbl_regression(fit)`

can be used as mentioned before.

If we want to obtain the least square(LS) mean between treatment groups, `emmeans`

or `multcomp`

package can provide the same results. In addition the process to calculate the LS mean is also very worth to leaning and understanding.

`# by multcomppostHocs <- glht(fit, linfct = mcp(TRTP = "Tukey"))summary(postHocs)# by emmeansfit_within <- emmeans(fit, "TRTP")pairs(fit_within, reverse = TRUE)`

The summary output as shown below:

`> summary(postHocs) Simultaneous Tests for General Linear HypothesesMultiple Comparisons of Means: Tukey ContrastsFit: lm(formula = CHG ~ BASE + TRTP, data = ana_dat)Linear Hypotheses: Estimate Std. Error t value Pr(>|t|) Xanomeline Low Dose - Placebo == 0 -0.2192 0.5433 -0.404 0.9116 Xanomeline High Dose - Placebo == 0 0.4447 0.5528 0.804 0.6937 Xanomeline High Dose - Xanomeline Low Dose == 0 0.6639 0.3113 2.132 0.0855 .---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1(Adjusted p values reported -- single-step method)`

Now it's clear a significant difference was observed between the pairs `Low Dose vs. Placebo`

, `High Dose vs. Placebo`

, and `High Dose vs. Low Dose`

.

https://r4csr.org/efficacy-table.html#efficacy-table

ANCOVA in R

How to perform ANCOVA in R

An Introduction to ANCOVA (Analysis of Variance)

How to Conduct an ANCOVA in R

In addition to reading the corresponding procedure reference in the official documentation, I recommend using the `ods trace on`

to find the stat table names, and then extract any table you want. Here is the regression analysis of the `sashelp.cars`

dataset, let's see how to get the stat tables.

`ods trace on; /* write ODS table names to log */proc reg data=sashelp.cars plots=none; model Horsepower = EngineSize Weight;quit;ods trace off; /* stop writing to log */`

The logs you can see in SAS are as follows:

`Output Added:-------------Name: NObsLabel: Number of ObservationsTemplate: Stat.Reg.NObsPath: Reg.MODEL1.Fit.Horsepower.NObs-------------Output Added:-------------Name: ANOVALabel: Analysis of VarianceTemplate: Stat.REG.ANOVAPath: Reg.MODEL1.Fit.Horsepower.ANOVA-------------Output Added:-------------Name: FitStatisticsLabel: Fit StatisticsTemplate: Stat.REG.FitStatisticsPath: Reg.MODEL1.Fit.Horsepower.FitStatistics-------------Output Added:-------------Name: ParameterEstimatesLabel: Parameter EstimatesTemplate: Stat.REG.ParameterEstimatesPath: Reg.MODEL1.Fit.Horsepower.ParameterEstimates`

By looking at the output you can find each stat table name, like `ParameterEstimates`

. That means you can extract it by adding the `ods output ParameterEstimates=rst`

statement to store the table in the `rst`

dataset, as follows:

`proc reg data=sashelp.cars plots=none; /* same procedure call */ model Horsepower = EngineSize Weight; ods output ParameterEstimates=rst; /* the data set name is 'rst' */quit;`

Multiple stat tables can be stored in one `ods output`

statement. For example below statement stores both ParameterEstimates table and ANOVA table at the same time.

`proc reg data=sashelp.cars plots=none; model Horsepower = EngineSize Weight; ods output ParameterEstimates=parms ANOVA=anvar;quit;`

And then if you want to create a macro variable that contains the value of certain statistic, such as slope for EngineSize:

`data _null_; set rst; if variable="EngineSize" then call symputx("slope1", estimate);run;%put &=slope1;`

Several procedures provide an alternative option for createing an output similar to the `ods ouput`

mentioned above. For instance, the `outest`

option in the `proc reg`

procedure.

`proc reg data=sashelp.cars noprint outest=rst2 rsquare; /* statistics in 'rst2' */ model Horsepower = EngineSize Weight;quit;`

So you'd better to check the SAS documentation to see if this procedure you use.

All of the above is referred to the following articles:

ODS OUTPUT: Store any statistic created by any SAS procedure

Find the ODS table names produced by any SAS procedure

A SAS macro to combine portrait and landscape rtf files into one single file

In order to make it suitable for every condition as follows, I will additionally perform an update so that it can be more flexible.

- containing multiple table, figure and list at the same time
- using the title as the index of table contents
- order the files manually (Just provide a solution, have not implemented yet.)

First of all, let's look at the RTF's structure, which is referred to that article.

It is divided into three parts: opening section, content section and closing section. If we look at our single rtf, that structure is still the same. Consequently, the rtf combining process can be summarized as follows:

- Read all filenames into SAS (sorted by filename or defined by manual order).
- Keep the open section of first RTF.
- Remove both opening and closing sections except the first and the last RTF. And add
`\pard\sect`

code in front of`\sectd`

so that all of the files can be combined correctly. - Keep the closing section of last RTF.
- Save the updated RTF code into each SAS dataset. (Do not be saved in a single dataset, as the character length is limited in SAS.)

Now let's see the code we can use in this process. Firstly I import the rtf filenames from the external folder.

`data refList(keep=filepath fn); length fref $8 fn $80 filepath $400; rc = filename(fref, "&inpath"); if rc = 0 then dirid = dopen(fref); if dirid <= 0 then putlog 'ERR' 'OR: Unable to open directory.'; nfiles = dnum(dirid); do i = 1 to nfiles; fn = dread(dirid, i); fid = mopen(dirid, fn); if fid > 0 and index(fn,"rtf") then do; filepath="&inpath\"||left(trim(fn)); fn = strip(tranwrd(fn,".rtf","")); output; end; end; rc = dclose(dirid);run;`

Secondly, read each line in rtf file until find one line that starts with `\sectd`

, which means the above is openning section, and below is content section. And remove the last `}`

except the last rft file.

`data rtfdt&i(where = (ptline=1)); retain ptline; set rtfdt&i end = last; if substr(line,1,6)='\sectd' then do; ptline = 1; /*enable to combine portrait and landscape rtf*/ line="\pard\sect"||compress(tranwrd(line,"\pgnrestart\pgnstarts1","")); end; if last and line^='}' then line=substr(strip(line),1,length(strip(line))-1); else if last and line='}' then delete;run;`

Thirdly, when you find the title code in rtf, replace the `\pard`

with `\pard\outlinelevel1`

so that this title can be identified as index for content table.

`%if &titleindex = 1 %then %do; data rtfdt&i.; set rtfdt&i.; retain fl 0; if index(line,'\pard\plain\') and (not index(line,'\header\pard')) and (not index(line, '\footer\pard')) then fl=1+fl; run; data rtfdt&i; set rtfdt&i; by fl notsorted; if fl=1 and first.fl then /*add index for the contents as per titles*/ line=tranwrd(line,'\pard','\pard\outlinelevel1'); run;%end;`

At last, don't save above rtf contents in one single SAS dataset because as the character length is limited in SAS. And add the `}`

as the closing section so that keep the rtf file complete.

The total code as shown below:

`/*Example*//*%s_combrtf(inpath=&inpath,outpath=&outpath,outfile=&outfile);*//*Parameter Description*//*inpath input path*//*outpath output path*//*outfile output file name*//*titleindex whether to add title index, default is 1*/%macro s_combrtf(inpath= ,outpath= ,outfile= ,titleindex=1); data refList(keep=filepath fn); length fref $8 fn $80 filepath $400; rc = filename(fref, "&inpath"); if rc = 0 then dirid = dopen(fref); if dirid <= 0 then putlog 'ERR' 'OR: Unable to open directory.'; nfiles = dnum(dirid); do i = 1 to nfiles; fn = dread(dirid, i); fid = mopen(dirid, fn); if fid > 0 and index(fn,"rtf") then do; filepath="&inpath\"||left(trim(fn)); fn = strip(tranwrd(fn,".rtf","")); output; end; end; rc = dclose(dirid); run; /*sort by filename by default*/ proc sort data = refList sortseq = linguistic(numeric_collation=on) out = sorted_refList; by fn; quit; data fileorder; set sorted_refList; FileLevel = 2; order = .; run; data _null_; set fileorder end=last; fnref=strip("filename fnref")||strip(_N_)||right(' "')||strip(filepath)||strip('" lrecl=5000 ;'); call execute(fnref); if last then call symputx('maxn',vvalue(_n_), 'l'); run; %do i=1 %to &maxn.; data rtfdt&i.; infile fnref&i. truncover; informat line $5000.; format line $5000.; length line $5000.; input line $1-5000; line=strip(line); run; /*add title index and adapt to more flexible*/ %if &titleindex = 1 %then %do; data rtfdt&i.; set rtfdt&i.; retain fl 0; if index(line,'\pard\plain\') and (not index(line,'\header\pard')) and (not index(line, '\footer\pard')) then fl=1+fl; run; data rtfdt&i; set rtfdt&i; by fl notsorted; if fl=1 and first.fl then /*add index for the contents as per titles*/ line=tranwrd(line,'\pard','\pard\outlinelevel1'); run; %end; %if &i.=1 %then %do; data final; set rtfdt&i(keep = line) end = last; if last and line^='}' then line=substr(strip(line),1,length(strip(line))-1); else if last and line='}' then delete; run; %end; %if &i.^=1 %then %do; data rtfdt&i(where = (ptline=1)); retain ptline; set rtfdt&i end = last; if substr(line,1,6)='\sectd' then do; ptline = 1; /*enable to combine portrait and landscape rtf*/ line="\pard\sect"||compress(tranwrd(line,"\pgnrestart\pgnstarts1","")); end; if last and line^='}' then line=substr(strip(line),1,length(strip(line))-1); else if last and line='}' then delete; run; %end; %if &i.=&maxn. %then %do; %local _cnt; data final; set final %do _cnt=2 %to &maxn; rtfdt&_cnt(keep = line) %end; ; run; data final; set final end = last; if last then line=strip(line)||strip("}"); run; %end; %end; data _null_; file "&outpath\&outfile..rtf" lrecl=5000 nopad; set final; put line; run;%mend;`

This appoach, in my opinion is quite excellent as it can resolve the issues as follows:

I know that different company put titles and footnotes in different places. Some may place them in header & footer section and some may place them in the body of rtf document. Above macro will works no matter how you place the titles and footnotes.

A SAS macro to combine portrait and landscape rtf files into one single file Combine multiple RTF files to one file

SM05: An Efficient Way to Combine RTF Files and Create Multi-Level Bookmarks and a Hyperlinked TOC

utl-sas-macro-to-combine-rtf-files-into-one-single-file

http://onbiostatistics.blogspot.com/2009/01/data-dredging-vs-data-mining-post-hoc.html

- Ad hoc是指完成final统计分析后，有其他针对于该final报告的额外统计分析需求。
- Post hoc是指在submission后，来自于监管部门的审评意见的额外统计分析需求。

对于以上两种情况，实施相同的处理方式。比如需要额外的变量从，ADS的数据集说明文档（specification）则需要同样被补充。相关文档可以在SAP或者验证计划/报告中作为附件，因此对应的版本必须在文件名以及标题中显示。相关文档也可作为独立文档被保存。

而 Post-hoc 分析，即事后分析，是指在数据收集完毕后，根据数据本身特点额外设定分组，提出研究假设，并进行统计分析。

Post hoc常被称为数据疏浚（data dredging）或数据捕鱼（data fishing），其动机往往是为了得到阳性的结果。因此，事后分析结果一般不被各国药品监管部分接受作为药物有效性的证据。

以下是在临床试验的过程中的注意事项： - 临床试验过程中所有的重大变化都需要记录； - 不鼓励进行Ad-hoc分析（Ad-hoc没有先进行假设再分析，不符合严格的统计学原则，只能作为探索性结论）； - 统计判断需要依据临床试验结果的客观说明和表述。

A post-hoc analysis involves looking at the data after a study has been concluded, and trying to find patterns that were not primary objectives of the study. In other words, all analyses that were not pre-planned and were conducted as 'additional' analyses after completing the experiment are considered to be post-hoc analyses. A post-hoc study is conducted using data that has already been collected. https://www.editage.com/insights/zh-hans/node/7139

While both post-hoc and ad-hoc analysis may be performed based on the data or results we have seen, the ad-hoc analysis typically occurred alongside the project while the post-hoc analysis occurred absolutely after the project or after the unblinding of the study or after the pre-specified analyses results have been reviewed. In this sense, the ad-hoc analysis is better than post-hoc analysis.

]]>