In the trials, we would actually draw up a plan to define the rules for how to impute the partial date. But here, I simplify the imputation rule as shown below to illustrate its implementation in R and SAS:

- If the day of analysis start date is missing then impute the first day of the month. If both the day and month are missing then impute to 01-Jan.
- If the day of analysis end date is missing then impute the last day of the month. If both the day and month are missing then impute to 31-Dec.
- If the imputed analysis end date is after the last alive date then set it to the last alive date.

Firstly, let’s create dummy data in SAS that includes four variables.

`data dummy; length USUBJID $20. LSTALVDT $20. AESTDTC $20. AEENDTC $20.; input USUBJID $ LSTALVDT $ AESTDTC $ AEENDTC $; datalines; SITE01-001 2023-01-10 2019-06-18 2019-06-29 SITE01-001 2023-01-10 2020-01-02 2020-02 SITE01-001 2023-01-10 2022-03 2022-03 SITE01-001 2023-01-10 2022-06 2022-06 SITE01-001 2023-01-10 2023 2023;run;`

`USUBJID`

, unique subject identifier.`LSTALVDT`

, last known alive date.`AESTDTC`

, start date of adverse event.`AEENDTC`

, end date of adverse event.

And we can see from the above rules that concatenating "01" with the date that misses the day is very easy. However if we want to calculate the `AENDT`

, we need to consider which day is matched with each month, for example, the 28th or 29th, 30th or 31th. So we need to apply the `intnx`

function to get the last day correctly.

`data dummy_2; set dummy; if length(AESTDTC)=7 then do; ASTDTF="D"; ASTDT=catx('-', AESTDTC, "01"); end; else if length(AESTDTC)=4 then do; ASTDTF="M"; ASTDT=catx('-', AESTDTC, "01-01"); end; else if length(AESTDTC)=10 then ASTDT=AESTDTC; if length(AEENDTC)=7 then do; AENDTF="D"; AEENDTC_=catx('-', AEENDTC, "01"); AENDT=put(intnx('month', input(AEENDTC_,yymmdd10.), 0, 'E'), yymmdd10.); end; else if length(AEENDTC)=4 then do; AENDTF="M"; AENDT=catx('-', AEENDTC, "12-31"); end; else if length(AEENDTC)=10 then AENDT=AEENDTC; if input(AENDT,yymmdd10.)>input(LSTALVDT,yymmdd10.) then AENDT=LSTALVDT; drop AEENDTC_;run;`

From the output we can see that when the day of date is missing, we set the imputation flag to 'D' as the flag variable, like `ASTDTF`

. If the month of the date is missing, set it to "M". It also considers leap years and sets the date to the last alive date if the imputed date is later than the last alive date. So I suppose all the dates have been imputed correctly.

Then let’s create the same dummy to see how to conduct the rules in R.

`library(tidyverse)library(lubridate)dummy <- tibble( USUBJID = "SITE01-001", LSTALVDT = "2023-01-10", AESTDTC = c("2019-06-18", "2020-01-02", "2022-03", "2022-06", "2023"), AEENDTC = c("2019-06-29", "2020-02", "2022-03", "2022-06", "2023"))`

The dummy data can be shown below.

`# A tibble: 5 × 4 USUBJID LSTALVDT AESTDTC AEENDTC <chr> <chr> <chr> <chr> 1 SITE01-001 2023-01-10 2019-06-18 2019-06-292 SITE01-001 2023-01-10 2020-01-02 2020-02 3 SITE01-001 2023-01-10 2022-03 2022-03 4 SITE01-001 2023-01-10 2022-06 2022-06 5 SITE01-001 2023-01-10 2023 2023 `

And then we follow the rules as the SAS used to impute the partial date in R. To get the last day of each month's imputation, we'd better use the `rollback()`

and `ceiling_date()`

functions in the `lubridate`

package to get the correct day considering the leap years. In addition, others are the common functions in the `tidyverse`

package to manipulate the data, like `case_when()`

and `select()`

.

`dummy_2 <- dummy %>% mutate( ASTDTF = case_when( str_length(AESTDTC) == 4 ~ "M", str_length(AESTDTC) == 7 ~ "D" ), ASTDT_ = case_when( str_length(AESTDTC) == 4 ~ str_c(AESTDTC, "01-01", sep = "-"), str_length(AESTDTC) == 7 ~ str_c(AESTDTC, "01", sep = "-"), is.na(ASTDTF) ~ AESTDTC ), ASTDT = ymd(ASTDT_), AENDTF = case_when( str_length(AEENDTC) == 4 ~ "M", str_length(AEENDTC) == 7 ~ "D" ), AENDT_ = case_when( str_length(AEENDTC) == 4 ~ str_c(AEENDTC, "12-31", sep = "-"), str_length(AEENDTC) == 7 ~ str_c(AEENDTC, "-15"), is.na(AENDTF) ~ AEENDTC ), AENDT = case_when( str_length(AEENDTC) == 7 ~ rollback(ceiling_date(ymd(AENDT_), "month")), TRUE ~ ymd(AENDT_) ), AENDT = if_else(AENDT > ymd(LSTALVDT), ymd(LSTALVDT), AENDT) ) %>% select(-ASTDT_, -AENDT_)`

Here we can see that the output is consistent with the SAS. It's very easy in R, right? You can also use many useful functions to transfer the different date types, for example from `date9.`

to `yymmdd10.`

like `dmy("01Jan2023")`

. Honestly the `lubridate`

package can provide a series of functions to deal with date manipulation, such as using `interval()`

to calculate the duration of AEs.

`# A tibble: 5 × 8 USUBJID LSTALVDT AESTDTC AEENDTC ASTDTF ASTDT AENDTF AENDT <chr> <chr> <chr> <chr> <chr> <date> <chr> <date> 1 SITE01-001 2023-01-10 2019-06-18 2019-06-29 NA 2019-06-18 NA 2019-06-292 SITE01-001 2023-01-10 2020-01-02 2020-02 NA 2020-01-02 D 2020-02-293 SITE01-001 2023-01-10 2022-03 2022-03 D 2022-03-01 D 2022-03-314 SITE01-001 2023-01-10 2022-06 2022-06 D 2022-06-01 D 2022-06-305 SITE01-001 2023-01-10 2023 2023 M 2023-01-01 M 2023-01-10`

`admiral`

PackageMaybe you would say if there is a package that can deal with date imputation for ADaM. A manipulation structure that is wrapped in a series of functions to sort out the common imputation situations in ADaM. There's no doubt that you can believe the `admiral`

package. Let me show some examples here to demonstrate how to use it for imputing partial dates.

`library(admiral)dummy %>% derive_vars_dt( dtc = AESTDTC, new_vars_prefix = "AST", highest_imputation = "M", date_imputation = "first" ) %>% mutate(LSTALVDT = ymd(LSTALVDT)) %>% derive_vars_dt( dtc = AEENDTC, new_vars_prefix = "AEND", highest_imputation = "M", date_imputation = "last", max_dates = vars(LSTALVDT) )`

Isn't the code quite straightforward? If your date vector is date time (DTM), you can use `derive_vars_dtm()`

instead.

`# A tibble: 5 × 8 USUBJID LSTALVDT AESTDTC AEENDTC ASTDT ASTDTF AENDDT AENDDTF <chr> <date> <chr> <chr> <date> <chr> <date> <chr> 1 SITE01-001 2023-01-10 2019-06-18 2019-06-29 2019-06-18 NA 2019-06-29 NA 2 SITE01-001 2023-01-10 2020-01-02 2020-02 2020-01-02 NA 2020-02-29 D 3 SITE01-001 2023-01-10 2022-03 2022-03 2022-03-01 D 2022-03-31 D 4 SITE01-001 2023-01-10 2022-06 2022-06 2022-06-01 D 2022-06-30 D 5 SITE01-001 2023-01-10 2023 2023 2023-01-01 M 2023-01-10 M`

I'm planning to learn how to use the `admiral`

package, for example, by building ADaM ADRS. I suppose this package improves the ecology of R greatly in drug trials.

Common Dating in R: With an example of partial date imputation

Tips to Manipulate the Partial Dates

Date and Time Imputation

To reach this purpose, we just need to take two steps:

- Split the data frame by group.
- Add a blank row.

The idea is extremly clear and similar to the SAS process. Here, let's see how to complete these two steps.

Firstly, I create test data like:

`library(tidyverse)data <- iris %>% group_by(Species) %>% slice_head(n = 3) %>% select(Species, everything())> data# A tibble: 9 × 5# Groups: Species [3] Species Sepal.Length Sepal.Width Petal.Length Petal.Width <fct> <dbl> <dbl> <dbl> <dbl>1 setosa 5.1 3.5 1.4 0.22 setosa 4.9 3 1.4 0.23 setosa 4.7 3.2 1.3 0.24 versicolor 7 3.2 4.7 1.45 versicolor 6.4 3.2 4.5 1.56 versicolor 6.9 3.1 4.9 1.57 virginica 6.3 3.3 6 2.58 virginica 5.8 2.7 5.1 1.99 virginica 7.1 3 5.9 2.1`

Now I'd like to insert rows between each `Species`

, which would mean inserting a row between 3-4 rows and 6-7 rows. So we need to use the `group_split`

function to split data by the `Species`

variable.

`data %>% group_split(Species)`

And then we can find that the output class is a list, so the next step we should do is convert this list class to a dataframe with blank rows. We can now use the functional programming tool `purrr`

, which has a map function `map_dfr`

to deal with this. It applies a function(here is the `add_row`

) to each element of the list.

`data %>% group_split(Species) %>% map_dfr(~add_row(.x, .after = Inf))# A tibble: 12 × 5 Species Sepal.Length Sepal.Width Petal.Length Petal.Width <fct> <dbl> <dbl> <dbl> <dbl> 1 setosa 5.1 3.5 1.4 0.2 2 setosa 4.9 3 1.4 0.2 3 setosa 4.7 3.2 1.3 0.2 4 NA NA NA NA NA 5 versicolor 7 3.2 4.7 1.4 6 versicolor 6.4 3.2 4.5 1.5 7 versicolor 6.9 3.1 4.9 1.5 8 NA NA NA NA NA 9 virginica 6.3 3.3 6 2.510 virginica 5.8 2.7 5.1 1.911 virginica 7.1 3 5.9 2.112 NA NA NA NA NA `

The above output is what I expected. And I feel the R programming is more brief and clear than SAS, do you think so?

In R the most simple function to replace NA is `replace()`

or `is.na()`

functions.

`library(tidyverse)data <- tibble( a = c(1, 2, NA, 3, 4), b = c(5, NA, 6, 7, 8), c = c(9, 10, 11, NA, 12))`

For instance, if we want to replace NAs in all columns, the simple functions can be used like:

`data[is.na(data)] <- 0replace(data, is.na(data), 0)`

In the more factual scenario, we will have both numeric and character columns at the same time, not only the numeric in the above example. It seems the prior method is not convenient as we must select the numeric or character columns first and then replace NA with any appropriate value. Through searching on Google, I suppose the more simple way is to use `dplyr::mutate_if()`

to check and select the specific type of columns, and `replace_na()`

to replace the NAs.

`data <- tibble( num1 = c(NA, 1, NA), num2 = c(2, NA, 3), chr1 = c("a", NA, "b"), chr2 = c("c", "d", NA))data %>% mutate_if(is.numeric, ~replace_na(., 0)) %>% mutate_if(is.character, ~replace_na(., "xx"))`

To be honest, I prefer the combo functions as I got used to applying the pipe `%>%`

code in R, so the relevant functions like `mutate_if()`

, `mutate_all()`

, `mutate_at()`

functions in `tidyverse`

R package are very convenient for me.

For instance, if you want to replace NAs with 0 on selected column names or indexes, as shown below.

`data %>% mutate_at(c(1,2), ~replace_na(., 0))`

Besides the `dplyr::coalesce()`

function can also be used to replace the NAs in a very tricky way, although it’s used to find the first non-missing element in common.

`data %>% mutate(num1 = coalesce(num1, 0))`

R – Replace NA with Empty String in a DataFrame

R – Replace NA with 0 in Multiple Columns

Let's see a demo.

`library(ggplot2)library(tidyverse)# Datadata(iris)ggplot(iris, aes(x = Species, y = Sepal.Length, colour = Species)) + geom_boxplot()`

Adding jittered points to the box plot in `ggplot`

is useful to see the underlying distribution of the data. You can use the `geom_jitter`

function with few params. For example, `width`

param to adjust the width of the jittered points.

`ggplot(iris, aes(x = Species, y = Sepal.Length, colour = Species, shape = Species)) + geom_boxplot() + geom_jitter(width = 0.25)`

Sometimes, we might try to add jittered data points to the grouped boxplot, but we can not use the `geom_jitter()`

function directly as it's a handy shortcut for `geom_point(position="jitter")`

. Let's see what chart will be generated as shown below. It makes the grouped boxplot with overlapping jittered data points.

`ggplot(iris2, aes(x = Species, y = Sepal.Length, colour = group, shape = group)) + geom_boxplot() + geom_jitter(width = 0.25)`

Natively, how to make a better and correct jittered data points to the grouped boxplot. We can use the `position_jitterdodge()`

as the position param, inside the `geom_point`

function.

`ggplot(iris2, aes(x = Species, y = Sepal.Length, colour = group, shape = group)) + geom_boxplot() + geom_point(position = position_jitterdodge(jitter.width = 0.25))`

Right now, we get a nice looking grouped boxplot with clearly separated boxes and jittered data points within each box.

https://r-charts.com/distribution/box-plot-jitter-ggplot2/

https://datavizpyr.com/how-to-make-grouped-boxplot-with-jittered-data-points-in-ggplot2/

Given that I want to plot a scatter plot with regression line for `sashelp.iris`

dataset by the GTL(Graph Template Language) process. So I define a GTL template firstly.

`proc template; define statgraph ScatterRegPlot; begingraph/ backgroundcolor=white border=false datacontrastcolors=(orange purple blue) datasymbols=(circlefilled trianglefilled DiamondFilled); layout overlay; scatterplot x=SepalLength y=SepalWidth /group=Species name='points'; regressionplot x=SepalLength y=SepalWidth / group=Species degree=3 name='reg'; discretelegend 'points'; endlayout; endgraph; end;run;`

Now let's see how to create RTF or PDF with this graph.

For PDF as below:

`ods escapechar="^";ods listing close;options nonumber nodate;ods pdf file="C:/Users/Desktop/example.pdf";proc sgrender data = sashelp.iris template = ScatterRegPlot; run;ods pdf close;ods listing;`

For RTF just change `ods pdf`

above to `ods rtf`

.

If we just want to save as PNG, as follows:

`ods listing gpath='C:/Users/TJ0695/Desktop' image_dpi = 300 style=Journal;ods graphics / imagename="example" imagefmt=png width = 20cm height = 15cm;proc sgrender data = sashelp.iris template = ScatterRegPlot; run;ods graphics off;`

If we increase DPI to 600, it will cause an error, like `ERROR: Java virtual machine exception. java.lang.OutOfMemoryError: Java heap space.`

. So we should modify the configuration file of SAS to fix this error.

- Run
`proc options option=config; run;`

to find the certain configuration file. - Open that file and find the specific text started with
`-Xms`

or`-Xmx`

, and change both of them to`1024m`

from`128m`

. - Reboot SAS and rerun the code.

After that the error doesn't appear again but the warning is still there.

I find it difficult to understand what LS actually means in its literal sense.

The definition from `lsmeans`

package is shown blow, that have been transitioned to `emmeans`

package.

Least-squares means (LS means for short) for a linear model are simply predictions—or averages thereof—over a regular grid of predictor settings which I call the

reference grid.

In fact, even when I read this sentence, I was still very confused. What's the reference grid, and how to predict?

So let's see how the LS means is calculated, and the corresponding confidence interval as well.

Firstly import CDSIC pliot dataset, the same as the previous blog article - Conduct an ANCOVA model in R for Drug Trial. And then handle with the `adsl`

and `adlb`

to create an analysis dataset `ana_dat`

so that we can use ANCOVA by `lm`

function. Supposed that we want to see the `CHG`

(change from baseline) is affected by independent variable `TRTP`

(treatment) under the control of covariate variables `BASE`

(baseline) and `AGE`

(age).

Filter the dataset by `BASE`

variable as one missing value can be found in dataset.

`library(tidyverse)library(emmeans)ana_dat2 <- filter(ana_dat, !is.na(BASE))`

Then fit the ANCOVA model by `lm`

function.

`fit <- lm(CHG ~ BASE + AGE + TRTP, data = ana_dat2)anova(fit)# Analysis of Variance Table## Response: CHG# Df Sum Sq Mean Sq F value Pr(>F)# BASE 1 1.699 1.6989 0.9524 0.3322# AGE 1 0.001 0.0010 0.0006 0.9811# TRTP 2 8.343 4.1715 2.3385 0.1034# Residuals 76 135.570 1.7838 `

We know that the LS means can be calculated according to reference grid that contains the mean of covariables and total factors for independent variables.

`rg <- ref_grid(fit)# 'emmGrid' object with variables:# BASE = 5.4427# AGE = 75.309# TRTP = Placebo, Xanomeline Low Dose, Xanomeline High Dose`

The mean of `BASE`

and `AGE`

are, as we can see from the table above, `5.4427`

and `75.309`

, respectively. Or we can calculate manually like:

`summary(ana_dat2[,c("BASE", "AGE")])# BASE AGE # Min. : 3.497 Min. :51.00 # 1st Qu.: 4.774 1st Qu.:71.00 # Median : 5.273 Median :77.00 # Mean : 5.443 Mean :75.31 # 3rd Qu.: 5.718 3rd Qu.:81.00 # Max. :10.880 Max. :88.00`

Then we can use `summary()`

or `predict()`

function to get the predicted value based on reference grid `rg`

.

`rg_pred <- summary(rg)rg_pred# BASE AGE TRTP prediction SE df# 5.44 75.3 Placebo 0.0578 0.506 76# 5.44 75.3 Xanomeline Low Dose -0.1833 0.211 76# 5.44 75.3 Xanomeline High Dose 0.5031 0.235 76`

The prediction column is the same as from `predict(rg)`

. The prediction table looks like the predicted values of the different factor levels at the constant mean value.

In fact, we can aslo calculate the predicted value as we have the coefficients estimation of the regression equation from `fit$coefficients`

`> fit$coefficients (Intercept) BASE AGE -1.11361290 0.11228582 0.00743963 TRTPXanomeline Low Dose TRTPXanomeline High Dose -0.24108746 0.44531274`

As the `TRTP`

includes multiple factors so it has been converted into dummy variables:

`contrasts(ana_dat2$TRTP)# Xanomeline Low Dose Xanomeline High Dose# Placebo 0 0# Xanomeline Low Dose 1 0# Xanomeline High Dose 0 1`

Now if we want to calculate the predicted value for the `Xanomeline Low Dose`

factor, it can be as follows:

`> 0.11229*5.44+0.00744*75.3-0.24109*1-1.11361[1] -0.1836104`

Back to LS means, from its definition, it seems to be the average of the predicted values.

`rg_pred %>% group_by(TRTP) %>% summarise(LSmean = mean(prediction))# # A tibble: 3 × 2# TRTP LSmean# <fct> <dbl># 1 Placebo 0.0578# 2 Xanomeline Low Dose -0.183 # 3 Xanomeline High Dose 0.503 `

It's exactly the same results as `lsmeans(rg, "TRTP")`

by `emmeans`

package. Or just using `emmeans(fit, "TRTP")`

can also get the same results

`lsmeans(rg, "TRTP")# TRTP lsmean SE df lower.CL upper.CL# Placebo 0.0578 0.506 76 -0.949 1.065# Xanomeline Low Dose -0.1833 0.211 76 -0.603 0.236# Xanomeline High Dose 0.5031 0.235 76 0.036 0.970`

The degree of freedom is `76`

as the DF for `TRTP`

is `2`

, and `1`

and `1`

for each covariables. So the total DF is `81-2-1-1=76`

I think.

Using `test`

we can get the P value when we compare the lsmean to zero.

`test(lsmeans(fit, "TRTP"))# TRTP lsmean SE df t.ratio p.value# Placebo 0.0578 0.506 76 0.114 0.9093# Xanomeline Low Dose -0.1833 0.211 76 -0.870 0.3869# Xanomeline High Dose 0.5031 0.235 76 2.145 0.0351`

In fact, the `t.ratio`

is the t statistics, so we can calculate P value manually, like

`2 * pt(abs(0.114), 76, lower.tail = F)2 * pt(abs(-0.870), 76, lower.tail = F)2 * pt(abs(2.145), 76, lower.tail = F)`

Likewise the confidence interval of lsmean can also be calculated manually based on `SE`

and `DF`

, such as for Placebo factor.

`> 0.0578 + c(-1, 1) * qt(0.975, 76) * 0.506[1] -0.9499863 1.0655863`

I think these steps will go a long way in understanding the meaning of least-squares means, and the logic behind it. Hope to be helpful.

“emmeans” package

最小二乘均值的估计模型

UNDERSTANDING ANALYSIS OF COVARIANCE (ANCOVA)

Confidence intervals and tests in emmeans

Least-squares Means: The R Package lsmeans

As an example dataset, I'll use the `cdiscpilot01`

from CDSIC that contains the AdaM and SDTM datasets for a single study. And then our purpose is to conduct efficacy analysis by ANCOVA with LS mean estimation. Suppose that we want to know whether or not the treatment has an impact on `Glucose`

while accounting for the baseline of glucose. The patients are limited to who reach the visit of `end of treatment`

but have not been discontinued due to AE.

ANCOVA makes several assumptions about the input data, such as:

- Linearity between the covariate and the outcome variable
- Homogeneity of regression slopes
- The outcome variable should be approximately normally distributed
- Homoscedasticity
- No significant outliers

Maybe we need to additional article to talk about how to conduct these assumptions, but not in here. So we suppose that all assumptions have been met for the ANCOVA.

Install and load the following required packages. And then load `adsl`

and `adlbc`

datasets from `cdiscpilot01`

study, which can be referred to another article: Example of SDTM and ADaM datasets from the CDISC.

`library(tidyverse)library(emmeans)library(gtsummary)library(multcomp)adsl <- haven::read_xpt(file = "./phuse-scripts/data/adam/cdiscpilot01/adsl.xpt")adlb <- haven::read_xpt(file = "./phuse-scripts/data/adam/cdiscpilot01/adlbc.xpt")`

Per the purpose, we need to filter the efficacy population and focus on `Glucose (mg/dL)`

lab test.

`gluc <- adlb %>% left_join(adsl %>% select(USUBJID, EFFFL), by = "USUBJID") %>% # PARAMCD is parameter code and here we focus on Glucose (mg/dL) filter(EFFFL == "Y" & PARAMCD == "GLUC") %>% arrange(TRTPN) %>% mutate(TRTP = factor(TRTP, levels = unique(TRTP)))`

And then to produce the analysis datasets by filtering the target patients who have reach out the end of treatment and not been discontinued due to AE.

`ana_dat <- gluc %>% filter(AVISIT == "End of Treatment" & DSRAEFL == "Y") %>% arrange(SUBJID, AVISITN) %>% mutate(AVISIT = factor(AVISIT, levels = unique(AVISIT)))`

Once we have the datasets for analysis, we need to examine the datasets first. I find `tbl_summary`

function in `gtsummary`

package can calculate descriptive statistics and provide a very nice table with clinical style, as shown below:

`ana_dat %>% dplyr::select(AGEGR1, SEX, RACE, TRTP, AVAL, BASE, CHG) %>% tbl_summary(by = TRTP, missing = "no") %>% add_n() %>% as_gt() %>% gt::tab_source_note(gt::md("*This data is from cdiscpilot01 study.*"))`

Here we can see the descriptive summary for each variables by the treatment group. Certainly we can also do some visualization like boxplot or scatterplot, but not present here.

We use `lm`

function to fit ANCOVA model with treatment(`TRTP`

) as independent variable, change from baseline(`CHG`

)as response variable, and baseline(`BASE`

) as covariates.

`fit <- lm(CHG ~ BASE + TRTP, data = ana_dat)summary(fit)`

The summary output for regression coefficients as follows. If you would like to obtain anova tables, should use `anova(fit)`

instead of `summary(fit)`

.

`Call:lm(formula = CHG ~ BASE + TRTP, data = ana_dat)Residuals: Min 1Q Median 3Q Max -3.1744 -0.7627 -0.0680 0.5633 5.0349 Coefficients: Estimate Std. Error t value Pr(>|t|)(Intercept) -0.5579 0.8809 -0.633 0.528BASE 0.1111 0.1329 0.837 0.405TRTPXanomeline Low Dose -0.2192 0.5433 -0.404 0.688TRTPXanomeline High Dose 0.4447 0.5528 0.804 0.424Residual standard error: 1.328 on 77 degrees of freedom (1 observation deleted due to missingness)Multiple R-squared: 0.06702, Adjusted R-squared: 0.03068 F-statistic: 1.844 on 3 and 77 DF, p-value: 0.1462`

From above results, we can easily conclude the regression coefficient and model, and the significance comparing to zero. With the coefficient, we can predict the any change based on baseline and treatment.

Besides we can use `contrasts`

function to obtain contrast metrices so that understand the dummy variables for `TRTP`

in the multiple regression model here.

`> contrasts(ana_dat$TRTP) Xanomeline Low Dose Xanomeline High DosePlacebo 0 0Xanomeline Low Dose 1 0Xanomeline High Dose 0 1`

From the anova table as shown below, it can been seen that the treatment have no statistical significance for the change in glucose under the control of the effects of baseline.

`> anova(fit)Analysis of Variance TableResponse: CHG Df Sum Sq Mean Sq F value Pr(>F)BASE 1 1.699 1.6989 0.9629 0.3295TRTP 2 8.061 4.0304 2.2844 0.1087Residuals 77 135.853 1.7643 `

If you would like to make the output more pretty, `tbl_regression(fit)`

can be used as mentioned before.

If we want to obtain the least square(LS) mean between treatment groups, `emmeans`

or `multcomp`

package can provide the same results. In addition the process to calculate the LS mean is also very worth to leaning and understanding.

`# by multcomppostHocs <- glht(fit, linfct = mcp(TRTP = "Tukey"))summary(postHocs)# by emmeansfit_within <- emmeans(fit, "TRTP")pairs(fit_within, reverse = TRUE)`

The summary output as shown below:

`> summary(postHocs) Simultaneous Tests for General Linear HypothesesMultiple Comparisons of Means: Tukey ContrastsFit: lm(formula = CHG ~ BASE + TRTP, data = ana_dat)Linear Hypotheses: Estimate Std. Error t value Pr(>|t|) Xanomeline Low Dose - Placebo == 0 -0.2192 0.5433 -0.404 0.9116 Xanomeline High Dose - Placebo == 0 0.4447 0.5528 0.804 0.6937 Xanomeline High Dose - Xanomeline Low Dose == 0 0.6639 0.3113 2.132 0.0855 .---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1(Adjusted p values reported -- single-step method)`

Now it's clear a significant difference was observed between the pairs `Low Dose vs. Placebo`

, `High Dose vs. Placebo`

, and `High Dose vs. Low Dose`

.

https://r4csr.org/efficacy-table.html#efficacy-table

ANCOVA in R

How to perform ANCOVA in R

An Introduction to ANCOVA (Analysis of Variance)

How to Conduct an ANCOVA in R

In addition to reading the corresponding procedure reference in the official documentation, I recommend using the `ods trace on`

to find the stat table names, and then extract any table you want. Here is the regression analysis of the `sashelp.cars`

dataset, let's see how to get the stat tables.

`ods trace on; /* write ODS table names to log */proc reg data=sashelp.cars plots=none; model Horsepower = EngineSize Weight;quit;ods trace off; /* stop writing to log */`

The logs you can see in SAS are as follows:

`Output Added:-------------Name: NObsLabel: Number of ObservationsTemplate: Stat.Reg.NObsPath: Reg.MODEL1.Fit.Horsepower.NObs-------------Output Added:-------------Name: ANOVALabel: Analysis of VarianceTemplate: Stat.REG.ANOVAPath: Reg.MODEL1.Fit.Horsepower.ANOVA-------------Output Added:-------------Name: FitStatisticsLabel: Fit StatisticsTemplate: Stat.REG.FitStatisticsPath: Reg.MODEL1.Fit.Horsepower.FitStatistics-------------Output Added:-------------Name: ParameterEstimatesLabel: Parameter EstimatesTemplate: Stat.REG.ParameterEstimatesPath: Reg.MODEL1.Fit.Horsepower.ParameterEstimates`

By looking at the output you can find each stat table name, like `ParameterEstimates`

. That means you can extract it by adding the `ods output ParameterEstimates=rst`

statement to store the table in the `rst`

dataset, as follows:

`proc reg data=sashelp.cars plots=none; /* same procedure call */ model Horsepower = EngineSize Weight; ods output ParameterEstimates=rst; /* the data set name is 'rst' */quit;`

Multiple stat tables can be stored in one `ods output`

statement. For example below statement stores both ParameterEstimates table and ANOVA table at the same time.

`proc reg data=sashelp.cars plots=none; model Horsepower = EngineSize Weight; ods output ParameterEstimates=parms ANOVA=anvar;quit;`

And then if you want to create a macro variable that contains the value of certain statistic, such as slope for EngineSize:

`data _null_; set rst; if variable="EngineSize" then call symputx("slope1", estimate);run;%put &=slope1;`

Several procedures provide an alternative option for createing an output similar to the `ods ouput`

mentioned above. For instance, the `outest`

option in the `proc reg`

procedure.

`proc reg data=sashelp.cars noprint outest=rst2 rsquare; /* statistics in 'rst2' */ model Horsepower = EngineSize Weight;quit;`

So you'd better to check the SAS documentation to see if this procedure you use.

All of the above is referred to the following articles:

ODS OUTPUT: Store any statistic created by any SAS procedure

Find the ODS table names produced by any SAS procedure

A SAS macro to combine portrait and landscape rtf files into one single file

In order to make it suitable for every condition as follows, I will additionally perform an update so that it can be more flexible.

- containing multiple table, figure and list at the same time
- using the title as the index of table contents
- order the files manually (Just provide a solution, have not implemented yet.)

First of all, let's look at the RTF's structure, which is referred to that article.

It is divided into three parts: opening section, content section and closing section. If we look at our single rtf, that structure is still the same. Consequently, the rtf combining process can be summarized as follows:

- Read all filenames into SAS (sorted by filename or defined by manual order).
- Keep the open section of first RTF.
- Remove both opening and closing sections except the first and the last RTF. And add
`\pard\sect`

code in front of`\sectd`

so that all of the files can be combined correctly. - Keep the closing section of last RTF.
- Save the updated RTF code into each SAS dataset. (Do not be saved in a single dataset, as the character length is limited in SAS.)

Now let's see the code we can use in this process. Firstly I import the rtf filenames from the external folder.

`data refList(keep=filepath fn); length fref $8 fn $80 filepath $400; rc = filename(fref, "&inpath"); if rc = 0 then dirid = dopen(fref); if dirid <= 0 then putlog 'ERR' 'OR: Unable to open directory.'; nfiles = dnum(dirid); do i = 1 to nfiles; fn = dread(dirid, i); fid = mopen(dirid, fn); if fid > 0 and index(fn,"rtf") then do; filepath="&inpath\"||left(trim(fn)); fn = strip(tranwrd(fn,".rtf","")); output; end; end; rc = dclose(dirid);run;`

Secondly, read each line in rtf file until find one line that starts with `\sectd`

, which means the above is openning section, and below is content section. And remove the last `}`

except the last rft file.

`data rtfdt&i(where = (ptline=1)); retain ptline; set rtfdt&i end = last; if substr(line,1,6)='\sectd' then do; ptline = 1; /*enable to combine portrait and landscape rtf*/ line="\pard\sect"||compress(tranwrd(line,"\pgnrestart\pgnstarts1","")); end; if last and line^='}' then line=substr(strip(line),1,length(strip(line))-1); else if last and line='}' then delete;run;`

Thirdly, when you find the title code in rtf, replace the `\pard`

with `\pard\outlinelevel1`

so that this title can be identified as index for content table.

`%if &titleindex = 1 %then %do; data rtfdt&i.; set rtfdt&i.; retain fl 0; if index(line,'\pard\plain\') and (not index(line,'\header\pard')) and (not index(line, '\footer\pard')) then fl=1+fl; run; data rtfdt&i; set rtfdt&i; by fl notsorted; if fl=1 and first.fl then /*add index for the contents as per titles*/ line=tranwrd(line,'\pard','\pard\outlinelevel1'); run;%end;`

At last, don't save above rtf contents in one single SAS dataset because as the character length is limited in SAS. And add the `}`

as the closing section so that keep the rtf file complete.

The total code as shown below:

`/*Example*//*%s_combrtf(inpath=&inpath,outpath=&outpath,outfile=&outfile);*//*Parameter Description*//*inpath input path*//*outpath output path*//*outfile output file name*//*titleindex whether to add title index, default is 1*/%macro s_combrtf(inpath= ,outpath= ,outfile= ,titleindex=1); data refList(keep=filepath fn); length fref $8 fn $80 filepath $400; rc = filename(fref, "&inpath"); if rc = 0 then dirid = dopen(fref); if dirid <= 0 then putlog 'ERR' 'OR: Unable to open directory.'; nfiles = dnum(dirid); do i = 1 to nfiles; fn = dread(dirid, i); fid = mopen(dirid, fn); if fid > 0 and index(fn,"rtf") then do; filepath="&inpath\"||left(trim(fn)); fn = strip(tranwrd(fn,".rtf","")); output; end; end; rc = dclose(dirid); run; /*sort by filename by default*/ proc sort data = refList sortseq = linguistic(numeric_collation=on) out = sorted_refList; by fn; quit; data fileorder; set sorted_refList; FileLevel = 2; order = .; run; data _null_; set fileorder end=last; fnref=strip("filename fnref")||strip(_N_)||right(' "')||strip(filepath)||strip('" lrecl=5000 ;'); call execute(fnref); if last then call symputx('maxn',vvalue(_n_), 'l'); run; %do i=1 %to &maxn.; data rtfdt&i.; infile fnref&i. truncover; informat line $5000.; format line $5000.; length line $5000.; input line $1-5000; line=strip(line); run; /*add title index and adapt to more flexible*/ %if &titleindex = 1 %then %do; data rtfdt&i.; set rtfdt&i.; retain fl 0; if index(line,'\pard\plain\') and (not index(line,'\header\pard')) and (not index(line, '\footer\pard')) then fl=1+fl; run; data rtfdt&i; set rtfdt&i; by fl notsorted; if fl=1 and first.fl then /*add index for the contents as per titles*/ line=tranwrd(line,'\pard','\pard\outlinelevel1'); run; %end; %if &i.=1 %then %do; data final; set rtfdt&i(keep = line) end = last; if last and line^='}' then line=substr(strip(line),1,length(strip(line))-1); else if last and line='}' then delete; run; %end; %if &i.^=1 %then %do; data rtfdt&i(where = (ptline=1)); retain ptline; set rtfdt&i end = last; if substr(line,1,6)='\sectd' then do; ptline = 1; /*enable to combine portrait and landscape rtf*/ line="\pard\sect"||compress(tranwrd(line,"\pgnrestart\pgnstarts1","")); end; if last and line^='}' then line=substr(strip(line),1,length(strip(line))-1); else if last and line='}' then delete; run; %end; %if &i.=&maxn. %then %do; %local _cnt; data final; set final %do _cnt=2 %to &maxn; rtfdt&_cnt(keep = line) %end; ; run; data final; set final end = last; if last then line=strip(line)||strip("}"); run; %end; %end; data _null_; file "&outpath\&outfile..rtf" lrecl=5000 nopad; set final; put line; run;%mend;`

This appoach, in my opinion is quite excellent as it can resolve the issues as follows:

I know that different company put titles and footnotes in different places. Some may place them in header & footer section and some may place them in the body of rtf document. Above macro will works no matter how you place the titles and footnotes.

A SAS macro to combine portrait and landscape rtf files into one single file Combine multiple RTF files to one file

SM05: An Efficient Way to Combine RTF Files and Create Multi-Level Bookmarks and a Hyperlinked TOC

utl-sas-macro-to-combine-rtf-files-into-one-single-file

http://onbiostatistics.blogspot.com/2009/01/data-dredging-vs-data-mining-post-hoc.html

- Ad hoc是指完成final统计分析后，有其他针对于该final报告的额外统计分析需求。
- Post hoc是指在submission后，来自于监管部门的审评意见的额外统计分析需求。

对于以上两种情况，实施相同的处理方式。比如需要额外的变量从，ADS的数据集说明文档（specification）则需要同样被补充。相关文档可以在SAP或者验证计划/报告中作为附件，因此对应的版本必须在文件名以及标题中显示。相关文档也可作为独立文档被保存。

而 Post-hoc 分析，即事后分析，是指在数据收集完毕后，根据数据本身特点额外设定分组，提出研究假设，并进行统计分析。

Post hoc常被称为数据疏浚（data dredging）或数据捕鱼（data fishing），其动机往往是为了得到阳性的结果。因此，事后分析结果一般不被各国药品监管部分接受作为药物有效性的证据。

以下是在临床试验的过程中的注意事项： - 临床试验过程中所有的重大变化都需要记录； - 不鼓励进行Ad-hoc分析（Ad-hoc没有先进行假设再分析，不符合严格的统计学原则，只能作为探索性结论）； - 统计判断需要依据临床试验结果的客观说明和表述。

A post-hoc analysis involves looking at the data after a study has been concluded, and trying to find patterns that were not primary objectives of the study. In other words, all analyses that were not pre-planned and were conducted as 'additional' analyses after completing the experiment are considered to be post-hoc analyses. A post-hoc study is conducted using data that has already been collected. https://www.editage.com/insights/zh-hans/node/7139

While both post-hoc and ad-hoc analysis may be performed based on the data or results we have seen, the ad-hoc analysis typically occurred alongside the project while the post-hoc analysis occurred absolutely after the project or after the unblinding of the study or after the pre-specified analyses results have been reviewed. In this sense, the ad-hoc analysis is better than post-hoc analysis.

]]>回过头看，这些非常基础的用法却是平时用的最多的代码。

set-keep联合 提取特定变量

`/*set-keep-挑选变量*/ data keep; set sashelp.class(keep=name sex); /*查看数据，sashelp逻辑库的class数据集，keep相当于 class[,c("name","sex")] keep代表提取元素，而drop代表剔除元素*/ run;`

set-rename 修改变量名称

`/*set-rename-修改变量名称*/ data keep; set sashelp.class(keep=name sex rename=(name=name_new sex=sex_new)); run;`

set-where 条件选择

`/*set-where-按条件选中*/ data keep; set sashelp.class(keep=name sex where=(sex='M')); run;`

set-in 临时变量

`/*set-in-临时单个变量*/ /*可以说是SAS跟R最大的区别的一点就是，SAS内容都是不直接放在内存之中，而是放在数据集中，如果要对数据集的内容进行一些操作，需要先赋值成一些临时变量*/ data keep; set one(in=a) two(in=b); /*one变量变成临时变量a，two变量变成临时变量b，同时合并one two变量*/ in_one=a; in_two=b; /*将临时变量a b 赋值新变量名称in_one、In_two*/ if a=1 then flag=1;else flag=0; /*构造一个新变量flag，为满足某种条件*/ run;`

set-nobs 计总数

`/*set-nobs-数据总数，譬如nrow的意思*/ data nobs11(keep=total); set sashelp.class nobs=total_obs; /*将class数据集通过nobs转给临时变量total_obs，然后传给实际变量total，再传出*/ total=total_obs; output; stop; run;`

利用nobs=total_obs，以及total=total_obs的操作来计数。 先用函数obs计数，然后传给临时变量total_obs放入缓存，缓存内容需要再传导给实际变量total。

此外，注意还有output+stop代表单独输出为数据表，而stop的意思是停留在一个单元格，不然就会生成19*1的一列数值，里面都填充着数字19。

数据集合并——横向合并，obs这里表示取前10行

`/*set-数据集合并*/ data concatenat; set sashelp.class sashelp.class(obs=10); /*横向合并，同时sashelp.class(obs=10)代表切片*/ run;`

`/*merge 横向合并*/proc sort data=chapt3.merge_a;by=x;run;proc sort data=chapt3.merge_c;by=x;run; data d;merge chapt3.merge_a chapt3.merge_c;by x;run;`

SAS合并需要预先进行一些内容的排序，才能进行合并。

- 排序：proc sort data=逻辑库.数据集; by=变量名称；run；
- 合并：merge 数据集1 数据集2；by x；

注意这里合并需要by，同时By是作为单独的代码。

- where-between/and

前面set和where联用可以得到比较好的效果。还有一些可能：

`Where x between 10 and 20;/* X[10,20] */Where x not between 10 and 20;Where x between y*0.1 and y*0.5;where x between 'a' and 'c';`

where-between/and可以作为切片的一种。同时数据集(obs=10)也是切片的一种方式。

`where x in(1,2);/*选择变量等于某值的情况*/`

选择变量服从某些特征的。

where在缺失值的应用

/

*where选中缺失值*/ Where x is missing; where x is null; /* 数值型变量，定位缺失值，is.na()*/

有点像R中的is.na()函数。

`Where x;/*选择数值变量x非0非缺失值的观测值*/Where x and y; /*字符型变量，选择观测值*/Where x ne '';`

还有一些比较特殊的写法，譬如where x就直接代表选中了非0非缺失值的数据，比较方便。x ne ''代表，x不等于空格。

where选中字符型

where x like 'D_an'; where x like '%ab%' or x like '%z%'; /

*字符型匹配，下划线代表一个字符，%代表可以有很多字符*/

跟SQL里面一样用like来定位字符型内容。其中需要注意的是，D_an代表D和an其中可以容纳一个字符；而D%an代表D和an中可以容纳多个字符。

`/*append base= data= force 语句*//*base是元数据集，data需要添加在后面的数据集，force是强行添加，不太用*/proc append base=null data=sashelp.class(where=(sex='M'));run;`

利用proc append来启动函数，proc append base=基础数据集 data=添加数据集

]]>I suggest there are two methods to implement this requirement.

- to do that in
`proc report`

- to do that before
`proc report`

Let's say that we have an example `sashelp.class`

, and wish to divide it into two groups so that we can present them separately.

For the first method, we need to add a `gr`

variable for grouping, such as female and male.

`data final; set sashelp.class; if sex="male" then gr=1; else gr=2;run;proc sort data = final; by gr age;run;`

And then, in the `proc report`

procedure, add one line code `compute after gr;`

as shown below. You can see it works.

`proc report data=final; column gr name sex age height weight; define gr / group order noprint; compute after gr; line @1 ""; endcomp;run;`

To use the second method, we simply add one blank row to the `final`

dataset using `call missing(of _all_)`

.

`data final2; set final; by sex; output; if last.sex then do; call missing(of _all_); output; end; drop gr;run;`

Just a trick, I hope to help anyone who's learning SAS.

]]>If you want to review the full datsets for pilot project, I would recommand it's better to clone the cdisc-org/sdtm-adam-pilot-project git repository, as following:

`clone https://github.com/cdisc-org/sdtm-adam-pilot-project.git`

If neccessary, otherwise you can aslo get CDISC pilot projects from the `phuse-scripts`

repository.

`clone https://github.com/phuse-org/phuse-scripts.git`

Supposed you just want to get SDTM and ADaM datasets and run some tests in R, I would like to import those datasets from R packages, like `admiral`

package. And install and try it right here.

`library(admiral)data(admiral_adsl)adsl <- admiral_adslhead(adsl[1:5,1:5])# A tibble: 5 × 5 STUDYID USUBJID SUBJID RFSTDTC RFENDTC <chr> <chr> <chr> <chr> <chr> 1 CDISCPILOT01 01-701-1015 1015 2014-01-02 2014-07-022 CDISCPILOT01 01-701-1023 1023 2012-08-05 2012-09-023 CDISCPILOT01 01-701-1028 1028 2013-07-19 2014-01-144 CDISCPILOT01 01-701-1033 1033 2014-03-18 2014-04-145 CDISCPILOT01 01-701-1034 1034 2014-07-01 2014-12-30`

You can also find the function list directly right here, to see which ADaM datsets are avaiable for use, such as `adae`

, `adeg`

, `advs`

and so on.

If you want to import SDTM datasets, you should use the function like `data(admiral_ae)`

from `admiral.test`

package.

`library(admiral.test)data(admiral_ae)ae <- admiral_aehead(ae[1:5,1:5])# A tibble: 5 × 5 STUDYID DOMAIN USUBJID AESEQ AESPID <chr> <chr> <chr> <dbl> <chr> 1 CDISCPILOT01 AE 01-701-1015 1 E07 2 CDISCPILOT01 AE 01-701-1015 2 E08 3 CDISCPILOT01 AE 01-701-1015 3 E06 4 CDISCPILOT01 AE 01-701-1023 3 E10 5 CDISCPILOT01 AE 01-701-1023 1 E08`

I find it quite practical, don't you? In addition to the `admiral`

package, I aslo find that `r2rtf`

and `clinUtils`

R packages contain the exmaple datasets for CDISC pilot projects. But both of them are not quite complete, so this is just a alternative option.

Overall, from my persepctive, these example datasets are a great resource for developing R packages or Shiny dashboards for pharmaceutical use.

https://github.com/phuse-org/phuse-scripts

https://pharmaverse.github.io/admiral/index.html

https://github.com/pharmaverse/pharmaverse

https://github.com/atorus-research/CDISC_pilot_replication

https://cran.r-project.org/web/packages/admiral.test/admiral.test.pdf

https://rdrr.io/cran/clinUtils/f/vignettes/clinUtils-vignette.Rmd

First of all, here sharing a good resource about SAS Marcos is related to our topic. That lists a series of useful macros, and the code is very standard and worthy to learn, highly recommended.

For R, I'm willing to recommend the `fs`

R package, which provides a cross-platform, uniform interface file system operations

And then let's begin with our topics.

Suppose if you want to identify the list of files in a particular directory, in R you can easily choose `list.files()`

. For example list files in a specific directory like the current directory.

`list.files(path = "./")`

You can also get the files within the subfolders, and just match the `.txt`

files, simple use

`list.files(path = "./", pattern = ".txt", recursive = T)`

In SAS, limited to my knowledge, just list two approaches. The first approach is to use a `filename`

statement with a `pipe`

device type and `dir`

command in the Windows environment.

`filename dirlist pipe 'dir /b E:\Tp\*.txt';data list; length line $200; infile dirlist; input; line = strip(_infile_);run;filename dirlist clear;proc print data=list; run;`

The second approach is to use the functions `dopen`

and `dread`

with the help from `dnum`

, as the following example.

`filename root "E:\Tp";data list; * -- return variables --; length name $ 512; * -- directory to inventory --; dirid = dopen('root'); if dirid <= 0 then putlog 'ERR' 'OR: Unable to open directory.'; nfiles = dnum(dirid); do i = 1 to dnum(dirid); * -- directory item name --; name = dread(dirid, i); output; end; rc = dclose(dirid); dirid = 0;run;`

The more details for the second approach can be found in these links.

Suppose if you want to identify a file called `README.md`

that exists in the current directory, then you can choose the `file.exists()`

function in R, that returns `TRUE`

if the file exists, and `FALSE`

otherwise.

`file.exists("./README.md")`

In SAS, `fileexist`

function verifies the existence of the file, and it will return 1 if the file exists, and 0 otherwise.

`%let fpath=E:\Tp\test.sas;%macro fileexists(filepath); %if %sysfunc(fileexist(&filepath)) %then %put NOTE: The external file &filepath exists.; %else %put ERROR: The external file &filepath does not exist.;%mend fileexists;%fileexists(&fpath);`

Besides if you want to check if a dataset exists, you can choose the `exist`

function. If for checking a variable exists, `varnum`

is recommended.

If you want to create a blank file in R then

`file.create("./text.txt")`

In SAS, I'm not sure if the below is the normal way, but it’s definitely simple anyway.

`data _null_; file "E:\Tp\test.txt";run;`

If you want to delete a specific file in R, then

`file.remove("./test.txt")`

In SAS, I think we can simply use `fdelete`

function.

`filename defile '"E:\Tp\test.txt';data _null_; rc=fdelete('defile');run;`

Creating a directory is very similar to a file. The function in R is `dir.create()`

that is very convenient to use. In SAS it can be accomplished using the `dlcreatedir`

option and `libname`

statement with 2 lines of code.

`options dlcreatedir;libname folder 'E:\Tp\dummy';`

If you want to create or copy multiple folders or directories, more detailed information can be found in Using SAS® to Create Directories and Duplicate Files.

You can use the macro as shown below to list all the RTF files which means the extension is `.rtf`

.

`/*Example*//*%ListFilesSpecifyExtension(dir=C:\Users\demo,ext=rtf,out=rst);*//*Parameter Description*//*dir input directory*//*ext file name extension*//*out output dataset*/%macro ListFilesSpecifyExtension(dir=, ext=, out=); %local filrf rc did name i; %let rc=%sysfunc(filename(filrf,&dir)); %let did=%sysfunc(dopen(&filrf)); /* Use the %IF statement to make sure the directory can be opened. If not, end the macro. */ %if &did eq 0 %then %do; %put Directory &dir cannot be open or does not exist; %return; %end; data &out; length FileName $200; %do i = 1 %to %sysfunc(dnum(&did)); %let name=%qsysfunc(dread(&did,&i)); %if %qupcase(%qscan(&name,-1,.)) = %upcase(&ext) %then %do; Filename = "&name"; output; %end; %end; run; %let rc=%sysfunc(dclose(&did)); %let rc=%sysfunc(filename(filrf));%mend ListFilesSpecifyExtension;`

Directory Listings in SAS

Obtaining A List of Files In A Directory Using SAS® Functions

http://sasunit.sourceforge.net/v15/doc/files.html

Check if a Specified Object Exists

Using SAS® to Create Directories and Duplicate Files

The Bland-Altman method generally refers to the Bland-Altman plot, which is used to display the relationship between two paired quantitative measurement tests or assays (Bland & Altman, 1986 and 1999). For example, a new product might be compared with the registered product, or previous generation product. Alternatively the product can be the reference or gold standard method. Sometimes we would call it a different plot instead of the Bland-Altman plot in the CLSI EP09.

As we can see from above, the Bland-Altman is a scatter plot that clearly shows the relationship between the differences and the magnitude of measurement. The X axis represents the mean of two measurements, and the Y axis represents the difference.

Sometimes the difference can be defined as constant difference, but it also can be defined as proportional difference that depends on the distribution of measurements. For example, if the difference is not related to the magnitude, it means that we have a constant difference between the two assays throughout the total X axis range. Otherwise proportional difference means that it’s related to the magnitude, and proportional to the magnitude. So such plots can be visually inspected to determine the underlying variability characteristic of this relationship.

The assumption of Bland-Altman is that the differences are normally distributed. But we all know that we can not make sure the measurements are following the normal distribution completely. In many cases, actually there will not be a big impact for Bland-Altman analysis when the distribution of the differences is not normally distributed.

But from my side, I propose that the range of the two assays should not be too different, be assured that they are in the similar magnitude.

There have been some definitions we should know for Bland Altman analysis.

**Bias**, it refers to the mean of the differences of the two measurements. It can be seen in the middle line in the Bland Altman plot, which is useful for detecting a systematic difference.**95% CI of Bias**, it refers to the 95% confidence interval of the mean difference, which illustrates the magnitude of the systematic difference.**Limits of Agreement (LoA)**, it refers to the 95% prediction interval of the differences. And it can be seen in the upper and lower lines in the Bland Altman plot. This indicator is very important in clinical trials. Always we need to compare the LoA with the clinical acceptance criteria to demonstrate the bias for the new product is accepted in clinical practice.**95% CI of LoA**, it refers to the 95% confidence interval of LoA, to demonstrate the error or precision of the upper and lower LoA.

To explain the calculation process more clearly, here I use the R to implement it.

Suppose you have two measurements from different assays. The mean differences of them is `5`

, and the corresponding standard deviation is `0.8`

.

`d <- 5sd <- 0.8`

We can easily get the LoA by the formula as it’s the prediction interval of the differences.

`LoA <- c(d - 1.96 * sd, d + 1.96 * sd)`

And the CI for `d`

and `LoA`

would be a bit complicated, as shown below from NCSS Bland-Altman Plot and Analysis documentation.

From the above formula, the standard error of `LoA`

CI is about 1.71 times as the `d`

. This `1.71`

always occurs in some Bland Altman related articles,for now at least we have known how to calculate this number. Just drop the `n`

from both sides of the equation.

`sqrt(1 + 1.96^2 / 2)`

For the CI, we can also easily calculate them according to the above formula, for example the 95% two-side confidence interval.

But here we must make sure that we should use the t distribution or normal distribution, that will affect whether we use t statistic or z statistic. Suppose here I use t distribution, and define the sample size `n`

is 200, so degrees of freedom is equal to `n-1`

.

`n <- 200t <- qt(1 - 0.05 / 2, 200-1)d_se <- sd / sqrt(n)d_CI <- c(d - t * d_se, d + t * d_se)> d_CI[1] 4.888449 5.111551`

Then the corresponding CI of `LoA`

`LoA_se <- sd * sqrt(1 / n + 1.96^2 / (2 * (n - 1)))LoA_CI <- LoA + c(- t * LoA_se, t * LoA_se)> LoA_CI[1] 3.241041 6.758959s`

What's the best result for Bland Altman analysis in the clinical trials?

Basically in the Bland Altman plot, we hope the spread of the scatter points is consistent across the range of X axis. And only a few points fall outside `LoA`

. Moreover, the `LoA`

or `LoA`

CI meet the clinical requirements.

Above all are my rough understanding, the main purpose is to note the calculation of CI for LoA. The references are as shown below:

Bland-Altman Plots(一致性评价)在python中的实现

Bland-Altman Plot and Analysis

Bland-Altman plot

Bland-Altman 分析在临床测量方法一致性评价中的应用

**Please indicate the source**: http://www.bioinfo-scrounger.com

Suppose you have a new product that you want to conduct a method comparison with an existing product. The primary endpoint is the AUC of your product is not worse than that compared product. It's no doubt that this is a non-inferiority trial design. So suppose the non-inferiority margin is `-0.15`

, the one-sided hypotheses is below:

- Margin: -0.15
- H0: AUC1 - AUC2 ≤ -0.1500
- H1: AUC1 - AUC2 > -0.1500

And then which method could we use to assess the statistical significance? In common, there are two methods used to estimate the AUC. One method is the empirical (nonparametric) method by DeLong et al. (1988), which doesn't depend on the strong normality assumption that the Binormal method makes. The other method is Binormal method presented by Metz (1978) and McClish (1989). Actually the bootstrap technique is also suggested for both methods, like the `rocNIT`

R package.

In this post, I’m mainly going to record how to use the nonparametric method to compare the paired AUC, referring to the self-topped Liu (2006) article.

The method is accomplished by R code to demonstrate each procedure. I don’t post the formula in here, please read that article that is very clear and easy to understand.

For the paired AUC, we can know that these two products are measured on the same subject. So I create a simulation data with 80 subjects, and the corresponding two measurements that are tested by the new product and compared product.

Given the following simulation data that can be download from Paired_Criteria_adjusted.txt

The nonparametric estimation of the ROC curve area is based on the Mann-Whitney U statistic.

And then we need to calculate the estimated variance of `auc1 - auc2`

When we get the estimated variance, the difference of two paired AUC and margin, we can also obtain the Z statistic. Thereby the p value and one-side 95% lower limit can be calculated either.

The total code is as following:

`# Helper functionsh_get_u_statistic <- function(x, y) { if (x > y) { return(1) } if (x == y) { return(0.5) } if (x < y) { return(0) }}h_auc_v10_v01 <- function(n1, n0, v1, v0) { v10 <- NULL v01 <- NULL # Mann-Whitney U statistic auc <- sum(sapply(v1, function(x) { sapply(v0, function(y) {h_get_u_statistic(x, y)}) })) / (n1 * n0) for (i in 1:n1) { v10 <- c(v10, sum( sapply(v0, function(y) {h_get_u_statistic(v1[i], y)}) ) / n0) } for (i in 1:n0) { v01 <- c(v01, sum( sapply(v1, function(x) {h_get_u_statistic(x, v0[i])}) ) / n1) } return(list(auc = auc, v10 = v10, v01 = v01))}# To get the auc and corresponding intermediate parameters `v10` and `v01`get_auc <- function(response, var) { dat <- cbind(response, var) n0 <- sum(response == 0, na.rm = TRUE) n1 <- sum(response == 1, na.rm = TRUE) var0 <- var[response == 0] var1 <- var[response == 1] c( list(n1 = n1, n0 = n0, var1 = var1, var0 = var0), h_auc_v10_v01(n1 = n1, n0 = n0, v1 = var1, v0 = var0) )}# The main program.auc.test <- function(mroc1, mroc2, margin, alpha = 0.05) { mod1 <- mroc1 mod2 <- mroc2 n1 <- mod1$n1 n0 <- mod1$n0 auc1 <- mod1$auc auc2 <- mod2$auc s10_11 <- sum((mod1$v10 - auc1)^2) / (n1 - 1) s10_22 <- sum((mod2$v10 - auc2)^2) / (n1 - 1) s10_12 <- sum((mod1$v10 - auc1) * (mod2$v10 - auc2)) / (n1 - 1) s01_11 <- sum((mod1$v01 - auc1)^2) / (n0 - 1) s01_22 <- sum((mod2$v01 - auc2)^2) / (n0 - 1) s01_12 <- sum((mod1$v01 - auc1) * (mod2$v01 - auc2)) / (n0 - 1) variance <- (s10_11 + s10_22 - 2 * s10_12) / n1 + (s01_11 + s01_22 - 2 * s01_12) / n0 auc_diff <- auc1 - auc2 z <- (auc_diff - margin) / sqrt(variance) p <- 1 - pnorm(z) lower_limit <- auc_diff - (qnorm(1 - alpha) * sqrt(variance)) list(Difference = auc_diff, `Non-Inferiority Pvalue` = p, `One-Sided 95% Lower Limit` = lower_limit)}`

Now let's start to run the main programs. Firsly import the simulation data.

`data <- data.table::fread("./Paired_Criteria_adjusted.txt") %>% mutate(Response = if_else(Condition == "Present", 1, 0))`

And then obtain the two results for ROC models.

`mroc1 <- get_auc(response = data$Response, var = data$Method1)mroc2 <- get_auc(response = data$Response, var = data$Method2)`

In the end, calculate the final output, especially the p value and one-side limit.

`> auc.test(mroc1, mroc2, margin = -0.15, alpha = 0.05)$Difference[1] -0.04771115$`Non-Inferiority Pvalue`[1] 0.04447758$`One-Sided 95% Lower Limit`[1] -0.1466274`

From this outcome, we can conclude that we need to reject the H0 hypothesis. Not only because the P value is less than 0.05, but also the one-side lower limit is greater than -0.15. So we can conclude that the Method1 (new product) is not worse than the Method2 (registered product) when the margin is 0.15.

Above is my note for a non-inferiority test in the diagnostic area. Actually I prefer to use R package or NCSS software to estimate the test, as personal code is always not correct sometimes. So the above code is helpful to understand the principle, not better to use directly.

Tests of equivalence and non-inferiority for diagnostic accuracy based on the paired areas under ROC curves

Comparing Two ROC Curves – Paired Design - NCSS

**Please indicate the source**: http://www.bioinfo-scrounger.com

In R, you can use the `dplyr`

package to `left_join`

or `inner_join`

and other functions `*_join()`

to handle your data in any form. As per to transposing, we generally call it **Pivot data**, and use corresponding `pivot_wider`

and `pivot_longer()`

functions from the `tidyr`

package to handle your data from long to wide, or the opposite. This how to use R to pivot data could refer to my post. Pivoting data in R

As for the SAS, let's see this with step-by-step examples.

In SAS, it defines the merge processing as three approaches.

- one-to-one reading
- concatenating
- Match-merging

The first two ways are the usage of `set`

, not to discuss in this post. The third way I suppose is the most common in our data processing.

First, we create two example datasets as shown below.

`proc sort data = sashelp.class; by Name;run;data cls1; set sashelp.class(keep = Name Sex obs = 10);run;data cls2; set sashelp.class(keep = Name Age firstobs = 5 obs = 15);run;`

Then we define the most important argument `by`

to specify which variable to join by. Before merging, we must make sure the dataset is sorted by that `by`

variable.

**Why?**

A simple answer is that the SAS match-merge is based on the classic sequential match algorithm, and the latter is based on the premise that all input streams are sorted identically.

`data clss; merge cls1 cls2; by Name;run;`

Actually in my option, the above code is not commonly used, that is just similar with the `full_join`

all x rows, followed by unmatched y rows. Maybe I would like to use merge processing such as `left_join`

, `inner_join`

. In this case we need to specify the `IN=`

argument in the `merge`

statement.

For instance, post-process like `left_join`

.

`data clss2; merge cls1(in = x) cls2(in = y); by Name; if x;run;`

post-process like `inner_join`

.

`data clss2; merge cls1(in = x) cls2(in = y); by Name; if x and y;run;`

It can be seen above that we can use `IN=`

to control which rows to be filtered.

However, considering we have to sort the dataset first, I sometimes would like to use `proc sql`

to merge data. Since it’s close to the form I used in R.

`proc sql; create table clss3 as select x.*, y.* from cls1 as x left join cls2 as y on x.Name=y.Name;quit;`

In my opinion, for transpose processing, it's better to learn and understand it with a few examples. Simply memorizing the arguments is easy to confuse. I suggest running each code and seeing the output, and thinking of how to realize it.

So let's see the examples to show transpose a dataset from long to wide, i.e. rows to columns.

First, create an example data from `sashelp.shoes`

dataset.

`data shoes; set sashelp.shoes; if Subsidiary in ("Johannesburg" "Nairobi"); keep Region Product Subsidiary Sales Inventory;run;`

By default, the transpose procedure only transposes the numeric columns from long to wide and ignores any character column. But actually in normal work, we generally will define a series of arguments like `prefix`

, `name`

and `label`

. Besides with the `var`

statement, we can select which column or columns you want to transpose. And for the `id`

statement you can use the variable of a column as the new variable names.

`proc transpose data = shoes(where = (Subsidiary = "Johannesburg")) out = res; var Sales; id Product;run;`

If you want to group the data by a variable, then add the `by`

argument.

`proc transpose data = shoes out = res; var Sales; id Product; by Subsidiary;run;`

If you want to re-define the columns `_NAME_`

and `_LABEL`

, then add the `name`

and `label`

options.

`proc transpose data = shoes out = res name = var_name label = label_name; var Sales; id Product; by Subsidiary;run;`

Then we see an example to show how to transpose data from wide to long.

`proc transpose data = res(drop = var_name label_name) out = res2; var Boot Sandal Slipper; by Subsidiary;run;`

SAS MERGING TUTORIAL

MATCH MERGING DATA FILES IN SAS | SAS LEARNING MODULES

SAS：数据合并简介

Complete Guide to PROC TRANSPOSE in SAS

HOW TO RESHAPE DATA WIDE TO LONG USING PROC TRANSPOSE | SAS LEARNING MODULES

**Please indicate the source**: http://www.bioinfo-scrounger.com

For the `proc sql`

method, `inobs`

or `outobs`

both can be used to select N rows from a dataset, but worth noting that it will cause differences if you join tables.

`proc sql inobs = 5 /*outobs=5*/; create table cls as select * from sashelp.class;quit;`

For the SAS code instead of `proc sql`

, the most straightforward method is to utilize the `obs`

that is very similar to the `sql`

method. Otherwise if you would like to select a range of rows, just add another parameter `firstobs`

.

`data raw_o; set sashelp.class(firstobs = 5 obs = 10);run;`

Utilizing the `_N_`

variable with the `IF-ELSE`

statement to reach this purpose I suppose is more flexible sometimes.

`data raw_1; set sashelp.class; if 5<= _N_ <=10 then output;run;`

So how about selecting the last rows? It seems we have to know the total number of rows, and then utilize the `_N_`

variable once.

`data raw_2; set sashelp.class; if &n_rows.-4<=_N_<=&n_rows. then output;run;`

BTW how to select N observations randomly, we can use the `proc surveyselect`

procedure and define `method = srs`

as the simple random selection method, so that we get the random 5 rows from this dataset.

`proc surveyselect data = sashelp.class out = rd_class method = srs sampsize = 5 seed = 123456;run;`

Besides for the row number, we can also add a new row number by group, as shown the following example:

`proc sort data = sashelp.class out = sorted_class; by age;run;data sorted_class_2; set sorted_class; by age; if first.age then new_row_number=0; new_row_number+1;run;`

First off, we would want to create a macro list to store information. If we just want to store a single value, as following:

`proc sql; select count(name) into: n_name trimmed from sashelp.class;quit;%put &n_name;`

Storing multiple values is also very similar.

`proc sql; select count(name),mean(height) format=10.2 into: n_name trimmed, :mean_height trimmed from sashelp.class;quit;%put &n_name &mean_height;`

Or just simply want the list values to be assigned to a list of macro variables.

`proc sql; select distinct(name) into: n1-:n19 from sashelp.class;quit;`

But the above example seems like you have to know the total number of distinct values in the dataset. So maybe the common way is to store the column values in a list separated by any delimiter you want.

`proc sql; select distinct(name) into: nameList separated by ' ' from sashelp.class;/* %let numNames = &sqlobs;*/quit;%put &nameList.;`

Supposed that I just want the second element in this macro variable, what should I do? Maybe the `%scan`

function is enough to reach our purpose.

`%put %scan(&nameList,2);`

Obviously, it seems not as convenient to extract the element as `nameList[2]`

in R, but it is enough to use in SAS.

Another way is to loop the macro variable by `%do`

, as below to assign each element to a new column.

`%let cntName = %sysfunc(countw(&nameList));data raw; array names[&cntName] $200 name1-name&cntName; do i=1 to &cntName; names[i]=scan("&nameList", i); end; drop i;run;`

Otherwise there are still many great documents that have been posted to show how to store and manipulate lists in SAS, like Choosing the Best Way to Store and Manipulate Lists in SAS

How to use the SAS SCAN Function?

Creating Lists! Using the Powerful INTO Clause with PROC SQL to Store Information in Macro Lists

How to Select the First N Rows in SAS

Using SAS® Macro Variable Lists to Create Dynamic Data-Driven Programs

Storing and Using a List of Values in a Macro Variable

**Please indicate the source**: http://www.bioinfo-scrounger.com

In R, I prefer to use `unique()`

or `dplyr::distinct`

toolkit to remove duplicates, and `is.na()`

, `na.omit()`

functions or external packages like `mice`

to handle missing values.

We can use the `proc sort`

to remove rows that have duplicate values across all columns of the dataset.

`proc sort data = sashelp.cars(keep = make type origin) out = without_dups nodupkey; by _all_;run;`

In some special condition, we would like to select only unique/distinct rows from a dataset as per a specific column and keep the first row of values for that column.

`proc sort data = sashelp.cars out = make_without_dups nodupkey; by Make;run;`

In clinical trial data, missing data or missing values is a common occurrence when no data is stored for the variable in the observation. It can be occurred in numeric or character variables as a single period (`.`

).

For all we know, according to the missing values, the reasons can be summarized as below:

- Missing completely at random (MCAR)
- Missing at random (MAR), not completely random
- Not missing at random (NMAR)

So how to handle the missing values?

Suppose we did a reaction time study with six subjects, and the subjects reaction time was measured by three times. That data is as shown below.

`data times; input id trial1 trial2 trial3; cards;1 1.5 1.4 1.6 2 1.5 . 1.9 3 . 2.0 1.6 4 . . 2.2 5 2.1 2.3 2.26 1.8 2.0 1.9;run;`

As you see below, we can use some useful functions to count the number of missing observations, like `nmiss`

for numeric and `cmiss`

for character. Or `missing`

to indicate whether the argument contains a missing value. And then filter any rows that have more than one missing value.

`data raw_0; set times (where = (nmiss(trial1,trial2,trial3) = 0));run;`

Or just indicate the specific variable, like `trial1`

column.

`data raw_1; set times; missing_flag = missing(trial1);run;`

First off, let's try to replace all missing values with zero in every column in a simple way, which is creating an implicit Array `NumVar`

to hold all numeric variables in the dataset and then loop over it. If you just want to replace one column, so then add that variable name instead of `_numeric_`

.

`data raw_3; set times; array NumVar _numeric_; do over NumVar; if NumVar=. then NumVar=0; end;run;`

If your question is more complicated, such as not replaced by zero but by mean, then how would we address it? I suppose that `proc stdize`

is a good solution.

`/*proc stdize data = times out = stdize_vars reponly missing = 0; run;*/proc stdize data = times out = stdize_vars reponly method = mean; var trial1 trial2; /* or _numeric_, or empty*/run;`

Imputation missing values is a complicated data manipulation process that can work well if you select the correct method for specific variables. But I would not learn more about how to do it with SAS by now, since I prefer to use R for imputation.

Here just list a few of useful sas procedures so that I can read and recall them later if needed.

`proc hpimpute`

`PROC MI`

,`PROC REG`

,`PROC MIANALYZE`

`proc surveyimpute`

Hope above notes will be helpful for you.

https://www.statology.org/sas-remove-duplicates/

https://www.statology.org/sas-replace-missing-values-with-zero/

http://www.philasug.org/Presentations/201910/Handling_Missing_Data_in_SAS.pdf

https://sasnrd.com/sas-replace-missing-values-zero/

**Please indicate the source**: http://www.bioinfo-scrounger.com

The following examples show how to resolve the below questions (just very simple but quite common):

- How to count distinct values
- How to count variables by group
- How to produce the frequency table of variables
- How to calculate the statistics for variables

In R, it seems like using `Hmisc::describe`

is available, but not the only function, other external packages or `base`

functions like `summary`

can also be utilized very well.

Here we use the `proc sql`

procedure with the SAS dataset called BirthWgt, to count the `Race`

variable.

`proc sql; select count(Race) as cnt_race from sashelp.BirthWgt;run;`

But I feel just count the total number of `Race`

variable is not make sense. If we would like to count the `Married`

variables grouped by the `Race`

variable:

`proc sql; select Race, count(Married) as cnt_married from sashelp.BirthWgt group by Race;run;`

If you want to count the distinct value, add the `distinct`

in the `count`

function.

`proc sql; select count(distinct Married) as distinct_married from sashelp.BirthWgt;run;`

We can use `proc freq`

to create frequency tables for one or more variables. Such as the example for the `SomeCollege`

variable with missing values, sorted by `Race`

and define the output as `result`

dataset including cumulative frequencies and percentages.

`proc sort data = sashelp.BirthWgt; by Race;run;proc freq data=sashelp.BirthWgt; tables SomeCollege /out=result missing outcum; by Race;run;`

BTW if you add a statistical argument like `chisq`

, the result becomes the statistics for the Chi-Square Tests.

Otherwise we can use `proc tabulate`

to create a table for displaying multiple statistics quickly.

`proc tabulate data = sashelp.cars; var weight; table weight * (N Min Q1 Median Mean Q3 Max);run;`

But I think `proc means`

is more convenient to save the output like:

`proc means data = sashelp.cars n nmiss mean std median p25 p75 min max; var weight; output out=weight_tbl n=n nmiss=nmiss mean=mean std=std median=median p25=p25 p75=p75 min=min max=max;run;`

https://www.statology.org/sas-count-distinct/

https://www.statology.org/sas-count-by-group/

https://www.statology.org/sas-frequency-table/

https://www.codeleading.com/article/53981053526/

https://www.statology.org/proc-tabulate-sas/

**Please indicate the source**: http://www.bioinfo-scrounger.com