KeepNotes blog Stay hungry, Stay Foolish. 2023-01-18T14:27:17.000Z http://www.bioinfo-scrounger.com/ Kai Hexo Partial Date Imputation http://www.bioinfo-scrounger.com/archives/date_imputation/ 2023-01-18T14:20:55.000Z 2023-01-18T14:27:17.000Z Partial dates are very common in clinical trials, such as AE that allow some parts of the date or time to be missing. However, when you create the ADaM dataset for AE, some variables like ASTDT (Analysis Start Date) or AENDT (Analysis End Date) are numeric, so they can be derived only when the date is complete and then you can calculate the durations.

In the trials, we would actually draw up a plan to define the rules for how to impute the partial date. But here, I simplify the imputation rule as shown below to illustrate its implementation in R and SAS:

• If the day of analysis start date is missing then impute the first day of the month. If both the day and month are missing then impute to 01-Jan.
• If the day of analysis end date is missing then impute the last day of the month. If both the day and month are missing then impute to 31-Dec.
• If the imputed analysis end date is after the last alive date then set it to the last alive date.

#### Manipulating in SAS

Firstly, let’s create dummy data in SAS that includes four variables.

``data dummy;    length USUBJID \$20. LSTALVDT \$20. AESTDTC \$20. AEENDTC \$20.;    input USUBJID \$ LSTALVDT \$ AESTDTC \$ AEENDTC \$;    datalines;    SITE01-001 2023-01-10 2019-06-18 2019-06-29    SITE01-001 2023-01-10 2020-01-02 2020-02    SITE01-001 2023-01-10 2022-03 2022-03    SITE01-001 2023-01-10 2022-06 2022-06    SITE01-001 2023-01-10 2023 2023;run;``
• `USUBJID`, unique subject identifier.
• `LSTALVDT`, last known alive date.
• `AESTDTC` , start date of adverse event.
• `AEENDTC`, end date of adverse event.

And we can see from the above rules that concatenating "01" with the date that misses the day is very easy. However if we want to calculate the `AENDT`, we need to consider which day is matched with each month, for example, the 28th or 29th, 30th or 31th. So we need to apply the `intnx` function to get the last day correctly.

``data dummy_2;    set dummy;    if length(AESTDTC)=7 then do;        ASTDTF="D";        ASTDT=catx('-', AESTDTC, "01");    end;    else if length(AESTDTC)=4 then do;        ASTDTF="M";        ASTDT=catx('-', AESTDTC, "01-01");    end;    else if length(AESTDTC)=10 then ASTDT=AESTDTC;    if length(AEENDTC)=7 then do;        AENDTF="D";        AEENDTC_=catx('-', AEENDTC, "01");        AENDT=put(intnx('month', input(AEENDTC_,yymmdd10.), 0, 'E'), yymmdd10.);    end;    else if length(AEENDTC)=4 then do;        AENDTF="M";        AENDT=catx('-', AEENDTC, "12-31");    end;    else if length(AEENDTC)=10 then AENDT=AEENDTC;    if input(AENDT,yymmdd10.)>input(LSTALVDT,yymmdd10.) then AENDT=LSTALVDT;    drop AEENDTC_;run;``

From the output we can see that when the day of date is missing, we set the imputation flag to 'D' as the flag variable, like `ASTDTF`. If the month of the date is missing, set it to "M". It also considers leap years and sets the date to the last alive date if the imputed date is later than the last alive date. So I suppose all the dates have been imputed correctly.

#### Manipulating in R

Then let’s create the same dummy to see how to conduct the rules in R.

``library(tidyverse)library(lubridate)dummy <- tibble(  USUBJID = "SITE01-001",  LSTALVDT = "2023-01-10",  AESTDTC = c("2019-06-18", "2020-01-02", "2022-03", "2022-06", "2023"),  AEENDTC = c("2019-06-29", "2020-02", "2022-03", "2022-06", "2023"))``

The dummy data can be shown below.

``# A tibble: 5 × 4  USUBJID    LSTALVDT   AESTDTC    AEENDTC     <chr>      <chr>      <chr>      <chr>     1 SITE01-001 2023-01-10 2019-06-18 2019-06-292 SITE01-001 2023-01-10 2020-01-02 2020-02   3 SITE01-001 2023-01-10 2022-03    2022-03   4 SITE01-001 2023-01-10 2022-06    2022-06   5 SITE01-001 2023-01-10 2023       2023 ``

And then we follow the rules as the SAS used to impute the partial date in R. To get the last day of each month's imputation, we'd better use the `rollback()` and `ceiling_date()` functions in the `lubridate` package to get the correct day considering the leap years. In addition, others are the common functions in the `tidyverse` package to manipulate the data, like `case_when()` and `select()`.

``dummy_2 <- dummy %>%  mutate(    ASTDTF = case_when(      str_length(AESTDTC) == 4 ~ "M",      str_length(AESTDTC) == 7 ~ "D"    ),    ASTDT_ = case_when(      str_length(AESTDTC) == 4 ~ str_c(AESTDTC, "01-01", sep = "-"),      str_length(AESTDTC) == 7 ~ str_c(AESTDTC, "01", sep = "-"),      is.na(ASTDTF) ~ AESTDTC    ),    ASTDT = ymd(ASTDT_),    AENDTF = case_when(      str_length(AEENDTC) == 4 ~ "M",      str_length(AEENDTC) == 7 ~ "D"    ),    AENDT_ = case_when(      str_length(AEENDTC) == 4 ~ str_c(AEENDTC, "12-31", sep = "-"),      str_length(AEENDTC) == 7 ~ str_c(AEENDTC, "-15"),      is.na(AENDTF) ~ AEENDTC    ),    AENDT = case_when(      str_length(AEENDTC) == 7 ~ rollback(ceiling_date(ymd(AENDT_), "month")),      TRUE ~ ymd(AENDT_)    ),    AENDT = if_else(AENDT > ymd(LSTALVDT), ymd(LSTALVDT), AENDT)  ) %>%  select(-ASTDT_, -AENDT_)``

Here we can see that the output is consistent with the SAS. It's very easy in R, right? You can also use many useful functions to transfer the different date types, for example from `date9.` to `yymmdd10.` like `dmy("01Jan2023")`. Honestly the `lubridate` package can provide a series of functions to deal with date manipulation, such as using `interval()` to calculate the duration of AEs.

``# A tibble: 5 × 8  USUBJID    LSTALVDT   AESTDTC    AEENDTC    ASTDTF ASTDT      AENDTF AENDT       <chr>      <chr>      <chr>      <chr>      <chr>  <date>     <chr>  <date>    1 SITE01-001 2023-01-10 2019-06-18 2019-06-29 NA     2019-06-18 NA     2019-06-292 SITE01-001 2023-01-10 2020-01-02 2020-02    NA     2020-01-02 D      2020-02-293 SITE01-001 2023-01-10 2022-03    2022-03    D      2022-03-01 D      2022-03-314 SITE01-001 2023-01-10 2022-06    2022-06    D      2022-06-01 D      2022-06-305 SITE01-001 2023-01-10 2023       2023       M      2023-01-01 M      2023-01-10``

#### Using `admiral` Package

Maybe you would say if there is a package that can deal with date imputation for ADaM. A manipulation structure that is wrapped in a series of functions to sort out the common imputation situations in ADaM. There's no doubt that you can believe the `admiral` package. Let me show some examples here to demonstrate how to use it for imputing partial dates.

``library(admiral)dummy %>%   derive_vars_dt(    dtc = AESTDTC,    new_vars_prefix = "AST",    highest_imputation = "M",    date_imputation = "first"  ) %>%  mutate(LSTALVDT = ymd(LSTALVDT)) %>%  derive_vars_dt(    dtc = AEENDTC,    new_vars_prefix = "AEND",    highest_imputation = "M",    date_imputation = "last",    max_dates = vars(LSTALVDT)  )``

Isn't the code quite straightforward? If your date vector is date time (DTM), you can use `derive_vars_dtm()` instead.

``# A tibble: 5 × 8  USUBJID    LSTALVDT   AESTDTC    AEENDTC    ASTDT      ASTDTF AENDDT     AENDDTF  <chr>      <date>     <chr>      <chr>      <date>     <chr>  <date>     <chr>  1 SITE01-001 2023-01-10 2019-06-18 2019-06-29 2019-06-18 NA     2019-06-29 NA     2 SITE01-001 2023-01-10 2020-01-02 2020-02    2020-01-02 NA     2020-02-29 D      3 SITE01-001 2023-01-10 2022-03    2022-03    2022-03-01 D      2022-03-31 D      4 SITE01-001 2023-01-10 2022-06    2022-06    2022-06-01 D      2022-06-30 D      5 SITE01-001 2023-01-10 2023       2023       2023-01-01 M      2023-01-10 M``

I'm planning to learn how to use the `admiral` package, for example, by building ADaM ADRS. I suppose this package improves the ecology of R greatly in drug trials.

#### Reference

]]>
<p>Partial dates are very common in clinical trials, such as AE that allow some parts of the date or time to be missing. However, when you create the ADaM dataset for AE, some variables like ASTDT (Analysis Start Date) or AENDT (Analysis End Date) are numeric, so they can be derived only when the date is complete and then you can calculate the durations.</p>
R - Add a blank row after each group http://www.bioinfo-scrounger.com/archives/add_blank_row_in_r/ 2023-01-13T12:54:50.000Z 2023-01-13T12:57:43.000Z I'm a R-lover and believe that anything SAS can do, R can do better. As R is such a powerful language for statistical analysis in clinical trials. Once, I posted an article that said how to insert blank rows, so I looked up how to do that in R.

To reach this purpose, we just need to take two steps:

• Split the data frame by group.

The idea is extremly clear and similar to the SAS process. Here, let's see how to complete these two steps.

Firstly, I create test data like:

``library(tidyverse)data <- iris %>% group_by(Species) %>%  slice_head(n = 3) %>%  select(Species, everything())> data# A tibble: 9 × 5# Groups:   Species   Species    Sepal.Length Sepal.Width Petal.Length Petal.Width  <fct>             <dbl>       <dbl>        <dbl>       <dbl>1 setosa              5.1         3.5          1.4         0.22 setosa              4.9         3            1.4         0.23 setosa              4.7         3.2          1.3         0.24 versicolor          7           3.2          4.7         1.45 versicolor          6.4         3.2          4.5         1.56 versicolor          6.9         3.1          4.9         1.57 virginica           6.3         3.3          6           2.58 virginica           5.8         2.7          5.1         1.99 virginica           7.1         3            5.9         2.1``

Now I'd like to insert rows between each `Species`, which would mean inserting a row between 3-4 rows and 6-7 rows. So we need to use the `group_split` function to split data by the `Species` variable.

``data %>% group_split(Species)``

And then we can find that the output class is a list, so the next step we should do is convert this list class to a dataframe with blank rows. We can now use the functional programming tool `purrr`, which has a map function `map_dfr` to deal with this. It applies a function(here is the `add_row`) to each element of the list.

``data %>% group_split(Species) %>%  map_dfr(~add_row(.x, .after = Inf))# A tibble: 12 × 5   Species    Sepal.Length Sepal.Width Petal.Length Petal.Width   <fct>             <dbl>       <dbl>        <dbl>       <dbl> 1 setosa              5.1         3.5          1.4         0.2 2 setosa              4.9         3            1.4         0.2 3 setosa              4.7         3.2          1.3         0.2 4 NA                 NA          NA           NA          NA   5 versicolor          7           3.2          4.7         1.4 6 versicolor          6.4         3.2          4.5         1.5 7 versicolor          6.9         3.1          4.9         1.5 8 NA                 NA          NA           NA          NA   9 virginica           6.3         3.3          6           2.510 virginica           5.8         2.7          5.1         1.911 virginica           7.1         3            5.9         2.112 NA                 NA          NA           NA          NA  ``

The above output is what I expected. And I feel the R programming is more brief and clear than SAS, do you think so?

#### Reference

Insert a Blank Row After Each Group of Data

]]>
<p>I'm a R-lover and believe that anything SAS can do, R can do better. As R is such a powerful language for statistical analysis in clinical trials. Once, I posted an article that said <a href="https://www.bioinfo-scrounger.com/archives/insert-blank-rows/">how to insert blank rows</a>, so I looked up how to do that in R.</p>
R - Replace NA with Zero and Empty String in Multiple Columns http://www.bioinfo-scrounger.com/archives/replace_strings/ 2022-12-30T13:26:17.000Z 2022-12-30T13:28:18.000Z This casual note is to record how to use R to replace the NA with 0 or any string. Generally, NA can be generated for different reasons, like unclean data, data transformation, or missing values. Otherwise, we have to convert NA to zero or other stings in order to present them in tables or listings.

In R the most simple function to replace NA is `replace()` or `is.na()` functions.

``library(tidyverse)data <- tibble(  a = c(1, 2, NA, 3, 4),  b = c(5, NA, 6, 7, 8),  c = c(9, 10, 11, NA, 12))``

For instance, if we want to replace NAs in all columns, the simple functions can be used like:

``data[is.na(data)] <- 0replace(data, is.na(data), 0)``

In the more factual scenario, we will have both numeric and character columns at the same time, not only the numeric in the above example. It seems the prior method is not convenient as we must select the numeric or character columns first and then replace NA with any appropriate value. Through searching on Google, I suppose the more simple way is to use `dplyr::mutate_if()` to check and select the specific type of columns, and `replace_na()` to replace the NAs.

``data <- tibble(  num1 = c(NA, 1, NA),  num2 = c(2, NA, 3),  chr1 = c("a", NA, "b"),  chr2 = c("c", "d", NA))data %>%  mutate_if(is.numeric, ~replace_na(., 0)) %>%  mutate_if(is.character, ~replace_na(., "xx"))``

To be honest, I prefer the combo functions as I got used to applying the pipe `%>%` code in R, so the relevant functions like `mutate_if()`, `mutate_all()`, `mutate_at()` functions in `tidyverse` R package are very convenient for me.

For instance, if you want to replace NAs with 0 on selected column names or indexes, as shown below.

``data %>%  mutate_at(c(1,2), ~replace_na(., 0))``

Besides the `dplyr::coalesce()` function can also be used to replace the NAs in a very tricky way, although it’s used to find the first non-missing element in common.

``data %>%  mutate(num1 = coalesce(num1, 0))``

#### Reference

]]>
<p>This casual note is to record how to use R to replace the NA with 0 or any string. Generally, NA can be generated for different reasons, like unclean data, data transformation, or missing values. Otherwise, we have to convert NA to zero or other stings in order to present them in tables or listings.</p>
Boxplot With Jittered Points in R http://www.bioinfo-scrounger.com/archives/jittered_boxplot/ 2022-12-29T14:05:55.000Z 2022-12-29T14:10:22.000Z The box plot is used to demonstrate the data distribution in common and to look for outliers. We can also see where the 25% and 75% quarters are, as well as the median value from the box. As a result, it's a very helpful visual chart.

Let's see a demo.

``library(ggplot2)library(tidyverse)# Datadata(iris)ggplot(iris, aes(x = Species, y = Sepal.Length,                 colour = Species)) +  geom_boxplot()``

Adding jittered points to the box plot in `ggplot` is useful to see the underlying distribution of the data. You can use the `geom_jitter` function with few params. For example, `width` param to adjust the width of the jittered points.

``ggplot(iris, aes(x = Species, y = Sepal.Length,                 colour = Species, shape = Species)) +  geom_boxplot() +  geom_jitter(width = 0.25)``

Sometimes, we might try to add jittered data points to the grouped boxplot, but we can not use the `geom_jitter()` function directly as it's a handy shortcut for `geom_point(position="jitter")`. Let's see what chart will be generated as shown below. It makes the grouped boxplot with overlapping jittered data points.

``ggplot(iris2, aes(x = Species, y = Sepal.Length,                 colour = group, shape = group)) +  geom_boxplot() +  geom_jitter(width = 0.25)``

Natively, how to make a better and correct jittered data points to the grouped boxplot. We can use the `position_jitterdodge()` as the position param, inside the `geom_point` function.

``ggplot(iris2, aes(x = Species, y = Sepal.Length,                  colour = group, shape = group)) +  geom_boxplot() +  geom_point(position = position_jitterdodge(jitter.width = 0.25))``

Right now, we get a nice looking grouped boxplot with clearly separated boxes and jittered data points within each box.

#### Reference

]]>
<p>The box plot is used to demonstrate the data distribution in common and to look for outliers. We can also see where the 25% and 75% quarters are, as well as the median value from the box. As a result, it's a very helpful visual chart.</p>
How to save graphs in SAS http://www.bioinfo-scrounger.com/archives/save_graphs_in_sas/ 2022-12-27T12:33:13.000Z 2022-12-27T12:37:00.000Z Recently, I'm a little confused how to create or save PNG graphs in SAS. Normally, we would have been to create RTF or PDF instead but there was sometimes a specific requestment to save as PNG directly. So we need to know how to complete it in SAS when I have a graph generated by SGPLOT or GTL procedure.

Given that I want to plot a scatter plot with regression line for `sashelp.iris` dataset by the GTL(Graph Template Language) process. So I define a GTL template firstly.

``proc template;    define statgraph ScatterRegPlot;    begingraph/ backgroundcolor=white border=false datacontrastcolors=(orange purple blue) datasymbols=(circlefilled trianglefilled DiamondFilled);        layout overlay;            scatterplot x=SepalLength y=SepalWidth /group=Species name='points';            regressionplot x=SepalLength y=SepalWidth / group=Species degree=3 name='reg';            discretelegend 'points';        endlayout;    endgraph;    end;run;``

Now let's see how to create RTF or PDF with this graph.

For PDF as below:

``ods escapechar="^";ods listing close;options nonumber nodate;ods pdf file="C:/Users/Desktop/example.pdf";proc sgrender data = sashelp.iris template = ScatterRegPlot; run;ods pdf close;ods listing;``

For RTF just change `ods pdf` above to `ods rtf`.

If we just want to save as PNG, as follows:

``ods listing gpath='C:/Users/TJ0695/Desktop' image_dpi = 300 style=Journal;ods graphics / imagename="example" imagefmt=png width = 20cm height = 15cm;proc sgrender data = sashelp.iris template = ScatterRegPlot; run;ods graphics off;``

If we increase DPI to 600, it will cause an error, like `ERROR: Java virtual machine exception. java.lang.OutOfMemoryError: Java heap space.`. So we should modify the configuration file of SAS to fix this error.

• Run `proc options option=config; run;` to find the certain configuration file.
• Open that file and find the specific text started with `-Xms` or `-Xmx`, and change both of them to `1024m` from `128m`.
• Reboot SAS and rerun the code.

After that the error doesn't appear again but the warning is still there.

#### Reference

Generating PNG files in SAS using ODS Graphics

]]>
<p>Recently, I'm a little confused how to create or save PNG graphs in SAS. Normally, we would have been to create RTF or PDF instead but there was sometimes a specific requestment to save as PNG directly. So we need to know how to complete it in SAS when I have a graph generated by SGPLOT or GTL procedure.</p>
Definition of least-squares means (LS means) http://www.bioinfo-scrounger.com/archives/definition-lsmeans/ 2022-09-19T13:21:00.000Z 2022-09-19T13:24:06.000Z This article aims to learn the basic calculation process of least-squares means (LS means).

I find it difficult to understand what LS actually means in its literal sense.

The definition from `lsmeans` package is shown blow, that have been transitioned to `emmeans` package.

Least-squares means (LS means for short) for a linear model are simply predictions—or averages thereof—over a regular grid of predictor settings which I call the reference grid.

In fact, even when I read this sentence, I was still very confused. What's the reference grid, and how to predict?

So let's see how the LS means is calculated, and the corresponding confidence interval as well.

Firstly import CDSIC pliot dataset, the same as the previous blog article - Conduct an ANCOVA model in R for Drug Trial. And then handle with the `adsl` and `adlb` to create an analysis dataset `ana_dat` so that we can use ANCOVA by `lm` function. Supposed that we want to see the `CHG`(change from baseline) is affected by independent variable `TRTP`(treatment) under the control of covariate variables `BASE`(baseline) and `AGE`(age).

Filter the dataset by `BASE` variable as one missing value can be found in dataset.

``library(tidyverse)library(emmeans)ana_dat2 <- filter(ana_dat, !is.na(BASE))``

Then fit the ANCOVA model by `lm` function.

``fit <- lm(CHG ~ BASE + AGE + TRTP, data = ana_dat2)anova(fit)# Analysis of Variance Table## Response: CHG#           Df  Sum Sq Mean Sq F value Pr(>F)# BASE       1   1.699  1.6989  0.9524 0.3322# AGE        1   0.001  0.0010  0.0006 0.9811# TRTP       2   8.343  4.1715  2.3385 0.1034# Residuals 76 135.570  1.7838  ``

We know that the LS means can be calculated according to reference grid that contains the mean of covariables and total factors for independent variables.

``rg <- ref_grid(fit)# 'emmGrid' object with variables:#    BASE = 5.4427#    AGE = 75.309#    TRTP = Placebo, Xanomeline Low Dose, Xanomeline High Dose``

The mean of `BASE` and `AGE` are, as we can see from the table above, `5.4427` and `75.309`, respectively. Or we can calculate manually like:

``summary(ana_dat2[,c("BASE", "AGE")])#      BASE             AGE       # Min.   : 3.497   Min.   :51.00  # 1st Qu.: 4.774   1st Qu.:71.00  # Median : 5.273   Median :77.00  # Mean   : 5.443   Mean   :75.31  # 3rd Qu.: 5.718   3rd Qu.:81.00  # Max.   :10.880   Max.   :88.00``

Then we can use `summary()` or `predict()` function to get the predicted value based on reference grid `rg`.

``rg_pred <- summary(rg)rg_pred# BASE  AGE TRTP                 prediction    SE df# 5.44 75.3 Placebo                  0.0578 0.506 76# 5.44 75.3 Xanomeline Low Dose     -0.1833 0.211 76# 5.44 75.3 Xanomeline High Dose     0.5031 0.235 76``

The prediction column is the same as from `predict(rg)`. The prediction table looks like the predicted values of the different factor levels at the constant mean value.

In fact, we can aslo calculate the predicted value as we have the coefficients estimation of the regression equation from `fit\$coefficients`

``> fit\$coefficients             (Intercept)                     BASE                      AGE              -1.11361290               0.11228582               0.00743963  TRTPXanomeline Low Dose TRTPXanomeline High Dose              -0.24108746               0.44531274``

As the `TRTP` includes multiple factors so it has been converted into dummy variables:

``contrasts(ana_dat2\$TRTP)#                      Xanomeline Low Dose Xanomeline High Dose# Placebo                                0                    0# Xanomeline Low Dose                    1                    0# Xanomeline High Dose                   0                    1``

Now if we want to calculate the predicted value for the `Xanomeline Low Dose` factor, it can be as follows:

``> 0.11229*5.44+0.00744*75.3-0.24109*1-1.11361 -0.1836104``

Back to LS means, from its definition, it seems to be the average of the predicted values.

``rg_pred %>%  group_by(TRTP) %>%  summarise(LSmean = mean(prediction))# # A tibble: 3 × 2#   TRTP                  LSmean#   <fct>                  <dbl># 1 Placebo               0.0578# 2 Xanomeline Low Dose  -0.183 # 3 Xanomeline High Dose  0.503 ``

It's exactly the same results as `lsmeans(rg, "TRTP")` by `emmeans` package. Or just using `emmeans(fit, "TRTP")` can also get the same results

``lsmeans(rg, "TRTP")# TRTP                  lsmean    SE df lower.CL upper.CL# Placebo               0.0578 0.506 76   -0.949    1.065# Xanomeline Low Dose  -0.1833 0.211 76   -0.603    0.236# Xanomeline High Dose  0.5031 0.235 76    0.036    0.970``

The degree of freedom is `76` as the DF for `TRTP` is `2`, and `1` and `1` for each covariables. So the total DF is `81-2-1-1=76` I think.

Using `test` we can get the P value when we compare the lsmean to zero.

``test(lsmeans(fit, "TRTP"))# TRTP                  lsmean    SE df t.ratio p.value# Placebo               0.0578 0.506 76   0.114  0.9093# Xanomeline Low Dose  -0.1833 0.211 76  -0.870  0.3869# Xanomeline High Dose  0.5031 0.235 76   2.145  0.0351``

In fact, the `t.ratio` is the t statistics, so we can calculate P value manually, like

``2 * pt(abs(0.114), 76, lower.tail = F)2 * pt(abs(-0.870), 76, lower.tail = F)2 * pt(abs(2.145), 76, lower.tail = F)``

Likewise the confidence interval of lsmean can also be calculated manually based on `SE` and `DF`, such as for Placebo factor.

``> 0.0578 + c(-1, 1) * qt(0.975, 76) * 0.506 -0.9499863  1.0655863``

I think these steps will go a long way in understanding the meaning of least-squares means, and the logic behind it. Hope to be helpful.

#### Reference

]]>
<p>This article aims to learn the basic calculation process of least-squares means (LS means).</p>
Conduct an ANCOVA model in R for Drug Trial http://www.bioinfo-scrounger.com/archives/ANCOVA-model-in-R/ 2022-09-12T08:47:46.000Z 2022-09-12T08:51:21.000Z This article is to illustate how to conduct an (Analysis of Covariance) ANCOVA model to determine whether or not the change from baseline glucose is affected by treatment significantly. In other words, using ANCOVA to compare the adjusted means of two or more independent groups while acounting for one or more covariates.

As an example dataset, I'll use the `cdiscpilot01` from CDSIC that contains the AdaM and SDTM datasets for a single study. And then our purpose is to conduct efficacy analysis by ANCOVA with LS mean estimation. Suppose that we want to know whether or not the treatment has an impact on `Glucose` while accounting for the baseline of glucose. The patients are limited to who reach the visit of `end of treatment` but have not been discontinued due to AE.

#### Assumptions

ANCOVA makes several assumptions about the input data, such as:

• Linearity between the covariate and the outcome variable
• Homogeneity of regression slopes
• The outcome variable should be approximately normally distributed
• Homoscedasticity
• No significant outliers

Maybe we need to additional article to talk about how to conduct these assumptions, but not in here. So we suppose that all assumptions have been met for the ANCOVA.

#### Data preparation

Install and load the following required packages. And then load `adsl` and `adlbc` datasets from `cdiscpilot01` study, which can be referred to another article: Example of SDTM and ADaM datasets from the CDISC.

``library(tidyverse)library(emmeans)library(gtsummary)library(multcomp)adsl <- haven::read_xpt(file = "./phuse-scripts/data/adam/cdiscpilot01/adsl.xpt")adlb <- haven::read_xpt(file = "./phuse-scripts/data/adam/cdiscpilot01/adlbc.xpt")``

#### Data manipulation

Per the purpose, we need to filter the efficacy population and focus on `Glucose (mg/dL)` lab test.

``gluc <- adlb %>%  left_join(adsl %>% select(USUBJID, EFFFL), by = "USUBJID") %>%  # PARAMCD is parameter code and here we focus on Glucose (mg/dL)  filter(EFFFL == "Y" & PARAMCD == "GLUC") %>%  arrange(TRTPN) %>%  mutate(TRTP = factor(TRTP, levels = unique(TRTP)))``

And then to produce the analysis datasets by filtering the target patients who have reach out the end of treatment and not been discontinued due to AE.

``ana_dat <- gluc %>%  filter(AVISIT == "End of Treatment" & DSRAEFL == "Y") %>%  arrange(SUBJID, AVISITN) %>%  mutate(AVISIT = factor(AVISIT, levels = unique(AVISIT)))``

#### Summary of analysis datsets

Once we have the datasets for analysis, we need to examine the datasets first. I find `tbl_summary` function in `gtsummary` package can calculate descriptive statistics and provide a very nice table with clinical style, as shown below:

``ana_dat %>%  dplyr::select(AGEGR1, SEX, RACE, TRTP, AVAL, BASE, CHG) %>%  tbl_summary(by = TRTP, missing = "no") %>%  add_n() %>%  as_gt() %>%  gt::tab_source_note(gt::md("*This data is from cdiscpilot01 study.*"))``

Here we can see the descriptive summary for each variables by the treatment group. Certainly we can also do some visualization like boxplot or scatterplot, but not present here.

#### Fit ANCOVA model

We use `lm` function to fit ANCOVA model with treatment(`TRTP`) as independent variable, change from baseline(`CHG`)as response variable, and baseline(`BASE`) as covariates.

``fit <- lm(CHG ~ BASE + TRTP, data = ana_dat)summary(fit)``

The summary output for regression coefficients as follows. If you would like to obtain anova tables, should use `anova(fit)` instead of `summary(fit)`.

``Call:lm(formula = CHG ~ BASE + TRTP, data = ana_dat)Residuals:    Min      1Q  Median      3Q     Max -3.1744 -0.7627 -0.0680  0.5633  5.0349 Coefficients:                         Estimate Std. Error t value Pr(>|t|)(Intercept)               -0.5579     0.8809  -0.633    0.528BASE                       0.1111     0.1329   0.837    0.405TRTPXanomeline Low Dose   -0.2192     0.5433  -0.404    0.688TRTPXanomeline High Dose   0.4447     0.5528   0.804    0.424Residual standard error: 1.328 on 77 degrees of freedom  (1 observation deleted due to missingness)Multiple R-squared:  0.06702,   Adjusted R-squared:  0.03068 F-statistic: 1.844 on 3 and 77 DF,  p-value: 0.1462``

From above results, we can easily conclude the regression coefficient and model, and the significance comparing to zero. With the coefficient, we can predict the any change based on baseline and treatment.

Besides we can use `contrasts` function to obtain contrast metrices so that understand the dummy variables for `TRTP` in the multiple regression model here.

``> contrasts(ana_dat\$TRTP)                     Xanomeline Low Dose Xanomeline High DosePlacebo                                0                    0Xanomeline Low Dose                    1                    0Xanomeline High Dose                   0                    1``

From the anova table as shown below, it can been seen that the treatment have no statistical significance for the change in glucose under the control of the effects of baseline.

``> anova(fit)Analysis of Variance TableResponse: CHG          Df  Sum Sq Mean Sq F value Pr(>F)BASE       1   1.699  1.6989  0.9629 0.3295TRTP       2   8.061  4.0304  2.2844 0.1087Residuals 77 135.853  1.7643      ``

If you would like to make the output more pretty, `tbl_regression(fit)` can be used as mentioned before.

If we want to obtain the least square(LS) mean between treatment groups, `emmeans` or `multcomp` package can provide the same results. In addition the process to calculate the LS mean is also very worth to leaning and understanding.

``# by multcomppostHocs <- glht(fit, linfct = mcp(TRTP = "Tukey"))summary(postHocs)# by emmeansfit_within <- emmeans(fit, "TRTP")pairs(fit_within, reverse = TRUE)``

The summary output as shown below:

``> summary(postHocs)     Simultaneous Tests for General Linear HypothesesMultiple Comparisons of Means: Tukey ContrastsFit: lm(formula = CHG ~ BASE + TRTP, data = ana_dat)Linear Hypotheses:                                                Estimate Std. Error t value Pr(>|t|)  Xanomeline Low Dose - Placebo == 0               -0.2192     0.5433  -0.404   0.9116  Xanomeline High Dose - Placebo == 0               0.4447     0.5528   0.804   0.6937  Xanomeline High Dose - Xanomeline Low Dose == 0   0.6639     0.3113   2.132   0.0855 .---Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1(Adjusted p values reported -- single-step method)``

Now it's clear a significant difference was observed between the pairs `Low Dose vs. Placebo`, `High Dose vs. Placebo`, and `High Dose vs. Low Dose`.

#### Reference

]]>
<p>This article is to illustate how to conduct an (Analysis of Covariance) ANCOVA model to determine whether or not the change from baseline glucose is affected by treatment significantly. In other words, using ANCOVA to compare the adjusted means of two or more independent groups while acounting for one or more covariates.</p>
Capturing statistic tables from SAS procedure - SASlearner http://www.bioinfo-scrounger.com/archives/statistic_tables/ 2022-09-11T15:41:27.000Z 2022-09-11T15:45:55.000Z As a new SAS user, a common question that is always asked and searched on google is: how can I get the statistic into a table?

In addition to reading the corresponding procedure reference in the official documentation, I recommend using the `ods trace on` to find the stat table names, and then extract any table you want. Here is the regression analysis of the `sashelp.cars` dataset, let's see how to get the stat tables.

``ods trace on;                           /* write ODS table names to log */proc reg data=sashelp.cars plots=none;   model Horsepower = EngineSize Weight;quit;ods trace off;                          /* stop writing to log */``

The logs you can see in SAS are as follows:

``Output Added:-------------Name:       NObsLabel:      Number of ObservationsTemplate:   Stat.Reg.NObsPath:       Reg.MODEL1.Fit.Horsepower.NObs-------------Output Added:-------------Name:       ANOVALabel:      Analysis of VarianceTemplate:   Stat.REG.ANOVAPath:       Reg.MODEL1.Fit.Horsepower.ANOVA-------------Output Added:-------------Name:       FitStatisticsLabel:      Fit StatisticsTemplate:   Stat.REG.FitStatisticsPath:       Reg.MODEL1.Fit.Horsepower.FitStatistics-------------Output Added:-------------Name:       ParameterEstimatesLabel:      Parameter EstimatesTemplate:   Stat.REG.ParameterEstimatesPath:       Reg.MODEL1.Fit.Horsepower.ParameterEstimates``

By looking at the output you can find each stat table name, like `ParameterEstimates`. That means you can extract it by adding the `ods output ParameterEstimates=rst` statement to store the table in the `rst` dataset, as follows:

``proc reg data=sashelp.cars plots=none;  /* same procedure call */   model Horsepower = EngineSize Weight;   ods output ParameterEstimates=rst;        /* the data set name is 'rst' */quit;``

Multiple stat tables can be stored in one `ods output` statement. For example below statement stores both ParameterEstimates table and ANOVA table at the same time.

``proc reg data=sashelp.cars plots=none;   model Horsepower = EngineSize Weight;   ods output ParameterEstimates=parms ANOVA=anvar;quit;``

And then if you want to create a macro variable that contains the value of certain statistic, such as slope for EngineSize:

``data _null_;    set rst;    if variable="EngineSize" then         call symputx("slope1", estimate);run;%put &=slope1;``

Several procedures provide an alternative option for createing an output similar to the `ods ouput` mentioned above. For instance, the `outest` option in the `proc reg` procedure.

``proc reg data=sashelp.cars noprint outest=rst2 rsquare; /* statistics in 'rst2' */   model Horsepower = EngineSize Weight;quit;``

So you'd better to check the SAS documentation to see if this procedure you use.

All of the above is referred to the following articles:

]]>
<p>As a new SAS user, a common question that is always asked and searched on google is: how can I get the statistic into a table?</p>
Combine RTF files into one file http://www.bioinfo-scrounger.com/archives/combine_rtf/ 2022-09-11T15:24:28.000Z 2022-09-11T15:35:16.000Z This post is just a note referred from one article as shown below that I think would be beneficial for anyone who is as new as I am, as this requirement is fairly common in pharamaceutical programming.

A SAS macro to combine portrait and landscape rtf files into one single file

In order to make it suitable for every condition as follows, I will additionally perform an update so that it can be more flexible.

• containing multiple table, figure and list at the same time
• using the title as the index of table contents
• order the files manually (Just provide a solution, have not implemented yet.)

First of all, let's look at the RTF's structure, which is referred to that article.

It is divided into three parts: opening section, content section and closing section. If we look at our single rtf, that structure is still the same. Consequently, the rtf combining process can be summarized as follows:

• Read all filenames into SAS (sorted by filename or defined by manual order).
• Keep the open section of first RTF.
• Remove both opening and closing sections except the first and the last RTF. And add `\pard\sect` code in front of `\sectd` so that all of the files can be combined correctly.
• Keep the closing section of last RTF.
• Save the updated RTF code into each SAS dataset. (Do not be saved in a single dataset, as the character length is limited in SAS.)

Now let's see the code we can use in this process. Firstly I import the rtf filenames from the external folder.

``data refList(keep=filepath fn);    length fref \$8 fn \$80 filepath \$400;    rc = filename(fref, "&inpath");    if rc = 0 then        dirid = dopen(fref);    if dirid <= 0 then        putlog 'ERR' 'OR: Unable to open directory.';    nfiles = dnum(dirid);    do i = 1 to nfiles;        fn = dread(dirid, i);        fid = mopen(dirid, fn);        if fid > 0 and index(fn,"rtf") then do;            filepath="&inpath\"||left(trim(fn));            fn = strip(tranwrd(fn,".rtf",""));            output;        end;    end;    rc = dclose(dirid);run;``

Secondly, read each line in rtf file until find one line that starts with `\sectd`, which means the above is openning section, and below is content section. And remove the last `}` except the last rft file.

``data rtfdt&i(where = (ptline=1));    retain ptline;    set rtfdt&i end = last;    if substr(line,1,6)='\sectd' then do;        ptline = 1;        /*enable to combine portrait and landscape rtf*/        line="\pard\sect"||compress(tranwrd(line,"\pgnrestart\pgnstarts1",""));    end;    if last and line^='}' then        line=substr(strip(line),1,length(strip(line))-1);    else if last and line='}' then delete;run;``

Thirdly, when you find the title code in rtf, replace the `\pard` with `\pard\outlinelevel1` so that this title can be identified as index for content table.

``%if &titleindex = 1 %then %do;    data rtfdt&i.;        set rtfdt&i.;        retain fl 0;        if index(line,'\pard\plain\') and (not index(line,'\header\pard')) and (not index(line, '\footer\pard')) then            fl=1+fl;    run;    data rtfdt&i;        set rtfdt&i;        by fl notsorted;        if fl=1 and first.fl then /*add index for the contents as per titles*/            line=tranwrd(line,'\pard','\pard\outlinelevel1');    run;%end;``

At last, don't save above rtf contents in one single SAS dataset because as the character length is limited in SAS. And add the `}` as the closing section so that keep the rtf file complete.

The total code as shown below:

``/*Example*//*%s_combrtf(inpath=&inpath,outpath=&outpath,outfile=&outfile);*//*Parameter Description*//*inpath        input path*//*outpath       output path*//*outfile       output file name*//*titleindex    whether to add title index, default is 1*/%macro s_combrtf(inpath= ,outpath= ,outfile= ,titleindex=1);    data refList(keep=filepath fn);        length fref \$8 fn \$80 filepath \$400;        rc = filename(fref, "&inpath");        if rc = 0 then            dirid = dopen(fref);        if dirid <= 0 then            putlog 'ERR' 'OR: Unable to open directory.';        nfiles = dnum(dirid);        do i = 1 to nfiles;            fn = dread(dirid, i);            fid = mopen(dirid, fn);            if fid > 0 and index(fn,"rtf") then do;                filepath="&inpath\"||left(trim(fn));                fn = strip(tranwrd(fn,".rtf",""));                output;            end;        end;        rc = dclose(dirid);    run;    /*sort by filename by default*/    proc sort data = refList sortseq = linguistic(numeric_collation=on) out = sorted_refList;        by fn;    quit;    data fileorder;        set sorted_refList;        FileLevel = 2;        order = .;    run;    data _null_;        set fileorder  end=last;        fnref=strip("filename fnref")||strip(_N_)||right(' "')||strip(filepath)||strip('" lrecl=5000 ;');        call execute(fnref);        if last then            call symputx('maxn',vvalue(_n_), 'l');    run;    %do i=1 %to &maxn.;        data rtfdt&i.;            infile fnref&i. truncover;            informat line \$5000.;            format line \$5000.;            length line \$5000.;            input line \$1-5000;            line=strip(line);        run;        /*add title index and adapt to more flexible*/        %if &titleindex = 1 %then %do;            data rtfdt&i.;                set rtfdt&i.;                retain fl 0;                if index(line,'\pard\plain\') and (not index(line,'\header\pard')) and (not index(line, '\footer\pard')) then                    fl=1+fl;            run;            data rtfdt&i;                set rtfdt&i;                by fl notsorted;                if fl=1 and first.fl then                    /*add index for the contents as per titles*/                    line=tranwrd(line,'\pard','\pard\outlinelevel1');            run;        %end;        %if &i.=1 %then %do;            data final;                set rtfdt&i(keep = line) end = last;                if last and line^='}' then                    line=substr(strip(line),1,length(strip(line))-1);                else if last and line='}' then delete;            run;        %end;        %if &i.^=1 %then %do;            data rtfdt&i(where = (ptline=1));                retain ptline;                set rtfdt&i end = last;                if substr(line,1,6)='\sectd' then do;                    ptline = 1;                    /*enable to combine portrait and landscape rtf*/                    line="\pard\sect"||compress(tranwrd(line,"\pgnrestart\pgnstarts1",""));                end;                if last and line^='}' then                    line=substr(strip(line),1,length(strip(line))-1);                else if last and line='}' then delete;            run;        %end;        %if &i.=&maxn. %then %do;            %local _cnt;            data final;                set final                    %do _cnt=2 %to &maxn;                        rtfdt&_cnt(keep = line)                    %end;                ;            run;            data final;                set final end = last;                if last then                    line=strip(line)||strip("}");            run;        %end;    %end;    data _null_;        file "&outpath\&outfile..rtf" lrecl=5000 nopad;        set final;        put line;    run;%mend;``

This appoach, in my opinion is quite excellent as it can resolve the issues as follows:

I know that different company put titles and footnotes in different places. Some may place them in header & footer section and some may place them in the body of rtf document. Above macro will works no matter how you place the titles and footnotes.

#### Reference

]]>
<p>This post is just a note referred from one article as shown below that I think would be beneficial for anyone who is as new as I am, as this requirement is fairly common in pharamaceutical programming.</p>
Post-hoc and ad-hoc of statistical analysis in clinical trials http://www.bioinfo-scrounger.com/archives/posthoc_adhoc/ 2022-09-08T15:19:56.000Z 2022-09-08T15:22:12.000Z 一些过往学习的记录，均是参考和摘抄了一些网上的资料，无法保证描述的准确性。。。

http://onbiostatistics.blogspot.com/2009/01/data-dredging-vs-data-mining-post-hoc.html

• Post hoc是指在submission后，来自于监管部门的审评意见的额外统计分析需求。

Post hoc常被称为数据疏浚（data dredging）或数据捕鱼（data fishing），其动机往往是为了得到阳性的结果。因此，事后分析结果一般不被各国药品监管部分接受作为药物有效性的证据。

A post-hoc analysis involves looking at the data after a study has been concluded, and trying to find patterns that were not primary objectives of the study. In other words, all analyses that were not pre-planned and were conducted as 'additional' analyses after completing the experiment are considered to be post-hoc analyses. A post-hoc study is conducted using data that has already been collected. https://www.editage.com/insights/zh-hans/node/7139

While both post-hoc and ad-hoc analysis may be performed based on the data or results we have seen, the ad-hoc analysis typically occurred alongside the project while the post-hoc analysis occurred absolutely after the project or after the unblinding of the study or after the pre-specified analyses results have been reviewed. In this sense, the ad-hoc analysis is better than post-hoc analysis.

]]>
<p>一些过往学习的记录，均是参考和摘抄了一些网上的资料，无法保证描述的准确性。。。</p> <p><a href="http://onbiostatistics.blogspot.com/2009/01/data-dredging-vs-data-mining-post-hoc.html" target="_blank" rel="noopener">http://onbiostatistics.blogspot.com/2009/01/data-dredging-vs-data-mining-post-hoc.html</a></p>
SAS 数据集操作 http://www.bioinfo-scrounger.com/archives/sas_data_manipulation/ 2022-09-08T14:47:33.000Z 2022-09-11T15:24:12.000Z 这个是当初最开始学习用SAS的时候做的笔记，现在再次看到有点亲切和熟悉，初学者看书做笔记的感觉。。。

#### Set用法

1. set-keep联合 提取特定变量

`` /*set-keep-挑选变量*/ data keep; set sashelp.class(keep=name sex);     /*查看数据，sashelp逻辑库的class数据集，keep相当于 class[,c("name","sex")]   keep代表提取元素，而drop代表剔除元素*/ run;``
2. set-rename 修改变量名称

`` /*set-rename-修改变量名称*/ data keep; set sashelp.class(keep=name sex rename=(name=name_new sex=sex_new)); run;``
3. set-where 条件选择

`` /*set-where-按条件选中*/ data keep; set sashelp.class(keep=name sex where=(sex='M'));     run;``
4. set-in 临时变量

`` /*set-in-临时单个变量*/ /*可以说是SAS跟R最大的区别的一点就是，SAS内容都是不直接放在内存之中，而是放在数据集中，如果要对数据集的内容进行一些操作，需要先赋值成一些临时变量*/ data keep; set one(in=a) two(in=b);  /*one变量变成临时变量a，two变量变成临时变量b，同时合并one two变量*/ in_one=a; in_two=b; /*将临时变量a b 赋值新变量名称in_one、In_two*/ if a=1 then flag=1;else flag=0; /*构造一个新变量flag，为满足某种条件*/ run;``
5. set-nobs 计总数

`` /*set-nobs-数据总数，譬如nrow的意思*/ data nobs11(keep=total); set sashelp.class nobs=total_obs; /*将class数据集通过nobs转给临时变量total_obs，然后传给实际变量total，再传出*/ total=total_obs; output; stop; run;``

利用nobs=total_obs，以及total=total_obs的操作来计数。 先用函数obs计数，然后传给临时变量total_obs放入缓存，缓存内容需要再传导给实际变量total。

此外，注意还有output+stop代表单独输出为数据表，而stop的意思是停留在一个单元格，不然就会生成19*1的一列数值，里面都填充着数字19。

6. 数据集合并——横向合并，obs这里表示取前10行

`` /*set-数据集合并*/ data concatenat; set sashelp.class sashelp.class(obs=10); /*横向合并，同时sashelp.class(obs=10)代表切片*/ run;``

#### merge用法——横向合并

``/*merge 横向合并*/proc sort data=chapt3.merge_a;by=x;run;proc sort data=chapt3.merge_c;by=x;run; data d;merge chapt3.merge_a chapt3.merge_c;by x;run;``

SAS合并需要预先进行一些内容的排序，才能进行合并。

• 排序：proc sort data=逻辑库.数据集; by=变量名称；run；
• 合并：merge 数据集1 数据集2；by x；

#### where 按条件选择

1. where-between/and

``Where x between 10 and 20;/* X[10,20] */Where x not between 10 and 20;Where x between y*0.1 and y*0.5;where x between 'a' and 'c';``

where-between/and可以作为切片的一种。同时数据集(obs=10)也是切片的一种方式。

``where x in(1,2);/*选择变量等于某值的情况*/``

1. where在缺失值的应用

/where选中缺失值/ Where x is missing; where x is null; /* 数值型变量，定位缺失值，is.na()*/

``Where x;/*选择数值变量x非0非缺失值的观测值*/Where x and y; /*字符型变量，选择观测值*/Where x ne '';``

1. where选中字符型

where x like 'D_an'; where x like '%ab%' or x like '%z%'; /字符型匹配，下划线代表一个字符，%代表可以有很多字符/

#### append函数——横向合并、添加

``/*append base= data= force 语句*//*base是元数据集，data需要添加在后面的数据集，force是强行添加，不太用*/proc append base=null data=sashelp.class(where=(sex='M'));run;``

]]>
<p>这个是当初最开始学习用SAS的时候做的笔记，现在再次看到有点亲切和熟悉，初学者看书做笔记的感觉。。。</p> <p>回过头看，这些非常基础的用法却是平时用的最多的代码。</p>
How to insert blank rows - SASlearner http://www.bioinfo-scrounger.com/archives/insert-blank-rows/ 2022-08-29T14:41:49.000Z 2022-08-29T14:43:52.000Z If you'd like to add one blank row in your report, as shown below:

I suggest there are two methods to implement this requirement.

• to do that in `proc report`
• to do that before `proc report`

Let's say that we have an example `sashelp.class`, and wish to divide it into two groups so that we can present them separately.

For the first method, we need to add a `gr` variable for grouping, such as female and male.

``data final;    set sashelp.class;    if sex="male" then        gr=1;    else gr=2;run;proc sort data = final;    by gr age;run;``

And then, in the `proc report` procedure, add one line code `compute after gr;` as shown below. You can see it works.

``proc report data=final;    column gr name sex age height weight;    define gr  / group order noprint;    compute after gr;        line @1 "";    endcomp;run;``

To use the second method, we simply add one blank row to the `final` dataset using `call missing(of _all_)`.

``data final2;    set final;    by sex;    output;    if last.sex then do;        call missing(of _all_);        output;    end;    drop gr;run;``

Just a trick, I hope to help anyone who's learning SAS.

]]>
<p>If you'd like to add one blank row in your report, as shown below:</p>
Example of SDTM and ADaM datasets from the CDISC http://www.bioinfo-scrounger.com/archives/sdtm_adam_cdisc/ 2022-08-28T13:59:28.000Z 2023-01-18T14:25:59.000Z Here is to present a set of SDTM and ADaM datasets from CDISC pilot projects as resources for R develpment and programming.

If you want to review the full datsets for pilot project, I would recommand it's better to clone the cdisc-org/sdtm-adam-pilot-project git repository, as following:

``clone https://github.com/cdisc-org/sdtm-adam-pilot-project.git``

If neccessary, otherwise you can aslo get CDISC pilot projects from the `phuse-scripts` repository.

``clone https://github.com/phuse-org/phuse-scripts.git``

Supposed you just want to get SDTM and ADaM datasets and run some tests in R, I would like to import those datasets from R packages, like `admiral` package. And install and try it right here.

``library(admiral)data(admiral_adsl)adsl <- admiral_adslhead(adsl[1:5,1:5])# A tibble: 5 × 5  STUDYID      USUBJID     SUBJID RFSTDTC    RFENDTC     <chr>        <chr>       <chr>  <chr>      <chr>     1 CDISCPILOT01 01-701-1015 1015   2014-01-02 2014-07-022 CDISCPILOT01 01-701-1023 1023   2012-08-05 2012-09-023 CDISCPILOT01 01-701-1028 1028   2013-07-19 2014-01-144 CDISCPILOT01 01-701-1033 1033   2014-03-18 2014-04-145 CDISCPILOT01 01-701-1034 1034   2014-07-01 2014-12-30``

You can also find the function list directly right here, to see which ADaM datsets are avaiable for use, such as `adae`, `adeg`, `advs` and so on.

If you want to import SDTM datasets, you should use the function like `data(admiral_ae)` from `admiral.test` package.

``library(admiral.test)data(admiral_ae)ae <- admiral_aehead(ae[1:5,1:5])# A tibble: 5 × 5  STUDYID      DOMAIN USUBJID     AESEQ AESPID  <chr>        <chr>  <chr>       <dbl> <chr> 1 CDISCPILOT01 AE     01-701-1015     1 E07   2 CDISCPILOT01 AE     01-701-1015     2 E08   3 CDISCPILOT01 AE     01-701-1015     3 E06   4 CDISCPILOT01 AE     01-701-1023     3 E10   5 CDISCPILOT01 AE     01-701-1023     1 E08``

I find it quite practical, don't you? In addition to the `admiral` package, I aslo find that `r2rtf` and `clinUtils` R packages contain the exmaple datasets for CDISC pilot projects. But both of them are not quite complete, so this is just a alternative option.

Overall, from my persepctive, these example datasets are a great resource for developing R packages or Shiny dashboards for pharmaceutical use.

#### Reference

]]>
<p>Here is to present a set of SDTM and ADaM datasets from CDISC pilot projects as resources for R develpment and programming.</p>
File and Directory Manipulation - SAS&R http://www.bioinfo-scrounger.com/archives/file-dir-manipulation/ 2022-08-18T12:55:37.000Z 2022-08-18T13:02:04.000Z In this article, we are going to present how to work with files and folders in R and SAS.

First of all, here sharing a good resource about SAS Marcos is related to our topic. That lists a series of useful macros, and the code is very standard and worthy to learn, highly recommended.

For R, I'm willing to recommend the `fs` R package, which provides a cross-platform, uniform interface file system operations

And then let's begin with our topics.

#### List of files

Suppose if you want to identify the list of files in a particular directory, in R you can easily choose `list.files()`. For example list files in a specific directory like the current directory.

``list.files(path = "./")``

You can also get the files within the subfolders, and just match the `.txt` files, simple use

``list.files(path = "./", pattern = ".txt", recursive = T)``

In SAS, limited to my knowledge, just list two approaches. The first approach is to use a `filename` statement with a `pipe` device type and `dir` command in the Windows environment.

``filename dirlist pipe 'dir /b E:\Tp\*.txt';data list;    length line \$200;    infile dirlist;    input;    line = strip(_infile_);run;filename dirlist clear;proc print data=list; run;``

The second approach is to use the functions `dopen` and `dread` with the help from `dnum`, as the following example.

``filename root "E:\Tp";data list;    * --  return variables  --;    length name \$ 512;    * --  directory to inventory  --;    dirid = dopen('root');    if dirid <= 0 then        putlog 'ERR' 'OR: Unable to open directory.';    nfiles = dnum(dirid);    do i = 1 to dnum(dirid);        * --  directory item name  --;        name = dread(dirid, i);        output;    end;    rc = dclose(dirid);    dirid = 0;run;``

The more details for the second approach can be found in these links.

#### File Exists

Suppose if you want to identify a file called `README.md` that exists in the current directory, then you can choose the `file.exists()` function in R, that returns `TRUE` if the file exists, and `FALSE` otherwise.

``file.exists("./README.md")``

In SAS, `fileexist` function verifies the existence of the file, and it will return 1 if the file exists, and 0 otherwise.

``%let fpath=E:\Tp\test.sas;%macro fileexists(filepath);    %if %sysfunc(fileexist(&filepath)) %then        %put NOTE: The external file &filepath exists.;    %else  %put ERROR: The external file &filepath does not exist.;%mend fileexists;%fileexists(&fpath);``

Besides if you want to check if a dataset exists, you can choose the `exist` function. If for checking a variable exists, `varnum` is recommended.

#### File Creates

If you want to create a blank file in R then

``file.create("./text.txt")``

In SAS, I'm not sure if the below is the normal way, but it’s definitely simple anyway.

``data _null_;    file "E:\Tp\test.txt";run;``

#### File Deletes

If you want to delete a specific file in R, then

``file.remove("./test.txt")``

In SAS, I think we can simply use `fdelete` function.

``filename defile '"E:\Tp\test.txt';data _null_;    rc=fdelete('defile');run;``

#### Directory creates

Creating a directory is very similar to a file. The function in R is `dir.create()` that is very convenient to use. In SAS it can be accomplished using the `dlcreatedir` option and `libname` statement with 2 lines of code.

``options dlcreatedir;libname folder 'E:\Tp\dummy';``

If you want to create or copy multiple folders or directories, more detailed information can be found in Using SAS® to Create Directories and Duplicate Files.

#### List specific extension files

You can use the macro as shown below to list all the RTF files which means the extension is `.rtf`.

``/*Example*//*%ListFilesSpecifyExtension(dir=C:\Users\demo,ext=rtf,out=rst);*//*Parameter Description*//*dir         input directory*//*ext         file name extension*//*out         output dataset*/%macro ListFilesSpecifyExtension(dir=, ext=, out=);    %local filrf rc did name i;    %let rc=%sysfunc(filename(filrf,&dir));    %let did=%sysfunc(dopen(&filrf));    /* Use the %IF statement to make sure the directory can be opened. If not, end the macro. */    %if &did eq 0 %then %do;        %put Directory &dir cannot be open or does not exist;        %return;    %end;    data &out;        length FileName \$200;        %do i = 1 %to %sysfunc(dnum(&did));            %let name=%qsysfunc(dread(&did,&i));            %if %qupcase(%qscan(&name,-1,.)) = %upcase(&ext) %then %do;                Filename = "&name";                output;            %end;        %end;    run;    %let rc=%sysfunc(dclose(&did));    %let rc=%sysfunc(filename(filrf));%mend ListFilesSpecifyExtension;``

#### Reference

]]>
<p>In this article, we are going to present how to work with files and folders in R and SAS.</p>
Bland-Altman Analysis http://www.bioinfo-scrounger.com/archives/bland-altman-analysis/ 2022-05-26T14:08:11.000Z 2022-05-26T14:19:46.375Z The Bland-Altman analysis is the most common method of assessing the agreement in method comparison in IVD CT or CE trials.

The Bland-Altman method generally refers to the Bland-Altman plot, which is used to display the relationship between two paired quantitative measurement tests or assays (Bland & Altman, 1986 and 1999). For example, a new product might be compared with the registered product, or previous generation product. Alternatively the product can be the reference or gold standard method. Sometimes we would call it a different plot instead of the Bland-Altman plot in the CLSI EP09.

As we can see from above, the Bland-Altman is a scatter plot that clearly shows the relationship between the differences and the magnitude of measurement. The X axis represents the mean of two measurements, and the Y axis represents the difference.

Sometimes the difference can be defined as constant difference, but it also can be defined as proportional difference that depends on the distribution of measurements. For example, if the difference is not related to the magnitude, it means that we have a constant difference between the two assays throughout the total X axis range. Otherwise proportional difference means that it’s related to the magnitude, and proportional to the magnitude. So such plots can be visually inspected to determine the underlying variability characteristic of this relationship.

#### Assumption

The assumption of Bland-Altman is that the differences are normally distributed. But we all know that we can not make sure the measurements are following the normal distribution completely. In many cases, actually there will not be a big impact for Bland-Altman analysis when the distribution of the differences is not normally distributed.

But from my side, I propose that the range of the two assays should not be too different, be assured that they are in the similar magnitude.

#### Basics

There have been some definitions we should know for Bland Altman analysis.

• Bias, it refers to the mean of the differences of the two measurements. It can be seen in the middle line in the Bland Altman plot, which is useful for detecting a systematic difference.
• 95% CI of Bias, it refers to the 95% confidence interval of the mean difference, which illustrates the magnitude of the systematic difference.
• Limits of Agreement (LoA), it refers to the 95% prediction interval of the differences. And it can be seen in the upper and lower lines in the Bland Altman plot. This indicator is very important in clinical trials. Always we need to compare the LoA with the clinical acceptance criteria to demonstrate the bias for the new product is accepted in clinical practice.
• 95% CI of LoA, it refers to the 95% confidence interval of LoA, to demonstrate the error or precision of the upper and lower LoA.

#### Analysis

To explain the calculation process more clearly, here I use the R to implement it.

Suppose you have two measurements from different assays. The mean differences of them is `5`, and the corresponding standard deviation is `0.8`.

``d <- 5sd <- 0.8``

We can easily get the LoA by the formula as it’s the prediction interval of the differences.

``LoA <- c(d - 1.96 * sd, d + 1.96 * sd)``

And the CI for `d` and `LoA` would be a bit complicated, as shown below from NCSS Bland-Altman Plot and Analysis documentation.

From the above formula, the standard error of `LoA` CI is about 1.71 times as the `d`. This `1.71` always occurs in some Bland Altman related articles,for now at least we have known how to calculate this number. Just drop the `n` from both sides of the equation.

``sqrt(1 + 1.96^2 / 2)``

For the CI, we can also easily calculate them according to the above formula, for example the 95% two-side confidence interval.

But here we must make sure that we should use the t distribution or normal distribution, that will affect whether we use t statistic or z statistic. Suppose here I use t distribution, and define the sample size `n` is 200, so degrees of freedom is equal to `n-1`.

``n <- 200t <- qt(1 - 0.05 / 2, 200-1)d_se <- sd / sqrt(n)d_CI <- c(d - t * d_se, d + t * d_se)> d_CI 4.888449 5.111551``

Then the corresponding CI of `LoA`

``LoA_se <- sd * sqrt(1 / n + 1.96^2 / (2 * (n - 1)))LoA_CI <- LoA + c(- t * LoA_se, t * LoA_se)> LoA_CI 3.241041 6.758959s``

#### Results

What's the best result for Bland Altman analysis in the clinical trials?

Basically in the Bland Altman plot, we hope the spread of the scatter points is consistent across the range of X axis. And only a few points fall outside `LoA`. Moreover, the `LoA` or `LoA` CI meet the clinical requirements.

#### Reference

Above all are my rough understanding, the main purpose is to note the calculation of CI for LoA. The references are as shown below:

]]>
<p>The Bland-Altman analysis is the most common method of assessing the agreement in method comparison in IVD CT or CE trials.</p>
Non-Inferiority Test for Paired ROC Curves http://www.bioinfo-scrounger.com/archives/non-inferiority-test-roc/ 2022-05-06T12:44:48.000Z 2022-05-06T12:48:38.538Z This post is to talk about how to compare two paired areas under ROC curves(AUC) for diagnostic accuracy by non-inferiority test.

Suppose you have a new product that you want to conduct a method comparison with an existing product. The primary endpoint is the AUC of your product is not worse than that compared product. It's no doubt that this is a non-inferiority trial design. So suppose the non-inferiority margin is `-0.15`, the one-sided hypotheses is below:

• Margin: -0.15
• H0: AUC1 - AUC2 ≤ -0.1500
• H1: AUC1 - AUC2 > -0.1500

And then which method could we use to assess the statistical significance? In common, there are two methods used to estimate the AUC. One method is the empirical (nonparametric) method by DeLong et al. (1988), which doesn't depend on the strong normality assumption that the Binormal method makes. The other method is Binormal method presented by Metz (1978) and McClish (1989). Actually the bootstrap technique is also suggested for both methods, like the `rocNIT` R package.

In this post, I’m mainly going to record how to use the nonparametric method to compare the paired AUC, referring to the self-topped Liu (2006) article.

The method is accomplished by R code to demonstrate each procedure. I don’t post the formula in here, please read that article that is very clear and easy to understand.

For the paired AUC, we can know that these two products are measured on the same subject. So I create a simulation data with 80 subjects, and the corresponding two measurements that are tested by the new product and compared product.

The nonparametric estimation of the ROC curve area is based on the Mann-Whitney U statistic.

And then we need to calculate the estimated variance of `auc1 - auc2`

When we get the estimated variance, the difference of two paired AUC and margin, we can also obtain the Z statistic. Thereby the p value and one-side 95% lower limit can be calculated either.

The total code is as following:

``# Helper functionsh_get_u_statistic <- function(x, y) {  if (x > y) {    return(1)  }  if (x == y) {    return(0.5)  }  if (x < y) {    return(0)  }}h_auc_v10_v01 <- function(n1, n0, v1, v0) {  v10 <- NULL  v01 <- NULL  # Mann-Whitney U statistic  auc <- sum(sapply(v1, function(x) {    sapply(v0, function(y) {h_get_u_statistic(x, y)})  })) / (n1 * n0)  for (i in 1:n1) {    v10 <- c(v10, sum(      sapply(v0, function(y) {h_get_u_statistic(v1[i], y)})    ) / n0)  }  for (i in 1:n0) {    v01 <- c(v01, sum(      sapply(v1, function(x) {h_get_u_statistic(x, v0[i])})    ) / n1)  }  return(list(auc = auc, v10 = v10, v01 = v01))}# To get the auc and corresponding intermediate parameters `v10` and `v01`get_auc <- function(response, var) {  dat <- cbind(response, var)  n0 <- sum(response == 0, na.rm = TRUE)  n1 <- sum(response == 1, na.rm = TRUE)  var0 <- var[response == 0]  var1 <- var[response == 1]  c(    list(n1 = n1, n0 = n0, var1 = var1, var0 = var0),    h_auc_v10_v01(n1 = n1, n0 = n0, v1 = var1, v0 = var0)  )}# The main program.auc.test <- function(mroc1, mroc2, margin, alpha = 0.05) {  mod1 <- mroc1  mod2 <- mroc2  n1 <- mod1\$n1  n0 <- mod1\$n0  auc1 <- mod1\$auc  auc2 <- mod2\$auc  s10_11 <- sum((mod1\$v10 - auc1)^2) / (n1 - 1)  s10_22 <- sum((mod2\$v10 - auc2)^2) / (n1 - 1)  s10_12 <- sum((mod1\$v10 - auc1) * (mod2\$v10 - auc2)) / (n1 - 1)  s01_11 <- sum((mod1\$v01 - auc1)^2) / (n0 - 1)  s01_22 <- sum((mod2\$v01 - auc2)^2) / (n0 - 1)  s01_12 <- sum((mod1\$v01 - auc1) * (mod2\$v01 - auc2)) / (n0 - 1)  variance <- (s10_11 + s10_22 - 2 * s10_12) / n1 + (s01_11 + s01_22 - 2 * s01_12) / n0    auc_diff <- auc1 - auc2  z <- (auc_diff - margin) / sqrt(variance)  p <- 1 - pnorm(z)  lower_limit <- auc_diff - (qnorm(1 - alpha) * sqrt(variance))  list(Difference = auc_diff, `Non-Inferiority Pvalue` = p, `One-Sided 95% Lower Limit` = lower_limit)}``

Now let's start to run the main programs. Firsly import the simulation data.

``data <- data.table::fread("./Paired_Criteria_adjusted.txt") %>%  mutate(Response = if_else(Condition == "Present", 1, 0))``

And then obtain the two results for ROC models.

``mroc1 <- get_auc(response = data\$Response, var = data\$Method1)mroc2 <- get_auc(response = data\$Response, var = data\$Method2)``

In the end, calculate the final output, especially the p value and one-side limit.

``> auc.test(mroc1, mroc2, margin = -0.15, alpha = 0.05)\$Difference -0.04771115\$`Non-Inferiority Pvalue` 0.04447758\$`One-Sided 95% Lower Limit` -0.1466274``

From this outcome, we can conclude that we need to reject the H0 hypothesis. Not only because the P value is less than 0.05, but also the one-side lower limit is greater than -0.15. So we can conclude that the Method1 (new product) is not worse than the Method2 (registered product) when the margin is 0.15.

Above is my note for a non-inferiority test in the diagnostic area. Actually I prefer to use R package or NCSS software to estimate the test, as personal code is always not correct sometimes. So the above code is helpful to understand the principle, not better to use directly.

#### Reference

]]>
<p>This post is to talk about how to compare two paired areas under ROC curves(AUC) for diagnostic accuracy by non-inferiority test.</p> <blockquote> <p>Reference: <a href="https://pubmed.ncbi.nlm.nih.gov/16158400/" target="_blank" rel="noopener">Tests of equivalence and non-inferiority for diagnostic accuracy based on the paired areas under ROC curves</a></p> </blockquote>
Merge and Transpose - SASlearner http://www.bioinfo-scrounger.com/archives/merge-and-transpose/ 2022-04-23T13:30:25.000Z 2022-04-25T07:21:03.424Z If you often work with data manipulation, obviously you need to know how to merge and transpose data as that is very common in our data processing.

In R, you can use the `dplyr` package to `left_join` or `inner_join` and other functions `*_join()` to handle your data in any form. As per to transposing, we generally call it Pivot data, and use corresponding `pivot_wider` and `pivot_longer()` functions from the `tidyr` package to handle your data from long to wide, or the opposite. This how to use R to pivot data could refer to my post. Pivoting data in R

As for the SAS, let's see this with step-by-step examples.

#### Merge

In SAS, it defines the merge processing as three approaches.

• concatenating
• Match-merging

The first two ways are the usage of `set`, not to discuss in this post. The third way I suppose is the most common in our data processing.

First, we create two example datasets as shown below.

``proc sort data = sashelp.class;    by Name;run;data cls1;    set sashelp.class(keep = Name Sex obs = 10);run;data cls2;    set sashelp.class(keep = Name Age firstobs = 5 obs = 15);run;``

Then we define the most important argument `by` to specify which variable to join by. Before merging, we must make sure the dataset is sorted by that `by` variable.

Why?

A simple answer is that the SAS match-merge is based on the classic sequential match algorithm, and the latter is based on the premise that all input streams are sorted identically.

``data clss;    merge cls1 cls2;    by Name;run;``

Actually in my option, the above code is not commonly used, that is just similar with the `full_join` all x rows, followed by unmatched y rows. Maybe I would like to use merge processing such as `left_join`, `inner_join`. In this case we need to specify the `IN=` argument in the `merge` statement.

For instance, post-process like `left_join`.

``data clss2;    merge cls1(in = x) cls2(in = y);    by Name;    if x;run;``

post-process like `inner_join`.

``data clss2;    merge cls1(in = x) cls2(in = y);    by Name;    if x and y;run;``

It can be seen above that we can use `IN=` to control which rows to be filtered.

However, considering we have to sort the dataset first, I sometimes would like to use `proc sql` to merge data. Since it’s close to the form I used in R.

``proc sql;    create table clss3 as         select x.*, y.* from cls1 as x            left join cls2 as y on x.Name=y.Name;quit;``

#### Transpose

In my opinion, for transpose processing, it's better to learn and understand it with a few examples. Simply memorizing the arguments is easy to confuse. I suggest running each code and seeing the output, and thinking of how to realize it.

So let's see the examples to show transpose a dataset from long to wide, i.e. rows to columns.

First, create an example data from `sashelp.shoes` dataset.

``data shoes;    set sashelp.shoes;    if Subsidiary in ("Johannesburg" "Nairobi");    keep Region Product Subsidiary Sales Inventory;run;``

By default, the transpose procedure only transposes the numeric columns from long to wide and ignores any character column. But actually in normal work, we generally will define a series of arguments like `prefix`, `name` and `label`. Besides with the `var` statement, we can select which column or columns you want to transpose. And for the `id` statement you can use the variable of a column as the new variable names.

``proc transpose data = shoes(where = (Subsidiary = "Johannesburg"))     out = res;    var Sales;    id Product;run;``

If you want to group the data by a variable, then add the `by` argument.

``proc transpose data = shoes out = res;    var Sales;    id Product;    by Subsidiary;run;``

If you want to re-define the columns `_NAME_` and `_LABEL`, then add the `name` and `label` options.

``proc transpose data = shoes out = res name = var_name label = label_name;    var Sales;    id Product;    by Subsidiary;run;``

Then we see an example to show how to transpose data from wide to long.

``proc transpose data = res(drop = var_name label_name) out = res2;    var Boot Sandal Slipper;    by Subsidiary;run;``

#### Reference

]]>
<p>If you often work with data manipulation, obviously you need to know how to <strong>merge</strong> and <strong>transpose</strong> data as that is very common in our data processing.</p>
Select N Rows or K-th Elements from Macro Variable - SASlearner http://www.bioinfo-scrounger.com/archives/n-rows-kth-element/ 2022-04-18T14:32:23.000Z 2022-04-18T14:34:13.502Z Selecting N rows from a dataset or K-th element from a macro variable is a common data manipulation process. This post is listing the ways how to resolve these questions.

#### First or Last N Rows

For the `proc sql` method, `inobs` or `outobs` both can be used to select N rows from a dataset, but worth noting that it will cause differences if you join tables.

``proc sql inobs = 5 /*outobs=5*/;    create table cls as        select * from sashelp.class;quit;``

For the SAS code instead of `proc sql`, the most straightforward method is to utilize the `obs` that is very similar to the `sql` method. Otherwise if you would like to select a range of rows, just add another parameter `firstobs`.

``data raw_o;    set sashelp.class(firstobs = 5 obs = 10);run;``

Utilizing the `_N_` variable with the `IF-ELSE` statement to reach this purpose I suppose is more flexible sometimes.

``data raw_1;    set sashelp.class;    if 5<= _N_ <=10 then output;run;``

So how about selecting the last rows? It seems we have to know the total number of rows, and then utilize the `_N_` variable once.

``data raw_2;    set sashelp.class;    if &n_rows.-4<=_N_<=&n_rows. then        output;run;``

BTW how to select N observations randomly, we can use the `proc surveyselect` procedure and define `method = srs` as the simple random selection method, so that we get the random 5 rows from this dataset.

``proc surveyselect data = sashelp.class out = rd_class    method = srs sampsize = 5 seed = 123456;run;``

Besides for the row number, we can also add a new row number by group, as shown the following example:

``proc sort data = sashelp.class out = sorted_class;    by age;run;data sorted_class_2;    set sorted_class;    by age;    if first.age then new_row_number=0;    new_row_number+1;run;``

#### K-th Elements from Macro Variable

First off, we would want to create a macro list to store information. If we just want to store a single value, as following:

``proc sql;    select count(name) into: n_name trimmed from sashelp.class;quit;%put &n_name;``

Storing multiple values is also very similar.

``proc sql;    select count(name),mean(height) format=10.2        into: n_name trimmed, :mean_height trimmed    from sashelp.class;quit;%put &n_name &mean_height;``

Or just simply want the list values to be assigned to a list of macro variables.

``proc sql;    select distinct(name) into: n1-:n19 from sashelp.class;quit;``

But the above example seems like you have to know the total number of distinct values in the dataset. So maybe the common way is to store the column values in a list separated by any delimiter you want.

``proc sql;    select distinct(name) into: nameList separated by ' '        from sashelp.class;/*    %let numNames = &sqlobs;*/quit;%put &nameList.;``

Supposed that I just want the second element in this macro variable, what should I do? Maybe the `%scan` function is enough to reach our purpose.

``%put %scan(&nameList,2);``

Obviously, it seems not as convenient to extract the element as `nameList` in R, but it is enough to use in SAS.

Another way is to loop the macro variable by `%do`, as below to assign each element to a new column.

``%let cntName = %sysfunc(countw(&nameList));data raw;    array names[&cntName] \$200 name1-name&cntName;    do i=1 to &cntName;        names[i]=scan("&nameList", i);    end;    drop i;run;``

Otherwise there are still many great documents that have been posted to show how to store and manipulate lists in SAS, like Choosing the Best Way to Store and Manipulate Lists in SAS

#### Reference

]]>
<p>Selecting N rows from a dataset or K-th element from a macro variable is a common data manipulation process. This post is listing the ways how to resolve these questions.</p>
Handling Duplicates and Missing values - SASlearner http://www.bioinfo-scrounger.com/archives/headling-duplicates-missings/ 2022-04-13T12:04:52.000Z 2022-04-13T12:08:14.331Z Handling the duplicates and missing values in data manipulation is a very common process. This post is taking a few examples to list how to accomplish it from a datasets in SAS.

In R, I prefer to use `unique()` or `dplyr::distinct` toolkit to remove duplicates, and `is.na()`, `na.omit()` functions or external packages like `mice` to handle missing values.

#### Duplicates

We can use the `proc sort` to remove rows that have duplicate values across all columns of the dataset.

``proc sort data = sashelp.cars(keep = make type origin) out = without_dups nodupkey;    by _all_;run;``

In some special condition, we would like to select only unique/distinct rows from a dataset as per a specific column and keep the first row of values for that column.

``proc sort data = sashelp.cars out = make_without_dups nodupkey;    by Make;run;``

#### Missing Values

In clinical trial data, missing data or missing values is a common occurrence when no data is stored for the variable in the observation. It can be occurred in numeric or character variables as a single period (`.`).

For all we know, according to the missing values, the reasons can be summarized as below:

• Missing completely at random (MCAR)
• Missing at random (MAR), not completely random
• Not missing at random (NMAR)

So how to handle the missing values?

##### Removing observations

Suppose we did a reaction time study with six subjects, and the subjects reaction time was measured by three times. That data is as shown below.

``data times;    input id trial1 trial2 trial3;    cards;1 1.5 1.4 1.6 2 1.5  .  1.9 3  .  2.0 1.6 4  .   .  2.2 5 2.1 2.3 2.26 1.8 2.0 1.9;run;``

As you see below, we can use some useful functions to count the number of missing observations, like `nmiss` for numeric and `cmiss` for character. Or `missing` to indicate whether the argument contains a missing value. And then filter any rows that have more than one missing value.

``data raw_0;    set times (where = (nmiss(trial1,trial2,trial3) = 0));run;``

Or just indicate the specific variable, like `trial1` column.

``data raw_1;    set times;    missing_flag = missing(trial1);run;``
##### Replacing Values

First off, let's try to replace all missing values with zero in every column in a simple way, which is creating an implicit Array `NumVar` to hold all numeric variables in the dataset and then loop over it. If you just want to replace one column, so then add that variable name instead of `_numeric_`.

``data raw_3;    set times;    array NumVar _numeric_;    do over NumVar;        if NumVar=. then            NumVar=0;    end;run;``

If your question is more complicated, such as not replaced by zero but by mean, then how would we address it? I suppose that `proc stdize` is a good solution.

``/*proc stdize data = times out = stdize_vars reponly missing = 0; run;*/proc stdize data = times out = stdize_vars reponly method = mean;    var trial1 trial2; /* or _numeric_, or empty*/run;``
##### Imputation Method

Imputation missing values is a complicated data manipulation process that can work well if you select the correct method for specific variables. But I would not learn more about how to do it with SAS by now, since I prefer to use R for imputation.

Here just list a few of useful sas procedures so that I can read and recall them later if needed.

• `proc hpimpute`
• `PROC MI`, `PROC REG`, `PROC MIANALYZE`
• `proc surveyimpute`

Hope above notes will be helpful for you.

#### Reference

]]>
<p>Handling the duplicates and missing values in data manipulation is a very common process. This post is taking a few examples to list how to accomplish it from a datasets in SAS.</p>
Displaying Descriptive Statistics for Variables - SASlearner http://www.bioinfo-scrounger.com/archives/descriptive-statistics/ 2022-04-11T14:21:45.000Z 2022-04-11T14:23:33.819Z This post is talking about how to display descriptive statistics for variables quickly. In the sense that we would like to know an usual and agile way to accomplish it in SAS.

The following examples show how to resolve the below questions (just very simple but quite common):

• How to count distinct values
• How to count variables by group
• How to produce the frequency table of variables
• How to calculate the statistics for variables

In R, it seems like using `Hmisc::describe` is available, but not the only function, other external packages or `base` functions like `summary` can also be utilized very well.

#### Count Values or Distinct Values

Here we use the `proc sql` procedure with the SAS dataset called BirthWgt, to count the `Race` variable.

``proc sql;    select count(Race) as cnt_race        from sashelp.BirthWgt;run;``

But I feel just count the total number of `Race` variable is not make sense. If we would like to count the `Married` variables grouped by the `Race` variable:

``proc sql;    select Race, count(Married) as cnt_married        from sashelp.BirthWgt        group by Race;run;``

If you want to count the distinct value, add the `distinct` in the `count` function.

``proc sql;    select count(distinct Married) as distinct_married        from sashelp.BirthWgt;run;``

#### Frequency Table

We can use `proc freq` to create frequency tables for one or more variables. Such as the example for the `SomeCollege` variable with missing values, sorted by `Race` and define the output as `result` dataset including cumulative frequencies and percentages.

``proc sort data = sashelp.BirthWgt;    by Race;run;proc freq data=sashelp.BirthWgt;    tables SomeCollege /out=result missing outcum;    by Race;run;``

BTW if you add a statistical argument like `chisq`, the result becomes the statistics for the Chi-Square Tests.

#### Descriptive Statistics

Otherwise we can use `proc tabulate` to create a table for displaying multiple statistics quickly.

``proc tabulate data = sashelp.cars;    var weight;    table weight * (N Min Q1 Median Mean Q3 Max);run;``

But I think `proc means` is more convenient to save the output like:

``proc means data = sashelp.cars n nmiss mean std median p25 p75 min max;    var weight;    output out=weight_tbl n=n nmiss=nmiss mean=mean std=std median=median p25=p25 p75=p75 min=min max=max;run;``