I find it difficult to understand what LS actually means in its literal sense.

The definition from `lsmeans`

package is shown blow, that have been transitioned to `emmeans`

package.

Least-squares means (LS means for short) for a linear model are simply predictions—or averages thereof—over a regular grid of predictor settings which I call the

reference grid.

In fact, even when I read this sentence, I was still very confused. What's the reference grid, and how to predict?

So let's see how the LS means is calculated, and the corresponding confidence interval as well.

Firstly import CDSIC pliot dataset, the same as the previous blog article - Conduct an ANCOVA model in R for Drug Trial. And then handle with the `adsl`

and `adlb`

to create an analysis dataset `ana_dat`

so that we can use ANCOVA by `lm`

function. Supposed that we want to see the `CHG`

(change from baseline) is affected by independent variable `TRTP`

(treatment) under the control of covariate variables `BASE`

(baseline) and `AGE`

(age).

Filter the dataset by `BASE`

variable as one missing value can be found in dataset.

`library(tidyverse)library(emmeans)ana_dat2 <- filter(ana_dat, !is.na(BASE))`

Then fit the ANCOVA model by `lm`

function.

`fit <- lm(CHG ~ BASE + AGE + TRTP, data = ana_dat2)anova(fit)# Analysis of Variance Table## Response: CHG# Df Sum Sq Mean Sq F value Pr(>F)# BASE 1 1.699 1.6989 0.9524 0.3322# AGE 1 0.001 0.0010 0.0006 0.9811# TRTP 2 8.343 4.1715 2.3385 0.1034# Residuals 76 135.570 1.7838 `

We know that the LS means can be calculated according to reference grid that contains the mean of covariables and total factors for independent variables.

`rg <- ref_grid(fit)# 'emmGrid' object with variables:# BASE = 5.4427# AGE = 75.309# TRTP = Placebo, Xanomeline Low Dose, Xanomeline High Dose`

The mean of `BASE`

and `AGE`

are, as we can see from the table above, `5.4427`

and `75.309`

, respectively. Or we can calculate manually like:

`summary(ana_dat2[,c("BASE", "AGE")])# BASE AGE # Min. : 3.497 Min. :51.00 # 1st Qu.: 4.774 1st Qu.:71.00 # Median : 5.273 Median :77.00 # Mean : 5.443 Mean :75.31 # 3rd Qu.: 5.718 3rd Qu.:81.00 # Max. :10.880 Max. :88.00`

Then we can use `summary()`

or `predict()`

function to get the predicted value based on reference grid `rg`

.

`rg_pred <- summary(rg)rg_pred# BASE AGE TRTP prediction SE df# 5.44 75.3 Placebo 0.0578 0.506 76# 5.44 75.3 Xanomeline Low Dose -0.1833 0.211 76# 5.44 75.3 Xanomeline High Dose 0.5031 0.235 76`

The prediction column is the same as from `predict(rg)`

. The prediction table looks like the predicted values of the different factor levels at the constant mean value.

In fact, we can aslo calculate the predicted value as we have the coefficients estimation of the regression equation from `fit$coefficients`

`> fit$coefficients (Intercept) BASE AGE -1.11361290 0.11228582 0.00743963 TRTPXanomeline Low Dose TRTPXanomeline High Dose -0.24108746 0.44531274`

As the `TRTP`

includes multiple factors so it has been converted into dummy variables:

`contrasts(ana_dat2$TRTP)# Xanomeline Low Dose Xanomeline High Dose# Placebo 0 0# Xanomeline Low Dose 1 0# Xanomeline High Dose 0 1`

Now if we want to calculate the predicted value for the `Xanomeline Low Dose`

factor, it can be as follows:

`> 0.11229*5.44+0.00744*75.3-0.24109*1-1.11361[1] -0.1836104`

Back to LS means, from its definition, it seems to be the average of the predicted values.

`rg_pred %>% group_by(TRTP) %>% summarise(LSmean = mean(prediction))# # A tibble: 3 × 2# TRTP LSmean# <fct> <dbl># 1 Placebo 0.0578# 2 Xanomeline Low Dose -0.183 # 3 Xanomeline High Dose 0.503 `

It's exactly the same results as `lsmeans(rg, "TRTP")`

by `emmeans`

package. Or just using `emmeans(fit, "TRTP")`

can also get the same results

`lsmeans(rg, "TRTP")# TRTP lsmean SE df lower.CL upper.CL# Placebo 0.0578 0.506 76 -0.949 1.065# Xanomeline Low Dose -0.1833 0.211 76 -0.603 0.236# Xanomeline High Dose 0.5031 0.235 76 0.036 0.970`

The degree of freedom is `76`

as the DF for `TRTP`

is `2`

, and `1`

and `1`

for each covariables. So the total DF is `81-2-1-1=76`

I think.

Using `test`

we can get the P value when we compare the lsmean to zero.

`test(lsmeans(fit, "TRTP"))# TRTP lsmean SE df t.ratio p.value# Placebo 0.0578 0.506 76 0.114 0.9093# Xanomeline Low Dose -0.1833 0.211 76 -0.870 0.3869# Xanomeline High Dose 0.5031 0.235 76 2.145 0.0351`

In fact, the `t.ratio`

is the t statistics, so we can calculate P value manually, like

`2 * pt(abs(0.114), 76, lower.tail = F)2 * pt(abs(-0.870), 76, lower.tail = F)2 * pt(abs(2.145), 76, lower.tail = F)`

Likewise the confidence interval of lsmean can also be calculated manually based on `SE`

and `DF`

, such as for Placebo factor.

`> 0.0578 + c(-1, 1) * qt(0.975, 76) * 0.506[1] -0.9499863 1.0655863`

I think these steps will go a long way in understanding the meaning of least-squares means, and the logic behind it. Hope to be helpful.

“emmeans” package

最小二乘均值的估计模型

UNDERSTANDING ANALYSIS OF COVARIANCE (ANCOVA)

Confidence intervals and tests in emmeans

Least-squares Means: The R Package lsmeans

As an example dataset, I'll use the `cdiscpilot01`

from CDSIC that contains the AdaM and SDTM datasets for a single study. And then our purpose is to conduct efficacy analysis by ANCOVA with LS mean estimation. Suppose that we want to know whether or not the treatment has an impact on `Glucose`

while accounting for the baseline of glucose. The patients are limited to who reach the visit of `end of treatment`

but have not been discontinued due to AE.

ANCOVA makes several assumptions about the input data, such as:

- Linearity between the covariate and the outcome variable
- Homogeneity of regression slopes
- The outcome variable should be approximately normally distributed
- Homoscedasticity
- No significant outliers

Maybe we need to additional article to talk about how to conduct these assumptions, but not in here. So we suppose that all assumptions have been met for the ANCOVA.

Install and load the following required packages. And then load `adsl`

and `adlbc`

datasets from `cdiscpilot01`

study, which can be referred to another article: Example of SDTM and ADaM datasets from the CDISC.

`library(tidyverse)library(emmeans)library(gtsummary)library(multcomp)adsl <- haven::read_xpt(file = "./phuse-scripts/data/adam/cdiscpilot01/adsl.xpt")adlb <- haven::read_xpt(file = "./phuse-scripts/data/adam/cdiscpilot01/adlbc.xpt")`

Per the purpose, we need to filter the efficacy population and focus on `Glucose (mg/dL)`

lab test.

`gluc <- adlb %>% left_join(adsl %>% select(USUBJID, EFFFL), by = "USUBJID") %>% # PARAMCD is parameter code and here we focus on Glucose (mg/dL) filter(EFFFL == "Y" & PARAMCD == "GLUC") %>% arrange(TRTPN) %>% mutate(TRTP = factor(TRTP, levels = unique(TRTP)))`

And then to produce the analysis datasets by filtering the target patients who have reach out the end of treatment and not been discontinued due to AE.

`ana_dat <- gluc %>% filter(AVISIT == "End of Treatment" & DSRAEFL == "Y") %>% arrange(SUBJID, AVISITN) %>% mutate(AVISIT = factor(AVISIT, levels = unique(AVISIT)))`

Once we have the datasets for analysis, we need to examine the datasets first. I find `tbl_summary`

function in `gtsummary`

package can calculate descriptive statistics and provide a very nice table with clinical style, as shown below:

`ana_dat %>% dplyr::select(AGEGR1, SEX, RACE, TRTP, AVAL, BASE, CHG) %>% tbl_summary(by = TRTP, missing = "no") %>% add_n() %>% as_gt() %>% gt::tab_source_note(gt::md("*This data is from cdiscpilot01 study.*"))`

Here we can see the descriptive summary for each variables by the treatment group. Certainly we can also do some visualization like boxplot or scatterplot, but not present here.

We use `lm`

function to fit ANCOVA model with treatment(`TRTP`

) as independent variable, change from baseline(`CHG`

)as response variable, and baseline(`BASE`

) as covariates.

`fit <- lm(CHG ~ BASE + TRTP, data = ana_dat)summary(fit)`

The summary output for regression coefficients as follows. If you would like to obtain anova tables, should use `anova(fit)`

instead of `summary(fit)`

.

`Call:lm(formula = CHG ~ BASE + TRTP, data = ana_dat)Residuals: Min 1Q Median 3Q Max -3.1744 -0.7627 -0.0680 0.5633 5.0349 Coefficients: Estimate Std. Error t value Pr(>|t|)(Intercept) -0.5579 0.8809 -0.633 0.528BASE 0.1111 0.1329 0.837 0.405TRTPXanomeline Low Dose -0.2192 0.5433 -0.404 0.688TRTPXanomeline High Dose 0.4447 0.5528 0.804 0.424Residual standard error: 1.328 on 77 degrees of freedom (1 observation deleted due to missingness)Multiple R-squared: 0.06702, Adjusted R-squared: 0.03068 F-statistic: 1.844 on 3 and 77 DF, p-value: 0.1462`

From above results, we can easily conclude the regression coefficient and model, and the significance comparing to zero. With the coefficient, we can predict the any change based on baseline and treatment.

Besides we can use `contrasts`

function to obtain contrast metrices so that understand the dummy variables for `TRTP`

in the multiple regression model here.

`> contrasts(ana_dat$TRTP) Xanomeline Low Dose Xanomeline High DosePlacebo 0 0Xanomeline Low Dose 1 0Xanomeline High Dose 0 1`

From the anova table as shown below, it can been seen that the treatment have no statistical significance for the change in glucose under the control of the effects of baseline.

`> anova(fit)Analysis of Variance TableResponse: CHG Df Sum Sq Mean Sq F value Pr(>F)BASE 1 1.699 1.6989 0.9629 0.3295TRTP 2 8.061 4.0304 2.2844 0.1087Residuals 77 135.853 1.7643 `

If you would like to make the output more pretty, `tbl_regression(fit)`

can be used as mentioned before.

If we want to obtain the least square(LS) mean between treatment groups, `emmeans`

or `multcomp`

package can provide the same results. In addition the process to calculate the LS mean is also very worth to leaning and understanding.

`# by multcomppostHocs <- glht(fit, linfct = mcp(TRTP = "Tukey"))summary(postHocs)# by emmeansfit_within <- emmeans(fit, "TRTP")pairs(fit_within, reverse = TRUE)`

The summary output as shown below:

`> summary(postHocs) Simultaneous Tests for General Linear HypothesesMultiple Comparisons of Means: Tukey ContrastsFit: lm(formula = CHG ~ BASE + TRTP, data = ana_dat)Linear Hypotheses: Estimate Std. Error t value Pr(>|t|) Xanomeline Low Dose - Placebo == 0 -0.2192 0.5433 -0.404 0.9116 Xanomeline High Dose - Placebo == 0 0.4447 0.5528 0.804 0.6937 Xanomeline High Dose - Xanomeline Low Dose == 0 0.6639 0.3113 2.132 0.0855 .---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1(Adjusted p values reported -- single-step method)`

Now it's clear a significant difference was observed between the pairs `Low Dose vs. Placebo`

, `High Dose vs. Placebo`

, and `High Dose vs. Low Dose`

.

https://r4csr.org/efficacy-table.html#efficacy-table

ANCOVA in R

How to perform ANCOVA in R

An Introduction to ANCOVA (Analysis of Variance)

How to Conduct an ANCOVA in R

In addition to reading the corresponding procedure reference in the official documentation, I recommend using the `ods trace on`

to find the stat table names, and then extract any table you want. Here is the regression analysis of the `sashelp.cars`

dataset, let's see how to get the stat tables.

`ods trace on; /* write ODS table names to log */proc reg data=sashelp.cars plots=none; model Horsepower = EngineSize Weight;quit;ods trace off; /* stop writing to log */`

The logs you can see in SAS are as follows:

`Output Added:-------------Name: NObsLabel: Number of ObservationsTemplate: Stat.Reg.NObsPath: Reg.MODEL1.Fit.Horsepower.NObs-------------Output Added:-------------Name: ANOVALabel: Analysis of VarianceTemplate: Stat.REG.ANOVAPath: Reg.MODEL1.Fit.Horsepower.ANOVA-------------Output Added:-------------Name: FitStatisticsLabel: Fit StatisticsTemplate: Stat.REG.FitStatisticsPath: Reg.MODEL1.Fit.Horsepower.FitStatistics-------------Output Added:-------------Name: ParameterEstimatesLabel: Parameter EstimatesTemplate: Stat.REG.ParameterEstimatesPath: Reg.MODEL1.Fit.Horsepower.ParameterEstimates`

By looking at the output you can find each stat table name, like `ParameterEstimates`

. That means you can extract it by adding the `ods output ParameterEstimates=rst`

statement to store the table in the `rst`

dataset, as follows:

`proc reg data=sashelp.cars plots=none; /* same procedure call */ model Horsepower = EngineSize Weight; ods output ParameterEstimates=rst; /* the data set name is 'rst' */quit;`

Multiple stat tables can be stored in one `ods output`

statement. For example below statement stores both ParameterEstimates table and ANOVA table at the same time.

`proc reg data=sashelp.cars plots=none; model Horsepower = EngineSize Weight; ods output ParameterEstimates=parms ANOVA=anvar;quit;`

And then if you want to create a macro variable that contains the value of certain statistic, such as slope for EngineSize:

`data _null_; set rst; if variable="EngineSize" then call symputx("slope1", estimate);run;%put &=slope1;`

Several procedures provide an alternative option for createing an output similar to the `ods ouput`

mentioned above. For instance, the `outest`

option in the `proc reg`

procedure.

`proc reg data=sashelp.cars noprint outest=rst2 rsquare; /* statistics in 'rst2' */ model Horsepower = EngineSize Weight;quit;`

So you'd better to check the SAS documentation to see if this procedure you use.

All of the above is referred to the following articles:

ODS OUTPUT: Store any statistic created by any SAS procedure

Find the ODS table names produced by any SAS procedure

A SAS macro to combine portrait and landscape rtf files into one single file

In order to make it suitable for every condition as follows, I will additionally perform an update so that it can be more flexible.

- containing multiple table, figure and list at the same time
- using the title as the index of table contents
- order the files manually (Just provide a solution, have not implemented yet.)

First of all, let's look at the RTF's structure, which is referred to that article.

It is divided into three parts: opening section, content section and closing section. If we look at our single rtf, that structure is still the same. Consequently, the rtf combining process can be summarized as follows:

- Read all filenames into SAS (sorted by filename or defined by manual order).
- Keep the open section of first RTF.
- Remove both opening and closing sections except the first and the last RTF. And add
`\pard\sect`

code in front of`\sectd`

so that all of the files can be combined correctly. - Keep the closing section of last RTF.
- Save the updated RTF code into each SAS dataset. (Do not be saved in a single dataset, as the character length is limited in SAS.)

Now let's see the code we can use in this process. Firstly I import the rtf filenames from the external folder.

`data refList(keep=filepath fn); length fref $8 fn $80 filepath $400; rc = filename(fref, "&inpath"); if rc = 0 then dirid = dopen(fref); if dirid <= 0 then putlog 'ERR' 'OR: Unable to open directory.'; nfiles = dnum(dirid); do i = 1 to nfiles; fn = dread(dirid, i); fid = mopen(dirid, fn); if fid > 0 and index(fn,"rtf") then do; filepath="&inpath\"||left(trim(fn)); fn = strip(tranwrd(fn,".rtf","")); output; end; end; rc = dclose(dirid);run;`

Secondly, read each line in rtf file until find one line that starts with `\sectd`

, which means the above is openning section, and below is content section. And remove the last `}`

except the last rft file.

`data rtfdt&i(where = (ptline=1)); retain ptline; set rtfdt&i end = last; if substr(line,1,6)='\sectd' then do; ptline = 1; /*enable to combine portrait and landscape rtf*/ line="\pard\sect"||compress(tranwrd(line,"\pgnrestart\pgnstarts1","")); end; if last and line^='}' then line=substr(strip(line),1,length(strip(line))-1); else if last and line='}' then delete;run;`

Thirdly, when you find the title code in rtf, replace the `\pard`

with `\pard\outlinelevel1`

so that this title can be identified as index for content table.

`%if &titleindex = 1 %then %do; data rtfdt&i.; set rtfdt&i.; retain fl 0; if index(line,'\pard\plain\') and (not index(line,'\header\pard')) and (not index(line, '\footer\pard')) then fl=1+fl; run; data rtfdt&i; set rtfdt&i; by fl notsorted; if fl=1 and first.fl then /*add index for the contents as per titles*/ line=tranwrd(line,'\pard','\pard\outlinelevel1'); run;%end;`

At last, don't save above rtf contents in one single SAS dataset because as the character length is limited in SAS. And add the `}`

as the closing section so that keep the rtf file complete.

The total code as shown below:

`/*Example*//*%s_combrtf(inpath=&inpath,outpath=&outpath,outfile=&outfile);*//*Parameter Description*//*inpath input path*//*outpath output path*//*outfile output file name*//*titleindex whether to add title index, default is 1*/%macro s_combrtf(inpath= ,outpath= ,outfile= ,titleindex=1); data refList(keep=filepath fn); length fref $8 fn $80 filepath $400; rc = filename(fref, "&inpath"); if rc = 0 then dirid = dopen(fref); if dirid <= 0 then putlog 'ERR' 'OR: Unable to open directory.'; nfiles = dnum(dirid); do i = 1 to nfiles; fn = dread(dirid, i); fid = mopen(dirid, fn); if fid > 0 and index(fn,"rtf") then do; filepath="&inpath\"||left(trim(fn)); fn = strip(tranwrd(fn,".rtf","")); output; end; end; rc = dclose(dirid); run; /*sort by filename by default*/ proc sort data = refList sortseq = linguistic(numeric_collation=on) out = sorted_refList; by fn; quit; data fileorder; set sorted_refList; FileLevel = 2; order = .; run; data _null_; set fileorder end=last; fnref=strip("filename fnref")||strip(_N_)||right(' "')||strip(filepath)||strip('" lrecl=5000 ;'); call execute(fnref); if last then call symputx('maxn',vvalue(_n_), 'l'); run; %do i=1 %to &maxn.; data rtfdt&i.; infile fnref&i. truncover; informat line $5000.; format line $5000.; length line $5000.; input line $1-5000; line=strip(line); run; /*add title index and adapt to more flexible*/ %if &titleindex = 1 %then %do; data rtfdt&i.; set rtfdt&i.; retain fl 0; if index(line,'\pard\plain\') and (not index(line,'\header\pard')) and (not index(line, '\footer\pard')) then fl=1+fl; run; data rtfdt&i; set rtfdt&i; by fl notsorted; if fl=1 and first.fl then /*add index for the contents as per titles*/ line=tranwrd(line,'\pard','\pard\outlinelevel1'); run; %end; %if &i.=1 %then %do; data final; set rtfdt&i(keep = line) end = last; if last and line^='}' then line=substr(strip(line),1,length(strip(line))-1); else if last and line='}' then delete; run; %end; %if &i.^=1 %then %do; data rtfdt&i(where = (ptline=1)); retain ptline; set rtfdt&i end = last; if substr(line,1,6)='\sectd' then do; ptline = 1; /*enable to combine portrait and landscape rtf*/ line="\pard\sect"||compress(tranwrd(line,"\pgnrestart\pgnstarts1","")); end; if last and line^='}' then line=substr(strip(line),1,length(strip(line))-1); else if last and line='}' then delete; run; %end; %if &i.=&maxn. %then %do; %local _cnt; data final; set final %do _cnt=2 %to &maxn; rtfdt&_cnt(keep = line) %end; ; run; data final; set final end = last; if last then line=strip(line)||strip("}"); run; %end; %end; data _null_; file "&outpath\&outfile..rtf" lrecl=5000 nopad; set final; put line; run;%mend;`

This appoach, in my opinion is quite excellent as it can resolve the issues as follows:

I know that different company put titles and footnotes in different places. Some may place them in header & footer section and some may place them in the body of rtf document. Above macro will works no matter how you place the titles and footnotes.

A SAS macro to combine portrait and landscape rtf files into one single file Combine multiple RTF files to one file

SM05: An Efficient Way to Combine RTF Files and Create Multi-Level Bookmarks and a Hyperlinked TOC

utl-sas-macro-to-combine-rtf-files-into-one-single-file

http://onbiostatistics.blogspot.com/2009/01/data-dredging-vs-data-mining-post-hoc.html

- Ad hoc是指完成final统计分析后，有其他针对于该final报告的额外统计分析需求。
- Post hoc是指在submission后，来自于监管部门的审评意见的额外统计分析需求。

对于以上两种情况，实施相同的处理方式。比如需要额外的变量从，ADS的数据集说明文档（specification）则需要同样被补充。相关文档可以在SAP或者验证计划/报告中作为附件，因此对应的版本必须在文件名以及标题中显示。相关文档也可作为独立文档被保存。

而 Post-hoc 分析，即事后分析，是指在数据收集完毕后，根据数据本身特点额外设定分组，提出研究假设，并进行统计分析。

Post hoc常被称为数据疏浚（data dredging）或数据捕鱼（data fishing），其动机往往是为了得到阳性的结果。因此，事后分析结果一般不被各国药品监管部分接受作为药物有效性的证据。

以下是在临床试验的过程中的注意事项： - 临床试验过程中所有的重大变化都需要记录； - 不鼓励进行Ad-hoc分析（Ad-hoc没有先进行假设再分析，不符合严格的统计学原则，只能作为探索性结论）； - 统计判断需要依据临床试验结果的客观说明和表述。

A post-hoc analysis involves looking at the data after a study has been concluded, and trying to find patterns that were not primary objectives of the study. In other words, all analyses that were not pre-planned and were conducted as 'additional' analyses after completing the experiment are considered to be post-hoc analyses. A post-hoc study is conducted using data that has already been collected. https://www.editage.com/insights/zh-hans/node/7139

While both post-hoc and ad-hoc analysis may be performed based on the data or results we have seen, the ad-hoc analysis typically occurred alongside the project while the post-hoc analysis occurred absolutely after the project or after the unblinding of the study or after the pre-specified analyses results have been reviewed. In this sense, the ad-hoc analysis is better than post-hoc analysis.

]]>回过头看，这些非常基础的用法却是平时用的最多的代码。

set-keep联合 提取特定变量

`/*set-keep-挑选变量*/ data keep; set sashelp.class(keep=name sex); /*查看数据，sashelp逻辑库的class数据集，keep相当于 class[,c("name","sex")] keep代表提取元素，而drop代表剔除元素*/ run;`

set-rename 修改变量名称

`/*set-rename-修改变量名称*/ data keep; set sashelp.class(keep=name sex rename=(name=name_new sex=sex_new)); run;`

set-where 条件选择

`/*set-where-按条件选中*/ data keep; set sashelp.class(keep=name sex where=(sex='M')); run;`

set-in 临时变量

`/*set-in-临时单个变量*/ /*可以说是SAS跟R最大的区别的一点就是，SAS内容都是不直接放在内存之中，而是放在数据集中，如果要对数据集的内容进行一些操作，需要先赋值成一些临时变量*/ data keep; set one(in=a) two(in=b); /*one变量变成临时变量a，two变量变成临时变量b，同时合并one two变量*/ in_one=a; in_two=b; /*将临时变量a b 赋值新变量名称in_one、In_two*/ if a=1 then flag=1;else flag=0; /*构造一个新变量flag，为满足某种条件*/ run;`

set-nobs 计总数

`/*set-nobs-数据总数，譬如nrow的意思*/ data nobs11(keep=total); set sashelp.class nobs=total_obs; /*将class数据集通过nobs转给临时变量total_obs，然后传给实际变量total，再传出*/ total=total_obs; output; stop; run;`

利用nobs=total_obs，以及total=total_obs的操作来计数。 先用函数obs计数，然后传给临时变量total_obs放入缓存，缓存内容需要再传导给实际变量total。

此外，注意还有output+stop代表单独输出为数据表，而stop的意思是停留在一个单元格，不然就会生成19*1的一列数值，里面都填充着数字19。

数据集合并——横向合并，obs这里表示取前10行

`/*set-数据集合并*/ data concatenat; set sashelp.class sashelp.class(obs=10); /*横向合并，同时sashelp.class(obs=10)代表切片*/ run;`

`/*merge 横向合并*/proc sort data=chapt3.merge_a;by=x;run;proc sort data=chapt3.merge_c;by=x;run; data d;merge chapt3.merge_a chapt3.merge_c;by x;run;`

SAS合并需要预先进行一些内容的排序，才能进行合并。

- 排序：proc sort data=逻辑库.数据集; by=变量名称；run；
- 合并：merge 数据集1 数据集2；by x；

注意这里合并需要by，同时By是作为单独的代码。

- where-between/and

前面set和where联用可以得到比较好的效果。还有一些可能：

`Where x between 10 and 20;/* X[10,20] */Where x not between 10 and 20;Where x between y*0.1 and y*0.5;where x between 'a' and 'c';`

where-between/and可以作为切片的一种。同时数据集(obs=10)也是切片的一种方式。

`where x in(1,2);/*选择变量等于某值的情况*/`

选择变量服从某些特征的。

where在缺失值的应用

/

*where选中缺失值*/ Where x is missing; where x is null; /* 数值型变量，定位缺失值，is.na()*/

有点像R中的is.na()函数。

`Where x;/*选择数值变量x非0非缺失值的观测值*/Where x and y; /*字符型变量，选择观测值*/Where x ne '';`

还有一些比较特殊的写法，譬如where x就直接代表选中了非0非缺失值的数据，比较方便。x ne ''代表，x不等于空格。

where选中字符型

where x like 'D_an'; where x like '%ab%' or x like '%z%'; /

*字符型匹配，下划线代表一个字符，%代表可以有很多字符*/

跟SQL里面一样用like来定位字符型内容。其中需要注意的是，D_an代表D和an其中可以容纳一个字符；而D%an代表D和an中可以容纳多个字符。

`/*append base= data= force 语句*//*base是元数据集，data需要添加在后面的数据集，force是强行添加，不太用*/proc append base=null data=sashelp.class(where=(sex='M'));run;`

利用proc append来启动函数，proc append base=基础数据集 data=添加数据集

]]>I suggest there are two methods to implement this requirement.

- to do that in
`proc report`

- to do that before
`proc report`

Let's say that we have an example `sashelp.class`

, and wish to divide it into two groups so that we can present them separately.

For the first method, we need to add a `gr`

variable for grouping, such as female and male.

`data final; set sashelp.class; if sex="male" then gr=1; else gr=2;run;proc sort data = final; by gr age;run;`

And then, in the `proc report`

procedure, add one line code `compute after gr;`

as shown below. You can see it works.

`proc report data=final; column gr name sex age height weight; define gr / group order noprint; compute after gr; line @1 ""; endcomp;run;`

To use the second method, we simply add one blank row to the `final`

dataset using `call missing(of _all_)`

.

`data final2; set final; by sex; output; if last.sex then do; call missing(of _all_); output; end; drop gr;run;`

Just a trick, I hope to help anyone who's learning SAS.

]]>If you want to review the full datsets for pilot project, I would recommand it's better to clone the cdisc-org/sdtm-adam-pilot-project git repository, as following:

`clone https://github.com/cdisc-org/sdtm-adam-pilot-project.git`

If neccessary, otherwise you can aslo get CDISC pilot projects from the `phuse-scripts`

repository.

`clone https://github.com/phuse-org/phuse-scripts.git`

Supposed you just want to get SDTM and ADaM datasets and run some tests in R, I would like to import those datasets from R packages, like `admiral`

package. And install and try it right here.

`library(admiral)data(admiral_adsl)adsl <- admiral_adslhead(adsl[1:5,1:5])# A tibble: 5 × 5 STUDYID USUBJID SUBJID RFSTDTC RFENDTC <chr> <chr> <chr> <chr> <chr> 1 CDISCPILOT01 01-701-1015 1015 2014-01-02 2014-07-022 CDISCPILOT01 01-701-1023 1023 2012-08-05 2012-09-023 CDISCPILOT01 01-701-1028 1028 2013-07-19 2014-01-144 CDISCPILOT01 01-701-1033 1033 2014-03-18 2014-04-145 CDISCPILOT01 01-701-1034 1034 2014-07-01 2014-12-30`

You can also find the function list directly right here, to see which ADaM datsets are avaiable for use, such as `adae`

, `adeg`

, `advs`

and so on.

If you want to import SDTM datasets, you should use the function like `data(admiral_ae)`

from `admiral.test`

package.

`library(admiral.test)data(admiral_ae)ae <- admiral_aehead(ae[1:5,1:5])# A tibble: 5 × 5 STUDYID DOMAIN USUBJID AESEQ AESPID <chr> <chr> <chr> <dbl> <chr> 1 CDISCPILOT01 AE 01-701-1015 1 E07 2 CDISCPILOT01 AE 01-701-1015 2 E08 3 CDISCPILOT01 AE 01-701-1015 3 E06 4 CDISCPILOT01 AE 01-701-1023 3 E10 5 CDISCPILOT01 AE 01-701-1023 1 E08`

I find it quite practical, don't you? In addition to the `admiral`

package, I aslo find that `r2rtf`

and `clinUtils`

R packages contain the exmaple datasets for CDISC pilot projects. But both of them are not quite complete, so this is just a alternative option.

Overall, from my persepctive, these example datasets are a great resource for developing R packages or Shiny dashboards for pharmaceutical use.

https://github.com/phuse-org/phuse-scripts

https://pharmaverse.github.io/admiral/index.html

https://github.com/pharmaverse/pharmaverse

https://github.com/atorus-research/CDISC_pilot_replication

https://cran.r-project.org/web/packages/admiral.test/admiral.test.pdf

https://rdrr.io/cran/clinUtils/f/vignettes/clinUtils-vignette.Rmd

First of all, here sharing a good resource about SAS Marcos is related to our topic. That lists a series of useful macros, and the code is very standard and worthy to learn, highly recommended.

For R, I'm willing to recommend the `fs`

R package, which provides a cross-platform, uniform interface file system operations

And then let's begin with our topics.

Suppose if you want to identify the list of files in a particular directory, in R you can easily choose `list.files()`

. For example list files in a specific directory like the current directory.

`list.files(path = "./")`

You can also get the files within the subfolders, and just match the `.txt`

files, simple use

`list.files(path = "./", pattern = ".txt", recursive = T)`

In SAS, limited to my knowledge, just list two approaches. The first approach is to use a `filename`

statement with a `pipe`

device type and `dir`

command in the Windows environment.

`filename dirlist pipe 'dir /b E:\Tp\*.txt';data list; length line $200; infile dirlist; input; line = strip(_infile_);run;filename dirlist clear;proc print data=list; run;`

The second approach is to use the functions `dopen`

and `dread`

with the help from `dnum`

, as the following example.

`filename root "E:\Tp";data list; * -- return variables --; length name $ 512; * -- directory to inventory --; dirid = dopen('root'); if dirid <= 0 then putlog 'ERR' 'OR: Unable to open directory.'; nfiles = dnum(dirid); do i = 1 to dnum(dirid); * -- directory item name --; name = dread(dirid, i); output; end; rc = dclose(dirid); dirid = 0;run;`

The more details for the second approach can be found in these links.

Suppose if you want to identify a file called `README.md`

that exists in the current directory, then you can choose the `file.exists()`

function in R, that returns `TRUE`

if the file exists, and `FALSE`

otherwise.

`file.exists("./README.md")`

In SAS, `fileexist`

function verifies the existence of the file, and it will return 1 if the file exists, and 0 otherwise.

`%let fpath=E:\Tp\test.sas;%macro fileexists(filepath); %if %sysfunc(fileexist(&filepath)) %then %put NOTE: The external file &filepath exists.; %else %put ERROR: The external file &filepath does not exist.;%mend fileexists;%fileexists(&fpath);`

Besides if you want to check if a dataset exists, you can choose the `exist`

function. If for checking a variable exists, `varnum`

is recommended.

If you want to create a blank file in R then

`file.create("./text.txt")`

In SAS, I'm not sure if the below is the normal way, but it’s definitely simple anyway.

`data _null_; file "E:\Tp\test.txt";run;`

If you want to delete a specific file in R, then

`file.remove("./test.txt")`

In SAS, I think we can simply use `fdelete`

function.

`filename defile '"E:\Tp\test.txt';data _null_; rc=fdelete('defile');run;`

Creating a directory is very similar to a file. The function in R is `dir.create()`

that is very convenient to use. In SAS it can be accomplished using the `dlcreatedir`

option and `libname`

statement with 2 lines of code.

`options dlcreatedir;libname folder 'E:\Tp\dummy';`

If you want to create or copy multiple folders or directories, more detailed information can be found in Using SAS® to Create Directories and Duplicate Files.

You can use the macro as shown below to list all the RTF files which means the extension is `.rtf`

.

`/*Example*//*%ListFilesSpecifyExtension(dir=C:\Users\demo,ext=rtf,out=rst);*//*Parameter Description*//*dir input directory*//*ext file name extension*//*out output dataset*/%macro ListFilesSpecifyExtension(dir=, ext=, out=); %local filrf rc did name i; %let rc=%sysfunc(filename(filrf,&dir)); %let did=%sysfunc(dopen(&filrf)); /* Use the %IF statement to make sure the directory can be opened. If not, end the macro. */ %if &did eq 0 %then %do; %put Directory &dir cannot be open or does not exist; %return; %end; data &out; length FileName $200; %do i = 1 %to %sysfunc(dnum(&did)); %let name=%qsysfunc(dread(&did,&i)); %if %qupcase(%qscan(&name,-1,.)) = %upcase(&ext) %then %do; Filename = "&name"; output; %end; %end; run; %let rc=%sysfunc(dclose(&did)); %let rc=%sysfunc(filename(filrf));%mend ListFilesSpecifyExtension;`

Directory Listings in SAS

Obtaining A List of Files In A Directory Using SAS® Functions

http://sasunit.sourceforge.net/v15/doc/files.html

Check if a Specified Object Exists

Using SAS® to Create Directories and Duplicate Files

The Bland-Altman method generally refers to the Bland-Altman plot, which is used to display the relationship between two paired quantitative measurement tests or assays (Bland & Altman, 1986 and 1999). For example, a new product might be compared with the registered product, or previous generation product. Alternatively the product can be the reference or gold standard method. Sometimes we would call it a different plot instead of the Bland-Altman plot in the CLSI EP09.

As we can see from above, the Bland-Altman is a scatter plot that clearly shows the relationship between the differences and the magnitude of measurement. The X axis represents the mean of two measurements, and the Y axis represents the difference.

Sometimes the difference can be defined as constant difference, but it also can be defined as proportional difference that depends on the distribution of measurements. For example, if the difference is not related to the magnitude, it means that we have a constant difference between the two assays throughout the total X axis range. Otherwise proportional difference means that it’s related to the magnitude, and proportional to the magnitude. So such plots can be visually inspected to determine the underlying variability characteristic of this relationship.

The assumption of Bland-Altman is that the differences are normally distributed. But we all know that we can not make sure the measurements are following the normal distribution completely. In many cases, actually there will not be a big impact for Bland-Altman analysis when the distribution of the differences is not normally distributed.

But from my side, I propose that the range of the two assays should not be too different, be assured that they are in the similar magnitude.

There have been some definitions we should know for Bland Altman analysis.

**Bias**, it refers to the mean of the differences of the two measurements. It can be seen in the middle line in the Bland Altman plot, which is useful for detecting a systematic difference.**95% CI of Bias**, it refers to the 95% confidence interval of the mean difference, which illustrates the magnitude of the systematic difference.**Limits of Agreement (LoA)**, it refers to the 95% prediction interval of the differences. And it can be seen in the upper and lower lines in the Bland Altman plot. This indicator is very important in clinical trials. Always we need to compare the LoA with the clinical acceptance criteria to demonstrate the bias for the new product is accepted in clinical practice.**95% CI of LoA**, it refers to the 95% confidence interval of LoA, to demonstrate the error or precision of the upper and lower LoA.

To explain the calculation process more clearly, here I use the R to implement it.

Suppose you have two measurements from different assays. The mean differences of them is `5`

, and the corresponding standard deviation is `0.8`

.

`d <- 5sd <- 0.8`

We can easily get the LoA by the formula as it’s the prediction interval of the differences.

`LoA <- c(d - 1.96 * sd, d + 1.96 * sd)`

And the CI for `d`

and `LoA`

would be a bit complicated, as shown below from NCSS Bland-Altman Plot and Analysis documentation.

From the above formula, the standard error of `LoA`

CI is about 1.71 times as the `d`

. This `1.71`

always occurs in some Bland Altman related articles,for now at least we have known how to calculate this number. Just drop the `n`

from both sides of the equation.

`sqrt(1 + 1.96^2 / 2)`

For the CI, we can also easily calculate them according to the above formula, for example the 95% two-side confidence interval.

But here we must make sure that we should use the t distribution or normal distribution, that will affect whether we use t statistic or z statistic. Suppose here I use t distribution, and define the sample size `n`

is 200, so degrees of freedom is equal to `n-1`

.

`n <- 200t <- qt(1 - 0.05 / 2, 200-1)d_se <- sd / sqrt(n)d_CI <- c(d - t * d_se, d + t * d_se)> d_CI[1] 4.888449 5.111551`

Then the corresponding CI of `LoA`

`LoA_se <- sd * sqrt(1 / n + 1.96^2 / (2 * (n - 1)))LoA_CI <- LoA + c(- t * LoA_se, t * LoA_se)> LoA_CI[1] 3.241041 6.758959s`

What's the best result for Bland Altman analysis in the clinical trials?

Basically in the Bland Altman plot, we hope the spread of the scatter points is consistent across the range of X axis. And only a few points fall outside `LoA`

. Moreover, the `LoA`

or `LoA`

CI meet the clinical requirements.

Above all are my rough understanding, the main purpose is to note the calculation of CI for LoA. The references are as shown below:

Bland-Altman Plots(一致性评价)在python中的实现

Bland-Altman Plot and Analysis

Bland-Altman plot

Bland-Altman 分析在临床测量方法一致性评价中的应用

**Please indicate the source**: http://www.bioinfo-scrounger.com

Suppose you have a new product that you want to conduct a method comparison with an existing product. The primary endpoint is the AUC of your product is not worse than that compared product. It's no doubt that this is a non-inferiority trial design. So suppose the non-inferiority margin is `-0.15`

, the one-sided hypotheses is below:

- Margin: -0.15
- H0: AUC1 - AUC2 ≤ -0.1500
- H1: AUC1 - AUC2 > -0.1500

And then which method could we use to assess the statistical significance? In common, there are two methods used to estimate the AUC. One method is the empirical (nonparametric) method by DeLong et al. (1988), which doesn't depend on the strong normality assumption that the Binormal method makes. The other method is Binormal method presented by Metz (1978) and McClish (1989). Actually the bootstrap technique is also suggested for both methods, like the `rocNIT`

R package.

In this post, I’m mainly going to record how to use the nonparametric method to compare the paired AUC, referring to the self-topped Liu (2006) article.

The method is accomplished by R code to demonstrate each procedure. I don’t post the formula in here, please read that article that is very clear and easy to understand.

For the paired AUC, we can know that these two products are measured on the same subject. So I create a simulation data with 80 subjects, and the corresponding two measurements that are tested by the new product and compared product.

Given the following simulation data that can be download from Paired_Criteria_adjusted.txt

The nonparametric estimation of the ROC curve area is based on the Mann-Whitney U statistic.

And then we need to calculate the estimated variance of `auc1 - auc2`

When we get the estimated variance, the difference of two paired AUC and margin, we can also obtain the Z statistic. Thereby the p value and one-side 95% lower limit can be calculated either.

The total code is as following:

`# Helper functionsh_get_u_statistic <- function(x, y) { if (x > y) { return(1) } if (x == y) { return(0.5) } if (x < y) { return(0) }}h_auc_v10_v01 <- function(n1, n0, v1, v0) { v10 <- NULL v01 <- NULL # Mann-Whitney U statistic auc <- sum(sapply(v1, function(x) { sapply(v0, function(y) {h_get_u_statistic(x, y)}) })) / (n1 * n0) for (i in 1:n1) { v10 <- c(v10, sum( sapply(v0, function(y) {h_get_u_statistic(v1[i], y)}) ) / n0) } for (i in 1:n0) { v01 <- c(v01, sum( sapply(v1, function(x) {h_get_u_statistic(x, v0[i])}) ) / n1) } return(list(auc = auc, v10 = v10, v01 = v01))}# To get the auc and corresponding intermediate parameters `v10` and `v01`get_auc <- function(response, var) { dat <- cbind(response, var) n0 <- sum(response == 0, na.rm = TRUE) n1 <- sum(response == 1, na.rm = TRUE) var0 <- var[response == 0] var1 <- var[response == 1] c( list(n1 = n1, n0 = n0, var1 = var1, var0 = var0), h_auc_v10_v01(n1 = n1, n0 = n0, v1 = var1, v0 = var0) )}# The main program.auc.test <- function(mroc1, mroc2, margin, alpha = 0.05) { mod1 <- mroc1 mod2 <- mroc2 n1 <- mod1$n1 n0 <- mod1$n0 auc1 <- mod1$auc auc2 <- mod2$auc s10_11 <- sum((mod1$v10 - auc1)^2) / (n1 - 1) s10_22 <- sum((mod2$v10 - auc2)^2) / (n1 - 1) s10_12 <- sum((mod1$v10 - auc1) * (mod2$v10 - auc2)) / (n1 - 1) s01_11 <- sum((mod1$v01 - auc1)^2) / (n0 - 1) s01_22 <- sum((mod2$v01 - auc2)^2) / (n0 - 1) s01_12 <- sum((mod1$v01 - auc1) * (mod2$v01 - auc2)) / (n0 - 1) variance <- (s10_11 + s10_22 - 2 * s10_12) / n1 + (s01_11 + s01_22 - 2 * s01_12) / n0 auc_diff <- auc1 - auc2 z <- (auc_diff - margin) / sqrt(variance) p <- 1 - pnorm(z) lower_limit <- auc_diff - (qnorm(1 - alpha) * sqrt(variance)) list(Difference = auc_diff, `Non-Inferiority Pvalue` = p, `One-Sided 95% Lower Limit` = lower_limit)}`

Now let's start to run the main programs. Firsly import the simulation data.

`data <- data.table::fread("./Paired_Criteria_adjusted.txt") %>% mutate(Response = if_else(Condition == "Present", 1, 0))`

And then obtain the two results for ROC models.

`mroc1 <- get_auc(response = data$Response, var = data$Method1)mroc2 <- get_auc(response = data$Response, var = data$Method2)`

In the end, calculate the final output, especially the p value and one-side limit.

`> auc.test(mroc1, mroc2, margin = -0.15, alpha = 0.05)$Difference[1] -0.04771115$`Non-Inferiority Pvalue`[1] 0.04447758$`One-Sided 95% Lower Limit`[1] -0.1466274`

From this outcome, we can conclude that we need to reject the H0 hypothesis. Not only because the P value is less than 0.05, but also the one-side lower limit is greater than -0.15. So we can conclude that the Method1 (new product) is not worse than the Method2 (registered product) when the margin is 0.15.

Above is my note for a non-inferiority test in the diagnostic area. Actually I prefer to use R package or NCSS software to estimate the test, as personal code is always not correct sometimes. So the above code is helpful to understand the principle, not better to use directly.

Tests of equivalence and non-inferiority for diagnostic accuracy based on the paired areas under ROC curves

Comparing Two ROC Curves – Paired Design - NCSS

**Please indicate the source**: http://www.bioinfo-scrounger.com

In R, you can use the `dplyr`

package to `left_join`

or `inner_join`

and other functions `*_join()`

to handle your data in any form. As per to transposing, we generally call it **Pivot data**, and use corresponding `pivot_wider`

and `pivot_longer()`

functions from the `tidyr`

package to handle your data from long to wide, or the opposite. This how to use R to pivot data could refer to my post. Pivoting data in R

As for the SAS, let's see this with step-by-step examples.

In SAS, it defines the merge processing as three approaches.

- one-to-one reading
- concatenating
- Match-merging

The first two ways are the usage of `set`

, not to discuss in this post. The third way I suppose is the most common in our data processing.

First, we create two example datasets as shown below.

`proc sort data = sashelp.class; by Name;run;data cls1; set sashelp.class(keep = Name Sex obs = 10);run;data cls2; set sashelp.class(keep = Name Age firstobs = 5 obs = 15);run;`

Then we define the most important argument `by`

to specify which variable to join by. Before merging, we must make sure the dataset is sorted by that `by`

variable.

**Why?**

A simple answer is that the SAS match-merge is based on the classic sequential match algorithm, and the latter is based on the premise that all input streams are sorted identically.

`data clss; merge cls1 cls2; by Name;run;`

Actually in my option, the above code is not commonly used, that is just similar with the `full_join`

all x rows, followed by unmatched y rows. Maybe I would like to use merge processing such as `left_join`

, `inner_join`

. In this case we need to specify the `IN=`

argument in the `merge`

statement.

For instance, post-process like `left_join`

.

`data clss2; merge cls1(in = x) cls2(in = y); by Name; if x;run;`

post-process like `inner_join`

.

`data clss2; merge cls1(in = x) cls2(in = y); by Name; if x and y;run;`

It can be seen above that we can use `IN=`

to control which rows to be filtered.

However, considering we have to sort the dataset first, I sometimes would like to use `proc sql`

to merge data. Since it’s close to the form I used in R.

`proc sql; create table clss3 as select x.*, y.* from cls1 as x left join cls2 as y on x.Name=y.Name;quit;`

In my opinion, for transpose processing, it's better to learn and understand it with a few examples. Simply memorizing the arguments is easy to confuse. I suggest running each code and seeing the output, and thinking of how to realize it.

So let's see the examples to show transpose a dataset from long to wide, i.e. rows to columns.

First, create an example data from `sashelp.shoes`

dataset.

`data shoes; set sashelp.shoes; if Subsidiary in ("Johannesburg" "Nairobi"); keep Region Product Subsidiary Sales Inventory;run;`

By default, the transpose procedure only transposes the numeric columns from long to wide and ignores any character column. But actually in normal work, we generally will define a series of arguments like `prefix`

, `name`

and `label`

. Besides with the `var`

statement, we can select which column or columns you want to transpose. And for the `id`

statement you can use the variable of a column as the new variable names.

`proc transpose data = shoes(where = (Subsidiary = "Johannesburg")) out = res; var Sales; id Product;run;`

If you want to group the data by a variable, then add the `by`

argument.

`proc transpose data = shoes out = res; var Sales; id Product; by Subsidiary;run;`

If you want to re-define the columns `_NAME_`

and `_LABEL`

, then add the `name`

and `label`

options.

`proc transpose data = shoes out = res name = var_name label = label_name; var Sales; id Product; by Subsidiary;run;`

Then we see an example to show how to transpose data from wide to long.

`proc transpose data = res(drop = var_name label_name) out = res2; var Boot Sandal Slipper; by Subsidiary;run;`

SAS MERGING TUTORIAL

MATCH MERGING DATA FILES IN SAS | SAS LEARNING MODULES

SAS：数据合并简介

Complete Guide to PROC TRANSPOSE in SAS

HOW TO RESHAPE DATA WIDE TO LONG USING PROC TRANSPOSE | SAS LEARNING MODULES

**Please indicate the source**: http://www.bioinfo-scrounger.com

For the `proc sql`

method, `inobs`

or `outobs`

both can be used to select N rows from a dataset, but worth noting that it will cause differences if you join tables.

`proc sql inobs = 5 /*outobs=5*/; create table cls as select * from sashelp.class;quit;`

For the SAS code instead of `proc sql`

, the most straightforward method is to utilize the `obs`

that is very similar to the `sql`

method. Otherwise if you would like to select a range of rows, just add another parameter `firstobs`

.

`data raw_o; set sashelp.class(firstobs = 5 obs = 10);run;`

Utilizing the `_N_`

variable with the `IF-ELSE`

statement to reach this purpose I suppose is more flexible sometimes.

`data raw_1; set sashelp.class; if 5<= _N_ <=10 then output;run;`

So how about selecting the last rows? It seems we have to know the total number of rows, and then utilize the `_N_`

variable once.

`data raw_2; set sashelp.class; if &n_rows.-4<=_N_<=&n_rows. then output;run;`

BTW how to select N observations randomly, we can use the `proc surveyselect`

procedure and define `method = srs`

as the simple random selection method, so that we get the random 5 rows from this dataset.

`proc surveyselect data = sashelp.class out = rd_class method = srs sampsize = 5 seed = 123456;run;`

Besides for the row number, we can also add a new row number by group, as shown the following example:

`proc sort data = sashelp.class out = sorted_class; by age;run;data sorted_class_2; set sorted_class; by age; if first.age then new_row_number=0; new_row_number+1;run;`

First off, we would want to create a macro list to store information. If we just want to store a single value, as following:

`proc sql; select count(name) into: n_name trimmed from sashelp.class;quit;%put &n_name;`

Storing multiple values is also very similar.

`proc sql; select count(name),mean(height) format=10.2 into: n_name trimmed, :mean_height trimmed from sashelp.class;quit;%put &n_name &mean_height;`

Or just simply want the list values to be assigned to a list of macro variables.

`proc sql; select distinct(name) into: n1-:n19 from sashelp.class;quit;`

But the above example seems like you have to know the total number of distinct values in the dataset. So maybe the common way is to store the column values in a list separated by any delimiter you want.

`proc sql; select distinct(name) into: nameList separated by ' ' from sashelp.class;/* %let numNames = &sqlobs;*/quit;%put &nameList.;`

Supposed that I just want the second element in this macro variable, what should I do? Maybe the `%scan`

function is enough to reach our purpose.

`%put %scan(&nameList,2);`

Obviously, it seems not as convenient to extract the element as `nameList[2]`

in R, but it is enough to use in SAS.

Another way is to loop the macro variable by `%do`

, as below to assign each element to a new column.

`%let cntName = %sysfunc(countw(&nameList));data raw; array names[&cntName] $200 name1-name&cntName; do i=1 to &cntName; names[i]=scan("&nameList", i); end; drop i;run;`

Otherwise there are still many great documents that have been posted to show how to store and manipulate lists in SAS, like Choosing the Best Way to Store and Manipulate Lists in SAS

How to use the SAS SCAN Function?

Creating Lists! Using the Powerful INTO Clause with PROC SQL to Store Information in Macro Lists

How to Select the First N Rows in SAS

Using SAS® Macro Variable Lists to Create Dynamic Data-Driven Programs

Storing and Using a List of Values in a Macro Variable

**Please indicate the source**: http://www.bioinfo-scrounger.com

In R, I prefer to use `unique()`

or `dplyr::distinct`

toolkit to remove duplicates, and `is.na()`

, `na.omit()`

functions or external packages like `mice`

to handle missing values.

We can use the `proc sort`

to remove rows that have duplicate values across all columns of the dataset.

`proc sort data = sashelp.cars(keep = make type origin) out = without_dups nodupkey; by _all_;run;`

In some special condition, we would like to select only unique/distinct rows from a dataset as per a specific column and keep the first row of values for that column.

`proc sort data = sashelp.cars out = make_without_dups nodupkey; by Make;run;`

In clinical trial data, missing data or missing values is a common occurrence when no data is stored for the variable in the observation. It can be occurred in numeric or character variables as a single period (`.`

).

For all we know, according to the missing values, the reasons can be summarized as below:

- Missing completely at random (MCAR)
- Missing at random (MAR), not completely random
- Not missing at random (NMAR)

So how to handle the missing values?

Suppose we did a reaction time study with six subjects, and the subjects reaction time was measured by three times. That data is as shown below.

`data times; input id trial1 trial2 trial3; cards;1 1.5 1.4 1.6 2 1.5 . 1.9 3 . 2.0 1.6 4 . . 2.2 5 2.1 2.3 2.26 1.8 2.0 1.9;run;`

As you see below, we can use some useful functions to count the number of missing observations, like `nmiss`

for numeric and `cmiss`

for character. Or `missing`

to indicate whether the argument contains a missing value. And then filter any rows that have more than one missing value.

`data raw_0; set times (where = (nmiss(trial1,trial2,trial3) = 0));run;`

Or just indicate the specific variable, like `trial1`

column.

`data raw_1; set times; missing_flag = missing(trial1);run;`

First off, let's try to replace all missing values with zero in every column in a simple way, which is creating an implicit Array `NumVar`

to hold all numeric variables in the dataset and then loop over it. If you just want to replace one column, so then add that variable name instead of `_numeric_`

.

`data raw_3; set times; array NumVar _numeric_; do over NumVar; if NumVar=. then NumVar=0; end;run;`

If your question is more complicated, such as not replaced by zero but by mean, then how would we address it? I suppose that `proc stdize`

is a good solution.

`/*proc stdize data = times out = stdize_vars reponly missing = 0; run;*/proc stdize data = times out = stdize_vars reponly method = mean; var trial1 trial2; /* or _numeric_, or empty*/run;`

Imputation missing values is a complicated data manipulation process that can work well if you select the correct method for specific variables. But I would not learn more about how to do it with SAS by now, since I prefer to use R for imputation.

Here just list a few of useful sas procedures so that I can read and recall them later if needed.

`proc hpimpute`

`PROC MI`

,`PROC REG`

,`PROC MIANALYZE`

`proc surveyimpute`

Hope above notes will be helpful for you.

https://www.statology.org/sas-remove-duplicates/

https://www.statology.org/sas-replace-missing-values-with-zero/

http://www.philasug.org/Presentations/201910/Handling_Missing_Data_in_SAS.pdf

https://sasnrd.com/sas-replace-missing-values-zero/

**Please indicate the source**: http://www.bioinfo-scrounger.com

The following examples show how to resolve the below questions (just very simple but quite common):

- How to count distinct values
- How to count variables by group
- How to produce the frequency table of variables
- How to calculate the statistics for variables

In R, it seems like using `Hmisc::describe`

is available, but not the only function, other external packages or `base`

functions like `summary`

can also be utilized very well.

Here we use the `proc sql`

procedure with the SAS dataset called BirthWgt, to count the `Race`

variable.

`proc sql; select count(Race) as cnt_race from sashelp.BirthWgt;run;`

But I feel just count the total number of `Race`

variable is not make sense. If we would like to count the `Married`

variables grouped by the `Race`

variable:

`proc sql; select Race, count(Married) as cnt_married from sashelp.BirthWgt group by Race;run;`

If you want to count the distinct value, add the `distinct`

in the `count`

function.

`proc sql; select count(distinct Married) as distinct_married from sashelp.BirthWgt;run;`

We can use `proc freq`

to create frequency tables for one or more variables. Such as the example for the `SomeCollege`

variable with missing values, sorted by `Race`

and define the output as `result`

dataset including cumulative frequencies and percentages.

`proc sort data = sashelp.BirthWgt; by Race;run;proc freq data=sashelp.BirthWgt; tables SomeCollege /out=result missing outcum; by Race;run;`

BTW if you add a statistical argument like `chisq`

, the result becomes the statistics for the Chi-Square Tests.

Otherwise we can use `proc tabulate`

to create a table for displaying multiple statistics quickly.

`proc tabulate data = sashelp.cars; var weight; table weight * (N Min Q1 Median Mean Q3 Max);run;`

But I think `proc means`

is more convenient to save the output like:

`proc means data = sashelp.cars n nmiss mean std median p25 p75 min max; var weight; output out=weight_tbl n=n nmiss=nmiss mean=mean std=std median=median p25=p25 p75=p75 min=min max=max;run;`

https://www.statology.org/sas-count-distinct/

https://www.statology.org/sas-count-by-group/

https://www.statology.org/sas-frequency-table/

https://www.codeleading.com/article/53981053526/

https://www.statology.org/proc-tabulate-sas/

**Please indicate the source**: http://www.bioinfo-scrounger.com

`proc freq`

, `proc transpose`

.In other programming languages like R, it’s very easy and convenient to show column names as following:

`> names(iris)[1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" "Species" > colnames(iris)[1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" "Species"`

In SAS, as I know, maybe there are two approaches to accomplish the same purpose.

One way is to use `proc contents`

, which seems very straightforward.

`proc contents data = sashelp.class;run;`

If we want to store them in a new table (dataset), it can be used like this.

`proc contents data = sashelp.class memtype = data out = cols nodetails noprint;run;`

The second way to list the column names is with the use of a direction table. That way is also very common for other purposes.

I learned from Google that we can use either `proc data`

or `proc sql`

procedure.

For `proc data`

step, `sashelp.vcolumn`

is a view on the direction table, so it's just like to filter the rows of `sashelp.vcolumn`

dataset.

`data columns; set sashelp.vcolumn; where libname = 'sashelp' and memname = 'class';run;`

If we would like to save the column names as a macro variable, then I feel `proc sql`

is a better solution. So it corresponds to the `dictionary.columns`

table.

`proc sql; select name into: cols seperated by ' ' from dictionary.columns where libname = 'SASHELP' and memname = 'CLASS';quit;%put &cols;`

If you're not sure which column variable is what you want, try the `describe`

below.

`proc sql; describe table dictionary.columns;quit;/* output, not run */create table DICTIONARY.COLUMNS ( libname char(8) label='Library Name', memname char(32) label='Member Name', memtype char(8) label='Member Type', name char(32) label='Column Name', type char(4) label='Column Type', length num label='Column Length', npos num label='Column Position', varnum num label='Column Number in Table', label char(256) label='Column Label', format char(49) label='Column Format', informat char(49) label='Column Informat', idxusage char(9) label='Column Index Type', sortedby num label='Order in Key Sequence', xtype char(12) label='Extended Type', notnull char(3) label='Not NULL?', precision num label='Precision', scale num label='Scale', transcode char(3) label='Transcoded?', diagnostic char(256) label='Diagnostic Message from File Open Attempt' );`

By now we get the macro variable called `cols`

, then it can be used in any procedure like `proc freq`

.

`proc freq data = sashelp.class nlevels; tables &cols;run;`

Do you think which one is the better solution? It’s no doubt that it depends on your actual case.

Code to list table columns and data types

How to List Column Names in SAS

**Please indicate the source**: http://www.bioinfo-scrounger.com

When we run programs in SAS, we usually see the log to check if the programs run successfully. The most common approach is to check it in the SAS directly. However in the clinical trial, it may be beneficial to check the external log files generated from your programs so that we can check all log files at once. For example, see if there are some errors or warnings inside without visual inspection by eyes. This post demonstrates how to use SAS code to output logs to external files in SAS.

So far I have known that there are two approaches to reach our purpose.

The first way is to use `proc printto`

like:

`proc printto log="./log1.txt" new;run;data class; set sashelp.cars; where origin="Asia";run;/*revert log to the log window*/proc printto;run;`

Put the file path in the `Log=option`

and then that output file will contain the log. The `new`

option is defined so that SAS replaces the previous contents of the log file. Otherwise sas appends the new log to the current file.

It looks like you just need to put your running codes between the two `proc printto`

procedures, which is very easy and straightforward.

The second way is to use `dm replace`

, and this way is also very easy and I like it indeed.

`data class; set sashelp.cars; where origin="Asia";run;dm 'log; log; file "./log1.txt" replace ;';run;`

Put this standard code at the end of your program just the `file`

option needs to be specified.

But someone said the `dm replace`

way is limited by the rows of logs in SAS.

So then which approach do you prefer?

https://cloud.tencent.com/developer/article/1523962

https://sasnrd.com/sas-output-log-text-proc-printto/

**Please indicate the source**: http://www.bioinfo-scrounger.com

由于Hexo博客中的about页面默认是通过markdown渲染的，假如想用自定义的页面，比如HTML/CSS方式，那么该如何实现呢？

在Hexo的根目录下修改`_config.yml`

文件

`skip_render: "about/*"`

如上所示，表示skip `about`

目录下的所有所有文件

接着只要把你已生成的HTML/CSS/JS文件放置到about目录下，然后更新下hexo

`hexo cleanhexo g -d`

这里有个问题，如何生成HTML等文件？

我初期的想法是找个好看的HTML模板，然后自己修改模板的HTML/CSS

但是！我发现这样有点费时，因为自身HTML编写能力不足；用google搜了一圈，发现有个非常好用的工具：https://nicepage.com/

从上面这个GIF可看出，操作是不是跟PPT等工具一样，拖拖拉拉即可，非常方便，只需要略微理解下一下其各个标签的在HTML中含义；10分钟即可上手。。。

在此记录下这个工具，万一我以后还能用上呢

在NicePage设计好后，下载打包后的HTML相关文件，解压至hexo的about目录下，最后效果可见：https://www.bioinfo-scrounger.com/about/

本文出自于http://www.bioinfo-scrounger.com转载请注明出处

]]>对团队R包开发的协作经验，做个小结；并为个人开发R包做个基础知识准备

基础知识准备，查阅以下文档：

首先确定R包的名称，可遵循以下规则（搬抄自：What's in a Name）

- must start with letter
- no underscores
- periods allowable or use CamelCase
- can have numbers
- should be Google-able

可以用`available`

包来检查下名字是否被使用，NOTE：非常实用！

available::available("mcIVD", browse = FALSE) # if want "mcIVD"

使用`usethis`

包的`usethis::create_package()`

函数创建R包，R包开发中的一些列操作都可依靠这个包，各个函数的用法可参考：https://usethis.r-lib.org/reference/index.html

`install.packages(c("devtools", "usethis"))# Create a new packageusethis::create_package("~/newPackage")`

R包一般的基本结构：

- Functions
- Documents
- Data
- Vignettes
- Versions
- Dependencies

对于github command line用的不熟练的人（比如我自己），建议使用**Github Desktpp**或者**GitKraken**，我觉得后者更加好用点，尤其是多人协作开发R包；最后push到Github上

不管是个人开发还是多人协作开发，都要注意自己的R编码习惯，Google有个**R语言编码风格指南**可供参考，如来自 Google 的 R 语言编码风格指南

其中有几条值得特别注意:

- 最大单行长度为 80 个字符.
- 使用两个空格来缩进代码. 永远不要使用制表符或混合使用二者.
- ......

此外还有函数和方法等命名方式，函数我习惯于用下划线，方法用大小写，不知道这样合不合适。。。

一般建议用`styler`

包来统一规范代码格式

`install.packages("styler")`

然后在Rstudio的Addins中选择适用的规范方式，如Style active file.

一般R的描述说明是放在**DESCRIPTION**文件中，在R包创建之时就应该定义一部分内容，剩下的在后续R包开发中持续更新

这个部分内容可以参考其他成熟R包的写法，copy and modify即可

其中Imports和Suggests区别如下：

Imports是指必需引用的包，在安装的时候，若这些包还未被安装，则会持行安装程序

如果只是使用某些包中类、方法或者（一般）函数，而不用完全载入包，可以在此栏列出包的名称，最好加上版本号（在R CMD check会检查版本）。在代码中，引用其他包的namespace可以使用

`::`

或者`:::`

操作符。与之对应的，需要在NAMESPACE文件中指明引用Suggests是指建议安装的包，可能在示例数据，运行测试，创建vignettes或者包里面只有少量函数使用这些包，因此我们只需要在使用函数前检查这些包是否被安装

如果只是在帮助文档的examples，tests或者vignettes中用到了一些包，那么没有必要“依赖”或者“引用”，只用“建议”安装即可。版本号同样也要加上，在R CMD check时会用到。当然，我们要考虑到如果读者也想重现一下examples/tests/vignettes的例子，最好使用

`if(require(pkgname)))`

的条件句控制：TRUE执行，FALSE返回错误。

Depends和Imports的区别如下：

- Depends和Imports的唯一的区别就是，Depends会attach包；而Imports只load包
- 一般情况下只需在Imports里面列出需要的包，写函数的时候使用
`::`

来获取需要的函数；另外Imports或者Depends里面的包在安装的时候如果没有安装会自动安装，确保我们可以使用`::`

单独使用Imports、Depends和Suggests引用的都是CRAN上的包，如果想引用Bioconductor上的包，需要在前面加上`biocViews:`

此外还有些`usethis`

包对DESCRIPTION文档的操作，如：

`# 升级版本号usethis::use_version()`

BugReports：一个网址，用于提交bug，代替了向作者发邮件。一个好的想法是使用Github，在项目的issures版块提交bug。

一般来说，R包的函数会放在`R`

子目录下

当编写函数时，不要忘记添加`roxygen`

格式的注释，点击`Insert Roxygen Skeleton`

快速生成文档骨架

注释完成后，使用`devtools::document()`

在`man`

子目录下生成函数文档，这些操作都有快捷方式以及Rstudio中的按键，看个人习惯

一般Roxygen注释包含以下几块内容，可查阅：

- https://cran.r-project.org/web/packages/roxygen2/vignettes/rd.html
- https://cran.r-project.org/web/packages/roxygen2/vignettes/roxygen2.html

假如是多人协作，最好大家Roxygen格式都保持一致

尤其是一些注释习惯保持一致，如title不要用动词开头，尽量保持名词化等等；

一般来说，官方制定的类型系统有四种：基础类型、S3类型、S4类型和RC类型；Bioconductor收录的R包一般都是S4类型；S3相比S4更加宽松点，所以也更加简单点，但是不robust。

可参考：Advanced R

对于一个复杂的R包，给函数或者方法编写对应的测试代码是非常有必要的，可使用`testthat`

来实现

非常重要，虽然编写test代码会多增加一点工作量，但是实际使用中却可以节约很多调试的时间

`# 增加测试环境usethis::use_testthat()# 安装testthatinstall.packages("testthat")`

对于每个`test_*.R`

代码的单独测试，可在Rstuido中点击**Run Tests**来执行。

我们在一些成熟的R包中可看到，其有个文档网页，或者说是Guidance以供用户查阅；

先在github page页生成一个网址（Privite库不支持，必须public库哈），然后安装`pkgdown`

包

`install.packages("pkgdown")`

初始化后，可手动修改`_pkgdown.yml`

文件来自定义，然后push到Github上

`usethis::use_pkgdown()`pkgdown::build_site()``

再Github repo的Setting中修改，最后refresh网站

其他可参考：https://pkgdown.r-lib.org/articles/pkgdown.html

Vignettes一般对于成熟的R是必备的

`usethis::use_vignette("my-vignette")`

参考：https://r-pkgs.org/vignettes.html

增加一些external data 和internal data，`data-raw`

文件夹存放internal data的code，`data`

文件夹存档`raw data`

对应clean-up版的Rdata，以及一些Rdata document的R代码

此外还可以将一些需要parse的raw data 放在`inst/extdata`

中，可用`system.file()`

调用

增加一些其他文档来补充下R包的整体框架，如`NEWS.md`

`usethis::use_news_md()`

比如只想要pipe(`%>%`

)，但不要import `dplyr`

包，则：

`usethis::use_pipe()`

创建一个`README.Rmd`

文档

`usethis::use_readme_rmd()`

值得注意的是，每次想push comments或merge分支的时候，记得check下整个R包，看看是否有errors未解决

还有Spell Checking，CRAN会检查拼写（或者增加一个WORDLIST文件用于标注一些正确的拼写），如：

`devtools::spell_check()# after you have fixed the issues, runspelling::update_wordlist()`

此外制作一个Logo/Hex Sticker，可参考：Owning a hex sticker，或者直接使用`hexSticker`

包

以上为我所整理的R包开发初始的基本框架和注意事项，有问题可随时沟通

本文出自于http://www.bioinfo-scrounger.com转载请注明出处

]]>`input`

a character string `20211109`

with a sas format like `yymmdd8.`

that specifies how sas must interpret the character string, you will get a numeric number `22593`

.What does the number mean? The date value in SAS represents the number of days between January 1st, 1960 and a specified date, before is negative numbers, while after is positive numbers.

`data date_1; input date_char $8.; sas_date_value=input(date_char, yymmdd8.);/* sas_date_format=put(sas_date_value, yymmdd10.);*/ datalines;20211109;run;proc contents data=date_1; run;`

As the sas date value is not readable for humans, you can use `PUT`

function to convert the date value to SAS date, or just `FORMAT`

statement to apply a format to the variable straightforwardly.

`data date_1; input date_char $8.; sas_date_value=input(date_char, yymmdd8.); sas_date_format=put(sas_date_value, yymmdd10.); datalines;20211109;run;data date_2; input date_char $8.; sas_date=input(date_char, yymmdd8.); format sas_date yymmdd10.; datalines;20211109;run;`

It also happens in R, but the start date is `1970-01-01`

, instead of `1960-01-01`

.

`x <- as.Date("1970-01-01", "%Y-%m-%d")> as.numeric(x)[1] 0`

The most common requirement is to return a person’s age; the `YRDIF`

function can handle it.

`data c_age; input date1 date9.; age=yrdif(date1, today(), "Actual"); format date1 yymmdd10. age 5.1; datalines;14JUN199003MAY2000;run;`

In SAS you can calculate the difference between two dates with `INTCK`

functions.

`data date_init; format mydate1 mydate2 yymmdd10.; input mydate1 :date9. mydate2 :date9.; datalines;13JUN2020 18JUN202022JUN2020 20JUL202001JAN2020 31DEC202003MAY2020 19AUG2020;run;`

Note: if two variables set input format, it’s better to add : in front of format to avoid some unexpected errors.

Firstly, calculate the difference in days.

`data data_d; set date_init; diff_days_disc=intck("day", mydate1, mydate2)run;`

Then the difference in months is as follows. If you set the argument equal to "C", SAS just calculates the full month between them. So if the number of days is less than one month, the full month number is zero.

`data date_m; set date_init; diff_months_disc = intck('month', mydate1, mydate2, 'D'); diff_months_cont = intck('month', mydate1, mydate2, 'C');run;`

Then the difference in weeks is as follows, the same as month calculation.

`data date_w; set date_init; diff_weeks_disc = intck('week', mydate1, mydate2, 'D'); diff_weeks_cont = intck('week', mydate1, mydate2, 'C');run;`

Additionally, if you want to answer the question, what day is the next 5th day? `INTNX`

function is frequently used in that situation.

`data _null_; next_day = intnx("day", today(), 1); previous_day = intnx("day", today(), -1); add_5_days = intnx("day", today(), 5); put "Today: %sysfunc(today(), EURDFWKX28.)"; put "Next Day: " next_day EURDFWKX28.; put "Previous Day: " previous_day EURDFWKX28.; put "Today +5 Days: " add_5_days EURDFWKX28.;run;`

If you want to extract the day from the SAS date, `day`

function may be an easy way to accomplish. The `day`

function returns the day number that is within this month, however the month number is within this year.

Besides, if you want to extract the week and year combined, the put function may be more appropriate.

`data date_extr; set date_init(keep=mydate1); ext_day=day(mydate1); ext_week=week(mydate1); ext_year=year(mydate1); monthyear = put(mydate1,monyy7.);run;`

Oppositely `MDY`

function is used to combined the day, month and year data. `MYD`

, `YMD`

, `YDM`

, `DMY`

, `DYM`

are also similar.

In the R, `lubridate`

package provides many convenient functions to produce, convert and conduct date data.

You can use `lubridate::ymd()`

, `lubridate::mdy()`

and `lubridate::dmy()`

convert character string to date format.

`ymd(c("1998-3-10", "2018-01-17", "18-1-17"))`

If you’re willing to extract day, month and year numbers from date format data, those functions may be useful as follows.

`> mday(as.Date("2021-11-09"))[1] 9> month(as.Date("2021-11-09"))[1] 11> year(as.Date("2021-11-09"))[1] 2021`

We can add or subtract the date directly, but sometimes the `difftime`

function is a better choice.

`x <- as.Date(c('2021-09-01', '2021-11-01'))> c(difftime(x[2], x[1], units='days'))Time difference of 61 days`

In addition, `dseconds()`

, `dminutes()`

, `dhours()`

, `ddays()`

, `dweeks()`

, `dyears()`

functions are convenient if you want to return the day of the 5th day from today.

`d <- ymd("2021-11-09")> d + ddays(5)[1] "2021-11-14"`

EXTRACT DAY, MONTH AND YEAR FROM DATE OR TIMESTAMP IN SAS

How to Easily Convert a Number to a Date in SAS

How to Easily Calculate the Difference Between Two SAS Dates

Complete Guide for SAS INTNX Function with Examples

**Please indicate the source**: http://www.bioinfo-scrounger.com