`input`

a character string `20211109`

with a sas format like `yymmdd8.`

that specifies how sas must interpret the character string, you will get a numeric number `22593`

.What does the number mean? The date value in SAS represents the number of days between January 1st, 1960 and a specified date, before is negative numbers, while after is positive numbers.

`data date_1; input date_char $8.; sas_date_value=input(date_char, yymmdd8.);/* sas_date_format=put(sas_date_value, yymmdd10.);*/ datalines;20211109;run;proc contents data=date_1; run;`

As the sas date value is not readable for humans, you can use `PUT`

function to convert the date value to SAS date, or just `FORMAT`

statement to apply a format to the variable straightforwardly.

`data date_1; input date_char $8.; sas_date_value=input(date_char, yymmdd8.); sas_date_format=put(sas_date_value, yymmdd10.); datalines;20211109;run;data date_2; input date_char $8.; sas_date=input(date_char, yymmdd8.); format sas_date yymmdd10.; datalines;20211109;run;`

It also happens in R, but the start date is `1970-01-01`

, instead of `1960-01-01`

.

`x <- as.Date("1970-01-01", "%Y-%m-%d")> as.numeric(x)[1] 0`

The most common requirement is to return a person’s age; the `YRDIF`

function can handle it.

`data c_age; input date1 date9.; age=yrdif(date1, today(), "Actual"); format date1 yymmdd10. age 5.1; datalines;14JUN199003MAY2000;run;`

In SAS you can calculate the difference between two dates with `INTCK`

functions.

`data date_init; format mydate1 mydate2 yymmdd10.; input mydate1 :date9. mydate2 :date9.; datalines;13JUN2020 18JUN202022JUN2020 20JUL202001JAN2020 31DEC202003MAY2020 19AUG2020;run;`

Note: if two variables set input format, it’s better to add : in front of format to avoid some unexpected errors.

Firstly, calculate the difference in days.

`data data_d; set date_init; diff_days_disc=intck("day", mydate1, mydate2)run;`

Then the difference in months is as follows. If you set the argument equal to "C", SAS just calculates the full month between them. So if the number of days is less than one month, the full month number is zero.

`data date_m; set date_init; diff_months_disc = intck('month', mydate1, mydate2, 'D'); diff_months_cont = intck('month', mydate1, mydate2, 'C');run;`

Then the difference in weeks is as follows, the same as month calculation.

`data date_w; set date_init; diff_weeks_disc = intck('week', mydate1, mydate2, 'D'); diff_weeks_cont = intck('week', mydate1, mydate2, 'C');run;`

Additionally, if you want to answer the question, what day is the next 5th day? `INTNX`

function is frequently used in that situation.

`data _null_; next_day = intnx("day", today(), 1); previous_day = intnx("day", today(), -1); add_5_days = intnx("day", today(), 5); put "Today: %sysfunc(today(), EURDFWKX28.)"; put "Next Day: " next_day EURDFWKX28.; put "Previous Day: " previous_day EURDFWKX28.; put "Today +5 Days: " add_5_days EURDFWKX28.;run;`

If you want to extract the day from the SAS date, `day`

function may be an easy way to accomplish. The `day`

function returns the day number that is within this month, however the month number is within this year.

Besides, if you want to extract the week and year combined, the put function may be more appropriate.

`data date_extr; set date_init(keep=mydate1); ext_day=day(mydate1); ext_week=week(mydate1); ext_year=year(mydate1); monthyear = put(mydate1,monyy7.);run;`

Oppositely `MDY`

function is used to combined the day, month and year data. `MYD`

, `YMD`

, `YDM`

, `DMY`

, `DYM`

are also similar.

In the R, `lubridate`

package provides many convenient functions to produce, convert and conduct date data.

You can use `lubridate::ymd()`

, `lubridate::mdy()`

and `lubridate::dmy()`

convert character string to date format.

`ymd(c("1998-3-10", "2018-01-17", "18-1-17"))`

If you’re willing to extract day, month and year numbers from date format data, those functions may be useful as follows.

`> mday(as.Date("2021-11-09"))[1] 9> month(as.Date("2021-11-09"))[1] 11> year(as.Date("2021-11-09"))[1] 2021`

We can add or subtract the date directly, but sometimes the `difftime`

function is a better choice.

`x <- as.Date(c('2021-09-01', '2021-11-01'))> c(difftime(x[2], x[1], units='days'))Time difference of 61 days`

In addition, `dseconds()`

, `dminutes()`

, `dhours()`

, `ddays()`

, `dweeks()`

, `dyears()`

functions are convenient if you want to return the day of the 5th day from today.

`d <- ymd("2021-11-09")> d + ddays(5)[1] "2021-11-14"`

EXTRACT DAY, MONTH AND YEAR FROM DATE OR TIMESTAMP IN SAS

How to Easily Convert a Number to a Date in SAS

How to Easily Calculate the Difference Between Two SAS Dates

Complete Guide for SAS INTNX Function with Examples

**Please indicate the source**: http://www.bioinfo-scrounger.com

This is reference to the **"Chapter 8 A graphical compendium"** in **<SAS and R: Data Management, Statistical Analysis, and Graphics (second edition)>**.

I believe that the capability of data science is more than just building predictive models, data visualization is also an integral part, especially in a convincing way.

从我对SAS的初步接触来看，SAS绘图功能不如R实用和方便；如果仅仅是要满足临床试验中固定化图表的实现，那SAS还是可以的，但是复杂的图表或者更加个性化的图表，那SAS可能就欠佳了，一个主要的原因，我觉得是SAS不是开源的，没有那么多开发者们愿意投入和分享。。。

虽然SAS绘图不是很给力，但是我们还是得学习学习，顺便也看下R中的实现方式，以下整理一些常用的图形绘制方式

柱状图，在SAS中使用`sgplot`

和`vbar`

，`styleattrs`

定义group的颜色，`barwidth`

定义柱子宽度，`transparency`

对应透明度，`stat`

定义纵坐标展示的统计量，`grid`

用于增加网格线

`/*Barplot*/proc sgplot data=sashelp.class; styleattrs datacolors=(red green);/* title "Sex vs. Mean weight.";*/ vbar sex/response=weight barwidth=0.75 group=sex transparency=0.5 stat=mean; yaxis grid;run;`

若想增加error bars，则增加`limits=both`

参数

`**** Barplot with errorbar;proc sgplot data=sashelp.cars; vbar type/response=msrp stat=mean limitstat=stddev limits=both;run;`

若想绘制堆积图，则增加`groupdisplay=stack`

参数，还可以配合`stat=percent`

以及`categoryorder`

等等

`**** Stacked barplot, forcing on groupdisplay param;proc sgplot data=sashelp.cars(where=(Type="SUV")); vbar Type / Group=Origin groupdisplay=stack stat=percent categoryorder=respdesc;run;`

若想绘制分组型的直方图，则增加`group`

和`groupdisplay=cluster`

参数

`**** Barplot by groups, the groupdisplay=cluster;proc sgplot data=sashelp.cars; vbar origin/group=DriveTrain groupdisplay=cluster;run;`

若在分组型的直方图中将纵坐标改成percent，则：

`/*Define the percent is within group, not all*/proc sgplot data=sashelp.cars pctlevel=group; vbar origin/group=DriveTrain groupdisplay=cluster stat=percent;run;`

在R中，base函数是用`barplot()`

，或者使用绘图包，如`ggplot2`

包中的`ggplot2::geom_bar()`

，以及`ggpubr`

包的`ggpubr::ggbarplot()`

；若想寻找R的`barplot()`

函数能绘制哪些图形，可试下输入`example(barplot)`

，如是ggplot2和ggpubr的资源，则可以查阅以下reference

https://rpkgs.datanovia.com/ggpubr/reference/index.html

ggbarplot(df, "dose", "len", fill = "dose", color = "dose", palette = c("#00AFBB", "#E7B800", "#FC4E07"))

接着那么在SAS中导出/保存生成的图片, `ods`

指定保存的位置及图片类型及命名（注：若不想在`fillattrs`

中自定义color和transparency，则直接定义`fillattrs=GraphData2`

）

`ods _all_ close;ods listing gpath="C:\Plots";ods graphics / imagename="barplot" imagefmt=png height=10cm width=15cm;proc sgplot data=sashelp.class; vbar sex/response=weight barwidth=0.5 fillattrs=(color=skyblue) transparency=0.5 stat=mean; yaxis grid;run;ods listing close;`

若想输出到RTF中（对于出报告，这真是SAS的强项。。。特别是样式标准化的报告），则增加ods rtf来指定输出方式及路径，`ods graphics / noborder`

定义输出的图形不要边框

`ods _all_ close;ods rtf file="C:\Plots\barplot.rtf";ods graphics / noborder;ods graphics /outputfmt=JPEG;ods escapechar="^";options nodate nonumber pageno=1;proc sgplot data=sashelp.class; vbar sex/response=weight barwidth=0.5 fillattrs=GraphData2 transparency=0.5 stat=mean; yaxis grid;run;ods rtf close;`

SAS中的dotplot跟R的有点不同，其X轴只支持数值型的变量（这点不确定是否有误），然后的有相应的stat参数才行；这样的话，那跟散点图有啥区别，我更想是X轴是分类变量。。。下面是SAS中带有误差线的dotplot，X轴是Height的均值；若想分组（如颜色形状等等），则增加`group variable`

`proc sgplot data=sashelp.class; dot Age / response=Height stat=mean limitstat=stddev numstd=1;run;`

在R中，dotplot一般用于X轴是分类变量的情况，可用`ggpubr`

包的`ggdotplot()`

函数，其中`add = "boxplot"`

参数用于在dotplot中增加boxplot，如下所示：

`library(ggpubr)ggdotplot(ToothGrowth, "dose", "len", add = "boxplot", color = "dose", fill = "dose", palette = c("#00AFBB", "#E7B800", "#FC4E07"))`

若SAS中想呈现处boxplot和dotplot合并的形式，则需要同时使用`vbox`

和`scatter`

，并增加`jitter`

实现抖动效果以免重叠

`proc sgplot data=sashelp.cars; vbox MPG_City / category=Cylinders boxwidth=0.5 nooutliers; scatter x=Cylinders y=MPG_City / jitter transparency=0.6 markerattrs=(color=red symbol=CircleFilled);run;`

直方图，一般用于看数据分布情况，SAS用法如下所示：

`proc sgplot data=sashelp.cars; histogram MPG_City / nbins=30; xaxis values=(0 to 70 by 10); density MPG_City;run;`

若想将多组直方图overlap，则增加`group`

参数

`**** overlapping histogram by two columns;proc sgplot data=sashelp.cars; histogram MPG_City / binstart=0 binwidth=2 transparency=0.5; histogram MPG_Highway / binstart=0 binwidth=2 transparency=0.5;run;`

而在R中，base函数`hist()`

，或者`ggplot2`

包以及`ggpubr`

包都能实现，如：

`wdata = data.frame( sex = factor(rep(c("F", "M"), each=200)), weight = c(rnorm(200, 55), rnorm(200, 58)))gghistogram(wdata, x = "weight", y = "..density..", add = "mean", rug = TRUE, fill = "sex", palette = "jco", add_density = TRUE)`

箱线图，不仅可看数据分布情况，还可以看是否有潜在的离群点，适用性比较广，展现的灵活性也比较高，非常常用；在SAS，我还是以`sgplot`

的`vbox`

语句举例，如下所示：

`proc sgplot data=sashelp.heart; vbox Cholesterol / category=DeathCause group=Sex clusterwidth=0.5 boxwidth=0.8 meanattrs=(size=5) outlierattrs=(size=5); xaxis display=(noline nolabel noticks); yaxis display=(noline nolabel noticks);run;`

在R中的话，想生成上图的形式，可用`ggplot2`

包或者`ggpubr`

包，如下`ggpubr::ggboxplot()`

函数所示，其中`palette = "jco"`

是指调用JCO期刊的色板，这个是`ggpubr`

包的一个极好的功能，特别实用。。。

`ggboxplot(ToothGrowth, "dose", "len", fill = "supp", palette = "jco", bxp.errorbar = T)`

气泡图，SAS可以用`sgplot`

下的`bubble`

语句，如：

`proc sgplot data=sashelp.cars; bubble x=Horsepower y=MPG_Highway size=Cylinders;run;`

在R中，则可以用`ggplot`

包的`geom_point()`

函数，类似于散点图

散点图，应用及其广泛的一种展示形式，一般会搭配拟合曲线等其他信息一起展示，在SAS中则是用`sgplot`

下的`scatter`

语句；其中`attrpriority=none`

有点意思，假如是等于`none`

，则除了有color of markers and lines外，line patterns or marker symbols都跟随group变量；但假如是等于`color`

，则只有color of markers and lines跟随group变量

`ods graphics on / border=off attrpriority=none height=12cm width=15cm;proc sgplot data=sashelp.iris; styleattrs datacontrastcolors=(magenta orange brown) datasymbols=(star square triangle); scatter x=PetalLength y=PetalWidth / group=Species; reg x=PetalLength y=PetalWidth / group=Species clm;run;ods graphics off;`

若想增加prediction limits（预测的置信区间）和confidence limits（总体均数的置信区间），则分别增加`cli`

和`clm`

参数

在R中，若想实现上述这张图的效果，则可以用`ggplot`

包的`ggscatter()`

函数

`ggscatter(iris, x = "Petal.Length", y = "Petal.Width", color = "Species", shape = "Species", add.params = list(linetype = "Species"), add = "reg.line", conf.int = T)`

矩阵散点图，一个快速展示数据整体分布情况的图，特别是需要多个变量之间两两展示的是时候

在SAS中，是用`matrix`

语句实现

`**** Matrix scatter plots;proc sgscatter data=sashelp.cars; matrix MPG_Highway Horsepower EngineSize / diagonal=(histogram normal);run;`

在R中，如果是简单的矩阵散点图，可以直接对矩阵数据类型调用`plot()`

函数，如：

`plot(iris[,-5] , pch=20 , cex=1.5)`

或者用`pairs()`

函数

`pairs(iris[,-5])`

假如想跟SAS一样在对角线有直方图，则需要使用`psych`

包的`pairs.panels()`

函数，如：

`psych::pairs.panels( iris[,-5], method = "pearson", # correlation method hist.col = "#00AFBB", density = TRUE, # show density plots ellipses = FALSE, # show correlation ellipses smooth = FALSE)`

常用的是点线图，然后分组以及带有error bar，SAS如下所示：

`**** Line plot with errorbar in different groups;proc sgplot data=sashelp.cars; vline Cylinders/response=msrp stat=mean group=type limits=both limitstat=stddev markers;run;`

如果是R的话，可以用`ggplot`

包的`ggline()`

函数

`ggline(ToothGrowth, x = "dose", y = "len", color = "supp", linetype = "supp", add = "mean_se", palette = c("#00AFBB", "#E7B800"))`

ROC曲线图，在诊断试验中比较常见，在SAS中简单的方法则是直接用`logistic`

语句的plots选项实现；或者导出sensitivity and specificity数据，然后用`series`

语句绘制

`ods graphics on;proc logistic data=sashelp.heart plots(only)=roc; class sex Chol_Status BP_Status Weight_Status Smoking_Status; model status(event="Dead")=AgeAtStart sex Chol_Status BP_Status Weight_Status Smoking_Status; ods output roccurve=ROCdata;run;ods graphics off;** Or use sgplot procedure;ods graphics on / border=off height=12cm width=15cm;proc sgplot data=ROCdata aspect=1; xaxis label="False Positive Fraction" values=(0 to 1 by 0.25) grid offsetmin=.05 offsetmax=.05; yaxis label="True Positive Fraction" values=(0 to 1 by 0.25) grid offsetmin=.05 offsetmax=.05; lineparm x=0 y=0 slope=1 / transparency=.3 lineattrs=(color=gray); series x=_1mspec_ y=_sensit_ ; inset ("Group 1 AUC" = "0.6891") / border opaque position=bottomright; title "ROC curves for both groups";run;ods graphics off;`

在R中，有多个常见的R包可以计算AUC值及绘制ROC曲线，如`pROC`

，`ROCR`

d等等，但是我总觉得的ROC曲线还是得用ggplot2来绘制，以及搭配`plotROC`

包

生存分析中Kaplan-Meier中用于展现患者生存过程的重要手段，在SAS中简单的方法则是直接用`lifetest`

语句的plots选项实现；或者导出生存数据，然后用`series`

语句绘制

`**** Kaplan-Meier plot;proc lifetest data=sashelp.bmt plots=survival; time t*status(0); strata Group;run;** or Failure probability;proc lifetest data=sashelp.bmt plots=survival(failture); time t*status(0); strata Group;run;`

在R中，我最常用的生存分析的R包是`survival`

和`survminer`

，前者分析后者可视化，真是非常好用。。。

参考资料：

SAS and R: Data Management, Statistical Analysis, and Graphics (second edition)

本文出自于http://www.bioinfo-scrounger.com转载请注明出处

]]>This is reference to the **"Chapter 3 Statistical and mathematical functions"**, **"Chapter 4 Programming and operating system interface"** and **"Chapter 5 Common statistical procedures"** in **<SAS and R: Data Management, Statistical Analysis, and Graphics (second edition)>**.

本篇主要是看下常见的统计学方法在R以及SAS中是如何实现的，主要是看后者。。。

比如已知Z-score，在正态分布的CDF曲线上，想返回从负无穷到Z值的积分，在R中用`pnorm()`

，在SAS中则是`cdf`

方法

`# R codey = pnorm(1.96, mean=0, sd=1)# SAS codedata normal; y = cdf("NORMAL", 1.96, 0, 1);run;`

其他分布的用法也类似，都有比较固定的使用方法，搜一搜就有了

设置随机种子

`# R codeset.seed(12345)# SAS codecall streaminit(12345);`

生成随机数，假如是要满足正态分布的随机数，在R中用`rnorm()`

；在SAS中则是`normal`

或者`rand`

，但是其只输出单个值，需要利用for循环生成多个随机数

`# R codernorm(10)# SAS codedata rand; call streaminit(12345); do i=1 to 10; x=rand("normal", 0, 1); output; end;run;`

一些数学运算相关的函数，在R和SAS中都差不多。。。

四舍五入比较常用，如在SAS中有：

`data sign; nextintx = ceil(3.49); justintx = floor(3.49); roundx = round(3.49, 0.1); roundint = round(3.49, 0.01); movetozero = int(3.49);run;`

在R中还多`signif()`

，`trunc()`

函数跟SAS的`int`

一致（应该是吧？）

在R中，最常见的循环是用for循环，或者其他函数（如`apply`

家族或者`ddply`

等函数）；在SAS中则是`do end`

, `do wile`

, `do until`

等

`# R codex <- numeric(10)for (i in 1:10){ x[i] <- rnorm(1)}# SAS codedata; do i = 1 to 10; x = normal(0); output; end;run;`

生成一系列的数字，如1，3，5，7，9等数列，在R中用`seq()`

函数即可，SAS稍微麻烦点，需要用loop。。。

`data ds; do x = 1 to 9 by 2; output; end;run;`

生成有重复的数值/字符，在R中用`rep()`

函数就可实现，SAS则还是得用loop?

`# R codedata.frame(x1 = rep(1:3, each=2), x2 = rep(c("M","F"), time=3))# SAS codedata ds; do x1 = 1 to 3; do x2 = "M","F"; output; end; end;run;`

SAS的summary分析总让我觉得SAS是一个不算编程语言，只能说其是一个分析工具。。。在summary分析中SAS很“贴心”将多个计算函数（如mean, stdev, max, min等等）放在某个proc中，但是这样会让整个编程语言缺少灵活性，变得非常的“死板”。。让其理念跟 其他编程语言完全不一样了，总觉得怪怪的。。。这里吐槽下。。。

在R中，或者Python中，summary分析，一般常见就是要用哪个函数就调用对应的，然后输出成一个数据框/数组即可

在SAS中，则是调用`proc mean`

或者`proc univariate`

，如以下所示：

Using proc means for summary statistics

`/*proc means contains printed output and data output*/proc means data=sashelp.iris N mean stddev max min; class Species; var PetalLength PetalWidth SepalLength SepalWidth; output out=ds;run;`

Using proc univariate for detailed summary statistics（注：需要用ods输出结果）

`ods output BasicMeasures=ss;/*ods trace on/listing;*/proc univariate data=sashelp.iris all; class Species; var PetalLength PetalWidth SepalLength SepalWidth;run;/*ods trace off;*/`

比如要生成分位数（25%，50%，75%，95%），用`univarate`

如下：

`ods output Quantiles=qt;proc univariate data=sashelp.iris all; var PetalLength PetalWidth SepalLength SepalWidth;run;`

或者用output自定义输出分位数

`proc univariate data=sashelp.iris; class Species; var PetalLength; output out=iris_percentile pctlpts = 0,25,50,75,95,100 pctlpre = P_;run;`

也可以用`means`

实现百分位数的计算

`proc means data=sashelp.iris p25 p50 p75 p95; class Species; var PetalLength; output out=perc p25=p_25 p50=p_50 p75=p_75 p95=p_95;run;`

在R中则是用`quantile()`

函数，但是！假如是用默认参数，如下所示这种，其结果跟SAS是不一样的

`quantile(c(1:10), c(0.25,0.5,0.75,0.95))`

必须指定分位数的计算方法，如`type = 3`

，才能输出SAS对应的结果

`quantile(c(1:10), c(0.25,0.5,0.75,0.95), type = 3)`

若想对数据集的列（变量）standardized，如Standardizing to a Given Mean and Standard Deviation，在SAS是用`standard`

步，可指定mean和std，输出的结果即是Z-score标准化后的

`proc standard data=sashelp.iris out=iris2 mean=0 std=1; var PetalLength PetalWidth;run;`

在R中在可以使用`scale()`

函数，搭配`apply()`

等函数可以依次对各个变量做scale

`scale(iris$Sepal.Length)`

计算基于正态分布的均值及置信区间，SAS还是用`means`

步，这个proc真是啥都放在里面。。。`lclm`

和`uclm`

分别指上限和下限

`proc means data=sashelp.iris lclm mean uclm; var PetalLength;run;`

在R里，可以用一些函数比如`qt()`

计算t统计量或者`qnorm()`

计算z统计量，然后利用置信区间的公式计算最终的CI；也可以用`t.test()`

计算

`t.test(iris$Sepal.Length)$conf.int`

这个contingency table在计算诊断指标时非常常用，SAS把这放在了`freq`

步里，如下所示；`nopercent nocol norow`

用于规定display table；在R中则常用`table()`

函数并联合其他计算函数或者公式

`data dumy; input x y @@; datalines;0 1 1 0 1 10 1 1 1 1 01 1 1 1 0 01 0 0 0 0 1run;proc freq data=dumy; tables x*y / out=freqtable nopercent nocol norow;run;`

若想计算Chi-Square或者RR等值，则增加`chisq`

、`relrisk`

参数；而R则是用`chisq.test()`

`proc freq data=dumy; tables x*y / chisq relrisk;run;`

若想计算Fisher's exact test，则添加`exact`

参数；而R则是用`fisher.test()`

若想计算Kappa值（一致性分析），则添加`agree`

参数；而R则是用`kappa()`

`proc freq data=dumy; tables x*y / agree;run;`

还可以计算诊断相关的灵敏度以及特异度等指标，根据公式计算即可

有点不太习惯SAS这种非开源的工具，其会根据需要将一些固定的分析放到某些proc步中，虽然这样看起来蛮方便的，但是实际使用中会使得分析变得复杂，尽管其有很详细的help文档。。。

上面是定性的一些统计指标，下面列举些定量的指标；如相关系数在诊断试验中很常见，常用于定量试剂，在SAS中用`corr`

步，在R中则是`cor()`

或者`cor.test()`

函数

`# R codecor.test(iris$Sepal.Length, iris$Sepal.Width)# SAS codeproc corr data=sashelp.iris; var PetalLength PetalWidth;run;`

正态性检验，如the Shapiro-Wilk test, the Kolmogorov-Smirnov test

`proc univariate data=sashelp.iris normal; var PetalLength;run;`

在R中则是选择方法对应的函数，如`shapiro.test(x)`

，或者一些相关的R包（会将一系列方法整合在一起）

T检验，SAS支持组间T检验通过一个分类变量一个对应值的形式；**注：*结果中会输出方差齐性和不齐性两种结果**

`data scores; input Gender $ Score @@; datalines;f 75 f 76 f 80 f 77 f 80 f 77 f 73m 82 m 80 m 85 m 85 m 78 m 87 m 82;run;proc ttest data=scores; class Gender; var Score;run;`

而R则是用`t.test(y ~ x, data)`

，或者`t.test(y1, x1)`

等方法

上述的T检验是参数假设检验，假设其满足正态分布；但是有时非参的假设检验更加合适，如Wilcoxon test, Kolmogorov-Smirnov test；在SAS中可以用`npar1way`

步

`proc npar1way data=scores wilcoxon edf; class Gender; var Score;run;`

在R中则是使用`wilcox.test()`

Logrank test在Kaplan-Meier plot和Cox proportional hazards model中比较常见；在SAS中可以用`lifetest`

步

`proc lifetest data=sashelp.BMT plots=survival(atrisk=0 to 2500 by 500); time T * Status(0); strata Group / test=logrank adjust=sidak;run;`

在R中则是用`survival`

包的`survdiff()`

函数

参考资料：

SAS and R: Data Management, Statistical Analysis, and Graphics (second edition)

本文出自于http://www.bioinfo-scrounger.com转载请注明出处

]]>This is reference to the **2.3 section of Data management** and **2.4 Date and time variables** in <SAS and R: Data Management, Statistical Analysis, and Graphics (second edition)>.

在R中常说的数据集操作是指处理数据框类型的数据，当然有时也会是其他的数据类型。在SAS中就是数据集，SAS相比其他编程方法来说数据类型还是太少了。。

处理数据，常见的不外乎combination, collation, and subsetting

在R中比较简单，根据行索引即可提取任意行的数据集；在SAS则可以通过`firstobs`

和`obs`

来定义

提取起始于第i行的数据集

`data class; set sashelp.class (firstobs=5);run;`

提取前i行的数据集

`data class; set sashelp.class (obs=5);run;`

提取第i行到第j行的数据集

`data class; set sashelp.class (firstobs=5 obs=10);run;`

这是指根据变量过滤数据集的行，在R中常用的函数是`dplyr::filter()`

；而是SAS中则是搭配`where`

语句使用

`# R codefilter(iris, Species == "setosa")# SAS codedata class; set sashelp.class(where=(Sex="M"));run;`

在R中，可以相当于过滤行，即可用`dplyr::filter()`

函数来实现；在SAS中则可以通过多个IF语句分割数据集

`data female male; set sashelp.class; if Sex="F" then output female; if Sex="M" then output male;run;`

这是指根据变量过滤数据集的行，在R中常用的函数是`dplyr::select()`

；在SAS中则可以通过`keep`

来实现，如：

`# R codedplyr::select(iris, c("Sepal.Length", "Sepal.Width")) # SAS codedata class; set sashelp.class(keep= Name Sex Age);run;`

这个在simulation中蛮好用的，从逻辑上将实现方式有很多，类似于subsetting datasets；但是也有现成的函数可以直接用，在R中可以考虑用`dplyr::sample_n()`

函数，在SAS则是`surveyselect`

`# R code# replace = FALSEdplyr::sample_n(iris, 10)# SAS code/* sample without replacement 简单随机抽样*/proc surveyselect data=sashelp.class out=outds n=10 method=srs; run;`

输出观测编号，也就是行号，或者说行索引；方法肯定很多，比如SAS是用`_n_`

变量，R则可以`row.names()`

`# R coderow.names(iris)# SAS codedata _null_; set sashelp.class; put _n_;run;`

这种可以分成两种情况：

- 整个数据集去重
- 基于指定的变量（列）来去重

在R中可以用`dplyr::distinct()`

函数，在SAS中则是用`sort`

步和`nodupkey`

参数

`# R codeunique(iris) # All duplicate rows removed dplyr::distinct(iris, Species, .keep_all = TRUE) # Species rows removed# SAS code/*Romove duplicated value by all variables*/proc sort data=sashelp.retail out=retail_without_duplicated_values nodupkey; by _all_;run;/*Remove duplicatd values by Year variable*/proc sort data=sashelp.retail out=retail_with_unique_value nodupkey; by Year;run;`

这个与上述的 keep unique values 刚好相反，返回有重复的行；在R中对于向量可以用`duplicated`

函数，对于数据框则需要几个函数配合以下；在SAS中还是用`sort`

步和`nouniquekey`

参数

`# R codeiris %>% add_count(Species) %>% filter(n>1)/*Return duplicatd values by Year variable*/proc sort data=sashelp.retail out=retail_with_duplicated_value nouniquekey; by Year Month;run;`

这个在数据处理中是一个非常常见的需求，俗称长数据和宽数据的转化，如宽转长或者长转宽；在R中，现在比较好用的函数是`tidyr::pivot_longer()`

or `tidyr::pivot_wider()`

；在SAS则是用`transpose`

步

`# R code# wide to long transformrelig_income %>% pivot_longer(!religion, names_to = "income", values_to = "count") # long to wide transformfish_encounters %>% pivot_wider(names_from = station, values_from = seen, values_fill = 0)# SAS code/*proc transpose, the same as pivot for long to wide or wide to long*/data wide; input SID $ Programming State English; datalines; S01 98 100 80 S02 84 98 94 S03 89 92 88 ;run;/*wide to long transform*/proc transpose data=wide out=long(rename=(_name_=Coursename col1=Score)); var Programming State English; by SID;run;/*long to wide transform*/proc transpose data=long out=Rewide(drop=_name_); var Score; by SID; id Coursename;run;`

将一个数据集拆分成不同变量的数据集，R会使用列索引的方式，SAS则可以用`keep`

和`output`

搭配，如：

`/* one to one, having sample observation*/data class1(keep=name sex) class2(keep=age height weight); set sashelp.class; output class1; output class2;run;`

常见的合并数据集，可以分成列合并和行合并，对应在R中常用的就是`rbind()`

和`cbind()`

；在SAS中方法就比较多了，如`set`

、`merge`

`/*Set*/data class; set class1; set class2;run;/*Sometime is the same as merge*/data class; merge class1 class2;run;`

以上是简单的例子，其实对数据的拼接也有比较复杂的用法，比如除了`set`

外，还有`append`

方法，或则SQL的union等

有时我们是想根据指定的列（或变量），连接两个数据集，在SQL中就是left join, right join 或者 inner join，也就是merge datasets；在R中，可以使用`left_join()`

、`right_join()`

、`inner_join()`

等，跟SQL很类似，非常好记；在SAS中，要么先sort再merge，要么用sql语句

`# R codeband_members %>% inner_join(band_instruments, by = "name")# SAS code/*For correct way is first sort and then merge*/data class1(keep=name sex) class2(keep=name age height weight); set sashelp.class; output class1; output class2;run;proc sort data=class1; by name;run;proc sort data=class2; by name;run;data class; merge class1 class2; by name;run;/*proc sql to implementary left join, right join, inner join*/data class1(keep=name sex) class2(keep=name age height weight); set sashelp.class; output class1; if _n_ in (1,5,10,15) then output class2;run;data class2 set class2; if name="Janet" then name="Janey";run;proc sql; create table class_left as select a.*,b.* from class1 as a left join class2 as b on a.name=b.name;quit;`

这里是单指SAS中datasets的一些操作了

`/*select, copy, change name, delete*/proc datasets; copy in=sashelp out=work; select class;quit;proc datasets lib=work; change class=student;quit;proc datasets lib=work; delete student;quit;/*delete all datasets*/proc datasets lib=work kill memtype=data;run;/*only save class data*/proc datasets lib=work; save class;run;`

创建日期型变量，在R中常用`as.Date()`

函数，在SAS中则是input转化或者mdy用法

值得注意的是，前者是返回日期型变量，后者SAS则是返回距离1960.1.1号的天数，是数值型。。。

`# R codeas.Date("2014-04-29")as.Date(Sys.time())data dt; dayvar = input("04/29/2014", mmddyy10.); dayvar2 = mdy(4, 29, 2014); todays_date = today();run;`

参考资料：

SAS and R: Data Management, Statistical Analysis, and Graphics (second edition)

本文出自于http://www.bioinfo-scrounger.com转载请注明出处

]]>This is reference to the 2.1 to 2.2 section of Data management in <SAS and R: Data Management, Statistical Analysis, and Graphics (second edition)>.

在R中，低于format数据，根据数据格式不同有不同的操作方式，比较灵活

在SAS中，有`proc format`

专门处理这个需求

`proc format ;value sex 1 = "Male" 2 = "Female" . = "Unknown";value bp 140-high = "high" 135-140 = "mid" other = "low";value $gp "A1" = "C";run ;data report; format gender sex. sbp bp. group $gp.; input id gender sbp group $; datalines;100001 1 160 A100002 2 133 A1100003 . 120 B;run;`

从上例子可看出`proc format`

是主要影响到数据的输出格式，此外还有个`informat`

是影响SAS输入数据的格式，如：

`data fs; informat x $2.; input x$ y; x1 = x + 1; y1 = y + 1; datalines;1100 1200;run;`

SAS中带format的数据集，如果缺失format文件或版本不对，会打不开；需要format每次加载了才能使用,可通过fmtsearch()或数据集加载，如：

`option fmtsearch=(libname)`

`proc format library=work cntlout=work.fmt; value rang 1-<20="正常" Low-<1="异常" 20< -High="异常无临床意义" .="MISS";run;data temp1;input ORRES1;LBTESTCD1=put(ORRES1,rang30.);cards;100209.232218-1;run;`

对于缺失format文件的情况，大致可以有以下处理方式：

- 直接舍弃format，
`option nofmterr;`

- 保存类似于上述的format文件，
`library`

参数指定 - 保存format的数据集，
`cntlout`

参数指定

在R中，想输出一个dataset（数据框/列表/向量）中的某个变量/数值，直接显式的引用即可，方式众多，操作比较灵活，如：

`print(ds$col1)head(ds, 5)ds`

在SAS中，则需要先data step或者proc step显式或者隐式申明一个dataset，然后再操作

`proc print data=sashelp.class (obs=5); var Name Age Sex;run;`

在R中，有`comment()`

函数可以定义变量的labeling，但是日常适用中好似不怎么常用

在SAS中，由于临床试验中的要求，一般都会给变量设置labeling，如：

`proc print data=sashelp.class(obs=5) label; label Age="This is a modified label"run;`

在R中，更改数据集的列名也有多种形式，如:

`names(df)[1] <- c("col1")colnames(df)[1:2] <- c("col1", "col2")dplyr::rename()`

在SAS中，可对数据集用`rename`

处理

`data class; set sashelp.class (rename=(Sex=Gender));run;`

在R中，转成字符串用`as.character()`

，转成数值型则是`as.numeric()`

在SAS中，大部分转化是用`out`

以及`input`

实现的，以sashelp的class数据集为例，可以看到Age，Height，Weight是数值型（可以根据变量值是否右对齐来判断），然后用`put`

将其从数值型转化为字符串型

`# convert from numeric to string by putdata class; set sashelp.class (obs=5); C_age=put(Age,$8.); C_height=put(Height,$8.); C_weight=put(Weight,$8.); drop Age Height Weight;run;`

接着用`input`

实现字符串型转化为数值型

`data class2; set work.class; age=put(C_age,8.); keep Name Sex age;run;`

此外还有一种更简单的方式实现上述需求，如：

`data class2; set work.class; age=C_age+0; keep Name Sex age;run;`

在R中，在数据框中生成一个逻辑变量有多种方法，我常用`dplyr::mutate()`

，配合`if_else()`

等语句即可

在SAS中，也类似，在data步中用if语句

`data class; set sashelp.class; if Age>15 and Sex eq "M" then gp="GroupA"; else if Age<=15 and Sex eq "F" then gp="GroupB"; else gp=.;run;`

这是一个字符串操作的问题，从SAS和R的函数来看，几乎覆盖了所有字符串操作的需求，如：

提取字符串第2-4位的字符

`# R codestringr::str_sub("abcdef", start = 2, end = 4)# SAS codedata _null_; str=substr("abcdef", 2, 3); put str=;run;`

对于SAS略微不舒服地方，每次输出个数据必须指定data步或者proc，真的有点麻烦，一点也不programming！

判断指定的字符串是否能match到指定字符

`# R codestringr::str_detect("Hello world!", "world")stringr::str_match("Hello world!", "world")# SAS codedata _null_; /* Use PRXMATCH to find the position of the pattern match. */ position=prxmatch("/world/", "Hello world!"); put position=; if position then put "word, Match!";run;`

从字符串中提取出匹配到的字符

`# R codestringr::str_match("AE 2021-01-01", "\\w+\\s+(.*)")[1,2]# SAS codedata _null_; re=prxparse("/\w+\s+(.*)/"); if prxmatch(re, "AE 2021-01-01") then do; date=prxposn(re,1,"AE 2021-01-01"); end; put date=;run;`

替换字符串中的某些字符，在R中还是可以借助`stringr`

包，而在SAS中可以用`tranwrd`

`# R codestringr::str_replace_all("a_b_c", "_", "+")# SAS codedata _null_; my_string = "a_b_c"; my_new_string = tranwrd(my_string,"_", "+"); put "My String: " my_string; put "My New String: " my_new_string;run;`

获取字符串的长度，在R中可以用`nchar()`

函数，`length()`

可以计算向量的长度；而SAS则是`length`

计算字符串的长度

`# R codenchar("12345")# SAS codedata _null_; len=length("12345"); put len=;run;`

拼接字符串，在R中用`paste()`

函数，在SAS中则是`||`

`# R codepaste("Hello", "World!")stringr::str_c("Hello ", "World!")# SAS codedata _null_; newcharvar="Hello " || "World!"; put newcharvar=;run;`

分割字符串，在R中常见的是`strsplit()`

，或者`stringr::str_split()`

；在SAS中基础用`scan`

，`countw`

函数则可以计算分割后字符的数目

`# R codestrsplit("Smith John", " ")# SAS codedata have; name="Smith John"; lastname=scan(name,1," "); firstname=scan(name,2," ");run;`

用于判断某个字符串或者变量是否在某个向量或者数组内，在R中常用`%in%`

，而SAS中则是`in`

；两者的返回结果形式不一样，前者返回TRUE/FALSE，后者返回1/0

`# R code"a" %in% c("a", "b")# SAS codedata _null_; res=("a" in ("a","b")); put res=;run;`

参考资料：

SAS and R: Data Management, Statistical Analysis, and Graphics (second edition)

本文出自于http://www.bioinfo-scrounger.com转载请注明出处

]]>后来我发现有一本书《SAS and R: Data Management, Statistical Analysis, and Graphics (second edition)》，这本书写内容的也正好是我想要做的：基于实际需求，尤其是在数据处理这部分，分别列出R和SAS的实现方法。这样会使得记忆更加深刻点

因此我打算参考这本书的思路，以系列的形式来记录SAS的笔记；

This is reference to the Data input and output (chapter 1) from <SAS and R: Data Management, Statistical Analysis, and Graphics (second edition)>

在R中，看到`.rda`

结尾文件，我们一般认为其是R数据集，常用`load()`

来加载文件，用`save()`

在R中将数据保存为`.rda`

格式

`load(file = "mydata.rda")`

在SAS中，SAS数据集是以`.sas7bdat`

结尾，用`libname`

加载文件，保存也类同

`libname libref "C:\Users\temp";data ds2; set libref.ds;run;`

在R中，有多种读取TXT或者CSV的函数和R包，比如基础的函数`read.table()`

或者`read.csv()`

，我常用的包有`data.table`

和`readr`

在SAS中，一般会用`proc import`

`/*input csv files*/filename mycsv "C:/demo.csv";proc import out=mycsv datafile=mycsv dbms=csv replace; getnames=yes; guessingrows=20; datarow=2;run;/*input txt files with tab delimiter*/filename mytxt "C:/demo.txt";proc import out=mytxt datafile=mytxt dbms=dlm replace; delimiter='09'x; getnames=yes;run;`

在R中，一般用R包读取外部XLSX中sheet的数据，如`xlsx`

包，有时`openxlsx`

包也不错

`df <- xlsx::read.xlsx(file = "data.xlsx", sheetIndex = 1)`

在SAS中，还是适用`proc import`

，设置`dbms=excel`

`/*input excel files*/filename myexcel "C:/class.xlsx";proc import out=myxls datafile=myexcel dbms=excel replace;run;`

在R中，常见的数据格式有向量`c()`

、列表`list()`

以及数据框`data.frame()`

在SAS中，用`input`

生成自定义的数据

`data wide; input SID $ Programming State English; datalines; S01 98 100 80 S02 84 98 94 S03 89 92 88 ;run;`

在R中，查看部分数据可以用`print()`

函数，查看数据格式用`str()`

函数

以上对应到SAS中，分别是`proc print`

和`proc content`

在R中，native data用`save()`

函数，TXT/CSV对应`write.table()`

，XLSX对应`xlsx::write.xlsx()`

或者`openxlsx::write.xlsx()`

在SAS中，native数据跟input类似

`libname libref "C:\Users\temp";data temp.ds; set work.ds;run;`

TXT、CSV以及XLS都可以用`proc export`

`proc export data=ds outfile="C:/Users/temp/filename.xls" dbms=excel;run;`

参考资料：

SAS and R: Data Management, Statistical Analysis, and Graphics (second edition)

本文出自于http://www.bioinfo-scrounger.com转载请注明出处

]]>Git is a version control system used to track changes in computer files. Git's primary purpose is to manage any changes made in one or more projects over a given period of time. It helps coordinate work among members of a project team and tracks progress over time. Git also helps both programming professionals and non-technical users by monitoring their project files.

**What is Gitlab?**

GitLab is a web-based Git repository that provides free open and private repositories, issue-following capabilities, and wikis. It is a complete DevOps platform that enables professionals to perform all the tasks in a project—from project planning and source code management to monitoring and security. Furthermore, it allows teams to collaborate and build better software.

Above copy from https://www.simplilearn.com/tutorials/git-tutorial/what-is-gitlab

Nowadays Gitlab has been widely used in many companies, as it's free I think...

Generally I’m accustomed to using github desktop to connect git repositories, such as github. It’s my first try to use gitlab so I searched for some information in google. Below is my summarization.

Open your Git Bash (if not, please install Git).

Confirm if you have an existing SSH key pair.

- Go to your home directory, and enter .ssh/ subfolder to see if you have ssh before.
- In common, ED25519 (preferred) is
`id_ed25519.pub`

, or RSA is`id_rsa.pub`

, or ECDSA is`id_ecdsa.pub`

.

If you have neither of them, you should generate an SSH key pair, for example for ED25519:

`ssh-keygen -t ed25519 -C "<comment>"`

Copy the contents of your public key file by Git Bash, for example:

`cat ~/.ssh/id_ed25519.pub | clip`

Then Add the SSH key to your Gitlab account.

Sign in to Gitlab, and select **preferences** from your avatar in the top right corner.

- Select
**SSH keys**on the left sidebar. - Paste the keys in the Key box and then select
**Add key**.

Verify that you can connect to the Gitlab. Open Git Bash and run this command, replacing `gitlab.example.com`

with your Gitlab instance URL.

`ssh -T git@gitlab.example.com`

If this is your first time to connect, you may see the requirement for verifying the authenticity of Gitlab host. If not you may receive a Welcome message.

Now that the general configuration is completed, you can clone your repository as a normal Git process by SSH.

`git clone git@ssh.xxx/test.git`

Above all refer to https://docs.gitlab.com/ee/ssh/

If you would like to use RStudio, you should paste your ssh key from RStudio into the Gitlab key box.

Then select RStudio **File -> New Project -> Version Control -> Git**.

Copy the `SSH url`

you get from Gitlab, and paste it into **Repository URL**. Click on **Create Project** to start cloning and setting up the Rstudio environment.

Now you can use simple `commit`

and `push`

commands in RStudio as in Git.

Download Github Desktop first (if not).

Go to your Gitlab repository (specify) that you want to clone, and click the **Settings -> Access Token** on the left sidebar.

Add a project access token, including Name, Expires(optional) and Scopes. After filling in, click on Create Personal access Token.

Copy your access token and store it somewhere as we will use it later.

Open Github desktop, and click on **file -> clone a repository**. And then paste the URL of your repository into the URL field and choose the destination folder (Local path). After that select **clone**.

While cloning, it would pop up a window that let you enter username and password. That is your access token we created before. After that click on **Save and retry** to restart cloning.

Finally you can operate the repository between local and remote as we are used in Github.

Above all refer to How to use Github Desktop with Gitlab Repository

**Please indicate the source**: http://www.bioinfo-scrounger.com

首先我有一个带有中文的shiny代码，如下所示：

`library(shiny)ui <- fluidPage( p("这个是一个中文测试"), checkboxGroupInput( inputId = "test", label = "选择一门语言:", choiceNames = c("中文", "英语", "其他"), choiceValues = c(1, 2, 3) ))server <- function(input, output, session) { }shinyApp(ui, server)`

上述shiny在Rstudio IDE中正常打开

然后我分别将其publish到**shinyapp.io**和**RStudio-connect server**等server端，发现均失败了，前者是报错XXX包未找到，后者报错XXX字符有问题；同时我也在Deploy页面发现以下warning：

`Warning message:In fileDependencies.R(file) : Failed to parse C:/Users/guk8/AppData/Local/Temp/RtmpKeuyz5/file3edc27cd319/app.R ; dependencies in this file will not be discovered.`

通过在google上的搜索，初步怀疑是中文字符编码所导致的；而为何在shinyapp.io不是报中文字符的问题呢？

在回答这个问题之前，我们可以尝试将中文从`checkboxGroupInput`

的`choiceNames`

参数中剔除，换成英文，如下所示：

`library(shiny)ui <- fluidPage( p("这个是一个中文测试"), checkboxGroupInput( inputId = "test", label = "选择一门语言:", # choiceNames = c("中文", "英语", "其他"), choiceNames = c("Chinese", "English", "Others"), choiceValues = c(1, 2, 3) ))server <- function(input, output, session) { }shinyApp(ui, server)`

**现在shiny app可以正常的publish了**

这个是什么原因呢，暂时可以认为shinyapp.io不支持我本地PC机（win10）上publish的shiny app中除label以外的地方含有中文

此时看下我本地PC机的R locale

`> Sys.getlocale()[1] "LC_COLLATE=Chinese (Simplified)_China.936;LC_CTYPE=Chinese (Simplified)_China.936;LC_MONETARY=Chinese (Simplified)_China.936;LC_NUMERIC=C;LC_TIME=Chinese (Simplified)_China.936"`

此时R code的encoding如下

`> options()$encoding[1] "UTF-8"`

这时我怀疑应该是本地PC机编码语言的问题，所以将上述`choiceNames`

参数带有中文的shiny app放到linux的Rstudio-server上运行

linux的locale如下所示：

`anlan@ubuntu:~$ localeLANG=zh_CN.GBKLANGUAGE=zh_CN:zh:en_US:enLC_CTYPE="zh_CN.GBK"LC_NUMERIC="zh_CN.GBK"LC_TIME="zh_CN.GBK"LC_COLLATE="zh_CN.GBK"LC_MONETARY="zh_CN.GBK"LC_MESSAGES="zh_CN.GBK"LC_PAPER="zh_CN.GBK"LC_NAME="zh_CN.GBK"LC_ADDRESS="zh_CN.GBK"LC_TELEPHONE="zh_CN.GBK"LC_MEASUREMENT="zh_CN.GBK"LC_IDENTIFICATION="zh_CN.GBK"LC_ALL=`

以及R locale如下所示：

`> Sys.getlocale()[1] "LC_CTYPE=zh_CN.UTF-8;LC_NUMERIC=C;LC_TIME=zh_CN.UTF-8;LC_COLLATE=zh_CN.UTF-8;LC_MONETARY=zh_CN.UTF-8;LC_MESSAGES=zh_CN.UTF-8;LC_PAPER=zh_CN.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=zh_CN.UTF-8;LC_IDENTIFICATION=C"`

在此系统下，app能正常的publish到shinyapps.io上，如下所示：

从google一些资料中可看到，若想编码类似中文字符，需要将R locale改成UTF-8（可能又不只是这个原因，但至少我能在现在这个linux系统中正常publish中文的shiny app了。。。）

除了shinyapps.io外，我们还可以publish到RStudio-connect

但是得注意RStuido-connect的server的locale

我最开始的linux下的locale是`zh_CN.GBK`

，我无论从Rstudio-server还是PC端的Rstudio-destop都无法正常publish；报错大致如下：

`invalid multibyte stringr at xxx line`

并伴随着以下warnings：

`Warning message:In fileDependencies.R(file) : Failed to parse C:/Users/guk8/AppData/Local/Temp/RtmpKeuyz5/file3edc27cd319/app.R ; dependencies in this file will not be discovered.`

起初我以为是`parse()`

函数导致中文字符无法被编译，但是当我将中文字符转化为UTF-8后还是无法正常publish

这里有个比较trick的方法可以解决上述问题，将中文字符写入外部文件word.txt中，然后用`read.table()`

读入并指定`fileEncoding = utf-8`

参数，最后将中文字符赋予某个向量或者列表，最后在shiny ui代码中引用

上述方法经尝试是可行的！！！但是，太过于繁琐。。。

最后我将上述问题的起源猜测于RStudio connect的server，可能是server的locale的问题，即在`/etc/default/locale`

中添加下述命令：

`LANG=en_US.utf8LANGUAGE=en_US:en`

即将server的LANG从原来的`zh_CN.GBK`

转化为`en_US.utf8`

现在的server locale如下：

`anlan@ubuntu:~$ localeLANG=en_US.utf8LANGUAGE=en_US:enLC_CTYPE="en_US.utf8"LC_NUMERIC="en_US.utf8"LC_TIME="en_US.utf8"LC_COLLATE="en_US.utf8"LC_MONETARY="en_US.utf8"LC_MESSAGES="en_US.utf8"LC_PAPER="en_US.utf8"LC_NAME="en_US.utf8"LC_ADDRESS="en_US.utf8"LC_TELEPHONE="en_US.utf8"LC_MEASUREMENT="en_US.utf8"LC_IDENTIFICATION="en_US.utf8"LC_ALL=`

此时server上的R locale如下所示：

`> Sys.getlocale()[1] "LC_CTYPE=en_US.utf8;LC_NUMERIC=C;LC_TIME=en_US.utf8;LC_COLLATE=en_US.utf8;LC_MONETARY=en_US.utf8;LC_MESSAGES=en_US.utf8;LC_PAPER=en_US.utf8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.utf8;LC_IDENTIFICATION=C"`

现在无论是PC Rstudio destop还是Rstudio server上的中文shiny程序，均能正常publish到RStudio-connect上了。。。

虽然我无法解释具体的原因，可能还是server的语言编码导致之前无法publish；但是至少来说，问题是解决了。。。

以上是这两天debug的过程，希望对大家在遇到shiny中文报错的时候有报错哈~

本文出自于http://www.bioinfo-scrounger.com转载请注明出处

]]>*Defining variables once*

**Why create the analysis data set?** One of the primary reasons for creating analysis data sets is to have variable derivation in a single place so that we can avoid searching and changing each variable in different programs multiple times.

*Defining Study Populations*

**Intent-to-treat (ITT)**, All patients were randomized to study therapy. It is intended that they will be treated. Patients are analyzed according to a randomized treatment group.**As-treated**, Patients analyzed according to the study intervention they actually received. Patients may get a treatment that they were not randomized to.**Per-protocol**, All patients who did not experience a subjectively defined serious protocol violation during the study.**Safety**, All patients who actually received the study drug.

以下数据集概念摘抄自：https://www.163.com/dy/article/G0LAV9TD053438SI.html

在进行数据的统计分析之前，需要对疗效及安全性的分析人群进行定义并在数据分析时按照这一定义进行数据的分析，常见的数据分析集如下：

**意向性分析(intention-to-treat population，ITT)**：即所谓的ITT分析，该分析集纳入了所有随机化后的患者。这里需要注意的是，如果某患者随机到了A组，那后续的ITT分析该患者必须一直在A组，哪怕他接受的是B组的治疗，或者没有接受任何治疗。这看起来有点匪夷所思。其实，这样做最重要的目的就是保持两组之间的基线特点均衡可比，通过随机化，将除了研究因素以外的其他变量完全均衡和匹配掉，从而充分观察干预效果，而As-treated 集则是根据患者实际接受的治疗进行数据分析。对于单臂研究，ITT的概念通常并不经常涉及，一般情况下是指所有入组的患者(一般以签署知情同意书为依据)。**全分析集(full analysis set， FAS集)**：是ITT集的子集，部分研究称为修订后ITT(modifiedITT，mITT)分析，他是指对所有随机化受试者的数据做最少和公正的剔除之后所得到的数据集，保持原始数据集的完整性，减少偏倚，但是目前缺乏有关这一问题的共识。在ICHE9中，描述了只有在一些特定的原因下，可能导致受试者被排除在全分析集之外，包括(1)不满足主要入排标准；(2)没有用过一次药；(3)在进行随机化后没有任何数据。FAS可以作为主要分析集。**符合方案集分析(per- protocol analysis，PP集)**：是FAS集的一个子集，指受试者在入排标准、接受治疗、主要指标测量等方面不存在严重方案违背，它只对依从了干预措施的研究对象进行分析。个人认为，FAS集与PP集不存在太大的差异，因此，很多研究将FAS分析或PP分析二选一，与ITT分析一起报道。(需要特别注意的是，如果将患者从某个分析集中剔除，一定要有充分的理由，一般是由研究者、申办方及统计师商议好共同决定，且对于盲法设计的研究，一定是在揭盲前，揭盲前，揭盲前 ( 重要的事情说 3 遍 ) ，因为揭盲后对数据的修改有操纵数据的嫌疑，一般会受到监管部门的质疑(NEJ-009研究惨痛的教训是不是还在眼前)。)**安全集(Safe analysis Set，SS)**：安全集与上述几种评价疗效的数据集不同。安全集是用来评价试验药物安全性的。一般要符合以下三点：1）随机化分组；2）至少使用过一剂试验药物；3）至少有一次安全性评价。

以下摘抄自ICH-E9：

- 一般说来，显示选择不同的病例集进行分析对主要的试验结果不敏感有优越性的。
- 在有些情况下，最好能计划选择不同的分析集进行对结论的敏感性的探索。
- 在优效性试验中，全分析集用于主要分析（除了特殊情况），因为它倾向于避免由于符合分析集所致的效果的过于最优化估计。这是由于，在全分析集中包括了依从性不良者一般会减少估计的处理效应
- 然而在一个等效性或非劣效性试验中，应用全分析集一般并不保守其作用应当非常仔细地考虑。

*Defining Baseline Observations*

“Baseline” is a common clinical concept, which is used to demonstrate the state of a patient before some interventions, so that a subsequent comparison could be in a balanced state. Usually, the baseline value could be the last reading prior to medical intervention if you would like to make the cholesterol measurements.

Deriving Last Observation Carried Forward (LOCF) variables. For example, you want the last observation carried forward so long as the measures occur within a five-day window before the pill is taken.

*Defining Study Day*

- Calculating a continuous study day,
`study_day = event_date - intervention_date + 1`

, in this approach, the 1 is represented by initial intervention. - Calculating a study day without day zero, If event_date is lower than intervention_date then
`study_day = event_date - intervention_date`

. If event_date is higher than or equal to intervention_date then`study_day = event_date - intervention_date + 1`

, in this approach, the 1 is represented by initial intervention.

The first way is useful to graph or calculate durations that span the day before the therapeutic intervention day.

The second way is more intuitive as the day before intervention is represented by study day “-1”, so it is used more often, especially in CDISC SDTM.

However the first way is recommended to use in CDISC ADaM. Whether you are deriving data based on the CDISC models or not, you should calculate study day variables **in a consistent fashion** across a clinical trial or set of trials for an application.

*Windowing Data*

A tag is some descriptive label such as “Visit 5”， “Baseline”, or “Abnormal”. For example, baseline observations must occur before initial drug dosing.

*Transposing Data*

Normalized data may also be described as “stacked”, **“vertical”** or “talk and skinny”, while non-normalized data are often called “flat”, **“wide”** or “short and fat”.

So this normalized or non-normalized data may mean long data or wide data, which is why we need to transpose data so that the dependent variable is present one the same observation as the independent variables.

In SAS, I think proc transpose procedure is a powerful statement to handle these needs, no matter from long data to wide, or wide data to long.

`**** INPUT SAMPLE NORMALIZED SYSTOLIC BLOOD PRESSURE VALUES.**** SUBJECT = PATIENT NUMBER, VISIT = VISIT NUMBER,**** SBP = SYSTOLIC BLOOD PRESSURE.;data sbp;input subject $ visit sbp;datalines;101 1 160101 3 140101 4 130101 5 120202 1 141202 2 151202 3 161202 4 171202 5 181;run;**** TRANSPOSE THE NORMALIZED SBP VALUES TO A FLAT STRUCTURE.;proc transpose data = sbp out = sbpflat prefix = VISIT; by subject; id visit; var sbp;run;`

This procedure in R will be handled by `pivot_longer()`

and `pivot_wider()`

functions.

*Categorical Data and Why Zero and Missing Results Differ Greatly*

Missing data:

- The response is unknown.
- The observation will not be included in population analysis and denominator definitions.

Zero data:

- The response is known.
- The response is “NO” when the categorical variable is Boolean variable.
- The observation will be included in population analysis and denominator definitions.

*Performing Many-to-Many Comparisons/joins*

Imagine you have a data set of adverse event data and a data set of concomitant medications, and you want to know if a concomitant medication was given to a patient during the time of the adverse event.

It’s usually to join them with `proc sql`

in SAS, and using `left_join`

in R, which is a very common procedure for data manipulation.

*Common Analysis Data Sets*

- The
**critical variables analysis data set**always has a single observation per subject to simplify the process of merging with other data sets.The whole purpose of the critical variables data set is to capture in one place the essential analysis stratification variables that are used throughout the statistical analysis and reporting. - The purpose of using
**change-from-baseline analysis data sets**is to measure what effect some therapeutic intervention had some kind of diagnostic measure. A measure is taken before and after therapy, and a difference and sometimes a percentage difference are calculated for each post-baseline measure. - A
**time-to-event analysis data set**captures the information about the time distance between therapeutic intervention and some other particular event. Two variables defined as follow:**Event/Censor**, A binomial outcome such as “success/failure,” “death/life,” “heart attack/no heart attack.” If the event happened to the subject, then the event variable is set to 1. If it is certain that the patient did not experience the event, then the event variable is set to 0. Otherwise, this variable should be missing.**Time to Event**, This variable captures the time (usually study day) from therapeutic intervention to the event date or censor date. If the event occurred for a subject, the time to event is the study day at that event. If the event did not occur, then the time to event is set to the censor date that is often the last known follow-up date for a subject.

As said survival data, or called time-to-event data, is very common in survival analysis, such as Kaplan-Meier curve, log-rank test and Cox proportional hazards model.

Often the censor date is the last known date of patient follow-up, but a patient could be censored for other reasons, such as having taken a protocol-prohibited medication.

Creating time-to-event data sets can be a difficult programming task, especially during interim data analyses, such as for a DSMB. This is usually because the event data itself is captured in more than one place in the case report form and the censor date may be difficult to obtain.

For example, perhaps the event of interest is death. You may have to search the adverse events CRF page, the study termination CRF page, clinical endpoint committee CRFs, and perhaps a special death events CRF page just to gather all of the known death events and dates. For subjects who did not experience the event of interest, you may not have a study termination form to provide the censoring date, so you may have to use some surrogate data to create a censor date.

**Please indicate the source**: http://www.bioinfo-scrounger.com

The alluvial plot and Sankey diagram are both forms of visualization for general flow diagrams. These plot types are designed to show the change in magnitude of some quantity as it flows between states. Although the difference between alluvial plot and sankey diagram is always discussed online, like the issue Alluvial diagram vs Sankey diagram?, here just comes how to allow us to study the connection and flow of data between different categorical features in R. So we don’t mind the mixed use by both of them.

Note: an Alluvial diagram is a subcategory of Sankey diagrams where nodes are grouped in vertical nodes (sometimes called steps).

For illustrating the following cases, I will load the flight data from `nycflights13`

library first. This comprehensive data set contains all flights that departed from the New York City airports JFK, LGA and EWR in 2013, including three columns we’re concerned about, such as origin (airport of origin), dest (destination airport) and carrier (airline code). For a better demonstration, I select the top five destinations and top four carriers.

`top_dest <- flights %>% count(dest) %>% top_n(5, n) %>% pull(dest) top_carrier <- flights %>% filter(dest %in% top_dest) %>% count(carrier) %>% top_n(4, n) %>% pull(carrier) fly <- flights %>% filter(dest %in% top_dest & carrier %in% top_carrier)`

Let‘s take a look at the sankey, Google defines a sankey as:

A sankey diagram is a visualization used to depict a flow from one set of values to another. The things being connected are called nodes and the connections are called links. Sankeys are best used when you want to show a many-to-many mapping between two domains or multiple paths through a set of stages.

In R, we can plot a sankey diagram with the `ggsankey`

package in the ggplot2 framework. This package is very kind to provide a function (`make_long()`

) to transform our common wide data to long.

`fly <- flights %>% filter(dest %in% top_dest & carrier %in% top_carrier) %>% ggsankey::make_long(origin, carrier, dest)`

In R, we can plot a sankey diagram with the `ggsankey`

package in the ggplot2 framework. This package is very kind to provide a function (`make_long()`

) to transform our common wide data to long, so that columns will be fit to the parameters in functions. It contains four columns, corresponding to stage and node, such as stage is for `x`

and `next_x`

, and node is for `node`

and `next_node`

. Hence, at least four columns are required. More usages are illustrated in this document(https://github.com/davidsjoberg/ggsankey).

So a basic sankey diagram is as following:

`ggplot(fly, aes(x = x, next_x = next_x, node = node, next_node = next_node, fill = factor(node), label = node)) + geom_sankey(flow.alpha = 0.6, node.color = "gray30") + geom_sankey_label(size = 3, color = "white", fill = "gray40") + scale_fill_viridis_d() + theme_sankey(base_size = 18) + labs(x = NULL) + theme(legend.position = "none", plot.title = element_text(hjust = .5))`

Furthermore, `networkD3`

package is also able to plot the sankey diagram, but not easy to use I think.

After my initial use, `alluvial`

and `ggalluvial`

packages are very suitable for R users to create the alluvial plots. The former has its own specific syntax, otherwise the later one can be integrated seamlessly into ggplot2, same as `ggsankey`

.

Actually, to be honest, both of them are convenient. You can choose either one according to your use situation. For example, the`alluvial`

package is demonstrated below:

`fly <- flights %>% filter(dest %in% top_dest & carrier %in% top_carrier) %>% count(origin, carrier, dest) %>% mutate(origin = fct_relevel(as.factor(origin), c("EWR","JFK","LGA")))alluvial(fly %>% select(-n), freq = fly$n, border = NA, alpha = 0.5, col=case_when(fly$origin == "JFK" ~ "red", fly$origin == "EWR" ~ "blue", TRUE ~ "orange"), cex = 0.75, axis_labels = c("Origin", "Carrier", "Destination"))`

The detailed usages can be found in this web (https://github.com/mbojan/alluvial).

If you would like to own more customized adjustments, maybe the `ggsankey`

is better. As it’s a ggplot2 extension, which has enough functions for modification following your thoughts.

We still don’t need to take much time to transform data because `ggalluvial`

also has a very convenient function to do the same procedure as `make_long()`

in `ggsankey`

. If your data is in a “wide” format , like the flight dataset, the `to_lodes_form()`

function will help you easily.

`fly <- flights %>% filter(dest %in% top_dest & carrier %in% top_carrier) %>% count(origin, carrier, dest) %>% mutate( origin = fct_relevel(as.factor(origin), c("LGA", "EWR","JFK")), col = origin ) %>% ggalluvial::to_lodes_form(key = type, axes = c("origin", "carrier", "dest"))ggplot(data = fly, aes(x = type, stratum = stratum, alluvium = alluvium, y = n)) + # geom_lode(width = 1/6) + geom_flow(aes(fill = col), width = 1/6, color = "darkgray", curve_type = "cubic") + # geom_alluvium(aes(fill = stratum)) + geom_stratum(color = "grey", width = 1/6) + geom_label(stat = "stratum", aes(label = after_stat(stratum))) + theme( panel.background = element_blank(), axis.text.y = element_blank(), axis.text.x = element_text(size = 15, face = "bold"), axis.title = element_blank(), axis.ticks = element_blank(), legend.position = "none" ) + scale_fill_viridis_d()`

The `geom_alluvium()`

is a bit different to `geom_flow()`

, which is chosen to use depending on the type of dataset you provide and the purpose you want to demonstrate.

Above all are my notes for a previous question. If you have any questions, please read the references as follows, which is easier to understand.

https://github.com/davidsjoberg/ggsankey

https://www.displayr.com/sankey-diagrams-r/

https://cran.r-project.org/web/packages/alluvial/vignettes/alluvial.html

https://corybrunson.github.io/ggalluvial/

https://www.r-bloggers.com/2019/06/data-flow-visuals-alluvial-vs-ggalluvial-in-r/

https://cran.rstudio.com/web/packages/ggalluvial/vignettes/ggalluvial.html

**Please indicate the source**: http://www.bioinfo-scrounger.com

For simplicity, if you have many clinical datasets, you’d like to find out which datasets have a certain variable like Sex, what should we do? Per R thinking, I will read all datasets and derive row names (variables) and then find out which datasets have this specific variable. If you apply this way in SAS, maybe you need to create a macro and utilize open and varnum functions. However there is a tip to achieve it, as following:

`libname mydata "C:/Users/anlan/Documents/CYFRA";data sex_tb; set sashelp.vcolumn; where libname="MYDATA" and name="SEX"; keep memname name type length;run;`

Just manipulate the sashelp.vcolumn to filter the rows by libname(`libname="MYDATA"`

) and column name(`name="SEX"`

). We can not find this tip from any common books, but it’s very useful in our work, I call it “work experience” generally.

The second example, how to derive date data from a string? I think any derived process can be splitted into two procedures, matching and extracting. In R, obviously the popular string related package is `stringr`

, so how about in SAS? SAS does not have the same toolkit package/macro as R, but some functions may be useful, like `prxchange`

. After I read the documentation of `prxchange`

, I feel that it has some similarities with perl in regular matching, as following:

`data work.demo; input tmp $20.; datalines;AE 2021-01-01CM 2021-02-01MH 2021-03-01;run;data demo_date; set demo; format date YYMMDD10.;/* date=prxchange("s/\w+\s+(.*)/$1/",1,tmp);*/ date=input(prxchange("s/\w+\s+(.*)/$1/",1,tmp),YYMMDD10.);run;proc contents data=demo_date; run;`

If you just want to judge whether one word or numeric exists in the string, in other words matching words, maybe `prxmatch`

and `find`

are helpful. If you want to extract words instead of match, `prxchange`

or `prxparse+prxposn`

is preferred. By the way, the later one is more close to R logic.

`data work.num; input tmp $; datalines;AE01CM02MH03;run;data num_ext; length tmp type1 type2 $ 20; keep tmp type1 type2; re=prxparse("/([a-zA-Z]+)(\d+)/"); set num; if prxmatch(re, tmp) then do; type1=prxposn(re, 1, tmp); type2=prxposn(re, 2, tmp); output; end;run;`

The third question, what’s the difference between informat and format?

`informat`

describes how the data is presented in the text file.`format`

describes how you want SAS to present the data when you look at it.

Remember, formats do not change the underlying data, just how it is printed for input on the screen.

Having an instance here, converting `data9.`

date to `yymmdd10.`

represents the usage of `format`

statements.

`data aes;input @1 subject $3. @5 ae_start date9. @15 ae_stop date9. @25 adverse_event $15.;format ae_start yymmdd10. ae_stop yymmdd10.;datalines;101 01JAN2004 02JAN2004 Headache101 15JAN2004 03FEB2004 Back Pain102 03NOV2003 10DEC2003 Rash102 03JAN2004 10JAN2004 Abdominal Pain102 04APR2004 04APR2004 Constipation;run;`

Another condition, we typically store the number 1 for male and 2 for female. It would be embarrassing to hand a client with an unclear understanding of the 1 and 2 numbers. So we need a format to dress up the report.

`data report;input ID Gender State $;datalines;100001 1 LA100002 2 LA100003 . AL;run;proc format ;value sex 1 = "Male" 2 = "Female" . = "Unknown";run ;proc print data = report; var id gender state; format gender sex.;run;`

`Informat`

is usually used with the `input`

statement to read multiple styles of variables into SAS.

Informats usage:

Character Informats: $INFORMAT w.

Numeric Informats: INFORMAT w.d

Date/Time Informats: INFORMAT w.

data death; input @1 subject $3. @5 death 1.; datalines; 101 0 102 0 ; run;

The fourth question, what’s the difference between the keep option on the set statement or the data statement?

If you place the keep option on the set statement, SAS keeps the specified variables when it reads the input data set. On the other hand, if you place the keep option on the DATA statement, SAS keeps the specified variables when it writes to the output data set. From this explanation, we can easily think that the latter one is faster than the former when the input dataset is very large.

Above all are my partal interview questions. It's such a pity that I‘m not fully prepared as I am just proficient in R, not SAS. So I’m planning to take my free time to learn and summarize SAS, just like when I learned Perl, R and Python.

By the way, I think is a good book as it not only provides some sas cases but also introduces the knowledge from the pharmaceutical industry.

**Please indicate the source**: http://www.bioinfo-scrounger.com

`submit /R`

statement. A few months ago, I consulted with SAS support for how to import plots by R in IML into RTF templates directly as I could not find any useful information in google. Unfortunately SAS support told me if the plot was created in R, it would need to be saved within the submit block as well using R code. It means if you want to directly import R graphics to RTF, maybe you should use some R function to achieve it.However I found a post (Mixing SAS and R Plots) that illustrates how to mix SAS plots and R plots in one graph by chance, which could solve my question perfectly. So let’s have a look at how to generate a plot by R in SAS and import it to RTF or PDF. It seems not directly but a trick I think.

Firstly, as usual we draw a R graphic in SAS.

`proc iml; submit /R; library(ggpubr) data("ToothGrowth") ggviolin(ToothGrowth, x = "dose", y = "len", add = "none") %>% ggadd(c("boxplot", "jitter"), color = "dose") %>% ggsave(filename = "C:/Users/anlan/Documents/plots/violin.png", width = 10, height = 8, units = "cm", dpi = 300) endsubmit; call ImportDataSetFromR("work.ToothGrowth", "ToothGrowth");run;quit;`

And then set output style to rtf template and graphics with png of 12cm height and 15cm width. This should be noted that the height and width point to the size of the graphic, not the actual plot size.

`ods rtf file = "C:/Users/anlan/Documents/plots/outgraph.rtf"; ods graphics / noborder height=12cm width=15cm outputfmt=png; `

Next we use the SAS Graph Template Language(GTL) to define a template and use the `drawimage`

statement to import the R graphic into SAS. The width and height params in the `drawimage`

statement is used to adjust the size of the image’s bounding box (actual size). In this example, width=90 widthunit=percent means the plot is zoomed out to 90%.

`proc template; define statgraph plottemp; begingraph; layout overlay; drawimage "C:/Users/guk8/Documents/plots/violin.png" / width=90 widthunit=percent height=90 heightunit=percent; endlayout; endgraph; end;run;`

The final plot in RTF is as following:

`proc sgrender template=plottemp; run;`

Actually the drawimage statement is designed to import the external graphics to SAS graph, to display a mixed graph. For example I’d like to show a graph having an image at the right-bottom of the total graph.

`proc template; define statgraph mgraphic; begingraph; entrytitle "Mix SAS and external graphics"; layout overlay; boxplot y=len x=dose / name="box" group=supp groupdisplay=cluster spread=true; discretelegend "box"; drawimage "C:/Users/anlan/Documents/plots/violin.png" / width=45 widthunit=percent height=45 heightunit=percent anchor=bottomright x=98 y=2 drawspace=wallpercent ; endlayout; endgraph; end;run; proc sgrender data=work.ToothGrowth template=mgraphic; label dose="Dose";run;`

The mixed graphic is as following:

Great, it seems easy to achieve. Obviously it definitely can be achieved in R using some useful functions as well.

Mixing SAS and R Plots

DRAWIMAGE Statement

Hands-on Graph Template Language: Part B

BOXPLOT Statement

SAS Boxplot – Explore the Major Types of Boxplots in SAS

**Please indicate the source**: http://www.bioinfo-scrounger.com

To be perfectly honest, I’m not pretty sure that what I do is correct as I’m a new recruit in SAS. However I have strong experience in R so I’m accustomed to think of problems and solve them in R.

Due to the work, I need to learn to manipulate data by R and SAS simultaneously. In this post I expect to use SAS to complete the same procedure with that blog(Logistic Regression for biomarkers). The steps that will be covered are the following:

- Check variables distributions and correlation
- Fit logistic regression model
- Predict the probability of the event
- Compare two ROC curve

Firstly I load the same dataset from R package `mlbench`

by the IML/SAS procedure. The `submit`

and `endsumbit`

can wrap R code in SAS and run it.

`proc iml; submit /R; data("PimaIndiansDiabetes2", package = "mlbench") endsubmit; call ImportDataSetFromR("work.diabetes2", "PimaIndiansDiabetes2");run;quit;`

We can take a look at the frequency of categorical variables in summary table as following:

`proc freq data=diabetes2; tables diabetes;run;`

We can also check the continuous variables as following:

`proc means data=diabetes2; var age glucose insulin mass pedigree pregnant pressure triceps;run;`

Moreover I choose graphs to demonstrate the distribution and correlation of the variables. I always think that graphs are often more informative. For instance, histogram plot is easy to examine the distribution and look for outliers.

`proc template; define statgraph multiple_charts; begingraph; entrytitle "Two distributions"; /* Define Chart Grid */ layout lattice / rows = 1 columns = 2; /* Chart 1 */ layout overlay; entry "Glucose Histogram" / location=outside; histogram glucose / binwidth=10; endlayout; /* Chart 2 */ layout overlay; entry "Pressure Histogram" / location=outside; histogram pressure / binwidth=5; endlayout; endlayout; endgraph; end;run; proc sgrender data=diabetes2 template=multiple_charts;run;`

For the aspect of variable correlation, heatmap scatter plot is a better way often.

`* calculate correlation matrix for the data;ods output PearsonCorr=Corr_P;proc corr data=diabetes2; var age glucose insulin mass pedigree pregnant pressure triceps;run;proc sort data=Corr_P; by Variable;run;* transform wide to long;proc transpose data=Corr_P out=CorrLong(rename=(COL1=Corr)) name=VarID; var age glucose insulin mass pedigree pregnant pressure triceps; by Variable;run;proc sgplot data=CorrLong noautolegend; heatmap x=Variable y=VarID / colorresponse=Corr colormodel=ThreeColorRamp; *Colorresponse allows discrete squares for each correlation.; text x=Variable y=VarID text=Corr / textattrs=(size=10pt); /*Create a variable that contans that info and set text=VARIABLE */ label Corr='Pearson Correlation'; yaxis reverse display=(nolabel); xaxis display=(nolabel); gradlegend;run;`

These two figures show that glucose and pressure are normal distribution basically, and they have no relatively high correlation.

Before fitting the model, we firstly reformat the diabetes variable and keep the glucose and pressure variables without any NA.

`proc format; value $diabetes "pos"=1 "neg"=0;run;data inputData; set diabetes2(keep=diabetes glucose pressure); if nmiss(of _numeric_) + cmiss(of _character_) > 0 then delete; /*remove NA rows*/ format diabetes $diabetes.;run;`

To fit the logistic regression model in SAS, generally we will use the following code:

`ods graphics on;proc logistic data=inputData plots(only)=roc; model diabetes(event="1") = glucose pressure; output out=estimates p=est_response; ods output roccurve=ROCdata;run;`

The `plots(only)=roc`

means we only expect to display a roc plot, and we can get all probabilities of prediction from the `est_response`

column in `estimates`

dataset. With the ods output, we save the roc curve relative data into `ROCdata`

directly.

Indeed it seems very considerate, but I think it’s just not flexible. Because it will result in not useful programming.

In this case, I don’t specify `class`

variables. Otherwise If you specify class variables when the param option is set equal to either `ref`

or `glm`

, SAS will automatically create dummy variables. (Without specifying param, the default coding for two-level factor variables is -1, 1, rather than 0, 1 like we prefer).

The model partial results are the following:

We can find that the variable estimates are equal to those from R. In addition we can automatically get the odds ratio estimates for each variable as well. Actually I can calculate the odds ratio through `exp(coef)`

by myself.

If you would like to know the probability of new data, maybe `lsmeans`

is useful, which has the same effect as `predict()`

in R.

Here it’s the turn to compare two ROC curves, so firstly I have to fit two logistic regression models, one is only glucose, another is glucose plus pressure.

`proc logistic data=inputData plots(only)=roc; model diabetes(event="1") = glucose; output out=estimates p=est_response; ods output roccurve=rocdata1;run;proc logistic data=inputData plots(only)=roc; model diabetes(event="1") = glucose pressure; output out=estimates p=est_response; ods output roccurve=rocdata2;run;data plotdata; set rocdata1(in=a) rocdata2(in=b); if a then group="mod1"; if b then group="mod2"; keep _1mspec_ _sensit_ grouprun;`

Using `set`

to combine these two roc data for plots. And then use the proc sgplot and series statement to plot the ROC curve.

In `proc sgplot`

, the `aspect=1`

option requests a square plot which is customary for an ROC plot in which both axes use the [0,1] range. The `inset`

statement writes the individual group AUC (area under the ROC curve) values inside the plot area.

`proc sgplot data=plotdata aspect=1;/* styleattrs wallcolor=grayEE;*/ series x=_1mspec_ y=_sensit_ / group=group; lineparm x=0 y=0 slope=1 / transparency=.3 lineattrs=(color=gray); title "ROC curves for both groups"; xaxis label="False Positive Fraction" values=(0 to 1 by 0.25) grid offsetmin=.05 offsetmax=.05; yaxis label="True Positive Fraction" values=(0 to 1 by 0.25) grid offsetmin=.05 offsetmax=.05; inset ("glucose AUC" = "0.7877" "glucose+pressure AUC" = "0.7913") / border position=bottomright; title "ROC curves for logistic regression";run;`

Above is my first note about how to use SAS for analysis and visualization. In the next period of time, I plan to compare some ways of code between R and SAS to push my SAS learning. Hope having a bit of progress in it.

Thanks for the post( Using SAS to Estimate a Logistic Regression Model ) to make me more clear about the logistic regression in SAS.

Modify the ROC plot produced by PROC LOGISTIC Plot and compare ROC curves from a fitted model used to score validation or other data

Example 78.7 ROC Curve, Customized Odds Ratios, Goodness-of-Fit Statistics, R-Square, and Confidence Limits

Using SAS to Estimate a Logistic Regression Model

sas_correlation_heat_map.sas

SAS系列20——PROC LOGISTIC 逻辑回归

**Please indicate the source**: http://www.bioinfo-scrounger.com

As we known, logistic regression can be applied in the different aspects, like:

- Calculate OR value to find out potential risk factors.
- Construct a model as a classifier to estimate probability whether an instance belongs to a class or not.
- Adjust potential mixed factors so that we estimate the impact of the interested factor and endpoint.

For example:

Suppose we’re interested in know how variables, such as age, sex, body mass index affect blood pressure. In this case maybe body mass index is my most interested factors, and the age and sex are mixed factors. Therefore the blood pressure should be categorical variables, split into two-factors, high blood pressure and normal blood pressure.

In my case, I have a known biomarker as a reference marker, and I’d like to add another marker to the refer one as a marker combination. Then estimate whether the marker combination is better than the reference marker. So how to select the appropriate statistical methods?

Obviously, the most straightforward idea is to compare the sensitivity and specificity between the combined marker and the reference marker.

- Superiority of sensitivity, the alternative hypothesis is that the number of the additional cancer cases identified by combined marker to reference marker is larger than zero. The p-value is calculated by a binomial test.
- Non-inferiority of specificity, the alternative hypothesis is that the number of misclassifications as positive by combined marker is less than 10% relative to reference. The p-value is calculated using an approximated standard normal distribution based on the restricted maximum likelihood estimation(RMLE).

However if you want to compare the AUC between the reference marker and combined marker, a logistic regression can meet our needs perfectly. So reference marker and combined marker as the independent variables and the disease condition (cancer/control) as the dependent variable. We’re pleased to see that the combined AUC is larger than the reference one.

Before we perform logistic regression, some details may be useful to our model and worth considering in advance.

- Remove potential outliers
- Make sure that the predictor variables are normally distributed. If not, you can use log, root, Box-Cox transformation.
- Remove highly correlated predictors to minimize overfitting. The presence of highly correlated predictors might lead to an unstable model solution. The third consideration is always neglected in our performance.

In that way, how to fit a logistic regression model and calculate the AUC? I'd like to take a little notes about the analysis process in R and SAS. By the way, I think R is much better than SAS in the statistical analysis and visualization aspects. **This is the spirit and power of open-source, which makes our work better and better.**

I take the data set `PimaIndiansDiabetes2`

from the `mlbench`

package as an example, which is about “Pima Indians Diabetes Database”. Load the data, select two interested variables and response, remove NAs.

`library(tidyverse)data("PimaIndiansDiabetes2", package = "mlbench")data <- select(PimaIndiansDiabetes2, c("glucose", "pressure", "diabetes")) %>% na.omit()`

Firstly I think it’s necessary to estimate the distribution and correlation between those variables (glucose and pressure) as follow:

`## distributionggpubr::ggarrange( ggpubr::gghistogram(data = data, x = "glucose"), ggpubr::gghistogram(data = data, x = "pressure"), nrow = 1, ncol = 2)`

`## correlationlibrary(ggcorrplot)ggcorrplot(corr = cor_pmat(PimaIndiansDiabetes2[,1:8]), method = "circle")`

Here, we can see that these two variables are normally distributed and without correlation.

Then I fit a simple model based on the glucose predictor variable.

`model1 <- glm(diabetes ~ glucose, data = data, family = binomial)summary(model1)$coef## Estimate Std. Error z value Pr(>|z|)## (Intercept) -5.61173171 0.442288629 -12.68794 6.897596e-37## glucose 0.03951014 0.003397783 11.62821 2.962420e-31`

The output above shows the beta coefficients and according significance levels. The intercept is `-5.61`

and the coefficient of glucose variable is `0.039`

. Moreover, `Std.Error`

represents the accuracy of the coefficient, the larger it is, the less confident it will be. And the `z value`

is the estimation value divided by standard error, and according to the `p-value`

.

When we think of the Hazard Ratio, maybe we should know the meaning of logistic beta coefficients. We know that estimate is the regression coefficient, so exp(coef) is the odds ratio which means **the ratio of the odds that an event will occur (event = 1) given the presence of the predictor x (x = 1), compared to the odds of the event occurring in the absence of that predictor (x = 0)**

We all know that the s-shape curve is defined as `p = exp(y) / [1 + exp(y)]`

(James et al. 2014). This can be also simply written as `p = 1 / [1 + exp(-y)]`

, where:

`y = b0 + b1*x`

`exp()`

is the exponential`p`

is the probability of an event to occur (1) given x. Mathematically, this is written as`p(event=1|x)`

and abbreviated as`p(x)`

, so`p(x) = 1 / [1 + exp(-(b0 + b1*x))]`

Based on the formula, if we get a new glucose plasma concentration, it will be easy to predict the probability of the patient being diabetes positive. In R, we can use the `predict()`

function to calculate the probability instead of that logistic equation.

`mod_prob1 <- predict(model1, newdata = data, type = "response")`

We can also apply `geom_smooth()`

to fit a s-shaped probability curve using the above `data`

.

`data %>% mutate(prob = ifelse(diabetes == "pos", 1, 0)) %>% ggplot(aes(glucose, prob)) + geom_point(alpha = 0.2) + geom_smooth(method = "glm", method.args = list(family = "binomial")) + theme_light() + labs( title = "Logistic Regression Model", x = "Plasma Glucose Concentration", y = "Probability of being diabete-pos" )`

Back to my case, my purpose is to compare the reference marker and combined marker. Suppose the glucose is the reference one, the glucose plus pressure is the combined one. So I must fit multiple logistic regression with glucose and pressure variables.

`model2 <- glm(diabetes ~ glucose + pressure, data = data, family = binomial)summary(model2)$coef## Estimate Std. Error z value Pr(>|z|)## (Intercept) -6.49941142 0.659445793 -9.855869 6.465488e-23## glucose 0.03836257 0.003428241 11.190160 4.556103e-29## pressure 0.01406869 0.007478525 1.881212 5.994305e-02`

In the same way, calculate the probability based on the multiple regression. And then compare the two models with AUC.

`mod_prob2 <- predict(model2, newdata = data, type = "response")plotres <- data.frame(event = ifelse(data$diabetes == "pos", 1, 0), glucose = mod_prob1, pressure = mod_prob2, stringsAsFactors = F) %>% pivot_longer(cols = 2:3)`

To plot multiple ROC curves on the same plot, maybe `plotROC`

package can help us, perfect to use.

`library(plotROC)p <- ggplot(as.data.frame(plotres), aes(d = event, m = value, color = name)) + geom_roc(n.cuts = 0) + style_roc()p + annotate("text", x = .75, y = .25, label = paste(c("glucose", "pressure"), "AUC =", round(calc_auc(p)$AUC, 4), collapse = "\n"))`

From this roc output, maybe the “combined marker” is not better than “reference marker”. Obviously, it’s my fault to choose a non suitable dummy data. But I think this blog is useful to understand what the logistic regression in biomarker combinations is.

**Thanks for this article( Logistic Regression Essentials in R ) to make me more clear about the logistic regression.**

Logistic Regression Essentials in R Heart Disease Prediction using Logistic Regression

Heart Disease Prediction

Understanding Logistic Regression using R

Chapter 10 Logistic Regression

Generate ROC Curve Charts for Print and Interactive Use

**Please indicate the source**: http://www.bioinfo-scrounger.com

Powerpoint is a creative tool that can help you make any hex stickers you’re willing to. You can search some templates online and do extra additions. Using powerpoint to manipulate the image and create a semi-circular text, which will take a longer time than what you have hoped to get the hexagon shape right.

The biggest advantage of powerpoint templates is that provided you’re proficient in powerpoint, you’re sure to make a hex sticker.

`hexSticker`

package can provide a series of pretty figures that are generated by base plot, lattice and ggplot2. It sounds that any R plot is able to be added to hexo stickers including the extra graphs certainly.

`library(hexSticker)imgurl <- "./interactive.png"hexSticker::sticker(imgurl, package="easyIVD", p_size=20, p_y = 1.5, s_x=1, s_y=.75, s_width=.5, filename="imgfile.png")`

The advantage of hexSticker packages is that as it’s generated by R code, it’s easy to control each parameter simultaneously and has strong repeatability. Moreover it’s more convenient to share with others than powerpoint templates.

More detail configuration in `?hexSticker::sticker`

document.

I think It’s a brilliant tool, and has a great prize in the 2020 RStudio Shiny Contest. Since I have tried it, I believe it’s the most convenient tool for making hex stickers. It has more detailed configurations and is so thoughtful and friendly.

You can specify hex name, image configurations, hexagon border, spotlight details and add url in the sticker, which I believe is enough for your personal design.

This tool is built by R shiny so you only need to access the web (https://connect.thinkr.fr/hexmake/) to begin designing your personalized sticker.

The home page is shown below:

After a series of configurations, my hex sticker completed and looks pretty and good. If you are also interested in it, please expand your brainstorming and make it become more creative.

Making a Hexagon Sticker

hexSticker: create hexagon sticker in R

Build your Own Hex Sticker

**Please indicate the source**: http://www.bioinfo-scrounger.com

In this situation, as an option we can consider using maximally selected rank statistics to find the cutoff.

The statistics depends on your data types.

What is the maximally selected rank statistics? Briefly speaking it assumes that an unknown cutoff in X (independent variable) determines two groups of observations regarding the response Y, and the two groups have the biggest statistics between each other. This statistics is an appropriate standardized two-sample linear rank statistic of the responses that represents the difference between two groups.

The hypothesis test is to find out the maximum of the standardized statistics of all possible cutoffs, which can provide the best separation of the response into two groups.

So Maximally selected rank statistics can be used for estimation as well as evaluation of a simple cutpoint model.

`surv_cutpoint()`

function in the `survminer`

package wrapped `maxstat`

in it to determine the optimal cutpoint for each variable.

A simple example below is about how to use the `maxstat`

package to find a statistically significant cutoff.

Load the survival data from the maxstat package.

`library(survival)library(maxstat)data(DLBCL)mod <- maxstat.test(Surv(time, cens) ~ MGE, data = DLBCL, smethod = "LogRank", pmethod = "condMC", B = 9999)> modMaximally selected LogRank statistics using condMCdata: Surv(time, cens) by MGEM = 3.1772, p-value = 0.009701sample estimates:estimated cutpoint 0.1860526 `

This argument `smthod`

provides several kinds of statistics to be computed and the `pmethod`

is used to specify the kind of p-value approximation. The argument `B`

specifies the number of Monte-Carlo replications to be performed (and defaults to 10000).

For the overall survival time, the estimated cutpoint is 0.186 mean gene expression, the maximum of the log-rank statistics is M = 3.1772. The probability that, under the null hypothesis, the maximally selected log-rank statistic is greater M = 3.171 is less then than 0.0097.

If you have more than one dependent variable to be evaluated, you need to evaluate these predictors simultaneously and find out which one is better than the others.

`mod2 <- maxstat.test(Surv(time, cens) ~ MGE + IPI, data = DLBCL, smethod = "LogRank", pmethod = "exactGauss", abseps=0.01)> mod2 Optimally Selected Prognostic Factors Call: maxstat.test.data.frame(formula = Surv(time, cens) ~ MGE + IPI, data = DLBCL, smethod = "LogRank", pmethod = "exactGauss", abseps = 0.01) Selected: Maximally selected LogRank statistics using exactGaussdata: Surv(time, cens) by IPIM = 2.9603, p-value = 0.01104sample estimates:estimated cutpoint 1 Adjusted p.value: 0.03430325 , error: 0.001754899> mod2$maxstats[[1]]Maximally selected LogRank statistics using exactGaussdata: Surv(time, cens) by MGEM = 3.0602, p-value = 0.02721sample estimates:estimated cutpoint 0.1860526 [[2]]Maximally selected LogRank statistics using exactGaussdata: Surv(time, cens) by IPIM = 2.9603, p-value = 0.01104sample estimates:estimated cutpoint 1`

The p-value of the global test for the null hypothesis “survival is independent from both IPI and MGE” is 0.034 and IPI provides a better distinction into two groups than MGE does.

The visualization of the result can be shown by `plot()`

function

`plot(mod2)`

Maximally Selected Rank Statistics in R

https://www.jianshu.com/p/0851baac137c

https://www.iikx.com/news/statistics/1747.html

**Please indicate the source**: http://www.bioinfo-scrounger.com

For instance, how to create a pie chart or a donut chart? If using R software and the `ggplot2`

package, the function `coor_polar()`

is recommended, as the pie chart is just a stacked bar chart in polar coordinates.

Create a simple dataset:

`df <- data.frame( group = c("Female", "Male", "Child"), value = c(25, 30, 45), Perc = c("25%", "30%", "45%"))`

And then create a pie chart:

`ggplot(df2, aes(x = "", y = value, fill = group))+ geom_bar(width = 1, stat = "identity") + coord_polar("y", start=0)`

Above is a default style, maybe It seems a bit different to some pie charts from business tools, and not common so we want to remove axis tick and mark labels. To be more advanced, we probably need to add text annotations, therefore we should create a customized pie chart in `theme()`

and `geom_text()`

function. For having pretty colour panels, the `ggsci`

package is highly recommended.

Here there is one point: if you need to add mark labels by

`geom_text`

function, please sort your fill (group/factor) variable firstly, otherwise the text label will be located in the wrong position.

Create a custome theme, and calculate the position of each label in the pie chart.

`mytheme <- theme_minimal()+ theme( axis.title = element_blank(), axis.text.x = element_blank(), panel.border = element_blank(), panel.grid = element_blank(), axis.ticks = element_blank(), legend.key.size = unit(15, "pt"), legend.text = element_text(size = 12), legend.position = "top" )df2 <- df %>% mutate( cs = rev(cumsum(rev(value))), pos = value/2 + lead(cs, 1, default = 0) )`

The pie chart below seems more fashionable than the front ones.

`ggplot(df2, aes(x = "", y = value, fill = fct_inorder(group)))+ geom_bar(width = 1, stat = "identity") + coord_polar("y", start=0) + ggsci::scale_fill_npg() + mytheme + geom_text(aes(y = pos), label = Perc, size = 5) + guides(fill = guide_legend(title = NULL))`

Categorical data are often better understood in a donut chart rather than in a pie chart, although I always think both of them are the same. But unlike the pie chart, to draw a donut chart we must specify the `x = 2`

in `aes()`

and add `xlim()`

to limit it.

`ggplot(df2, aes(x = 2, y = value, fill = fct_inorder(group)))+ geom_bar(width = 1, stat = "identity") + coord_polar("y", start = 200) + xlim(.2,2.5) + ggsci::scale_fill_npg() + theme_void() + geom_text(aes(y = pos), label = Perc, size = 5, col = "white") + guides(fill = guide_legend(title = NULL))`

However, we find that it’s not easy to remember the parameters of pie and donut charts. What is the more simple way to demonstrate it? I think the `ggpie()`

and `ggdonutchart()`

functions in the ggpubr package are preferred.

More details should refer to https://rpkgs.datanovia.com/ggpubr/reference/index.html, it also includes other useful plot functions. Here is the donut chart as an example.

In addition, the `geom_label_repel`

function is better to add text annotation I think.

`df2 <- df2 %>% mutate(group = fct_inorder(group), tmp = "")ggpubr::ggdonutchart(data = df2, x = "value", label = "tmp", lab.pos = "in", fill = "group", color = "black", palette = "npg") + geom_label_repel(aes(y = pos, label = paste0(group, " (", Perc, ")"), fill = group, segment.color = pal_npg("nrc")(3), segment.size = 0.8), data = df2, size = 4, show.legend = F, nudge_y = 1, color = "black") + guides(fill = FALSE)`

The more **pretty design** could be referred to by this blog. (Donut chart with ggplot2)

This tip is about how to add labels to a dodged barplot when I have to specify the `position=position_dodge()`

and `width =`

simultaneously. If you are careless, you will find out the text annotations are in the incorrect position.

In this situation, we must specify the **consistent width** value in all position related functions, such as `geom_bar()`

, `position_dodge()`

and `geom_text()`

.

`df <- data.frame(supp = rep(c("VC", "OJ"), each = 3), dose = rep(c("D0.5", "D1", "D2"),2), len = c(6.8, 15, 33, 4.2, 10, 29.5))ggplot(data = df, aes(x = dose, y = len, fill = supp)) + geom_bar(stat = "identity", color = "black", position = position_dodge(0.65), width = 0.65)+ theme_minimal() + geom_text(aes(label = len), vjust = -0.5, color = "black", position = position_dodge(0.65), size=3.5) + scale_fill_brewer(palette = "Blues")`

Then how about a stacked barplot? We must calculate the pos variable and specify it in `geom_text()`

as y asix.

`df2 <- arrange(df2, dose, supp) %>% plyr::ddply("dose", transform, label_ypos=cumsum(len))ggplot(data = df2, aes(x = dose, y = len, fill = supp)) + geom_bar(stat = "identity")+ geom_text(aes(y = label_ypos, label=len), vjust = 1.6, color = "black", size = 3.5)+ scale_fill_brewer(palette = "Blues")+ theme_minimal()`

Obviously, if you apply the `ggbarplot()`

function of the ggpubr package, the fewer parameters you need to calculate and remember. (https://rpkgs.datanovia.com/ggpubr/reference/ggbarplot.html)

ggplot2 pie chart : Quick start guide - R software and data visualization

ggplot2 barplots : Quick start guide - R software and data visualization

https://ggplot2.tidyverse.org/reference/

http://www.sthda.com/english/wiki/ggplot2-essentials

https://rpkgs.datanovia.com/ggpubr/reference/index.html

Plotting Pie and Donut Chart with ggpubr pckage in R

**Please indicate the source**: http://www.bioinfo-scrounger.com

CLSI EP05A3 and EP15A3 as the reference

Definition of Intermediate Precision：

Intermediate precision (also called within-laboratory or within-device) is a measure of precision under a defined set of conditions: same measurement procedure, same measuring system, same location, and replicate measurements on the same or similar objects over an extended period of time. It may include changes to other conditions such as new calibrations, operators, or reagent lots. ——Intermediate precision

Take throwing darts as an example：

- Accuracy: The score you get from the target of darts. The higher the score, the better.
- Precision: The distribution of the score. If you get a very close location, it means your technique is very stable.

If you want to estimate the precision for a certain test, these three indicators are useful to figure out whether it’s good enough for using.

- %CV coefficient of variation expressed as a percentage
- %CV
_{R}repeatability coefficient of variation - %CV
_{WL}within-laboratory coefficient of variation

We all know that it’s impossible to ensure every test is equal as there are so many factors that would influence our results, such as:

- Day
- Run
- Reagent lot
- Calibrator lot
- Calibration cycle
- Operator
- Instrument
- Laboratory

The first two of the above are usually the main factors to be considered.

So There is always some variants in the measured results compared to real values. It consists of systematic error (bias) and random error. Precision measures random error.

In a single-site 20x2x2 study with 20 days, with two runs per day, with two replicates per run. The associated factors including days and runs will be involved in the statistical analysis, which it used to estimate the two types of precision: repeatability (within-run precision) and within-laboratory precision (within-device precision)

Once the source of variation has been identified, ANOVA model can be used to calculate the SDs and %CVs in the statistical processing of the data. Usual factor can be divided into three components:

Within-run precision (or repeatability), measures the results from replicated samples for a given sample, in a single run, with the essentially constant situation. This variation may be basically caused by random error happening inside the instrument, such as variation of pipetted volumes of sample and reagent.

Between-run precision, measures the variation from different runs (e.g. run1 and run2). This run factor may cause the operation conditions to change, such as temperature, instrument status etc.

Between-day precision, measures the variation happening between days, which is easy to understand, such as caused by humidity etc.

This protocol (20x2x2) is to estimate the repeatability (within-run) and within-laboratory (intermediate precision) following CLSI EP-15.

From the description above, we can find the protocol is a classic nested (hierarchical) design, where replicates are nested within runs and runs are nested within days. So in this situation, nested ANOVA is appropriate. If two factors are involved, corresponding to two-way nested ANOVA.

To estimate the precision of this single-site 20x2x2 design, we should follow a nested linear components-of-variance model involving two factors: “day” and “run”, with “run” nested with “day”. I think this model can be analyzed using the two-way nested ANOVA. It should be noted that the design is balanced because it specifies the same number of runs for each day, and the same number of replicates for each run.

The above screenshot from CLSI EP05-A3 can help us to understand the nested linear components-of-variance model. We can especially know that the residual in the model represents the within-run factor.

Nested random effects are when each member of one group is contained entirely within a single unit of another group. The canonical example is students in classrooms Crossed random effects are when this nesting is not true. An example would be different seeds and different fields used for planting crops. Seeds of the same type can be planted in different fields, and each field can have multiple seeds in it.

Whether random effects are nested or crossed is a property of the data, not the model. In the other word, you should tell the model which data is nested or crossed.

I don’t describe the experiment and workflow in this section, which can be found in the CLSI EP05 and EP15 documents clearly.

Let’s talk about how to calculate the %CV and SD that can be divided into at least two categories based on how many factors are involved.

The first step, I load a simple design(20x2x2) data from a R package `VCA`

including 2 replicates, 2 runs and 20 days from a single sample，where y is the test measurements.

One reagent lot - a single sample

One instrument system

20 test days

Two runs per day

Two replicates measurements per run

library(VCA) data(dataEP05A2_2) > summary(dataEP05A2_2) day run y

1 : 4 1:40 Min. :68.87

2 : 4 2:40 1st Qu.:73.22

3 : 4 Median :75.39

4 : 4 Mean :75.41

5 : 4 3rd Qu.:77.37

6 : 4 Max. :83.02

(Other):56

The second step, I use the nested ANOVA by aov function in R to fit a nested linear components-of-variance model. In this situation, runs are nested within days.

`res <- aov(y~day/run, data = dataEP05A2_2)ss <- summary(res)> ss Df Sum Sq Mean Sq F value Pr(>F) day 19 319.0 16.787 4.512 3e-05 ***day:run 20 187.4 9.372 2.519 0.00634 ** Residuals 40 148.8 3.720 ---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1`

The third step, calculate the SD and %CV of the day, run and error variation following the formula occurred in EP05-A3. By the way, the error CV(`CVerror`

) is corresding to %CV_{R}, also called within-run or repeatability precision. And the %CV_{WL} is the within-laboratory precision.

`nrep <- 2nrun <- 2nday <- 20Verror <- ss[[1]]$`Mean Sq`[3]Vrun <- (ss[[1]]$`Mean Sq`[2] - ss[[1]]$`Mean Sq`[3]) / nrepVday <- (ss[[1]]$`Mean Sq`[1] - ss[[1]]$`Mean Sq`[2]) / (nrun * nrep)Serror <- sqrt(Verror)Sday <- sqrt(Vday)Srun <- sqrt(Vrun)Swl <- sqrt(Vday + Vrun + Verror)> print(c(Swl, Sday, Srun, Serror))[1] 2.898293 1.361533 1.681086 1.928803CVerror <- Serror / mean(dataEP05A2_2$y) * 100> CVerror[1] 2.557875CVwl <- Swl / mean(dataEP05A2_2$y) * 100> CVwl[1] 3.843561`

The fourth step, calculate the confidence interval of SD and %CV, which is relying on the chi-square distribution value for DF estimates. Use the CV% of error as an example.

`alpha <- 0.05CVCI <- c(CVerror * sqrt(ss[[1]]$Df[3] / qchisq(1-alpha/2, df = 40)), CVerror * sqrt(ss[[1]]$Df[3] / qchisq(alpha/2, df = 40)))> CVCI[1] 2.100049 3.272809CVCI_oneSide <- c(CVerror * sqrt(ss[[1]]$Df[3] / qchisq(1-alpha, df = 40)), CVerror * sqrt(ss[[1]]$Df[3] / qchisq(alpha, df = 40)))> CVCI_oneSide[1] 2.166476 3.142029`

**Fortunately, above standard calculation steps have been packed into a R package, that is the VCA package. So we just apply anovaVCA function to fit the model and summarize the it. For CI calculation, the VCAinference function could be used. It sounds so good.**

Fit model:

`res <- anovaVCA(y~day/run, dataEP05A2_2)res> resResult Variance Component Analysis:----------------------------------- Name DF SS MS VC %Total SD CV[%] 1 total 54.78206 8.400103 100 2.898293 3.8435612 day 19 318.961943 16.787471 1.853772 22.068447 1.361533 1.8055923 day:run 20 187.447626 9.372381 2.82605 33.643043 1.681086 2.2293664 error 40 148.811221 3.720281 3.720281 44.288509 1.928803 2.557875Mean: 75.40645 (N = 80) Experimental Design: balanced | Method: ANOVA`

Calculate CI for SD and %CV:

`VCAinference(res)> VCAinference(res)Inference from (V)ariance (C)omponent (A)nalysis------------------------------------------------> VCA Result:------------- Name DF SS MS VC %Total SD CV[%] 1 total 54.7821 8.4001 100 2.8983 3.84362 day 19 318.9619 16.7875 1.8538 22.0684 1.3615 1.80563 day:run 20 187.4476 9.3724 2.8261 33.643 1.6811 2.22944 error 40 148.8112 3.7203 3.7203 44.2885 1.9288 2.5579Mean: 75.4064 (N = 80) Experimental Design: balanced | Method: ANOVA> VC:----- Estimate CI LCL CI UCL One-Sided LCL One-Sided UCLtotal 8.4001 5.9669 12.7046 6.2987 11.8680day 1.8538 day:run 2.8261 error 3.7203 2.5077 6.0906 2.6689 5.6135> SD:----- Estimate CI LCL CI UCL One-Sided LCL One-Sided UCLtotal 2.8983 2.4427 3.5644 2.5097 3.4450day 1.3615 day:run 1.6811 error 1.9288 1.5836 2.4679 1.6337 2.3693> CV[%]:-------- Estimate CI LCL CI UCL One-Sided LCL One-Sided UCLtotal 3.8436 3.2394 4.7269 3.3282 4.5686day 1.8056 day:run 2.2294 error 2.5579 2.1000 3.2728 2.1665 3.142095% Confidence Level SAS PROC MIXED method used for computing CIs`

These functions can be used to handle complicated design, so we don't need to set up functions or a package any more.

Visualizing Nested and Cross Random Effects

R-Package VCA for Variance Component Analysis

How to Perform a Nested ANOVA in R (Step-by-Step)

Lab 8 - Nested and Repeated Measures ANOVA

What’s with the precision?

**Please indicate the source**: http://www.bioinfo-scrounger.com

I thought I used to understand the ANOVA definitely. But when I’d like to apply the MANOVA model, I found I was totally wrong. I even had no clear understanding about which variables, continuous or categorical, should be used in ANOVA. So I decided to keep notes to figure out what is the difference between ANOVA, MANOVA and ANCOVA.

ANOVA is a statistical technique that assesses potential differences in dependent variables by categorical variables. Commonly, ANOVAs are used in three ways: one-way ANOVA, two way ANOVA and N-way ANOVA.

**Independence of observations**, that there are no hidden relationships among observations.**Normally-distributed dependent variables**, comply with normal distribution. If it is not met, you can try a data transformation.**Homogeneity of variables**, the variances in each group are similar. If it is not met, you may be able to use non-parametric alternatives, like the Kruskal-Wallis test.

Types of data in ANOVA, T test and Chi-Squared Test

X independent variables | X group | Y | Analysis |
---|---|---|---|

categorical | Two or more groups | quantitative | ANOVA |

categorical | Just two groups | quantitative | T test |

categorical | Two or more groups | quantitative | Chi-Squared Test |

One way ANOVA has just one independent variable affecting a dependent variable, and the independent variable can have 2 or more categories to compare.

The null hypothesis for the test is that means in groups are equal, which means there is no difference among group means. Therefore, a significant result means that the means are unequal. If you want to compare two groups, use the T-test instead.

ANOVA uses the F-test for statistical significance. If the variance within groups is smaller than the variance between groups, the F-test will find a higher F-value, that means a higher significance.

ANOVA only tells you if there are differences among the independent variables(levels), but not which differences are significant. To find out how the levels differ from one another, perform a TukeyHSD post hoc test.

Two way ANOVA has two independent variables, or two categorical variables, which is the most different from one way ANOVA. These categories are also called factors, and the factors can be split into multiple levels. So if one factor can be split into 3 levels, and another level can be split into 3 levels. In this condition, there will be 3x3=9 groups.

Use a two way ANOVA when you want to know how two independent variables, in combination, affect a dependent variable. So A two way ANOVA with interaction tests three null hypotheses at the same time:

- There is no difference in group means at any level of the first independent variable.
- There is no difference in group means at any level of the second independent variable.
- The effect of one independent variable does not depend on the effect of the other independent variable (a.k.a. no interaction effect)

If you want a two way ANOVA without interaction effect, only need the first two hypotheses.

`data <- mtcars[,c("am", "mpg", "hp", "vs")] %>% mutate(am = factor(am), vs = factor(vs))summary(data)# One-way ANOVAone.way <- aov(mpg~am, data = data)summary(one.way)# Two-way ANOVAtwo.way <- aov(mpg~am+vs, data = data)summary(two.way)# Two-way ANOVA with interactiontwo.way <- aov(mpg~am*vs, data = data)summary(two.way)`

We know that one or two way ANOVA has only one dependent variable, but MANOVA is not limited. We alway call MANOVA the multivariate analysis of variance, so it is used when there are two or more dependent variables. It’s purpose is to find out if dependent variables differ from independent variables simultaneously.

MANOVA assumes that independent variables are categorical and dependent variables are continuous, the same as ANOVA.

Instead of a univariate F value, we would obtain a multivariate F value, and several test statistics are available: Wilks' λ, Hotelling's trace, Pillai's criterion.

Sometimes, we use one way ANOVA can not find out the significance for each dependent variable between groups (independent variables). Therefore we conclude that there is no relation between dependent and independent. However when we apply MANOVA to these dependent variables simultaneously, it concludes that dependent variables are affected by the independent variables.

if you're still confused about it, try read this post Comparison of MANOVA to ANOVA Using an Example, will give a better example to interpret.

When you need to perform a series of one way ANOVA because you have multiple dependent variables to analyze, in this situation using MANOVA can protect against Type I errors.

Example:

- dependent variables: Sepal.Length and Petal.Length
- independent variables: Species

Fit model and summarize:

`sepl <- iris$Sepal.Lengthpetl <- iris$Petal.Length# MANOVA testres.manova <- manova(cbind(Sepal.Length, Petal.Length) ~ Species, data = iris)# define statistics, Wilkssummary(res.manova, test = "Wilks")`

ANCOVA is like an extension of ANOVA, and can be used to adjust other factors that might affect the outcome, such as age, gender or drug use. Otherwise it can be also used to combine with the categorical variable as a continuous variable(one factor is categorical, another is quantitative), or variables on a scale as predictors. It means the covariate is a variable of interest, not the one you want to control for.

Therefore, you can enter any covariate you want to ANCOVA. The more you enter, the fewer degrees of freedom you will have, so that it will reduce the statistical power. Finally, the lower the power, the less likely you will be able to rely on the results of the test.

Before performing ANCOVA, besides normal distribution and homogeneity of variance, we need to verify that covariate and the independent variable are independent of each other, since adding a covariate into a model only makes sense if the covariate and independent variable act independently to the dependent variable.

NOTE, if you use type 1 sum of square for the model, you must note the order, the covariate goes first(and there is no interaction)

Example:

- dependent variables: Petal.Length
- independent variables: Species
- covariate: Sepal.Length

Fit model and summarize:

`# fit ANCOVA modelfit <- aov(Petal.Length~Sepal.Length+Species, iris)# view summary of modelcar::Anova(fit, type = 2)`

What is the difference between ANOVA & MANOVA?

ANOVA Test: Definition, Types, Examples

ANOVA (Analysis of Variance)

How to Conduct an ANCOVA in R

ANCOVA example

ANCOVA in R

Doing and reporting your first ANOVA and ANCOVA in R

ANCOVA -- Notes and R Code

ANCOVA: Analysis of Covariance

An introduction to the two-way ANOVA

ANOVA in R: A step-by-step guide

An introduction to the one-way ANOVA

Understanding confounding variables

**Please indicate the source**: http://www.bioinfo-scrounger.com

From now on, if any, I will try my best to keep notes in English to exercise written for work.

Recently I have discussed the non-standard evaluation mode in dplyr package with a colleague. Before that conversation, I always defined the mode as dynamic variables to search in google to solve related problems. Then I knowed that the dynamic mode is called “non-standard evaluation” in dplyr.

In order to keep a tidy environment, most dplyr verbs use tidy evaluation which is a special type of non-standard evaluation throughout the tidyverse. It defined a concept of data masking that you can use data variables as if they were variables in the environment. Even to keep tidy selection, you can choose variables based on their position(eg. 1,2,3), name or type(eg. is.numeric).

- If you want to learn more about the difference between non-standard evaluation and standard evaluation, the post (Dynamic column/variable names with dplyr using Standard Evaluation functions) will be helpful.
- If you want to know the data masking and tidy selection, the vignette Programming with dplyr-vigenettes or Programming with dplyr is suitable for learning.
- The Dplyr team commended that we could read the Metaprogramming chapters in Advanced R (a book) if we’d like to learn more about the underlying theory, or precisely how it’s different from non-standard evaluation.

For this post, I mainly record some common solutions on how to use the dynamic variables (or called intermediate variables, NSE) in dplyr. Although the above make some tasks easier, sometimes we may be confused for example how to use NSE in the `mutate()`

, `summarise()`

, `group_by()`

, `filter()`

, especially in self-defined function arguments or ggplot2 arguments.

In the other words, I need to learn how to use non-standard evaluation(NSE) in dplyr calls.

Use the `.data`

pronoun to transfer string variables.

`library(tidyverse)GraphVar <- "dist"cars %>% group_by(.data[["speed"]]) %>% summarise(Sum = sum(.data[[GraphVar]], na.rm = TRUE), Count = n() ) %>% head()`

Use the name of string variables in the output dataframe with `:=`

。

`var <- "value"iris %>% mutate(!!var := ifelse(Sepal.Length > 5, 1, 0)) %>% head()`

The easiest way to remember and operate is using the constructor `sym()`

when we want to to unquote something that looks like code instead of a string, which often used in ggplot2 and R shiny.

`grp.var <- "Species"uniq.var <- "Sepal.Width"iris %>% group_by(!!sym(grp.var)) %>% summarise(n_uniq = n_distinct(!!sym(uniq.var)))`

For the tricks in the function aspect, it should be divided into two situations depending on what is the type of variables, env-variables or data-variables.

- Env-variables are “programming” variables that live in an environment. They are usually created with
`<-`

. - Data-variables are “statistical” variables that live in a data frame. I understand them as column names.

The variable names will be automatically quoted by surrounding it in doubled braces, if the function arguments are not string.

`mean_by <- function(data, var, group) { data %>% group_by({{group}}) %>% summarise(avg = mean({{var}}))}mean_by(starwars, group = species, var = height) %>% head()`

We need to construct symbols to transform the string if we'd like to use a character as the arguments.

`mean_by <- function(data, var, group) { group <- sym(group) var <- sym(var) data %>% group_by(!!group) %>% summarise(avg = mean(!!var))}mean_by(starwars, group = "species", var = "height") %>% head()`

If you want to import user-supplied expressions, such as `height*100`

, doubled braces run normally, but `sym`

does not. In this situation, we need to replace `sym`

by `enquo`

.

`mean_by <- function(data, var, group) { group <- enquo(group) var <- enquo(var) data %>% group_by(!!group) %>% summarise(avg = mean(!!var))}mean_by(starwars, var = height * 100, group = as.factor(species)) %>% head()`

The dplyr allows multiple grouping variables, which can be represented by `…`

object

mean_by <- function(data, var, ...) { var <- enquo(var)

data %>% group_by(...) %>% summarise(avg = mean(!! var)) } mean_by(starwars, height, species, eye_color)

Above are the supplementary for a previous blog post(https://www.bioinfo-scrounger.com/archives/R-dplyr-tricks/).

Dynamic column/variable names with dplyr using Standard Evaluation functions

Programming with dplyr-vigenettes

Programming with dplyr

https://stackoverflow.com/questions/27975124/pass-arguments-to-dplyr-functions

**Please indicate the source**: http://www.bioinfo-scrounger.com