Git is a version control system used to track changes in computer files. Git's primary purpose is to manage any changes made in one or more projects over a given period of time. It helps coordinate work among members of a project team and tracks progress over time. Git also helps both programming professionals and non-technical users by monitoring their project files.

**What is Gitlab?**

GitLab is a web-based Git repository that provides free open and private repositories, issue-following capabilities, and wikis. It is a complete DevOps platform that enables professionals to perform all the tasks in a project—from project planning and source code management to monitoring and security. Furthermore, it allows teams to collaborate and build better software.

Above copy from https://www.simplilearn.com/tutorials/git-tutorial/what-is-gitlab

Nowadays Gitlab has been widely used in many companies, as it's free I think...

Generally I’m accustomed to using github desktop to connect git repositories, such as github. It’s my first try to use gitlab so I searched for some information in google. Below is my summarization.

Open your Git Bash (if not, please install Git).

Confirm if you have an existing SSH key pair.

- Go to your home directory, and enter .ssh/ subfolder to see if you have ssh before.
- In common, ED25519 (preferred) is
`id_ed25519.pub`

, or RSA is`id_rsa.pub`

, or ECDSA is`id_ecdsa.pub`

.

If you have neither of them, you should generate an SSH key pair, for example for ED25519:

`ssh-keygen -t ed25519 -C "<comment>"`

Copy the contents of your public key file by Git Bash, for example:

`cat ~/.ssh/id_ed25519.pub | clip`

Then Add the SSH key to your Gitlab account.

Sign in to Gitlab, and select **preferences** from your avatar in the top right corner.

- Select
**SSH keys**on the left sidebar. - Paste the keys in the Key box and then select
**Add key**.

Verify that you can connect to the Gitlab. Open Git Bash and run this command, replacing `gitlab.example.com`

with your Gitlab instance URL.

`ssh -T git@gitlab.example.com`

If this is your first time to connect, you may see the requirement for verifying the authenticity of Gitlab host. If not you may receive a Welcome message.

Now that the general configuration is completed, you can clone your repository as a normal Git process by SSH.

`git clone git@ssh.xxx/test.git`

Above all refer to https://docs.gitlab.com/ee/ssh/

If you would like to use RStudio, you should paste your ssh key from RStudio into the Gitlab key box.

Then select RStudio **File -> New Project -> Version Control -> Git**.

Copy the `SSH url`

you get from Gitlab, and paste it into **Repository URL**. Click on **Create Project** to start cloning and setting up the Rstudio environment.

Now you can use simple `commit`

and `push`

commands in RStudio as in Git.

Download Github Desktop first (if not).

Go to your Gitlab repository (specify) that you want to clone, and click the **Settings -> Access Token** on the left sidebar.

Add a project access token, including Name, Expires(optional) and Scopes. After filling in, click on Create Personal access Token.

Copy your access token and store it somewhere as we will use it later.

Open Github desktop, and click on **file -> clone a repository**. And then paste the URL of your repository into the URL field and choose the destination folder (Local path). After that select **clone**.

While cloning, it would pop up a window that let you enter username and password. That is your access token we created before. After that click on **Save and retry** to restart cloning.

Finally you can operate the repository between local and remote as we are used in Github.

Above all refer to How to use Github Desktop with Gitlab Repository

**Please indicate the source**: http://www.bioinfo-scrounger.com

首先我有一个带有中文的shiny代码，如下所示：

`library(shiny)ui <- fluidPage( p("这个是一个中文测试"), checkboxGroupInput( inputId = "test", label = "选择一门语言:", choiceNames = c("中文", "英语", "其他"), choiceValues = c(1, 2, 3) ))server <- function(input, output, session) { }shinyApp(ui, server)`

上述shiny在Rstudio IDE中正常打开

然后我分别将其publish到**shinyapp.io**和**RStudio-connect server**等server端，发现均失败了，前者是报错XXX包未找到，后者报错XXX字符有问题；同时我也在Deploy页面发现以下warning：

`Warning message:In fileDependencies.R(file) : Failed to parse C:/Users/guk8/AppData/Local/Temp/RtmpKeuyz5/file3edc27cd319/app.R ; dependencies in this file will not be discovered.`

通过在google上的搜索，初步怀疑是中文字符编码所导致的；而为何在shinyapp.io不是报中文字符的问题呢？

在回答这个问题之前，我们可以尝试将中文从`checkboxGroupInput`

的`choiceNames`

参数中剔除，换成英文，如下所示：

`library(shiny)ui <- fluidPage( p("这个是一个中文测试"), checkboxGroupInput( inputId = "test", label = "选择一门语言:", # choiceNames = c("中文", "英语", "其他"), choiceNames = c("Chinese", "English", "Others"), choiceValues = c(1, 2, 3) ))server <- function(input, output, session) { }shinyApp(ui, server)`

**现在shiny app可以正常的publish了**

这个是什么原因呢，暂时可以认为shinyapp.io不支持我本地PC机（win10）上publish的shiny app中除label以外的地方含有中文

此时看下我本地PC机的R locale

`> Sys.getlocale()[1] "LC_COLLATE=Chinese (Simplified)_China.936;LC_CTYPE=Chinese (Simplified)_China.936;LC_MONETARY=Chinese (Simplified)_China.936;LC_NUMERIC=C;LC_TIME=Chinese (Simplified)_China.936"`

此时R code的encoding如下

`> options()$encoding[1] "UTF-8"`

这时我怀疑应该是本地PC机编码语言的问题，所以将上述`choiceNames`

参数带有中文的shiny app放到linux的Rstudio-server上运行

linux的locale如下所示：

`anlan@ubuntu:~$ localeLANG=zh_CN.GBKLANGUAGE=zh_CN:zh:en_US:enLC_CTYPE="zh_CN.GBK"LC_NUMERIC="zh_CN.GBK"LC_TIME="zh_CN.GBK"LC_COLLATE="zh_CN.GBK"LC_MONETARY="zh_CN.GBK"LC_MESSAGES="zh_CN.GBK"LC_PAPER="zh_CN.GBK"LC_NAME="zh_CN.GBK"LC_ADDRESS="zh_CN.GBK"LC_TELEPHONE="zh_CN.GBK"LC_MEASUREMENT="zh_CN.GBK"LC_IDENTIFICATION="zh_CN.GBK"LC_ALL=`

以及R locale如下所示：

`> Sys.getlocale()[1] "LC_CTYPE=zh_CN.UTF-8;LC_NUMERIC=C;LC_TIME=zh_CN.UTF-8;LC_COLLATE=zh_CN.UTF-8;LC_MONETARY=zh_CN.UTF-8;LC_MESSAGES=zh_CN.UTF-8;LC_PAPER=zh_CN.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=zh_CN.UTF-8;LC_IDENTIFICATION=C"`

在此系统下，app能正常的publish到shinyapps.io上，如下所示：

从google一些资料中可看到，若想编码类似中文字符，需要将R locale改成UTF-8（可能又不只是这个原因，但至少我能在现在这个linux系统中正常publish中文的shiny app了。。。）

除了shinyapps.io外，我们还可以publish到RStudio-connect

但是得注意RStuido-connect的server的locale

我最开始的linux下的locale是`zh_CN.GBK`

，我无论从Rstudio-server还是PC端的Rstudio-destop都无法正常publish；报错大致如下：

`invalid multibyte stringr at xxx line`

并伴随着以下warnings：

`Warning message:In fileDependencies.R(file) : Failed to parse C:/Users/guk8/AppData/Local/Temp/RtmpKeuyz5/file3edc27cd319/app.R ; dependencies in this file will not be discovered.`

起初我以为是`parse()`

函数导致中文字符无法被编译，但是当我将中文字符转化为UTF-8后还是无法正常publish

这里有个比较trick的方法可以解决上述问题，将中文字符写入外部文件word.txt中，然后用`read.table()`

读入并指定`fileEncoding = utf-8`

参数，最后将中文字符赋予某个向量或者列表，最后在shiny ui代码中引用

上述方法经尝试是可行的！！！但是，太过于繁琐。。。

最后我将上述问题的起源猜测于RStudio connect的server，可能是server的locale的问题，即在`/etc/default/locale`

中添加下述命令：

`LANG=en_US.utf8LANGUAGE=en_US:en`

即将server的LANG从原来的`zh_CN.GBK`

转化为`en_US.utf8`

现在的server locale如下：

`anlan@ubuntu:~$ localeLANG=en_US.utf8LANGUAGE=en_US:enLC_CTYPE="en_US.utf8"LC_NUMERIC="en_US.utf8"LC_TIME="en_US.utf8"LC_COLLATE="en_US.utf8"LC_MONETARY="en_US.utf8"LC_MESSAGES="en_US.utf8"LC_PAPER="en_US.utf8"LC_NAME="en_US.utf8"LC_ADDRESS="en_US.utf8"LC_TELEPHONE="en_US.utf8"LC_MEASUREMENT="en_US.utf8"LC_IDENTIFICATION="en_US.utf8"LC_ALL=`

此时server上的R locale如下所示：

`> Sys.getlocale()[1] "LC_CTYPE=en_US.utf8;LC_NUMERIC=C;LC_TIME=en_US.utf8;LC_COLLATE=en_US.utf8;LC_MONETARY=en_US.utf8;LC_MESSAGES=en_US.utf8;LC_PAPER=en_US.utf8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.utf8;LC_IDENTIFICATION=C"`

现在无论是PC Rstudio destop还是Rstudio server上的中文shiny程序，均能正常publish到RStudio-connect上了。。。

虽然我无法解释具体的原因，可能还是server的语言编码导致之前无法publish；但是至少来说，问题是解决了。。。

以上是这两天debug的过程，希望对大家在遇到shiny中文报错的时候有报错哈~

本文出自于http://www.bioinfo-scrounger.com转载请注明出处

]]>*Defining variables once*

**Why create the analysis data set?** One of the primary reasons for creating analysis data sets is to have variable derivation in a single place so that we can avoid searching and changing each variable in different programs multiple times.

*Defining Study Populations*

**Intent-to-treat (ITT)**, All patients were randomized to study therapy. It is intended that they will be treated. Patients are analyzed according to a randomized treatment group.**As-treated**, Patients analyzed according to the study intervention they actually received. Patients may get a treatment that they were not randomized to.**Per-protocol**, All patients who did not experience a subjectively defined serious protocol violation during the study.**Safety**, All patients who actually received the study drug.

以下数据集概念摘抄自：https://www.163.com/dy/article/G0LAV9TD053438SI.html

在进行数据的统计分析之前，需要对疗效及安全性的分析人群进行定义并在数据分析时按照这一定义进行数据的分析，常见的数据分析集如下：

**意向性分析(intention-to-treat population，ITT)**：即所谓的ITT分析，该分析集纳入了所有随机化后的患者。这里需要注意的是，如果某患者随机到了A组，那后续的ITT分析该患者必须一直在A组，哪怕他接受的是B组的治疗，或者没有接受任何治疗。这看起来有点匪夷所思。其实，这样做最重要的目的就是保持两组之间的基线特点均衡可比，通过随机化，将除了研究因素以外的其他变量完全均衡和匹配掉，从而充分观察干预效果，而As-treated 集则是根据患者实际接受的治疗进行数据分析。对于单臂研究，ITT的概念通常并不经常涉及，一般情况下是指所有入组的患者(一般以签署知情同意书为依据)。**全分析集(full analysis set， FAS集)**：是ITT集的子集，部分研究称为修订后ITT(modifiedITT，mITT)分析，他是指对所有随机化受试者的数据做最少和公正的剔除之后所得到的数据集，保持原始数据集的完整性，减少偏倚，但是目前缺乏有关这一问题的共识。在ICHE9中，描述了只有在一些特定的原因下，可能导致受试者被排除在全分析集之外，包括(1)不满足主要入排标准；(2)没有用过一次药；(3)在进行随机化后没有任何数据。FAS可以作为主要分析集。**符合方案集分析(per- protocol analysis，PP集)**：是FAS集的一个子集，指受试者在入排标准、接受治疗、主要指标测量等方面不存在严重方案违背，它只对依从了干预措施的研究对象进行分析。个人认为，FAS集与PP集不存在太大的差异，因此，很多研究将FAS分析或PP分析二选一，与ITT分析一起报道。(需要特别注意的是，如果将患者从某个分析集中剔除，一定要有充分的理由，一般是由研究者、申办方及统计师商议好共同决定，且对于盲法设计的研究，一定是在揭盲前，揭盲前，揭盲前 ( 重要的事情说 3 遍 ) ，因为揭盲后对数据的修改有操纵数据的嫌疑，一般会受到监管部门的质疑(NEJ-009研究惨痛的教训是不是还在眼前)。)**安全集(Safe analysis Set，SS)**：安全集与上述几种评价疗效的数据集不同。安全集是用来评价试验药物安全性的。一般要符合以下三点：1）随机化分组；2）至少使用过一剂试验药物；3）至少有一次安全性评价。

以下摘抄自ICH-E9：

- 一般说来，显示选择不同的病例集进行分析对主要的试验结果不敏感有优越性的。
- 在有些情况下，最好能计划选择不同的分析集进行对结论的敏感性的探索。
- 在优效性试验中，全分析集用于主要分析（除了特殊情况），因为它倾向于避免由于符合分析集所致的效果的过于最优化估计。这是由于，在全分析集中包括了依从性不良者一般会减少估计的处理效应
- 然而在一个等效性或非劣效性试验中，应用全分析集一般并不保守其作用应当非常仔细地考虑。

*Defining Baseline Observations*

“Baseline” is a common clinical concept, which is used to demonstrate the state of a patient before some interventions, so that a subsequent comparison could be in a balanced state. Usually, the baseline value could be the last reading prior to medical intervention if you would like to make the cholesterol measurements.

Deriving Last Observation Carried Forward (LOCF) variables. For example, you want the last observation carried forward so long as the measures occur within a five-day window before the pill is taken.

*Defining Study Day*

- Calculating a continuous study day,
`study_day = event_date - intervention_date + 1`

, in this approach, the 1 is represented by initial intervention. - Calculating a study day without day zero, If event_date is lower than intervention_date then
`study_day = event_date - intervention_date`

. If event_date is higher than or equal to intervention_date then`study_day = event_date - intervention_date + 1`

, in this approach, the 1 is represented by initial intervention.

The first way is useful to graph or calculate durations that span the day before the therapeutic intervention day.

The second way is more intuitive as the day before intervention is represented by study day “-1”, so it is used more often, especially in CDISC SDTM.

However the first way is recommended to use in CDISC ADaM. Whether you are deriving data based on the CDISC models or not, you should calculate study day variables **in a consistent fashion** across a clinical trial or set of trials for an application.

*Windowing Data*

A tag is some descriptive label such as “Visit 5”， “Baseline”, or “Abnormal”. For example, baseline observations must occur before initial drug dosing.

*Transposing Data*

Normalized data may also be described as “stacked”, **“vertical”** or “talk and skinny”, while non-normalized data are often called “flat”, **“wide”** or “short and fat”.

So this normalized or non-normalized data may mean long data or wide data, which is why we need to transpose data so that the dependent variable is present one the same observation as the independent variables.

In SAS, I think proc transpose procedure is a powerful statement to handle these needs, no matter from long data to wide, or wide data to long.

`**** INPUT SAMPLE NORMALIZED SYSTOLIC BLOOD PRESSURE VALUES.**** SUBJECT = PATIENT NUMBER, VISIT = VISIT NUMBER,**** SBP = SYSTOLIC BLOOD PRESSURE.;data sbp;input subject $ visit sbp;datalines;101 1 160101 3 140101 4 130101 5 120202 1 141202 2 151202 3 161202 4 171202 5 181;run;**** TRANSPOSE THE NORMALIZED SBP VALUES TO A FLAT STRUCTURE.;proc transpose data = sbp out = sbpflat prefix = VISIT; by subject; id visit; var sbp;run;`

This procedure in R will be handled by `pivot_longer()`

and `pivot_wider()`

functions.

*Categorical Data and Why Zero and Missing Results Differ Greatly*

Missing data:

- The response is unknown.
- The observation will not be included in population analysis and denominator definitions.

Zero data:

- The response is known.
- The response is “NO” when the categorical variable is Boolean variable.
- The observation will be included in population analysis and denominator definitions.

*Performing Many-to-Many Comparisons/joins*

Imagine you have a data set of adverse event data and a data set of concomitant medications, and you want to know if a concomitant medication was given to a patient during the time of the adverse event.

It’s usually to join them with `proc sql`

in SAS, and using `left_join`

in R, which is a very common procedure for data manipulation.

*Common Analysis Data Sets*

- The
**critical variables analysis data set**always has a single observation per subject to simplify the process of merging with other data sets.The whole purpose of the critical variables data set is to capture in one place the essential analysis stratification variables that are used throughout the statistical analysis and reporting. - The purpose of using
**change-from-baseline analysis data sets**is to measure what effect some therapeutic intervention had some kind of diagnostic measure. A measure is taken before and after therapy, and a difference and sometimes a percentage difference are calculated for each post-baseline measure. - A
**time-to-event analysis data set**captures the information about the time distance between therapeutic intervention and some other particular event. Two variables defined as follow:**Event/Censor**, A binomial outcome such as “success/failure,” “death/life,” “heart attack/no heart attack.” If the event happened to the subject, then the event variable is set to 1. If it is certain that the patient did not experience the event, then the event variable is set to 0. Otherwise, this variable should be missing.**Time to Event**, This variable captures the time (usually study day) from therapeutic intervention to the event date or censor date. If the event occurred for a subject, the time to event is the study day at that event. If the event did not occur, then the time to event is set to the censor date that is often the last known follow-up date for a subject.

As said survival data, or called time-to-event data, is very common in survival analysis, such as Kaplan-Meier curve, log-rank test and Cox proportional hazards model.

Often the censor date is the last known date of patient follow-up, but a patient could be censored for other reasons, such as having taken a protocol-prohibited medication.

Creating time-to-event data sets can be a difficult programming task, especially during interim data analyses, such as for a DSMB. This is usually because the event data itself is captured in more than one place in the case report form and the censor date may be difficult to obtain.

For example, perhaps the event of interest is death. You may have to search the adverse events CRF page, the study termination CRF page, clinical endpoint committee CRFs, and perhaps a special death events CRF page just to gather all of the known death events and dates. For subjects who did not experience the event of interest, you may not have a study termination form to provide the censoring date, so you may have to use some surrogate data to create a censor date.

**Please indicate the source**: http://www.bioinfo-scrounger.com

The alluvial plot and Sankey diagram are both forms of visualization for general flow diagrams. These plot types are designed to show the change in magnitude of some quantity as it flows between states. Although the difference between alluvial plot and sankey diagram is always discussed online, like the issue Alluvial diagram vs Sankey diagram?, here just comes how to allow us to study the connection and flow of data between different categorical features in R. So we don’t mind the mixed use by both of them.

Note: an Alluvial diagram is a subcategory of Sankey diagrams where nodes are grouped in vertical nodes (sometimes called steps).

For illustrating the following cases, I will load the flight data from `nycflights13`

library first. This comprehensive data set contains all flights that departed from the New York City airports JFK, LGA and EWR in 2013, including three columns we’re concerned about, such as origin (airport of origin), dest (destination airport) and carrier (airline code). For a better demonstration, I select the top five destinations and top four carriers.

`top_dest <- flights %>% count(dest) %>% top_n(5, n) %>% pull(dest) top_carrier <- flights %>% filter(dest %in% top_dest) %>% count(carrier) %>% top_n(4, n) %>% pull(carrier) fly <- flights %>% filter(dest %in% top_dest & carrier %in% top_carrier)`

Let‘s take a look at the sankey, Google defines a sankey as:

A sankey diagram is a visualization used to depict a flow from one set of values to another. The things being connected are called nodes and the connections are called links. Sankeys are best used when you want to show a many-to-many mapping between two domains or multiple paths through a set of stages.

In R, we can plot a sankey diagram with the `ggsankey`

package in the ggplot2 framework. This package is very kind to provide a function (`make_long()`

) to transform our common wide data to long.

`fly <- flights %>% filter(dest %in% top_dest & carrier %in% top_carrier) %>% ggsankey::make_long(origin, carrier, dest)`

In R, we can plot a sankey diagram with the `ggsankey`

package in the ggplot2 framework. This package is very kind to provide a function (`make_long()`

) to transform our common wide data to long, so that columns will be fit to the parameters in functions. It contains four columns, corresponding to stage and node, such as stage is for `x`

and `next_x`

, and node is for `node`

and `next_node`

. Hence, at least four columns are required. More usages are illustrated in this document(https://github.com/davidsjoberg/ggsankey).

So a basic sankey diagram is as following:

`ggplot(fly, aes(x = x, next_x = next_x, node = node, next_node = next_node, fill = factor(node), label = node)) + geom_sankey(flow.alpha = 0.6, node.color = "gray30") + geom_sankey_label(size = 3, color = "white", fill = "gray40") + scale_fill_viridis_d() + theme_sankey(base_size = 18) + labs(x = NULL) + theme(legend.position = "none", plot.title = element_text(hjust = .5))`

Furthermore, `networkD3`

package is also able to plot the sankey diagram, but not easy to use I think.

After my initial use, `alluvial`

and `ggalluvial`

packages are very suitable for R users to create the alluvial plots. The former has its own specific syntax, otherwise the later one can be integrated seamlessly into ggplot2, same as `ggsankey`

.

Actually, to be honest, both of them are convenient. You can choose either one according to your use situation. For example, the`alluvial`

package is demonstrated below:

`fly <- flights %>% filter(dest %in% top_dest & carrier %in% top_carrier) %>% count(origin, carrier, dest) %>% mutate(origin = fct_relevel(as.factor(origin), c("EWR","JFK","LGA")))alluvial(fly %>% select(-n), freq = fly$n, border = NA, alpha = 0.5, col=case_when(fly$origin == "JFK" ~ "red", fly$origin == "EWR" ~ "blue", TRUE ~ "orange"), cex = 0.75, axis_labels = c("Origin", "Carrier", "Destination"))`

The detailed usages can be found in this web (https://github.com/mbojan/alluvial).

If you would like to own more customized adjustments, maybe the `ggsankey`

is better. As it’s a ggplot2 extension, which has enough functions for modification following your thoughts.

We still don’t need to take much time to transform data because `ggalluvial`

also has a very convenient function to do the same procedure as `make_long()`

in `ggsankey`

. If your data is in a “wide” format , like the flight dataset, the `to_lodes_form()`

function will help you easily.

`fly <- flights %>% filter(dest %in% top_dest & carrier %in% top_carrier) %>% count(origin, carrier, dest) %>% mutate( origin = fct_relevel(as.factor(origin), c("LGA", "EWR","JFK")), col = origin ) %>% ggalluvial::to_lodes_form(key = type, axes = c("origin", "carrier", "dest"))ggplot(data = fly, aes(x = type, stratum = stratum, alluvium = alluvium, y = n)) + # geom_lode(width = 1/6) + geom_flow(aes(fill = col), width = 1/6, color = "darkgray", curve_type = "cubic") + # geom_alluvium(aes(fill = stratum)) + geom_stratum(color = "grey", width = 1/6) + geom_label(stat = "stratum", aes(label = after_stat(stratum))) + theme( panel.background = element_blank(), axis.text.y = element_blank(), axis.text.x = element_text(size = 15, face = "bold"), axis.title = element_blank(), axis.ticks = element_blank(), legend.position = "none" ) + scale_fill_viridis_d()`

The `geom_alluvium()`

is a bit different to `geom_flow()`

, which is chosen to use depending on the type of dataset you provide and the purpose you want to demonstrate.

Above all are my notes for a previous question. If you have any questions, please read the references as follows, which is easier to understand.

https://github.com/davidsjoberg/ggsankey

https://www.displayr.com/sankey-diagrams-r/

https://cran.r-project.org/web/packages/alluvial/vignettes/alluvial.html

https://corybrunson.github.io/ggalluvial/

https://www.r-bloggers.com/2019/06/data-flow-visuals-alluvial-vs-ggalluvial-in-r/

https://cran.rstudio.com/web/packages/ggalluvial/vignettes/ggalluvial.html

**Please indicate the source**: http://www.bioinfo-scrounger.com

For simplicity, if you have many clinical datasets, you’d like to find out which datasets have a certain variable like Sex, what should we do? Per R thinking, I will read all datasets and derive row names (variables) and then find out which datasets have this specific variable. If you apply this way in SAS, maybe you need to create a macro and utilize open and varnum functions. However there is a tip to achieve it, as following:

`libname mydata "C:/Users/anlan/Documents/CYFRA";data sex_tb; set sashelp.vcolumn; where libname="MYDATA" and name="SEX"; keep memname name type length;run;`

Just manipulate the sashelp.vcolumn to filter the rows by libname(`libname="MYDATA"`

) and column name(`name="SEX"`

). We can not find this tip from any common books, but it’s very useful in our work, I call it “work experience” generally.

The second example, how to derive date data from a string? I think any derived process can be splitted into two procedures, matching and extracting. In R, obviously the popular string related package is `stringr`

, so how about in SAS? SAS does not have the same toolkit package/macro as R, but some functions may be useful, like `prxchange`

. After I read the documentation of `prxchange`

, I feel that it has some similarities with perl in regular matching, as following:

`data work.demo; input tmp $20.; datalines;AE 2021-01-01CM 2021-02-01MH 2021-03-01;run;data demo_date; set demo; format date YYMMDD10.;/* date=prxchange("s/\w+\s+(.*)/$1/",1,tmp);*/ date=input(prxchange("s/\w+\s+(.*)/$1/",1,tmp),YYMMDD10.);run;proc contents data=demo_date; run;`

If you just want to judge whether one word or numeric exists in the string, in other words matching words, maybe `prxmatch`

and `find`

are helpful. If you want to extract words instead of match, `prxchange`

or `prxparse+prxposn`

is preferred. By the way, the later one is more close to R logic.

`data work.num; input tmp $; datalines;AE01CM02MH03;run;data num_ext; length tmp type1 type2 $ 20; keep tmp type1 type2; re=prxparse("/([a-zA-Z]+)(\d+)/"); set num; if prxmatch(re, tmp) then do; type1=prxposn(re, 1, tmp); type2=prxposn(re, 2, tmp); output; end;run;`

The third question, what’s the difference between informat and format?

`informat`

describes how the data is presented in the text file.`format`

describes how you want SAS to present the data when you look at it.

Remember, formats do not change the underlying data, just how it is printed for input on the screen.

Having an instance here, converting `data9.`

date to `yymmdd10.`

represents the usage of `format`

statements.

`data aes;input @1 subject $3. @5 ae_start date9. @15 ae_stop date9. @25 adverse_event $15.;format ae_start yymmdd10. ae_stop yymmdd10.;datalines;101 01JAN2004 02JAN2004 Headache101 15JAN2004 03FEB2004 Back Pain102 03NOV2003 10DEC2003 Rash102 03JAN2004 10JAN2004 Abdominal Pain102 04APR2004 04APR2004 Constipation;run;`

Another condition, we typically store the number 1 for male and 2 for female. It would be embarrassing to hand a client with an unclear understanding of the 1 and 2 numbers. So we need a format to dress up the report.

`data report;input ID Gender State $;datalines;100001 1 LA100002 2 LA100003 . AL;run;proc format ;value sex 1 = "Male" 2 = "Female" . = "Unknown";run ;proc print data = report; var id gender state; format gender sex.;run;`

`Informat`

is usually used with the `input`

statement to read multiple styles of variables into SAS.

Informats usage:

Character Informats: $INFORMAT w.

Numeric Informats: INFORMAT w.d

Date/Time Informats: INFORMAT w.

data death; input @1 subject $3. @5 death 1.; datalines; 101 0 102 0 ; run;

The fourth question, what’s the difference between the keep option on the set statement or the data statement?

If you place the keep option on the set statement, SAS keeps the specified variables when it reads the input data set. On the other hand, if you place the keep option on the DATA statement, SAS keeps the specified variables when it writes to the output data set. From this explanation, we can easily think that the latter one is faster than the former when the input dataset is very large.

Above all are my partal interview questions. It's such a pity that I‘m not fully prepared as I am just proficient in R, not SAS. So I’m planning to take my free time to learn and summarize SAS, just like when I learned Perl, R and Python.

By the way, I think is a good book as it not only provides some sas cases but also introduces the knowledge from the pharmaceutical industry.

**Please indicate the source**: http://www.bioinfo-scrounger.com

`submit /R`

statement. A few months ago, I consulted with SAS support for how to import plots by R in IML into RTF templates directly as I could not find any useful information in google. Unfortunately SAS support told me if the plot was created in R, it would need to be saved within the submit block as well using R code. It means if you want to directly import R graphics to RTF, maybe you should use some R function to achieve it.However I found a post (Mixing SAS and R Plots) that illustrates how to mix SAS plots and R plots in one graph by chance, which could solve my question perfectly. So let’s have a look at how to generate a plot by R in SAS and import it to RTF or PDF. It seems not directly but a trick I think.

Firstly, as usual we draw a R graphic in SAS.

`proc iml; submit /R; library(ggpubr) data("ToothGrowth") ggviolin(ToothGrowth, x = "dose", y = "len", add = "none") %>% ggadd(c("boxplot", "jitter"), color = "dose") %>% ggsave(filename = "C:/Users/anlan/Documents/plots/violin.png", width = 10, height = 8, units = "cm", dpi = 300) endsubmit; call ImportDataSetFromR("work.ToothGrowth", "ToothGrowth");run;quit;`

And then set output style to rtf template and graphics with png of 12cm height and 15cm width. This should be noted that the height and width point to the size of the graphic, not the actual plot size.

`ods rtf file = "C:/Users/anlan/Documents/plots/outgraph.rtf"; ods graphics / noborder height=12cm width=15cm outputfmt=png; `

Next we use the SAS Graph Template Language(GTL) to define a template and use the `drawimage`

statement to import the R graphic into SAS. The width and height params in the `drawimage`

statement is used to adjust the size of the image’s bounding box (actual size). In this example, width=90 widthunit=percent means the plot is zoomed out to 90%.

`proc template; define statgraph plottemp; begingraph; layout overlay; drawimage "C:/Users/guk8/Documents/plots/violin.png" / width=90 widthunit=percent height=90 heightunit=percent; endlayout; endgraph; end;run;`

The final plot in RTF is as following:

`proc sgrender template=plottemp; run;`

Actually the drawimage statement is designed to import the external graphics to SAS graph, to display a mixed graph. For example I’d like to show a graph having an image at the right-bottom of the total graph.

`proc template; define statgraph mgraphic; begingraph; entrytitle "Mix SAS and external graphics"; layout overlay; boxplot y=len x=dose / name="box" group=supp groupdisplay=cluster spread=true; discretelegend "box"; drawimage "C:/Users/anlan/Documents/plots/violin.png" / width=45 widthunit=percent height=45 heightunit=percent anchor=bottomright x=98 y=2 drawspace=wallpercent ; endlayout; endgraph; end;run; proc sgrender data=work.ToothGrowth template=mgraphic; label dose="Dose";run;`

The mixed graphic is as following:

Great, it seems easy to achieve. Obviously it definitely can be achieved in R using some useful functions as well.

Mixing SAS and R Plots

DRAWIMAGE Statement

Hands-on Graph Template Language: Part B

BOXPLOT Statement

SAS Boxplot – Explore the Major Types of Boxplots in SAS

**Please indicate the source**: http://www.bioinfo-scrounger.com

To be perfectly honest, I’m not pretty sure that what I do is correct as I’m a new recruit in SAS. However I have strong experience in R so I’m accustomed to think of problems and solve them in R.

Due to the work, I need to learn to manipulate data by R and SAS simultaneously. In this post I expect to use SAS to complete the same procedure with that blog(Logistic Regression for biomarkers). The steps that will be covered are the following:

- Check variables distributions and correlation
- Fit logistic regression model
- Predict the probability of the event
- Compare two ROC curve

Firstly I load the same dataset from R package `mlbench`

by the IML/SAS procedure. The `submit`

and `endsumbit`

can wrap R code in SAS and run it.

`proc iml; submit /R; data("PimaIndiansDiabetes2", package = "mlbench") endsubmit; call ImportDataSetFromR("work.diabetes2", "PimaIndiansDiabetes2");run;quit;`

We can take a look at the frequency of categorical variables in summary table as following:

`proc freq data=diabetes2; tables diabetes;run;`

We can also check the continuous variables as following:

`proc means data=diabetes2; var age glucose insulin mass pedigree pregnant pressure triceps;run;`

Moreover I choose graphs to demonstrate the distribution and correlation of the variables. I always think that graphs are often more informative. For instance, histogram plot is easy to examine the distribution and look for outliers.

`proc template; define statgraph multiple_charts; begingraph; entrytitle "Two distributions"; /* Define Chart Grid */ layout lattice / rows = 1 columns = 2; /* Chart 1 */ layout overlay; entry "Glucose Histogram" / location=outside; histogram glucose / binwidth=10; endlayout; /* Chart 2 */ layout overlay; entry "Pressure Histogram" / location=outside; histogram pressure / binwidth=5; endlayout; endlayout; endgraph; end;run; proc sgrender data=diabetes2 template=multiple_charts;run;`

For the aspect of variable correlation, heatmap scatter plot is a better way often.

`* calculate correlation matrix for the data;ods output PearsonCorr=Corr_P;proc corr data=diabetes2; var age glucose insulin mass pedigree pregnant pressure triceps;run;proc sort data=Corr_P; by Variable;run;* transform wide to long;proc transpose data=Corr_P out=CorrLong(rename=(COL1=Corr)) name=VarID; var age glucose insulin mass pedigree pregnant pressure triceps; by Variable;run;proc sgplot data=CorrLong noautolegend; heatmap x=Variable y=VarID / colorresponse=Corr colormodel=ThreeColorRamp; *Colorresponse allows discrete squares for each correlation.; text x=Variable y=VarID text=Corr / textattrs=(size=10pt); /*Create a variable that contans that info and set text=VARIABLE */ label Corr='Pearson Correlation'; yaxis reverse display=(nolabel); xaxis display=(nolabel); gradlegend;run;`

These two figures show that glucose and pressure are normal distribution basically, and they have no relatively high correlation.

Before fitting the model, we firstly reformat the diabetes variable and keep the glucose and pressure variables without any NA.

`proc format; value $diabetes "pos"=1 "neg"=0;run;data inputData; set diabetes2(keep=diabetes glucose pressure); if nmiss(of _numeric_) + cmiss(of _character_) > 0 then delete; /*remove NA rows*/ format diabetes $diabetes.;run;`

To fit the logistic regression model in SAS, generally we will use the following code:

`ods graphics on;proc logistic data=inputData plots(only)=roc; model diabetes(event="1") = glucose pressure; output out=estimates p=est_response; ods output roccurve=ROCdata;run;`

The `plots(only)=roc`

means we only expect to display a roc plot, and we can get all probabilities of prediction from the `est_response`

column in `estimates`

dataset. With the ods output, we save the roc curve relative data into `ROCdata`

directly.

Indeed it seems very considerate, but I think it’s just not flexible. Because it will result in not useful programming.

In this case, I don’t specify `class`

variables. Otherwise If you specify class variables when the param option is set equal to either `ref`

or `glm`

, SAS will automatically create dummy variables. (Without specifying param, the default coding for two-level factor variables is -1, 1, rather than 0, 1 like we prefer).

The model partial results are the following:

We can find that the variable estimates are equal to those from R. In addition we can automatically get the odds ratio estimates for each variable as well. Actually I can calculate the odds ratio through `exp(coef)`

by myself.

If you would like to know the probability of new data, maybe `lsmeans`

is useful, which has the same effect as `predict()`

in R.

Here it’s the turn to compare two ROC curves, so firstly I have to fit two logistic regression models, one is only glucose, another is glucose plus pressure.

`proc logistic data=inputData plots(only)=roc; model diabetes(event="1") = glucose; output out=estimates p=est_response; ods output roccurve=rocdata1;run;proc logistic data=inputData plots(only)=roc; model diabetes(event="1") = glucose pressure; output out=estimates p=est_response; ods output roccurve=rocdata2;run;data plotdata; set rocdata1(in=a) rocdata2(in=b); if a then group="mod1"; if b then group="mod2"; keep _1mspec_ _sensit_ grouprun;`

Using `set`

to combine these two roc data for plots. And then use the proc sgplot and series statement to plot the ROC curve.

In `proc sgplot`

, the `aspect=1`

option requests a square plot which is customary for an ROC plot in which both axes use the [0,1] range. The `inset`

statement writes the individual group AUC (area under the ROC curve) values inside the plot area.

`proc sgplot data=plotdata aspect=1;/* styleattrs wallcolor=grayEE;*/ series x=_1mspec_ y=_sensit_ / group=group; lineparm x=0 y=0 slope=1 / transparency=.3 lineattrs=(color=gray); title "ROC curves for both groups"; xaxis label="False Positive Fraction" values=(0 to 1 by 0.25) grid offsetmin=.05 offsetmax=.05; yaxis label="True Positive Fraction" values=(0 to 1 by 0.25) grid offsetmin=.05 offsetmax=.05; inset ("glucose AUC" = "0.7877" "glucose+pressure AUC" = "0.7913") / border position=bottomright; title "ROC curves for logistic regression";run;`

Above is my first note about how to use SAS for analysis and visualization. In the next period of time, I plan to compare some ways of code between R and SAS to push my SAS learning. Hope having a bit of progress in it.

Thanks for the post( Using SAS to Estimate a Logistic Regression Model ) to make me more clear about the logistic regression in SAS.

Modify the ROC plot produced by PROC LOGISTIC Plot and compare ROC curves from a fitted model used to score validation or other data

Example 78.7 ROC Curve, Customized Odds Ratios, Goodness-of-Fit Statistics, R-Square, and Confidence Limits

Using SAS to Estimate a Logistic Regression Model

sas_correlation_heat_map.sas

SAS系列20——PROC LOGISTIC 逻辑回归

**Please indicate the source**: http://www.bioinfo-scrounger.com

As we known, logistic regression can be applied in the different aspects, like:

- Calculate OR value to find out potential risk factors.
- Construct a model as a classifier to estimate probability whether an instance belongs to a class or not.
- Adjust potential mixed factors so that we estimate the impact of the interested factor and endpoint.

For example:

Suppose we’re interested in know how variables, such as age, sex, body mass index affect blood pressure. In this case maybe body mass index is my most interested factors, and the age and sex are mixed factors. Therefore the blood pressure should be categorical variables, split into two-factors, high blood pressure and normal blood pressure.

In my case, I have a known biomarker as a reference marker, and I’d like to add another marker to the refer one as a marker combination. Then estimate whether the marker combination is better than the reference marker. So how to select the appropriate statistical methods?

Obviously, the most straightforward idea is to compare the sensitivity and specificity between the combined marker and the reference marker.

- Superiority of sensitivity, the alternative hypothesis is that the number of the additional cancer cases identified by combined marker to reference marker is larger than zero. The p-value is calculated by a binomial test.
- Non-inferiority of specificity, the alternative hypothesis is that the number of misclassifications as positive by combined marker is less than 10% relative to reference. The p-value is calculated using an approximated standard normal distribution based on the restricted maximum likelihood estimation(RMLE).

However if you want to compare the AUC between the reference marker and combined marker, a logistic regression can meet our needs perfectly. So reference marker and combined marker as the independent variables and the disease condition (cancer/control) as the dependent variable. We’re pleased to see that the combined AUC is larger than the reference one.

Before we perform logistic regression, some details may be useful to our model and worth considering in advance.

- Remove potential outliers
- Make sure that the predictor variables are normally distributed. If not, you can use log, root, Box-Cox transformation.
- Remove highly correlated predictors to minimize overfitting. The presence of highly correlated predictors might lead to an unstable model solution. The third consideration is always neglected in our performance.

In that way, how to fit a logistic regression model and calculate the AUC? I'd like to take a little notes about the analysis process in R and SAS. By the way, I think R is much better than SAS in the statistical analysis and visualization aspects. **This is the spirit and power of open-source, which makes our work better and better.**

I take the data set `PimaIndiansDiabetes2`

from the `mlbench`

package as an example, which is about “Pima Indians Diabetes Database”. Load the data, select two interested variables and response, remove NAs.

`library(tidyverse)data("PimaIndiansDiabetes2", package = "mlbench")data <- select(PimaIndiansDiabetes2, c("glucose", "pressure", "diabetes")) %>% na.omit()`

Firstly I think it’s necessary to estimate the distribution and correlation between those variables (glucose and pressure) as follow:

`## distributionggpubr::ggarrange( ggpubr::gghistogram(data = data, x = "glucose"), ggpubr::gghistogram(data = data, x = "pressure"), nrow = 1, ncol = 2)`

`## correlationlibrary(ggcorrplot)ggcorrplot(corr = cor_pmat(PimaIndiansDiabetes2[,1:8]), method = "circle")`

Here, we can see that these two variables are normally distributed and without correlation.

Then I fit a simple model based on the glucose predictor variable.

`model1 <- glm(diabetes ~ glucose, data = data, family = binomial)summary(model1)$coef## Estimate Std. Error z value Pr(>|z|)## (Intercept) -5.61173171 0.442288629 -12.68794 6.897596e-37## glucose 0.03951014 0.003397783 11.62821 2.962420e-31`

The output above shows the beta coefficients and according significance levels. The intercept is `-5.61`

and the coefficient of glucose variable is `0.039`

. Moreover, `Std.Error`

represents the accuracy of the coefficient, the larger it is, the less confident it will be. And the `z value`

is the estimation value divided by standard error, and according to the `p-value`

.

When we think of the Hazard Ratio, maybe we should know the meaning of logistic beta coefficients. We know that estimate is the regression coefficient, so exp(coef) is the odds ratio which means **the ratio of the odds that an event will occur (event = 1) given the presence of the predictor x (x = 1), compared to the odds of the event occurring in the absence of that predictor (x = 0)**

We all know that the s-shape curve is defined as `p = exp(y) / [1 + exp(y)]`

(James et al. 2014). This can be also simply written as `p = 1 / [1 + exp(-y)]`

, where:

`y = b0 + b1*x`

`exp()`

is the exponential`p`

is the probability of an event to occur (1) given x. Mathematically, this is written as`p(event=1|x)`

and abbreviated as`p(x)`

, so`p(x) = 1 / [1 + exp(-(b0 + b1*x))]`

Based on the formula, if we get a new glucose plasma concentration, it will be easy to predict the probability of the patient being diabetes positive. In R, we can use the `predict()`

function to calculate the probability instead of that logistic equation.

`mod_prob1 <- predict(model1, newdata = data, type = "response")`

We can also apply `geom_smooth()`

to fit a s-shaped probability curve using the above `data`

.

`data %>% mutate(prob = ifelse(diabetes == "pos", 1, 0)) %>% ggplot(aes(glucose, prob)) + geom_point(alpha = 0.2) + geom_smooth(method = "glm", method.args = list(family = "binomial")) + theme_light() + labs( title = "Logistic Regression Model", x = "Plasma Glucose Concentration", y = "Probability of being diabete-pos" )`

Back to my case, my purpose is to compare the reference marker and combined marker. Suppose the glucose is the reference one, the glucose plus pressure is the combined one. So I must fit multiple logistic regression with glucose and pressure variables.

`model2 <- glm(diabetes ~ glucose + pressure, data = data, family = binomial)summary(model2)$coef## Estimate Std. Error z value Pr(>|z|)## (Intercept) -6.49941142 0.659445793 -9.855869 6.465488e-23## glucose 0.03836257 0.003428241 11.190160 4.556103e-29## pressure 0.01406869 0.007478525 1.881212 5.994305e-02`

In the same way, calculate the probability based on the multiple regression. And then compare the two models with AUC.

`mod_prob2 <- predict(model2, newdata = data, type = "response")plotres <- data.frame(event = ifelse(data$diabetes == "pos", 1, 0), glucose = mod_prob1, pressure = mod_prob2, stringsAsFactors = F) %>% pivot_longer(cols = 2:3)`

To plot multiple ROC curves on the same plot, maybe `plotROC`

package can help us, perfect to use.

`library(plotROC)p <- ggplot(as.data.frame(plotres), aes(d = event, m = value, color = name)) + geom_roc(n.cuts = 0) + style_roc()p + annotate("text", x = .75, y = .25, label = paste(c("glucose", "pressure"), "AUC =", round(calc_auc(p)$AUC, 4), collapse = "\n"))`

From this roc output, maybe the “combined marker” is not better than “reference marker”. Obviously, it’s my fault to choose a non suitable dummy data. But I think this blog is useful to understand what the logistic regression in biomarker combinations is.

**Thanks for this article( Logistic Regression Essentials in R ) to make me more clear about the logistic regression.**

Logistic Regression Essentials in R Heart Disease Prediction using Logistic Regression

Heart Disease Prediction

Understanding Logistic Regression using R

Chapter 10 Logistic Regression

Generate ROC Curve Charts for Print and Interactive Use

**Please indicate the source**: http://www.bioinfo-scrounger.com

Powerpoint is a creative tool that can help you make any hex stickers you’re willing to. You can search some templates online and do extra additions. Using powerpoint to manipulate the image and create a semi-circular text, which will take a longer time than what you have hoped to get the hexagon shape right.

The biggest advantage of powerpoint templates is that provided you’re proficient in powerpoint, you’re sure to make a hex sticker.

`hexSticker`

package can provide a series of pretty figures that are generated by base plot, lattice and ggplot2. It sounds that any R plot is able to be added to hexo stickers including the extra graphs certainly.

`library(hexSticker)imgurl <- "./interactive.png"hexSticker::sticker(imgurl, package="easyIVD", p_size=20, p_y = 1.5, s_x=1, s_y=.75, s_width=.5, filename="imgfile.png")`

The advantage of hexSticker packages is that as it’s generated by R code, it’s easy to control each parameter simultaneously and has strong repeatability. Moreover it’s more convenient to share with others than powerpoint templates.

More detail configuration in `?hexSticker::sticker`

document.

I think It’s a brilliant tool, and has a great prize in the 2020 RStudio Shiny Contest. Since I have tried it, I believe it’s the most convenient tool for making hex stickers. It has more detailed configurations and is so thoughtful and friendly.

You can specify hex name, image configurations, hexagon border, spotlight details and add url in the sticker, which I believe is enough for your personal design.

This tool is built by R shiny so you only need to access the web (https://connect.thinkr.fr/hexmake/) to begin designing your personalized sticker.

The home page is shown below:

After a series of configurations, my hex sticker completed and looks pretty and good. If you are also interested in it, please expand your brainstorming and make it become more creative.

Making a Hexagon Sticker

hexSticker: create hexagon sticker in R

Build your Own Hex Sticker

**Please indicate the source**: http://www.bioinfo-scrounger.com

In this situation, as an option we can consider using maximally selected rank statistics to find the cutoff.

The statistics depends on your data types.

What is the maximally selected rank statistics? Briefly speaking it assumes that an unknown cutoff in X (independent variable) determines two groups of observations regarding the response Y, and the two groups have the biggest statistics between each other. This statistics is an appropriate standardized two-sample linear rank statistic of the responses that represents the difference between two groups.

The hypothesis test is to find out the maximum of the standardized statistics of all possible cutoffs, which can provide the best separation of the response into two groups.

So Maximally selected rank statistics can be used for estimation as well as evaluation of a simple cutpoint model.

`surv_cutpoint()`

function in the `survminer`

package wrapped `maxstat`

in it to determine the optimal cutpoint for each variable.

A simple example below is about how to use the `maxstat`

package to find a statistically significant cutoff.

Load the survival data from the maxstat package.

`library(survival)library(maxstat)data(DLBCL)mod <- maxstat.test(Surv(time, cens) ~ MGE, data = DLBCL, smethod = "LogRank", pmethod = "condMC", B = 9999)> modMaximally selected LogRank statistics using condMCdata: Surv(time, cens) by MGEM = 3.1772, p-value = 0.009701sample estimates:estimated cutpoint 0.1860526 `

This argument `smthod`

provides several kinds of statistics to be computed and the `pmethod`

is used to specify the kind of p-value approximation. The argument `B`

specifies the number of Monte-Carlo replications to be performed (and defaults to 10000).

For the overall survival time, the estimated cutpoint is 0.186 mean gene expression, the maximum of the log-rank statistics is M = 3.1772. The probability that, under the null hypothesis, the maximally selected log-rank statistic is greater M = 3.171 is less then than 0.0097.

If you have more than one dependent variable to be evaluated, you need to evaluate these predictors simultaneously and find out which one is better than the others.

`mod2 <- maxstat.test(Surv(time, cens) ~ MGE + IPI, data = DLBCL, smethod = "LogRank", pmethod = "exactGauss", abseps=0.01)> mod2 Optimally Selected Prognostic Factors Call: maxstat.test.data.frame(formula = Surv(time, cens) ~ MGE + IPI, data = DLBCL, smethod = "LogRank", pmethod = "exactGauss", abseps = 0.01) Selected: Maximally selected LogRank statistics using exactGaussdata: Surv(time, cens) by IPIM = 2.9603, p-value = 0.01104sample estimates:estimated cutpoint 1 Adjusted p.value: 0.03430325 , error: 0.001754899> mod2$maxstats[[1]]Maximally selected LogRank statistics using exactGaussdata: Surv(time, cens) by MGEM = 3.0602, p-value = 0.02721sample estimates:estimated cutpoint 0.1860526 [[2]]Maximally selected LogRank statistics using exactGaussdata: Surv(time, cens) by IPIM = 2.9603, p-value = 0.01104sample estimates:estimated cutpoint 1`

The p-value of the global test for the null hypothesis “survival is independent from both IPI and MGE” is 0.034 and IPI provides a better distinction into two groups than MGE does.

The visualization of the result can be shown by `plot()`

function

`plot(mod2)`

Maximally Selected Rank Statistics in R

https://www.jianshu.com/p/0851baac137c

https://www.iikx.com/news/statistics/1747.html

**Please indicate the source**: http://www.bioinfo-scrounger.com

For instance, how to create a pie chart or a donut chart? If using R software and the `ggplot2`

package, the function `coor_polar()`

is recommended, as the pie chart is just a stacked bar chart in polar coordinates.

Create a simple dataset:

`df <- data.frame( group = c("Female", "Male", "Child"), value = c(25, 30, 45), Perc = c("25%", "30%", "45%"))`

And then create a pie chart:

`ggplot(df2, aes(x = "", y = value, fill = group))+ geom_bar(width = 1, stat = "identity") + coord_polar("y", start=0)`

Above is a default style, maybe It seems a bit different to some pie charts from business tools, and not common so we want to remove axis tick and mark labels. To be more advanced, we probably need to add text annotations, therefore we should create a customized pie chart in `theme()`

and `geom_text()`

function. For having pretty colour panels, the `ggsci`

package is highly recommended.

Here there is one point: if you need to add mark labels by

`geom_text`

function, please sort your fill (group/factor) variable firstly, otherwise the text label will be located in the wrong position.

Create a custome theme, and calculate the position of each label in the pie chart.

`mytheme <- theme_minimal()+ theme( axis.title = element_blank(), axis.text.x = element_blank(), panel.border = element_blank(), panel.grid = element_blank(), axis.ticks = element_blank(), legend.key.size = unit(15, "pt"), legend.text = element_text(size = 12), legend.position = "top" )df2 <- df %>% mutate( cs = rev(cumsum(rev(value))), pos = value/2 + lead(cs, 1, default = 0) )`

The pie chart below seems more fashionable than the front ones.

`ggplot(df2, aes(x = "", y = value, fill = fct_inorder(group)))+ geom_bar(width = 1, stat = "identity") + coord_polar("y", start=0) + ggsci::scale_fill_npg() + mytheme + geom_text(aes(y = pos), label = Perc, size = 5) + guides(fill = guide_legend(title = NULL))`

Categorical data are often better understood in a donut chart rather than in a pie chart, although I always think both of them are the same. But unlike the pie chart, to draw a donut chart we must specify the `x = 2`

in `aes()`

and add `xlim()`

to limit it.

`ggplot(df2, aes(x = 2, y = value, fill = fct_inorder(group)))+ geom_bar(width = 1, stat = "identity") + coord_polar("y", start = 200) + xlim(.2,2.5) + ggsci::scale_fill_npg() + theme_void() + geom_text(aes(y = pos), label = Perc, size = 5, col = "white") + guides(fill = guide_legend(title = NULL))`

However, we find that it’s not easy to remember the parameters of pie and donut charts. What is the more simple way to demonstrate it? I think the `ggpie()`

and `ggdonutchart()`

functions in the ggpubr package are preferred.

More details should refer to https://rpkgs.datanovia.com/ggpubr/reference/index.html, it also includes other useful plot functions. Here is the donut chart as an example.

In addition, the `geom_label_repel`

function is better to add text annotation I think.

`df2 <- df2 %>% mutate(group = fct_inorder(group), tmp = "")ggpubr::ggdonutchart(data = df2, x = "value", label = "tmp", lab.pos = "in", fill = "group", color = "black", palette = "npg") + geom_label_repel(aes(y = pos, label = paste0(group, " (", Perc, ")"), fill = group, segment.color = pal_npg("nrc")(3), segment.size = 0.8), data = df2, size = 4, show.legend = F, nudge_y = 1, color = "black") + guides(fill = FALSE)`

The more **pretty design** could be referred to by this blog. (Donut chart with ggplot2)

This tip is about how to add labels to a dodged barplot when I have to specify the `position=position_dodge()`

and `width =`

simultaneously. If you are careless, you will find out the text annotations are in the incorrect position.

In this situation, we must specify the **consistent width** value in all position related functions, such as `geom_bar()`

, `position_dodge()`

and `geom_text()`

.

`df <- data.frame(supp = rep(c("VC", "OJ"), each = 3), dose = rep(c("D0.5", "D1", "D2"),2), len = c(6.8, 15, 33, 4.2, 10, 29.5))ggplot(data = df, aes(x = dose, y = len, fill = supp)) + geom_bar(stat = "identity", color = "black", position = position_dodge(0.65), width = 0.65)+ theme_minimal() + geom_text(aes(label = len), vjust = -0.5, color = "black", position = position_dodge(0.65), size=3.5) + scale_fill_brewer(palette = "Blues")`

Then how about a stacked barplot? We must calculate the pos variable and specify it in `geom_text()`

as y asix.

`df2 <- arrange(df2, dose, supp) %>% plyr::ddply("dose", transform, label_ypos=cumsum(len))ggplot(data = df2, aes(x = dose, y = len, fill = supp)) + geom_bar(stat = "identity")+ geom_text(aes(y = label_ypos, label=len), vjust = 1.6, color = "black", size = 3.5)+ scale_fill_brewer(palette = "Blues")+ theme_minimal()`

Obviously, if you apply the `ggbarplot()`

function of the ggpubr package, the fewer parameters you need to calculate and remember. (https://rpkgs.datanovia.com/ggpubr/reference/ggbarplot.html)

ggplot2 pie chart : Quick start guide - R software and data visualization

ggplot2 barplots : Quick start guide - R software and data visualization

https://ggplot2.tidyverse.org/reference/

http://www.sthda.com/english/wiki/ggplot2-essentials

https://rpkgs.datanovia.com/ggpubr/reference/index.html

Plotting Pie and Donut Chart with ggpubr pckage in R

**Please indicate the source**: http://www.bioinfo-scrounger.com

CLSI EP05A3 and EP15A3 as the reference

Definition of Intermediate Precision：

Intermediate precision (also called within-laboratory or within-device) is a measure of precision under a defined set of conditions: same measurement procedure, same measuring system, same location, and replicate measurements on the same or similar objects over an extended period of time. It may include changes to other conditions such as new calibrations, operators, or reagent lots. ——Intermediate precision

Take throwing darts as an example：

- Accuracy: The score you get from the target of darts. The higher the score, the better.
- Precision: The distribution of the score. If you get a very close location, it means your technique is very stable.

If you want to estimate the precision for a certain test, these three indicators are useful to figure out whether it’s good enough for using.

- %CV coefficient of variation expressed as a percentage
- %CV
_{R}repeatability coefficient of variation - %CV
_{WL}within-laboratory coefficient of variation

We all know that it’s impossible to ensure every test is equal as there are so many factors that would influence our results, such as:

- Day
- Run
- Reagent lot
- Calibrator lot
- Calibration cycle
- Operator
- Instrument
- Laboratory

The first two of the above are usually the main factors to be considered.

So There is always some variants in the measured results compared to real values. It consists of systematic error (bias) and random error. Precision measures random error.

In a single-site 20x2x2 study with 20 days, with two runs per day, with two replicates per run. The associated factors including days and runs will be involved in the statistical analysis, which it used to estimate the two types of precision: repeatability (within-run precision) and within-laboratory precision (within-device precision)

Once the source of variation has been identified, ANOVA model can be used to calculate the SDs and %CVs in the statistical processing of the data. Usual factor can be divided into three components:

Within-run precision (or repeatability), measures the results from replicated samples for a given sample, in a single run, with the essentially constant situation. This variation may be basically caused by random error happening inside the instrument, such as variation of pipetted volumes of sample and reagent.

Between-run precision, measures the variation from different runs (e.g. run1 and run2). This run factor may cause the operation conditions to change, such as temperature, instrument status etc.

Between-day precision, measures the variation happening between days, which is easy to understand, such as caused by humidity etc.

This protocol (20x2x2) is to estimate the repeatability (within-run) and within-laboratory (intermediate precision) following CLSI EP-15.

From the description above, we can find the protocol is a classic nested (hierarchical) design, where replicates are nested within runs and runs are nested within days. So in this situation, nested ANOVA is appropriate. If two factors are involved, corresponding to two-way nested ANOVA.

To estimate the precision of this single-site 20x2x2 design, we should follow a nested linear components-of-variance model involving two factors: “day” and “run”, with “run” nested with “day”. I think this model can be analyzed using the two-way nested ANOVA. It should be noted that the design is balanced because it specifies the same number of runs for each day, and the same number of replicates for each run.

The above screenshot from CLSI EP05-A3 can help us to understand the nested linear components-of-variance model. We can especially know that the residual in the model represents the within-run factor.

Nested random effects are when each member of one group is contained entirely within a single unit of another group. The canonical example is students in classrooms Crossed random effects are when this nesting is not true. An example would be different seeds and different fields used for planting crops. Seeds of the same type can be planted in different fields, and each field can have multiple seeds in it.

Whether random effects are nested or crossed is a property of the data, not the model. In the other word, you should tell the model which data is nested or crossed.

I don’t describe the experiment and workflow in this section, which can be found in the CLSI EP05 and EP15 documents clearly.

Let’s talk about how to calculate the %CV and SD that can be divided into at least two categories based on how many factors are involved.

The first step, I load a simple design(20x2x2) data from a R package `VCA`

including 2 replicates, 2 runs and 20 days from a single sample，where y is the test measurements.

One reagent lot - a single sample

One instrument system

20 test days

Two runs per day

Two replicates measurements per run

library(VCA) data(dataEP05A2_2) > summary(dataEP05A2_2) day run y

1 : 4 1:40 Min. :68.87

2 : 4 2:40 1st Qu.:73.22

3 : 4 Median :75.39

4 : 4 Mean :75.41

5 : 4 3rd Qu.:77.37

6 : 4 Max. :83.02

(Other):56

The second step, I use the nested ANOVA by aov function in R to fit a nested linear components-of-variance model. In this situation, runs are nested within days.

`res <- aov(y~day/run, data = dataEP05A2_2)ss <- summary(res)> ss Df Sum Sq Mean Sq F value Pr(>F) day 19 319.0 16.787 4.512 3e-05 ***day:run 20 187.4 9.372 2.519 0.00634 ** Residuals 40 148.8 3.720 ---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1`

The third step, calculate the SD and %CV of the day, run and error variation following the formula occurred in EP05-A3. By the way, the error CV(`CVerror`

) is corresding to %CV_{R}, also called within-run or repeatability precision. And the %CV_{WL} is the within-laboratory precision.

`nrep <- 2nrun <- 2nday <- 20Verror <- ss[[1]]$`Mean Sq`[3]Vrun <- (ss[[1]]$`Mean Sq`[2] - ss[[1]]$`Mean Sq`[3]) / nrepVday <- (ss[[1]]$`Mean Sq`[1] - ss[[1]]$`Mean Sq`[2]) / (nrun * nrep)Serror <- sqrt(Verror)Sday <- sqrt(Vday)Srun <- sqrt(Vrun)Swl <- sqrt(Vday + Vrun + Verror)> print(c(Swl, Sday, Srun, Serror))[1] 2.898293 1.361533 1.681086 1.928803CVerror <- Serror / mean(dataEP05A2_2$y) * 100> CVerror[1] 2.557875CVwl <- Swl / mean(dataEP05A2_2$y) * 100> CVwl[1] 3.843561`

The fourth step, calculate the confidence interval of SD and %CV, which is relying on the chi-square distribution value for DF estimates. Use the CV% of error as an example.

`alpha <- 0.05CVCI <- c(CVerror * sqrt(ss[[1]]$Df[3] / qchisq(1-alpha/2, df = 40)), CVerror * sqrt(ss[[1]]$Df[3] / qchisq(alpha/2, df = 40)))> CVCI[1] 2.100049 3.272809CVCI_oneSide <- c(CVerror * sqrt(ss[[1]]$Df[3] / qchisq(1-alpha, df = 40)), CVerror * sqrt(ss[[1]]$Df[3] / qchisq(alpha, df = 40)))> CVCI_oneSide[1] 2.166476 3.142029`

**Fortunately, above standard calculation steps have been packed into a R package, that is the VCA package. So we just apply anovaVCA function to fit the model and summarize the it. For CI calculation, the VCAinference function could be used. It sounds so good.**

Fit model:

`res <- anovaVCA(y~day/run, dataEP05A2_2)res> resResult Variance Component Analysis:----------------------------------- Name DF SS MS VC %Total SD CV[%] 1 total 54.78206 8.400103 100 2.898293 3.8435612 day 19 318.961943 16.787471 1.853772 22.068447 1.361533 1.8055923 day:run 20 187.447626 9.372381 2.82605 33.643043 1.681086 2.2293664 error 40 148.811221 3.720281 3.720281 44.288509 1.928803 2.557875Mean: 75.40645 (N = 80) Experimental Design: balanced | Method: ANOVA`

Calculate CI for SD and %CV:

`VCAinference(res)> VCAinference(res)Inference from (V)ariance (C)omponent (A)nalysis------------------------------------------------> VCA Result:------------- Name DF SS MS VC %Total SD CV[%] 1 total 54.7821 8.4001 100 2.8983 3.84362 day 19 318.9619 16.7875 1.8538 22.0684 1.3615 1.80563 day:run 20 187.4476 9.3724 2.8261 33.643 1.6811 2.22944 error 40 148.8112 3.7203 3.7203 44.2885 1.9288 2.5579Mean: 75.4064 (N = 80) Experimental Design: balanced | Method: ANOVA> VC:----- Estimate CI LCL CI UCL One-Sided LCL One-Sided UCLtotal 8.4001 5.9669 12.7046 6.2987 11.8680day 1.8538 day:run 2.8261 error 3.7203 2.5077 6.0906 2.6689 5.6135> SD:----- Estimate CI LCL CI UCL One-Sided LCL One-Sided UCLtotal 2.8983 2.4427 3.5644 2.5097 3.4450day 1.3615 day:run 1.6811 error 1.9288 1.5836 2.4679 1.6337 2.3693> CV[%]:-------- Estimate CI LCL CI UCL One-Sided LCL One-Sided UCLtotal 3.8436 3.2394 4.7269 3.3282 4.5686day 1.8056 day:run 2.2294 error 2.5579 2.1000 3.2728 2.1665 3.142095% Confidence Level SAS PROC MIXED method used for computing CIs`

These functions can be used to handle complicated design, so we don't need to set up functions or a package any more.

Visualizing Nested and Cross Random Effects

R-Package VCA for Variance Component Analysis

How to Perform a Nested ANOVA in R (Step-by-Step)

Lab 8 - Nested and Repeated Measures ANOVA

What’s with the precision?

**Please indicate the source**: http://www.bioinfo-scrounger.com

I thought I used to understand the ANOVA definitely. But when I’d like to apply the MANOVA model, I found I was totally wrong. I even had no clear understanding about which variables, continuous or categorical, should be used in ANOVA. So I decided to keep notes to figure out what is the difference between ANOVA, MANOVA and ANCOVA.

ANOVA is a statistical technique that assesses potential differences in dependent variables by categorical variables. Commonly, ANOVAs are used in three ways: one-way ANOVA, two way ANOVA and N-way ANOVA.

**Independence of observations**, that there are no hidden relationships among observations.**Normally-distributed dependent variables**, comply with normal distribution. If it is not met, you can try a data transformation.**Homogeneity of variables**, the variances in each group are similar. If it is not met, you may be able to use non-parametric alternatives, like the Kruskal-Wallis test.

Types of data in ANOVA, T test and Chi-Squared Test

X independent variables | X group | Y | Analysis |
---|---|---|---|

categorical | Two or more groups | quantitative | ANOVA |

categorical | Just two groups | quantitative | T test |

categorical | Two or more groups | quantitative | Chi-Squared Test |

One way ANOVA has just one independent variable affecting a dependent variable, and the independent variable can have 2 or more categories to compare.

The null hypothesis for the test is that means in groups are equal, which means there is no difference among group means. Therefore, a significant result means that the means are unequal. If you want to compare two groups, use the T-test instead.

ANOVA uses the F-test for statistical significance. If the variance within groups is smaller than the variance between groups, the F-test will find a higher F-value, that means a higher significance.

ANOVA only tells you if there are differences among the independent variables(levels), but not which differences are significant. To find out how the levels differ from one another, perform a TukeyHSD post hoc test.

Two way ANOVA has two independent variables, or two categorical variables, which is the most different from one way ANOVA. These categories are also called factors, and the factors can be split into multiple levels. So if one factor can be split into 3 levels, and another level can be split into 3 levels. In this condition, there will be 3x3=9 groups.

Use a two way ANOVA when you want to know how two independent variables, in combination, affect a dependent variable. So A two way ANOVA with interaction tests three null hypotheses at the same time:

- There is no difference in group means at any level of the first independent variable.
- There is no difference in group means at any level of the second independent variable.
- The effect of one independent variable does not depend on the effect of the other independent variable (a.k.a. no interaction effect)

If you want a two way ANOVA without interaction effect, only need the first two hypotheses.

`data <- mtcars[,c("am", "mpg", "hp", "vs")] %>% mutate(am = factor(am), vs = factor(vs))summary(data)# One-way ANOVAone.way <- aov(mpg~am, data = data)summary(one.way)# Two-way ANOVAtwo.way <- aov(mpg~am+vs, data = data)summary(two.way)# Two-way ANOVA with interactiontwo.way <- aov(mpg~am*vs, data = data)summary(two.way)`

We know that one or two way ANOVA has only one dependent variable, but MANOVA is not limited. We alway call MANOVA the multivariate analysis of variance, so it is used when there are two or more dependent variables. It’s purpose is to find out if dependent variables differ from independent variables simultaneously.

MANOVA assumes that independent variables are categorical and dependent variables are continuous, the same as ANOVA.

Instead of a univariate F value, we would obtain a multivariate F value, and several test statistics are available: Wilks' λ, Hotelling's trace, Pillai's criterion.

Sometimes, we use one way ANOVA can not find out the significance for each dependent variable between groups (independent variables). Therefore we conclude that there is no relation between dependent and independent. However when we apply MANOVA to these dependent variables simultaneously, it concludes that dependent variables are affected by the independent variables.

if you're still confused about it, try read this post Comparison of MANOVA to ANOVA Using an Example, will give a better example to interpret.

When you need to perform a series of one way ANOVA because you have multiple dependent variables to analyze, in this situation using MANOVA can protect against Type I errors.

Example:

dependent variables: Sepal.Length and Petal.Length

independent variables: Species

sepl <- iris\(Sepal.Length petl <- iris\)Petal.Length # MANOVA test res.manova <- manova(cbind(Sepal.Length, Petal.Length) ~ Species, data = iris) # define statistics, Wilks summary(res.manova, test = "Wilks")

ANCOVA is like an extension of ANOVA, and can be used to adjust other factors that might affect the outcome, such as age, gender or drug use. Otherwise it can be also used to combine with the categorical variable as a continuous variable(one factor is categorical, another is quantitative), or variables on a scale as predictors. It means the covariate is a variable of interest, not the one you want to control for.

Therefore, you can enter any covariate you want to ANCOVA. The more you enter, the fewer degrees of freedom you will have, so that it will reduce the statistical power. Finally, the lower the power, the less likely you will be able to rely on the results of the test.

Before performing ANCOVA, besides normal distribution and homogeneity of variance, we need to verify that covariate and the independent variable are independent of each other, since adding a covariate into a model only makes sense if the covariate and independent variable act independently to the dependent variable.

NOTE, if you use type 1 sum of square for the model, you must note the order, the covariate goes first(and there is no interaction)

Example:

dependent variables: Petal.Length

independent variables: Species

covariate: Sepal.Length

# fit ANCOVA model fit <- aov(Petal.Length~Sepal.Length+Species, iris) # view summary of model car::Anova(fit, type = 2)

What is the difference between ANOVA & MANOVA?

ANOVA Test: Definition, Types, Examples

ANOVA (Analysis of Variance)

How to Conduct an ANCOVA in R

ANCOVA example

ANCOVA in R

Doing and reporting your first ANOVA and ANCOVA in R

ANCOVA -- Notes and R Code

ANCOVA: Analysis of Covariance

An introduction to the two-way ANOVA

ANOVA in R: A step-by-step guide

An introduction to the one-way ANOVA

Understanding confounding variables

**Please indicate the source**: http://www.bioinfo-scrounger.com

From now on, if any, I will try my best to keep notes in English to exercise written for work.

Recently I have discussed the non-standard evaluation mode in dplyr package with a colleague. Before that conversation, I always defined the mode as dynamic variables to search in google to solve related problems. Then I knowed that the dynamic mode is called “non-standard evaluation” in dplyr.

In order to keep a tidy environment, most dplyr verbs use tidy evaluation which is a special type of non-standard evaluation throughout the tidyverse. It defined a concept of data masking that you can use data variables as if they were variables in the environment. Even to keep tidy selection, you can choose variables based on their position(eg. 1,2,3), name or type(eg. is.numeric).

- If you want to learn more about the difference between non-standard evaluation and standard evaluation, the post (Dynamic column/variable names with dplyr using Standard Evaluation functions) will be helpful.
- If you want to know the data masking and tidy selection, the vignette Programming with dplyr-vigenettes or Programming with dplyr is suitable for learning.
- The Dplyr team commended that we could read the Metaprogramming chapters in Advanced R (a book) if we’d like to learn more about the underlying theory, or precisely how it’s different from non-standard evaluation.

For this post, I mainly record some common solutions on how to use the dynamic variables (or called intermediate variables, NSE) in dplyr. Although the above make some tasks easier, sometimes we may be confused for example how to use NSE in the `mutate()`

, `summarise()`

, `group_by()`

, `filter()`

, especially in self-defined function arguments or ggplot2 arguments.

In the other words, I need to learn how to use non-standard evaluation(NSE) in dplyr calls.

Use the `.data`

pronoun to transfer string variables.

`library(tidyverse)GraphVar <- "dist"cars %>% group_by(.data[["speed"]]) %>% summarise(Sum = sum(.data[[GraphVar]], na.rm = TRUE), Count = n() ) %>% head()`

Use the name of string variables in the output dataframe with `:=`

。

`var <- "value"iris %>% mutate(!!var := ifelse(Sepal.Length > 5, 1, 0)) %>% head()`

The easiest way to remember and operate is using the constructor `sym()`

when we want to to unquote something that looks like code instead of a string, which often used in ggplot2 and R shiny.

`grp.var <- "Species"uniq.var <- "Sepal.Width"iris %>% group_by(!!sym(grp.var)) %>% summarise(n_uniq = n_distinct(!!sym(uniq.var)))`

For the tricks in the function aspect, it should be divided into two situations depending on what is the type of variables, env-variables or data-variables.

- Env-variables are “programming” variables that live in an environment. They are usually created with
`<-`

. - Data-variables are “statistical” variables that live in a data frame. I understand them as column names.

The variable names will be automatically quoted by surrounding it in doubled braces, if the function arguments are not string.

`mean_by <- function(data, var, group) { data %>% group_by({{group}}) %>% summarise(avg = mean({{var}}))}mean_by(starwars, group = species, var = height) %>% head()`

We need to construct symbols to transform the string if we'd like to use a character as the arguments.

`mean_by <- function(data, var, group) { group <- sym(group) var <- sym(var) data %>% group_by(!!group) %>% summarise(avg = mean(!!var))}mean_by(starwars, group = "species", var = "height") %>% head()`

If you want to import user-supplied expressions, such as `height*100`

, doubled braces run normally, but `sym`

does not. In this situation, we need to replace `sym`

by `enquo`

.

`mean_by <- function(data, var, group) { group <- enquo(group) var <- enquo(var) data %>% group_by(!!group) %>% summarise(avg = mean(!!var))}mean_by(starwars, var = height * 100, group = as.factor(species)) %>% head()`

The dplyr allows multiple grouping variables, which can be represented by `…`

object

mean_by <- function(data, var, ...) { var <- enquo(var)

data %>% group_by(...) %>% summarise(avg = mean(!! var)) } mean_by(starwars, height, species, eye_color)

Above are the supplementary for a previous blog post(https://www.bioinfo-scrounger.com/archives/R-dplyr-tricks/).

Dynamic column/variable names with dplyr using Standard Evaluation functions

Programming with dplyr-vigenettes

Programming with dplyr

https://stackoverflow.com/questions/27975124/pass-arguments-to-dplyr-functions

**Please indicate the source**: http://www.bioinfo-scrounger.com

Have a good understanding of SDTM domains and their structure. The SDTM Implementation Guide (SDTMIG) is there to help with this.

Read SDTMIG.....It will make the SDTM mapping process much smoother.

- Build EDC from CRF
- Get Raw Datasets(source data) from EDC
- Map Raw Datasets(source data) to SDTM Datasets

6 key steps in a typical mapping process:

- Identify all the datasets you want to map.
- Identify all the SDTM datasets that correlate with those datasets.
- Get the dataset metadata.
**(What it means?)** - Get the SDTM dataset metadata that corresponds to Step 3.
- Map the variables in the datasets identified in Step 1 to the SDTM domain variables.
- Create custom domains for any other datasets that don't have corresponding SDTM datasets.

There's 9 likely scenarios in a typical SDTM mapping process. Get to grips with these, and SDTM mapping becomes much more achievable.

The direct carry forward.

Variable that are already SDTM compliant can be directly carried forward to the SDTM datasets. They don't need to modified.

**(Nothing needs to do, just directly capture it.)**The variable rename

You need to rename some variables to be able to map to the corresponding SDTM variable.

**For example, if the original variable is GENDER, it should be renamed SEX to comply with SDTM standards.**The variable attribute change

Variable attributes must be mapped as well as variable names. Attributes like label, type, length and format must comply with the SDTM attributes.

**(These variable attributes should comply with SDTM attributes)**The reformat

The format that a value is stored in is changed. However the value itself does not change.

**For example, converting a SAS date to an ISO 8601 format character string. (Does it mean to change the format of value itself?)**The combine

Sometimes multiple variables must be combined to form a single SDTM variable.

**(It means that some SDTM variables can't carried directly, sometimes transform is need.)**The split

A non-SDTM variable might need to be split into 2 or more SDTM variables to comply with SDTM standards.

**(It's contrary to combine step)**The derivation

Some SDTM variables are obtained by deriving a result from data in the non-SDTM dataset.

**For example, instead of manually entering a patients age, using the date of birth and study start date to derive it instead**The variable value map and new code list application

Some variable values need to be recoded or mapped to match with the values of a corresponding SDTM variable. This mapping is recommended for variables with a code list attached that has controlled terminology that can't be extended.

**You should map all values in controlled terminology, and not just the values present in the dataset. This would cover values that are not in the dataset currently, but may come in during future dataset updates.**The horizontal-to-vertical data structure transpose

There are situations where the structure of the non-CDISC dataset is completely different to its corresponding SDTM dataset. In such cases you need to transform its structure to one that is SDTM-compliant.

For example, the Vital Signs dataset. When data is collected in wide form, every test and recorded value is stored in separate variables. SDTM requires data to be stored in lean form. Therefore, the dataset must be transposed to have the tests, values and unit under 3 variables. If there are variables that can't be mapped to an SDTM variable, they would go into supplemental qualifiers.

**(Such like long and wide pivot transform)**

There's thing you can do to make SDTM mapping easier.

- Part of the trouble is that SDTM mapping is typically done at the end of the clinical trial process-once patient data has been collected. Retrospectively trying to make your results data fit the SDTM structure takes a lot of time and effort.
- For this reason, it's best practice to align raw datasets with CDISC standards before collecting any patient data
- That means implementing SDTM right from the start-when designing CRFs. Doing it this way means it's much easier to convert your datasets. And it saves time later on in the process when you're pulling your submission deliverables together. You can submit your study much quickly.

Above all are excerpted, if there is any question, please read the original paper.

**Please indicate the source**: http://www.bioinfo-scrounger.com

Box-Cox变换是Box和Cox在1964年提出的一种广义幂变换方法，是统计建模中常用的一种数据变换，用于连续的响应变量不满足正态分布的情况。Box-Cox变换之后，可以一定程度上减小不可观测的误差和预测变量的相关性。

Cox变换的主要特点是引入一个参数lambda，通过数据本身估计该参数进而确定应采取的数据变换形式，Box-Cox变换可以明显地改善数据的正态性、对称性和方差相等性，又不丢失信息，后经过一定的推广和改进，扩展了其应用范围。

Box Cox变换的核心参数是lambda（λ），其范围从-5到5。所以我们主要目的在于通过一定的方法，选择除最佳的lambda值。

以上y值需要非负数才行，若对阵有负数的数据集，则公式如下:

常见的lambda计算方法：

- 最大似然估计
- Bayes方法

平时我们想让一个非正态的数据变成正态，一次个想到可能就是取log（即对数转化），可能还有倒数转化，平方根转化等等。而Box-Cox变换是多种变换的统称，当取不同lambda值时，其对应的就是不同的转化方式

从以上可得，我们需要计算一个lambda即可做Box-Cox变换了

先生成一个F Distribution的向量

`set.seed(250) set.seed(250)x <- rf(500, 30, 30)hist(x,breaks = 15)qqnorm(x)`

从上述柱状图和QQ图都可看出，该模拟的向量并不呈正态，接着用`EnvStats`

包的`boxcox`

做Boxcox Power Transformation，选择最大似然法计算lambda

`library(EnvStats)boxcox.list <- boxcox(x, objective.name = "Log-Likelihood")> boxcox.listResults of Box-Cox Transformation---------------------------------Objective Name: Log-LikelihoodData: xSample Size: 500 lambda Log-Likelihood -2.0 -429.0778 -1.5 -334.4623 -1.0 -264.8572 -0.5 -221.4762 0.0 -204.6382 0.5 -213.9799 1.0 -248.6916 1.5 -307.6451 2.0 -389.4097 `

从上可看出，当lambda为0时，Log-likelihood值最大，即对应的是log转换；若想直接估计出lambda，则加`optimize = TRUE`

参数

`boxcox(x, objective.name = "Log-Likelihood", optimize = TRUE)`

或者通过图形展示lambda的分布

`plot(boxcox.list, xlim = c(-2, 2))`

最后看下Box-cox转换后的QQ图

`plot(boxcox.list, plot.type = "Q-Q Plots")`

参考资料：

]]>避免偏倚的设计技巧

在临床试验中，避免偏倚的两个重要设计技巧是**盲法**和**随机化**，这些应是上市申请中所包含的临床对照试验的一般特点。

随机和盲态是两种常见的最小化临床试验偏差的方法。因为药物在目标人群的（比如全世界所有患一线非小细胞肺癌的患者）的真实疗效水平我们不知道，我们只能通过医学研究去推断这种疗效。

在设计方案中，应对旨在尽可能缩小试验进行过程中任何可损害统计分析满意程度的可预见的不正确情况的特定处理过程进行说明。这些不正确情况包括违反试验方案的各种情况、失访和缺失值。方案中应考虑到减少这些问题的频度和在数据分析中出现这些问题时如何处理的方法。

盲法试验

盲法（blinding）是为了控制在临床试验的过程中以及对结果的解释时产生有意或无意的偏倚（bias）。这些偏倚来自由于对治疗的了解而对病例的搜集和安排、对病人的照顾、病人对治疗的态度、对终点（end point）的评价、对失访的处理、在分析中数据的剔除等的影响。**其根本目的是，在有可能产生偏倚的时候防止知道采用的是何种处理。**

根据设盲程度的不同，盲法可分为完整设盲、不完整设盲和不设盲。在表达上也见用：

- 单盲（single-blind），研究者和/或其成员知道采用的是何种处理，但病人不知道，或者正相反。
- 双盲（double-blind），所有病人及所有参与治疗或临床评定的申办者及研究人员均不知道谁接受的是何种处理包括挑选合格病人者、评价结者或按照设计方案评价依从性者。
- 三盲（triple-blind），不仅对受试者和研究者设盲，而且试验的其他有关人员，包括临床试验的监查员、研究助理及统计人员也不清楚治疗组的分配情况。
- 非盲（开放）（open-label），一种不设盲的试验。所有的人，包括受试者、研究者、监查员、数据管理人员和统计分析工作者都知道病人采用的何种处理。
- 双盲双模拟法（double-blind，double-dummy），如果试验药品与对照药品的剂型、用药时间或剂量不同，为保证盲法的实施，往往要采用双盲双模拟。

不设盲注意事项，为最大程度地减少偏移，可考虑采用以下方法： * 在完成受试者筛选和入组前，受试者和研究者均不晓得分组信息（即分配隐藏） * 在伦理许可的前提下，受试者在完成治疗前，不知晓分组信息 * 采用盲态数据审核 * 申请人需要对采用不完整设盲或者不设盲试验设计的理由进行论述，详细控制偏移的具体措施（如采用可客观判定的指标以避免评价偏移，采用标准操作规范以减少实施偏移等）。中心阅片室、中心实验室、评价委员会。

盲态审核（Blind Review）

在试验完成（最后一例病人的最后一次观察）与揭盲之间对数据进行核对和评价，以便把计划的分析最后定下来。

提前破盲

在双盲试验中，申办方会提供研究者一套随密封代码，现在很多项目中在随机系统中进行随机的操作，并在试验方案中注明破盲的方法和执行破盲的人员。

盲法试验一般在试验结束进行统计分析时才揭盲。但是为了保证受试者的安全， 在紧急情况下，如发生SAE又不能判断与试验药物是否有关、过量服药、与合并用药产生严重的药物相互反应等，急需知道服用何种药物决定抢救方案时，需要提前破盲。

以上是对盲法一些概念的摘抄和整理

本文出自于http://www.bioinfo-scrounger.com转载请注明出处

]]>一个SAS encoding的问题

当SAS的配置附件选择u8的sasv9.cfg后，SAS的-ENCODING参数就变成UTF-8，那么若输入数据是其他格式，如euc-cn Simplified Chinese (EUC)；那么若不将SAS session 转化未UTF-8，则可能会出现以下报错：

**ERROR: Some character data was lost during transcoding in the data set MYDATA.DS3. Either the data contains characters that are not representable in the new encoding or truncation occurred during transcoding.**

对于解决SAS encoding的问题，可以follow以下SAS的步骤：Migrating Data to UTF-8 for SAS 来排错

针对上述的ERROR，可参考：Determine Whether the CVP Engine Is Needed to Read Your Data without Truncation，即调用CVP engine

The first LIBNAME statement points to the original data set. Use a second LIBNAME statement to point to the location of the library that will contain the new data set.

`libname mylib cvp "path to original data set"; libname mylib2 "path to new data set";`

Use PROC DATASETS with the COPY statement and the OVERRIDE= option. When you specify OVERRIDE=(ENCODING=SESSION OUTREP=SESSION) in the COPY statement, the new data set is created in the host data representation and encoding of the SAS session that is executing the COPY statement. Add the CONTENTS statement to view a description of the content of the new data set.

`proc datasets nolist; copy in=mylib out=mylib2 override=(encoding=session outrep=session); contents data=mylib2.mydata; run;`

因此以下两种方式都行（都是基于以上原理）：

`libname mylib cvp "./Documents/test";libname mylib2 "./Documents/format";proc datasets nolist; copy in=mylib out=mylib2 override=(encoding=session outrep=session); contents data=mylib2.foo; run;`

或：

`libname inlib cvp "./Documents/test"; libname outlib "./Documents/format" outencoding="UTF-8";proc datasets nolist; contents data=outlib.foo; run;`

若不是出现以上ERROR报错，那可以尝试下以下解决常规的encoding问题的方法：

- Using the FILE Statement to Specify an Encoding for Writing to an External File
- Using the FILENAME Statement to Specify an Encoding for Reading an External File
- Using the FILENAME Statement to Specify an Encoding for Writing to an External File
- Changing Encoding for Message Body and Attachment
- Using the INFILE= Statement to Specify an Encoding for Reading from an External File

具体代码参照：ENCODING Examples

以上是个人解决SAS ENCODING的过程，仅供参考

本文出自于http://www.bioinfo-scrounger.com转载请注明出处

]]>Correlation作hypothesis test是一个常见的分析，一般我们的零假设H0是`ρ=0`

，也就是说想看下correlation与0的差别是否显著，此时满足t distribution，先计算t-statistics

用R的`cor.test`

函数：

`data("iris")> cor.test(iris$Sepal.Length, iris$Petal.Length) Pearson's product-moment correlationdata: iris$Sepal.Length and iris$Petal.Lengtht = 21.646, df = 148, p-value < 2.2e-16alternative hypothesis: true correlation is not equal to 095 percent confidence interval: 0.8270363 0.9055080sample estimates: cor 0.8717538 `

公式转化如下：

`r <- 0.87175> r / sqrt((1 - r^2) / (150 - 2))[1] 21.64563pvalue <- 2 * pt(-abs(21.64563), df=150-1)`

上述两种方式结果一致

假设我们不想将correlation与0比较，而是跟一个特定的`ρ0`

比较，则需要先将corelation进行Fisher transformation

Fisher transformation有哪些用处呢？

Fisher (1973, p. 199) describes the following practical applications of the z transformation:

- testing whether a population correlation is equal to a given value
- testing for equality of two population correlations
- combining correlation estimates from different samples

这里主要看上述第一条，即given value

接上述iris的例子，假设我想看correlation与`ρ0=0.8`

做比较，则：

`> (1/2*log((1+0.87175)/(1-0.87175)) - 1/2*log((1+0.8)/(1-0.8))) / sqrt(1/(150-3))[1] 2.930596> 2 * pnorm(abs(2.930596), lower.tail = F)[1] 0.003383124`

以上结果与NCSS软件一致，但与SAS的`proc corr`

的结果有些略微不同（主要在于最终的P值）

`proc corr data=sashelp.iris nosimple fisher (rho0=0.8 biasadj=no); var SepalLength PetalLength;run;`

上述SAS的结果中的Fisher z统计量是指

`Zρ`

，而不是`Zρ-Zρ0`

从上述公式可看出，对于fisher transformation后的Z分布，虽然其不是完全的标准正态分布，但随着样本量的增加可看作近似正态分布：

For the transformed , the approximate variance V(zr)=1/(n-3) is independent of the correlation . Furthermore, even the distribution of is not strictly normal, it tends to normality rapidly as the sample size increases for any values of (Fisher 1973, pp. 200–201).

计算公式如下：

上述结果是转化后Z分布的confidence interval，然后需要再转化为correlation对应的confidence interval

`# Correlation coefficientr <- 0.87175# Z statisticsZ_upper <- 1/2 * log((1+r)/(1-r)) + qnorm(p = 1 - 0.05/2, lower.tail = T) / sqrt(150 - 3)Z_lower <- 1/2 * log((1+r)/(1-r)) - qnorm(p = 1 - 0.05/2, lower.tail = T) / sqrt(150 - 3)# Correlation confidence intervalCor_upper <- (exp(2 * Z_upper) - 1) / (exp(2 * Z_upper) + 1)Cor_lower <- (exp(2 * Z_lower) - 1) / (exp(2 * Z_lower) + 1)> c(Cor_lower, Cor_upper)[1] 0.8270314 0.9055052`

上述结果跟R的`cor.test`

和SAS的`proc corr`

结果一致，说明没有问题

以上公式均参考自：

SAS The CORR Procedure

NCSS Correlation

**PS. 若想了解其他的correlation hypothesis test方法以及计算结果可参考：https://www.psychometrica.de/correlation.html**，蛮有意思的一个网站。。。

其他参考资料：

https://stats.stackexchange.com/questions/14220/how-to-test-hypothesis-that-correlation-is-equal-to-given-value-using-r https://en.wikipedia.org/wiki/Pearson_correlation_coefficient#Fisher_Transformation

https://cran.r-project.org/web/packages/cocor/cocor.pdf

https://www.personality-project.org/r/html/paired.r.html

本文出自于http://www.bioinfo-scrounger.com转载请注明出处

]]>是指来源于大量的正常人群中有关实验测定数据，并根据正常人群中不同年龄、性别分别进行统计分析，得到了绝大多数人群中数据的分布范围，并以此确定参考值范围。对超出参考值界限不大的异常值，可以根据病人的临床表现区别对待，可以采取治疗措施，也可以进行观察。

什么是医学决定水平（Medicine decide level）：

是临床医生在诊断和治疗疾病时应该掌握和使用的数据，它不同于参考值的另一些限值，临床医生可以通过观察测定值是否高于或低于这些限值，可在疾病诊断中起排除或确认的作用，或对某些疾病进行分级或分类，或对预后作出估计，以提示医师在临床上应采取何种处理方式，如进一步进行某一方面的检查，或决定采取某种治疗措施等等。

参考值范围 vs. 医学决定水平

区别 | 参考值范围 | 医学决定水平 |
---|---|---|

来源 | 来源于正常人群中不同年龄、性别分别进行统计分析，得到了绝大多数人群中数据的分布范围 | 来源于大量的临床病人数据的观察和积累，用于确定疾病的发生发展和变化情况 |

作用 | 对人群健康状态进行判断的指标，需要结合临床症状判断诊疗方案 | 在诊断及治疗工作时，对疾病诊断或治疗起关键作用的某一被测成分的浓度，临床上必须采取措施的检测水平 |

值 | 有一个上限和一个下限，也可只有一个上限或一个下限，临床医生通过上、下限来判断患者的健康状况 | 可根据不同的疾病诊断要点和标准，不同的治疗要求和治疗方法的选择，有多个设定的上限或下限，临床医生在使用这些指标时能够根据不同的界限采取不同的处理方法和措施 |

重要性 | 重要，尤其是针对广大人群的常规检测，对于疾病早期的判断有显著的指导作用 | 重要，通过观察测定值是否高于或低于这些限值，可在疾病诊断中起排除或确认的作用，或对某些疾病进行分级或分类，或对预后作出估计，以提示医师在临床上应采取何种处理方式，如进一步进行某一方面的检查，或决定采取某种治疗措施等等 |

以白细胞计数为例：

- 其参考值范围是（4～10）×109/L
- 医学决定水平临床意义及措施：
- 0.5×109/L低于此值，病人有高度易感染性，应采取相应的预防性治疗及预防感染措施。
- 3×109/L低于此值为白细胞减少症，应再作其他试验，如白细胞分类计数、观察外周血涂片等，并应询问用药史。`
- 11×109/L高于此值为白细胞增多，此时作白细胞分类计数有助于分析病因和分型，如果需要应查找感染源。
- 30×109/L高于此值，提示可能为白血病，应进行白细胞分类，观察外周血涂片和进行骨髓检查。

引申下什么是正常值、正常参考值、参考值、参考值范围？

“正常值”等名称实际上只是一个概念，它是所得到的测定结果在一个相对正常的范围之内，所以也可称为“正常范围”。正常值的划定是有一定要求的，它是来自于相对绝大多数处于健康状态人的测定结果，一般把所选择相对健康的正常人群测定值中的95%划定正常值的界限，所以仍有约5%的健康人的结果分布在异常区域内。

从这可看出，其正常值的计算方式跟参考区间差不多。。。

因此，当你的化验结果略超出正常值范围一点时并不意味着你一定有病，这里可能存在测定误差、干扰因素、你是95%以外的正常人等各种可能。“正常值”、“正常参考值”、“参考值”、“参考值范围”均表达同一个含义，但正常二字有较多的局限性，所以现在专家推荐统称为参考值或参考范围。

WS/T 402-2012的附录中也对参考区间和临界值（医学决定界值）分别做了说明：

医学决定水平不同于参考区间，它是基于其他科学和医学知识建立起来的。它和参考区间的得出方式是不同的，通常与某些特定医学条件相关。

怎么计算参考区间呢：

可参考Analysis of Reference Value，参考自EP28-A3C。。。也可结合WS/T 402-2012一起看

对于医学决定水平（Xc）的预期偏倚（Bc），在《免于进行临床试验的体外诊断试剂同品种比对技术指导原则》中指出：

一般将医学决定水平处的预期偏倚及其95%可信区间与申请人声称的可接受偏倚的限值进行比较。可接受偏倚的限值或者由申请人咨询临床机构后根据临床需求设定，或者参考相关的国内外标准等设定。如果预期偏倚的95%可信区间未超出申请人声称的可接受偏倚的限值，说明试验用体外诊断试剂与对比试剂的检测结果没有显著的偏倚，二者等效。

MDL的预期偏倚计算方法可参考EP09-A3，属于回归分析中的一部分

- 假如回归方法是用OLS，则将Xc带入回归方程
`y = a + bx`

中的x，计算`Bc = a + (b - 1) * Xc`

，然后其CI通过公式计算即可 - 对于Deming/Weighted deming回归方法，则其CI建议使用Jackknife方法来计算
- 而Passing-Balock回归方法，则其CI建议使用Jackknife方法

计算公式均可查看EP，或者调用R包`mcr`

（罗氏诊断编写）中的函数

以上说的都是考核和对比试剂呈现的是线性的情况，当出现非线性的时候（比较少见），则MDL的预期偏倚计算方式可能需要有些不同：

如果研究目的是为特定的医学决定水平Xc提供预期偏倚估计，那么可以使用该值（浓度）附近的点来提供这样的估计。至少要选择20个这样的点，可以是选择该MDL上下距离最近的各10个点，或者围绕医学决策浓度选择一个浓度区间。上述选点应根据考核和对比试剂检测平均值排列的结果进行。如果根据单个检测值（考核或者对比）的排列结果来计算预期偏倚，可能导致结果出现不当的偏差。————EP28-A3C(9.1.5)

参考资料：

本文出自于http://www.bioinfo-scrounger.com转载请注明出处

]]>