The alluvial plot and Sankey diagram are both forms of visualization for general flow diagrams. These plot types are designed to show the change in magnitude of some quantity as it flows between states. Although the difference between alluvial plot and sankey diagram is always discussed online, like the issue Alluvial diagram vs Sankey diagram?, here just comes how to allow us to study the connection and flow of data between different categorical features in R. So we don’t mind the mixed use by both of them.

Note: an Alluvial diagram is a subcategory of Sankey diagrams where nodes are grouped in vertical nodes (sometimes called steps).

For illustrating the following cases, I will load the flight data from `nycflights13`

library first. This comprehensive data set contains all flights that departed from the New York City airports JFK, LGA and EWR in 2013, including three columns we’re concerned about, such as origin (airport of origin), dest (destination airport) and carrier (airline code). For a better demonstration, I select the top five destinations and top four carriers.

`top_dest <- flights %>% count(dest) %>% top_n(5, n) %>% pull(dest) top_carrier <- flights %>% filter(dest %in% top_dest) %>% count(carrier) %>% top_n(4, n) %>% pull(carrier) fly <- flights %>% filter(dest %in% top_dest & carrier %in% top_carrier)`

Let‘s take a look at the sankey, Google defines a sankey as:

A sankey diagram is a visualization used to depict a flow from one set of values to another. The things being connected are called nodes and the connections are called links. Sankeys are best used when you want to show a many-to-many mapping between two domains or multiple paths through a set of stages.

In R, we can plot a sankey diagram with the `ggsankey`

package in the ggplot2 framework. This package is very kind to provide a function (`make_long()`

) to transform our common wide data to long.

`fly <- flights %>% filter(dest %in% top_dest & carrier %in% top_carrier) %>% ggsankey::make_long(origin, carrier, dest)`

In R, we can plot a sankey diagram with the `ggsankey`

package in the ggplot2 framework. This package is very kind to provide a function (`make_long()`

) to transform our common wide data to long, so that columns will be fit to the parameters in functions. It contains four columns, corresponding to stage and node, such as stage is for `x`

and `next_x`

, and node is for `node`

and `next_node`

. Hence, at least four columns are required. More usages are illustrated in this document(https://github.com/davidsjoberg/ggsankey).

So a basic sankey diagram is as following:

`ggplot(fly, aes(x = x, next_x = next_x, node = node, next_node = next_node, fill = factor(node), label = node)) + geom_sankey(flow.alpha = 0.6, node.color = "gray30") + geom_sankey_label(size = 3, color = "white", fill = "gray40") + scale_fill_viridis_d() + theme_sankey(base_size = 18) + labs(x = NULL) + theme(legend.position = "none", plot.title = element_text(hjust = .5))`

Furthermore, `networkD3`

package is also able to plot the sankey diagram, but not easy to use I think.

After my initial use, `alluvial`

and `ggalluvial`

packages are very suitable for R users to create the alluvial plots. The former has its own specific syntax, otherwise the later one can be integrated seamlessly into ggplot2, same as `ggsankey`

.

Actually, to be honest, both of them are convenient. You can choose either one according to your use situation. For example, the`alluvial`

package is demonstrated below:

`fly <- flights %>% filter(dest %in% top_dest & carrier %in% top_carrier) %>% count(origin, carrier, dest) %>% mutate(origin = fct_relevel(as.factor(origin), c("EWR","JFK","LGA")))alluvial(fly %>% select(-n), freq = fly$n, border = NA, alpha = 0.5, col=case_when(fly$origin == "JFK" ~ "red", fly$origin == "EWR" ~ "blue", TRUE ~ "orange"), cex = 0.75, axis_labels = c("Origin", "Carrier", "Destination"))`

The detailed usages can be found in this web (https://github.com/mbojan/alluvial).

If you would like to own more customized adjustments, maybe the `ggsankey`

is better. As it’s a ggplot2 extension, which has enough functions for modification following your thoughts.

We still don’t need to take much time to transform data because `ggalluvial`

also has a very convenient function to do the same procedure as `make_long()`

in `ggsankey`

. If your data is in a “wide” format , like the flight dataset, the `to_lodes_form()`

function will help you easily.

`fly <- flights %>% filter(dest %in% top_dest & carrier %in% top_carrier) %>% count(origin, carrier, dest) %>% mutate( origin = fct_relevel(as.factor(origin), c("LGA", "EWR","JFK")), col = origin ) %>% ggalluvial::to_lodes_form(key = type, axes = c("origin", "carrier", "dest"))ggplot(data = fly, aes(x = type, stratum = stratum, alluvium = alluvium, y = n)) + # geom_lode(width = 1/6) + geom_flow(aes(fill = col), width = 1/6, color = "darkgray", curve_type = "cubic") + # geom_alluvium(aes(fill = stratum)) + geom_stratum(color = "grey", width = 1/6) + geom_label(stat = "stratum", aes(label = after_stat(stratum))) + theme( panel.background = element_blank(), axis.text.y = element_blank(), axis.text.x = element_text(size = 15, face = "bold"), axis.title = element_blank(), axis.ticks = element_blank(), legend.position = "none" ) + scale_fill_viridis_d()`

The `geom_alluvium()`

is a bit different to `geom_flow()`

, which is chosen to use depending on the type of dataset you provide and the purpose you want to demonstrate.

Above all are my notes for a previous question. If you have any questions, please read the references as follows, which is easier to understand.

https://github.com/davidsjoberg/ggsankey

https://www.displayr.com/sankey-diagrams-r/

https://cran.r-project.org/web/packages/alluvial/vignettes/alluvial.html

https://corybrunson.github.io/ggalluvial/

https://www.r-bloggers.com/2019/06/data-flow-visuals-alluvial-vs-ggalluvial-in-r/

https://cran.rstudio.com/web/packages/ggalluvial/vignettes/ggalluvial.html

**Please indicate the source**: http://www.bioinfo-scrounger.com

For simplicity, if you have many clinical datasets, you’d like to find out which datasets have a certain variable like Sex, what should we do? Per R thinking, I will read all datasets and derive row names (variables) and then find out which datasets have this specific variable. If you apply this way in SAS, maybe you need to create a macro and utilize open and varnum functions. However there is a tip to achieve it, as following:

`libname mydata "C:/Users/anlan/Documents/CYFRA";data sex_tb; set sashelp.vcolumn; where libname="MYDATA" and name="SEX"; keep memname name type length;run;`

Just manipulate the sashelp.vcolumn to filter the rows by libname(`libname="MYDATA"`

) and column name(`name="SEX"`

). We can not find this tip from any common books, but it’s very useful in our work, I call it “work experience” generally.

The second example, how to derive date data from a string? I think any derived process can be splitted into two procedures, matching and extracting. In R, obviously the popular string related package is `stringr`

, so how about in SAS? SAS does not have the same toolkit package/macro as R, but some functions may be useful, like `prxchange`

. After I read the documentation of `prxchange`

, I feel that it has some similarities with perl in regular matching, as following:

`data work.demo; input tmp $20.; datalines;AE 2021-01-01CM 2021-02-01MH 2021-03-01;run;data demo_date; set demo; format date YYMMDD10.;/* date=prxchange("s/\w+\s+(.*)/$1/",1,tmp);*/ date=input(prxchange("s/\w+\s+(.*)/$1/",1,tmp),YYMMDD10.);run;proc contents data=demo_date; run;`

If you just want to judge whether one word or numeric exists in the string, in other words matching words, maybe `prxmatch`

and `find`

are helpful. If you want to extract words instead of match, `prxchange`

or `prxparse+prxposn`

is preferred. By the way, the later one is more close to R logic.

`data work.num; input tmp $; datalines;AE01CM02MH03;run;data num_ext; length tmp type1 type2 $ 20; keep tmp type1 type2; re=prxparse("/([a-zA-Z]+)(\d+)/"); set num; if prxmatch(re, tmp) then do; type1=prxposn(re, 1, tmp); type2=prxposn(re, 2, tmp); output; end;run;`

The third question, what’s the difference between informat and format?

`informat`

describes how the data is presented in the text file.`format`

describes how you want SAS to present the data when you look at it.

Remember, formats do not change the underlying data, just how it is printed for input on the screen.

Having an instance here, converting `data9.`

date to `yymmdd10.`

represents the usage of `format`

statements.

`data aes;input @1 subject $3. @5 ae_start date9. @15 ae_stop date9. @25 adverse_event $15.;format ae_start yymmdd10. ae_stop yymmdd10.;datalines;101 01JAN2004 02JAN2004 Headache101 15JAN2004 03FEB2004 Back Pain102 03NOV2003 10DEC2003 Rash102 03JAN2004 10JAN2004 Abdominal Pain102 04APR2004 04APR2004 Constipation;run;`

Another condition, we typically store the number 1 for male and 2 for female. It would be embarrassing to hand a client with an unclear understanding of the 1 and 2 numbers. So we need a format to dress up the report.

`data report;input ID Gender State $;datalines;100001 1 LA100002 2 LA100003 . AL;run;proc format ;value sex 1 = "Male" 2 = "Female" . = "Unknown";run ;proc print data = report; var id gender state; format gender sex.;run;`

`Informat`

is usually used with the `input`

statement to read multiple styles of variables into SAS.

Informats usage:

Character Informats: $INFORMAT w.

Numeric Informats: INFORMAT w.d

Date/Time Informats: INFORMAT w.

data death; input @1 subject $3. @5 death 1.; datalines; 101 0 102 0 ; run;

The fourth question, what’s the difference between the keep option on the set statement or the data statement?

If you place the keep option on the set statement, SAS keeps the specified variables when it reads the input data set. On the other hand, if you place the keep option on the DATA statement, SAS keeps the specified variables when it writes to the output data set. From this explanation, we can easily think that the latter one is faster than the former when the input dataset is very large.

Above all are my partal interview questions. It's such a pity that I‘m not fully prepared as I am just proficient in R, not SAS. So I’m planning to take my free time to learn and summarize SAS, just like when I learned Perl, R and Python.

By the way, I think is a good book as it not only provides some sas cases but also introduces the knowledge from the pharmaceutical industry.

**Please indicate the source**: http://www.bioinfo-scrounger.com

`submit /R`

statement. A few months ago, I consulted with SAS support for how to import plots by R in IML into RTF templates directly as I could not find any useful information in google. Unfortunately SAS support told me if the plot was created in R, it would need to be saved within the submit block as well using R code. It means if you want to directly import R graphics to RTF, maybe you should use some R function to achieve it.However I found a post (Mixing SAS and R Plots) that illustrates how to mix SAS plots and R plots in one graph by chance, which could solve my question perfectly. So let’s have a look at how to generate a plot by R in SAS and import it to RTF or PDF. It seems not directly but a trick I think.

Firstly, as usual we draw a R graphic in SAS.

`proc iml; submit /R; library(ggpubr) data("ToothGrowth") ggviolin(ToothGrowth, x = "dose", y = "len", add = "none") %>% ggadd(c("boxplot", "jitter"), color = "dose") %>% ggsave(filename = "C:/Users/anlan/Documents/plots/violin.png", width = 10, height = 8, units = "cm", dpi = 300) endsubmit; call ImportDataSetFromR("work.ToothGrowth", "ToothGrowth");run;quit;`

And then set output style to rtf template and graphics with png of 12cm height and 15cm width. This should be noted that the height and width point to the size of the graphic, not the actual plot size.

`ods rtf file = "C:/Users/anlan/Documents/plots/outgraph.rtf"; ods graphics / noborder height=12cm width=15cm outputfmt=png; `

Next we use the SAS Graph Template Language(GTL) to define a template and use the `drawimage`

statement to import the R graphic into SAS. The width and height params in the `drawimage`

statement is used to adjust the size of the image’s bounding box (actual size). In this example, width=90 widthunit=percent means the plot is zoomed out to 90%.

`proc template; define statgraph plottemp; begingraph; layout overlay; drawimage "C:/Users/guk8/Documents/plots/violin.png" / width=90 widthunit=percent height=90 heightunit=percent; endlayout; endgraph; end;run;`

The final plot in RTF is as following:

`proc sgrender template=plottemp; run;`

Actually the drawimage statement is designed to import the external graphics to SAS graph, to display a mixed graph. For example I’d like to show a graph having an image at the right-bottom of the total graph.

`proc template; define statgraph mgraphic; begingraph; entrytitle "Mix SAS and external graphics"; layout overlay; boxplot y=len x=dose / name="box" group=supp groupdisplay=cluster spread=true; discretelegend "box"; drawimage "C:/Users/anlan/Documents/plots/violin.png" / width=45 widthunit=percent height=45 heightunit=percent anchor=bottomright x=98 y=2 drawspace=wallpercent ; endlayout; endgraph; end;run; proc sgrender data=work.ToothGrowth template=mgraphic; label dose="Dose";run;`

The mixed graphic is as following:

Great, it seems easy to achieve. Obviously it definitely can be achieved in R using some useful functions as well.

Mixing SAS and R Plots

DRAWIMAGE Statement

Hands-on Graph Template Language: Part B

BOXPLOT Statement

SAS Boxplot – Explore the Major Types of Boxplots in SAS

**Please indicate the source**: http://www.bioinfo-scrounger.com

To be perfectly honest, I’m not pretty sure that what I do is correct as I’m a new recruit in SAS. However I have strong experience in R so I’m accustomed to think of problems and solve them in R.

Due to the work, I need to learn to manipulate data by R and SAS simultaneously. In this post I expect to use SAS to complete the same procedure with that blog(Logistic Regression for biomarkers). The steps that will be covered are the following:

- Check variables distributions and correlation
- Fit logistic regression model
- Predict the probability of the event
- Compare two ROC curve

Firstly I load the same dataset from R package `mlbench`

by the IML/SAS procedure. The `submit`

and `endsumbit`

can wrap R code in SAS and run it.

`proc iml; submit /R; data("PimaIndiansDiabetes2", package = "mlbench") endsubmit; call ImportDataSetFromR("work.diabetes2", "PimaIndiansDiabetes2");run;quit;`

We can take a look at the frequency of categorical variables in summary table as following:

`proc freq data=diabetes2; tables diabetes;run;`

We can also check the continuous variables as following:

`proc means data=diabetes2; var age glucose insulin mass pedigree pregnant pressure triceps;run;`

Moreover I choose graphs to demonstrate the distribution and correlation of the variables. I always think that graphs are often more informative. For instance, histogram plot is easy to examine the distribution and look for outliers.

`proc template; define statgraph multiple_charts; begingraph; entrytitle "Two distributions"; /* Define Chart Grid */ layout lattice / rows = 1 columns = 2; /* Chart 1 */ layout overlay; entry "Glucose Histogram" / location=outside; histogram glucose / binwidth=10; endlayout; /* Chart 2 */ layout overlay; entry "Pressure Histogram" / location=outside; histogram pressure / binwidth=5; endlayout; endlayout; endgraph; end;run; proc sgrender data=diabetes2 template=multiple_charts;run;`

For the aspect of variable correlation, heatmap scatter plot is a better way often.

`* calculate correlation matrix for the data;ods output PearsonCorr=Corr_P;proc corr data=diabetes2; var age glucose insulin mass pedigree pregnant pressure triceps;run;proc sort data=Corr_P; by Variable;run;* transform wide to long;proc transpose data=Corr_P out=CorrLong(rename=(COL1=Corr)) name=VarID; var age glucose insulin mass pedigree pregnant pressure triceps; by Variable;run;proc sgplot data=CorrLong noautolegend; heatmap x=Variable y=VarID / colorresponse=Corr colormodel=ThreeColorRamp; *Colorresponse allows discrete squares for each correlation.; text x=Variable y=VarID text=Corr / textattrs=(size=10pt); /*Create a variable that contans that info and set text=VARIABLE */ label Corr='Pearson Correlation'; yaxis reverse display=(nolabel); xaxis display=(nolabel); gradlegend;run;`

These two figures show that glucose and pressure are normal distribution basically, and they have no relatively high correlation.

Before fitting the model, we firstly reformat the diabetes variable and keep the glucose and pressure variables without any NA.

`proc format; value $diabetes "pos"=1 "neg"=0;run;data inputData; set diabetes2(keep=diabetes glucose pressure); if nmiss(of _numeric_) + cmiss(of _character_) > 0 then delete; /*remove NA rows*/ format diabetes $diabetes.;run;`

To fit the logistic regression model in SAS, generally we will use the following code:

`ods graphics on;proc logistic data=inputData plots(only)=roc; model diabetes(event="1") = glucose pressure; output out=estimates p=est_response; ods output roccurve=ROCdata;run;`

The `plots(only)=roc`

means we only expect to display a roc plot, and we can get all probabilities of prediction from the `est_response`

column in `estimates`

dataset. With the ods output, we save the roc curve relative data into `ROCdata`

directly.

Indeed it seems very considerate, but I think it’s just not flexible. Because it will result in not useful programming.

In this case, I don’t specify `class`

variables. Otherwise If you specify class variables when the param option is set equal to either `ref`

or `glm`

, SAS will automatically create dummy variables. (Without specifying param, the default coding for two-level factor variables is -1, 1, rather than 0, 1 like we prefer).

The model partial results are the following:

We can find that the variable estimates are equal to those from R. In addition we can automatically get the odds ratio estimates for each variable as well. Actually I can calculate the odds ratio through `exp(coef)`

by myself.

If you would like to know the probability of new data, maybe `lsmeans`

is useful, which has the same effect as `predict()`

in R.

Here it’s the turn to compare two ROC curves, so firstly I have to fit two logistic regression models, one is only glucose, another is glucose plus pressure.

`proc logistic data=inputData plots(only)=roc; model diabetes(event="1") = glucose; output out=estimates p=est_response; ods output roccurve=rocdata1;run;proc logistic data=inputData plots(only)=roc; model diabetes(event="1") = glucose pressure; output out=estimates p=est_response; ods output roccurve=rocdata2;run;data plotdata; set rocdata1(in=a) rocdata2(in=b); if a then group="mod1"; if b then group="mod2"; keep _1mspec_ _sensit_ grouprun;`

Using `set`

to combine these two roc data for plots. And then use the proc sgplot and series statement to plot the ROC curve.

In `proc sgplot`

, the `aspect=1`

option requests a square plot which is customary for an ROC plot in which both axes use the [0,1] range. The `inset`

statement writes the individual group AUC (area under the ROC curve) values inside the plot area.

`proc sgplot data=plotdata aspect=1;/* styleattrs wallcolor=grayEE;*/ series x=_1mspec_ y=_sensit_ / group=group; lineparm x=0 y=0 slope=1 / transparency=.3 lineattrs=(color=gray); title "ROC curves for both groups"; xaxis label="False Positive Fraction" values=(0 to 1 by 0.25) grid offsetmin=.05 offsetmax=.05; yaxis label="True Positive Fraction" values=(0 to 1 by 0.25) grid offsetmin=.05 offsetmax=.05; inset ("glucose AUC" = "0.7877" "glucose+pressure AUC" = "0.7913") / border position=bottomright; title "ROC curves for logistic regression";run;`

Above is my first note about how to use SAS for analysis and visualization. In the next period of time, I plan to compare some ways of code between R and SAS to push my SAS learning. Hope having a bit of progress in it.

Thanks for the post( Using SAS to Estimate a Logistic Regression Model ) to make me more clear about the logistic regression in SAS.

Modify the ROC plot produced by PROC LOGISTIC Plot and compare ROC curves from a fitted model used to score validation or other data

Example 78.7 ROC Curve, Customized Odds Ratios, Goodness-of-Fit Statistics, R-Square, and Confidence Limits

Using SAS to Estimate a Logistic Regression Model

sas_correlation_heat_map.sas

SAS系列20——PROC LOGISTIC 逻辑回归

**Please indicate the source**: http://www.bioinfo-scrounger.com

As we known, logistic regression can be applied in the different aspects, like:

- Calculate OR value to find out potential risk factors.
- Construct a model as a classifier to estimate probability whether an instance belongs to a class or not.
- Adjust potential mixed factors so that we estimate the impact of the interested factor and endpoint.

For example:

Suppose we’re interested in know how variables, such as age, sex, body mass index affect blood pressure. In this case maybe body mass index is my most interested factors, and the age and sex are mixed factors. Therefore the blood pressure should be categorical variables, split into two-factors, high blood pressure and normal blood pressure.

In my case, I have a known biomarker as a reference marker, and I’d like to add another marker to the refer one as a marker combination. Then estimate whether the marker combination is better than the reference marker. So how to select the appropriate statistical methods?

Obviously, the most straightforward idea is to compare the sensitivity and specificity between the combined marker and the reference marker.

- Superiority of sensitivity, the alternative hypothesis is that the number of the additional cancer cases identified by combined marker to reference marker is larger than zero. The p-value is calculated by a binomial test.
- Non-inferiority of specificity, the alternative hypothesis is that the number of misclassifications as positive by combined marker is less than 10% relative to reference. The p-value is calculated using an approximated standard normal distribution based on the restricted maximum likelihood estimation(RMLE).

However if you want to compare the AUC between the reference marker and combined marker, a logistic regression can meet our needs perfectly. So reference marker and combined marker as the independent variables and the disease condition (cancer/control) as the dependent variable. We’re pleased to see that the combined AUC is larger than the reference one.

Before we perform logistic regression, some details may be useful to our model and worth considering in advance.

- Remove potential outliers
- Make sure that the predictor variables are normally distributed. If not, you can use log, root, Box-Cox transformation.
- Remove highly correlated predictors to minimize overfitting. The presence of highly correlated predictors might lead to an unstable model solution. The third consideration is always neglected in our performance.

In that way, how to fit a logistic regression model and calculate the AUC? I'd like to take a little notes about the analysis process in R and SAS. By the way, I think R is much better than SAS in the statistical analysis and visualization aspects. **This is the spirit and power of open-source, which makes our work better and better.**

I take the data set `PimaIndiansDiabetes2`

from the `mlbench`

package as an example, which is about “Pima Indians Diabetes Database”. Load the data, select two interested variables and response, remove NAs.

`library(tidyverse)data("PimaIndiansDiabetes2", package = "mlbench")data <- select(PimaIndiansDiabetes2, c("glucose", "pressure", "diabetes")) %>% na.omit()`

Firstly I think it’s necessary to estimate the distribution and correlation between those variables (glucose and pressure) as follow:

`## distributionggpubr::ggarrange( ggpubr::gghistogram(data = data, x = "glucose"), ggpubr::gghistogram(data = data, x = "pressure"), nrow = 1, ncol = 2)`

`## correlationlibrary(ggcorrplot)ggcorrplot(corr = cor_pmat(PimaIndiansDiabetes2[,1:8]), method = "circle")`

Here, we can see that these two variables are normally distributed and without correlation.

Then I fit a simple model based on the glucose predictor variable.

`model1 <- glm(diabetes ~ glucose, data = data, family = binomial)summary(model1)$coef## Estimate Std. Error z value Pr(>|z|)## (Intercept) -5.61173171 0.442288629 -12.68794 6.897596e-37## glucose 0.03951014 0.003397783 11.62821 2.962420e-31`

The output above shows the beta coefficients and according significance levels. The intercept is `-5.61`

and the coefficient of glucose variable is `0.039`

. Moreover, `Std.Error`

represents the accuracy of the coefficient, the larger it is, the less confident it will be. And the `z value`

is the estimation value divided by standard error, and according to the `p-value`

.

When we think of the Hazard Ratio, maybe we should know the meaning of logistic beta coefficients. We know that estimate is the regression coefficient, so exp(coef) is the odds ratio which means **the ratio of the odds that an event will occur (event = 1) given the presence of the predictor x (x = 1), compared to the odds of the event occurring in the absence of that predictor (x = 0)**

We all know that the s-shape curve is defined as `p = exp(y) / [1 + exp(y)]`

(James et al. 2014). This can be also simply written as `p = 1 / [1 + exp(-y)]`

, where:

`y = b0 + b1*x`

`exp()`

is the exponential`p`

is the probability of an event to occur (1) given x. Mathematically, this is written as`p(event=1|x)`

and abbreviated as`p(x)`

, so`p(x) = 1 / [1 + exp(-(b0 + b1*x))]`

Based on the formula, if we get a new glucose plasma concentration, it will be easy to predict the probability of the patient being diabetes positive. In R, we can use the `predict()`

function to calculate the probability instead of that logistic equation.

`mod_prob1 <- predict(model1, newdata = data, type = "response")`

We can also apply `geom_smooth()`

to fit a s-shaped probability curve using the above `data`

.

`data %>% mutate(prob = ifelse(diabetes == "pos", 1, 0)) %>% ggplot(aes(glucose, prob)) + geom_point(alpha = 0.2) + geom_smooth(method = "glm", method.args = list(family = "binomial")) + theme_light() + labs( title = "Logistic Regression Model", x = "Plasma Glucose Concentration", y = "Probability of being diabete-pos" )`

Back to my case, my purpose is to compare the reference marker and combined marker. Suppose the glucose is the reference one, the glucose plus pressure is the combined one. So I must fit multiple logistic regression with glucose and pressure variables.

`model2 <- glm(diabetes ~ glucose + pressure, data = data, family = binomial)summary(model2)$coef## Estimate Std. Error z value Pr(>|z|)## (Intercept) -6.49941142 0.659445793 -9.855869 6.465488e-23## glucose 0.03836257 0.003428241 11.190160 4.556103e-29## pressure 0.01406869 0.007478525 1.881212 5.994305e-02`

In the same way, calculate the probability based on the multiple regression. And then compare the two models with AUC.

`mod_prob2 <- predict(model2, newdata = data, type = "response")plotres <- data.frame(event = ifelse(data$diabetes == "pos", 1, 0), glucose = mod_prob1, pressure = mod_prob2, stringsAsFactors = F) %>% pivot_longer(cols = 2:3)`

To plot multiple ROC curves on the same plot, maybe `plotROC`

package can help us, perfect to use.

`library(plotROC)p <- ggplot(as.data.frame(plotres), aes(d = event, m = value, color = name)) + geom_roc(n.cuts = 0) + style_roc()p + annotate("text", x = .75, y = .25, label = paste(c("glucose", "pressure"), "AUC =", round(calc_auc(p)$AUC, 4), collapse = "\n"))`

From this roc output, maybe the “combined marker” is not better than “reference marker”. Obviously, it’s my fault to choose a non suitable dummy data. But I think this blog is useful to understand what the logistic regression in biomarker combinations is.

**Thanks for this article( Logistic Regression Essentials in R ) to make me more clear about the logistic regression.**

Logistic Regression Essentials in R Heart Disease Prediction using Logistic Regression

Heart Disease Prediction

Understanding Logistic Regression using R

Chapter 10 Logistic Regression

Generate ROC Curve Charts for Print and Interactive Use

**Please indicate the source**: http://www.bioinfo-scrounger.com

Powerpoint is a creative tool that can help you make any hex stickers you’re willing to. You can search some templates online and do extra additions. Using powerpoint to manipulate the image and create a semi-circular text, which will take a longer time than what you have hoped to get the hexagon shape right.

The biggest advantage of powerpoint templates is that provided you’re proficient in powerpoint, you’re sure to make a hex sticker.

`hexSticker`

package can provide a series of pretty figures that are generated by base plot, lattice and ggplot2. It sounds that any R plot is able to be added to hexo stickers including the extra graphs certainly.

`library(hexSticker)imgurl <- "./interactive.png"hexSticker::sticker(imgurl, package="easyIVD", p_size=20, p_y = 1.5, s_x=1, s_y=.75, s_width=.5, filename="imgfile.png")`

The advantage of hexSticker packages is that as it’s generated by R code, it’s easy to control each parameter simultaneously and has strong repeatability. Moreover it’s more convenient to share with others than powerpoint templates.

More detail configuration in `?hexSticker::sticker`

document.

I think It’s a brilliant tool, and has a great prize in the 2020 RStudio Shiny Contest. Since I have tried it, I believe it’s the most convenient tool for making hex stickers. It has more detailed configurations and is so thoughtful and friendly.

You can specify hex name, image configurations, hexagon border, spotlight details and add url in the sticker, which I believe is enough for your personal design.

This tool is built by R shiny so you only need to access the web (https://connect.thinkr.fr/hexmake/) to begin designing your personalized sticker.

The home page is shown below:

After a series of configurations, my hex sticker completed and looks pretty and good. If you are also interested in it, please expand your brainstorming and make it become more creative.

Making a Hexagon Sticker

hexSticker: create hexagon sticker in R

Build your Own Hex Sticker

**Please indicate the source**: http://www.bioinfo-scrounger.com

In this situation, as an option we can consider using maximally selected rank statistics to find the cutoff.

The statistics depends on your data types.

What is the maximally selected rank statistics? Briefly speaking it assumes that an unknown cutoff in X (independent variable) determines two groups of observations regarding the response Y, and the two groups have the biggest statistics between each other. This statistics is an appropriate standardized two-sample linear rank statistic of the responses that represents the difference between two groups.

The hypothesis test is to find out the maximum of the standardized statistics of all possible cutoffs, which can provide the best separation of the response into two groups.

So Maximally selected rank statistics can be used for estimation as well as evaluation of a simple cutpoint model.

`surv_cutpoint()`

function in the `survminer`

package wrapped `maxstat`

in it to determine the optimal cutpoint for each variable.

A simple example below is about how to use the `maxstat`

package to find a statistically significant cutoff.

Load the survival data from the maxstat package.

`library(survival)library(maxstat)data(DLBCL)mod <- maxstat.test(Surv(time, cens) ~ MGE, data = DLBCL, smethod = "LogRank", pmethod = "condMC", B = 9999)> modMaximally selected LogRank statistics using condMCdata: Surv(time, cens) by MGEM = 3.1772, p-value = 0.009701sample estimates:estimated cutpoint 0.1860526 `

This argument `smthod`

provides several kinds of statistics to be computed and the `pmethod`

is used to specify the kind of p-value approximation. The argument `B`

specifies the number of Monte-Carlo replications to be performed (and defaults to 10000).

For the overall survival time, the estimated cutpoint is 0.186 mean gene expression, the maximum of the log-rank statistics is M = 3.1772. The probability that, under the null hypothesis, the maximally selected log-rank statistic is greater M = 3.171 is less then than 0.0097.

If you have more than one dependent variable to be evaluated, you need to evaluate these predictors simultaneously and find out which one is better than the others.

`mod2 <- maxstat.test(Surv(time, cens) ~ MGE + IPI, data = DLBCL, smethod = "LogRank", pmethod = "exactGauss", abseps=0.01)> mod2 Optimally Selected Prognostic Factors Call: maxstat.test.data.frame(formula = Surv(time, cens) ~ MGE + IPI, data = DLBCL, smethod = "LogRank", pmethod = "exactGauss", abseps = 0.01) Selected: Maximally selected LogRank statistics using exactGaussdata: Surv(time, cens) by IPIM = 2.9603, p-value = 0.01104sample estimates:estimated cutpoint 1 Adjusted p.value: 0.03430325 , error: 0.001754899> mod2$maxstats[[1]]Maximally selected LogRank statistics using exactGaussdata: Surv(time, cens) by MGEM = 3.0602, p-value = 0.02721sample estimates:estimated cutpoint 0.1860526 [[2]]Maximally selected LogRank statistics using exactGaussdata: Surv(time, cens) by IPIM = 2.9603, p-value = 0.01104sample estimates:estimated cutpoint 1`

The p-value of the global test for the null hypothesis “survival is independent from both IPI and MGE” is 0.034 and IPI provides a better distinction into two groups than MGE does.

The visualization of the result can be shown by `plot()`

function

`plot(mod2)`

Maximally Selected Rank Statistics in R

https://www.jianshu.com/p/0851baac137c

https://www.iikx.com/news/statistics/1747.html

**Please indicate the source**: http://www.bioinfo-scrounger.com

For instance, how to create a pie chart or a donut chart? If using R software and the `ggplot2`

package, the function `coor_polar()`

is recommended, as the pie chart is just a stacked bar chart in polar coordinates.

Create a simple dataset:

`df <- data.frame( group = c("Female", "Male", "Child"), value = c(25, 30, 45), Perc = c("25%", "30%", "45%"))`

And then create a pie chart:

`ggplot(df2, aes(x = "", y = value, fill = group))+ geom_bar(width = 1, stat = "identity") + coord_polar("y", start=0)`

Above is a default style, maybe It seems a bit different to some pie charts from business tools, and not common so we want to remove axis tick and mark labels. To be more advanced, we probably need to add text annotations, therefore we should create a customized pie chart in `theme()`

and `geom_text()`

function. For having pretty colour panels, the `ggsci`

package is highly recommended.

Here there is one point: if you need to add mark labels by

`geom_text`

function, please sort your fill (group/factor) variable firstly, otherwise the text label will be located in the wrong position.

Create a custome theme, and calculate the position of each label in the pie chart.

`mytheme <- theme_minimal()+ theme( axis.title = element_blank(), axis.text.x = element_blank(), panel.border = element_blank(), panel.grid = element_blank(), axis.ticks = element_blank(), legend.key.size = unit(15, "pt"), legend.text = element_text(size = 12), legend.position = "top" )df2 <- df %>% mutate( cs = rev(cumsum(rev(value))), pos = value/2 + lead(cs, 1, default = 0) )`

The pie chart below seems more fashionable than the front ones.

`ggplot(df2, aes(x = "", y = value, fill = fct_inorder(group)))+ geom_bar(width = 1, stat = "identity") + coord_polar("y", start=0) + ggsci::scale_fill_npg() + mytheme + geom_text(aes(y = pos), label = Perc, size = 5) + guides(fill = guide_legend(title = NULL))`

Categorical data are often better understood in a donut chart rather than in a pie chart, although I always think both of them are the same. But unlike the pie chart, to draw a donut chart we must specify the `x = 2`

in `aes()`

and add `xlim()`

to limit it.

`ggplot(df2, aes(x = 2, y = value, fill = fct_inorder(group)))+ geom_bar(width = 1, stat = "identity") + coord_polar("y", start = 200) + xlim(.2,2.5) + ggsci::scale_fill_npg() + theme_void() + geom_text(aes(y = pos), label = Perc, size = 5, col = "white") + guides(fill = guide_legend(title = NULL))`

However, we find that it’s not easy to remember the parameters of pie and donut charts. What is the more simple way to demonstrate it? I think the `ggpie()`

and `ggdonutchart()`

functions in the ggpubr package are preferred.

More details should refer to https://rpkgs.datanovia.com/ggpubr/reference/index.html, it also includes other useful plot functions. Here is the donut chart as an example.

In addition, the `geom_label_repel`

function is better to add text annotation I think.

`df2 <- df2 %>% mutate(group = fct_inorder(group), tmp = "")ggpubr::ggdonutchart(data = df2, x = "value", label = "tmp", lab.pos = "in", fill = "group", color = "black", palette = "npg") + geom_label_repel(aes(y = pos, label = paste0(group, " (", Perc, ")"), fill = group, segment.color = pal_npg("nrc")(3), segment.size = 0.8), data = df2, size = 4, show.legend = F, nudge_y = 1, color = "black") + guides(fill = FALSE)`

The more **pretty design** could be referred to by this blog. (Donut chart with ggplot2)

This tip is about how to add labels to a dodged barplot when I have to specify the `position=position_dodge()`

and `width =`

simultaneously. If you are careless, you will find out the text annotations are in the incorrect position.

In this situation, we must specify the **consistent width** value in all position related functions, such as `geom_bar()`

, `position_dodge()`

and `geom_text()`

.

`df <- data.frame(supp = rep(c("VC", "OJ"), each = 3), dose = rep(c("D0.5", "D1", "D2"),2), len = c(6.8, 15, 33, 4.2, 10, 29.5))ggplot(data = df, aes(x = dose, y = len, fill = supp)) + geom_bar(stat = "identity", color = "black", position = position_dodge(0.65), width = 0.65)+ theme_minimal() + geom_text(aes(label = len), vjust = -0.5, color = "black", position = position_dodge(0.65), size=3.5) + scale_fill_brewer(palette = "Blues")`

Then how about a stacked barplot? We must calculate the pos variable and specify it in `geom_text()`

as y asix.

`df2 <- arrange(df2, dose, supp) %>% plyr::ddply("dose", transform, label_ypos=cumsum(len))ggplot(data = df2, aes(x = dose, y = len, fill = supp)) + geom_bar(stat = "identity")+ geom_text(aes(y = label_ypos, label=len), vjust = 1.6, color = "black", size = 3.5)+ scale_fill_brewer(palette = "Blues")+ theme_minimal()`

Obviously, if you apply the `ggbarplot()`

function of the ggpubr package, the fewer parameters you need to calculate and remember. (https://rpkgs.datanovia.com/ggpubr/reference/ggbarplot.html)

ggplot2 pie chart : Quick start guide - R software and data visualization

ggplot2 barplots : Quick start guide - R software and data visualization

https://ggplot2.tidyverse.org/reference/

http://www.sthda.com/english/wiki/ggplot2-essentials

https://rpkgs.datanovia.com/ggpubr/reference/index.html

Plotting Pie and Donut Chart with ggpubr pckage in R

**Please indicate the source**: http://www.bioinfo-scrounger.com

CLSI EP05A3 and EP15A3 as the reference

Definition of Intermediate Precision：

Intermediate precision (also called within-laboratory or within-device) is a measure of precision under a defined set of conditions: same measurement procedure, same measuring system, same location, and replicate measurements on the same or similar objects over an extended period of time. It may include changes to other conditions such as new calibrations, operators, or reagent lots. ——Intermediate precision

Take throwing darts as an example：

- Accuracy: The score you get from the target of darts. The higher the score, the better.
- Precision: The distribution of the score. If you get a very close location, it means your technique is very stable.

If you want to estimate the precision for a certain test, these three indicators are useful to figure out whether it’s good enough for using.

- %CV coefficient of variation expressed as a percentage
- %CV
_{R}repeatability coefficient of variation - %CV
_{WL}within-laboratory coefficient of variation

We all know that it’s impossible to ensure every test is equal as there are so many factors that would influence our results, such as:

- Day
- Run
- Reagent lot
- Calibrator lot
- Calibration cycle
- Operator
- Instrument
- Laboratory

The first two of the above are usually the main factors to be considered.

So There is always some variants in the measured results compared to real values. It consists of systematic error (bias) and random error. Precision measures random error.

In a single-site 20x2x2 study with 20 days, with two runs per day, with two replicates per run. The associated factors including days and runs will be involved in the statistical analysis, which it used to estimate the two types of precision: repeatability (within-run precision) and within-laboratory precision (within-device precision)

Once the source of variation has been identified, ANOVA model can be used to calculate the SDs and %CVs in the statistical processing of the data. Usual factor can be divided into three components:

Within-run precision (or repeatability), measures the results from replicated samples for a given sample, in a single run, with the essentially constant situation. This variation may be basically caused by random error happening inside the instrument, such as variation of pipetted volumes of sample and reagent.

Between-run precision, measures the variation from different runs (e.g. run1 and run2). This run factor may cause the operation conditions to change, such as temperature, instrument status etc.

Between-day precision, measures the variation happening between days, which is easy to understand, such as caused by humidity etc.

This protocol (20x2x2) is to estimate the repeatability (within-run) and within-laboratory (intermediate precision) following CLSI EP-15.

From the description above, we can find the protocol is a classic nested (hierarchical) design, where replicates are nested within runs and runs are nested within days. So in this situation, nested ANOVA is appropriate. If two factors are involved, corresponding to two-way nested ANOVA.

To estimate the precision of this single-site 20x2x2 design, we should follow a nested linear components-of-variance model involving two factors: “day” and “run”, with “run” nested with “day”. I think this model can be analyzed using the two-way nested ANOVA. It should be noted that the design is balanced because it specifies the same number of runs for each day, and the same number of replicates for each run.

The above screenshot from CLSI EP05-A3 can help us to understand the nested linear components-of-variance model. We can especially know that the residual in the model represents the within-run factor.

Nested random effects are when each member of one group is contained entirely within a single unit of another group. The canonical example is students in classrooms Crossed random effects are when this nesting is not true. An example would be different seeds and different fields used for planting crops. Seeds of the same type can be planted in different fields, and each field can have multiple seeds in it.

Whether random effects are nested or crossed is a property of the data, not the model. In the other word, you should tell the model which data is nested or crossed.

I don’t describe the experiment and workflow in this section, which can be found in the CLSI EP05 and EP15 documents clearly.

Let’s talk about how to calculate the %CV and SD that can be divided into at least two categories based on how many factors are involved.

The first step, I load a simple design(20x2x2) data from a R package `VCA`

including 2 replicates, 2 runs and 20 days from a single sample，where y is the test measurements.

One reagent lot - a single sample

One instrument system

20 test days

Two runs per day

Two replicates measurements per run

library(VCA) data(dataEP05A2_2) > summary(dataEP05A2_2) day run y

1 : 4 1:40 Min. :68.87

2 : 4 2:40 1st Qu.:73.22

3 : 4 Median :75.39

4 : 4 Mean :75.41

5 : 4 3rd Qu.:77.37

6 : 4 Max. :83.02

(Other):56

The second step, I use the nested ANOVA by aov function in R to fit a nested linear components-of-variance model. In this situation, runs are nested within days.

`res <- aov(y~day/run, data = dataEP05A2_2)ss <- summary(res)> ss Df Sum Sq Mean Sq F value Pr(>F) day 19 319.0 16.787 4.512 3e-05 ***day:run 20 187.4 9.372 2.519 0.00634 ** Residuals 40 148.8 3.720 ---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1`

The third step, calculate the SD and %CV of the day, run and error variation following the formula occurred in EP05-A3. By the way, the error CV(`CVerror`

) is corresding to %CV_{R}, also called within-run or repeatability precision. And the %CV_{WL} is the within-laboratory precision.

`nrep <- 2nrun <- 2nday <- 20Verror <- ss[[1]]$`Mean Sq`[3]Vrun <- (ss[[1]]$`Mean Sq`[2] - ss[[1]]$`Mean Sq`[3]) / nrepVday <- (ss[[1]]$`Mean Sq`[1] - ss[[1]]$`Mean Sq`[2]) / (nrun * nrep)Serror <- sqrt(Verror)Sday <- sqrt(Vday)Srun <- sqrt(Vrun)Swl <- sqrt(Vday + Vrun + Verror)> print(c(Swl, Sday, Srun, Serror))[1] 2.898293 1.361533 1.681086 1.928803CVerror <- Serror / mean(dataEP05A2_2$y) * 100> CVerror[1] 2.557875CVwl <- Swl / mean(dataEP05A2_2$y) * 100> CVwl[1] 3.843561`

The fourth step, calculate the confidence interval of SD and %CV, which is relying on the chi-square distribution value for DF estimates. Use the CV% of error as an example.

`alpha <- 0.05CVCI <- c(CVerror * sqrt(ss[[1]]$Df[3] / qchisq(1-alpha/2, df = 40)), CVerror * sqrt(ss[[1]]$Df[3] / qchisq(alpha/2, df = 40)))> CVCI[1] 2.100049 3.272809CVCI_oneSide <- c(CVerror * sqrt(ss[[1]]$Df[3] / qchisq(1-alpha, df = 40)), CVerror * sqrt(ss[[1]]$Df[3] / qchisq(alpha, df = 40)))> CVCI_oneSide[1] 2.166476 3.142029`

**Fortunately, above standard calculation steps have been packed into a R package, that is the VCA package. So we just apply anovaVCA function to fit the model and summarize the it. For CI calculation, the VCAinference function could be used. It sounds so good.**

Fit model:

`res <- anovaVCA(y~day/run, dataEP05A2_2)res> resResult Variance Component Analysis:----------------------------------- Name DF SS MS VC %Total SD CV[%] 1 total 54.78206 8.400103 100 2.898293 3.8435612 day 19 318.961943 16.787471 1.853772 22.068447 1.361533 1.8055923 day:run 20 187.447626 9.372381 2.82605 33.643043 1.681086 2.2293664 error 40 148.811221 3.720281 3.720281 44.288509 1.928803 2.557875Mean: 75.40645 (N = 80) Experimental Design: balanced | Method: ANOVA`

Calculate CI for SD and %CV:

`VCAinference(res)> VCAinference(res)Inference from (V)ariance (C)omponent (A)nalysis------------------------------------------------> VCA Result:------------- Name DF SS MS VC %Total SD CV[%] 1 total 54.7821 8.4001 100 2.8983 3.84362 day 19 318.9619 16.7875 1.8538 22.0684 1.3615 1.80563 day:run 20 187.4476 9.3724 2.8261 33.643 1.6811 2.22944 error 40 148.8112 3.7203 3.7203 44.2885 1.9288 2.5579Mean: 75.4064 (N = 80) Experimental Design: balanced | Method: ANOVA> VC:----- Estimate CI LCL CI UCL One-Sided LCL One-Sided UCLtotal 8.4001 5.9669 12.7046 6.2987 11.8680day 1.8538 day:run 2.8261 error 3.7203 2.5077 6.0906 2.6689 5.6135> SD:----- Estimate CI LCL CI UCL One-Sided LCL One-Sided UCLtotal 2.8983 2.4427 3.5644 2.5097 3.4450day 1.3615 day:run 1.6811 error 1.9288 1.5836 2.4679 1.6337 2.3693> CV[%]:-------- Estimate CI LCL CI UCL One-Sided LCL One-Sided UCLtotal 3.8436 3.2394 4.7269 3.3282 4.5686day 1.8056 day:run 2.2294 error 2.5579 2.1000 3.2728 2.1665 3.142095% Confidence Level SAS PROC MIXED method used for computing CIs`

These functions can be used to handle complicated design, so we don't need to set up functions or a package any more.

Visualizing Nested and Cross Random Effects

R-Package VCA for Variance Component Analysis

How to Perform a Nested ANOVA in R (Step-by-Step)

Lab 8 - Nested and Repeated Measures ANOVA

What’s with the precision?

**Please indicate the source**: http://www.bioinfo-scrounger.com

I thought I used to understand the ANOVA definitely. But when I’d like to apply the MANOVA model, I found I was totally wrong. I even had no clear understanding about which variables, continuous or categorical, should be used in ANOVA. So I decided to keep notes to figure out what is the difference between ANOVA, MANOVA and ANCOVA.

ANOVA is a statistical technique that assesses potential differences in dependent variables by categorical variables. Commonly, ANOVAs are used in three ways: one-way ANOVA, two way ANOVA and N-way ANOVA.

**Independence of observations**, that there are no hidden relationships among observations.**Normally-distributed dependent variables**, comply with normal distribution. If it is not met, you can try a data transformation.**Homogeneity of variables**, the variances in each group are similar. If it is not met, you may be able to use non-parametric alternatives, like the Kruskal-Wallis test.

Types of data in ANOVA, T test and Chi-Squared Test

X independent variables | X group | Y | Analysis |
---|---|---|---|

categorical | Two or more groups | quantitative | ANOVA |

categorical | Just two groups | quantitative | T test |

categorical | Two or more groups | quantitative | Chi-Squared Test |

One way ANOVA has just one independent variable affecting a dependent variable, and the independent variable can have 2 or more categories to compare.

The null hypothesis for the test is that means in groups are equal, which means there is no difference among group means. Therefore, a significant result means that the means are unequal. If you want to compare two groups, use the T-test instead.

ANOVA uses the F-test for statistical significance. If the variance within groups is smaller than the variance between groups, the F-test will find a higher F-value, that means a higher significance.

ANOVA only tells you if there are differences among the independent variables(levels), but not which differences are significant. To find out how the levels differ from one another, perform a TukeyHSD post hoc test.

Two way ANOVA has two independent variables, or two categorical variables, which is the most different from one way ANOVA. These categories are also called factors, and the factors can be split into multiple levels. So if one factor can be split into 3 levels, and another level can be split into 3 levels. In this condition, there will be 3x3=9 groups.

Use a two way ANOVA when you want to know how two independent variables, in combination, affect a dependent variable. So A two way ANOVA with interaction tests three null hypotheses at the same time:

- There is no difference in group means at any level of the first independent variable.
- There is no difference in group means at any level of the second independent variable.
- The effect of one independent variable does not depend on the effect of the other independent variable (a.k.a. no interaction effect)

If you want a two way ANOVA without interaction effect, only need the first two hypotheses.

`data <- mtcars[,c("am", "mpg", "hp", "vs")] %>% mutate(am = factor(am), vs = factor(vs))summary(data)# One-way ANOVAone.way <- aov(mpg~am, data = data)summary(one.way)# Two-way ANOVAtwo.way <- aov(mpg~am+vs, data = data)summary(two.way)# Two-way ANOVA with interactiontwo.way <- aov(mpg~am*vs, data = data)summary(two.way)`

We know that one or two way ANOVA has only one dependent variable, but MANOVA is not limited. We alway call MANOVA the multivariate analysis of variance, so it is used when there are two or more dependent variables. It’s purpose is to find out if dependent variables differ from independent variables simultaneously.

MANOVA assumes that independent variables are categorical and dependent variables are continuous, the same as ANOVA.

Instead of a univariate F value, we would obtain a multivariate F value, and several test statistics are available: Wilks' λ, Hotelling's trace, Pillai's criterion.

Sometimes, we use one way ANOVA can not find out the significance for each dependent variable between groups (independent variables). Therefore we conclude that there is no relation between dependent and independent. However when we apply MANOVA to these dependent variables simultaneously, it concludes that dependent variables are affected by the independent variables.

if you're still confused about it, try read this post Comparison of MANOVA to ANOVA Using an Example, will give a better example to interpret.

When you need to perform a series of one way ANOVA because you have multiple dependent variables to analyze, in this situation using MANOVA can protect against Type I errors.

Example:

dependent variables: Sepal.Length and Petal.Length

independent variables: Species

sepl <- iris\(Sepal.Length petl <- iris\)Petal.Length # MANOVA test res.manova <- manova(cbind(Sepal.Length, Petal.Length) ~ Species, data = iris) # define statistics, Wilks summary(res.manova, test = "Wilks")

ANCOVA is like an extension of ANOVA, and can be used to adjust other factors that might affect the outcome, such as age, gender or drug use. Otherwise it can be also used to combine with the categorical variable as a continuous variable(one factor is categorical, another is quantitative), or variables on a scale as predictors. It means the covariate is a variable of interest, not the one you want to control for.

Therefore, you can enter any covariate you want to ANCOVA. The more you enter, the fewer degrees of freedom you will have, so that it will reduce the statistical power. Finally, the lower the power, the less likely you will be able to rely on the results of the test.

Before performing ANCOVA, besides normal distribution and homogeneity of variance, we need to verify that covariate and the independent variable are independent of each other, since adding a covariate into a model only makes sense if the covariate and independent variable act independently to the dependent variable.

NOTE, if you use type 1 sum of square for the model, you must note the order, the covariate goes first(and there is no interaction)

Example:

dependent variables: Petal.Length

independent variables: Species

covariate: Sepal.Length

# fit ANCOVA model fit <- aov(Petal.Length~Sepal.Length+Species, iris) # view summary of model car::Anova(fit, type = 2)

What is the difference between ANOVA & MANOVA?

ANOVA Test: Definition, Types, Examples

ANOVA (Analysis of Variance)

How to Conduct an ANCOVA in R

ANCOVA example

ANCOVA in R

Doing and reporting your first ANOVA and ANCOVA in R

ANCOVA -- Notes and R Code

ANCOVA: Analysis of Covariance

An introduction to the two-way ANOVA

ANOVA in R: A step-by-step guide

An introduction to the one-way ANOVA

Understanding confounding variables

**Please indicate the source**: http://www.bioinfo-scrounger.com

From now on, if any, I will try my best to keep notes in English to exercise written for work.

Recently I have discussed the non-standard evaluation mode in dplyr package with a colleague. Before that conversation, I always defined the mode as dynamic variables to search in google to solve related problems. Then I knowed that the dynamic mode is called “non-standard evaluation” in dplyr.

In order to keep a tidy environment, most dplyr verbs use tidy evaluation which is a special type of non-standard evaluation throughout the tidyverse. It defined a concept of data masking that you can use data variables as if they were variables in the environment. Even to keep tidy selection, you can choose variables based on their position(eg. 1,2,3), name or type(eg. is.numeric).

- If you want to learn more about the difference between non-standard evaluation and standard evaluation, the post (Dynamic column/variable names with dplyr using Standard Evaluation functions) will be helpful.
- If you want to know the data masking and tidy selection, the vignette Programming with dplyr-vigenettes or Programming with dplyr is suitable for learning.
- The Dplyr team commended that we could read the Metaprogramming chapters in Advanced R (a book) if we’d like to learn more about the underlying theory, or precisely how it’s different from non-standard evaluation.

For this post, I mainly record some common solutions on how to use the dynamic variables (or called intermediate variables, NSE) in dplyr. Although the above make some tasks easier, sometimes we may be confused for example how to use NSE in the `mutate()`

, `summarise()`

, `group_by()`

, `filter()`

, especially in self-defined function arguments or ggplot2 arguments.

In the other words, I need to learn how to use non-standard evaluation(NSE) in dplyr calls.

Use the `.data`

pronoun to transfer string variables.

`library(tidyverse)GraphVar <- "dist"cars %>% group_by(.data[["speed"]]) %>% summarise(Sum = sum(.data[[GraphVar]], na.rm = TRUE), Count = n() ) %>% head()`

Use the name of string variables in the output dataframe with `:=`

。

`var <- "value"iris %>% mutate(!!var := ifelse(Sepal.Length > 5, 1, 0)) %>% head()`

The easiest way to remember and operate is using the constructor `sym()`

when we want to to unquote something that looks like code instead of a string, which often used in ggplot2 and R shiny.

`grp.var <- "Species"uniq.var <- "Sepal.Width"iris %>% group_by(!!sym(grp.var)) %>% summarise(n_uniq = n_distinct(!!sym(uniq.var)))`

For the tricks in the function aspect, it should be divided into two situations depending on what is the type of variables, env-variables or data-variables.

- Env-variables are “programming” variables that live in an environment. They are usually created with
`<-`

. - Data-variables are “statistical” variables that live in a data frame. I understand them as column names.

The variable names will be automatically quoted by surrounding it in doubled braces, if the function arguments are not string.

`mean_by <- function(data, var, group) { data %>% group_by({{group}}) %>% summarise(avg = mean({{var}}))}mean_by(starwars, group = species, var = height) %>% head()`

We need to construct symbols to transform the string if we'd like to use a character as the arguments.

`mean_by <- function(data, var, group) { group <- sym(group) var <- sym(var) data %>% group_by(!!group) %>% summarise(avg = mean(!!var))}mean_by(starwars, group = "species", var = "height") %>% head()`

If you want to import user-supplied expressions, such as `height*100`

, doubled braces run normally, but `sym`

does not. In this situation, we need to replace `sym`

by `enquo`

.

`mean_by <- function(data, var, group) { group <- enquo(group) var <- enquo(var) data %>% group_by(!!group) %>% summarise(avg = mean(!!var))}mean_by(starwars, var = height * 100, group = as.factor(species)) %>% head()`

The dplyr allows multiple grouping variables, which can be represented by `…`

object

mean_by <- function(data, var, ...) { var <- enquo(var)

data %>% group_by(...) %>% summarise(avg = mean(!! var)) } mean_by(starwars, height, species, eye_color)

Above are the supplementary for a previous blog post(https://www.bioinfo-scrounger.com/archives/R-dplyr-tricks/).

Dynamic column/variable names with dplyr using Standard Evaluation functions

Programming with dplyr-vigenettes

Programming with dplyr

https://stackoverflow.com/questions/27975124/pass-arguments-to-dplyr-functions

**Please indicate the source**: http://www.bioinfo-scrounger.com

Have a good understanding of SDTM domains and their structure. The SDTM Implementation Guide (SDTMIG) is there to help with this.

Read SDTMIG.....It will make the SDTM mapping process much smoother.

- Build EDC from CRF
- Get Raw Datasets(source data) from EDC
- Map Raw Datasets(source data) to SDTM Datasets

6 key steps in a typical mapping process:

- Identify all the datasets you want to map.
- Identify all the SDTM datasets that correlate with those datasets.
- Get the dataset metadata.
**(What it means?)** - Get the SDTM dataset metadata that corresponds to Step 3.
- Map the variables in the datasets identified in Step 1 to the SDTM domain variables.
- Create custom domains for any other datasets that don't have corresponding SDTM datasets.

There's 9 likely scenarios in a typical SDTM mapping process. Get to grips with these, and SDTM mapping becomes much more achievable.

The direct carry forward.

Variable that are already SDTM compliant can be directly carried forward to the SDTM datasets. They don't need to modified.

**(Nothing needs to do, just directly capture it.)**The variable rename

You need to rename some variables to be able to map to the corresponding SDTM variable.

**For example, if the original variable is GENDER, it should be renamed SEX to comply with SDTM standards.**The variable attribute change

Variable attributes must be mapped as well as variable names. Attributes like label, type, length and format must comply with the SDTM attributes.

**(These variable attributes should comply with SDTM attributes)**The reformat

The format that a value is stored in is changed. However the value itself does not change.

**For example, converting a SAS date to an ISO 8601 format character string. (Does it mean to change the format of value itself?)**The combine

Sometimes multiple variables must be combined to form a single SDTM variable.

**(It means that some SDTM variables can't carried directly, sometimes transform is need.)**The split

A non-SDTM variable might need to be split into 2 or more SDTM variables to comply with SDTM standards.

**(It's contrary to combine step)**The derivation

Some SDTM variables are obtained by deriving a result from data in the non-SDTM dataset.

**For example, instead of manually entering a patients age, using the date of birth and study start date to derive it instead**The variable value map and new code list application

Some variable values need to be recoded or mapped to match with the values of a corresponding SDTM variable. This mapping is recommended for variables with a code list attached that has controlled terminology that can't be extended.

**You should map all values in controlled terminology, and not just the values present in the dataset. This would cover values that are not in the dataset currently, but may come in during future dataset updates.**The horizontal-to-vertical data structure transpose

There are situations where the structure of the non-CDISC dataset is completely different to its corresponding SDTM dataset. In such cases you need to transform its structure to one that is SDTM-compliant.

For example, the Vital Signs dataset. When data is collected in wide form, every test and recorded value is stored in separate variables. SDTM requires data to be stored in lean form. Therefore, the dataset must be transposed to have the tests, values and unit under 3 variables. If there are variables that can't be mapped to an SDTM variable, they would go into supplemental qualifiers.

**(Such like long and wide pivot transform)**

There's thing you can do to make SDTM mapping easier.

- Part of the trouble is that SDTM mapping is typically done at the end of the clinical trial process-once patient data has been collected. Retrospectively trying to make your results data fit the SDTM structure takes a lot of time and effort.
- For this reason, it's best practice to align raw datasets with CDISC standards before collecting any patient data
- That means implementing SDTM right from the start-when designing CRFs. Doing it this way means it's much easier to convert your datasets. And it saves time later on in the process when you're pulling your submission deliverables together. You can submit your study much quickly.

Above all are excerpted, if there is any question, please read the original paper.

**Please indicate the source**: http://www.bioinfo-scrounger.com

Box-Cox变换是Box和Cox在1964年提出的一种广义幂变换方法，是统计建模中常用的一种数据变换，用于连续的响应变量不满足正态分布的情况。Box-Cox变换之后，可以一定程度上减小不可观测的误差和预测变量的相关性。

Cox变换的主要特点是引入一个参数lambda，通过数据本身估计该参数进而确定应采取的数据变换形式，Box-Cox变换可以明显地改善数据的正态性、对称性和方差相等性，又不丢失信息，后经过一定的推广和改进，扩展了其应用范围。

Box Cox变换的核心参数是lambda（λ），其范围从-5到5。所以我们主要目的在于通过一定的方法，选择除最佳的lambda值。

以上y值需要非负数才行，若对阵有负数的数据集，则公式如下:

常见的lambda计算方法：

- 最大似然估计
- Bayes方法

平时我们想让一个非正态的数据变成正态，一次个想到可能就是取log（即对数转化），可能还有倒数转化，平方根转化等等。而Box-Cox变换是多种变换的统称，当取不同lambda值时，其对应的就是不同的转化方式

从以上可得，我们需要计算一个lambda即可做Box-Cox变换了

先生成一个F Distribution的向量

`set.seed(250) set.seed(250)x <- rf(500, 30, 30)hist(x,breaks = 15)qqnorm(x)`

从上述柱状图和QQ图都可看出，该模拟的向量并不呈正态，接着用`EnvStats`

包的`boxcox`

做Boxcox Power Transformation，选择最大似然法计算lambda

`library(EnvStats)boxcox.list <- boxcox(x, objective.name = "Log-Likelihood")> boxcox.listResults of Box-Cox Transformation---------------------------------Objective Name: Log-LikelihoodData: xSample Size: 500 lambda Log-Likelihood -2.0 -429.0778 -1.5 -334.4623 -1.0 -264.8572 -0.5 -221.4762 0.0 -204.6382 0.5 -213.9799 1.0 -248.6916 1.5 -307.6451 2.0 -389.4097 `

从上可看出，当lambda为0时，Log-likelihood值最大，即对应的是log转换；若想直接估计出lambda，则加`optimize = TRUE`

参数

`boxcox(x, objective.name = "Log-Likelihood", optimize = TRUE)`

或者通过图形展示lambda的分布

`plot(boxcox.list, xlim = c(-2, 2))`

最后看下Box-cox转换后的QQ图

`plot(boxcox.list, plot.type = "Q-Q Plots")`

参考资料：

]]>避免偏倚的设计技巧

在临床试验中，避免偏倚的两个重要设计技巧是**盲法**和**随机化**，这些应是上市申请中所包含的临床对照试验的一般特点。

随机和盲态是两种常见的最小化临床试验偏差的方法。因为药物在目标人群的（比如全世界所有患一线非小细胞肺癌的患者）的真实疗效水平我们不知道，我们只能通过医学研究去推断这种疗效。

在设计方案中，应对旨在尽可能缩小试验进行过程中任何可损害统计分析满意程度的可预见的不正确情况的特定处理过程进行说明。这些不正确情况包括违反试验方案的各种情况、失访和缺失值。方案中应考虑到减少这些问题的频度和在数据分析中出现这些问题时如何处理的方法。

盲法试验

盲法（blinding）是为了控制在临床试验的过程中以及对结果的解释时产生有意或无意的偏倚（bias）。这些偏倚来自由于对治疗的了解而对病例的搜集和安排、对病人的照顾、病人对治疗的态度、对终点（end point）的评价、对失访的处理、在分析中数据的剔除等的影响。**其根本目的是，在有可能产生偏倚的时候防止知道采用的是何种处理。**

根据设盲程度的不同，盲法可分为完整设盲、不完整设盲和不设盲。在表达上也见用：

- 单盲（single-blind），研究者和/或其成员知道采用的是何种处理，但病人不知道，或者正相反。
- 双盲（double-blind），所有病人及所有参与治疗或临床评定的申办者及研究人员均不知道谁接受的是何种处理包括挑选合格病人者、评价结者或按照设计方案评价依从性者。
- 三盲（triple-blind），不仅对受试者和研究者设盲，而且试验的其他有关人员，包括临床试验的监查员、研究助理及统计人员也不清楚治疗组的分配情况。
- 非盲（开放）（open-label），一种不设盲的试验。所有的人，包括受试者、研究者、监查员、数据管理人员和统计分析工作者都知道病人采用的何种处理。
- 双盲双模拟法（double-blind，double-dummy），如果试验药品与对照药品的剂型、用药时间或剂量不同，为保证盲法的实施，往往要采用双盲双模拟。

不设盲注意事项，为最大程度地减少偏移，可考虑采用以下方法： * 在完成受试者筛选和入组前，受试者和研究者均不晓得分组信息（即分配隐藏） * 在伦理许可的前提下，受试者在完成治疗前，不知晓分组信息 * 采用盲态数据审核 * 申请人需要对采用不完整设盲或者不设盲试验设计的理由进行论述，详细控制偏移的具体措施（如采用可客观判定的指标以避免评价偏移，采用标准操作规范以减少实施偏移等）。中心阅片室、中心实验室、评价委员会。

盲态审核（Blind Review）

在试验完成（最后一例病人的最后一次观察）与揭盲之间对数据进行核对和评价，以便把计划的分析最后定下来。

提前破盲

在双盲试验中，申办方会提供研究者一套随密封代码，现在很多项目中在随机系统中进行随机的操作，并在试验方案中注明破盲的方法和执行破盲的人员。

盲法试验一般在试验结束进行统计分析时才揭盲。但是为了保证受试者的安全， 在紧急情况下，如发生SAE又不能判断与试验药物是否有关、过量服药、与合并用药产生严重的药物相互反应等，急需知道服用何种药物决定抢救方案时，需要提前破盲。

以上是对盲法一些概念的摘抄和整理

本文出自于http://www.bioinfo-scrounger.com转载请注明出处

]]>一个SAS encoding的问题

当SAS的配置附件选择u8的sasv9.cfg后，SAS的-ENCODING参数就变成UTF-8，那么若输入数据是其他格式，如euc-cn Simplified Chinese (EUC)；那么若不将SAS session 转化未UTF-8，则可能会出现以下报错：

**ERROR: Some character data was lost during transcoding in the data set MYDATA.DS3. Either the data contains characters that are not representable in the new encoding or truncation occurred during transcoding.**

对于解决SAS encoding的问题，可以follow以下SAS的步骤：Migrating Data to UTF-8 for SAS 来排错

针对上述的ERROR，可参考：Determine Whether the CVP Engine Is Needed to Read Your Data without Truncation，即调用CVP engine

The first LIBNAME statement points to the original data set. Use a second LIBNAME statement to point to the location of the library that will contain the new data set.

`libname mylib cvp "path to original data set"; libname mylib2 "path to new data set";`

Use PROC DATASETS with the COPY statement and the OVERRIDE= option. When you specify OVERRIDE=(ENCODING=SESSION OUTREP=SESSION) in the COPY statement, the new data set is created in the host data representation and encoding of the SAS session that is executing the COPY statement. Add the CONTENTS statement to view a description of the content of the new data set.

`proc datasets nolist; copy in=mylib out=mylib2 override=(encoding=session outrep=session); contents data=mylib2.mydata; run;`

因此以下两种方式都行（都是基于以上原理）：

`libname mylib cvp "./Documents/test";libname mylib2 "./Documents/format";proc datasets nolist; copy in=mylib out=mylib2 override=(encoding=session outrep=session); contents data=mylib2.foo; run;`

或：

`libname inlib cvp "./Documents/test"; libname outlib "./Documents/format" outencoding="UTF-8";proc datasets nolist; contents data=outlib.foo; run;`

若不是出现以上ERROR报错，那可以尝试下以下解决常规的encoding问题的方法：

- Using the FILE Statement to Specify an Encoding for Writing to an External File
- Using the FILENAME Statement to Specify an Encoding for Reading an External File
- Using the FILENAME Statement to Specify an Encoding for Writing to an External File
- Changing Encoding for Message Body and Attachment
- Using the INFILE= Statement to Specify an Encoding for Reading from an External File

具体代码参照：ENCODING Examples

以上是个人解决SAS ENCODING的过程，仅供参考

本文出自于http://www.bioinfo-scrounger.com转载请注明出处

]]>Correlation作hypothesis test是一个常见的分析，一般我们的零假设H0是`ρ=0`

，也就是说想看下correlation与0的差别是否显著，此时满足t distribution，先计算t-statistics

用R的`cor.test`

函数：

`data("iris")> cor.test(iris$Sepal.Length, iris$Petal.Length) Pearson's product-moment correlationdata: iris$Sepal.Length and iris$Petal.Lengtht = 21.646, df = 148, p-value < 2.2e-16alternative hypothesis: true correlation is not equal to 095 percent confidence interval: 0.8270363 0.9055080sample estimates: cor 0.8717538 `

公式转化如下：

`r <- 0.87175> r / sqrt((1 - r^2) / (150 - 2))[1] 21.64563pvalue <- 2 * pt(-abs(21.64563), df=150-1)`

上述两种方式结果一致

假设我们不想将correlation与0比较，而是跟一个特定的`ρ0`

比较，则需要先将corelation进行Fisher transformation

Fisher transformation有哪些用处呢？

Fisher (1973, p. 199) describes the following practical applications of the z transformation:

- testing whether a population correlation is equal to a given value
- testing for equality of two population correlations
- combining correlation estimates from different samples

这里主要看上述第一条，即given value

接上述iris的例子，假设我想看correlation与`ρ0=0.8`

做比较，则：

`> (1/2*log((1+0.87175)/(1-0.87175)) - 1/2*log((1+0.8)/(1-0.8))) / sqrt(1/(150-3))[1] 2.930596> 2 * pnorm(abs(2.930596), lower.tail = F)[1] 0.003383124`

以上结果与NCSS软件一致，但与SAS的`proc corr`

的结果有些略微不同（主要在于最终的P值）

`proc corr data=sashelp.iris nosimple fisher (rho0=0.8 biasadj=no); var SepalLength PetalLength;run;`

上述SAS的结果中的Fisher z统计量是指

`Zρ`

，而不是`Zρ-Zρ0`

从上述公式可看出，对于fisher transformation后的Z分布，虽然其不是完全的标准正态分布，但随着样本量的增加可看作近似正态分布：

For the transformed , the approximate variance V(zr)=1/(n-3) is independent of the correlation . Furthermore, even the distribution of is not strictly normal, it tends to normality rapidly as the sample size increases for any values of (Fisher 1973, pp. 200–201).

计算公式如下：

上述结果是转化后Z分布的confidence interval，然后需要再转化为correlation对应的confidence interval

`# Correlation coefficientr <- 0.87175# Z statisticsZ_upper <- 1/2 * log((1+r)/(1-r)) + qnorm(p = 1 - 0.05/2, lower.tail = T) / sqrt(150 - 3)Z_lower <- 1/2 * log((1+r)/(1-r)) - qnorm(p = 1 - 0.05/2, lower.tail = T) / sqrt(150 - 3)# Correlation confidence intervalCor_upper <- (exp(2 * Z_upper) - 1) / (exp(2 * Z_upper) + 1)Cor_lower <- (exp(2 * Z_lower) - 1) / (exp(2 * Z_lower) + 1)> c(Cor_lower, Cor_upper)[1] 0.8270314 0.9055052`

上述结果跟R的`cor.test`

和SAS的`proc corr`

结果一致，说明没有问题

以上公式均参考自：

SAS The CORR Procedure

NCSS Correlation

**PS. 若想了解其他的correlation hypothesis test方法以及计算结果可参考：https://www.psychometrica.de/correlation.html**，蛮有意思的一个网站。。。

其他参考资料：

https://stats.stackexchange.com/questions/14220/how-to-test-hypothesis-that-correlation-is-equal-to-given-value-using-r https://en.wikipedia.org/wiki/Pearson_correlation_coefficient#Fisher_Transformation

https://cran.r-project.org/web/packages/cocor/cocor.pdf

https://www.personality-project.org/r/html/paired.r.html

本文出自于http://www.bioinfo-scrounger.com转载请注明出处

]]>是指来源于大量的正常人群中有关实验测定数据，并根据正常人群中不同年龄、性别分别进行统计分析，得到了绝大多数人群中数据的分布范围，并以此确定参考值范围。对超出参考值界限不大的异常值，可以根据病人的临床表现区别对待，可以采取治疗措施，也可以进行观察。

什么是医学决定水平（Medicine decide level）：

是临床医生在诊断和治疗疾病时应该掌握和使用的数据，它不同于参考值的另一些限值，临床医生可以通过观察测定值是否高于或低于这些限值，可在疾病诊断中起排除或确认的作用，或对某些疾病进行分级或分类，或对预后作出估计，以提示医师在临床上应采取何种处理方式，如进一步进行某一方面的检查，或决定采取某种治疗措施等等。

参考值范围 vs. 医学决定水平

区别 | 参考值范围 | 医学决定水平 |
---|---|---|

来源 | 来源于正常人群中不同年龄、性别分别进行统计分析，得到了绝大多数人群中数据的分布范围 | 来源于大量的临床病人数据的观察和积累，用于确定疾病的发生发展和变化情况 |

作用 | 对人群健康状态进行判断的指标，需要结合临床症状判断诊疗方案 | 在诊断及治疗工作时，对疾病诊断或治疗起关键作用的某一被测成分的浓度，临床上必须采取措施的检测水平 |

值 | 有一个上限和一个下限，也可只有一个上限或一个下限，临床医生通过上、下限来判断患者的健康状况 | 可根据不同的疾病诊断要点和标准，不同的治疗要求和治疗方法的选择，有多个设定的上限或下限，临床医生在使用这些指标时能够根据不同的界限采取不同的处理方法和措施 |

重要性 | 重要，尤其是针对广大人群的常规检测，对于疾病早期的判断有显著的指导作用 | 重要，通过观察测定值是否高于或低于这些限值，可在疾病诊断中起排除或确认的作用，或对某些疾病进行分级或分类，或对预后作出估计，以提示医师在临床上应采取何种处理方式，如进一步进行某一方面的检查，或决定采取某种治疗措施等等 |

以白细胞计数为例：

- 其参考值范围是（4～10）×109/L
- 医学决定水平临床意义及措施：
- 0.5×109/L低于此值，病人有高度易感染性，应采取相应的预防性治疗及预防感染措施。
- 3×109/L低于此值为白细胞减少症，应再作其他试验，如白细胞分类计数、观察外周血涂片等，并应询问用药史。`
- 11×109/L高于此值为白细胞增多，此时作白细胞分类计数有助于分析病因和分型，如果需要应查找感染源。
- 30×109/L高于此值，提示可能为白血病，应进行白细胞分类，观察外周血涂片和进行骨髓检查。

引申下什么是正常值、正常参考值、参考值、参考值范围？

“正常值”等名称实际上只是一个概念，它是所得到的测定结果在一个相对正常的范围之内，所以也可称为“正常范围”。正常值的划定是有一定要求的，它是来自于相对绝大多数处于健康状态人的测定结果，一般把所选择相对健康的正常人群测定值中的95%划定正常值的界限，所以仍有约5%的健康人的结果分布在异常区域内。

从这可看出，其正常值的计算方式跟参考区间差不多。。。

因此，当你的化验结果略超出正常值范围一点时并不意味着你一定有病，这里可能存在测定误差、干扰因素、你是95%以外的正常人等各种可能。“正常值”、“正常参考值”、“参考值”、“参考值范围”均表达同一个含义，但正常二字有较多的局限性，所以现在专家推荐统称为参考值或参考范围。

WS/T 402-2012的附录中也对参考区间和临界值（医学决定界值）分别做了说明：

医学决定水平不同于参考区间，它是基于其他科学和医学知识建立起来的。它和参考区间的得出方式是不同的，通常与某些特定医学条件相关。

怎么计算参考区间呢：

可参考Analysis of Reference Value，参考自EP28-A3C。。。也可结合WS/T 402-2012一起看

对于医学决定水平（Xc）的预期偏倚（Bc），在《免于进行临床试验的体外诊断试剂同品种比对技术指导原则》中指出：

一般将医学决定水平处的预期偏倚及其95%可信区间与申请人声称的可接受偏倚的限值进行比较。可接受偏倚的限值或者由申请人咨询临床机构后根据临床需求设定，或者参考相关的国内外标准等设定。如果预期偏倚的95%可信区间未超出申请人声称的可接受偏倚的限值，说明试验用体外诊断试剂与对比试剂的检测结果没有显著的偏倚，二者等效。

MDL的预期偏倚计算方法可参考EP09-A3，属于回归分析中的一部分

- 假如回归方法是用OLS，则将Xc带入回归方程
`y = a + bx`

中的x，计算`Bc = a + (b - 1) * Xc`

，然后其CI通过公式计算即可 - 对于Deming/Weighted deming回归方法，则其CI建议使用Jackknife方法来计算
- 而Passing-Balock回归方法，则其CI建议使用Jackknife方法

计算公式均可查看EP，或者调用R包`mcr`

（罗氏诊断编写）中的函数

以上说的都是考核和对比试剂呈现的是线性的情况，当出现非线性的时候（比较少见），则MDL的预期偏倚计算方式可能需要有些不同：

如果研究目的是为特定的医学决定水平Xc提供预期偏倚估计，那么可以使用该值（浓度）附近的点来提供这样的估计。至少要选择20个这样的点，可以是选择该MDL上下距离最近的各10个点，或者围绕医学决策浓度选择一个浓度区间。上述选点应根据考核和对比试剂检测平均值排列的结果进行。如果根据单个检测值（考核或者对比）的排列结果来计算预期偏倚，可能导致结果出现不当的偏差。————EP28-A3C(9.1.5)

参考资料：

本文出自于http://www.bioinfo-scrounger.com转载请注明出处

]]>参考自：DATA MINING Desktop Survival Guide by Graham Williams

主要为了理解Cox比例风险模型以及其预测结果

首先看下lung数据集对于sex协变量的cox结果

`library(survival)l.coxph <- coxph(Surv(time, status) ~ sex, data=lung)>summary(l.coxph)Call:coxph(formula = Surv(time, status) ~ sex, data = lung) n= 228, number of events= 165 coef exp(coef) se(coef) z Pr(>|z|) sex -0.5310 0.5880 0.1672 -3.176 0.00149 **---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 exp(coef) exp(-coef) lower .95 upper .95sex 0.588 1.701 0.4237 0.816Concordance= 0.579 (se = 0.021 )Likelihood ratio test= 10.63 on 1 df, p=0.001Wald test = 10.09 on 1 df, p=0.001Score (logrank) test = 10.33 on 1 df, p=0.001`

其中`sex=1`

代表male，`sex=2`

代表female；从上述结果可看出，coef为log(Hazards ratio)，即cox回归方程中的beta；所以HR(Hazards ratio)为0.5880，即exp(coef)

HR小于1，说明female的风险低于male（根据sex值的大小来，second group relative to the first group）；而exp(-coef)则是反之，是first group relative to the second，其值大于1，同样说明male风险高于female

接下来我想基于训练集得到的cox回归模型，预测新测试集的预后结果，可使用`predict`

函数，并看下其不同参数（`type="lp"`

，`risk`

，`expected`

和`terms`

）的区别

`results <- rownames_to_column(lung, var = "id") %>% select(c("id", "time", "status", "sex")) %>% bind_cols(tibble(lp = predict(l.coxph, type="lp"), risk = predict(l.coxph, type="risk"), expected = predict(l.coxph, type="expected"), terms = predict(l.coxph, type="terms")))> head(results, 10) id time status sex lp risk expected sex1 1 306 2 1 0.2096146 1.2332026 0.8280030 0.20961462 2 455 2 1 0.2096146 1.2332026 1.3880051 0.20961463 3 1010 1 1 0.2096146 1.2332026 3.5060567 0.20961464 4 210 2 1 0.2096146 1.2332026 0.5185439 0.20961465 5 883 2 1 0.2096146 1.2332026 3.5060567 0.20961466 6 1022 1 1 0.2096146 1.2332026 3.5060567 0.20961467 7 310 2 2 -0.3214090 0.7251266 0.4999485 -0.32140908 8 361 2 2 -0.3214090 0.7251266 0.6107055 -0.32140909 9 218 2 1 0.2096146 1.2332026 0.5367625 0.209614610 10 166 2 1 0.2096146 1.2332026 0.3309150 0.2096146`

对于`type="lp"`

参数，所计算的值可以作为线性预测变量的各个组成部分，若值大于0，说明该subject有相比mean survival（这个计算自训练集）有high risk，小于0则是low risk。按照上述逻辑，其计算方式如下：

`> (lung[1,"sex"] - l.coxph$means) %*% coef(l.coxph)["sex"] [,1][1,] 0.2096146`

对于`type="risk"`

参数，所计算的值作为exp后的结果（risk score exp(lp) ），计算subject的risk value；若value大于1，说明其高于人群平均水平，低于1则反之；通过对risk排序，可得到待预测的人群中top 20%的subjects

`> exp((lung[1,"sex"] - l.coxph$means) %*% coef(l.coxph)["sex"]) [,1][1,] 1.233203`

对于`type="expected"`

参数，其计算的是对应subject在随访观察期内预期发生的事件的数目，The survival probability for a subject is equal to exp(-expected)

对于`type="terms"`

参数，假如cox中只有一个变量（且这个变量是二分类变量），其计算结果同lp。。我是这么理解的，这个type是指类的意思

假如我再加上`ph.ecog`

变量

`options(na.action=na.exclude)l.coxph <- coxph(Surv(time, status) ~ sex+ph.ecog, data=lung)results <- rownames_to_column(lung, var = "id") %>% select(c("id", "time", "status", "sex", "ph.ecog")) %>% bind_cols(tibble(lp = predict(l.coxph, type="lp"), risk = predict(l.coxph, type="risk"), expected = predict(l.coxph, type="expected"), terms = predict(l.coxph, type="terms")))`

查看这个协变量的组合个数：

`> table(results$sex, results$ph.ecog) 0 1 2 3 1 36 71 29 1 2 27 42 21 0`

那么lp参数下就会有7个系数：

`> length(table(results$lp))[1] 7`

而从type参数来看，其类只有2+4=6个：

`> length(table(results$terms))[1] 6`

以上是根据网上资料简单的搬抄。。。。

本文出自于http://www.bioinfo-scrounger.com转载请注明出处

]]>在编写脚本过程中加上注释是必不可少的步骤，不仅方便自己也方便他人，有时这个是工作中必须的一步。。。

个人常用的方式是`#`

加`-`

（破折号）的组合，有时为了简单，就手动打一个`#`

号，然后依次N个`-`

（破折号）；但假如为了整体代码的美观，可能会刻意保持统一的破折号个数，这时难道要每次都数一下破折号的个数么？

Rstudio给了一个比较快捷的方式，如输入`Ctrl + Shift + R`

，然后输入注释内容，即可产生一行注释header，结合Rstudio的Top level功能可随时定位到每条注释header位置

`# This is the first comments header ---------------------------------------`

这样就不用刻意数破折号的个数啦，每次都是统一的格式

假如觉得注释行或者注释块不够的好看，或者说想加入更多的内容，则可以考虑试下`bannerCommenter`

包，大致上有`banner`

、`block`

、`boxup`

和`open_box`

等函数可供使用

个人比较喜欢用的是`banner`

，如：

`> bannerCommenter::banner("This is the first comments header", centre = TRUE, bandChar = "-")##---------------------------------------------------------------## This is the first comments header --##---------------------------------------------------------------`

虽然这个comment是显示在console window中的，但是其实该comment已经保存在系统复制管道中了，只是在script中输入`Ctrl+v`

黏贴即可呈现出来

假如想给项目的代码中加入一些固定的项目信息，比如项目名称、作者、时间、描述等信息，那么比较合适的方法是在script中添加一个header；比如一些包的加载，全局变量以及固定的脚本，都可写在script的header中

有人整理了一些关键的信息，如：

**Script name:**I try to use descriptive names. Something like summarising_soil_hazards.R is almost always going to be better than an uniformative name like script1.R….**Purpose of the script:**Just writing this down often helps clarify what exactly I am trying to do.**Author(s):**I share many of my scripts with my students, colleagues and clients – it is good to know where the script originated if they have any questions, comments or wish to suggest improvements**Date Created:**This date is automatically filled in with my template script. (Some people also like to include an updated field, but I tend to forget to fill this in and would rather get this data from version control)**Copyright statement / Usage Restrictions:**For intellectual property (IP) reasons it is good to know where the copyright resides, and how you are happy for people to use your work. You may wish to use a formal type of licence.**Contact Information:**It helps if people know who to contact, and how they can reach you the best. It’s not uncommon for me to include both my personal and work email accounts. I do this in case one of them ceases to be in service in the future for whatever reason.**Notes:**This is a free-text space which I use to jot down any thoughts or more detailed notes about the script, or even work to do.

有时候我们比较简单的方法就是在网上找一下别人的header模板，然后修改至满足自己个性化的需求后，保存在自己电脑的本地，等需要的再copy/paste出来

Rstudio提供了一个更加方便的工具：**snippets**

点击Rstudio的Tools -> Global Options -> Code -> Tab Editing -> Snippets -> "Edit Snippets" ，然后拉到最下面，输入下面的template（具体内容可个性化的修改）：

`snippet template_header ## --------------------------- ## ## Script name: ## ## Purpose of script: ## ## Author: Dr. Timothy Farewell ## ## Date Created: `r paste(Sys.Date())` ## ## Copyright (c) Timothy Farewell, `r paste(format(Sys.Date(), "%Y"))` ## Email: hello@timfarewell.co.uk ## ## --------------------------- ## ## Notes: ## ## ## --------------------------- ## set working directory for Mac and PC setwd("~/Google Drive/") # Tim's working directory (mac) setwd("C:/Users/tim/Google Drive/") # Tim's working directory (PC) ## --------------------------- options(scipen = 6, digits = 4) # I prefer to view outputs in non-scientific notation memory.limit(30000000) # this is needed on some PCs to increase memory allowance, but has no impact on macs. ## --------------------------- ## load up the packages we will need: (uncomment as required) require(tidyverse) require(data.table) # source("functions/packages.R") # loads up all the packages we need ## --------------------------- ## load up our functions into memory # source("functions/summarise_data.R") ## ---------------------------`

保存后，若想调用，就在script中输入template_header，然后tab键即可调用了

个人觉得很好用哈。。。

参考资料：

https://bookdown.org/yih_huynh/Guide-to-R-Book/r-conventions.html

bannerCommenter

My easy R script header template

本文出自于http://www.bioinfo-scrounger.com转载请注明出处

]]>SAS 系统全称为 Statistics Analysis System，最早由北卡罗来纳大学的两位生物统计学研究生编制，并于 1976 年成立了 SAS 软件研究所，正式推出了 SAS 软件。经过多年的发展，SAS 已被全世界 120 多个国家和地区的近三万家机构所采用，直接用户则超过三百万人，遍及金融、医药卫生、生产、运输、通讯、政府和教育科研等领域。

SAS在读书那会就听说了，但是由于其极其贵。。而且在学术界用的似乎并不广？所以也就没接触过了。

刚好最近需要使用SAS的IML模块了，主要该模块支持R以及矩阵运算！！！有些代码的编写可以用到之前的R的基础了。因此简单做个记录，如何配置SAS/IML模块

在 BASE SAS 的基础上，还可以增加如下不同的模块而增加不同的功 能：SAS/STAT（统计分析模块）、SAS/GRAPH（绘图模块）、SAS/QC（质量控制模块） 、SAS/ETS（经济计量学和时间序列分析模块）、SAS/OR（运筹学模块）、SAS/IML（交互式矩阵程序设计语言模块）、SAS/FSP（快速数据处理的交互式菜单系统模块）、SAS/AF（交互式全屏幕软件应用系统模块）等等

其他模块在此不做介绍了（我也没搞清楚。。。），对于IML模块，在购买SAS时是需要额外付费的；因此在安装完base SAS后，其矩阵运算的功能是可以正常使用的，比如：

`proc iml; a={1 2, 3 4}; b={0 2, 1 1}; c = a#b; d = a*b; print c,d;quit;`

如果想要在IML模块中调用R代码，则需要对其做些配置才行，首先安装R软件（这个当然必须要安装），至于R版本的选择可以参考：What versions of R are supported by SAS?

从中可看出R版本的可选范围是跟着IML版本而定的，可用以下代码来确定当前安装的IML版本

`/*List all SAS modules*/proc product_status;run;`

比如我的IML是15.2，所以应该是能装3.6.3的；PS.其实能装4.x.x，但是好似对R最新版本支持不太友好，所以装3.6.3吧

安装好R软件后，将其加入系统的环境变量，接着运行一下代码确认下RLANG是否打开

`PROC OPTIONS OPTION=RLANG;RUN;`

如果log中提示`RLANG Support access to R language interfaces`

，说明没啥问题了，可以正常调用R代码了；但是通常情况下，其会提示`NORLANG Do not support access to R language interfaces`

以上提示是告诉我们需要将`-RLANG`

加到SAS配置文件中，因此我们需要打开`C:\Program Files\SASHome\SASFoundation\9.4\sasv9.cfg`

文件，第一行加入以下配置：

`-RLANG-SET R_HOME "C:\Program Files\R\R-3.6.3"`

注意：上述配置需要写在文件中已有配置行

`-config "C:\Program Files\SASHome\SASFoundation\9.4\nls\zh\sasv9.cfg"`

的上方

重启SAS进程，然后再试下：

`PROC OPTIONS OPTION=RLANG;RUN;`

理论上应该就提示能正常访问R语言了，用以下代码测试下R版本：

`proc iml; submit/r; print(R.version) endsubmit;quit;`

接着测试下SAS和R数据格式的转化`ExportDataSetToR`

，这个用于读取SAS格式数据并转化后用R函数处理；类似的方法还有很多，如`ExportMatrixToR`

, `ImportDataSetFromR`

, `ImportMatrixFromR`

`proc iml; call ExportDataSetToR("Sashelp.Class", "class"); submit / R; names(class) class endsubmit;quit;`

注：假如以上代码报错：`ERROR: SAS is unable to transcode character data to the R encoding.`

，可能原因是：

- 用了4.x.x版本的R软件，建议改成3.x.x版本的即可解决
- R语言编码和SAS语言编码不同，比如我碰到的是：SAS是支持中文编码，但是R是英文编码，不支持中文，所以才报这个错误；将R改成中文编码的即可解决

最后吐槽下：SAS网上资料好少欸，而且都是在几年前的，google上很少搜到最新的。。。

参考资料：

Calling R Functions from SAS

Using R in the SAS System

SAS and R - stop choosing, start combining and get benefits!

SAS/IML User's Guide

What is the encoding problem when Calling R from Proc IML

What versions of R are supported by SAS?

Twelve advantages to calling R from the SAS/IML language

Usage Note 54806: Using R in the Open Source Integration node in SAS® Enterprise Miner(tm)

SAS调用R中的SMOTE方法

https://support.sas.com/kb/54/806.html

本文出自于http://www.bioinfo-scrounger.com转载请注明出处

]]>