Statistical functions and procedures (SAS & R)

This is reference to the "Chapter 3 Statistical and mathematical functions", "Chapter 4 Programming and operating system interface" and "Chapter 5 Common statistical procedures" in <SAS and R: Data Management, Statistical Analysis, and Graphics (second edition)>.

本篇主要是看下常见的统计学方法在R以及SAS中是如何实现的，主要是看后者。。。

Probability density function

比如已知Z-score，在正态分布的CDF曲线上，想返回从负无穷到Z值的积分，在R中用pnorm()，在SAS中则是cdf方法

# R code
y = pnorm(1.96, mean=0, sd=1)

# SAS code
data normal;
  y = cdf("NORMAL", 1.96, 0, 1);
run;

其他分布的用法也类似，都有比较固定的使用方法，搜一搜就有了

Setting the random number seed

设置随机种子

# R code
set.seed(12345)

# SAS code
call streaminit(12345);

Normal random variables

生成随机数，假如是要满足正态分布的随机数，在R中用rnorm()；在SAS中则是normal或者rand，但是其只输出单个值，需要利用for循环生成多个随机数

# R code
rnorm(10)

# SAS code
data rand;
  call streaminit(12345);
  do i=1 to 10;
    x=rand("normal", 0, 1);
    output;
  end;
run;

Basic mathematical functions

一些数学运算相关的函数，在R和SAS中都差不多。。。

Integer functions

四舍五入比较常用，如在SAS中有：

data sign;
  nextintx = ceil(3.49);
  justintx = floor(3.49);
  roundx = round(3.49, 0.1);
  roundint = round(3.49, 0.01);
  movetozero = int(3.49);
run;

在R中还多signif()，trunc()函数跟SAS的int一致（应该是吧？）

Looping

在R中，最常见的循环是用for循环，或者其他函数（如apply家族或者ddply等函数）；在SAS中则是do end, do wile, do until等

# R code
x <- numeric(10)
for (i in 1:10){
    x[i] <- rnorm(1)
}

# SAS code
data;
  do i = 1 to 10;
    x = normal(0);
    output;
  end;
run;

Sequence of values or patterns

生成一系列的数字，如1，3，5，7，9等数列，在R中用seq()函数即可，SAS稍微麻烦点，需要用loop。。。

data ds;
  do x = 1 to 9 by 2;
    output;
  end;
run;

Grid of values

生成有重复的数值/字符，在R中用rep()函数就可实现，SAS则还是得用loop?

# R code
data.frame(x1 = rep(1:3, each=2), x2 = rep(c("M","F"), time=3))

# SAS code
data ds;
  do x1 = 1 to 3;
    do x2 = "M","F";
      output;
    end;
  end;
run;

Summary statistics

SAS的summary分析总让我觉得SAS是一个不算编程语言，只能说其是一个分析工具。。。在summary分析中SAS很“贴心”将多个计算函数（如mean, stdev, max, min等等）放在某个proc中，但是这样会让整个编程语言缺少灵活性，变得非常的“死板”。。让其理念跟其他编程语言完全不一样了，总觉得怪怪的。。。这里吐槽下。。。

在R中，或者Python中，summary分析，一般常见就是要用哪个函数就调用对应的，然后输出成一个数据框/数组即可

在SAS中，则是调用proc mean或者proc univariate，如以下所示：

Using proc means for summary statistics

/*proc means contains printed output and data output*/
proc means data=sashelp.iris N mean stddev max min;
  class Species;
  var PetalLength PetalWidth SepalLength SepalWidth;
  output out=ds;
run;

Using proc univariate for detailed summary statistics（注：需要用ods输出结果）

ods output BasicMeasures=ss;
/*ods trace on/listing;*/
proc univariate data=sashelp.iris all;
  class Species;
  var PetalLength PetalWidth SepalLength SepalWidth;
run;
/*ods trace off;*/

Calculating Percentiles

比如要生成分位数（25%，50%，75%，95%），用univarate如下：

ods output Quantiles=qt;
proc univariate data=sashelp.iris all;
  var PetalLength PetalWidth SepalLength SepalWidth;
run;

或者用output自定义输出分位数

proc univariate data=sashelp.iris;
  class Species;
  var PetalLength;
  output out=iris_percentile 
    pctlpts = 0,25,50,75,95,100
    pctlpre = P_;
run;

也可以用means实现百分位数的计算

proc means data=sashelp.iris p25 p50 p75 p95;
  class Species;
  var PetalLength;
  output out=perc
    p25=p_25
    p50=p_50
    p75=p_75
    p95=p_95;
run;

在R中则是用quantile()函数，但是！假如是用默认参数，如下所示这种，其结果跟SAS是不一样的

quantile(c(1:10), c(0.25,0.5,0.75,0.95))

必须指定分位数的计算方法，如type = 3，才能输出SAS对应的结果

quantile(c(1:10), c(0.25,0.5,0.75,0.95), type = 3)

Centering, normalizing, and scaling

若想对数据集的列（变量）standardized，如Standardizing to a Given Mean and Standard Deviation，在SAS是用standard步，可指定mean和std，输出的结果即是Z-score标准化后的

proc standard data=sashelp.iris out=iris2 mean=0 std=1;
  var PetalLength PetalWidth;
run;

在R中在可以使用scale()函数，搭配apply()等函数可以依次对各个变量做scale

scale(iris$Sepal.Length)

Mean and 95% confidence interval

计算基于正态分布的均值及置信区间，SAS还是用means步，这个proc真是啥都放在里面。。。lclm和uclm分别指上限和下限

proc means data=sashelp.iris lclm mean uclm;
  var PetalLength;
run;

在R里，可以用一些函数比如qt()计算t统计量或者qnorm()计算z统计量，然后利用置信区间的公式计算最终的CI；也可以用t.test()计算

t.test(iris$Sepal.Length)$conf.int

Contingency tables

这个contingency table在计算诊断指标时非常常用，SAS把这放在了freq步里，如下所示；nopercent nocol norow用于规定display table；在R中则常用table()函数并联合其他计算函数或者公式

data dumy;
  input x y @@;
  datalines;
0 1 1 0 1 1
0 1 1 1 1 0
1 1 1 1 0 0
1 0 0 0 0 1
run;

proc freq data=dumy;
  tables x*y / out=freqtable nopercent nocol norow;
run;

若想计算Chi-Square或者RR等值，则增加chisq、relrisk参数；而R则是用chisq.test()

proc freq data=dumy;
  tables x*y / chisq relrisk;
run;

若想计算Fisher's exact test，则添加exact参数；而R则是用fisher.test()

若想计算Kappa值（一致性分析），则添加agree参数；而R则是用kappa()

proc freq data=dumy;
  tables x*y / agree;
run;

还可以计算诊断相关的灵敏度以及特异度等指标，根据公式计算即可

有点不太习惯SAS这种非开源的工具，其会根据需要将一些固定的分析放到某些proc步中，虽然这样看起来蛮方便的，但是实际使用中会使得分析变得复杂，尽管其有很详细的help文档。。。

Correlation

上面是定性的一些统计指标，下面列举些定量的指标；如相关系数在诊断试验中很常见，常用于定量试剂，在SAS中用corr步，在R中则是cor()或者cor.test()函数

# R code
cor.test(iris$Sepal.Length, iris$Sepal.Width)

# SAS code
proc corr data=sashelp.iris;
  var PetalLength PetalWidth;
run;

Tests for normality

正态性检验，如the Shapiro-Wilk test, the Kolmogorov-Smirnov test

proc univariate data=sashelp.iris normal;
  var PetalLength;
run;

在R中则是选择方法对应的函数，如shapiro.test(x)，或者一些相关的R包（会将一系列方法整合在一起）

Student's t test

T检验，SAS支持组间T检验通过一个分类变量一个对应值的形式；**注：*结果中会输出方差齐性和不齐性两种结果**

data scores;
   input Gender $ Score @@;
   datalines;
f 75  f 76  f 80  f 77  f 80  f 77  f 73
m 82  m 80  m 85  m 85  m 78  m 87  m 82
;
run;

proc ttest data=scores;
  class Gender;
  var Score;
run;

而R则是用t.test(y ~ x, data)，或者t.test(y1, x1)等方法

Nonparametric tests

上述的T检验是参数假设检验，假设其满足正态分布；但是有时非参的假设检验更加合适，如Wilcoxon test, Kolmogorov-Smirnov test；在SAS中可以用npar1way步

proc npar1way data=scores wilcoxon edf;
  class Gender;
  var Score;
run;

在R中则是使用wilcox.test()

Logrank test

Logrank test在Kaplan-Meier plot和Cox proportional hazards model中比较常见；在SAS中可以用lifetest步

proc lifetest data=sashelp.BMT plots=survival(atrisk=0 to 2500 by 500);
   time T * Status(0);
   strata Group / test=logrank adjust=sidak;
run;

在R中则是用survival包的survdiff()函数

参考资料：

SAS and R: Data Management, Statistical Analysis, and Graphics (second edition)

本文出自于http://www.bioinfo-scrounger.com转载请注明出处