This is reference to the 2.3 section of Data management and 2.4 Date and time variables in <SAS and R: Data Management, Statistical Analysis, and Graphics (second edition)>.
在R中常说的数据集操作是指处理数据框类型的数据,当然有时也会是其他的数据类型。在SAS中就是数据集,SAS相比其他编程方法来说数据类型还是太少了。。
处理数据,常见的不外乎combination, collation, and subsetting
Subsetting datasets by rows
在R中比较简单,根据行索引即可提取任意行的数据集;在SAS则可以通过firstobs
和obs
来定义
提取起始于第i行的数据集
data class;
set sashelp.class (firstobs=5);
run;
提取前i行的数据集
data class;
set sashelp.class (obs=5);
run;
提取第i行到第j行的数据集
data class;
set sashelp.class (firstobs=5 obs=10);
run;
Subsetting datasets by observations
这是指根据变量过滤数据集的行,在R中常用的函数是dplyr::filter()
;而是SAS中则是搭配where
语句使用
# R code
filter(iris, Species == "setosa")
# SAS code
data class;
set sashelp.class(where=(Sex="M"));
run;
Splitting a Dataset
在R中,可以相当于过滤行,即可用dplyr::filter()
函数来实现;在SAS中则可以通过多个IF语句分割数据集
data female male;
set sashelp.class;
if Sex="F" then output female;
if Sex="M" then output male;
run;
Drop or keep variables in a dataset
这是指根据变量过滤数据集的行,在R中常用的函数是dplyr::select()
;在SAS中则可以通过keep
来实现,如:
# R code
dplyr::select(iris, c("Sepal.Length", "Sepal.Width"))
# SAS code
data class;
set sashelp.class(keep= Name Sex Age);
run;
Random sample of a dataset
这个在simulation中蛮好用的,从逻辑上将实现方式有很多,类似于subsetting datasets;但是也有现成的函数可以直接用,在R中可以考虑用dplyr::sample_n()
函数,在SAS则是surveyselect
# R code
# replace = FALSE
dplyr::sample_n(iris, 10)
# SAS code
/* sample without replacement 简单随机抽样*/
proc surveyselect data=sashelp.class out=outds n=10 method=srs; run;
Observation number
输出观测编号,也就是行号,或者说行索引;方法肯定很多,比如SAS是用_n_
变量,R则可以row.names()
# R code
row.names(iris)
# SAS code
data _null_;
set sashelp.class;
put _n_;
run;
Keep unique values
这种可以分成两种情况:
- 整个数据集去重
- 基于指定的变量(列)来去重
在R中可以用dplyr::distinct()
函数,在SAS中则是用sort
步和nodupkey
参数
# R code
unique(iris) # All duplicate rows removed
dplyr::distinct(iris, Species, .keep_all = TRUE) # Species rows removed
# SAS code
/*Romove duplicated value by all variables*/
proc sort data=sashelp.retail out=retail_without_duplicated_values nodupkey;
by _all_;
run;
/*Remove duplicatd values by Year variable*/
proc sort data=sashelp.retail out=retail_with_unique_value nodupkey;
by Year;
run;
Identify duplicated values
这个与上述的 keep unique values 刚好相反,返回有重复的行;在R中对于向量可以用duplicated
函数,对于数据框则需要几个函数配合以下;在SAS中还是用sort
步和nouniquekey
参数
# R code
iris %>% add_count(Species) %>% filter(n>1)
/*Return duplicatd values by Year variable*/
proc sort data=sashelp.retail out=retail_with_duplicated_value nouniquekey;
by Year Month;
run;
Convert from wide to long format or long to wide format
这个在数据处理中是一个非常常见的需求,俗称长数据和宽数据的转化,如宽转长或者长转宽;在R中,现在比较好用的函数是tidyr::pivot_longer()
or tidyr::pivot_wider()
;在SAS则是用transpose
步
# R code
# wide to long transform
relig_income %>%
pivot_longer(!religion, names_to = "income", values_to = "count")
# long to wide transform
fish_encounters %>%
pivot_wider(names_from = station, values_from = seen, values_fill = 0)
# SAS code
/*proc transpose, the same as pivot for long to wide or wide to long*/
data wide;
input SID $ Programming State English;
datalines;
S01 98 100 80
S02 84 98 94
S03 89 92 88
;
run;
/*wide to long transform*/
proc transpose data=wide out=long(rename=(_name_=Coursename col1=Score));
var Programming State English;
by SID;
run;
/*long to wide transform*/
proc transpose data=long out=Rewide(drop=_name_);
var Score;
by SID;
id Coursename;
run;
Concatenate and stack datasets
将一个数据集拆分成不同变量的数据集,R会使用列索引的方式,SAS则可以用keep
和output
搭配,如:
/* one to one, having sample observation*/
data class1(keep=name sex) class2(keep=age height weight);
set sashelp.class;
output class1;
output class2;
run;
常见的合并数据集,可以分成列合并和行合并,对应在R中常用的就是rbind()
和cbind()
;在SAS中方法就比较多了,如set
、merge
/*Set*/
data class;
set class1;
set class2;
run;
/*Sometime is the same as merge*/
data class;
merge class1 class2;
run;
以上是简单的例子,其实对数据的拼接也有比较复杂的用法,比如除了set
外,还有append
方法,或则SQL的union等
Merge datasets
有时我们是想根据指定的列(或变量),连接两个数据集,在SQL中就是left join, right join 或者 inner join,也就是merge datasets;在R中,可以使用left_join()
、right_join()
、inner_join()
等,跟SQL很类似,非常好记;在SAS中,要么先sort再merge,要么用sql语句
# R code
band_members %>% inner_join(band_instruments, by = "name")
# SAS code
/*For correct way is first sort and then merge*/
data class1(keep=name sex) class2(keep=name age height weight);
set sashelp.class;
output class1;
output class2;
run;
proc sort data=class1;
by name;
run;
proc sort data=class2;
by name;
run;
data class;
merge class1 class2;
by name;
run;
/*proc sql to implementary left join, right join, inner join*/
data class1(keep=name sex) class2(keep=name age height weight);
set sashelp.class;
output class1;
if _n_ in (1,5,10,15) then output class2;
run;
data class2
set class2;
if name="Janet" then name="Janey";
run;
proc sql;
create table class_left as
select a.*,b.*
from class1 as a left join class2 as b
on a.name=b.name;
quit;
Select, copy, change name, delete datasets
这里是单指SAS中datasets的一些操作了
/*select, copy, change name, delete*/
proc datasets;
copy in=sashelp out=work;
select class;
quit;
proc datasets lib=work;
change class=student;
quit;
proc datasets lib=work;
delete student;
quit;
/*delete all datasets*/
proc datasets lib=work kill memtype=data;
run;
/*only save class data*/
proc datasets lib=work;
save class;
run;
Create date variable
创建日期型变量,在R中常用as.Date()
函数,在SAS中则是input转化或者mdy用法
值得注意的是,前者是返回日期型变量,后者SAS则是返回距离1960.1.1号的天数,是数值型。。。
# R code
as.Date("2014-04-29")
as.Date(Sys.time())
data dt;
dayvar = input("04/29/2014", mmddyy10.);
dayvar2 = mdy(4, 29, 2014);
todays_date = today();
run;
参考资料:
SAS and R: Data Management, Statistical Analysis, and Graphics (second edition)
本文出自于http://www.bioinfo-scrounger.com转载请注明出处