0%

Programming in pharma using SAS

It is not to be denied that sas is an essential skill for statistical programmers in the pharma field. Of course sas is a programming language which can be derived to different using requirements in different fields. So I think we should follow the actual requirements in the pharmaceutical industry to learn SAS if you want to be a qualified statistical programmer. Therefore the purpose of this post is to record some actual applications by sas so that I can understand and remember sas syntax clearly.


For simplicity, if you have many clinical datasets, you’d like to find out which datasets have a certain variable like Sex, what should we do? Per R thinking, I will read all datasets and derive row names (variables) and then find out which datasets have this specific variable. If you apply this way in SAS, maybe you need to create a macro and utilize open and varnum functions. However there is a tip to achieve it, as following:

libname mydata "C:/Users/anlan/Documents/CYFRA";

data sex_tb;
  set sashelp.vcolumn;
  where libname="MYDATA" and name="SEX";
  keep memname name type length;
run;
sas_prog1

Just manipulate the sashelp.vcolumn to filter the rows by libname(libname="MYDATA") and column name(name="SEX"). We can not find this tip from any common books, but it’s very useful in our work, I call it “work experience” generally.


The second example, how to derive date data from a string? I think any derived process can be splitted into two procedures, matching and extracting. In R, obviously the popular string related package is stringr, so how about in SAS? SAS does not have the same toolkit package/macro as R, but some functions may be useful, like prxchange. After I read the documentation of prxchange, I feel that it has some similarities with perl in regular matching, as following:

data work.demo;
  input tmp $20.;
  datalines;
AE 2021-01-01
CM 2021-02-01
MH 2021-03-01
;
run;

data demo_date;
  set demo;
  format date YYMMDD10.;
/*  date=prxchange("s/\w+\s+(.*)/$1/",1,tmp);*/
  date=input(prxchange("s/\w+\s+(.*)/$1/",1,tmp),YYMMDD10.);
run;
proc contents data=demo_date; run;
sas_prog2

If you just want to judge whether one word or numeric exists in the string, in other words matching words, maybe prxmatch and find are helpful. If you want to extract words instead of match, prxchange or prxparse+prxposn is preferred. By the way, the later one is more close to R logic.

data work.num;
  input tmp $;
  datalines;
AE01
CM02
MH03
;
run;

data num_ext;
  length tmp type1 type2 $ 20;
  keep tmp type1 type2;
  re=prxparse("/([a-zA-Z]+)(\d+)/");
  set num;
  if prxmatch(re, tmp) then
    do;
      type1=prxposn(re, 1, tmp);
      type2=prxposn(re, 2, tmp);
    output;
  end;
run;
sas_prog3

The third question, what’s the difference between informat and format?

  • informat describes how the data is presented in the text file.
  • format describes how you want SAS to present the data when you look at it.

Remember, formats do not change the underlying data, just how it is printed for input on the screen.

Having an instance here, converting data9. date to yymmdd10. represents the usage of format statements.

data aes;
input @1 subject $3. @5 ae_start date9. 
  @15 ae_stop date9. @25 adverse_event $15.;
format ae_start yymmdd10. ae_stop yymmdd10.;
datalines;
101 01JAN2004 02JAN2004 Headache
101 15JAN2004 03FEB2004 Back Pain
102 03NOV2003 10DEC2003 Rash
102 03JAN2004 10JAN2004 Abdominal Pain
102 04APR2004 04APR2004 Constipation
;
run;
sas_prog4

Another condition, we typically store the number 1 for male and 2 for female. It would be embarrassing to hand a client with an unclear understanding of the 1 and 2 numbers. So we need a format to dress up the report.

data report;
input ID Gender State $;
datalines;
100001 1 LA
100002 2 LA
100003 . AL
;
run;

proc format ;
value sex
  1 = "Male"
  2 = "Female"
  . = "Unknown";
run ;

proc print data = report;
  var id gender state;
  format gender sex.;
run;

Informat is usually used with the input statement to read multiple styles of variables into SAS.

Informats usage:

  • Character Informats: $INFORMAT w.

  • Numeric Informats: INFORMAT w.d

  • Date/Time Informats: INFORMAT w.

    data death; input @1 subject $3. @5 death 1.; datalines; 101 0 102 0 ; run;


The fourth question, what’s the difference between the keep option on the set statement or the data statement?

If you place the keep option on the set statement, SAS keeps the specified variables when it reads the input data set. On the other hand, if you place the keep option on the DATA statement, SAS keeps the specified variables when it writes to the output data set. From this explanation, we can easily think that the latter one is faster than the former when the input dataset is very large.


Above all are my partal interview questions. It's such a pity that I‘m not fully prepared as I am just proficient in R, not SAS. So I’m planning to take my free time to learn and summarize SAS, just like when I learned Perl, R and Python.

By the way, I think is a good book as it not only provides some sas cases but also introduces the knowledge from the pharmaceutical industry.


Please indicate the source: http://www.bioinfo-scrounger.com