0%

Handling Duplicates and Missing values - SASlearner

Handling the duplicates and missing values in data manipulation is a very common process. This post is taking a few examples to list how to accomplish it from a datasets in SAS.

In R, I prefer to use unique() or dplyr::distinct toolkit to remove duplicates, and is.na(), na.omit() functions or external packages like mice to handle missing values.

Duplicates

We can use the proc sort to remove rows that have duplicate values across all columns of the dataset.

proc sort data = sashelp.cars(keep = make type origin) out = without_dups nodupkey;
    by _all_;
run;

In some special condition, we would like to select only unique/distinct rows from a dataset as per a specific column and keep the first row of values for that column.

proc sort data = sashelp.cars out = make_without_dups nodupkey;
    by Make;
run;

Missing Values

In clinical trial data, missing data or missing values is a common occurrence when no data is stored for the variable in the observation. It can be occurred in numeric or character variables as a single period (.).

For all we know, according to the missing values, the reasons can be summarized as below:

  • Missing completely at random (MCAR)
  • Missing at random (MAR), not completely random
  • Not missing at random (NMAR)

So how to handle the missing values?

Removing observations

Suppose we did a reaction time study with six subjects, and the subjects reaction time was measured by three times. That data is as shown below.

data times;
    input id trial1 trial2 trial3;
    cards;
1 1.5 1.4 1.6 
2 1.5  .  1.9 
3  .  2.0 1.6 
4  .   .  2.2 
5 2.1 2.3 2.2
6 1.8 2.0 1.9
;
run;

As you see below, we can use some useful functions to count the number of missing observations, like nmiss for numeric and cmiss for character. Or missing to indicate whether the argument contains a missing value. And then filter any rows that have more than one missing value.

data raw_0;
    set times (where = (nmiss(trial1,trial2,trial3) = 0));
run;

Or just indicate the specific variable, like trial1 column.

data raw_1;
    set times;
    missing_flag = missing(trial1);
run;
Replacing Values

First off, let's try to replace all missing values with zero in every column in a simple way, which is creating an implicit Array NumVar to hold all numeric variables in the dataset and then loop over it. If you just want to replace one column, so then add that variable name instead of _numeric_.

data raw_3;
    set times;
    array NumVar _numeric_;

    do over NumVar;
        if NumVar=. then
            NumVar=0;
    end;
run;

If your question is more complicated, such as not replaced by zero but by mean, then how would we address it? I suppose that proc stdize is a good solution.

/*proc stdize data = times out = stdize_vars reponly missing = 0; run;*/
proc stdize data = times out = stdize_vars reponly method = mean;
    var trial1 trial2; /* or _numeric_, or empty*/
run;
Imputation Method

Imputation missing values is a complicated data manipulation process that can work well if you select the correct method for specific variables. But I would not learn more about how to do it with SAS by now, since I prefer to use R for imputation.

Here just list a few of useful sas procedures so that I can read and recall them later if needed.

  • proc hpimpute
  • PROC MI, PROC REG, PROC MIANALYZE
  • proc surveyimpute

Hope above notes will be helpful for you.

Reference

https://www.statology.org/sas-remove-duplicates/
https://www.statology.org/sas-replace-missing-values-with-zero/
http://www.philasug.org/Presentations/201910/Handling_Missing_Data_in_SAS.pdf
https://sasnrd.com/sas-replace-missing-values-zero/

Please indicate the source: http://www.bioinfo-scrounger.com