Reading notes for SAS Programming in the Pharmaceutical Industry book(1)

This post is the reading notes for the e-book “SAS Programming in the Pharmaceutical Industry” to keep recording some knowledge points about pharma programming.

Chapter 4 Transforming Data and Creating Analysis Data Sets

Defining variables once

Why create the analysis data set? One of the primary reasons for creating analysis data sets is to have variable derivation in a single place so that we can avoid searching and changing each variable in different programs multiple times.

Defining Study Populations

  • Intent-to-treat (ITT), All patients were randomized to study therapy. It is intended that they will be treated. Patients are analyzed according to a randomized treatment group.
  • As-treated, Patients analyzed according to the study intervention they actually received. Patients may get a treatment that they were not randomized to.
  • Per-protocol, All patients who did not experience a subjectively defined serious protocol violation during the study.
  • Safety, All patients who actually received the study drug.



  • 意向性分析(intention-to-treat population,ITT):即所谓的ITT分析,该分析集纳入了所有随机化后的患者。这里需要注意的是,如果某患者随机到了A组,那后续的ITT分析该患者必须一直在A组,哪怕他接受的是B组的治疗,或者没有接受任何治疗。这看起来有点匪夷所思。其实,这样做最重要的目的就是保持两组之间的基线特点均衡可比,通过随机化,将除了研究因素以外的其他变量完全均衡和匹配掉,从而充分观察干预效果,而As-treated 集则是根据患者实际接受的治疗进行数据分析。对于单臂研究,ITT的概念通常并不经常涉及,一般情况下是指所有入组的患者(一般以签署知情同意书为依据)。
  • 全分析集(full analysis set, FAS集):是ITT集的子集,部分研究称为修订后ITT(modifiedITT,mITT)分析,他是指对所有随机化受试者的数据做最少和公正的剔除之后所得到的数据集,保持原始数据集的完整性,减少偏倚,但是目前缺乏有关这一问题的共识。在ICHE9中,描述了只有在一些特定的原因下,可能导致受试者被排除在全分析集之外,包括(1)不满足主要入排标准;(2)没有用过一次药;(3)在进行随机化后没有任何数据。FAS可以作为主要分析集。
  • 符合方案集分析(per- protocol analysis,PP集):是FAS集的一个子集,指受试者在入排标准、接受治疗、主要指标测量等方面不存在严重方案违背,它只对依从了干预措施的研究对象进行分析。个人认为,FAS集与PP集不存在太大的差异,因此,很多研究将FAS分析或PP分析二选一,与ITT分析一起报道。(需要特别注意的是,如果将患者从某个分析集中剔除,一定要有充分的理由,一般是由研究者、申办方及统计师商议好共同决定,且对于盲法设计的研究,一定是在揭盲前,揭盲前,揭盲前 ( 重要的事情说 3 遍 ) ,因为揭盲后对数据的修改有操纵数据的嫌疑,一般会受到监管部门的质疑(NEJ-009研究惨痛的教训是不是还在眼前)。)
  • 安全集(Safe analysis Set,SS):安全集与上述几种评价疗效的数据集不同。安全集是用来评价试验药物安全性的。一般要符合以下三点:1)随机化分组;2)至少使用过一剂试验药物;3)至少有一次安全性评价。


  • 一般说来,显示选择不同的病例集进行分析对主要的试验结果不敏感有优越性的。
  • 在有些情况下,最好能计划选择不同的分析集进行对结论的敏感性的探索。
  • 在优效性试验中,全分析集用于主要分析(除了特殊情况),因为它倾向于避免由于符合分析集所致的效果的过于最优化估计。这是由于,在全分析集中包括了依从性不良者一般会减少估计的处理效应
  • 然而在一个等效性或非劣效性试验中,应用全分析集一般并不保守其作用应当非常仔细地考虑。

Defining Baseline Observations

“Baseline” is a common clinical concept, which is used to demonstrate the state of a patient before some interventions, so that a subsequent comparison could be in a balanced state. Usually, the baseline value could be the last reading prior to medical intervention if you would like to make the cholesterol measurements.

Deriving Last Observation Carried Forward (LOCF) variables. For example, you want the last observation carried forward so long as the measures occur within a five-day window before the pill is taken.

Defining Study Day

  • Calculating a continuous study day, study_day = event_date - intervention_date + 1, in this approach, the 1 is represented by initial intervention.
  • Calculating a study day without day zero, If event_date is lower than intervention_date then study_day = event_date - intervention_date. If event_date is higher than or equal to intervention_date then study_day = event_date - intervention_date + 1, in this approach, the 1 is represented by initial intervention.

The first way is useful to graph or calculate durations that span the day before the therapeutic intervention day.

The second way is more intuitive as the day before intervention is represented by study day “-1”, so it is used more often, especially in CDISC SDTM.

However the first way is recommended to use in CDISC ADaM. Whether you are deriving data based on the CDISC models or not, you should calculate study day variables in a consistent fashion across a clinical trial or set of trials for an application.

Windowing Data

A tag is some descriptive label such as “Visit 5”, “Baseline”, or “Abnormal”. For example, baseline observations must occur before initial drug dosing.

Transposing Data

Normalized data may also be described as “stacked”, “vertical” or “talk and skinny”, while non-normalized data are often called “flat”, “wide” or “short and fat”.

So this normalized or non-normalized data may mean long data or wide data, which is why we need to transpose data so that the dependent variable is present one the same observation as the independent variables.

In SAS, I think proc transpose procedure is a powerful statement to handle these needs, no matter from long data to wide, or wide data to long.

data sbp;
input subject $ visit sbp;
101 1 160
101 3 140
101 4 130
101 5 120
202 1 141
202 2 151
202 3 161
202 4 171
202 5 181
proc transpose data = sbp out = sbpflat prefix = VISIT;
    by subject;
    id visit;
    var sbp;

This procedure in R will be handled by pivot_longer() and pivot_wider() functions.

Categorical Data and Why Zero and Missing Results Differ Greatly

Missing data:

  • The response is unknown.
  • The observation will not be included in population analysis and denominator definitions.

Zero data:

  • The response is known.
  • The response is “NO” when the categorical variable is Boolean variable.
  • The observation will be included in population analysis and denominator definitions.

Performing Many-to-Many Comparisons/joins

Imagine you have a data set of adverse event data and a data set of concomitant medications, and you want to know if a concomitant medication was given to a patient during the time of the adverse event.

It’s usually to join them with proc sql in SAS, and using left_join in R, which is a very common procedure for data manipulation.

Common Analysis Data Sets

  • The critical variables analysis data set always has a single observation per subject to simplify the process of merging with other data sets.The whole purpose of the critical variables data set is to capture in one place the essential analysis stratification variables that are used throughout the statistical analysis and reporting.
  • The purpose of using change-from-baseline analysis data sets is to measure what effect some therapeutic intervention had some kind of diagnostic measure. A measure is taken before and after therapy, and a difference and sometimes a percentage difference are calculated for each post-baseline measure.
  • A time-to-event analysis data set captures the information about the time distance between therapeutic intervention and some other particular event. Two variables defined as follow:
    • Event/Censor, A binomial outcome such as “success/failure,” “death/life,” “heart attack/no heart attack.” If the event happened to the subject, then the event variable is set to 1. If it is certain that the patient did not experience the event, then the event variable is set to 0. Otherwise, this variable should be missing.
    • Time to Event, This variable captures the time (usually study day) from therapeutic intervention to the event date or censor date. If the event occurred for a subject, the time to event is the study day at that event. If the event did not occur, then the time to event is set to the censor date that is often the last known follow-up date for a subject.

As said survival data, or called time-to-event data, is very common in survival analysis, such as Kaplan-Meier curve, log-rank test and Cox proportional hazards model.

Often the censor date is the last known date of patient follow-up, but a patient could be censored for other reasons, such as having taken a protocol-prohibited medication.

Creating time-to-event data sets can be a difficult programming task, especially during interim data analyses, such as for a DSMB. This is usually because the event data itself is captured in more than one place in the case report form and the censor date may be difficult to obtain.

For example, perhaps the event of interest is death. You may have to search the adverse events CRF page, the study termination CRF page, clinical endpoint committee CRFs, and perhaps a special death events CRF page just to gather all of the known death events and dates. For subjects who did not experience the event of interest, you may not have a study termination form to provide the censoring date, so you may have to use some surrogate data to create a censor date.

Please indicate the source: http://www.bioinfo-scrounger.com