主要为了理解Cox比例风险模型以及其预测结果
首先看下lung数据集对于sex协变量的cox结果
library(survival)
l.coxph <- coxph(Surv(time, status) ~ sex, data=lung)
>summary(l.coxph)
Call:
coxph(formula = Surv(time, status) ~ sex, data = lung)
n= 228, number of events= 165
coef exp(coef) se(coef) z Pr(>|z|)
sex -0.5310 0.5880 0.1672 -3.176 0.00149 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
exp(coef) exp(-coef) lower .95 upper .95
sex 0.588 1.701 0.4237 0.816
Concordance= 0.579 (se = 0.021 )
Likelihood ratio test= 10.63 on 1 df, p=0.001
Wald test = 10.09 on 1 df, p=0.001
Score (logrank) test = 10.33 on 1 df, p=0.001
其中sex=1
代表male,sex=2
代表female;从上述结果可看出,coef为log(Hazards ratio),即cox回归方程中的beta;所以HR(Hazards ratio)为0.5880,即exp(coef)
HR小于1,说明female的风险低于male(根据sex值的大小来,second group relative to the first group);而exp(-coef)则是反之,是first group relative to the second,其值大于1,同样说明male风险高于female
接下来我想基于训练集得到的cox回归模型,预测新测试集的预后结果,可使用predict
函数,并看下其不同参数(type="lp"
,risk
,expected
和terms
)的区别
results <- rownames_to_column(lung, var = "id") %>%
select(c("id", "time", "status", "sex")) %>%
bind_cols(tibble(lp = predict(l.coxph, type="lp"),
risk = predict(l.coxph, type="risk"),
expected = predict(l.coxph, type="expected"),
terms = predict(l.coxph, type="terms")))
> head(results, 10)
id time status sex lp risk expected sex
1 1 306 2 1 0.2096146 1.2332026 0.8280030 0.2096146
2 2 455 2 1 0.2096146 1.2332026 1.3880051 0.2096146
3 3 1010 1 1 0.2096146 1.2332026 3.5060567 0.2096146
4 4 210 2 1 0.2096146 1.2332026 0.5185439 0.2096146
5 5 883 2 1 0.2096146 1.2332026 3.5060567 0.2096146
6 6 1022 1 1 0.2096146 1.2332026 3.5060567 0.2096146
7 7 310 2 2 -0.3214090 0.7251266 0.4999485 -0.3214090
8 8 361 2 2 -0.3214090 0.7251266 0.6107055 -0.3214090
9 9 218 2 1 0.2096146 1.2332026 0.5367625 0.2096146
10 10 166 2 1 0.2096146 1.2332026 0.3309150 0.2096146
对于type="lp"
参数,所计算的值可以作为线性预测变量的各个组成部分,若值大于0,说明该subject有相比mean survival(这个计算自训练集)有high risk,小于0则是low risk。按照上述逻辑,其计算方式如下:
> (lung[1,"sex"] - l.coxph$means) %*% coef(l.coxph)["sex"]
[,1]
[1,] 0.2096146
对于type="risk"
参数,所计算的值作为exp后的结果(risk score exp(lp) ),计算subject的risk value;若value大于1,说明其高于人群平均水平,低于1则反之;通过对risk排序,可得到待预测的人群中top 20%的subjects
> exp((lung[1,"sex"] - l.coxph$means) %*% coef(l.coxph)["sex"])
[,1]
[1,] 1.233203
对于type="expected"
参数,其计算的是对应subject在随访观察期内预期发生的事件的数目,The survival probability for a subject is equal to exp(-expected)
对于type="terms"
参数,假如cox中只有一个变量(且这个变量是二分类变量),其计算结果同lp。。我是这么理解的,这个type是指类的意思
假如我再加上ph.ecog
变量
options(na.action=na.exclude)
l.coxph <- coxph(Surv(time, status) ~ sex+ph.ecog, data=lung)
results <- rownames_to_column(lung, var = "id") %>%
select(c("id", "time", "status", "sex", "ph.ecog")) %>%
bind_cols(tibble(lp = predict(l.coxph, type="lp"),
risk = predict(l.coxph, type="risk"),
expected = predict(l.coxph, type="expected"),
terms = predict(l.coxph, type="terms")))
查看这个协变量的组合个数:
> table(results$sex, results$ph.ecog)
0 1 2 3
1 36 71 29 1
2 27 42 21 0
那么lp参数下就会有7个系数:
> length(table(results$lp))
[1] 7
而从type参数来看,其类只有2+4=6个:
> length(table(results$terms))
[1] 6
以上是根据网上资料简单的搬抄。。。。
本文出自于http://www.bioinfo-scrounger.com转载请注明出处