When we mention how to find the cutoff, the first response in our brain may be the ROC curve. Absolutely, ROC curve is a very common approach in the biomarker field to look for a fit cutoff to a reagent. However in the ROC curve, the dependent variable must be two categorical variables, which is not universal to different types of data, such as quantitative variables, survival(censored) variables.
In this situation, as an option we can consider using maximally selected rank statistics to find the cutoff.
The statistics depends on your data types.
What is the maximally selected rank statistics? Briefly speaking it assumes that an unknown cutoff in X (independent variable) determines two groups of observations regarding the response Y, and the two groups have the biggest statistics between each other. This statistics is an appropriate standardized two-sample linear rank statistic of the responses that represents the difference between two groups.
The hypothesis test is to find out the maximum of the standardized statistics of all possible cutoffs, which can provide the best separation of the response into two groups.
So Maximally selected rank statistics can be used for estimation as well as evaluation of a simple cutpoint model.
surv_cutpoint()
function in the survminer
package wrapped maxstat
in it to determine the optimal cutpoint for each variable.
A simple example below is about how to use the maxstat
package to find a statistically significant cutoff.
Load the survival data from the maxstat package.
library(survival)
library(maxstat)
data(DLBCL)
mod <- maxstat.test(Surv(time, cens) ~ MGE, data = DLBCL,
smethod = "LogRank", pmethod = "condMC", B = 9999)
> mod
Maximally selected LogRank statistics using condMC
data: Surv(time, cens) by MGE
M = 3.1772, p-value = 0.009701
sample estimates:
estimated cutpoint
0.1860526
This argument smthod
provides several kinds of statistics to be computed and the pmethod
is used to specify the kind of p-value approximation. The argument B
specifies the number of Monte-Carlo replications to be performed (and defaults to 10000).
For the overall survival time, the estimated cutpoint is 0.186 mean gene expression, the maximum of the log-rank statistics is M = 3.1772. The probability that, under the null hypothesis, the maximally selected log-rank statistic is greater M = 3.171 is less then than 0.0097.
If you have more than one dependent variable to be evaluated, you need to evaluate these predictors simultaneously and find out which one is better than the others.
mod2 <- maxstat.test(Surv(time, cens) ~ MGE + IPI, data = DLBCL,
smethod = "LogRank", pmethod = "exactGauss", abseps=0.01)
> mod2
Optimally Selected Prognostic Factors
Call: maxstat.test.data.frame(formula = Surv(time, cens) ~ MGE + IPI,
data = DLBCL, smethod = "LogRank", pmethod = "exactGauss",
abseps = 0.01)
Selected:
Maximally selected LogRank statistics using exactGauss
data: Surv(time, cens) by IPI
M = 2.9603, p-value = 0.01104
sample estimates:
estimated cutpoint
1
Adjusted p.value:
0.03430325 , error: 0.001754899
> mod2$maxstats
[[1]]
Maximally selected LogRank statistics using exactGauss
data: Surv(time, cens) by MGE
M = 3.0602, p-value = 0.02721
sample estimates:
estimated cutpoint
0.1860526
[[2]]
Maximally selected LogRank statistics using exactGauss
data: Surv(time, cens) by IPI
M = 2.9603, p-value = 0.01104
sample estimates:
estimated cutpoint
1
The p-value of the global test for the null hypothesis “survival is independent from both IPI and MGE” is 0.034 and IPI provides a better distinction into two groups than MGE does.
The visualization of the result can be shown by plot()
function
plot(mod2)
Reference
Maximally Selected Rank Statistics in R
https://www.jianshu.com/p/0851baac137c
https://www.iikx.com/news/statistics/1747.html
Please indicate the source: http://www.bioinfo-scrounger.com