Receiver Operating Characteristic

In [101]:
suppressPackageStartupMessages(library(tidyverse))

Logistic regression concept

In logistic regression, we want to relate some predictor variable (e..g gene expression value) to the probability of a binary outcome. We would like a linear model since that is mathematically and computationally tractable, but run into the issue that probabilities are restricted to [0,1]. Logistic regression solves this by transforming the output of a linear model into the [0, 1] range.

In [102]:
n <- 50
x <- seq(-4, 4, length.out = n)
a = 0
b = 1

lx <- a + b*x
px <- exp(lx)/(1 + exp(lx))
y <- rbinom(n, 1, px)

df <- data.frame(y=y, x=x, lx=lx, px=px)
In [103]:
options(repr.plot.width=4, repr.plot.height=3)

A linear model cannot be directly interpreted as a probability

In [104]:
ggplot(df, aes(x=x, y=lx)) +
geom_point(color='blue') +
labs(x="Gene expression value",
     y="Linear model relating gene\nexpression to probability of outcome")
Data type cannot be displayed:
../../_images/Computation_Wk4_Day2_PM_ROC_6_1.png
In [105]:
ggplot(df, aes(x=x, y=y)) +
geom_point() +
geom_line(aes(y=px), color='red') +
labs(x="Gene expression value", y="Binary outcome",
     title="Probability of outcome given value")
Data type cannot be displayed:
../../_images/Computation_Wk4_Day2_PM_ROC_7_1.png

Simulation from logistic model

In [106]:
sim.binomial <- function(n, b0, b1, b2, b3) {
    x1 <- rnorm(n)
    x2 <- rnorm(n)
    x3 <- rnorm(n)
    lx <- b0 + b1 * x1 + b2 * x2 + b3 * x3
    px <- exp(lx)/(1 + exp(lx))
    y <- rbinom(n, 1, px)
    data.frame(y, px, x1, x2, x3)
}

LOOCV

In [107]:
set.seed(31219)
data <- sim.binomial(100, 0.1, 0, 1, 0)
In [108]:
head(data)
ypxx1x2x3
0 0.5830298 1.2472628 0.23522348 1.4787281
1 0.4546145 0.5804904 -0.28204321-0.8217013
1 0.5417889 0.3065363 0.06754626 1.3946841
1 0.8080377 0.4604304 1.33730956 0.6502840
0 0.5436883 1.2094387 0.07519995 0.6585957
1 0.4011430 0.2338796 -0.50070506-0.8225897
In [119]:
n <- nrow(data)
pred <- numeric(n)
for (i in 1:n) {
    train <- data[-i,]
    test <- data[i,]
    model <- glm(y ~ x1 + x2 + x3, data=train, family="binomial")
    pred[i] <- predict(model, test, type="response")
}
yhat <- ifelse(pred < 0.5, 0, 1)
preddat <- data.frame(probhat=pred, yhat=yhat, y=data$y)

Confusion matrix

In [120]:
suppressPackageStartupMessages(library(caret))
In [123]:
tbl <- table(yhat, data$y)
confusionMatrix(tbl)
Confusion Matrix and Statistics


yhat  0  1
   0 31 26
   1 21 22

               Accuracy : 0.53
                 95% CI : (0.4276, 0.6306)
    No Information Rate : 0.52
    P-Value [Acc > NIR] : 0.4607

                  Kappa : 0.0547
 Mcnemar's Test P-Value : 0.5596

            Sensitivity : 0.5962
            Specificity : 0.4583
         Pos Pred Value : 0.5439
         Neg Pred Value : 0.5116
             Prevalence : 0.5200
         Detection Rate : 0.3100
   Detection Prevalence : 0.5700
      Balanced Accuracy : 0.5272

       'Positive' Class : 0

ROC and AUC

In [124]:
suppressPackageStartupMessages(library(ROCR))

Helper functions

In [125]:
ROC <- function(preddat, mlab = "probhat", ylab = "y") {
    m <- preddat[[mlab]]
    y <- preddat[[ylab]]
    performance(prediction(m, y), measure = "tpr", x.measure = "fpr")
}

AUC <- function(preddat, mlab = "probhat", ylab = "y") {
    m <- preddat[[mlab]]
    y <- preddat[[ylab]]
    performance(prediction(m, y), "auc")@y.values[[1]]
}

Calculate and plot ROC and AUC

In [126]:
my.roc <- ROC(preddat)
my.auc <- AUC(preddat)
In [127]:
my.auc
0.530048076923077
In [128]:
options(repr.plot.width=4, repr.plot.height=4)
In [129]:
plot(my.roc, col = "blue")
abline(0, 1)
../../_images/Computation_Wk4_Day2_PM_ROC_25_0.png

Resubstitution ROC

In [134]:
model.1 <- glm(y ~ x1 + x2 + x3, data=data, family="binomial")
pred.1 <- predict(model.1, type="response")
yhat.1 <- ifelse(pred.1 < 0.5, 0, 1)
preddat.1 <- data.frame(probhat=pred.1, yhat=yhat.1, y=data$y)
In [136]:
rs.roc <- ROC(preddat.1)
rs.auc <- AUC(preddat.1)
In [137]:
rs.auc
0.616987179487179
In [144]:
plot(my.roc, col = "blue", main="Resubstitution ROC is optimistic")
plot(rs.roc, col = "red", add=TRUE)
abline(0, 1)
../../_images/Computation_Wk4_Day2_PM_ROC_30_0.png