Statistics review 14: Logistic regression ========================================= R code accompanying `paper `__ Key learning points ------------------- - Modeling the dependence of a binary response variable on one or more explanatory variables .. code:: r suppressPackageStartupMessages(library(tidyverse)) .. code:: r options(repr.plot.width=4, repr.plot.height=3) .. code:: r marker <- seq(0.75, 4.75, by=0.5) patients <- c(182, 233, 224, 236, 225, 215, 221, 200, 264) deaths <- c(7, 27, 44, 91, 130, 168, 194, 191, 260) df <- data.frame(marker=marker, patients=patients, deaths=deaths) df .. raw:: html
markerpatientsdeaths
0.75182 7
1.25233 27
1.75224 44
2.25236 91
2.75225 130
3.25215 168
3.75221 194
4.25200 191
4.75264 260
.. code:: r df1 <- df %>% mutate(prop=deaths/patients) df1 .. raw:: html
markerpatientsdeathsprop
0.75 182 7 0.03846154
1.25 233 27 0.11587983
1.75 224 44 0.19642857
2.25 236 91 0.38559322
2.75 225 130 0.57777778
3.25 215 168 0.78139535
3.75 221 194 0.87782805
4.25 200 191 0.95500000
4.75 264 260 0.98484848
.. code:: r ggplot(df1, aes(x=marker, y=prop)) + geom_point() .. image:: SR14_Logistic_regression_files/SR14_Logistic_regression_5_1.png .. code:: r df2 <- df1 %>% mutate(logit=log(prop/(1-prop))) df2 .. raw:: html
markerpatientsdeathsproplogit
0.75 182 7 0.03846154-3.2188758
1.25 233 27 0.11587983-2.0320393
1.75 224 44 0.19642857-1.4087672
2.25 236 91 0.38559322-0.4658742
2.75 225 130 0.57777778 0.3136576
3.25 215 168 0.78139535 1.2738164
3.75 221 194 0.87782805 1.9720213
4.25 200 191 0.95500000 3.0550489
4.75 264 260 0.98484848 4.1743873
.. code:: r ggplot(df2, aes(x=marker, y=logit)) + geom_point() .. image:: SR14_Logistic_regression_files/SR14_Logistic_regression_7_1.png Maximum likelihood estimation ----------------------------- The parameters of many statistical models are estimated by maximizing the likelihood. This likelihood is a function of the unnormalized probability density after having observed some data. We will illustrate by example to give some intuition. Simulating sample data come from a population ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. code:: r mu <- 4 sigma <- 1 x <- rnorm(100, mu, sigma) .. code:: r ggplot(data.frame(x=x), aes(x=x)) + geom_histogram(binwidth=0.5, color='grey') .. image:: SR14_Logistic_regression_files/SR14_Logistic_regression_12_1.png Likelihood ~~~~~~~~~~ We have 100 samples independently drawn from the same normal distribution - that is, the samples are IID (independent identically distributed). The conditional probability of of a single sample :math:`x_i` is :math:`N(\mu, \sigma \mid x_i)`. Because the samples are independent, the likelihood of seeing all 100 samples is just the product of 100 such conditional probabilities. If we fix the standard deviation to be 1 and vary the mean, we see that the likelihood function has a maximum near the true population mean (:math:`\mu = 4`). Hence we can get a good estimate of :math:`\mu` by maximizing the likelihood. Typically, a numerical optimization algorithm is used to find the maximum - the value of :math:`\mu` (or other parameter) where this maximum occurs is the **maximum likelihood estimate** (MLE) of the population parameter. .. code:: r mus <- seq(0, 8, length.out = 100) lik <- vector('numeric', length=100) for (i in 1:100) { lik[i] <- prod(dnorm(x, mean=mus[i], sd=1)) } .. code:: r ggplot(data.frame(mu=mus, lik=lik), aes(x=mu, y=lik)) + geom_line() .. image:: SR14_Logistic_regression_files/SR14_Logistic_regression_15_1.png Working in log space ~~~~~~~~~~~~~~~~~~~~ In practice, multiplying lots of small numbers to get a likelihood results in underflow because of the discreteness of computer arithmetic. So we work with log probabilities and sum them rather than take products, but the basic idea of maximizing the (log) likelihood is the same. .. code:: r mus <- seq(0, 8, length.out = 100) loglik <- vector('numeric', length=100) for (i in 1:100) { loglik[i] <- sum(log(dnorm(x, mean=mus[i], sd=1))) } .. code:: r ggplot(data.frame(mu=mus, lik=loglik), aes(x=mu, y=lik)) + geom_line() .. image:: SR14_Logistic_regression_files/SR14_Logistic_regression_18_1.png Data with binary outcomes ------------------------- We will try to predict admission to graduate school based on GRE, GPA and class rank. Rank is the prestige of the undergraduate institution, with 1 = most prestigious and 4 = least prestigious. .. code:: r admit <- read.csv("http://stats.idre.ucla.edu/stat/data/binary.csv") .. code:: r admit$rank <- as.factor(admit$rank) .. code:: r str(admit) .. parsed-literal:: 'data.frame': 400 obs. of 4 variables: $ admit: int 0 1 1 1 0 1 1 0 1 0 ... $ gre : int 380 660 800 640 520 760 560 400 540 700 ... $ gpa : num 3.61 3.67 4 3.19 2.93 3 2.98 3.08 3.39 3.92 ... $ rank : Factor w/ 4 levels "1","2","3","4": 3 3 1 4 4 2 1 2 3 2 ... .. code:: r head(admit) .. raw:: html
admitgregparank
0 380 3.613
1 660 3.673
1 800 4.001
1 640 3.194
0 520 2.934
1 760 3.002
.. code:: r m <- glm(admit ~ gre + gpa + rank, data=admit, family="binomial") .. code:: r summary(m) .. parsed-literal:: Call: glm(formula = admit ~ gre + gpa + rank, family = "binomial", data = admit) Deviance Residuals: Min 1Q Median 3Q Max -1.6268 -0.8662 -0.6388 1.1490 2.0790 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -3.989979 1.139951 -3.500 0.000465 *** gre 0.002264 0.001094 2.070 0.038465 * gpa 0.804038 0.331819 2.423 0.015388 * rank2 -0.675443 0.316490 -2.134 0.032829 * rank3 -1.340204 0.345306 -3.881 0.000104 *** rank4 -1.551464 0.417832 -3.713 0.000205 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 499.98 on 399 degrees of freedom Residual deviance: 458.52 on 394 degrees of freedom AIC: 470.52 Number of Fisher Scoring iterations: 4 .. code:: r round(confint(m), 4) .. parsed-literal:: Waiting for profiling to be done... .. raw:: html
2.5 %97.5 %
(Intercept)-6.2716-1.7925
gre 0.0001 0.0044
gpa 0.1603 1.4641
rank2-1.3009-0.0567
rank3-2.0277-0.6704
rank4-2.4000-0.7535
.. code:: r library(aod) Test significance of GRE ~~~~~~~~~~~~~~~~~~~~~~~~ .. code:: r wald.test(b = coef(m), Sigma = vcov(m), Terms = 2) .. parsed-literal:: Wald test: ---------- Chi-squared test: X2 = 4.3, df = 1, P(> X2) = 0.038 Test significance of GPA ~~~~~~~~~~~~~~~~~~~~~~~~ .. code:: r wald.test(b = coef(m), Sigma = vcov(m), Terms = 3) .. parsed-literal:: Wald test: ---------- Chi-squared test: X2 = 5.9, df = 1, P(> X2) = 0.015 Test significance of Rank ~~~~~~~~~~~~~~~~~~~~~~~~~ .. code:: r wald.test(b = coef(m), Sigma = vcov(m), Terms = 4:6) .. parsed-literal:: Wald test: ---------- Chi-squared test: X2 = 20.9, df = 3, P(> X2) = 0.00011 Getting odds ratios ------------------- .. code:: r round(exp(cbind(OR = coef(m), confint(m))), 4) .. parsed-literal:: Waiting for profiling to be done... .. raw:: html
OR2.5 %97.5 %
(Intercept)0.01850.00190.1665
gre1.00231.00011.0044
gpa2.23451.17394.3238
rank20.50890.27230.9448
rank30.26180.13160.5115
rank40.21190.09070.4707
Interpretation ^^^^^^^^^^^^^^ For every unit increase in GPA, the odds of being admitted increase by a factor of 2.2345. Making predictions ------------------ .. code:: r head(admit) .. raw:: html
admitgregparank
0 380 3.613
1 660 3.673
1 800 4.001
1 640 3.194
0 520 2.934
1 760 3.002
.. code:: r new.data <- with(admit, data.frame(gre=mean(gre), gpa=mean(gpa), rank=factor(1:4))) .. code:: r new.data .. raw:: html
gregparank
587.7 3.38991
587.7 3.38992
587.7 3.38993
587.7 3.38994
.. code:: r new.data$p <- predict(m, new.data, type="response") new.data .. raw:: html
gregparankp
587.7 3.3899 1 0.5166016
587.7 3.3899 2 0.3522846
587.7 3.3899 3 0.2186120
587.7 3.3899 4 0.1846684
Exercise -------- .. code:: r library(titanic) .. code:: r head(titanic_train) .. raw:: html
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
1 0 3 Braund, Mr. Owen Harris male 22 1 0 A/5 21171 7.2500 S
2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Thayer)female 38 1 0 PC 17599 71.2833 C85 C
3 1 3 Heikkinen, Miss. Laina female 26 0 0 STON/O2. 3101282 7.9250 S
4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1 0 113803 53.1000 C123 S
5 0 3 Allen, Mr. William Henry male 35 0 0 373450 8.0500 S
6 0 3 Moran, Mr. James male NA 0 0 330877 8.4583 Q
.. code:: r str(titanic_train) .. parsed-literal:: 'data.frame': 891 obs. of 12 variables: $ PassengerId: int 1 2 3 4 5 6 7 8 9 10 ... $ Survived : int 0 1 1 1 0 0 0 0 1 1 ... $ Pclass : int 3 1 3 1 3 3 1 3 3 2 ... $ Name : chr "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ... $ Sex : chr "male" "female" "female" "female" ... $ Age : num 22 38 26 35 35 NA 54 2 27 14 ... $ SibSp : int 1 1 0 1 0 0 0 3 0 1 ... $ Parch : int 0 0 0 0 0 0 0 1 2 0 ... $ Ticket : chr "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ... $ Fare : num 7.25 71.28 7.92 53.1 8.05 ... $ Cabin : chr "" "C85" "" "C123" ... $ Embarked : chr "S" "C" "S" "S" ... **1**. Fit a logistic regression model of survival against sex, age, and passenger class (Pclass). Remember to convert to factors as appropriate. Evaluate and interpret the impact of sex, age and passenger class on survival.