Generalized Linear Models¶
Recall the linear model
Linear models are appropriate where the response variable (\(y\)) is continuous and the errors (\(\epsilon\)) are normally distributed. Many experimental designs (for example, case-control) have categorical outcomes and/or outcomes whose errors are not normally distributed. Often, these experiments may be modeled with a more general category of models called ‘generalized linear models’ (or GLMs).
To understand GLMs, we need to be more precise about what our linear regression is actually doing. We are really saying:
or equivalently
That is, our model is specifying the conditional expectation of observations \(Y\) on the idependent variable \(x\). In particular, that expectation is linear.
What Happens When \(\mathbb{E}(Y|x)\) is Not Linear?¶
A natural question to ask is: Can we find a function of the expectation that is linear? We would like this function to be invertible, so that we are really just transforming things. The answer is ‘yes’ - if - we assume that \(Y\) has a distribution that is part of the ‘exponential family’.
This family has the following form (note - this is vector notation)
Kinda ugly, but also pretty general. Note that the normal distribution is in this family! So is the binomial, negative binomial, Poisson, Gamma,...
Model¶
Suppose our outcome variable \(Y\) has a distribution in the exponential family. We need an invertible function \(g\) so that:
Why???
Because when we apply \(g\) to the expectation:
a miracle occurs...
Because it has miraculous properties, we give \(g\) a name: \(g\) is called a ‘link’ function. It ‘links’ the conditional expectation of our outcome variable to a linear function. Everybody loves linear functions! Yay!
Now, DESeq2 uses a negative binomial model, and we’ll talk about that - but first lets look at a model that is commonly used in case-control type settings.
Logistic Regression¶
In logistic regression, we model the probability of the outcome \(y\), given the independent variable \(x\). That is, \(Y|x\) is Bernoulli, with success probability \(p\) given by:
Note that \(p\) is bounded between \(0\) and \(1\) and is defined for all values of \(\alpha + \beta x\).
We use the notation \(expit(t)\) to denote the following:
\(expit\) is known as the logistic function.
and \(logit(t)\) to denote:
Note that:
so that
Because \(logit(p) = logit(\mathbb{E}(Y|x))\), it is called the link function.
In [1]:
expit<-function(t){
return(1/(1+exp(-t)))
}
x<-expit((-10:10))
plot(x)
So, how do we perform a logistic regression in R? First, let’s get a data set:
In [2]:
Titanic<-read.csv("titanic.csv")
In [3]:
attach(Titanic)
#Titanic
library(plyr)
Titanic$pclass<-revalue(pclass, c("1st"=1, "2nd"=1,"3rd"=0))
Titanic
Warning message:
In `[<-.factor`(`*tmp*`, ri, value = "⋮"): invalid factor level, NA generatedWarning message:
In `[<-.factor`(`*tmp*`, ri, value = "⋮"): invalid factor level, NA generatedWarning message:
In `[<-.factor`(`*tmp*`, ri, value = "⋮"): invalid factor level, NA generatedWarning message:
In `[<-.factor`(`*tmp*`, ri, value = "⋮"): invalid factor level, NA generatedWarning message:
In `[<-.factor`(`*tmp*`, ri, value = "⋮"): invalid factor level, NA generatedWarning message:
In `[<-.factor`(`*tmp*`, ri, value = "⋮"): invalid factor level, NA generatedWarning message:
In `[<-.factor`(`*tmp*`, ri, value = "⋮"): invalid factor level, NA generatedWarning message:
In `[<-.factor`(`*tmp*`, ri, value = "⋮"): invalid factor level, NA generatedWarning message:
In `[<-.factor`(`*tmp*`, ri, value = "⋮"): invalid factor level, NA generatedWarning message:
In `[<-.factor`(`*tmp*`, ri, value = "⋮"): invalid factor level, NA generatedWarning message:
In `[<-.factor`(`*tmp*`, ri, value = "⋮"): invalid factor level, NA generatedWarning message:
In `[<-.factor`(`*tmp*`, ri, value = "⋮"): invalid factor level, NA generatedWarning message:
In `[<-.factor`(`*tmp*`, ri, value = "⋮"): invalid factor level, NA generatedWarning message:
In `[<-.factor`(`*tmp*`, ri, value = "⋮"): invalid factor level, NA generatedWarning message:
In `[<-.factor`(`*tmp*`, ri, value = "⋮"): invalid factor level, NA generatedWarning message:
In `[<-.factor`(`*tmp*`, ri, value = "⋮"): invalid factor level, NA generate
Out[3]:
row.names | pclass | survived | name | age | embarked | home.dest | room | ticket | boat | sex | |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | 1 | 1 | 1 | Allen, Miss Elisabeth Walton | 29 | Southampton | St Louis, MO | B-5 | 24160 L221 | 2 | female |
2 | 2 | 1 | 0 | Allison, Miss Helen Loraine | 2 | Southampton | Montreal, PQ / Chesterville, ON | C26 | female | ||
3 | 3 | 1 | 0 | Allison, Mr Hudson Joshua Creighton | 30 | Southampton | Montreal, PQ / Chesterville, ON | C26 | (135) | male | |
4 | 4 | 1 | 0 | Allison, Mrs Hudson J.C. (Bessie Waldo Daniels) | 25 | Southampton | Montreal, PQ / Chesterville, ON | C26 | female | ||
5 | 5 | 1 | 1 | Allison, Master Hudson Trevor | 0.9167 | Southampton | Montreal, PQ / Chesterville, ON | C22 | 11 | male | |
6 | 6 | 1 | 1 | Anderson, Mr Harry | 47 | Southampton | New York, NY | E-12 | 3 | male | |
7 | 7 | 1 | 1 | Andrews, Miss Kornelia Theodosia | 63 | Southampton | Hudson, NY | D-7 | 13502 L77 | 10 | female |
8 | 8 | 1 | 0 | Andrews, Mr Thomas, jr | 39 | Southampton | Belfast, NI | A-36 | male | ||
9 | 9 | 1 | 1 | Appleton, Mrs Edward Dale (Charlotte Lamson) | 58 | Southampton | Bayside, Queens, NY | C-101 | 2 | female | |
10 | 10 | 1 | 0 | Artagaveytia, Mr Ramon | 71 | Cherbourg | Montevideo, Uruguay | (22) | male | ||
11 | 11 | 1 | 0 | Astor, Colonel John Jacob | 47 | Cherbourg | New York, NY | 17754 L224 10s 6d | (124) | male | |
12 | 12 | 1 | 1 | Astor, Mrs John Jacob (Madeleine Talmadge Force) | 19 | Cherbourg | New York, NY | 17754 L224 10s 6d | 4 | female | |
13 | 13 | 1 | 1 | Aubert, Mrs Leontine Pauline | NA | Cherbourg | Paris, France | B-35 | 17477 L69 6s | 9 | female |
14 | 14 | 1 | 1 | Barkworth, Mr Algernon H. | NA | Southampton | Hessle, Yorks | A-23 | B | male | |
15 | 15 | 1 | 0 | Baumann, Mr John D. | NA | Southampton | New York, NY | male | |||
16 | 16 | 1 | 1 | Baxter, Mrs James (Helene DeLaudeniere Chaput) | 50 | Cherbourg | Montreal, PQ | B-58/60 | 6 | female | |
17 | 17 | 1 | 0 | Baxter, Mr Quigg Edmond | 24 | Cherbourg | Montreal, PQ | B-58/60 | male | ||
18 | 18 | 1 | 0 | Beattie, Mr Thomson | 36 | Cherbourg | Winnipeg, MN | C-6 | male | ||
19 | 19 | 1 | 1 | Beckwith, Mr Richard Leonard | 37 | Southampton | New York, NY | D-35 | 5 | male | |
20 | 20 | 1 | 1 | Beckwith, Mrs Richard Leonard (Sallie Monypeny) | 47 | Southampton | New York, NY | D-35 | 5 | female | |
21 | 21 | 1 | 1 | Behr, Mr Karl Howell | 26 | Cherbourg | New York, NY | C-148 | 5 | male | |
22 | 22 | 1 | 0 | Birnbaum, Mr Jakob | 25 | Cherbourg | San Francisco, CA | (148) | male | ||
23 | 23 | 1 | 1 | Bishop, Mr Dickinson H. | 25 | Cherbourg | Dowagiac, MI | B-49 | 7 | male | |
24 | 24 | 1 | 1 | Bishop, Mrs Dickinson H. (Helen Walton) | 19 | Cherbourg | Dowagiac, MI | B-49 | 7 | female | |
25 | 25 | 1 | 1 | Bjornstrm-Steffansson, Mr Mauritz Hakan | 28 | Southampton | Stockholm, Sweden / Washington, DC | D | male | ||
26 | 26 | 1 | 0 | Blackwell, Mr Stephen Weart | 45 | Southampton | Trenton, NJ | (241) | male | ||
27 | 27 | 1 | 1 | Blank, Mr Henry | 39 | Cherbourg | Glen Ridge, NJ | A-31 | 7 | male | |
28 | 28 | 1 | 1 | Bonnell, Miss Caroline | 30 | Southampton | Youngstown, OH | C-7 | 8 | female | |
29 | 29 | 1 | 1 | Bonnell, Miss Elizabeth | 58 | Southampton | Birkdale, England Cleveland, Ohio | C-103 | 8 | female | |
30 | 30 | 1 | 0 | Borebank, Mr John James | NA | Southampton | London / Winnipeg, MB | D-21/2 | male | ||
31 | ⋮ | NA | ⋮ | NA | ⋮ | NA | NA | NA | NA | NA | NA |
1284 | 1284 | 0 | 0 | Vestrom, Miss Hulda Amanda Adolfina | NA | female | |||||
1285 | 1285 | 0 | 0 | Vonk, Mr Jenko | NA | male | |||||
1286 | 1286 | 0 | 0 | Ware, Mr Frederick | NA | male | |||||
1287 | 1287 | 0 | 0 | Warren, Mr Charles William | NA | male | |||||
1288 | 1288 | 0 | 0 | Wazli, Mr Yousif | NA | male | |||||
1289 | 1289 | 0 | 0 | Webber, Mr James | NA | male | |||||
1290 | 1290 | 0 | 1 | Wennerstrom, Mr August Edvard | NA | male | |||||
1291 | 1291 | 0 | 0 | Wenzel, Mr Linhart | NA | male | |||||
1292 | 1292 | 0 | 0 | Widegren, Mr Charles Peter | NA | male | |||||
1293 | 1293 | 0 | 0 | Wiklund, Mr Jacob Alfred | NA | male | |||||
1294 | 1294 | 0 | 1 | Wilkes, Mrs Ellen | NA | female | |||||
1295 | 1295 | 0 | 0 | Willer, Mr Aaron | NA | male | |||||
1296 | 1296 | 0 | 0 | Willey, Mr Edward | NA | male | |||||
1297 | 1297 | 0 | 0 | Williams, Mr Howard Hugh | NA | male | |||||
1298 | 1298 | 0 | 0 | Williams, Mr Leslie | NA | male | |||||
1299 | 1299 | 0 | 0 | Windelov, Mr Einar | NA | male | |||||
1300 | 1300 | 0 | 0 | Wirz, Mr Albert | NA | male | |||||
1301 | 1301 | 0 | 0 | Wiseman, Mr Phillippe | NA | male | |||||
1302 | 1302 | 0 | 0 | Wittevrongel, Mr Camiel | NA | male | |||||
1303 | 1303 | 0 | 1 | Yalsevac, Mr Ivan | NA | male | |||||
1304 | 1304 | 0 | 0 | Yasbeck, Mr Antoni | NA | male | |||||
1305 | 1305 | 0 | 1 | Yasbeck, Mrs Antoni | NA | female | |||||
1306 | 1306 | 0 | 0 | Youssef, Mr Gerios | NA | male | |||||
1307 | 1307 | 0 | 0 | Zabour, Miss Hileni | NA | female | |||||
1308 | 1308 | 0 | 0 | Zabour, Miss Tamini | NA | female | |||||
1309 | 1309 | 0 | 0 | Zakarian, Mr Artun | NA | male | |||||
1310 | 1310 | 0 | 0 | Zakarian, Mr Maprieder | NA | male | |||||
1311 | 1311 | 0 | 0 | Zenn, Mr Philip | NA | male | |||||
1312 | 1312 | 0 | 0 | Zievens, Rene | NA | female | |||||
1313 | 1313 | 0 | 0 | Zimmerman, Leo | NA | male |
In [4]:
fit.survive<-glm(survived ~ sex + age +pclass,family= "binomial")
In [5]:
summary(fit.survive)
Out[5]:
Call:
glm(formula = survived ~ sex + age + pclass, family = "binomial")
Deviance Residuals:
Min 1Q Median 3Q Max
-2.9784 -0.6520 -0.3142 0.5894 2.7022
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 4.522163 0.471008 9.601 < 2e-16 ***
sexmale -3.086709 0.241063 -12.805 < 2e-16 ***
age -0.049309 0.008732 -5.647 1.63e-08 ***
pclass2nd -1.495229 0.281986 -5.302 1.14e-07 ***
pclass3rd -2.841271 0.338897 -8.384 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 869.54 on 632 degrees of freedom
Residual deviance: 539.71 on 628 degrees of freedom
(680 observations deleted due to missingness)
AIC: 549.71
Number of Fisher Scoring iterations: 5
In [6]:
fit.survive<-glm(survived ~ sex + age + pclass + sex:pclass + age:pclass,family= "binomial")
In [7]:
summary(fit.survive)
Out[7]:
Call:
glm(formula = survived ~ sex + age + pclass + sex:pclass + age:pclass,
family = "binomial")
Deviance Residuals:
Min 1Q Median 3Q Max
-3.0532 -0.6111 -0.3829 0.4168 2.5174
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 4.736466 0.756171 6.264 3.76e-10 ***
sexmale -3.685635 0.510927 -7.214 5.45e-13 ***
age -0.042485 0.013044 -3.257 0.00113 **
pclass2nd 0.232062 1.095137 0.212 0.83218
pclass3rd -3.978596 0.888172 -4.480 7.48e-06 ***
sexmale:pclass2nd -0.655681 0.735503 -0.891 0.37268
sexmale:pclass3rd 1.870332 0.631770 2.960 0.00307 **
age:pclass2nd -0.049047 0.023427 -2.094 0.03629 *
age:pclass3rd 0.007216 0.021396 0.337 0.73593
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 869.54 on 632 degrees of freedom
Residual deviance: 517.73 on 624 degrees of freedom
(680 observations deleted due to missingness)
AIC: 535.73
Number of Fisher Scoring iterations: 5
In [8]:
detach(Titanic)
Negative Binomial¶
When \(Y\) follows a negative binomial distribution, we model the mean as:
What is the link function?
The following example is lifted from http://www.ats.ucla.edu/stat/r/dae/nbreg.htm
In [9]:
library(foreign)
library(ggplot2)
library(MASS)
dat <- read.dta("http://www.ats.ucla.edu/stat/stata/dae/nb_data.dta")
In [10]:
dat
Warning message:
In `[<-.factor`(`*tmp*`, ri, value = "⋮"): invalid factor level, NA generatedWarning message:
In `[<-.factor`(`*tmp*`, ri, value = "⋮"): invalid factor level, NA generate
Out[10]:
id | gender | math | daysabs | prog | |
---|---|---|---|---|---|
1 | 1001 | male | 63 | 4 | 2 |
2 | 1002 | male | 27 | 4 | 2 |
3 | 1003 | female | 20 | 2 | 2 |
4 | 1004 | female | 16 | 3 | 2 |
5 | 1005 | female | 2 | 3 | 2 |
6 | 1006 | female | 71 | 13 | 2 |
7 | 1007 | female | 63 | 11 | 2 |
8 | 1008 | male | 3 | 7 | 2 |
9 | 1009 | male | 51 | 10 | 2 |
10 | 1010 | male | 49 | 9 | 3 |
11 | 1011 | female | 31 | 4 | 3 |
12 | 1012 | male | 22 | 5 | 2 |
13 | 1013 | female | 73 | 5 | 2 |
14 | 1014 | female | 77 | 6 | 1 |
15 | 1015 | male | 10 | 1 | 3 |
16 | 1016 | male | 89 | 0 | 2 |
17 | 1017 | female | 34 | 1 | 2 |
18 | 1018 | female | 35 | 0 | 2 |
19 | 1019 | male | 77 | 5 | 1 |
20 | 1020 | male | 4 | 24 | 2 |
21 | 1021 | female | 21 | 2 | 2 |
22 | 1022 | male | 61 | 0 | 2 |
23 | 1023 | male | 41 | 1 | 2 |
24 | 1024 | male | 63 | 0 | 2 |
25 | 1025 | female | 19 | 8 | 2 |
26 | 1026 | male | 55 | 6 | 1 |
27 | 1027 | male | 6 | 7 | 2 |
28 | 1028 | female | 21 | 0 | 2 |
29 | 1029 | male | 70 | 2 | 2 |
30 | 1030 | male | 79 | 3 | 1 |
31 | ⋮ | NA | ⋮ | ⋮ | ⋮ |
285 | 2127 | male | 53 | 8 | 1 |
286 | 2128 | female | 43 | 1 | 3 |
287 | 2129 | male | 46 | 1 | 3 |
288 | 2130 | female | 35 | 7 | 2 |
289 | 2131 | female | 63 | 6 | 3 |
290 | 2132 | female | 75 | 0 | 3 |
291 | 2133 | female | 84 | 8 | 3 |
292 | 2134 | male | 41 | 0 | 2 |
293 | 2135 | female | 32 | 1 | 3 |
294 | 2136 | male | 12 | 0 | 2 |
295 | 2137 | male | 61 | 4 | 3 |
296 | 2138 | female | 99 | 17 | 2 |
297 | 2139 | male | 67 | 9 | 1 |
298 | 2140 | female | 35 | 0 | 3 |
299 | 2141 | female | 70 | 0 | 3 |
300 | 2142 | female | 87 | 1 | 3 |
301 | 2143 | male | 70 | 3 | 3 |
302 | 2144 | male | 57 | 1 | 2 |
303 | 2145 | male | 44 | 2 | 3 |
304 | 2146 | female | 26 | 2 | 3 |
305 | 2147 | male | 63 | 2 | 3 |
306 | 2148 | female | 8 | 5 | 2 |
307 | 2150 | male | 63 | 3 | 3 |
308 | 2151 | female | 59 | 7 | 3 |
309 | 2152 | female | 46 | 1 | 3 |
310 | 2153 | male | 26 | 1 | 2 |
311 | 2154 | female | 79 | 3 | 3 |
312 | 2155 | female | 59 | 0 | 2 |
313 | 2156 | female | 90 | 0 | 3 |
314 | 2157 | female | 77 | 2 | 3 |
In [11]:
dat <- within(dat, {
prog <- factor(prog, levels = 1:3, labels = c("General", "Academic", "Vocational"))
id <- factor(id)
})
summary(dat)
Out[11]:
id gender math daysabs prog
1001 : 1 female:160 Min. : 1.00 Min. : 0.000 General : 40
1002 : 1 male :154 1st Qu.:28.00 1st Qu.: 1.000 Academic :167
1003 : 1 Median :48.00 Median : 4.000 Vocational:107
1004 : 1 Mean :48.27 Mean : 5.955
1005 : 1 3rd Qu.:70.00 3rd Qu.: 8.000
1006 : 1 Max. :99.00 Max. :35.000
(Other):308
In [12]:
ggplot(dat, aes(daysabs, fill = prog)) + geom_histogram(binwidth = 1) + facet_grid(prog ~
., margins = TRUE, scales = "free")
In [13]:
m1 <- glm.nb(daysabs ~ math + prog, data = dat)
In [14]:
summary(m1)
Out[14]:
Call:
glm.nb(formula = daysabs ~ math + prog, data = dat, init.theta = 1.032713156,
link = log)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.1547 -1.0192 -0.3694 0.2285 2.5273
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.615265 0.197460 13.245 < 2e-16 ***
math -0.005993 0.002505 -2.392 0.0167 *
progAcademic -0.440760 0.182610 -2.414 0.0158 *
progVocational -1.278651 0.200720 -6.370 1.89e-10 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for Negative Binomial(1.0327) family taken to be 1)
Null deviance: 427.54 on 313 degrees of freedom
Residual deviance: 358.52 on 310 degrees of freedom
AIC: 1741.3
Number of Fisher Scoring iterations: 1
Theta: 1.033
Std. Err.: 0.106
2 x log-likelihood: -1731.258
In [ ]: