Generalized Linear Models

Recall the linear model

\[y = \alpha + \beta x + \epsilon\]

Linear models are appropriate where the response variable (\(y\)) is continuous and the errors (\(\epsilon\)) are normally distributed. Many experimental designs (for example, case-control) have categorical outcomes and/or outcomes whose errors are not normally distributed. Often, these experiments may be modeled with a more general category of models called ‘generalized linear models’ (or GLMs).

To understand GLMs, we need to be more precise about what our linear regression is actually doing. We are really saying:

\[y = \mu + \epsilon\]

or equivalently

\[\mathbb{E}(Y|x) = \mu = \alpha + \beta x\]

That is, our model is specifying the conditional expectation of observations \(Y\) on the idependent variable \(x\). In particular, that expectation is linear.

What Happens When \(\mathbb{E}(Y|x)\) is Not Linear?

A natural question to ask is: Can we find a function of the expectation that is linear? We would like this function to be invertible, so that we are really just transforming things. The answer is ‘yes’ - if - we assume that \(Y\) has a distribution that is part of the ‘exponential family’.

This family has the following form (note - this is vector notation)

\[f_Y(y|\theta) = h(y) g(\theta)exp(\eta(\theta)\cdot T(x))\]

Kinda ugly, but also pretty general. Note that the normal distribution is in this family! So is the binomial, negative binomial, Poisson, Gamma,...

Model

Suppose our outcome variable \(Y\) has a distribution in the exponential family. We need an invertible function \(g\) so that:

\[\mathbb{E}(Y|x) = \mu = g^{-1}(\alpha + \beta x)\]

Why???

Because when we apply \(g\) to the expectation:

\[g(\mu) = g(g^{-1}(\alpha + \beta x))\]

a miracle occurs...

\[g(\mu) = \alpha + \beta x\]

Because it has miraculous properties, we give \(g\) a name: \(g\) is called a ‘link’ function. It ‘links’ the conditional expectation of our outcome variable to a linear function. Everybody loves linear functions! Yay!

Now, DESeq2 uses a negative binomial model, and we’ll talk about that - but first lets look at a model that is commonly used in case-control type settings.

Logistic Regression

In logistic regression, we model the probability of the outcome \(y\), given the independent variable \(x\). That is, \(Y|x\) is Bernoulli, with success probability \(p\) given by:

\[\mathbb{E}(Y|x) = p = \frac{1}{1+e^{-(\alpha + \beta x)}}\]

Note that \(p\) is bounded between \(0\) and \(1\) and is defined for all values of \(\alpha + \beta x\).

We use the notation \(expit(t)\) to denote the following:

\[expit(t) = \frac{1}{1+e^{-t}}\]

\(expit\) is known as the logistic function.

and \(logit(t)\) to denote:

\[logit(t) = \log\left(\frac{1}{1-t}\right)\]

Note that:

\[logit(expit(t)) = t\]

so that

\[logit(p) = \alpha + \beta x\]

Because \(logit(p) = logit(\mathbb{E}(Y|x))\), it is called the link function.

In [1]:
expit<-function(t){
    return(1/(1+exp(-t)))
}
x<-expit((-10:10))
plot(x)

So, how do we perform a logistic regression in R? First, let’s get a data set:

In [2]:
Titanic<-read.csv("titanic.csv")
In [3]:
attach(Titanic)
#Titanic
library(plyr)
Titanic$pclass<-revalue(pclass, c("1st"=1, "2nd"=1,"3rd"=0))
Titanic
Warning message:
In `[<-.factor`(`*tmp*`, ri, value = "⋮"): invalid factor level, NA generatedWarning message:
In `[<-.factor`(`*tmp*`, ri, value = "⋮"): invalid factor level, NA generatedWarning message:
In `[<-.factor`(`*tmp*`, ri, value = "⋮"): invalid factor level, NA generatedWarning message:
In `[<-.factor`(`*tmp*`, ri, value = "⋮"): invalid factor level, NA generatedWarning message:
In `[<-.factor`(`*tmp*`, ri, value = "⋮"): invalid factor level, NA generatedWarning message:
In `[<-.factor`(`*tmp*`, ri, value = "⋮"): invalid factor level, NA generatedWarning message:
In `[<-.factor`(`*tmp*`, ri, value = "⋮"): invalid factor level, NA generatedWarning message:
In `[<-.factor`(`*tmp*`, ri, value = "⋮"): invalid factor level, NA generatedWarning message:
In `[<-.factor`(`*tmp*`, ri, value = "⋮"): invalid factor level, NA generatedWarning message:
In `[<-.factor`(`*tmp*`, ri, value = "⋮"): invalid factor level, NA generatedWarning message:
In `[<-.factor`(`*tmp*`, ri, value = "⋮"): invalid factor level, NA generatedWarning message:
In `[<-.factor`(`*tmp*`, ri, value = "⋮"): invalid factor level, NA generatedWarning message:
In `[<-.factor`(`*tmp*`, ri, value = "⋮"): invalid factor level, NA generatedWarning message:
In `[<-.factor`(`*tmp*`, ri, value = "⋮"): invalid factor level, NA generatedWarning message:
In `[<-.factor`(`*tmp*`, ri, value = "⋮"): invalid factor level, NA generatedWarning message:
In `[<-.factor`(`*tmp*`, ri, value = "⋮"): invalid factor level, NA generate
Out[3]:
row.namespclasssurvivednameageembarkedhome.destroomticketboatsex
1111Allen, Miss Elisabeth Walton29SouthamptonSt Louis, MOB-524160 L2212female
2210Allison, Miss Helen Loraine2SouthamptonMontreal, PQ / Chesterville, ONC26female
3310Allison, Mr Hudson Joshua Creighton30SouthamptonMontreal, PQ / Chesterville, ONC26(135)male
4410Allison, Mrs Hudson J.C. (Bessie Waldo Daniels)25SouthamptonMontreal, PQ / Chesterville, ONC26female
5511Allison, Master Hudson Trevor0.9167SouthamptonMontreal, PQ / Chesterville, ONC2211male
6611Anderson, Mr Harry47SouthamptonNew York, NYE-123male
7711Andrews, Miss Kornelia Theodosia63SouthamptonHudson, NYD-713502 L7710female
8810Andrews, Mr Thomas, jr39SouthamptonBelfast, NIA-36male
9911Appleton, Mrs Edward Dale (Charlotte Lamson)58SouthamptonBayside, Queens, NYC-1012female
101010Artagaveytia, Mr Ramon71CherbourgMontevideo, Uruguay(22)male
111110Astor, Colonel John Jacob47CherbourgNew York, NY17754 L224 10s 6d(124)male
121211Astor, Mrs John Jacob (Madeleine Talmadge Force)19CherbourgNew York, NY17754 L224 10s 6d4female
131311Aubert, Mrs Leontine PaulineNACherbourgParis, FranceB-3517477 L69 6s9female
141411Barkworth, Mr Algernon H.NASouthamptonHessle, YorksA-23Bmale
151510Baumann, Mr John D.NASouthamptonNew York, NYmale
161611Baxter, Mrs James (Helene DeLaudeniere Chaput)50CherbourgMontreal, PQB-58/606female
171710Baxter, Mr Quigg Edmond24CherbourgMontreal, PQB-58/60male
181810Beattie, Mr Thomson36CherbourgWinnipeg, MNC-6male
191911Beckwith, Mr Richard Leonard37SouthamptonNew York, NYD-355male
202011Beckwith, Mrs Richard Leonard (Sallie Monypeny)47SouthamptonNew York, NYD-355female
212111Behr, Mr Karl Howell26CherbourgNew York, NYC-1485male
222210Birnbaum, Mr Jakob25CherbourgSan Francisco, CA(148)male
232311Bishop, Mr Dickinson H.25CherbourgDowagiac, MIB-497male
242411Bishop, Mrs Dickinson H. (Helen Walton)19CherbourgDowagiac, MIB-497female
252511Bjornstrm-Steffansson, Mr Mauritz Hakan28SouthamptonStockholm, Sweden / Washington, DC Dmale
262610Blackwell, Mr Stephen Weart45SouthamptonTrenton, NJ(241)male
272711Blank, Mr Henry39CherbourgGlen Ridge, NJA-317male
282811Bonnell, Miss Caroline30SouthamptonYoungstown, OHC-78female
292911Bonnell, Miss Elizabeth58SouthamptonBirkdale, England Cleveland, OhioC-1038female
303010Borebank, Mr John JamesNASouthamptonLondon / Winnipeg, MBD-21/2male
31NANANANANANANANA
1284128400Vestrom, Miss Hulda Amanda AdolfinaNAfemale
1285128500Vonk, Mr JenkoNAmale
1286128600Ware, Mr FrederickNAmale
1287128700Warren, Mr Charles WilliamNAmale
1288128800Wazli, Mr YousifNAmale
1289128900Webber, Mr JamesNAmale
1290129001Wennerstrom, Mr August EdvardNAmale
1291129100Wenzel, Mr LinhartNAmale
1292129200Widegren, Mr Charles PeterNAmale
1293129300Wiklund, Mr Jacob AlfredNAmale
1294129401Wilkes, Mrs EllenNAfemale
1295129500Willer, Mr AaronNAmale
1296129600Willey, Mr EdwardNAmale
1297129700Williams, Mr Howard HughNAmale
1298129800Williams, Mr LeslieNAmale
1299129900Windelov, Mr EinarNAmale
1300130000Wirz, Mr AlbertNAmale
1301130100Wiseman, Mr PhillippeNAmale
1302130200Wittevrongel, Mr CamielNAmale
1303130301Yalsevac, Mr IvanNAmale
1304130400Yasbeck, Mr AntoniNAmale
1305130501Yasbeck, Mrs AntoniNAfemale
1306130600Youssef, Mr GeriosNAmale
1307130700Zabour, Miss HileniNAfemale
1308130800Zabour, Miss TaminiNAfemale
1309130900Zakarian, Mr ArtunNAmale
1310131000Zakarian, Mr MapriederNAmale
1311131100Zenn, Mr PhilipNAmale
1312131200Zievens, ReneNAfemale
1313131300Zimmerman, LeoNAmale
In [4]:
fit.survive<-glm(survived ~ sex + age +pclass,family= "binomial")
In [5]:
summary(fit.survive)
Out[5]:

Call:
glm(formula = survived ~ sex + age + pclass, family = "binomial")

Deviance Residuals:
    Min       1Q   Median       3Q      Max
-2.9784  -0.6520  -0.3142   0.5894   2.7022

Coefficients:
             Estimate Std. Error z value Pr(>|z|)
(Intercept)  4.522163   0.471008   9.601  < 2e-16 ***
sexmale     -3.086709   0.241063 -12.805  < 2e-16 ***
age         -0.049309   0.008732  -5.647 1.63e-08 ***
pclass2nd   -1.495229   0.281986  -5.302 1.14e-07 ***
pclass3rd   -2.841271   0.338897  -8.384  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 869.54  on 632  degrees of freedom
Residual deviance: 539.71  on 628  degrees of freedom
  (680 observations deleted due to missingness)
AIC: 549.71

Number of Fisher Scoring iterations: 5

In [6]:
fit.survive<-glm(survived ~ sex + age + pclass + sex:pclass + age:pclass,family= "binomial")
In [7]:
summary(fit.survive)
Out[7]:

Call:
glm(formula = survived ~ sex + age + pclass + sex:pclass + age:pclass,
    family = "binomial")

Deviance Residuals:
    Min       1Q   Median       3Q      Max
-3.0532  -0.6111  -0.3829   0.4168   2.5174

Coefficients:
                   Estimate Std. Error z value Pr(>|z|)
(Intercept)        4.736466   0.756171   6.264 3.76e-10 ***
sexmale           -3.685635   0.510927  -7.214 5.45e-13 ***
age               -0.042485   0.013044  -3.257  0.00113 **
pclass2nd          0.232062   1.095137   0.212  0.83218
pclass3rd         -3.978596   0.888172  -4.480 7.48e-06 ***
sexmale:pclass2nd -0.655681   0.735503  -0.891  0.37268
sexmale:pclass3rd  1.870332   0.631770   2.960  0.00307 **
age:pclass2nd     -0.049047   0.023427  -2.094  0.03629 *
age:pclass3rd      0.007216   0.021396   0.337  0.73593
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 869.54  on 632  degrees of freedom
Residual deviance: 517.73  on 624  degrees of freedom
  (680 observations deleted due to missingness)
AIC: 535.73

Number of Fisher Scoring iterations: 5

In [8]:
detach(Titanic)

Negative Binomial

When \(Y\) follows a negative binomial distribution, we model the mean as:

\[\mathbb{E}(Y|x) = exp(\alpha + \beta x)\]

What is the link function?

The following example is lifted from http://www.ats.ucla.edu/stat/r/dae/nbreg.htm

In [9]:
library(foreign)
library(ggplot2)
library(MASS)
dat <- read.dta("http://www.ats.ucla.edu/stat/stata/dae/nb_data.dta")
In [10]:
dat
Warning message:
In `[<-.factor`(`*tmp*`, ri, value = "⋮"): invalid factor level, NA generatedWarning message:
In `[<-.factor`(`*tmp*`, ri, value = "⋮"): invalid factor level, NA generate
Out[10]:
idgendermathdaysabsprog
11001male6342
21002male2742
31003female2022
41004female1632
51005female232
61006female71132
71007female63112
81008male372
91009male51102
101010male4993
111011female3143
121012male2252
131013female7352
141014female7761
151015male1013
161016male8902
171017female3412
181018female3502
191019male7751
201020male4242
211021female2122
221022male6102
231023male4112
241024male6302
251025female1982
261026male5561
271027male672
281028female2102
291029male7022
301030male7931
31NA
2852127male5381
2862128female4313
2872129male4613
2882130female3572
2892131female6363
2902132female7503
2912133female8483
2922134male4102
2932135female3213
2942136male1202
2952137male6143
2962138female99172
2972139male6791
2982140female3503
2992141female7003
3002142female8713
3012143male7033
3022144male5712
3032145male4423
3042146female2623
3052147male6323
3062148female852
3072150male6333
3082151female5973
3092152female4613
3102153male2612
3112154female7933
3122155female5902
3132156female9003
3142157female7723
In [11]:
dat <- within(dat, {
    prog <- factor(prog, levels = 1:3, labels = c("General", "Academic", "Vocational"))
    id <- factor(id)
})

summary(dat)
Out[11]:
id         gender         math          daysabs               prog
 1001   :  1   female:160   Min.   : 1.00   Min.   : 0.000   General   : 40
 1002   :  1   male  :154   1st Qu.:28.00   1st Qu.: 1.000   Academic  :167
 1003   :  1                Median :48.00   Median : 4.000   Vocational:107
 1004   :  1                Mean   :48.27   Mean   : 5.955
 1005   :  1                3rd Qu.:70.00   3rd Qu.: 8.000
 1006   :  1                Max.   :99.00   Max.   :35.000
 (Other):308
In [12]:
ggplot(dat, aes(daysabs, fill = prog)) + geom_histogram(binwidth = 1) + facet_grid(prog ~
    ., margins = TRUE, scales = "free")
In [13]:
m1 <- glm.nb(daysabs ~ math + prog, data = dat)
In [14]:
summary(m1)
Out[14]:

Call:
glm.nb(formula = daysabs ~ math + prog, data = dat, init.theta = 1.032713156,
    link = log)

Deviance Residuals:
    Min       1Q   Median       3Q      Max
-2.1547  -1.0192  -0.3694   0.2285   2.5273

Coefficients:
                Estimate Std. Error z value Pr(>|z|)
(Intercept)     2.615265   0.197460  13.245  < 2e-16 ***
math           -0.005993   0.002505  -2.392   0.0167 *
progAcademic   -0.440760   0.182610  -2.414   0.0158 *
progVocational -1.278651   0.200720  -6.370 1.89e-10 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for Negative Binomial(1.0327) family taken to be 1)

    Null deviance: 427.54  on 313  degrees of freedom
Residual deviance: 358.52  on 310  degrees of freedom
AIC: 1741.3

Number of Fisher Scoring iterations: 1


              Theta:  1.033
          Std. Err.:  0.106

 2 x log-likelihood:  -1731.258
In [ ]: