Linear Regression

A major goal of this morning’s lecture was to introduce the idea of intra-subject variability and how it affects the modeling of a particular data set. We will begin by reviewing (briefly) simple linear models, and we’ll discuss how to do such a fit in R. Then, we will look more closely at the Rails example from this morning.

Simple Linear Models

A simple linear model is of the form

\[y_i = \alpha + \beta x_i + \epsilon_i\]

The goal of a linear regression is to find the line

\[y = \alpha +\beta x\]

that best fits the data. I.e. we find the line such that the sum of the squared errors is minimized. A key assumption in such a model is that the errors follow a standard normal distribution. This is important, because when this assumption is invalid, applying the standard confidence interval calculation for the slope \(\beta\) will be wrong.

We will use a sample data set from R to illustrate a simple linear regression.

# load the data set data(ToothGrowth)
In [1]:
# Let's see what it looks like:
ToothGrowth
Out[1]:
lensuppdose
14.2VC0.5
211.5VC0.5
37.3VC0.5
45.8VC0.5
56.4VC0.5
610VC0.5
711.2VC0.5
811.2VC0.5
95.2VC0.5
107VC0.5
1116.5VC1
1216.5VC1
1315.2VC1
1417.3VC1
1522.5VC1
1617.3VC1
1713.6VC1
1814.5VC1
1918.8VC1
2015.5VC1
2123.6VC2
2218.5VC2
2333.9VC2
2425.5VC2
2526.4VC2
2632.5VC2
2726.7VC2
2821.5VC2
2923.3VC2
3029.5VC2
3115.2OJ0.5
3221.5OJ0.5
3317.6OJ0.5
349.7OJ0.5
3514.5OJ0.5
3610OJ0.5
378.2OJ0.5
389.4OJ0.5
3916.5OJ0.5
409.7OJ0.5
4119.7OJ1
4223.3OJ1
4323.6OJ1
4426.4OJ1
4520OJ1
4625.2OJ1
4725.8OJ1
4821.2OJ1
4914.5OJ1
5027.3OJ1
5125.5OJ2
5226.4OJ2
5322.4OJ2
5424.5OJ2
5524.8OJ2
5630.9OJ2
5726.4OJ2
5827.3OJ2
5929.4OJ2
6023OJ2

These data are the results of an experiment where 60 Guinea Pigs were divided in to 6 groups and then were given a specific dose of either vitamin C or orange juice over a period of time, and then their tooth length was measured. We’ll do two separated fits: one for vitamin C and one for OJ.

In [2]:
# Let's see a scatter plot
with(subset(ToothGrowth,  supp == "VC"), plot(dose, len))
title(main="Growth of Tooth vs Dose of Vitamin C")
In [3]:
with(subset(ToothGrowth,  supp == "OJ"), plot(dose, len))
title(main="Growth of Tooth vs Dose of OJ")
In [4]:
# Fit for vitamin C
fit.tooth.VC<-lm(ToothGrowth$len ~ ToothGrowth$dose,subset=(ToothGrowth$supp=="VC"))
summary(fit.tooth.VC)
Out[4]:

Call:
lm(formula = ToothGrowth$len ~ ToothGrowth$dose, subset = (ToothGrowth$supp ==
    "VC"))

Residuals:
    Min      1Q  Median      3Q     Max
-8.2264 -2.6029  0.0814  2.2288  7.4893

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)
(Intercept)         3.295      1.427   2.309   0.0285 *
ToothGrowth$dose   11.716      1.079  10.860 1.51e-11 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.685 on 28 degrees of freedom
Multiple R-squared:  0.8082,        Adjusted R-squared:  0.8013
F-statistic: 117.9 on 1 and 28 DF,  p-value: 1.509e-11

In [5]:
# Fit for vitamin C
fit.tooth.VC<-lm(len ~ dose, subset=(supp=="VC"), data=ToothGrowth)
summary(fit.tooth.VC)
Out[5]:

Call:
lm(formula = len ~ dose, data = ToothGrowth, subset = (supp ==
    "VC"))

Residuals:
    Min      1Q  Median      3Q     Max
-8.2264 -2.6029  0.0814  2.2288  7.4893

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)    3.295      1.427   2.309   0.0285 *
dose          11.716      1.079  10.860 1.51e-11 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.685 on 28 degrees of freedom
Multiple R-squared:  0.8082,        Adjusted R-squared:  0.8013
F-statistic: 117.9 on 1 and 28 DF,  p-value: 1.509e-11

Discussion

  • What is the biological hypothesis?
  • What is the statistical hypothesis?
  • How do we test the statistical hypothesis?
  • What is your conclusion?

Look at OJ and Tooth Growth

In [6]:
# Fit for OJ
fit.tooth.OJ<-lm(ToothGrowth$len ~ ToothGrowth$dose,subset=(ToothGrowth$supp=="OJ"))
summary(fit.tooth.OJ)
Out[6]:

Call:
lm(formula = ToothGrowth$len ~ ToothGrowth$dose, subset = (ToothGrowth$supp ==
    "OJ"))

Residuals:
    Min      1Q  Median      3Q     Max
-7.2557 -3.7979 -0.0643  3.3521  7.9386

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)
(Intercept)        11.550      1.722   6.708 2.79e-07 ***
ToothGrowth$dose    7.811      1.302   6.001 1.82e-06 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 4.446 on 28 degrees of freedom
Multiple R-squared:  0.5626,        Adjusted R-squared:  0.547
F-statistic: 36.01 on 1 and 28 DF,  p-value: 1.825e-06

Note that the lm function (for linear model) takes arguments in a specific format:

\[y\_values \sim x\_values\]

Or more formally stated: Dependent Variable ~ Independent Variable

R Formula Syntax

A more complicated linear model might include more than one main effect and/or interaction terms such as:

\[y = \alpha + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_1 x_2 +\epsilon\]

Below is a table of R formula syntax and the corresponding model. There are others, but that is a topic for more advanced study.

Syntax Model
\(x_1 + x_2 \;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\) \(y = \alpha + \beta_1 x_1 + \beta_2 x_2 +\epsilon\)
\(x_1:x_1\) \(y = \alpha + \beta x_1 x_2 +\epsilon \;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\)
\(x_1 * x_2\) \(y = \alpha + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_1 x_2 +\epsilon \;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\)
\(x_1 * x_2 *x_3\) \(y = \alpha + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 +\beta_4 x_1 x_2 +\beta_5 x_1 x_3 + \beta_6 x_2 x_3 + \beta_7 x_1 x_2 x_3 +\epsilon\)
\((x_1 + x_2 + x_3)^2\) \(y = \alpha + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 +\beta_4 x_1 x_2 +\beta_5 x_1 x_3 + \beta_6 x_2 x_3 +\epsilon\)

Work!

Load the R data set called “mtcars”.

  • Perform a linear regression on mpg using cyl and hp as main effects.
  • Add wt (weight) as a third main effect.
  • Add interaction terms for hp and weight

Rails Example

In [7]:
# Rails example

# Just loading some library functions
library(nlme)
library(lattice)

# Get the dataset (it is part of R)
data(Rail)

# Look at the data
Rail
Out[7]:
Railtravel
1155
2153
3154
4226
5237
6232
7378
8391
9385
10492
114100
12496
13549
14551
15550
16680
17685
18683
In [8]:
#Plot the data

xyplot(travel~Rail,data=Rail,ylab="Zero-force travel time (nano-seconds)")
In [9]:
# Linear Model ignoring random effect

fit.rails.simple<-lm(travel~1,data=Rail)
summary(fit.rails.simple)
res<-fit.rails.simple$residuals
qqnorm(res);qqline(res)

Out[9]:

Call:
lm(formula = travel ~ 1, data = Rail)

Residuals:
   Min     1Q Median     3Q    Max
-40.50 -16.25   0.00  18.50  33.50

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)   66.500      5.573   11.93  1.1e-09 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 23.65 on 17 degrees of freedom

The above plot is called a ‘QQ plot’. The straight line represents the quantiles of the normal distribution, while the dots are the quantiles of the residuals. If the errors were normally distributed, the dots would be close to or on the line.

Note: In this example, we are fitting a one-parameter model, i.e.

\[y_i = \alpha + \epsilon_i\]

Where \(\alpha\) is the mean travel time along a rail.

In [10]:
mean(Rail$travel)
sd(Rail$travel)
Out[10]:
66.5
Out[10]:
23.6450467390978
In [11]:

# Linear Model incorporating random effect
fit.rails.mm<-lme(travel~1,random=~1|Rail,data=Rail)
summary(fit.rails.mm)
res<-fit.rails.mm$residuals
Out[11]:
Linear mixed-effects model fit by REML
 Data: Rail
      AIC      BIC   logLik
  128.177 130.6766 -61.0885

Random effects:
 Formula: ~1 | Rail
        (Intercept) Residual
StdDev:    24.80547 4.020779

Fixed effects: travel ~ 1
            Value Std.Error DF  t-value p-value
(Intercept)  66.5  10.17104 12 6.538173       0

Standardized Within-Group Residuals:
        Min          Q1         Med          Q3         Max
-1.61882658 -0.28217671  0.03569328  0.21955784  1.61437744

Number of Observations: 18
Number of Groups: 6

Simulation

In [12]:
sim.ranef=function(nk,n,se,sb,verbose=FALSE)
    {
        # n exp units with nk replicates each
        N=n*nk

        # generate error from normal distribution with zero mean and variance se
        e=rnorm(N,0,se)

        # generate random effect from normal with mean zero and variance sb
        # here we get n different samples from the normal - but replicate those nk times
        b=rep(rnorm(n,0,sb),each=nk)

        # assigning a label to each n sample (eg. each rail), repeat nk times
        id=rep(1:n,each=nk)

        # Simulated output value
        y=0+e+b

        # create a data frame with the id, errors and outcome
        dat=data.frame(id,b,e,y)


        # fit both models: standard linear model (lm) and linear mixed effects (lme)
        mod0=summary(lm(y~1,data=dat))
        mod1=summary(lme(y~1,random=~1|id,data=dat))

        # extract the pvalues
        pval0=mod0$coef["(Intercept)","Pr(>|t|)"]  #each package returns a slightly different data frame
        pval1=mod1$tTable["(Intercept)","p-value"]

    # Just changing output based on options passed to the function.
       if(verbose)
            {
                out=list(dat,mod0,mod1) # This is a list data type. Basically, you can put anything into a list.
            }
        else
            {
                out=c(pval0,pval1) # Output only the p-values from the two models.
            }
        out
    }



In [13]:
# run the first simulation

set.seed(210)

# simulate 3 measurements on each of 6 subjects

ex1=sim.ranef(3,6,0.25,0.5,verbose=TRUE)

#print the simulated data set - note that it is stored in element 1 of a list.
ex1[[1]]





Out[13]:
idbey
11-0.14880670.007527368-0.1412794
21-0.14880670.31076470.161958
31-0.1488067-0.1323212-0.281128
421.498304-0.14761851.350686
521.498304-0.022972371.475332
621.498304-0.18311321.315191
73-0.51670830.4015476-0.1151607
83-0.5167083-0.2814648-0.7981732
93-0.5167083-0.1040145-0.6207228
1040.16444280.43286120.597304
1140.16444280.44929270.6137355
1240.1644428-0.1118880.05255479
1350.11384020.5247830.6386232
1450.1138402-0.022959530.09088062
1550.1138402-0.5665937-0.4527536
1660.8765584-0.04015120.8364071
1760.8765584-0.054093720.8224646
1860.8765584-0.042349740.8342086
In [14]:
# plot the simultation data using xyplot (from the lattice package)
xyplot(y~id,data=ex1[[1]])

# print the output from the two fits.

ex1[[2]]
ex1[[3]]


Out[14]:

Call:
lm(formula = y ~ 1, data = dat)

Residuals:
     Min       1Q   Median       3Q      Max
-1.15262 -0.48920  0.02518  0.47682  1.12088

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)   0.3545     0.1623   2.184   0.0433 *
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.6885 on 17 degrees of freedom

Out[14]:
Linear mixed-effects model fit by REML
 Data: dat
       AIC      BIC    logLik
  30.77925 33.27889 -12.38963

Random effects:
 Formula: ~1 | id
        (Intercept) Residual
StdDev:   0.6534632  0.31193

Fixed effects: y ~ 1
                Value Std.Error DF  t-value p-value
(Intercept) 0.3544515 0.2767212 12 1.280898  0.2244

Standardized Within-Group Residuals:
        Min          Q1         Med          Q3         Max
-1.80653704 -0.47853843  0.09922849  0.56888267  1.69225071

Number of Observations: 18
Number of Groups: 6
In [15]:
#Run the simulation 1000 times
set.seed(210)
B=1000
res=replicate(B,sim.ranef(3,6,0.25,0.5,verbose=FALSE)) #Note that verbose=FALSE means that the simulation will
                                                                 # output only p-values

# Compute the type I error rates
mean(res[1,]<0.05)
mean(res[2,]<0.05)

Out[15]:
0.247
Out[15]:
0.072
In [16]:
par(mfrow=c(1,3))
qqplot(qunif(ppoints(500)), res[1,],cex=0.1,xlab="Uniform Distribution",ylab="lm")
abline(0,1)
qqplot(qunif(ppoints(500)), res[2,],cex=0.1,xlab="Uniform Distribution",ylab="lme")
abline(0,1)
plot(res[1,],res[2,],pch=19,cex=0.5,xlab="lm",ylab="lme")
abline(0,1)

In [17]:
# Increase Sample size
B=1000
res=replicate(B,sim.ranef(3,50,0.25,0.5,verbose=FALSE))
mean(res[1,]<0.05)

mean(res[2,]<0.05)


par(mfrow=c(1,3))
qqplot(qunif(ppoints(500)), res[1,],cex=0.1,xlab="Uniform Distribution",ylab="lm")
abline(0,1)
qqplot(qunif(ppoints(500)), res[2,],cex=0.1,xlab="Uniform Distribution",ylab="lme")
abline(0,1)
plot(res[1,],res[2,],pch=19,cex=0.5,xlab="lm",ylab="lme")
abline(0,1)





Out[17]:
0.215
Out[17]:
0.052
In [18]:
# Clustering Effect on Two sample problem
sim.twosample.clustered=function(nk,n,se,sb,verbose=FALSE)
    {
        N=n*nk
        e0=rnorm(N,0,se)
        e1=rnorm(N,0,se)
        b0=rep(rnorm(n,0,sb),each=nk)
        b1=rep(rnorm(n,0,sb),each=nk)
        y0=e0+b0
        y1=e1+b1
        t.test(y0,y1)$p.value
    }

set.seed(2314)
B=1000
# Simulate with no clustering effect (sb=0)
pval0=replicate(B,sim.twosample.clustered(3,10,0.25,0))
# Simulate with no clustering effect (sb>0)
pval1=replicate(B,sim.twosample.clustered(3,10,0.25,0.5))
rm(B)

mean(pval0<0.05)
mean(pval1<0.05)
par(mfrow=c(1,2))
qqplot(qunif(ppoints(500)), pval0,cex=0.1,xlab="Uniform Distribution")
abline(0,1)
qqplot(qunif(ppoints(500)), pval1,cex=0.1,xlab="Uniform Distribution")
abline(0,1)
Out[18]:
0.049
Out[18]:
0.252
In [ ]: