Statistics review 8: Qualitative data – tests of association

R code accompanying paper)

Key learning points

  • How to investigate relationships between two qualitative (categorical) variables
  • \(\chi^2\) test of association
  • Test for trend (one variable is ordinal)
  • Risk measurement
  • Evaluating proportions
suppressPackageStartupMessages(library(tidyverse))
options(repr.plot.width=4, repr.plot.height=3)

Categorical data

Unordered

  • sex
  • blood group
  • classification of disease
  • whether the patient survived

Ordinal

  • Age group
  • Clinical scales (e.g. Pain scale)

\(\chi^2\) test of association

none <-  c(686, 451, 168)
exit <- c(152, 35, 58)
bacteremia <- c(96, 38, 22)

o <- data.frame(none=none, exit=exit, bacteremia=bacteremia,
                row.names= c("jugular", "subclavian", "femoral"))

Observed counts

o
noneexitbacteremia
jugular68615296
subclavian451 3538
femoral168 5822
m <- addmargins(as.matrix(o))
m
noneexitbacteremiaSum
jugular 686152 96 934
subclavian 451 35 38 524
femoral 168 58 22 248
Sum1305245 156 1706

Expected counts

e <- outer(m[1:3,4], m[4,1:3]/m[4,4])
round(e, 1)
noneexitbacteremia
jugular714.5134.185.4
subclavian400.8 75.347.9
femoral189.7 35.622.7

Calculate \(\chi^2\) test statistic

T <- sum((o - e)^2/e)
round(T, 2)
51.26
df <- (nrow(o) -1) * (ncol(o)  - 1)
df
4

Calculate p-value from \(\chi^2\) distribution

1 - pchisq(T, df)
1.96774818661538e-10

Use built in test

chisq.test(o)
    Pearson's Chi-squared test

data:  o
X-squared = 51.262, df = 4, p-value = 1.968e-10

Residuals

t <- addmargins(as.matrix(prop.table(o)))
t
noneexitbacteremiaSum
jugular0.402110200.089097300.056271980.5474795
subclavian0.264361080.020515830.022274330.3071512
femoral0.098475970.033997660.012895660.1453693
Sum0.764947250.143610790.091441971.0000000

Adjusted standardized residues

The numbers don’t agree with table 4 in the paper. Student to check by hand calculations?

round((o - e)/sqrt(e*(1-t[1:3,4])*(1-t[4,1:3])), 1)
noneexitbacteremia
jugular-3.3 4.7 3.5
subclavian 3.3-6.0-1.9
femoral-1.8 4.3-0.2

Two by two tables

treatment <- c(3, 47)
control <- c(8, 37)
df1 <- data.frame(treatment=treatment, contorl=control)
df1
treatmentcontorl
3 8
4737
chisq.test(df1, correct = FALSE)
    Pearson's Chi-squared test

data:  df1
X-squared = 3.2089, df = 1, p-value = 0.07324

Fisher’s exact test

fisher.test(df1)
    Fisher's Exact Test for Count Data

data:  df1
p-value = 0.1083
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
 0.04775944 1.35642029
sample estimates:
odds ratio
 0.2989279

Yates’ continuity correction

chisq.test(df1, correct=TRUE)
    Pearson's Chi-squared test with Yates' continuity correction

data:  df1
X-squared = 2.1617, df = 1, p-value = 0.1415

Test for trend

alert <- c(1110, 108)
vopr <- c(54, 14)
unresponsive <- c(14, 6)
df2 <- data.frame(alert=alert, vopr=vopr, unresponsive=unresponsive,
                  row.names=c("sruvived", "died"))
df2
alertvoprunresponsive
sruvived111054 14
died 10814 6
chisq.test(df2)
Warning message in chisq.test(df2):
“Chi-squared approximation may be incorrect”
    Pearson's Chi-squared test

data:  df2
X-squared = 19.383, df = 2, p-value = 6.18e-05

Calling the trend test

The trend test takes two arguments

  • Number of events
  • Number of trials

So we have to a little massaging to get the data in the correct format.

We can use either row as the number of events

Using number of survivors.

prop.trend.test(as.matrix(df2[1,]), colSums(df2))
    Chi-squared Test for Trend in Proportions

data:  as.matrix(df2[1, ]) out of colSums(df2) ,
 using scores: 1 2 3
X-squared = 19.328, df = 1, p-value = 1.101e-05

Using number of deaths.

prop.trend.test(as.matrix(df2[2,]), colSums(df2))
    Chi-squared Test for Trend in Proportions

data:  as.matrix(df2[2, ]) out of colSums(df2) ,
 using scores: 1 2 3
X-squared = 19.328, df = 1, p-value = 1.101e-05

Interpretation as slope of linear regression model

points <- c(replicate(1110, c(1,1)), replicate(108, c(1,2)),
            replicate(54, c(2,1)),  replicate(14, c(2,2)),
            replicate(14, c(3,1)), replicate(6, c(3,2)))
length(points)
2612
df3 <- data.frame(matrix(points, ncol=2, byrow = T))
colnames(df3) <- c("x", "y")
summary(lm(y ~ x - 1 , data=df3))
Call:
lm(formula = y ~ x - 1, data = df3)

Residuals:
     Min       1Q   Median       3Q      Max
-1.81677  0.06108  0.06108  0.06108  1.06108

Coefficients:
  Estimate Std. Error t value Pr(>|t|)
x 0.938922   0.009996   93.93   <2e-16 *
---
Signif. codes:  0 ‘*’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.4085 on 1305 degrees of freedom
Multiple R-squared:  0.8711,        Adjusted R-squared:  0.871
F-statistic:  8822 on 1 and 1305 DF,  p-value: < 2.2e-16
ggplot(df3, aes(x=x, y=y)) + geom_jitter(width = .3, height=.3) +
geom_smooth(method="lm") +
scale_x_continuous(breaks=1:3, labels=c("alert", "vopr", "unresp")) +
scale_y_continuous(breaks=1:2, labels=c("survided", "died")) +
xlab("") + ylab("")
_images/SR08_Qualitative_data_-_tests_of_association_44_1.png

Measurement of risk

died <- c(38, 59)
survived <- c(79, 60)
df4 <- data.frame(died=died, survived=survived, row.names=c("early", "standard"))
df4
diedsurvived
early3879
standard5960
round(100*df4[1,1]/sum(df4[1,]), 1)
32.5
round(100*df4[2,1]/sum(df4[2,]), 1)
49.6
round(df4[1,1]/sum(df4[1,2]), 2)
0.48
round(df4[2,1]/sum(df4[2,2]), 2)
0.98

Confidence interval for a proportion

Confidence interval for death with early treatment

n <- sum(df4[1,])
p <- df4[1,1]/n
alpha <- 0.05
k <- qnorm(1 - alpha/2)
se <- sqrt(p*(1-p)/n)
round(100*c(p - k*se, p + k*se), 1)
  1. 24
  2. 41

Confidence interval for death with early treatment

n <- sum(df4[2,])
p <- df4[2,1]/n
alpha <- 0.05
k <- qnorm(1 - alpha/2)
se <- sqrt(p*(1-p)/n)
round(100*c(p - k*se, p + k*se), 1)
  1. 40.6
  2. 58.6

Comparing risks

Risk ratio (relative risk)

round(df4[1,1]/sum(df4[1,])/ (df4[2,1]/sum(df4[2,])), 2)
0.66

Odds ratio

round((df4[1,1]/sum(df4[1,2]))/(df4[2,1]/sum(df4[2,2])), 2)
0.49

Testing for difference between two proportions

df4
diedsurvived
early3879
standard5960
df4 <- df4 %>% mutate(total=died+survived)
df4
diedsurvivedtotal
38 79 117
59 60 119
prop.test(df4[,1], df4[,3], correct = FALSE)
    2-sample test for equality of proportions without continuity
    correction

data:  df4[, 1] out of df4[, 3]
X-squared = 7.1271, df = 1, p-value = 0.007593
alternative hypothesis: two.sided
95 percent confidence interval:
 -0.29458384 -0.04744015
sample estimates:
   prop 1    prop 2
0.3247863 0.4957983

Matched samples

breath_plus <- c(40, 4)
breath_minus <- c(8, 32)
df5 <- data.frame(breath_plus=breath_plus, breath_minus=breath_minus,
                 row.names=c("xxoid_plus", "oxoid_minus"))
df5
breath_plusbreath_minus
xxoid_plus40 8
oxoid_minus 432

Manual (with Yates correction)

T <- ((8-4)-1)^2/((8+4))
T
0.75
round(1 - pchisq(T, 1), 4)
0.3865

Using built in test

mcnemar.test(as.matrix(df5))
    McNemar's Chi-squared test with continuity correction

data:  as.matrix(df5)
McNemar's chi-squared = 0.75, df = 1, p-value = 0.3865

Exercises

suppressPackageStartupMessages(library(MASS))

The rows indicate smoking history, and the columns indicate frequency of exercise

The allowed values in Smoke are “Heavy”, “Regul” (regularly), “Occas” (occasionally) and “Never”, and “Freq” (frequently), “Some” and “None” for Exercise.

tbl = table(survey$Smoke, survey$Exer)
tbl
      Freq None Some
Heavy    7    1    3
Never   87   18   84
Occas   12    3    4
Regul    9    1    7

1. Is there a evidence for a relationship between smoking history and frequency of exercise? Solve this a) by manual calculation, and b) by using the chisq.test.

2. Is there evidence for a trend for exercise with frequency of smoking if we consider smoking as an ordinal variables?

3. Create a new table that uses only the (Freq, None) columns and (Heavy, Never) rows. Compare the results of the \(\chi^2\) test with and without Yates continuity correction, as well as the Fisher exact test.