Statistics review 8: Qualitative data – tests of association ============================================================ R code accompanying `paper `__) Key learning points ------------------- - How to investigate relationships between two qualitative (categorical) variables - :math:`\chi^2` test of association - Test for trend (one variable is ordinal) - Risk measurement - Evaluating proportions .. code:: r suppressPackageStartupMessages(library(tidyverse)) .. code:: r options(repr.plot.width=4, repr.plot.height=3) Categorical data ---------------- Unordered ~~~~~~~~~ - sex - blood group - classification of disease - whether the patient survived Ordinal ~~~~~~~ - Age group - Clinical scales (e.g. Pain scale) :math:`\chi^2` test of association ---------------------------------- .. code:: r none <- c(686, 451, 168) exit <- c(152, 35, 58) bacteremia <- c(96, 38, 22) o <- data.frame(none=none, exit=exit, bacteremia=bacteremia, row.names= c("jugular", "subclavian", "femoral")) Observed counts ~~~~~~~~~~~~~~~ .. code:: r o .. raw:: html
noneexitbacteremia
jugular68615296
subclavian451 3538
femoral168 5822
.. code:: r m <- addmargins(as.matrix(o)) m .. raw:: html
noneexitbacteremiaSum
jugular 686152 96 934
subclavian 451 35 38 524
femoral 168 58 22 248
Sum1305245 156 1706
Expected counts ~~~~~~~~~~~~~~~ .. code:: r e <- outer(m[1:3,4], m[4,1:3]/m[4,4]) round(e, 1) .. raw:: html
noneexitbacteremia
jugular714.5134.185.4
subclavian400.8 75.347.9
femoral189.7 35.622.7
Calculate :math:`\chi^2` test statistic ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. code:: r T <- sum((o - e)^2/e) round(T, 2) .. raw:: html 51.26 .. code:: r df <- (nrow(o) -1) * (ncol(o) - 1) df .. raw:: html 4 Calculate p-value from :math:`\chi^2` distribution ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. code:: r 1 - pchisq(T, df) .. raw:: html 1.96774818661538e-10 Use built in test ~~~~~~~~~~~~~~~~~ .. code:: r chisq.test(o) .. parsed-literal:: Pearson's Chi-squared test data: o X-squared = 51.262, df = 4, p-value = 1.968e-10 Residuals --------- .. code:: r t <- addmargins(as.matrix(prop.table(o))) t .. raw:: html
noneexitbacteremiaSum
jugular0.402110200.089097300.056271980.5474795
subclavian0.264361080.020515830.022274330.3071512
femoral0.098475970.033997660.012895660.1453693
Sum0.764947250.143610790.091441971.0000000
Adjusted standardized residues ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The numbers don't agree with table 4 in the paper. Student to check by hand calculations? .. code:: r round((o - e)/sqrt(e*(1-t[1:3,4])*(1-t[4,1:3])), 1) .. raw:: html
noneexitbacteremia
jugular-3.3 4.7 3.5
subclavian 3.3-6.0-1.9
femoral-1.8 4.3-0.2
Two by two tables ----------------- .. code:: r treatment <- c(3, 47) control <- c(8, 37) df1 <- data.frame(treatment=treatment, contorl=control) df1 .. raw:: html
treatmentcontorl
3 8
4737
.. code:: r chisq.test(df1, correct = FALSE) .. parsed-literal:: Pearson's Chi-squared test data: df1 X-squared = 3.2089, df = 1, p-value = 0.07324 Fisher’s exact test ------------------- .. code:: r fisher.test(df1) .. parsed-literal:: Fisher's Exact Test for Count Data data: df1 p-value = 0.1083 alternative hypothesis: true odds ratio is not equal to 1 95 percent confidence interval: 0.04775944 1.35642029 sample estimates: odds ratio 0.2989279 Yates’ continuity correction ---------------------------- .. code:: r chisq.test(df1, correct=TRUE) .. parsed-literal:: Pearson's Chi-squared test with Yates' continuity correction data: df1 X-squared = 2.1617, df = 1, p-value = 0.1415 Test for trend -------------- .. code:: r alert <- c(1110, 108) vopr <- c(54, 14) unresponsive <- c(14, 6) df2 <- data.frame(alert=alert, vopr=vopr, unresponsive=unresponsive, row.names=c("sruvived", "died")) df2 .. raw:: html
alertvoprunresponsive
sruvived111054 14
died 10814 6
.. code:: r chisq.test(df2) .. parsed-literal:: Warning message in chisq.test(df2): “Chi-squared approximation may be incorrect” .. parsed-literal:: Pearson's Chi-squared test data: df2 X-squared = 19.383, df = 2, p-value = 6.18e-05 Calling the trend test ~~~~~~~~~~~~~~~~~~~~~~ The trend test takes two arguments - Number of events - Number of trials So we have to a little massaging to get the data in the correct format. We can use either row as the number of events ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Using number of survivors. .. code:: r prop.trend.test(as.matrix(df2[1,]), colSums(df2)) .. parsed-literal:: Chi-squared Test for Trend in Proportions data: as.matrix(df2[1, ]) out of colSums(df2) , using scores: 1 2 3 X-squared = 19.328, df = 1, p-value = 1.101e-05 Using number of deaths. .. code:: r prop.trend.test(as.matrix(df2[2,]), colSums(df2)) .. parsed-literal:: Chi-squared Test for Trend in Proportions data: as.matrix(df2[2, ]) out of colSums(df2) , using scores: 1 2 3 X-squared = 19.328, df = 1, p-value = 1.101e-05 Interpretation as slope of linear regression model ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. code:: r points <- c(replicate(1110, c(1,1)), replicate(108, c(1,2)), replicate(54, c(2,1)), replicate(14, c(2,2)), replicate(14, c(3,1)), replicate(6, c(3,2))) .. code:: r length(points) .. raw:: html 2612 .. code:: r df3 <- data.frame(matrix(points, ncol=2, byrow = T)) colnames(df3) <- c("x", "y") .. code:: r summary(lm(y ~ x - 1 , data=df3)) .. parsed-literal:: Call: lm(formula = y ~ x - 1, data = df3) Residuals: Min 1Q Median 3Q Max -1.81677 0.06108 0.06108 0.06108 1.06108 Coefficients: Estimate Std. Error t value Pr(>|t|) x 0.938922 0.009996 93.93 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.4085 on 1305 degrees of freedom Multiple R-squared: 0.8711, Adjusted R-squared: 0.871 F-statistic: 8822 on 1 and 1305 DF, p-value: < 2.2e-16 .. code:: r ggplot(df3, aes(x=x, y=y)) + geom_jitter(width = .3, height=.3) + geom_smooth(method="lm") + scale_x_continuous(breaks=1:3, labels=c("alert", "vopr", "unresp")) + scale_y_continuous(breaks=1:2, labels=c("survided", "died")) + xlab("") + ylab("") .. image:: SR08_Qualitative_data_-_tests_of_association_files/SR08_Qualitative_data_-_tests_of_association_44_1.png Measurement of risk ------------------- .. code:: r died <- c(38, 59) survived <- c(79, 60) df4 <- data.frame(died=died, survived=survived, row.names=c("early", "standard")) .. code:: r df4 .. raw:: html
diedsurvived
early3879
standard5960
Risk of death on early treatment ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. code:: r round(100*df4[1,1]/sum(df4[1,]), 1) .. raw:: html 32.5 Risk of death on standard treatment ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. code:: r round(100*df4[2,1]/sum(df4[2,]), 1) .. raw:: html 49.6 Odds of death on early treatment ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. code:: r round(df4[1,1]/sum(df4[1,2]), 2) .. raw:: html 0.48 Odds of death on standard treatment ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. code:: r round(df4[2,1]/sum(df4[2,2]), 2) .. raw:: html 0.98 Confidence interval for a proportion ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Confidence interval for death with early treatment ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. code:: r n <- sum(df4[1,]) p <- df4[1,1]/n .. code:: r alpha <- 0.05 k <- qnorm(1 - alpha/2) se <- sqrt(p*(1-p)/n) round(100*c(p - k*se, p + k*se), 1) .. raw:: html
  1. 24
  2. 41
Confidence interval for death with early treatment ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. code:: r n <- sum(df4[2,]) p <- df4[2,1]/n .. code:: r alpha <- 0.05 k <- qnorm(1 - alpha/2) se <- sqrt(p*(1-p)/n) round(100*c(p - k*se, p + k*se), 1) .. raw:: html
  1. 40.6
  2. 58.6
Comparing risks ~~~~~~~~~~~~~~~ Risk ratio (relative risk) ^^^^^^^^^^^^^^^^^^^^^^^^^^ .. code:: r round(df4[1,1]/sum(df4[1,])/ (df4[2,1]/sum(df4[2,])), 2) .. raw:: html 0.66 Odds ratio ^^^^^^^^^^ .. code:: r round((df4[1,1]/sum(df4[1,2]))/(df4[2,1]/sum(df4[2,2])), 2) .. raw:: html 0.49 Testing for difference between two proportions ---------------------------------------------- .. code:: r df4 .. raw:: html
diedsurvived
early3879
standard5960
.. code:: r df4 <- df4 %>% mutate(total=died+survived) df4 .. raw:: html
diedsurvivedtotal
38 79 117
59 60 119
.. code:: r prop.test(df4[,1], df4[,3], correct = FALSE) .. parsed-literal:: 2-sample test for equality of proportions without continuity correction data: df4[, 1] out of df4[, 3] X-squared = 7.1271, df = 1, p-value = 0.007593 alternative hypothesis: two.sided 95 percent confidence interval: -0.29458384 -0.04744015 sample estimates: prop 1 prop 2 0.3247863 0.4957983 Matched samples --------------- .. code:: r breath_plus <- c(40, 4) breath_minus <- c(8, 32) df5 <- data.frame(breath_plus=breath_plus, breath_minus=breath_minus, row.names=c("xxoid_plus", "oxoid_minus")) df5 .. raw:: html
breath_plusbreath_minus
xxoid_plus40 8
oxoid_minus 432
Manual (with Yates correction) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. code:: r T <- ((8-4)-1)^2/((8+4)) T .. raw:: html 0.75 .. code:: r round(1 - pchisq(T, 1), 4) .. raw:: html 0.3865 Using built in test ~~~~~~~~~~~~~~~~~~~ .. code:: r mcnemar.test(as.matrix(df5)) .. parsed-literal:: McNemar's Chi-squared test with continuity correction data: as.matrix(df5) McNemar's chi-squared = 0.75, df = 1, p-value = 0.3865 Exercises --------- .. code:: r suppressPackageStartupMessages(library(MASS)) The rows indicate smoking history, and the columns indicate frequency of exercise ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The allowed values in Smoke are "Heavy", "Regul" (regularly), "Occas" (occasionally) and "Never", and "Freq" (frequently), "Some" and "None" for Exercise. .. code:: r tbl = table(survey$Smoke, survey$Exer) tbl .. parsed-literal:: Freq None Some Heavy 7 1 3 Never 87 18 84 Occas 12 3 4 Regul 9 1 7 **1**. Is there a evidence for a relationship between smoking history and frequency of exercise? Solve this a) by manual calculation, and b) by using the chisq.test. **2**. Is there evidence for a trend for exercise with frequency of smoking if we consider smoking as an ordinal variables? **3**. Create a new table that uses only the (Freq, None) columns and (Heavy, Never) rows. Compare the results of the :math:`\chi^2` test with and without Yates continuity correction, as well as the Fisher exact test.