Statistics review 6: Nonparametric methods

R code accompanying paper

Key learning points

  • Common nonparametric methods
  • Advantages and disadvantages of nonparametric versus parametric methods
suppressPackageStartupMessages(library(tidyverse))
options(repr.plot.width=4, repr.plot.height=3)

The sign test

rr <- c(0.75, 2.03, 2.29, 2.11, 0.80, 1.50, 0.79, 1.01,
        1.23, 1.48, 2.45, 1.02, 1.03, 1.30, 1.54, 1.27)
ggplot(data.frame(rr=rr), aes(x=rr)) +
geom_histogram(breaks=seq(0.5, 2.5, 0.25), color='gray', fill='lightgray')
_images/SR06_Nonparametric_methods_5_1.png
n <- length(rr)
k <- sum(rr > 1)
p <- 0.5
S <- min(k, n-k)
S
3

Calculation of p value from binomial distribution

How likely are we to see 3 heads or fewer in 16 tosses of a fair coin? This is the p-value (double for two-sided test).

round(2 * pbinom(S, n, p), 2)
0.02

This is the same as summing the probabilities for 0, 1, 2, and 3 heads.

round(2 * sum(dbinom(0:S, n, p)), 2)
0.02

Sign test for paired data

before <- c(39.7, 59.1, 56.1, 57.7, 60.6, 37.8, 58.2, 33.6, 56.0, 65.3)
after <- c(52.9, 56.7, 61.9, 71.4, 67.7, 50.0, 60.7, 51.3, 59.5, 59.8)

df <- data.frame("Subject" = 1:length(before), "Before" = before, "After" = after)
df
SubjectBeforeAfter
1 39.752.9
2 59.156.7
3 56.161.9
4 57.771.4
5 60.667.7
6 37.850.0
7 58.260.7
8 33.651.3
9 56.059.5
10 65.359.8
d <- df$After - df$Before
n <- length(d)
k <- sum(d > 0)
S <- min(k, n-k)
round(2 * pbinom(S, n, p), 2)
0.11

At the cost of assuming data follow a normal distribution.

t.test(d)
    One Sample t-test

data:  d
t = 2.8681, df = 9, p-value = 0.01853
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
  1.432413 12.127587
sample estimates:
mean of x
     6.78

The Wilcoxon signed rank test

df$Difference <- d
df <- df %>% mutate(rank=rank(abs(Difference))) %>%
mutate(sign=sign(Difference)) %>%
arrange(rank)
df
SubjectBeforeAfterDifferenceranksign
2 59.156.7-2.4 1 -1
7 58.260.7 2.5 2 1
9 56.059.5 3.5 3 1
10 65.359.8-5.5 4 -1
3 56.161.9 5.8 5 1
5 60.667.7 7.1 6 1
6 37.850.012.2 7 1
1 39.752.913.2 8 1
4 57.771.413.7 9 1
8 33.651.317.710 1
df %>% filter(sign == 1) %>% summarise(Rp = sum(rank))
Rp
50
df %>% filter(sign == -1) %>% summarise(Rn = sum(rank))
Rn
5

There is no simple closed form distribution for the test statistic R, so we’ll just use the built-in R function.

wilcox.test(df$After, df$Before, paired=TRUE)
    Wilcoxon signed rank test

data:  df$After and df$Before
V = 50, p-value = 0.01953
alternative hypothesis: true location shift is not equal to 0

The Wilcoxon rank sum or Mann–Whitney test

dose <- c(7.2, 15.7, 19.1, 21.6, 26.8, 27.4, 28.5, 32.8, 36.3, 43.2, 44.7,
          5.6, 14.6, 18.2, 21.6, 23.1, 28.3, 31.7, 32.4, 36.8)
grp <- c(rep("Nonprotocolized", 11), rep("Protocolized", 9))
df <- data.frame(dose=dose, grp=grp)
df <- df %>% mutate(rank=rank(dose))
df
dosegrprank
7.2 Nonprotocolized 2.0
15.7 Nonprotocolized 4.0
19.1 Nonprotocolized 6.0
21.6 Nonprotocolized 7.5
26.8 Nonprotocolized10.0
27.4 Nonprotocolized11.0
28.5 Nonprotocolized13.0
32.8 Nonprotocolized16.0
36.3 Nonprotocolized17.0
43.2 Nonprotocolized19.0
44.7 Nonprotocolized20.0
5.6 Protocolized 1.0
14.6 Protocolized 3.0
18.2 Protocolized 5.0
21.6 Protocolized 7.5
23.1 Protocolized 9.0
28.3 Protocolized 12.0
31.7 Protocolized 14.0
32.4 Protocolized 15.0
36.8 Protocolized 18.0
df %>% filter(grp == "Protocolized") %>% summarise(sum=sum(rank))
sum
84.5
x <- df %>% filter(grp == "Protocolized") %>% select(dose)
y <- df %>% filter(grp == "Nonprotocolized") %>% select(dose)
wilcox.test(x[,], y[,])
Warning message in wilcox.test.default(x[, ], y[, ]):
“cannot compute exact p-value with ties”
    Wilcoxon rank sum test with continuity correction

data:  x[, ] and y[, ]
W = 39.5, p-value = 0.4703
alternative hypothesis: true location shift is not equal to 0
t.test(x, y)
    Welch Two Sample t-test

data:  x and y
t = -0.83661, df = 17.926, p-value = 0.4138
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -13.991115   6.023438
sample estimates:
mean of x mean of y
 23.58889  27.57273

Advantages of nonparametric methods

  • Nonparametric methods require no or very limited assump- tions to be made about the format of the data, and they may therefore be preferable when the assumptions required for parametric methods are not valid.
  • Nonparametric methods can be useful for dealing with unex- pected, outlying observations that might be problematic with a parametric approach.
  • Nonparametric methods are intuitive and are simple to carry out by hand, for small samples at least.
  • Nonparametric methods are often useful in the analysis of ordered categorical data in which assignation of scores to individual categories may be inappropriate.

Disadvantages of nonparametric methods

  • Nonparametric methods may lack power as compared with more traditional approaches.
  • Nonparametric methods are geared toward hypothesis testing rather than estimation of effects.
  • Tied values can be problematic when these are common.