Simulations and Statistical Inference¶

In [1]:

library(tidyverse)

── Attaching packages ─────────────────────────────────────── tidyverse 1.2.1 ──
✔ ggplot2 2.2.1     ✔ purrr   0.2.5
✔ tibble  1.4.2     ✔ dplyr   0.7.5
✔ tidyr   0.8.1     ✔ stringr 1.3.1
✔ readr   1.1.1     ✔ forcats 0.3.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()

In [2]:

options(repr.plot.width=4, repr.plot.height=3)

Set random number seed for reproucibility¶

In [3]:

set.seed(42)

Generating numbers¶

Using the c function

In [4]:

c(1,2,3,5,8,13)

1
2
3
5
8
13

Using the : operator

In [5]:

1:10

1
2
3
4
5
6
7
8
9
10

In [6]:

10:1

10
9
8
7
6
5
4
3
2
1

Using the seq function

In [7]:

seq(1, 10)

1
2
3
4
5
6
7
8
9
10

In [8]:

seq(1, 10, 2)

1
3
5
7
9

In [9]:

seq(10, 1, -1)

10
9
8
7
6
5
4
3
2
1

In [10]:

seq(0, 1, length.out = 11)

0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1

Using `rep`¶

rep is useful for generating repeating data patterns.

In [11]:

rep(3, 5)

3
3
3
3
3

In [12]:

rep(1:3, times=5)

1
2
3
1
2
3
1
2
3
1
2
3
1
2
3

In [13]:

rep(1:3, each=5)

1
1
1
1
1
2
2
2
2
2
3
3
3
3
3

In [14]:

rep(1:3, length.out=10)

1
2
3
1
2
3
1
2
3
1

Using `sample`¶

Simulate n tooses of a coin

In [15]:

n <- 10

In [16]:

t1 <- sample(c('H', 'T'), n, replace = TRUE)
t1

'T'
'T'
'H'
'T'
'T'
'T'
'T'
'H'
'T'
'T'

In [17]:

table(t1)

t1
H T
2 8

Simulate n tosses of a biased coin

In [18]:

t2 <- sample(c('H', 'T'), n, replace = TRUE, prob = c(0.3, 0.7))
t2

'T'
'H'
'H'
'T'
'T'
'H'
'H'
'T'
'T'
'T'

In [19]:

table(t2)

t2
H T
4 6

Simulate n rolls of a 6-sided die

In [20]:

n <- 100

In [21]:

d1 <- sample(1:6, n, replace=TRUE)
d1

6
1
6
6
1
4
3
6
3
6
5
5
3
5
1
5
1
2
6
4
3
3
1
6
3
6
6
4
6
4
3
3
3
5
1
5
5
2
2
4
5
6
5
4
6
2
2
5
5
2
1
1
2
3
2
5
1
3
4
1
4
1
3
4
5
4
2
1
1
2
5
1
2
6
6
5
2
4
5
4
4
2
2
3
6
6
5
5
4
1
4
6
5
3
4
4
1
3
4
5

In [22]:

table(d1)

d1
 1  2  3  4  5  6
16 14 15 18 20 17

Sampling without replacement. For example, if we wanted to assiggn 16 samples to treatment A or B at random such that exactly half had each treatment.

In [23]:

sample(rep(c('A', 'B'), each=8))

'A'
'A'
'B'
'A'
'B'
'A'
'B'
'B'
'A'
'B'
'B'
'A'
'A'
'B'
'B'
'A'

Random number generators¶

Disscrete distributionns¶

Sampling from a Bernoullli distribution returns TRUE for success and FALSE for failure.

In [24]:

rbernoulli(n=10, p=0.5)

TRUE
FALSE
TRUE
TRUE
FALSE
FALSE
FALSE
TRUE
TRUE
TRUE

In [25]:

as.integer(rbernoulli(n=10, p=0.5))

0
1
0
0
1
0
1
0
1
1

Sampling from a Binomial distribution returns the number of successes in size trials for n experiments.

In [26]:

rbinom(n=10, p=0.5, size=5)

2
0
1
3
4
3
3
2
3
1

Sampling from a negative binomial distribution returns the number of failures until size succcesses are observed for n experiemnts.

In [27]:

rnbinom(n=10, size=5, prob=0.5)

2
5
6
7
6
6
5
1
0
9

Sampling from a Poisson distribution returns the number of successes in n experiments if the average success rate per experiment is lambda.

In [28]:

rpois(n=10, lambda = 3)

3
5
3
1
3
7
3
2
2
3

Note: We can give different parameters for each experiment in these distributios.

In [54]:

rpois(n=10, lambda=1:10)

2
2
3
5
6
3
6
8
9
9

Continuous distributions¶

Sampling from a stnadard uniform distribution.

In [29]:

runif(5)

0.649875837611035
0.336419132305309
0.0609497462864965
0.451310850214213
0.838755033444613

Sampling form a uniform distribuiotn,

In [30]:

runif(5, 90, 100)

95.7463733432814
93.5335037740879
95.4742607823573
98.9271859382279
94.8999057058245

Sampling from a stnadard normal distribution.

In [31]:

rnorm(5)

-0.94773529068477
1.76797780755595
0.917327777548805
-0.894775408854386
-0.688165944958397

Looping¶

In [90]:

for (i in 1:10) {
    print(mean(rnorm(10)))
}

[1] 0.2660734
[1] -0.09742615
[1] -0.3068994
[1] 0.2910631
[1] 0.1062726
[1] 0.05955272
[1] 0.4080255
[1] -0.1243275
[1] -0.779981
[1] -0.187303

Saving variabels generated in a loop

In [91]:

n <- 10
vars <- numeric(n)
for (i in 1:n) {
    vars[i] <- mean(rnorm(10))
}
vars

-0.25116755575127
0.708916638976887
0.461551515258198
-0.475557761331935
-0.0598841272831985
0.272569234838853
0.0543736652184016
0.135002662203851
-0.190719191248592
0.0387989282162199

Using `replicate`¶

replicate is like rep but works for a function (such as a random number geenerator)

In [32]:

replicate(3, rnorm(5))

-0.9984839	0.15461594	-0.88368060
-2.1072414	-0.80160087	0.07714041
0.8574190	-0.04518119	0.83720136
1.1260757	-1.03824075	0.09992556
-0.1983684	1.55953311	0.30272834

replicate is quite useful for simulations. For exampe, suppose we want to know the distribution of the sample mean if we sampled 100 numbers from the standard normal distribution 1,000 times.

In [ ]:

n_expts <- 1000
n <- 100

Using for loop¶

In [86]:

set.seed(123)
mus <- numeric(n_expts)
for (i in 1:n_expts) {
    mus[i] <- mean(rnorm(n))
}
hist(mus)

../_images/cliburn_Statistical_Inference_01_64_0.png

Using replicate¶

In [89]:

set.seed(123)
mus <- replicate(n_expts, mean(rnorm(n)))
hist(mus)

../_images/cliburn_Statistical_Inference_01_66_0.png

Makign a `data.frame` of simulated data¶

Let’s simulate the following experiment.

There are 10 subjects in Group A and 10 subjects in Group B with ranodm PIDs from 10000-99999
We measure 5 genes in each subject. The genes have the same distribution for each subject, but different genes have differnt distribtutions:
- gene1 \(\sim N(10, 1)\)
- gene2 \(\sim N(11, 2)\)
- gene3 \(\sim N(12, 3)\)
- gene4 \(\sim N(13, 4)\)
- gene5 \(\sim (N(14, 5)\)

In [48]:

replicate(5, rnorm(3, 1:5, 1))

0.6536426	1.334889	1.5432302	0.6573478	2.249920
1.3585636	2.068446	0.8936747	4.2271172	3.146313
3.4023661	2.388056	2.6394386	1.5861398	2.943069

In [75]:

n <- 10
n_genes <- 5
min_pid <- 10000
max_pid <- 99999
groupings <- c('A', 'B')
n_groups <- length(groupings)
gene_mus <- 10:14
gene_sigmas <- 1:5
pad_width <- 3

In [76]:

pids <- sample(min_pid:max_pid, n_groups*n)
groups <- sample(rep(groupings, each=n))
genes <- t(replicate(n_groups*n, rnorm(n_genes, gene_mus, gene_sigmas)))
gene_names <- paste('gene', str_pad(1:n_genes, width = pad_width, pad='0'), sep='')
colnames(genes) <- gene_names

In [77]:

df <- data.frame(
    pid = pids,
    grp = groups,
    genes
)

In [58]:

sample_n(df, 3)

	pid	grp	gene001	gene002	gene003	gene004	gene005
16	38729	B	1.530708	8.613943	10.617613	9.215513	9.880912
13	22215	B	14.545609	10.625483	6.288087	9.741466	9.624057
14	89658	A	13.008224	14.321570	8.115676	12.608564	7.962775

Breakdown of simjulation¶

Set up simulation confiugration parameters.

In [70]:

n <- 10
n_genes <- 5
min_pid <- 10000
max_pid <- 99999
groupings <- c('A', 'B')
n_groups <- length(groupings)
gene_mus <- 10:14
gene_sigmas <- 1:5
pad_width <- 3

Create unique PIDs for each subject

In [62]:

pids <- sample(min_pid:max_pid, n_groups*n)

Assign a group to each subject at ranodm

In [67]:

groups <- sample(rep(groupings, each=n))

Make up 5 genes from different distributions for each subject

In [68]:

genes <- t(replicate(n_groups*n, rnorm(n_genes, gene_mus, gene_sigmas)))

Make nice names for each gene

In [71]:

gene_names <- paste('gene', str_pad(1:n_genes, width = pad_width, pad='0'), sep='')

Assign names to gene columns

In [72]:

colnames(genes) <- gene_names

Create data.frame to storre simulated data

In [73]:

df <- data.frame(
    pid = pids,
    grp = groups,
    genes
)

Peek into data.frame

In [74]:

sample_n(df, 3)

	pid	grp	gene001	gene002	gene003	gene004	gene005
16	72254	B	10.726324	9.749522	9.772832	19.65021	13.226052
18	21661	B	10.448878	12.689833	13.292832	19.66553	14.330836
14	91408	B	9.125949	11.009273	13.052141	14.95065	8.745456

Simulations and Statistical Inference¶

Set random number seed for reproucibility¶

Generating numbers¶

Using rep¶

Using sample¶

Random number generators¶

Disscrete distributionns¶

Continuous distributions¶

Looping¶

Using replicate¶

Using for loop¶

Using replicate¶

Makign a data.frame of simulated data¶

Breakdown of simjulation¶

Using `rep`¶

Using `sample`¶

Using `replicate`¶

Makign a `data.frame` of simulated data¶