Statistics review 1: Presenting and summarising data ==================================================== R code to accompany `paper `__ Key learning points ------------------- - The first step in any analysis is to describe and summarize the data - Understand the following - qualitative data (unordered and ordered) and quantitative data (discrete and continuous) - how these types of data can be represented figuratively - the two important features of a quantitative dataset (location and variability) - the measures of location (mean, median and mode) - the measures of variability (range, interquartile range, standard deviation and variance) - common distributions of clinical data - simple transformations of positively skewed data. .. code:: r suppressPackageStartupMessages(library(tidyverse)) Set default plot size to 4" by 3" ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. code:: r options(repr.plot.width=4, repr.plot.height=3) Summary Statistics ------------------ Read in data set from file ~~~~~~~~~~~~~~~~~~~~~~~~~~ .. code:: r df1 <- read.csv('data/haemoglobin.csv', header=FALSE, col.names=c("hb")) Working with data frames ^^^^^^^^^^^^^^^^^^^^^^^^ .. code:: r class(df1) .. raw:: html 'data.frame' .. code:: r head(df1) .. raw:: html

hb
5.4
8.2
6.4
8.3
6.4
8.3

.. code:: r str(df1) .. parsed-literal:: 'data.frame': 48 obs. of 1 variable: $ hb: num 5.4 8.2 6.4 8.3 6.4 8.3 7 8.6 7.1 8.8 ... .. code:: r dim(df1) .. raw:: html

.. code:: r summary(df1) .. parsed-literal:: hb Min. : 5.400 1st Qu.: 8.750 Median : 9.800 Mean : 9.869 3rd Qu.:10.800 Max. :14.100 Dataframe indexing ^^^^^^^^^^^^^^^^^^ .. code:: r df1$hb .. raw:: html

5.4
8.2
6.4
8.3
6.4
8.3
7
8.6
7.1
8.8
7.3
8.9
7.7
9.1
8.1
9.3
9.3
9.9
9.4
9.9
9.4
9.9
9.4
10.1
9.4
10.3
9.5
10.3
9.7
10.4
9.7
10.4
10.5
11.9
10.5
12.3
10.6
12.6
10.8
12.7
10.8
13
11.3
13.3
11.7
14
11.7
14.1

.. code:: r df1[,1] .. raw:: html

5.4
8.2
6.4
8.3
6.4
8.3
7
8.6
7.1
8.8
7.3
8.9
7.7
9.1
8.1
9.3
9.3
9.9
9.4
9.9
9.4
9.9
9.4
10.1
9.4
10.3
9.5
10.3
9.7
10.4
9.7
10.4
10.5
11.9
10.5
12.3
10.6
12.6
10.8
12.7
10.8
13
11.3
13.3
11.7
14
11.7
14.1

.. code:: r df1[1,] .. raw:: html 5.4 .. code:: r df1[5:10,] .. raw:: html

Measuring location ~~~~~~~~~~~~~~~~~~ Using a custom function ^^^^^^^^^^^^^^^^^^^^^^^ .. code:: r sr.mean <- function(x) { sum(x)/length(x) } .. code:: r sr.mean(df1$hb) .. raw:: html 9.86875 Using built-in functions ^^^^^^^^^^^^^^^^^^^^^^^^ .. code:: r df1 %>% summarize(mean=mean(hb), median=median(hb), mode=mode(hb)) .. raw:: html

mean	median	mode
9.86875	9.8	numeric

Visualizing the data distribution ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. code:: r ggplot(df1, aes(x=hb)) + geom_histogram(binwidth=0.5, fill="grey", color="darkgrey") .. image:: SR01_Presenting_and_summarising_data_files/SR01_Presenting_and_summarising_data_25_1.png Measuring variability ~~~~~~~~~~~~~~~~~~~~~ .. code:: r range(df1$hb) .. raw:: html

5.4
14.1

.. code:: r quantile(df1$hb, c(0.25, 0.75)) .. raw:: html

25%: 8.75
75%: 10.8

.. code:: r sd(df1$hb) .. raw:: html 1.97291837512189 .. code:: r var(df1$hb) .. raw:: html 3.89240691489362 .. code:: r df1 %>% summarize(min=min(hb), max=max(hb), iqr=quantile(hb, 0.75)- quantile(hb, 0.25), sd=sd(hb), var=var(hb)) .. raw:: html

min	max	iqr	sd	var
5.4	14.1	2.05	1.972918	3.892407

Using a convenience function ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. code:: r summary(df1) .. parsed-literal:: hb Min. : 5.400 1st Qu.: 8.750 Median : 9.800 Mean : 9.869 3rd Qu.:10.800 Max. :14.100 Common distributions and simple transformations ----------------------------------------------- **Exercise**. Read in the ``urea.csv`` data file into a data frame ``df2`` and name the column ``urea``. .. code:: r df2 <- read.csv("data/urea.csv", header=FALSE, col.names=c("urea")) .. code:: r head(df2, n=3) .. raw:: html

urea
16.007049
13.647212
6.653046

.. code:: r g <- ggplot(df2, aes(x=urea)) g <- g + geom_histogram(binwidth=1, fill="grey", color="darkgrey") g .. image:: SR01_Presenting_and_summarising_data_files/SR01_Presenting_and_summarising_data_38_1.png Skewness ^^^^^^^^ We say the data has a positive or right skew. This name comes from the fact that there is a statistical measure called skewness that is positive for long right tails, and negative for long left tails. .. code:: r install.packages("e1071", repos = "http://cran.r-project.org") library(e1071) .. parsed-literal:: The downloaded binary packages are in /var/folders/3l/tbmzdkss71152d8t9n1f8nx40000gn/T//RtmpdeIh9P/downloaded_packages .. code:: r skewness(df2$urea) .. raw:: html 1.80179083039024 Log transform of the data ^^^^^^^^^^^^^^^^^^^^^^^^^ .. code:: r df2['trans'] = log(df2['urea']) .. code:: r g <- ggplot(df2, aes(x=trans)) g <- g + geom_histogram(binwidth=0.3, fill="grey", color="darkgrey") g .. image:: SR01_Presenting_and_summarising_data_files/SR01_Presenting_and_summarising_data_44_1.png .. code:: r head(df2) .. raw:: html

urea	trans
16.007049	2.773029
13.647212	2.613535
6.653046	1.895075
5.107674	1.630744
19.325193	2.961410
10.141074	2.316594

Finding geometric mean ^^^^^^^^^^^^^^^^^^^^^^ .. code:: r gm <- function(x) { return(exp(mean(log(x)))) } .. code:: r df2 %>% summarise(mean=mean(urea), geom.mean=gm(urea)) .. raw:: html

mean	geom.mean
10.97242	8.504802

Exercise ~~~~~~~~ **1** Load the file "ph.csv" into a data frame called ``df``. Label the column ``ph``. **2**. Plot a histogram of the data. Calculate the skewness of the ``ph`` column. **3** Left skewed data can sometimes be made more "normal" by an exponential transformation. That is, if the original data is :math:`x`, the transformed data is :math:`e^x`. Create another column named ``trasn`` with the transformed data. **4**. Plot a histogram of the transformed data. **5** Write your own function called ``sr.sd`` to calculate the standard deviation using the formula in Table 3. **6**. Create a new table using the ``summarise`` function with the mean and standard deviation of both original and transformed data values. This data frame should have 1 row and 4 columns named ``orig.mean``, ``orig.sd``, ``trans.mean``, ``trans.sd``.