Statistics review 1: Presenting and summarising data ==================================================== R code to accompany `paper `__ Key learning points ------------------- - The first step in any analysis is to describe and summarize the data - Understand the following - qualitative data (unordered and ordered) and quantitative data (discrete and continuous) - how these types of data can be represented figuratively - the two important features of a quantitative dataset (location and variability) - the measures of location (mean, median and mode) - the measures of variability (range, interquartile range, standard deviation and variance) - common distributions of clinical data - simple transformations of positively skewed data. .. code:: r suppressPackageStartupMessages(library(tidyverse)) Set default plot size to 4" by 3" ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. code:: r options(repr.plot.width=4, repr.plot.height=3) Summary Statistics ------------------ Read in data set from file ~~~~~~~~~~~~~~~~~~~~~~~~~~ .. code:: r df1 <- read.csv('data/haemoglobin.csv', header=FALSE, col.names=c("hb")) Working with data frames ^^^^^^^^^^^^^^^^^^^^^^^^ .. code:: r class(df1) .. raw:: html 'data.frame' .. code:: r head(df1) .. raw:: html
hb
5.4
8.2
6.4
8.3
6.4
8.3
.. code:: r str(df1) .. parsed-literal:: 'data.frame': 48 obs. of 1 variable: $ hb: num 5.4 8.2 6.4 8.3 6.4 8.3 7 8.6 7.1 8.8 ... .. code:: r dim(df1) .. raw:: html
  1. 48
  2. 1
.. code:: r summary(df1) .. parsed-literal:: hb Min. : 5.400 1st Qu.: 8.750 Median : 9.800 Mean : 9.869 3rd Qu.:10.800 Max. :14.100 Dataframe indexing ^^^^^^^^^^^^^^^^^^ .. code:: r df1$hb .. raw:: html
  1. 5.4
  2. 8.2
  3. 6.4
  4. 8.3
  5. 6.4
  6. 8.3
  7. 7
  8. 8.6
  9. 7.1
  10. 8.8
  11. 7.3
  12. 8.9
  13. 7.7
  14. 9.1
  15. 8.1
  16. 9.3
  17. 9.3
  18. 9.9
  19. 9.4
  20. 9.9
  21. 9.4
  22. 9.9
  23. 9.4
  24. 10.1
  25. 9.4
  26. 10.3
  27. 9.5
  28. 10.3
  29. 9.7
  30. 10.4
  31. 9.7
  32. 10.4
  33. 10.5
  34. 11.9
  35. 10.5
  36. 12.3
  37. 10.6
  38. 12.6
  39. 10.8
  40. 12.7
  41. 10.8
  42. 13
  43. 11.3
  44. 13.3
  45. 11.7
  46. 14
  47. 11.7
  48. 14.1
.. code:: r df1[,1] .. raw:: html
  1. 5.4
  2. 8.2
  3. 6.4
  4. 8.3
  5. 6.4
  6. 8.3
  7. 7
  8. 8.6
  9. 7.1
  10. 8.8
  11. 7.3
  12. 8.9
  13. 7.7
  14. 9.1
  15. 8.1
  16. 9.3
  17. 9.3
  18. 9.9
  19. 9.4
  20. 9.9
  21. 9.4
  22. 9.9
  23. 9.4
  24. 10.1
  25. 9.4
  26. 10.3
  27. 9.5
  28. 10.3
  29. 9.7
  30. 10.4
  31. 9.7
  32. 10.4
  33. 10.5
  34. 11.9
  35. 10.5
  36. 12.3
  37. 10.6
  38. 12.6
  39. 10.8
  40. 12.7
  41. 10.8
  42. 13
  43. 11.3
  44. 13.3
  45. 11.7
  46. 14
  47. 11.7
  48. 14.1
.. code:: r df1[1,] .. raw:: html 5.4 .. code:: r df1[5:10,] .. raw:: html
  1. 6.4
  2. 8.3
  3. 7
  4. 8.6
  5. 7.1
  6. 8.8
Measuring location ~~~~~~~~~~~~~~~~~~ Using a custom function ^^^^^^^^^^^^^^^^^^^^^^^ .. code:: r sr.mean <- function(x) { sum(x)/length(x) } .. code:: r sr.mean(df1$hb) .. raw:: html 9.86875 Using built-in functions ^^^^^^^^^^^^^^^^^^^^^^^^ .. code:: r df1 %>% summarize(mean=mean(hb), median=median(hb), mode=mode(hb)) .. raw:: html
meanmedianmode
9.868759.8 numeric
Visualizing the data distribution ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. code:: r ggplot(df1, aes(x=hb)) + geom_histogram(binwidth=0.5, fill="grey", color="darkgrey") .. image:: SR01_Presenting_and_summarising_data_files/SR01_Presenting_and_summarising_data_25_1.png Measuring variability ~~~~~~~~~~~~~~~~~~~~~ .. code:: r range(df1$hb) .. raw:: html
  1. 5.4
  2. 14.1
.. code:: r quantile(df1$hb, c(0.25, 0.75)) .. raw:: html
25%
8.75
75%
10.8
.. code:: r sd(df1$hb) .. raw:: html 1.97291837512189 .. code:: r var(df1$hb) .. raw:: html 3.89240691489362 .. code:: r df1 %>% summarize(min=min(hb), max=max(hb), iqr=quantile(hb, 0.75)- quantile(hb, 0.25), sd=sd(hb), var=var(hb)) .. raw:: html
minmaxiqrsdvar
5.4 14.1 2.05 1.9729183.892407
Using a convenience function ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. code:: r summary(df1) .. parsed-literal:: hb Min. : 5.400 1st Qu.: 8.750 Median : 9.800 Mean : 9.869 3rd Qu.:10.800 Max. :14.100 Common distributions and simple transformations ----------------------------------------------- **Exercise**. Read in the ``urea.csv`` data file into a data frame ``df2`` and name the column ``urea``. .. code:: r df2 <- read.csv("data/urea.csv", header=FALSE, col.names=c("urea")) .. code:: r head(df2, n=3) .. raw:: html
urea
16.007049
13.647212
6.653046
.. code:: r g <- ggplot(df2, aes(x=urea)) g <- g + geom_histogram(binwidth=1, fill="grey", color="darkgrey") g .. image:: SR01_Presenting_and_summarising_data_files/SR01_Presenting_and_summarising_data_38_1.png Skewness ^^^^^^^^ We say the data has a positive or right skew. This name comes from the fact that there is a statistical measure called skewness that is positive for long right tails, and negative for long left tails. .. code:: r install.packages("e1071", repos = "http://cran.r-project.org") library(e1071) .. parsed-literal:: The downloaded binary packages are in /var/folders/3l/tbmzdkss71152d8t9n1f8nx40000gn/T//RtmpdeIh9P/downloaded_packages .. code:: r skewness(df2$urea) .. raw:: html 1.80179083039024 Log transform of the data ^^^^^^^^^^^^^^^^^^^^^^^^^ .. code:: r df2['trans'] = log(df2['urea']) .. code:: r g <- ggplot(df2, aes(x=trans)) g <- g + geom_histogram(binwidth=0.3, fill="grey", color="darkgrey") g .. image:: SR01_Presenting_and_summarising_data_files/SR01_Presenting_and_summarising_data_44_1.png .. code:: r head(df2) .. raw:: html
ureatrans
16.0070492.773029
13.6472122.613535
6.6530461.895075
5.1076741.630744
19.3251932.961410
10.1410742.316594
Finding geometric mean ^^^^^^^^^^^^^^^^^^^^^^ .. code:: r gm <- function(x) { return(exp(mean(log(x)))) } .. code:: r df2 %>% summarise(mean=mean(urea), geom.mean=gm(urea)) .. raw:: html
meangeom.mean
10.972428.504802
Exercise ~~~~~~~~ **1** Load the file "ph.csv" into a data frame called ``df``. Label the column ``ph``. **2**. Plot a histogram of the data. Calculate the skewness of the ``ph`` column. **3** Left skewed data can sometimes be made more "normal" by an exponential transformation. That is, if the original data is :math:`x`, the transformed data is :math:`e^x`. Create another column named ``trasn`` with the transformed data. **4**. Plot a histogram of the transformed data. **5** Write your own function called ``sr.sd`` to calculate the standard deviation using the formula in Table 3. **6**. Create a new table using the ``summarise`` function with the mean and standard deviation of both original and transformed data values. This data frame should have 1 row and 4 columns named ``orig.mean``, ``orig.sd``, ``trans.mean``, ``trans.sd``.