{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Using `tidyr` to create tidy data sets"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Description**\n",
"> Easily Tidy Data with 'spread()' and 'gather()' Functions\n",
"\n",
"A tidy data frame is one where\n",
"\n",
"- Each column is a variable\n",
"- Each row is an observatino\n",
"- Each value is a cell\n",
"\n",
"As we have seen, tidy data sets can be easily plotted with `ggplot2` and manipulated with `dplyr`. In this notebook, we see how to convert messy data sets into tidy ones using the verbs `gather`, `spread`, `separate`, `extract`, `separte_rows` and `unite`.\n",
"\n",
"For more complicated tidying jobs, see the full range of functions in the [`tidyr` documentation](https://cran.r-project.org/web/packages/tidyr/tidyr.pdf)."
]
},
{
"cell_type": "raw",
"metadata": {},
"source": [
"library(tidyverse)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Data\n",
"\n",
"We will work with a subset of the pilot metadata that we are familiar with from the `dplyr` session."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"path='../josh/info/2018_pilot_metadata.tsv'"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"Parsed with column specification:\n",
"cols(\n",
" Label = col_character(),\n",
" RNA_sample_num = col_integer(),\n",
" Media = col_character(),\n",
" Strain = col_character(),\n",
" Replicate = col_integer(),\n",
" experiment_person = col_character(),\n",
" libprep_person = col_character(),\n",
" enrichment_method = col_character(),\n",
" RIN = col_double(),\n",
" concentration_fold_difference = col_double(),\n",
" `i7 index` = col_character(),\n",
" `i5 index` = col_character(),\n",
" `i5 primer` = col_character(),\n",
" `i7 primer` = col_character(),\n",
" `library#` = col_integer()\n",
")\n"
]
}
],
"source": [
"df <- read_tsv(path)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Subset the data"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [],
"source": [
"df <- df %>% select(\n",
" Label,\n",
" RNA_sample_num,\n",
" Media,\n",
" Strain,\n",
" Replicate,\n",
" concentration_fold_difference,\n",
" `i7 index`,\n",
" `i5 index`, \n",
" `i5 primer`,\n",
" `i7 primer`,\n",
")"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"Label | RNA_sample_num | Media | Strain | Replicate | concentration_fold_difference | i7 index | i5 index | i5 primer | i7 primer |
\n",
"\n",
"\t2_MA_C | 2 | YPD | H99 | 2 | 1.34 | ATTACTCG | AGGCTATA | i501 | i701 |
\n",
"\t9_MA_C | 9 | YPD | mar1d | 3 | 2.23 | ATTACTCG | GCCTCTAT | i502 | i701 |
\n",
"\t10_MA_C | 10 | YPD | mar1d | 4 | 4.37 | ATTACTCG | AGGATAGG | i503 | i701 |
\n",
"\t14_MA_C | 14 | TC | H99 | 2 | 1.57 | ATTACTCG | TCAGAGCC | i504 | i701 |
\n",
"\t15_MA_C | 15 | TC | H99 | 3 | 2.85 | ATTACTCG | CTTCGCCT | i505 | i701 |
\n",
"\t21_MA_C | 21 | TC | mar1d | 3 | 1.81 | ATTACTCG | TAAGATTA | i506 | i701 |
\n",
"\n",
"
\n"
],
"text/latex": [
"\\begin{tabular}{r|llllllllll}\n",
" Label & RNA\\_sample\\_num & Media & Strain & Replicate & concentration\\_fold\\_difference & i7 index & i5 index & i5 primer & i7 primer\\\\\n",
"\\hline\n",
"\t 2\\_MA\\_C & 2 & YPD & H99 & 2 & 1.34 & ATTACTCG & AGGCTATA & i501 & i701 \\\\\n",
"\t 9\\_MA\\_C & 9 & YPD & mar1d & 3 & 2.23 & ATTACTCG & GCCTCTAT & i502 & i701 \\\\\n",
"\t 10\\_MA\\_C & 10 & YPD & mar1d & 4 & 4.37 & ATTACTCG & AGGATAGG & i503 & i701 \\\\\n",
"\t 14\\_MA\\_C & 14 & TC & H99 & 2 & 1.57 & ATTACTCG & TCAGAGCC & i504 & i701 \\\\\n",
"\t 15\\_MA\\_C & 15 & TC & H99 & 3 & 2.85 & ATTACTCG & CTTCGCCT & i505 & i701 \\\\\n",
"\t 21\\_MA\\_C & 21 & TC & mar1d & 3 & 1.81 & ATTACTCG & TAAGATTA & i506 & i701 \\\\\n",
"\\end{tabular}\n"
],
"text/markdown": [
"\n",
"Label | RNA_sample_num | Media | Strain | Replicate | concentration_fold_difference | i7 index | i5 index | i5 primer | i7 primer | \n",
"|---|---|---|---|---|---|\n",
"| 2_MA_C | 2 | YPD | H99 | 2 | 1.34 | ATTACTCG | AGGCTATA | i501 | i701 | \n",
"| 9_MA_C | 9 | YPD | mar1d | 3 | 2.23 | ATTACTCG | GCCTCTAT | i502 | i701 | \n",
"| 10_MA_C | 10 | YPD | mar1d | 4 | 4.37 | ATTACTCG | AGGATAGG | i503 | i701 | \n",
"| 14_MA_C | 14 | TC | H99 | 2 | 1.57 | ATTACTCG | TCAGAGCC | i504 | i701 | \n",
"| 15_MA_C | 15 | TC | H99 | 3 | 2.85 | ATTACTCG | CTTCGCCT | i505 | i701 | \n",
"| 21_MA_C | 21 | TC | mar1d | 3 | 1.81 | ATTACTCG | TAAGATTA | i506 | i701 | \n",
"\n",
"\n"
],
"text/plain": [
" Label RNA_sample_num Media Strain Replicate concentration_fold_difference\n",
"1 2_MA_C 2 YPD H99 2 1.34 \n",
"2 9_MA_C 9 YPD mar1d 3 2.23 \n",
"3 10_MA_C 10 YPD mar1d 4 4.37 \n",
"4 14_MA_C 14 TC H99 2 1.57 \n",
"5 15_MA_C 15 TC H99 3 2.85 \n",
"6 21_MA_C 21 TC mar1d 3 1.81 \n",
" i7 index i5 index i5 primer i7 primer\n",
"1 ATTACTCG AGGCTATA i501 i701 \n",
"2 ATTACTCG GCCTCTAT i502 i701 \n",
"3 ATTACTCG AGGATAGG i503 i701 \n",
"4 ATTACTCG TCAGAGCC i504 i701 \n",
"5 ATTACTCG CTTCGCCT i505 i701 \n",
"6 ATTACTCG TAAGATTA i506 i701 "
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"head(df)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 1. Use `gather` to combine multiple columns into one\n",
"\n",
"We note thet there are two \"index\" and \"primer\" columns - for a tidy data frame, we probably want to combine them. We cna do this using `gather`."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Warm-up\n",
"\n",
"The use of `gather` can be confusing, so we will start with a toy example."
]
},
{
"cell_type": "code",
"execution_count": 90,
"metadata": {},
"outputs": [],
"source": [
"name <- c('ann', 'bob', 'charlie')\n",
"jan <- c(102, 155, 211)\n",
"feb <- c(112, 150, 211)\n",
"mar <- c(123, 147, 213)\n",
"apr <- c(130, 140, 210)\n",
"\n",
"wts <- tibble(name=name, jan=jan, feb=feb, mar=mar, apr=apr)"
]
},
{
"cell_type": "code",
"execution_count": 91,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"name | jan | feb | mar | apr |
\n",
"\n",
"\tann | 102 | 112 | 123 | 130 |
\n",
"\tbob | 155 | 150 | 147 | 140 |
\n",
"\tcharlie | 211 | 211 | 213 | 210 |
\n",
"\n",
"
\n"
],
"text/latex": [
"\\begin{tabular}{r|lllll}\n",
" name & jan & feb & mar & apr\\\\\n",
"\\hline\n",
"\t ann & 102 & 112 & 123 & 130 \\\\\n",
"\t bob & 155 & 150 & 147 & 140 \\\\\n",
"\t charlie & 211 & 211 & 213 & 210 \\\\\n",
"\\end{tabular}\n"
],
"text/markdown": [
"\n",
"name | jan | feb | mar | apr | \n",
"|---|---|---|\n",
"| ann | 102 | 112 | 123 | 130 | \n",
"| bob | 155 | 150 | 147 | 140 | \n",
"| charlie | 211 | 211 | 213 | 210 | \n",
"\n",
"\n"
],
"text/plain": [
" name jan feb mar apr\n",
"1 ann 102 112 123 130\n",
"2 bob 155 150 147 140\n",
"3 charlie 211 211 213 210"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"wts"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Messy data\n",
"\n",
"In the current form, it is not possible to subgroup by month or to plot by month easily."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Using `gather` to creat a new column called `weight` to store monthly weights"
]
},
{
"cell_type": "code",
"execution_count": 93,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"name | month | weight |
\n",
"\n",
"\tann | jan | 102 |
\n",
"\tbob | jan | 155 |
\n",
"\tcharlie | jan | 211 |
\n",
"\tann | feb | 112 |
\n",
"\tbob | feb | 150 |
\n",
"\tcharlie | feb | 211 |
\n",
"\tann | mar | 123 |
\n",
"\tbob | mar | 147 |
\n",
"\tcharlie | mar | 213 |
\n",
"\tann | apr | 130 |
\n",
"\tbob | apr | 140 |
\n",
"\tcharlie | apr | 210 |
\n",
"\n",
"
\n"
],
"text/latex": [
"\\begin{tabular}{r|lll}\n",
" name & month & weight\\\\\n",
"\\hline\n",
"\t ann & jan & 102 \\\\\n",
"\t bob & jan & 155 \\\\\n",
"\t charlie & jan & 211 \\\\\n",
"\t ann & feb & 112 \\\\\n",
"\t bob & feb & 150 \\\\\n",
"\t charlie & feb & 211 \\\\\n",
"\t ann & mar & 123 \\\\\n",
"\t bob & mar & 147 \\\\\n",
"\t charlie & mar & 213 \\\\\n",
"\t ann & apr & 130 \\\\\n",
"\t bob & apr & 140 \\\\\n",
"\t charlie & apr & 210 \\\\\n",
"\\end{tabular}\n"
],
"text/markdown": [
"\n",
"name | month | weight | \n",
"|---|---|---|---|---|---|---|---|---|---|---|---|\n",
"| ann | jan | 102 | \n",
"| bob | jan | 155 | \n",
"| charlie | jan | 211 | \n",
"| ann | feb | 112 | \n",
"| bob | feb | 150 | \n",
"| charlie | feb | 211 | \n",
"| ann | mar | 123 | \n",
"| bob | mar | 147 | \n",
"| charlie | mar | 213 | \n",
"| ann | apr | 130 | \n",
"| bob | apr | 140 | \n",
"| charlie | apr | 210 | \n",
"\n",
"\n"
],
"text/plain": [
" name month weight\n",
"1 ann jan 102 \n",
"2 bob jan 155 \n",
"3 charlie jan 211 \n",
"4 ann feb 112 \n",
"5 bob feb 150 \n",
"6 charlie feb 211 \n",
"7 ann mar 123 \n",
"8 bob mar 147 \n",
"9 charlie mar 213 \n",
"10 ann apr 130 \n",
"11 bob apr 140 \n",
"12 charlie apr 210 "
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"wts %>% gather(key=month, value=weight, jan:apr)"
]
},
{
"cell_type": "code",
"execution_count": 97,
"metadata": {},
"outputs": [],
"source": [
"wts.1 <- wts %>% gather(key=month, value=weight, jan:apr)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### To get menth sorted correctly, we make it a factor"
]
},
{
"cell_type": "code",
"execution_count": 102,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"name | month | weight |
\n",
"\n",
"\tann | jan | 102 |
\n",
"\tbob | jan | 155 |
\n",
"\tcharlie | jan | 211 |
\n",
"\tann | feb | 112 |
\n",
"\tbob | feb | 150 |
\n",
"\tcharlie | feb | 211 |
\n",
"\tann | mar | 123 |
\n",
"\tbob | mar | 147 |
\n",
"\tcharlie | mar | 213 |
\n",
"\tann | apr | 130 |
\n",
"\tbob | apr | 140 |
\n",
"\tcharlie | apr | 210 |
\n",
"\n",
"
\n"
],
"text/latex": [
"\\begin{tabular}{r|lll}\n",
" name & month & weight\\\\\n",
"\\hline\n",
"\t ann & jan & 102 \\\\\n",
"\t bob & jan & 155 \\\\\n",
"\t charlie & jan & 211 \\\\\n",
"\t ann & feb & 112 \\\\\n",
"\t bob & feb & 150 \\\\\n",
"\t charlie & feb & 211 \\\\\n",
"\t ann & mar & 123 \\\\\n",
"\t bob & mar & 147 \\\\\n",
"\t charlie & mar & 213 \\\\\n",
"\t ann & apr & 130 \\\\\n",
"\t bob & apr & 140 \\\\\n",
"\t charlie & apr & 210 \\\\\n",
"\\end{tabular}\n"
],
"text/markdown": [
"\n",
"name | month | weight | \n",
"|---|---|---|---|---|---|---|---|---|---|---|---|\n",
"| ann | jan | 102 | \n",
"| bob | jan | 155 | \n",
"| charlie | jan | 211 | \n",
"| ann | feb | 112 | \n",
"| bob | feb | 150 | \n",
"| charlie | feb | 211 | \n",
"| ann | mar | 123 | \n",
"| bob | mar | 147 | \n",
"| charlie | mar | 213 | \n",
"| ann | apr | 130 | \n",
"| bob | apr | 140 | \n",
"| charlie | apr | 210 | \n",
"\n",
"\n"
],
"text/plain": [
" name month weight\n",
"1 ann jan 102 \n",
"2 bob jan 155 \n",
"3 charlie jan 211 \n",
"4 ann feb 112 \n",
"5 bob feb 150 \n",
"6 charlie feb 211 \n",
"7 ann mar 123 \n",
"8 bob mar 147 \n",
"9 charlie mar 213 \n",
"10 ann apr 130 \n",
"11 bob apr 140 \n",
"12 charlie apr 210 "
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"wts.1 %>% mutate(month = factor(month, \n",
" ordered = TRUE, \n",
" levels = c('jan', 'feb', 'mar', 'apr')))"
]
},
{
"cell_type": "code",
"execution_count": 103,
"metadata": {},
"outputs": [],
"source": [
"wts.2 <- wts.1 %>% mutate(month = factor(month, \n",
" ordered = TRUE, \n",
" levels = c('jan', 'feb', 'mar', 'apr')))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Now we can easily work wtih the tidy data set"
]
},
{
"cell_type": "code",
"execution_count": 106,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"month | min | max | mean |
\n",
"\n",
"\tjan | 102 | 211 | 156.0000 |
\n",
"\tfeb | 112 | 211 | 157.6667 |
\n",
"\tmar | 123 | 213 | 161.0000 |
\n",
"\tapr | 130 | 210 | 160.0000 |
\n",
"\n",
"
\n"
],
"text/latex": [
"\\begin{tabular}{r|llll}\n",
" month & min & max & mean\\\\\n",
"\\hline\n",
"\t jan & 102 & 211 & 156.0000\\\\\n",
"\t feb & 112 & 211 & 157.6667\\\\\n",
"\t mar & 123 & 213 & 161.0000\\\\\n",
"\t apr & 130 & 210 & 160.0000\\\\\n",
"\\end{tabular}\n"
],
"text/markdown": [
"\n",
"month | min | max | mean | \n",
"|---|---|---|---|\n",
"| jan | 102 | 211 | 156.0000 | \n",
"| feb | 112 | 211 | 157.6667 | \n",
"| mar | 123 | 213 | 161.0000 | \n",
"| apr | 130 | 210 | 160.0000 | \n",
"\n",
"\n"
],
"text/plain": [
" month min max mean \n",
"1 jan 102 211 156.0000\n",
"2 feb 112 211 157.6667\n",
"3 mar 123 213 161.0000\n",
"4 apr 130 210 160.0000"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"wts.2 %>% \n",
"group_by(month) %>% \n",
"summarize_at('weight', funs(min, max, mean)) "
]
},
{
"cell_type": "code",
"execution_count": 119,
"metadata": {},
"outputs": [],
"source": [
"options(repr.plot.width=6, repr.plot.height=3)"
]
},
{
"cell_type": "code",
"execution_count": 120,
"metadata": {},
"outputs": [
{
"data": {},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAtAAAAFoCAMAAABJ+DwrAAADAFBMVEUAAAAAujgBAQECAgID\nAwMEBAQFBQUGBgYHBwcICAgJCQkKCgoLCwsMDAwNDQ0ODg4PDw8QEBARERESEhITExMUFBQV\nFRUWFhYXFxcYGBgZGRkaGhobGxscHBwdHR0eHh4fHx8gICAhISEiIiIjIyMkJCQlJSUmJiYn\nJycoKCgpKSkqKiorKyssLCwtLS0uLi4vLy8wMDAxMTEyMjIzMzM0NDQ1NTU2NjY3Nzc4ODg5\nOTk6Ojo7Ozs8PDw9PT0+Pj4/Pz9AQEBBQUFCQkJDQ0NERERFRUVGRkZHR0dISEhJSUlKSkpL\nS0tMTExNTU1OTk5PT09QUFBRUVFSUlJTU1NUVFRVVVVWVlZXV1dYWFhZWVlaWlpbW1tcXFxd\nXV1eXl5fX19gYGBhYWFhnP9iYmJjY2NkZGRlZWVmZmZnZ2doaGhpaWlqampra2tsbGxtbW1u\nbm5vb29wcHBxcXFycnJzc3N0dHR1dXV3d3d4eHh5eXl6enp7e3t8fHx9fX1+fn5/f3+AgICB\ngYGCgoKDg4OEhISFhYWGhoaHh4eIiIiJiYmKioqLi4uMjIyOjo6Pj4+QkJCRkZGSkpKTk5OU\nlJSVlZWWlpaXl5eYmJiZmZmampqbm5ucnJydnZ2enp6fn5+goKChoaGioqKjo6OkpKSlpaWm\npqanp6epqamqqqqrq6usrKytra2urq6vr6+wsLCxsbGysrKzs7O0tLS1tbW2tra3t7e4uLi5\nubm6urq7u7u8vLy9vb2+vr6/v7/AwMDBwcHCwsLDw8PExMTFxcXGxsbHx8fIyMjJycnKysrL\ny8vMzMzNzc3Ozs7Pz8/Q0NDR0dHS0tLT09PU1NTV1dXW1tbX19fY2NjZ2dna2trb29vc3Nzd\n3d3e3t7f39/g4ODh4eHi4uLj4+Pk5OTl5eXm5ubn5+fo6Ojp6enq6urr6+vs7Ozt7e3u7u7v\n7+/w8PDx8fHy8vLz8/P09PT19fX29vb39/f4dm34+Pj5+fn6+vr7+/v8/Pz9/f3+/v7///9R\nfbnwAAAACXBIWXMAABJ0AAASdAHeZh94AAAeIklEQVR4nO3de2AU1b0H8Kkt6m1F9NpcHinB\nR0FrS+sjoNfWC1SrgrKQAAGMCOECFrRKr7dS4KoICNUKWi2IodTa+kCJoqUVapSAvCKEAgJh\nC0FQKsLGkEA2aR6bPXdmNjvJTmbC78ycs7Psfr9/7OzsfDn72/FjnCTLqjAESaIoXg+AICID\n0EhSBaCRpApAI0kVgEaSKgCNJFUAGkmqADSSVBEEuuLgmRJj5M+8noQcY2SvByFHDCpHEQQ6\n4D9TYoz8ideTkGOM7PUg5IhB5SgAnfgxRvZ6EHLEoHIUgE78GCN7PQg5YlA5CkAnfoyRvR6E\nHDGoHAWgEz/GyF4PQo4YVI4C0IkfY2SvByFHDCpHAejEjzGy14OQIwaVowB04scY2etByBGD\nylEAOvFjjOz1IOSIQeUoAO06K9Pb33cdY2RRC2YUtH1sb/qfRS0P0PGMMTJAx0QFvT59gaAn\nEIPKUQDadZIB9J4iFXTJQysFPYEYVI6ScKALh17e8/Z31ZP+1p2ZfZa2bEXFGFkg6NcH9Bq0\nyu8vHtu799jiln1RMUZ2vdKW3Ct7P7DHn7F0WK8++a3O9R+uHK5dcqjQt0++pteID4RNHP8k\nHOgBg15bPugm9STfvnbv9Es+NraiYowsEPR1L76e3Wvrvptufn35zT/eF90Xtb440KU3Zr/5\n3Hd/6c+4Kn913qV7Ws71dc+sawY96Pblb4/pXSJq4vgn0UDvm73W7190pXqSf+33f5i+1tiK\nijGyQNC/9/t3fW/u6xkf+v3rM16P7otaXxzol76tSl2Q58941O8vSl/fcq7nRL4pzChYmaE2\n9l71oqiJ459EA+3f+8YTE76nneTlfv9WDXTzVlSMkQWC1r6gjZ70VF9tr8+C6L6o9cWBfuzm\nyDbjVb9/c/r6lnP9ShT0b9Iz1KTPEzVx/JNooHfe0vfBF5+5MvKNiw66IPFBb1dvR937pA66\n75PRfVHriwP98K2RrXZOVdAx5zoC+ndXuH0OPWJQOUqigX7pst1+/8IzDPQy9RKj91OvZWz0\n+zdkLI/ui1pfHOj8Xjv8/sVDoqBjznUE9Pvpq/3+j3LWiJo4/kk00K+nL1q78KqL3z2TQF98\n7e/fHN57574Bt75ZcMuAfdF9UeuLA723b847y659IAo65lw3f1M4+sblb2Rdt1fUxPFPooH2\nz+j9ndFrRw6PQL60yNiKijGyMNCrR7zUr9dQ9V+5LWP0H9sZ+6JijOx6pQ9zLu/9wMdR0DHn\nuhn0zilX9cpxfbbFoHKUhAMtO8bIeC+HvIhB5SgAnfgxRvZ6EHLEoHIUgE78GCN7PQg5YlA5\nCkAnfoyRvR6EHDGoHAWgEz/GyF4PQo4YVI4C0IkfY2SvByFHDCpHAejEjzGy14OQIwaVowB0\n4scY2etByBGDylEAOvFjjOz1IOSIQeUogkAHyy1zIlxrfcCcxiZarzp8ktSrCNfZHDFGPmF9\nPBiuJD0D+aU1uH5pxsg2xxtDtGcIhqtIvQrySwvbHBCDylFEgQ5Y5gSrtT5gTmOY1qtmVaTe\nl6ze5ogxcoX18RpWSXqGCvJLY7ReNTtpc8QY2eZ4qIn2DEHiyatgdbQFG+wmEoPKUQDaFIAG\naAbQ7QSg4xqANgWgAZoBdDsB6LgGoE0BaIBmAN1OADquAWhTABqgGUC3E4COawDaFIAGaAbQ\n7QSg4xqANiUFQf83IZZ/EKBtA9AALSYAbQpAAzQD6HYiHDTVH6VHLloOBtC2AegzADS1B9AA\nHQBoQQFoUwDadQ+gAToA0IIC0KYAtOseQAN0AKAFBaBNAWjXvTMC9ImFY0fP+oSx0LKJ4xY1\ntGybA9C2Aei4hgr64am7/E/kVrD8vOKSSQuZsW0OQNsGoOMaIuhyX6n6VTl3dU3OBsa2ZVdG\nt9HjAG0bgI5riKCPv6peXtSN+Gupr5qxxiHbo1v1UPUv1ayus0w9C1kfMCfMaL1G1kDq1bMm\nmyPGS7I5HiI/Q7xeWrVpZIorao9c5Om1OsnxD8c3hXVPjD+5KVu7l1sY3ao3FZlqXpAynNQ0\neT0ANWYeFFfUHrnI0/M0ZNDh98dPr2Qbh2n3c9dEt9qBKjU2HxRXyYifkhay+5Q0U4KM+Nl2\nrN7miPGCbEauYbQPgDvB7D49z5RGu4+kM6WanbI5Yowc2aW4ovbIRZ7eGfHZdpUzJhSFGSv1\n1ajX0kNKotvoYVxD2wbX0HENEXT4wblBbRscUczYrqyK6DZ6HKBtA9BxDRH0jiFFO9QE2JIp\nB8qmPs2MbXMA2jYAHdcQQb/l07OKhfLHj1us/WKledscgLYNQMc1+NW3KQDtugfQAB0AaEEB\naFMA2nUPoAE6ANCCAtCmALTrHkADdACgBQWgTQFo1z2ABugAQAsKQJsC0K57AA3QAYAWFIA2\nBaBd9wAaoAMALSgAbQpAu+4BNEAHAFpQANoUgHbdA2iADgC0oAC0KQDtugfQAB0AaEEBaFMA\n2nUPoAE6ANCCAtCmALTrHkADdACgBUUQ6Lpay9SzkPUBc8KM1mtkDaRene0TGyPbPkO9y2cw\npYn40hrsXtop08gUV9QeucjTa3WS4x9BoGtPWqaaNVgfMKcpTOvVsRpS7xRrtDlijHzK+ng9\nC5KegfzSQozWq7M7iSeMkSP7FFfUHrnI02t1kuMfXHKYgksO171kuOQAaNsAdFwD0KYAtOse\nQAN0AKAFBaBNAWjXPYAG6ABACwpAmwLQrnsADdABgBYUgDYFoF33ABqgAwAtKABtCkC77gE0\nQAcAWlAA2hSAdt0DaIAOALSgALQpAO26B9AAHQBoQQFoUwDadQ+gAToA0IIC0KYAtOseQAN0\nAKAFBaBNAWjXPYAG6ABAC0qqg/4KJVbrALRt7wwB3ZirfdzCGz41WYyFlk0ct6jBOAjQtgHo\nuIYKun7nkz4N9DOzS0pKtjOWn1dcMmmhcRigbQPQcQ0VdMH4MTroae/ouzU5Gxjbll0ZPQzQ\ntgHouIZ+ybFfB507J+/O2UdYqa9avQgZon6pZvXvqSl1+XlZnn0UGAm01Tr4KDDb3hnyUWA6\n6Crf3N07Z+YFN2VrD+UWqjcVmWpekDWfvDTptyTQ3g5aZ9qnuKL2yEWenqfhBB0qDzNWPbxo\n4zDtodw16k3tH9RsrbZMDWu0PmBOE6P16lkdqRdkIZsjxusJ6rsk0NSinJdWZYwc2ae4ovbI\nRZ5eq5NsmYFZn916Xtd7tNf1ynUXnn/NUvXO4Kxtt1yYubLhwZ6d7jii7h8ceXGnfn/hgBwN\n7yWHnvtWlPpqVN1DSqKPJP81NLXX9qXhGtqUgT/st+KTxWdNUL8zU66fP+37yhsq6Ctu2ryn\n3zl9Hzuw/OwcxnZ0Sv/lY73P+h1ZpxFO0B/dr97U5mwJjihmbFdWRfQgQAO00TstaOU97bYH\nY9nd69VLqk4/VUF/7RBjRcpI9cDQDMb69/iSsYYB559qfyWLcIIOjpv19z2z7g+xJVMOlE19\n2jgI0K7hpxDoi7TbiWmMlWtfEQPnjVFB91LvlSra92KT01iF8rhWWaEUknlGw3vJceiRUWMX\nqt93h/LHj1ucSr9YAWhxoK/Wbielaaj++L/9z1U00Npj+5SX1dspaWyz0pzXyDyjwa++AVp0\n77Sg+2i3Guhnz75ozJLtGW1AlyjTi/Qc5ZYI0AAtukcGXX3uuJB6r3Mb0FXKTK3yeRH//60F\noAFadI8M+mPlOfXOaiXXDJrdnHacsaZbuoa4JQI0QIvukUHXd+/26B/u69K984tm0Ns7dpv5\nyLXKn/glAjRAi+7RQE/uxdiun3Tqceehzf0mtQZ9n/YDD3929wt+tMqBRIAGaNG9M+TNSe0G\noGX0vmIaObIr0h9A2wSgZfQAmj8A7QFUricGaK4AtAdQuZ4YoLkC0B5A5XpigOYKQHsAleuJ\nAZorAO0BVK4nTjrQoVrbNLT7B0kBaA+gcj1x0oGu/5dt+N+60SYA7QFUricGaK4AtAdQuZ4Y\noLkC0B5A5XpigOZKkoC2P0ctMZ1rgOaFSu0lEOgxpZHt+p9xLgPQMnoAzZ9WoMvLy5W3y7Uc\nn/ENzmUAWkYPoPnTCrTSKjdxLgPQMnoAzZ9WoBcsWKDcu0DPbw5zLgPQMnoAzZ/Ya+gBOxwu\nU3PCMidZnfUBc0JhWq+GVVs+TgEdaRojV+m7JFfUouieXjxhfJoPi7wCiitqj1zk6bU6yVY5\nI37KUWedehayOWJK2G4FUxpZg+XjFNCRpmlkkitqUXRPL9ZVm0amuKL2yEWeXp354yVjc1rQ\n+2678Pz+OxnrsDnn0m+/0bIlJRZ05aTLukZC/OPR4JJDRi9FLzkyBxR+0O86FXL/stCj/1Zr\nbEmJBT1BuXpsnh4i5GgAWkYvNUGHnyxj7OU0FfQixj5VDhhbUmJBdx4ephKOjSTQFKf/Amjn\n/hIRNGtY/1xeZw30B4yVa6Cbt6TEgk5z8PmlegBaRi81QQdv6PnQ28s00BuaQW9wDPqOqVTB\npgC0jF5qgl51nvpN5VIxoMsy8vk/fEkLQMvopSbodcorZUu7n7vTLeg+WtKVjr31OzyaGUDL\n6aUmaDaryzdHHMgaHIH89YPGlpQW0ANjwoFZC0DL6KUoaFdJ7LePAjRAcwagPYAK0PEC3bE5\n3/zuj+dz/V10gJbRA2j+xIJ+Kl25cnjOD5QBM0d/oxPxKlwPQMvoATR/YkEvOVf/X3m/d+Fa\n9sUlQzmWAWgZPYDmTyzoPvdGtlNvYuy5izmWAWgZPYDmTyzoTnMi23kXMPby1zmW4QQt0ilA\nu/CX7KD/6xr9raz1mTcwNuE7HMsAtIxecoIOh2zTxEHOJrGg//rVvq/t3Ln8P5WC2slnzedY\nBqBl9JITtNyYfg79Wob2V2Q7L2Pl50yu51gGoGX0khN0Y9A2PORsYv7FSt26Zc8XnmIsVG1Z\ntwtAy+glJ+h6+9d7xv6mEKAB2iqCP5eja+Qdd3F4tx1AA7RVxILu+n0Wr3fbATRAWwWXHADt\nzl/ygz5V+NrRWsu/tdKYe1K9DS2bOG5RQ8u2OQAtowfQ/DGBzj9fUYqKur3cdoqdT/o00Pl5\nxSWTFrZsmwPQMnoAzZ9Y0KvOGlCgFH3+E+Uv5l7B+DEa6JqcDYxty66MbqOHAVpGLzVBbzF9\nkTXv84C+8fuNTCliTdf+V9vmfg10qa9avfgYsj26jR4MiocK0ADNrPbbT2z5/NlMA80eubBt\nUwe9KVu7m1sY3WowMtW8EOlQXInucS3YKpH3DZBcUYuie3qRmT8pjuKK2iMXeXqnSTxB95gR\nAT0jo21TB71xmHY3d010q95UjVGzolEPxZXoHteCjY3G62nQd0muqEXRPb3YWGOMHHkFFFfU\nHrnI02t1kq1yetDrMjv1U//jfzy3S7e7jrfsOwCd860KDfSxbtk2oEt96ukNDSmJbqMHcckh\no5eqlxyXv73u9k7l4b7Xry26vk84uu8E9MHze8xTps9I6/gPG9DBEcWM7cqqiG4BWmYvVUGv\nZKym84K1HT5l7PDX1kX3nYBmO/pp77a72erruw6aLZlyoGzq0y1bgJbYS1XQ2gemD5u4uKe2\nd9mS6L4j0Ix9ubmkyrIZAR3KHz9ucUPLFqAl9lIVtPbT4Oz7fttL2+v52+i+E9C9p73v7C2p\nAC2jl6qg31IvMbo8X9jhCGOfdfgguk+TGAv6EkXpOOR5ns8vaA5Ay+ilKOhzL1m5aXC36vC1\nP9q44YeZ4eg+TaLpkuPwnyZdrihX/M9qIuRoAFpGLzVB7/atuvqCWw8wdmx0ly65x1v2SbH4\nofXR312h8L4JD6Bl9FITtLuY5Vb8edr1HZSOt3EuA9AyegDNn1jQD1z1VaXTHb8ubv9XPRYB\naBk9gOZPLGhF6f2qo4/wB2gZPYDmTyzoaTeco3Qe/sw2btQALaMH0PwxX0PXrps36AKl462c\nywC0jB5A88fixxl7nsVPOeRCBWibiAdd+vzIzspZ10xfy7kMQMvoATR/YkF3VZTOd/3pGP8y\nAC2jl5ygWdg2/PDaJBZ0//klzhYFaBm9JAUtNWI/l0O0P0oPoF348+SSw/5PJ9wHzYj2R+kB\ntAt/AG0TgJbRA2j+ALQHUAHaJgDNuyBAu4BK7QG0qx5Au/AH0DYBaBk9gOYPQHsAFaBtAtC8\nCwK0C6jUniDQHTa0/bMhZdtpJAK0B1AB2iYU0IeUJe2tLgh0feR/nEhxJbrHtWCo5Z3ejfou\nyRW1KLqnF0Mt/5Qjr4DiitojF3l6ofbfTu8cdP1BFfSJ6VvaW13UV+hyPRRXontcC5a3fELa\nCX2X5IpaFN37imnkyC7FFbVHLvL0ytv/GLrTgj42Mq3r1HrWoeCOCy57nbF9t114fv+dKvA/\npw3WLjlU6JWTe3Ty7bdcHZcchOCSI56XHE0/GLTpj//xMOvQfcXusV+vZ5kDCj/od50K+vJl\nnzSDHtC/qHh0txNWqwO0B1AB2iY66FUdValLxrIO8xk7qBwKP1nG2MtpKuinIt8Udtiw5ewK\n9W73d6xWB2gPoAK0TXTQT1wf6XV4j7EvlEOsYf1zeZ010H+Lgn5R6aBGWWi1OkB7ABWgbaKD\nnvujSE/7plAFHbyh50NvL0uL7EdAv3mR/eoA7QFUgLaJDnrFBacYe/WWKOhV59UxtjQGtF/5\nWD0ZQ/dYrQ7QHkAFaJvooBt6Dt361iU/j4Jep7xStrT7uTtbgWbDfvDB+oGXW34cEkB7ABWg\nbRL5sd2nQ/+9289ro6DZrC7fHHEga3Br0MF7u18w1PpDcgHaA6gAbRO8l4N3QYB2AZXaA2hX\nPYB24Q+gbQLQMnoAzR+A9gAqQNsEoHkXBGgXUKk9gHbVA2gX/gDaJgAtowfQ/AFoD6CmNmi5\nAWgPoAK0vAC0B1ABWl4A2gOoAC0vAO0BVICWF4D2ACpAywtAewAVoOUFoD2ACtDywgv6DZ+a\nLMZCyyaOW9RgPAzQMnoAzR9e0M/MLikp2c5Yfl5xyaSWv3YL0DJ6AM0fXtDTIh+GUJOzgbFt\n2ZXRhwFaRg+g+cMLOndO3p2zj7BSXzVjjUPUL9UsOE/N2lo9FFeie1wL1ra8WyCyS3JFLYru\n6cXaU6aRKa6oPXKRp1cr4C0ZjsMJuso3d/fOmXnBTdnaXm6helORqeaFyGGKK9E9rgVbpUm/\nJbmiFkX39CKrMw1OcUXtkYs8PU/DCTpUHmasenjRxmHaXu4a9abpiJrjFXoorkT3uBasqDBe\nSqW+S3JFLYru6cWKL42RI6+A4oraIxd5eq1Ocvzj6Md2960o9dWouoeURB/BNbSMHq6h+cMJ\n+qP7T6qXcjlbgiOKGduVZfyrCNAyegDNH07QwXGz/r5n1v0htmTKgbKpT7c8DtASegDNH95L\njkOPjBq78IR6uZE/ftxi/GJFbg+g+YNffXsAFaDlBaA9gArQ8gLQHkAFaHkBaA+gArS8ALQH\nUAFaXgDaA6gALS8A7QFUgJYXgPYAKkDLC0B7ABWg5QWgPYAK0PIC0B5ABWh5AWgPoAK0vAC0\nB1ABWl4A2gOoAC0vAO0BVICWF4D2ACpAywtAewAVoOUFoD2ACtDyAtAeQAVoeQFoD6ACtLwA\ntAdQAVpeANoDqAAtL4JA11brobgS3eNasLraGDmo75JcUYuie3qxusoYOfIKKK6oPXKRp9fq\nJMc/okCf0kNxJbrHteCplo/yrNZ3Sa6oRdE9vXjK+MxiFnkFFFfUHrnI02t1kuMfXHIQgkuO\nlLvkAGgZPYDmD0B7ABWg5QWgPYAK0PIC0B5ABWh5AWgPoAK0vAC0B1ABWl4A2gOoAC0vAO0B\nVICWF4D2ACpAywtAewAVoOUFoD2ACtDyAtAeQAVoeQFoD6ACtLwAtAdQAVpeANoDqAAtLwDt\nAVSAlheA9gAqQMsLQHsAFaDlBaA9gArQ8gLQHkAFaHkBaA+gArS8ALQHUAFaXpyCDi2bOG5R\ng7EL0DJ6AM0fp6Dz84pLJi00dgFaRg+g+eMQdE3OBsa2ZRsf6gPQMnoAzR+HoEt91Yw1Dtmu\n3m0oVlNWqYfiSnSPa8HKls/VOqnvklxRi6J7erHyhDFy5BVQXFF75CJPr9VJjn8cgt6Urd3m\nFqo3FZlqXog8THEluse1YKs06bckV9Si6J5eZHWmwSmuqD1ykafnaRyC3jhMu81do97UPKtm\nY41l6lij9QFzmhit18DqSb1aFrI5YrwEm+ONrM7lM5ji+qWdPN3I4TDtGRpdnzxTbF+aK5Lu\n4viSQx06NKQkut98DW3OCVZrfcCcxjCtV82qSL0vWb3NEeMlVFgfr2GVpGeoIL80RutVs5M2\nR4yRbY6HmmjPECSevApWR1uwwW4i5x5dxyHo4IhixnZlVRj71i8MoAE6znH6Y7slUw6UTX3a\n2AVo2wB0XOP4Fyv548ctbvOLFXMAGqDjHLG/+jYHoAE6zgFoUwAaoBlAtxOAjmsA2hSABmgG\n0O0EoOMagDYFoAGa2YI+dvhz2qk58imtd/TwMVrR9omNkW1AU5/huOiX9sXhL2yOGCO7fAbq\nSzt2+J+0BY8ctjkgBpWjCAJtE3/mfFpx1I203kuZ75N65Zm/oC3YJr/N3ErqHcycTVvw7uto\nvVczV9OKbTL0J7TeC5kbSb0jmf9HW3BCZhOtGMcAtCkADdD2AWiAjnMA2hSABmj7VBZspxUL\n36b1/AX/JPVqCzbTFmyT3QXHSb2TBSWnL2n54C1a70DBZ7Rim6xZRevtLfiC1Ksu+Ii24LqC\nMK0Yx8gFjSBxDkAjSRWARpIqHoPePWFG9K7f5+Ug9Jx5I595E7uIHNDHfO/Sir963Pgr7x6f\n6yQeOXEmjkPkgK5+yU8rznrVuMt9rhtJ37PTWkk9svyJySPLj6RLjqy97Mis0SNnfKLe3ffE\npHs2WNdm+nwPseCiCaPmfq6e690PjppeZlUbXvLY3TPL8+8ds4q1LPvRXXPaNpuP7r9r70O5\nMz+1a6XWyMIndjOy9MgD/eDMHbum/0K9O+No0yvD66172hePmTM+9v96bLXfN6V49+xRJy1a\nwx8OnrpvRHH4razalmWnvHesbbP56P5h047Vv3RXjU0rxUYWPbGbkaVHGuhwwVHG1t6l3v0L\nYwHfUeueeq79WacYaxpf7PdtYax+jNWvIYarX3qWqee3wne0Zdk3LYrRo/t9JerOPausW6k2\nsuCJXY0sPfK+Qod2r3pmjHZSdjF2sh0dhb4sNb6Vfu3j8tj8Zy1awz9m7GX1P2pBdRlj2b9b\nrdd8dL9P+xr0xPM2rRQbWfTEbkaWHmmg66b99PfF792ln/V2dWzKjdz1+4Lq7fznLVqtznXM\nsm0SPbrfV6XuzVts3Uq1kQVP7Gpk6ZEGemtOA2NrCDqO+A4xVvX4p37fZvW/hnf/1aLV6lzH\nLNsm0aP7fZu0/7K+w6cjWUcWPLGrkaVHGujdvqKja8YP++S0Otj8B3btnjUl5B82cUvpnLG1\nFq1W5zpm2TaJHt3vm1C8b25ukE9Hso4seGJXI0uPvGvoV+/OfeLovDmRUz3c5ueU2rmuWzx+\n1ONfsMNzt/581KOW/0iMc60u03rZtmk+ut+39YGR+g+UeHQk68iiJ3YzsvRIAj3U25e438f/\nRl2MzBsnI0uPHNBlvn9IWZcaB6caI3MndUDv8M1slLEuOfynGiPzJ3VAh7z8CHdnwchJErwf\nGkmqADSSVAFoJKkC0EhSBaCRpApAI0kVgJaagX28niDVAtCyskApB+j4B6BlBaA9CUDLCkB7\nEoB2lMFZ2265MHNlw4M9O91xRN3fOqhL10Hb1DsDsz679byu91SxAYqijFFBHxyc1nVi5enW\nQ0QFoB1l8BU3bd7T75y+jx1YfnYOY387u8f0GRef/TcV9A/7rfhk8VkT2I57lbdL2cD07vcv\nHaZM8nre1AlAO8rgrx1irEgZqd4dmsGaen8rwFh5+lVhNlB5T31sYI/oJYeSz1j4qsu8nTaV\nAtCOMriXelOqvKDeTk5jZcrj2oNzlENs4EXavYlpUdAdQ+ru2K4ejppiAWhHGXy1erNPeVm9\nnZLG1igrtQffVArZQO0Am2SA7q3t5gF03ALQjhILenUE9EpldfOPNVpA67sAHb8AtKPEgt6v\n/Ep7cJ5yEKC9DkA7SizopiszKhj7svt3m2JAHwfo+AegHSUWNHu3w2WPPnKp/mM7A/RvlBkf\nAnTcA9CO0gr0fdoPPIpv69JloP6LFV3wZPWxQz/+xs8AOu4BaCSpAtBIUgWgkaQKQCNJFYBG\nkioAjSRVABpJqgA0klQBaCSp8v90QAgsMZexuQAAAABJRU5ErkJggg==",
"text/plain": [
"plot without title"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"ggplot(wts.2, aes(x=month, y=weight, fill=name)) + \n",
"geom_bar(stat='identity') + \n",
"facet_wrap('name')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Using `gather` on experiment metadata"
]
},
{
"cell_type": "code",
"execution_count": 235,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"Label | i7 index | i5 index |
\n",
"\n",
"\t2_MA_C | ATTACTCG | AGGCTATA |
\n",
"\t9_MA_C | ATTACTCG | GCCTCTAT |
\n",
"\t10_MA_C | ATTACTCG | AGGATAGG |
\n",
"\t14_MA_C | ATTACTCG | TCAGAGCC |
\n",
"\t15_MA_C | ATTACTCG | CTTCGCCT |
\n",
"\t21_MA_C | ATTACTCG | TAAGATTA |
\n",
"\n",
"
\n"
],
"text/latex": [
"\\begin{tabular}{r|lll}\n",
" Label & i7 index & i5 index\\\\\n",
"\\hline\n",
"\t 2\\_MA\\_C & ATTACTCG & AGGCTATA \\\\\n",
"\t 9\\_MA\\_C & ATTACTCG & GCCTCTAT \\\\\n",
"\t 10\\_MA\\_C & ATTACTCG & AGGATAGG \\\\\n",
"\t 14\\_MA\\_C & ATTACTCG & TCAGAGCC \\\\\n",
"\t 15\\_MA\\_C & ATTACTCG & CTTCGCCT \\\\\n",
"\t 21\\_MA\\_C & ATTACTCG & TAAGATTA \\\\\n",
"\\end{tabular}\n"
],
"text/markdown": [
"\n",
"Label | i7 index | i5 index | \n",
"|---|---|---|---|---|---|\n",
"| 2_MA_C | ATTACTCG | AGGCTATA | \n",
"| 9_MA_C | ATTACTCG | GCCTCTAT | \n",
"| 10_MA_C | ATTACTCG | AGGATAGG | \n",
"| 14_MA_C | ATTACTCG | TCAGAGCC | \n",
"| 15_MA_C | ATTACTCG | CTTCGCCT | \n",
"| 21_MA_C | ATTACTCG | TAAGATTA | \n",
"\n",
"\n"
],
"text/plain": [
" Label i7 index i5 index\n",
"1 2_MA_C ATTACTCG AGGCTATA\n",
"2 9_MA_C ATTACTCG GCCTCTAT\n",
"3 10_MA_C ATTACTCG AGGATAGG\n",
"4 14_MA_C ATTACTCG TCAGAGCC\n",
"5 15_MA_C ATTACTCG CTTCGCCT\n",
"6 21_MA_C ATTACTCG TAAGATTA"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"df %>% \n",
"select(Label, ends_with('index')) %>% \n",
"head()"
]
},
{
"cell_type": "code",
"execution_count": 237,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"Label | index | sequence |
\n",
"\n",
"\t2_MA_C | i7 index | ATTACTCG |
\n",
"\t9_MA_C | i7 index | ATTACTCG |
\n",
"\t10_MA_C | i7 index | ATTACTCG |
\n",
"\t14_MA_C | i7 index | ATTACTCG |
\n",
"\t15_MA_C | i7 index | ATTACTCG |
\n",
"\t21_MA_C | i7 index | ATTACTCG |
\n",
"\n",
"
\n"
],
"text/latex": [
"\\begin{tabular}{r|lll}\n",
" Label & index & sequence\\\\\n",
"\\hline\n",
"\t 2\\_MA\\_C & i7 index & ATTACTCG \\\\\n",
"\t 9\\_MA\\_C & i7 index & ATTACTCG \\\\\n",
"\t 10\\_MA\\_C & i7 index & ATTACTCG \\\\\n",
"\t 14\\_MA\\_C & i7 index & ATTACTCG \\\\\n",
"\t 15\\_MA\\_C & i7 index & ATTACTCG \\\\\n",
"\t 21\\_MA\\_C & i7 index & ATTACTCG \\\\\n",
"\\end{tabular}\n"
],
"text/markdown": [
"\n",
"Label | index | sequence | \n",
"|---|---|---|---|---|---|\n",
"| 2_MA_C | i7 index | ATTACTCG | \n",
"| 9_MA_C | i7 index | ATTACTCG | \n",
"| 10_MA_C | i7 index | ATTACTCG | \n",
"| 14_MA_C | i7 index | ATTACTCG | \n",
"| 15_MA_C | i7 index | ATTACTCG | \n",
"| 21_MA_C | i7 index | ATTACTCG | \n",
"\n",
"\n"
],
"text/plain": [
" Label index sequence\n",
"1 2_MA_C i7 index ATTACTCG\n",
"2 9_MA_C i7 index ATTACTCG\n",
"3 10_MA_C i7 index ATTACTCG\n",
"4 14_MA_C i7 index ATTACTCG\n",
"5 15_MA_C i7 index ATTACTCG\n",
"6 21_MA_C i7 index ATTACTCG"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"df %>% \n",
"select(Label, ends_with('index')) %>% \n",
"gather(index, sequence, -Label) %>% \n",
"head()"
]
},
{
"cell_type": "code",
"execution_count": 239,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"Label | index | sequence |
\n",
"\n",
"\t2_MA_C | i7 | ATTACTCG |
\n",
"\t9_MA_C | i7 | ATTACTCG |
\n",
"\t10_MA_C | i7 | ATTACTCG |
\n",
"\t14_MA_C | i7 | ATTACTCG |
\n",
"\t15_MA_C | i7 | ATTACTCG |
\n",
"\t21_MA_C | i7 | ATTACTCG |
\n",
"\n",
"
\n"
],
"text/latex": [
"\\begin{tabular}{r|lll}\n",
" Label & index & sequence\\\\\n",
"\\hline\n",
"\t 2\\_MA\\_C & i7 & ATTACTCG \\\\\n",
"\t 9\\_MA\\_C & i7 & ATTACTCG \\\\\n",
"\t 10\\_MA\\_C & i7 & ATTACTCG \\\\\n",
"\t 14\\_MA\\_C & i7 & ATTACTCG \\\\\n",
"\t 15\\_MA\\_C & i7 & ATTACTCG \\\\\n",
"\t 21\\_MA\\_C & i7 & ATTACTCG \\\\\n",
"\\end{tabular}\n"
],
"text/markdown": [
"\n",
"Label | index | sequence | \n",
"|---|---|---|---|---|---|\n",
"| 2_MA_C | i7 | ATTACTCG | \n",
"| 9_MA_C | i7 | ATTACTCG | \n",
"| 10_MA_C | i7 | ATTACTCG | \n",
"| 14_MA_C | i7 | ATTACTCG | \n",
"| 15_MA_C | i7 | ATTACTCG | \n",
"| 21_MA_C | i7 | ATTACTCG | \n",
"\n",
"\n"
],
"text/plain": [
" Label index sequence\n",
"1 2_MA_C i7 ATTACTCG\n",
"2 9_MA_C i7 ATTACTCG\n",
"3 10_MA_C i7 ATTACTCG\n",
"4 14_MA_C i7 ATTACTCG\n",
"5 15_MA_C i7 ATTACTCG\n",
"6 21_MA_C i7 ATTACTCG"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"df %>% \n",
"select(Label, ends_with('index')) %>% \n",
"gather(index, sequence, -Label) %>% \n",
"mutate(index=str_extract(index, '^..')) %>%\n",
"head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 2. Use `spread` to convert one column into multiple\n",
"\n",
"We can undo `gather` with `spread`. This is less commonly used than `gather`."
]
},
{
"cell_type": "code",
"execution_count": 240,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"Label | index | sequence |
\n",
"\n",
"\t2_MA_C | i7 index | ATTACTCG |
\n",
"\t9_MA_C | i7 index | ATTACTCG |
\n",
"\t10_MA_C | i7 index | ATTACTCG |
\n",
"\t14_MA_C | i7 index | ATTACTCG |
\n",
"\t15_MA_C | i7 index | ATTACTCG |
\n",
"\t21_MA_C | i7 index | ATTACTCG |
\n",
"\n",
"
\n"
],
"text/latex": [
"\\begin{tabular}{r|lll}\n",
" Label & index & sequence\\\\\n",
"\\hline\n",
"\t 2\\_MA\\_C & i7 index & ATTACTCG \\\\\n",
"\t 9\\_MA\\_C & i7 index & ATTACTCG \\\\\n",
"\t 10\\_MA\\_C & i7 index & ATTACTCG \\\\\n",
"\t 14\\_MA\\_C & i7 index & ATTACTCG \\\\\n",
"\t 15\\_MA\\_C & i7 index & ATTACTCG \\\\\n",
"\t 21\\_MA\\_C & i7 index & ATTACTCG \\\\\n",
"\\end{tabular}\n"
],
"text/markdown": [
"\n",
"Label | index | sequence | \n",
"|---|---|---|---|---|---|\n",
"| 2_MA_C | i7 index | ATTACTCG | \n",
"| 9_MA_C | i7 index | ATTACTCG | \n",
"| 10_MA_C | i7 index | ATTACTCG | \n",
"| 14_MA_C | i7 index | ATTACTCG | \n",
"| 15_MA_C | i7 index | ATTACTCG | \n",
"| 21_MA_C | i7 index | ATTACTCG | \n",
"\n",
"\n"
],
"text/plain": [
" Label index sequence\n",
"1 2_MA_C i7 index ATTACTCG\n",
"2 9_MA_C i7 index ATTACTCG\n",
"3 10_MA_C i7 index ATTACTCG\n",
"4 14_MA_C i7 index ATTACTCG\n",
"5 15_MA_C i7 index ATTACTCG\n",
"6 21_MA_C i7 index ATTACTCG"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"df %>% \n",
"select(Label, ends_with('index')) %>% \n",
"gather(index, sequence, -Label) %>% \n",
"head()"
]
},
{
"cell_type": "code",
"execution_count": 241,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"Label | i5 index | i7 index |
\n",
"\n",
"\t1_MA_J | AGGCTATA | CGCTCATT |
\n",
"\t1_RZ_J | AGGCTATA | GAGATTCC |
\n",
"\t10_MA_C | AGGATAGG | ATTACTCG |
\n",
"\t10_RZ_C | AGGATAGG | TCCGGAGA |
\n",
"\t11_MA_J | AGGATAGG | CGCTCATT |
\n",
"\t11_RZ_J | AGGATAGG | GAGATTCC |
\n",
"\n",
"
\n"
],
"text/latex": [
"\\begin{tabular}{r|lll}\n",
" Label & i5 index & i7 index\\\\\n",
"\\hline\n",
"\t 1\\_MA\\_J & AGGCTATA & CGCTCATT \\\\\n",
"\t 1\\_RZ\\_J & AGGCTATA & GAGATTCC \\\\\n",
"\t 10\\_MA\\_C & AGGATAGG & ATTACTCG \\\\\n",
"\t 10\\_RZ\\_C & AGGATAGG & TCCGGAGA \\\\\n",
"\t 11\\_MA\\_J & AGGATAGG & CGCTCATT \\\\\n",
"\t 11\\_RZ\\_J & AGGATAGG & GAGATTCC \\\\\n",
"\\end{tabular}\n"
],
"text/markdown": [
"\n",
"Label | i5 index | i7 index | \n",
"|---|---|---|---|---|---|\n",
"| 1_MA_J | AGGCTATA | CGCTCATT | \n",
"| 1_RZ_J | AGGCTATA | GAGATTCC | \n",
"| 10_MA_C | AGGATAGG | ATTACTCG | \n",
"| 10_RZ_C | AGGATAGG | TCCGGAGA | \n",
"| 11_MA_J | AGGATAGG | CGCTCATT | \n",
"| 11_RZ_J | AGGATAGG | GAGATTCC | \n",
"\n",
"\n"
],
"text/plain": [
" Label i5 index i7 index\n",
"1 1_MA_J AGGCTATA CGCTCATT\n",
"2 1_RZ_J AGGCTATA GAGATTCC\n",
"3 10_MA_C AGGATAGG ATTACTCG\n",
"4 10_RZ_C AGGATAGG TCCGGAGA\n",
"5 11_MA_J AGGATAGG CGCTCATT\n",
"6 11_RZ_J AGGATAGG GAGATTCC"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"df %>% \n",
"select(Label, ends_with('index')) %>% \n",
"gather(index, sequence, -Label) %>% \n",
"spread(index, sequence) %>%\n",
"head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 3. Use `separate` to split a single column containing multiple values\n",
"\n",
"The label column actually contains three pieces of informaiton separated by underscores - a sample number, the RNA enrichment methd, and code for the person presenting the sample. Assuuming that we do not have this informatio nduplicate in other columns, it woujld be tricky to compare, say, results by the sampling method.\n",
"\n",
"The `separate` function is designed to address this problem."
]
},
{
"cell_type": "code",
"execution_count": 246,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"Label | concentration_fold_difference |
\n",
"\n",
"\t2_MA_C | 1.34 |
\n",
"\t9_MA_C | 2.23 |
\n",
"\t10_MA_C | 4.37 |
\n",
"\n",
"
\n"
],
"text/latex": [
"\\begin{tabular}{r|ll}\n",
" Label & concentration\\_fold\\_difference\\\\\n",
"\\hline\n",
"\t 2\\_MA\\_C & 1.34 \\\\\n",
"\t 9\\_MA\\_C & 2.23 \\\\\n",
"\t 10\\_MA\\_C & 4.37 \\\\\n",
"\\end{tabular}\n"
],
"text/markdown": [
"\n",
"Label | concentration_fold_difference | \n",
"|---|---|---|\n",
"| 2_MA_C | 1.34 | \n",
"| 9_MA_C | 2.23 | \n",
"| 10_MA_C | 4.37 | \n",
"\n",
"\n"
],
"text/plain": [
" Label concentration_fold_difference\n",
"1 2_MA_C 1.34 \n",
"2 9_MA_C 2.23 \n",
"3 10_MA_C 4.37 "
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"df %>% \n",
"select(Label, concentration_fold_difference) %>% \n",
"head(3)"
]
},
{
"cell_type": "code",
"execution_count": 249,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"sample_num | enrichment | person | concentration_fold_difference |
\n",
"\n",
"\t2 | MA | C | 1.34 |
\n",
"\t9 | MA | C | 2.23 |
\n",
"\t10 | MA | C | 4.37 |
\n",
"\n",
"
\n"
],
"text/latex": [
"\\begin{tabular}{r|llll}\n",
" sample\\_num & enrichment & person & concentration\\_fold\\_difference\\\\\n",
"\\hline\n",
"\t 2 & MA & C & 1.34\\\\\n",
"\t 9 & MA & C & 2.23\\\\\n",
"\t 10 & MA & C & 4.37\\\\\n",
"\\end{tabular}\n"
],
"text/markdown": [
"\n",
"sample_num | enrichment | person | concentration_fold_difference | \n",
"|---|---|---|\n",
"| 2 | MA | C | 1.34 | \n",
"| 9 | MA | C | 2.23 | \n",
"| 10 | MA | C | 4.37 | \n",
"\n",
"\n"
],
"text/plain": [
" sample_num enrichment person concentration_fold_difference\n",
"1 2 MA C 1.34 \n",
"2 9 MA C 2.23 \n",
"3 10 MA C 4.37 "
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"df %>% \n",
"select(Label, concentration_fold_difference) %>% \n",
"separate(Label, sep='_', into=c('sample_num', 'enrichment', 'person')) %>%\n",
"head(3)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### It is now easy to condition on sample_number, enrichment method or person\n",
"\n",
"Here is a summary of the data."
]
},
{
"cell_type": "code",
"execution_count": 278,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"person | enrichment | n | concentration_fold_difference |
\n",
"\n",
"\tC | MA | 24 | 2.3675 |
\n",
"\tC | RZ | 24 | 2.3675 |
\n",
"\tC | TOT | 3 | 1.3400 |
\n",
"\tJ | MA | 24 | 3.4350 |
\n",
"\tJ | RZ | 24 | 3.4350 |
\n",
"\tJ | TOT | 3 | 1.9800 |
\n",
"\tP | MA | 24 | 3.0800 |
\n",
"\tP | RZ | 24 | 3.0800 |
\n",
"\tP | TOT | 3 | 2.0500 |
\n",
"\n",
"
\n"
],
"text/latex": [
"\\begin{tabular}{r|llll}\n",
" person & enrichment & n & concentration\\_fold\\_difference\\\\\n",
"\\hline\n",
"\t C & MA & 24 & 2.3675\\\\\n",
"\t C & RZ & 24 & 2.3675\\\\\n",
"\t C & TOT & 3 & 1.3400\\\\\n",
"\t J & MA & 24 & 3.4350\\\\\n",
"\t J & RZ & 24 & 3.4350\\\\\n",
"\t J & TOT & 3 & 1.9800\\\\\n",
"\t P & MA & 24 & 3.0800\\\\\n",
"\t P & RZ & 24 & 3.0800\\\\\n",
"\t P & TOT & 3 & 2.0500\\\\\n",
"\\end{tabular}\n"
],
"text/markdown": [
"\n",
"person | enrichment | n | concentration_fold_difference | \n",
"|---|---|---|---|---|---|---|---|---|\n",
"| C | MA | 24 | 2.3675 | \n",
"| C | RZ | 24 | 2.3675 | \n",
"| C | TOT | 3 | 1.3400 | \n",
"| J | MA | 24 | 3.4350 | \n",
"| J | RZ | 24 | 3.4350 | \n",
"| J | TOT | 3 | 1.9800 | \n",
"| P | MA | 24 | 3.0800 | \n",
"| P | RZ | 24 | 3.0800 | \n",
"| P | TOT | 3 | 2.0500 | \n",
"\n",
"\n"
],
"text/plain": [
" person enrichment n concentration_fold_difference\n",
"1 C MA 24 2.3675 \n",
"2 C RZ 24 2.3675 \n",
"3 C TOT 3 1.3400 \n",
"4 J MA 24 3.4350 \n",
"5 J RZ 24 3.4350 \n",
"6 J TOT 3 1.9800 \n",
"7 P MA 24 3.0800 \n",
"8 P RZ 24 3.0800 \n",
"9 P TOT 3 2.0500 "
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"df %>% \n",
"select(Label, concentration_fold_difference) %>% \n",
"separate(Label, into=c('sample_num', 'enrichment', 'person'), sep='_') %>%\n",
"add_count(enrichment) %>%\n",
"group_by(person, enrichment, n) %>%\n",
"summarize_at('concentration_fold_difference', mean) "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 4. Use `unite` to craate a single variable from multiple columns\n",
"\n",
"Sometimes we want to do the opposite and combine multiple columns into a single column. Use `unite` to do this."
]
},
{
"cell_type": "code",
"execution_count": 276,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"sample_num | enrichment | person | concentration_fold_difference |
\n",
"\n",
"\t2 | MA | C | 1.34 |
\n",
"\t9 | MA | C | 2.23 |
\n",
"\t10 | MA | C | 4.37 |
\n",
"\t14 | MA | C | 1.57 |
\n",
"\t15 | MA | C | 2.85 |
\n",
"\t21 | MA | C | 1.81 |
\n",
"\n",
"
\n"
],
"text/latex": [
"\\begin{tabular}{r|llll}\n",
" sample\\_num & enrichment & person & concentration\\_fold\\_difference\\\\\n",
"\\hline\n",
"\t 2 & MA & C & 1.34\\\\\n",
"\t 9 & MA & C & 2.23\\\\\n",
"\t 10 & MA & C & 4.37\\\\\n",
"\t 14 & MA & C & 1.57\\\\\n",
"\t 15 & MA & C & 2.85\\\\\n",
"\t 21 & MA & C & 1.81\\\\\n",
"\\end{tabular}\n"
],
"text/markdown": [
"\n",
"sample_num | enrichment | person | concentration_fold_difference | \n",
"|---|---|---|---|---|---|\n",
"| 2 | MA | C | 1.34 | \n",
"| 9 | MA | C | 2.23 | \n",
"| 10 | MA | C | 4.37 | \n",
"| 14 | MA | C | 1.57 | \n",
"| 15 | MA | C | 2.85 | \n",
"| 21 | MA | C | 1.81 | \n",
"\n",
"\n"
],
"text/plain": [
" sample_num enrichment person concentration_fold_difference\n",
"1 2 MA C 1.34 \n",
"2 9 MA C 2.23 \n",
"3 10 MA C 4.37 \n",
"4 14 MA C 1.57 \n",
"5 15 MA C 2.85 \n",
"6 21 MA C 1.81 "
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"df %>% \n",
"select(Label, concentration_fold_difference) %>% \n",
"separate(Label, sep='_', into=c('sample_num', 'enrichment', 'person')) %>%\n",
"head"
]
},
{
"cell_type": "code",
"execution_count": 277,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"Label | concentration_fold_difference |
\n",
"\n",
"\t2_MA_C | 1.34 |
\n",
"\t9_MA_C | 2.23 |
\n",
"\t10_MA_C | 4.37 |
\n",
"\t14_MA_C | 1.57 |
\n",
"\t15_MA_C | 2.85 |
\n",
"\t21_MA_C | 1.81 |
\n",
"\n",
"
\n"
],
"text/latex": [
"\\begin{tabular}{r|ll}\n",
" Label & concentration\\_fold\\_difference\\\\\n",
"\\hline\n",
"\t 2\\_MA\\_C & 1.34 \\\\\n",
"\t 9\\_MA\\_C & 2.23 \\\\\n",
"\t 10\\_MA\\_C & 4.37 \\\\\n",
"\t 14\\_MA\\_C & 1.57 \\\\\n",
"\t 15\\_MA\\_C & 2.85 \\\\\n",
"\t 21\\_MA\\_C & 1.81 \\\\\n",
"\\end{tabular}\n"
],
"text/markdown": [
"\n",
"Label | concentration_fold_difference | \n",
"|---|---|---|---|---|---|\n",
"| 2_MA_C | 1.34 | \n",
"| 9_MA_C | 2.23 | \n",
"| 10_MA_C | 4.37 | \n",
"| 14_MA_C | 1.57 | \n",
"| 15_MA_C | 2.85 | \n",
"| 21_MA_C | 1.81 | \n",
"\n",
"\n"
],
"text/plain": [
" Label concentration_fold_difference\n",
"1 2_MA_C 1.34 \n",
"2 9_MA_C 2.23 \n",
"3 10_MA_C 4.37 \n",
"4 14_MA_C 1.57 \n",
"5 15_MA_C 2.85 \n",
"6 21_MA_C 1.81 "
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"df %>% \n",
"select(Label, concentration_fold_difference) %>% \n",
"separate(Label, sep='_', into=c('sample_num', 'enrichment', 'person')) %>%\n",
"unite(Label, c('sample_num', 'enrichment', 'person'), sep='_') %>% \n",
"head"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Extract parts with regular expressions\n",
"\n",
"Sometiems there is no simple delimiter between parts of a value. In such cases, we may need to use reuglar expressions to extract parts from a string."
]
},
{
"cell_type": "code",
"execution_count": 293,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"Label | concentration_fold_difference |
\n",
"\n",
"\t2MAC | 1.34 |
\n",
"\t9MAC | 2.23 |
\n",
"\t10MAC | 4.37 |
\n",
"\t14MAC | 1.57 |
\n",
"\t15MAC | 2.85 |
\n",
"\t21MAC | 1.81 |
\n",
"\n",
"
\n"
],
"text/latex": [
"\\begin{tabular}{r|ll}\n",
" Label & concentration\\_fold\\_difference\\\\\n",
"\\hline\n",
"\t 2MAC & 1.34 \\\\\n",
"\t 9MAC & 2.23 \\\\\n",
"\t 10MAC & 4.37 \\\\\n",
"\t 14MAC & 1.57 \\\\\n",
"\t 15MAC & 2.85 \\\\\n",
"\t 21MAC & 1.81 \\\\\n",
"\\end{tabular}\n"
],
"text/markdown": [
"\n",
"Label | concentration_fold_difference | \n",
"|---|---|---|---|---|---|\n",
"| 2MAC | 1.34 | \n",
"| 9MAC | 2.23 | \n",
"| 10MAC | 4.37 | \n",
"| 14MAC | 1.57 | \n",
"| 15MAC | 2.85 | \n",
"| 21MAC | 1.81 | \n",
"\n",
"\n"
],
"text/plain": [
" Label concentration_fold_difference\n",
"1 2MAC 1.34 \n",
"2 9MAC 2.23 \n",
"3 10MAC 4.37 \n",
"4 14MAC 1.57 \n",
"5 15MAC 2.85 \n",
"6 21MAC 1.81 "
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"bad_labels <- df %>% \n",
"select(Label, concentration_fold_difference) %>% \n",
"separate(Label, \n",
" sep='_', \n",
" into=c('sample_num', 'enrichment', 'person')) %>%\n",
"unite(Label, c('sample_num', 'enrichment', 'person'), sep='')\n",
"head(bad_labels)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The regular expression consists of 3 capture groups in parentheses that will form the new columns.\n",
"\n",
"`'([0-9]+)(.*)(.$)'`\n",
"\n",
"where\n",
"\n",
"- Capture group 1 `[0-9]+` means match one or more `+` of any digits `[0-9]`\n",
"- Capture group 2 `.*` means match zero or more `*` of any character `.`\n",
"- Capture gorup 3 `.$` means match any character `.` at the end `$`"
]
},
{
"cell_type": "code",
"execution_count": 306,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"sample_num | enrichment | person | concentration_fold_difference |
\n",
"\n",
"\t2 | MA | C | 1.34 |
\n",
"\t9 | MA | C | 2.23 |
\n",
"\t10 | MA | C | 4.37 |
\n",
"\t14 | MA | C | 1.57 |
\n",
"\t15 | MA | C | 2.85 |
\n",
"\t21 | MA | C | 1.81 |
\n",
"\n",
"
\n"
],
"text/latex": [
"\\begin{tabular}{r|llll}\n",
" sample\\_num & enrichment & person & concentration\\_fold\\_difference\\\\\n",
"\\hline\n",
"\t 2 & MA & C & 1.34\\\\\n",
"\t 9 & MA & C & 2.23\\\\\n",
"\t 10 & MA & C & 4.37\\\\\n",
"\t 14 & MA & C & 1.57\\\\\n",
"\t 15 & MA & C & 2.85\\\\\n",
"\t 21 & MA & C & 1.81\\\\\n",
"\\end{tabular}\n"
],
"text/markdown": [
"\n",
"sample_num | enrichment | person | concentration_fold_difference | \n",
"|---|---|---|---|---|---|\n",
"| 2 | MA | C | 1.34 | \n",
"| 9 | MA | C | 2.23 | \n",
"| 10 | MA | C | 4.37 | \n",
"| 14 | MA | C | 1.57 | \n",
"| 15 | MA | C | 2.85 | \n",
"| 21 | MA | C | 1.81 | \n",
"\n",
"\n"
],
"text/plain": [
" sample_num enrichment person concentration_fold_difference\n",
"1 2 MA C 1.34 \n",
"2 9 MA C 2.23 \n",
"3 10 MA C 4.37 \n",
"4 14 MA C 1.57 \n",
"5 15 MA C 2.85 \n",
"6 21 MA C 1.81 "
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"bad_labels %>% \n",
"extract(Label, \n",
" into=c('sample_num', 'enrichment', 'person'), \n",
" '([0-9]+)(.*)(.$)') %>%\n",
"head"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Using `separate_rows` rows when multiple values are in a single cell.\n",
"\n",
"Very occassionally, we find data sets where multiple values are stored in a single cell. We will make up and exmple here as this does not occur in the metadata data set."
]
},
{
"cell_type": "code",
"execution_count": 330,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"Label | primer |
\n",
"\n",
"\t1_MA_J | AGGCTATA,CGCTCATT |
\n",
"\t1_RZ_J | AGGCTATA,GAGATTCC |
\n",
"\t10_MA_C | AGGATAGG,ATTACTCG |
\n",
"\t10_RZ_C | AGGATAGG,TCCGGAGA |
\n",
"\t11_MA_J | AGGATAGG,CGCTCATT |
\n",
"\t11_RZ_J | AGGATAGG,GAGATTCC |
\n",
"\n",
"
\n"
],
"text/latex": [
"\\begin{tabular}{r|ll}\n",
" Label & primer\\\\\n",
"\\hline\n",
"\t 1\\_MA\\_J & AGGCTATA,CGCTCATT\\\\\n",
"\t 1\\_RZ\\_J & AGGCTATA,GAGATTCC\\\\\n",
"\t 10\\_MA\\_C & AGGATAGG,ATTACTCG\\\\\n",
"\t 10\\_RZ\\_C & AGGATAGG,TCCGGAGA\\\\\n",
"\t 11\\_MA\\_J & AGGATAGG,CGCTCATT\\\\\n",
"\t 11\\_RZ\\_J & AGGATAGG,GAGATTCC\\\\\n",
"\\end{tabular}\n"
],
"text/markdown": [
"\n",
"Label | primer | \n",
"|---|---|---|---|---|---|\n",
"| 1_MA_J | AGGCTATA,CGCTCATT | \n",
"| 1_RZ_J | AGGCTATA,GAGATTCC | \n",
"| 10_MA_C | AGGATAGG,ATTACTCG | \n",
"| 10_RZ_C | AGGATAGG,TCCGGAGA | \n",
"| 11_MA_J | AGGATAGG,CGCTCATT | \n",
"| 11_RZ_J | AGGATAGG,GAGATTCC | \n",
"\n",
"\n"
],
"text/plain": [
" Label primer \n",
"1 1_MA_J AGGCTATA,CGCTCATT\n",
"2 1_RZ_J AGGCTATA,GAGATTCC\n",
"3 10_MA_C AGGATAGG,ATTACTCG\n",
"4 10_RZ_C AGGATAGG,TCCGGAGA\n",
"5 11_MA_J AGGATAGG,CGCTCATT\n",
"6 11_RZ_J AGGATAGG,GAGATTCC"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"bad_primer <- df %>% \n",
"select(Label, `i5 index`, `i7 index`) %>% \n",
"unite(primer, c(`i5 index`, `i7 index`), sep=',') %>%\n",
"arrange(Label)\n",
"head(bad_primer)"
]
},
{
"cell_type": "code",
"execution_count": 331,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"Label | primer |
\n",
"\n",
"\t1_MA_J | AGGCTATA |
\n",
"\t1_MA_J | CGCTCATT |
\n",
"\t1_RZ_J | AGGCTATA |
\n",
"\t1_RZ_J | GAGATTCC |
\n",
"\t10_MA_C | AGGATAGG |
\n",
"\t10_MA_C | ATTACTCG |
\n",
"\n",
"
\n"
],
"text/latex": [
"\\begin{tabular}{r|ll}\n",
" Label & primer\\\\\n",
"\\hline\n",
"\t 1\\_MA\\_J & AGGCTATA \\\\\n",
"\t 1\\_MA\\_J & CGCTCATT \\\\\n",
"\t 1\\_RZ\\_J & AGGCTATA \\\\\n",
"\t 1\\_RZ\\_J & GAGATTCC \\\\\n",
"\t 10\\_MA\\_C & AGGATAGG \\\\\n",
"\t 10\\_MA\\_C & ATTACTCG \\\\\n",
"\\end{tabular}\n"
],
"text/markdown": [
"\n",
"Label | primer | \n",
"|---|---|---|---|---|---|\n",
"| 1_MA_J | AGGCTATA | \n",
"| 1_MA_J | CGCTCATT | \n",
"| 1_RZ_J | AGGCTATA | \n",
"| 1_RZ_J | GAGATTCC | \n",
"| 10_MA_C | AGGATAGG | \n",
"| 10_MA_C | ATTACTCG | \n",
"\n",
"\n"
],
"text/plain": [
" Label primer \n",
"1 1_MA_J AGGCTATA\n",
"2 1_MA_J CGCTCATT\n",
"3 1_RZ_J AGGCTATA\n",
"4 1_RZ_J GAGATTCC\n",
"5 10_MA_C AGGATAGG\n",
"6 10_MA_C ATTACTCG"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"bad_primer %>% \n",
"separate_rows(primer, sep=',') %>%\n",
"head"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Challenge: Reconstruct column structure of orignial metadata file"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Since `separate_rows` will place the items in each row in alternating columns, we create a temporary variable to store the position in the original cell."
]
},
{
"cell_type": "code",
"execution_count": 332,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"Label | primer | tmp |
\n",
"\n",
"\t1_MA_J | AGGCTATA | 1 |
\n",
"\t1_MA_J | CGCTCATT | 2 |
\n",
"\t1_RZ_J | AGGCTATA | 1 |
\n",
"\t1_RZ_J | GAGATTCC | 2 |
\n",
"\t10_MA_C | AGGATAGG | 1 |
\n",
"\t10_MA_C | ATTACTCG | 2 |
\n",
"\n",
"
\n"
],
"text/latex": [
"\\begin{tabular}{r|lll}\n",
" Label & primer & tmp\\\\\n",
"\\hline\n",
"\t 1\\_MA\\_J & AGGCTATA & 1 \\\\\n",
"\t 1\\_MA\\_J & CGCTCATT & 2 \\\\\n",
"\t 1\\_RZ\\_J & AGGCTATA & 1 \\\\\n",
"\t 1\\_RZ\\_J & GAGATTCC & 2 \\\\\n",
"\t 10\\_MA\\_C & AGGATAGG & 1 \\\\\n",
"\t 10\\_MA\\_C & ATTACTCG & 2 \\\\\n",
"\\end{tabular}\n"
],
"text/markdown": [
"\n",
"Label | primer | tmp | \n",
"|---|---|---|---|---|---|\n",
"| 1_MA_J | AGGCTATA | 1 | \n",
"| 1_MA_J | CGCTCATT | 2 | \n",
"| 1_RZ_J | AGGCTATA | 1 | \n",
"| 1_RZ_J | GAGATTCC | 2 | \n",
"| 10_MA_C | AGGATAGG | 1 | \n",
"| 10_MA_C | ATTACTCG | 2 | \n",
"\n",
"\n"
],
"text/plain": [
" Label primer tmp\n",
"1 1_MA_J AGGCTATA 1 \n",
"2 1_MA_J CGCTCATT 2 \n",
"3 1_RZ_J AGGCTATA 1 \n",
"4 1_RZ_J GAGATTCC 2 \n",
"5 10_MA_C AGGATAGG 1 \n",
"6 10_MA_C ATTACTCG 2 "
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"bad_primer %>% \n",
"separate_rows(primer, sep=',') %>%\n",
"mutate(tmp=rep_len(1:2, n())) %>% \n",
"head"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We then `spread` the primer values by the temporary variable and rename the columns as appropriate."
]
},
{
"cell_type": "code",
"execution_count": 333,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"Label | i5 index | i7 index |
\n",
"\n",
"\t1_MA_J | AGGCTATA | CGCTCATT |
\n",
"\t1_RZ_J | AGGCTATA | GAGATTCC |
\n",
"\t10_MA_C | AGGATAGG | ATTACTCG |
\n",
"\t10_RZ_C | AGGATAGG | TCCGGAGA |
\n",
"\t11_MA_J | AGGATAGG | CGCTCATT |
\n",
"\t11_RZ_J | AGGATAGG | GAGATTCC |
\n",
"\n",
"
\n"
],
"text/latex": [
"\\begin{tabular}{r|lll}\n",
" Label & i5 index & i7 index\\\\\n",
"\\hline\n",
"\t 1\\_MA\\_J & AGGCTATA & CGCTCATT \\\\\n",
"\t 1\\_RZ\\_J & AGGCTATA & GAGATTCC \\\\\n",
"\t 10\\_MA\\_C & AGGATAGG & ATTACTCG \\\\\n",
"\t 10\\_RZ\\_C & AGGATAGG & TCCGGAGA \\\\\n",
"\t 11\\_MA\\_J & AGGATAGG & CGCTCATT \\\\\n",
"\t 11\\_RZ\\_J & AGGATAGG & GAGATTCC \\\\\n",
"\\end{tabular}\n"
],
"text/markdown": [
"\n",
"Label | i5 index | i7 index | \n",
"|---|---|---|---|---|---|\n",
"| 1_MA_J | AGGCTATA | CGCTCATT | \n",
"| 1_RZ_J | AGGCTATA | GAGATTCC | \n",
"| 10_MA_C | AGGATAGG | ATTACTCG | \n",
"| 10_RZ_C | AGGATAGG | TCCGGAGA | \n",
"| 11_MA_J | AGGATAGG | CGCTCATT | \n",
"| 11_RZ_J | AGGATAGG | GAGATTCC | \n",
"\n",
"\n"
],
"text/plain": [
" Label i5 index i7 index\n",
"1 1_MA_J AGGCTATA CGCTCATT\n",
"2 1_RZ_J AGGCTATA GAGATTCC\n",
"3 10_MA_C AGGATAGG ATTACTCG\n",
"4 10_RZ_C AGGATAGG TCCGGAGA\n",
"5 11_MA_J AGGATAGG CGCTCATT\n",
"6 11_RZ_J AGGATAGG GAGATTCC"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"bad_primer %>% \n",
"separate_rows(primer, sep=',') %>%\n",
"mutate(tmp=rep_len(1:2, n())) %>% \n",
"spread(tmp, primer) %>%\n",
"rename(`i5 index`=`1`, `i7 index`=`2`) %>%\n",
"head"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Comapre with original"
]
},
{
"cell_type": "code",
"execution_count": 334,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"Label | i5 index | i7 index |
\n",
"\n",
"\t1_MA_J | AGGCTATA | CGCTCATT |
\n",
"\t1_RZ_J | AGGCTATA | GAGATTCC |
\n",
"\t10_MA_C | AGGATAGG | ATTACTCG |
\n",
"\t10_RZ_C | AGGATAGG | TCCGGAGA |
\n",
"\t11_MA_J | AGGATAGG | CGCTCATT |
\n",
"\t11_RZ_J | AGGATAGG | GAGATTCC |
\n",
"\n",
"
\n"
],
"text/latex": [
"\\begin{tabular}{r|lll}\n",
" Label & i5 index & i7 index\\\\\n",
"\\hline\n",
"\t 1\\_MA\\_J & AGGCTATA & CGCTCATT \\\\\n",
"\t 1\\_RZ\\_J & AGGCTATA & GAGATTCC \\\\\n",
"\t 10\\_MA\\_C & AGGATAGG & ATTACTCG \\\\\n",
"\t 10\\_RZ\\_C & AGGATAGG & TCCGGAGA \\\\\n",
"\t 11\\_MA\\_J & AGGATAGG & CGCTCATT \\\\\n",
"\t 11\\_RZ\\_J & AGGATAGG & GAGATTCC \\\\\n",
"\\end{tabular}\n"
],
"text/markdown": [
"\n",
"Label | i5 index | i7 index | \n",
"|---|---|---|---|---|---|\n",
"| 1_MA_J | AGGCTATA | CGCTCATT | \n",
"| 1_RZ_J | AGGCTATA | GAGATTCC | \n",
"| 10_MA_C | AGGATAGG | ATTACTCG | \n",
"| 10_RZ_C | AGGATAGG | TCCGGAGA | \n",
"| 11_MA_J | AGGATAGG | CGCTCATT | \n",
"| 11_RZ_J | AGGATAGG | GAGATTCC | \n",
"\n",
"\n"
],
"text/plain": [
" Label i5 index i7 index\n",
"1 1_MA_J AGGCTATA CGCTCATT\n",
"2 1_RZ_J AGGCTATA GAGATTCC\n",
"3 10_MA_C AGGATAGG ATTACTCG\n",
"4 10_RZ_C AGGATAGG TCCGGAGA\n",
"5 11_MA_J AGGATAGG CGCTCATT\n",
"6 11_RZ_J AGGATAGG GAGATTCC"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"df %>%\n",
"select(Label, `i5 index`, `i7 index`) %>%\n",
"arrange(Label) %>%\n",
"head"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Challenge exercise \n",
"\n",
"Let's look at the index and primer columne more closely. It seems that the index and primer values are linked and are just different ways of labeling the same thing. So we really should be have columns like this\n",
"\n",
"| primer_type | code | seq |\n",
"| - | - | - |\n",
"| i5 | i501 | AGGCTATA | \n",
"| i5 | i502 | GCCTCTAT |\n",
"| i7 | i701 | ATTACTCG |\n",
"| i7 | i702 | TCCGGAGA |\n",
"\n",
"We can do this by combining the operatios we have seen above."
]
},
{
"cell_type": "code",
"execution_count": 232,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"Label | RNA_sample_num | Media | Strain | Replicate | concentration_fold_difference | primer_type | code | seq |
\n",
"\n",
"\t2_MA_C | 2 | YPD | H99 | 2 | 1.34 | i5 | 01 | AGGCTATA |
\n",
"\t2_RZ_C | 2 | YPD | H99 | 2 | 1.34 | i5 | 01 | AGGCTATA |
\n",
"\t2_TOT_C | 2 | YPD | H99 | 2 | 1.34 | i5 | 01 | AGGCTATA |
\n",
"\t1_MA_J | 1 | YPD | H99 | 1 | 3.64 | i5 | 01 | AGGCTATA |
\n",
"\t1_RZ_J | 1 | YPD | H99 | 1 | 3.64 | i5 | 01 | AGGCTATA |
\n",
"\t4_MA_P | 4 | YPD | H99 | 4 | 2.05 | i5 | 01 | AGGCTATA |
\n",
"\t4_RZ_P | 4 | YPD | H99 | 4 | 2.05 | i5 | 01 | AGGCTATA |
\n",
"\t9_MA_C | 9 | YPD | mar1d | 3 | 2.23 | i5 | 02 | GCCTCTAT |
\n",
"\t9_RZ_C | 9 | YPD | mar1d | 3 | 2.23 | i5 | 02 | GCCTCTAT |
\n",
"\t3_MA_J | 3 | YPD | H99 | 3 | 1.98 | i5 | 02 | GCCTCTAT |
\n",
"\n",
"
\n"
],
"text/latex": [
"\\begin{tabular}{r|lllllllll}\n",
" Label & RNA\\_sample\\_num & Media & Strain & Replicate & concentration\\_fold\\_difference & primer\\_type & code & seq\\\\\n",
"\\hline\n",
"\t 2\\_MA\\_C & 2 & YPD & H99 & 2 & 1.34 & i5 & 01 & AGGCTATA \\\\\n",
"\t 2\\_RZ\\_C & 2 & YPD & H99 & 2 & 1.34 & i5 & 01 & AGGCTATA \\\\\n",
"\t 2\\_TOT\\_C & 2 & YPD & H99 & 2 & 1.34 & i5 & 01 & AGGCTATA \\\\\n",
"\t 1\\_MA\\_J & 1 & YPD & H99 & 1 & 3.64 & i5 & 01 & AGGCTATA \\\\\n",
"\t 1\\_RZ\\_J & 1 & YPD & H99 & 1 & 3.64 & i5 & 01 & AGGCTATA \\\\\n",
"\t 4\\_MA\\_P & 4 & YPD & H99 & 4 & 2.05 & i5 & 01 & AGGCTATA \\\\\n",
"\t 4\\_RZ\\_P & 4 & YPD & H99 & 4 & 2.05 & i5 & 01 & AGGCTATA \\\\\n",
"\t 9\\_MA\\_C & 9 & YPD & mar1d & 3 & 2.23 & i5 & 02 & GCCTCTAT \\\\\n",
"\t 9\\_RZ\\_C & 9 & YPD & mar1d & 3 & 2.23 & i5 & 02 & GCCTCTAT \\\\\n",
"\t 3\\_MA\\_J & 3 & YPD & H99 & 3 & 1.98 & i5 & 02 & GCCTCTAT \\\\\n",
"\\end{tabular}\n"
],
"text/markdown": [
"\n",
"Label | RNA_sample_num | Media | Strain | Replicate | concentration_fold_difference | primer_type | code | seq | \n",
"|---|---|---|---|---|---|---|---|---|---|\n",
"| 2_MA_C | 2 | YPD | H99 | 2 | 1.34 | i5 | 01 | AGGCTATA | \n",
"| 2_RZ_C | 2 | YPD | H99 | 2 | 1.34 | i5 | 01 | AGGCTATA | \n",
"| 2_TOT_C | 2 | YPD | H99 | 2 | 1.34 | i5 | 01 | AGGCTATA | \n",
"| 1_MA_J | 1 | YPD | H99 | 1 | 3.64 | i5 | 01 | AGGCTATA | \n",
"| 1_RZ_J | 1 | YPD | H99 | 1 | 3.64 | i5 | 01 | AGGCTATA | \n",
"| 4_MA_P | 4 | YPD | H99 | 4 | 2.05 | i5 | 01 | AGGCTATA | \n",
"| 4_RZ_P | 4 | YPD | H99 | 4 | 2.05 | i5 | 01 | AGGCTATA | \n",
"| 9_MA_C | 9 | YPD | mar1d | 3 | 2.23 | i5 | 02 | GCCTCTAT | \n",
"| 9_RZ_C | 9 | YPD | mar1d | 3 | 2.23 | i5 | 02 | GCCTCTAT | \n",
"| 3_MA_J | 3 | YPD | H99 | 3 | 1.98 | i5 | 02 | GCCTCTAT | \n",
"\n",
"\n"
],
"text/plain": [
" Label RNA_sample_num Media Strain Replicate concentration_fold_difference\n",
"1 2_MA_C 2 YPD H99 2 1.34 \n",
"2 2_RZ_C 2 YPD H99 2 1.34 \n",
"3 2_TOT_C 2 YPD H99 2 1.34 \n",
"4 1_MA_J 1 YPD H99 1 3.64 \n",
"5 1_RZ_J 1 YPD H99 1 3.64 \n",
"6 4_MA_P 4 YPD H99 4 2.05 \n",
"7 4_RZ_P 4 YPD H99 4 2.05 \n",
"8 9_MA_C 9 YPD mar1d 3 2.23 \n",
"9 9_RZ_C 9 YPD mar1d 3 2.23 \n",
"10 3_MA_J 3 YPD H99 3 1.98 \n",
" primer_type code seq \n",
"1 i5 01 AGGCTATA\n",
"2 i5 01 AGGCTATA\n",
"3 i5 01 AGGCTATA\n",
"4 i5 01 AGGCTATA\n",
"5 i5 01 AGGCTATA\n",
"6 i5 01 AGGCTATA\n",
"7 i5 01 AGGCTATA\n",
"8 i5 02 GCCTCTAT\n",
"9 i5 02 GCCTCTAT\n",
"10 i5 02 GCCTCTAT"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"df %>% \n",
"unite(i7, c(`i7 primer`, `i7 index`), sep=':') %>%\n",
"unite(i5, c(`i5 primer`, `i5 index`), sep=':') %>%\n",
"gather(primer_type, val, c(i7, i5)) %>%\n",
"separate(val, into=c('code', 'seq'), sep=':') %>%\n",
"mutate(code=str_extract(code, '..$')) %>%\n",
"arrange(primer_type, code) %>% \n",
"head(10)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "R",
"language": "R",
"name": "r"
},
"language_info": {
"codemirror_mode": "r",
"file_extension": ".r",
"mimetype": "text/x-r-source",
"name": "R",
"pygments_lexer": "r",
"version": "3.4.4"
}
},
"nbformat": 4,
"nbformat_minor": 2
}