{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Using `tidyr` to create tidy data sets" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Description**\n", "> Easily Tidy Data with 'spread()' and 'gather()' Functions\n", "\n", "A tidy data frame is one where\n", "\n", "- Each column is a variable\n", "- Each row is an observatino\n", "- Each value is a cell\n", "\n", "As we have seen, tidy data sets can be easily plotted with `ggplot2` and manipulated with `dplyr`. In this notebook, we see how to convert messy data sets into tidy ones using the verbs `gather`, `spread`, `separate`, `extract`, `separte_rows` and `unite`.\n", "\n", "For more complicated tidying jobs, see the full range of functions in the [`tidyr` documentation](https://cran.r-project.org/web/packages/tidyr/tidyr.pdf)." ] }, { "cell_type": "raw", "metadata": {}, "source": [ "library(tidyverse)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Data\n", "\n", "We will work with a subset of the pilot metadata that we are familiar with from the `dplyr` session." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "path='../josh/info/2018_pilot_metadata.tsv'" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Parsed with column specification:\n", "cols(\n", " Label = col_character(),\n", " RNA_sample_num = col_integer(),\n", " Media = col_character(),\n", " Strain = col_character(),\n", " Replicate = col_integer(),\n", " experiment_person = col_character(),\n", " libprep_person = col_character(),\n", " enrichment_method = col_character(),\n", " RIN = col_double(),\n", " concentration_fold_difference = col_double(),\n", " `i7 index` = col_character(),\n", " `i5 index` = col_character(),\n", " `i5 primer` = col_character(),\n", " `i7 primer` = col_character(),\n", " `library#` = col_integer()\n", ")\n" ] } ], "source": [ "df <- read_tsv(path)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Subset the data" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "df <- df %>% select(\n", " Label,\n", " RNA_sample_num,\n", " Media,\n", " Strain,\n", " Replicate,\n", " concentration_fold_difference,\n", " `i7 index`,\n", " `i5 index`, \n", " `i5 primer`,\n", " `i7 primer`,\n", ")" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\n", "
LabelRNA_sample_numMediaStrainReplicateconcentration_fold_differencei7 indexi5 indexi5 primeri7 primer
2_MA_C 2 YPD H99 2 1.34 ATTACTCGAGGCTATAi501 i701
9_MA_C 9 YPD mar1d 3 2.23 ATTACTCGGCCTCTATi502 i701
10_MA_C 10 YPD mar1d 4 4.37 ATTACTCGAGGATAGGi503 i701
14_MA_C 14 TC H99 2 1.57 ATTACTCGTCAGAGCCi504 i701
15_MA_C 15 TC H99 3 2.85 ATTACTCGCTTCGCCTi505 i701
21_MA_C 21 TC mar1d 3 1.81 ATTACTCGTAAGATTAi506 i701
\n" ], "text/latex": [ "\\begin{tabular}{r|llllllllll}\n", " Label & RNA\\_sample\\_num & Media & Strain & Replicate & concentration\\_fold\\_difference & i7 index & i5 index & i5 primer & i7 primer\\\\\n", "\\hline\n", "\t 2\\_MA\\_C & 2 & YPD & H99 & 2 & 1.34 & ATTACTCG & AGGCTATA & i501 & i701 \\\\\n", "\t 9\\_MA\\_C & 9 & YPD & mar1d & 3 & 2.23 & ATTACTCG & GCCTCTAT & i502 & i701 \\\\\n", "\t 10\\_MA\\_C & 10 & YPD & mar1d & 4 & 4.37 & ATTACTCG & AGGATAGG & i503 & i701 \\\\\n", "\t 14\\_MA\\_C & 14 & TC & H99 & 2 & 1.57 & ATTACTCG & TCAGAGCC & i504 & i701 \\\\\n", "\t 15\\_MA\\_C & 15 & TC & H99 & 3 & 2.85 & ATTACTCG & CTTCGCCT & i505 & i701 \\\\\n", "\t 21\\_MA\\_C & 21 & TC & mar1d & 3 & 1.81 & ATTACTCG & TAAGATTA & i506 & i701 \\\\\n", "\\end{tabular}\n" ], "text/markdown": [ "\n", "Label | RNA_sample_num | Media | Strain | Replicate | concentration_fold_difference | i7 index | i5 index | i5 primer | i7 primer | \n", "|---|---|---|---|---|---|\n", "| 2_MA_C | 2 | YPD | H99 | 2 | 1.34 | ATTACTCG | AGGCTATA | i501 | i701 | \n", "| 9_MA_C | 9 | YPD | mar1d | 3 | 2.23 | ATTACTCG | GCCTCTAT | i502 | i701 | \n", "| 10_MA_C | 10 | YPD | mar1d | 4 | 4.37 | ATTACTCG | AGGATAGG | i503 | i701 | \n", "| 14_MA_C | 14 | TC | H99 | 2 | 1.57 | ATTACTCG | TCAGAGCC | i504 | i701 | \n", "| 15_MA_C | 15 | TC | H99 | 3 | 2.85 | ATTACTCG | CTTCGCCT | i505 | i701 | \n", "| 21_MA_C | 21 | TC | mar1d | 3 | 1.81 | ATTACTCG | TAAGATTA | i506 | i701 | \n", "\n", "\n" ], "text/plain": [ " Label RNA_sample_num Media Strain Replicate concentration_fold_difference\n", "1 2_MA_C 2 YPD H99 2 1.34 \n", "2 9_MA_C 9 YPD mar1d 3 2.23 \n", "3 10_MA_C 10 YPD mar1d 4 4.37 \n", "4 14_MA_C 14 TC H99 2 1.57 \n", "5 15_MA_C 15 TC H99 3 2.85 \n", "6 21_MA_C 21 TC mar1d 3 1.81 \n", " i7 index i5 index i5 primer i7 primer\n", "1 ATTACTCG AGGCTATA i501 i701 \n", "2 ATTACTCG GCCTCTAT i502 i701 \n", "3 ATTACTCG AGGATAGG i503 i701 \n", "4 ATTACTCG TCAGAGCC i504 i701 \n", "5 ATTACTCG CTTCGCCT i505 i701 \n", "6 ATTACTCG TAAGATTA i506 i701 " ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "head(df)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1. Use `gather` to combine multiple columns into one\n", "\n", "We note thet there are two \"index\" and \"primer\" columns - for a tidy data frame, we probably want to combine them. We cna do this using `gather`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Warm-up\n", "\n", "The use of `gather` can be confusing, so we will start with a toy example." ] }, { "cell_type": "code", "execution_count": 90, "metadata": {}, "outputs": [], "source": [ "name <- c('ann', 'bob', 'charlie')\n", "jan <- c(102, 155, 211)\n", "feb <- c(112, 150, 211)\n", "mar <- c(123, 147, 213)\n", "apr <- c(130, 140, 210)\n", "\n", "wts <- tibble(name=name, jan=jan, feb=feb, mar=mar, apr=apr)" ] }, { "cell_type": "code", "execution_count": 91, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\t\n", "\t\n", "\t\n", "\n", "
namejanfebmarapr
ann 102 112 123 130
bob 155 150 147 140
charlie211 211 213 210
\n" ], "text/latex": [ "\\begin{tabular}{r|lllll}\n", " name & jan & feb & mar & apr\\\\\n", "\\hline\n", "\t ann & 102 & 112 & 123 & 130 \\\\\n", "\t bob & 155 & 150 & 147 & 140 \\\\\n", "\t charlie & 211 & 211 & 213 & 210 \\\\\n", "\\end{tabular}\n" ], "text/markdown": [ "\n", "name | jan | feb | mar | apr | \n", "|---|---|---|\n", "| ann | 102 | 112 | 123 | 130 | \n", "| bob | 155 | 150 | 147 | 140 | \n", "| charlie | 211 | 211 | 213 | 210 | \n", "\n", "\n" ], "text/plain": [ " name jan feb mar apr\n", "1 ann 102 112 123 130\n", "2 bob 155 150 147 140\n", "3 charlie 211 211 213 210" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "wts" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Messy data\n", "\n", "In the current form, it is not possible to subgroup by month or to plot by month easily." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Using `gather` to creat a new column called `weight` to store monthly weights" ] }, { "cell_type": "code", "execution_count": 93, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\n", "
namemonthweight
ann jan 102
bob jan 155
charliejan 211
ann feb 112
bob feb 150
charliefeb 211
ann mar 123
bob mar 147
charliemar 213
ann apr 130
bob apr 140
charlieapr 210
\n" ], "text/latex": [ "\\begin{tabular}{r|lll}\n", " name & month & weight\\\\\n", "\\hline\n", "\t ann & jan & 102 \\\\\n", "\t bob & jan & 155 \\\\\n", "\t charlie & jan & 211 \\\\\n", "\t ann & feb & 112 \\\\\n", "\t bob & feb & 150 \\\\\n", "\t charlie & feb & 211 \\\\\n", "\t ann & mar & 123 \\\\\n", "\t bob & mar & 147 \\\\\n", "\t charlie & mar & 213 \\\\\n", "\t ann & apr & 130 \\\\\n", "\t bob & apr & 140 \\\\\n", "\t charlie & apr & 210 \\\\\n", "\\end{tabular}\n" ], "text/markdown": [ "\n", "name | month | weight | \n", "|---|---|---|---|---|---|---|---|---|---|---|---|\n", "| ann | jan | 102 | \n", "| bob | jan | 155 | \n", "| charlie | jan | 211 | \n", "| ann | feb | 112 | \n", "| bob | feb | 150 | \n", "| charlie | feb | 211 | \n", "| ann | mar | 123 | \n", "| bob | mar | 147 | \n", "| charlie | mar | 213 | \n", "| ann | apr | 130 | \n", "| bob | apr | 140 | \n", "| charlie | apr | 210 | \n", "\n", "\n" ], "text/plain": [ " name month weight\n", "1 ann jan 102 \n", "2 bob jan 155 \n", "3 charlie jan 211 \n", "4 ann feb 112 \n", "5 bob feb 150 \n", "6 charlie feb 211 \n", "7 ann mar 123 \n", "8 bob mar 147 \n", "9 charlie mar 213 \n", "10 ann apr 130 \n", "11 bob apr 140 \n", "12 charlie apr 210 " ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "wts %>% gather(key=month, value=weight, jan:apr)" ] }, { "cell_type": "code", "execution_count": 97, "metadata": {}, "outputs": [], "source": [ "wts.1 <- wts %>% gather(key=month, value=weight, jan:apr)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### To get menth sorted correctly, we make it a factor" ] }, { "cell_type": "code", "execution_count": 102, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\n", "
namemonthweight
ann jan 102
bob jan 155
charliejan 211
ann feb 112
bob feb 150
charliefeb 211
ann mar 123
bob mar 147
charliemar 213
ann apr 130
bob apr 140
charlieapr 210
\n" ], "text/latex": [ "\\begin{tabular}{r|lll}\n", " name & month & weight\\\\\n", "\\hline\n", "\t ann & jan & 102 \\\\\n", "\t bob & jan & 155 \\\\\n", "\t charlie & jan & 211 \\\\\n", "\t ann & feb & 112 \\\\\n", "\t bob & feb & 150 \\\\\n", "\t charlie & feb & 211 \\\\\n", "\t ann & mar & 123 \\\\\n", "\t bob & mar & 147 \\\\\n", "\t charlie & mar & 213 \\\\\n", "\t ann & apr & 130 \\\\\n", "\t bob & apr & 140 \\\\\n", "\t charlie & apr & 210 \\\\\n", "\\end{tabular}\n" ], "text/markdown": [ "\n", "name | month | weight | \n", "|---|---|---|---|---|---|---|---|---|---|---|---|\n", "| ann | jan | 102 | \n", "| bob | jan | 155 | \n", "| charlie | jan | 211 | \n", "| ann | feb | 112 | \n", "| bob | feb | 150 | \n", "| charlie | feb | 211 | \n", "| ann | mar | 123 | \n", "| bob | mar | 147 | \n", "| charlie | mar | 213 | \n", "| ann | apr | 130 | \n", "| bob | apr | 140 | \n", "| charlie | apr | 210 | \n", "\n", "\n" ], "text/plain": [ " name month weight\n", "1 ann jan 102 \n", "2 bob jan 155 \n", "3 charlie jan 211 \n", "4 ann feb 112 \n", "5 bob feb 150 \n", "6 charlie feb 211 \n", "7 ann mar 123 \n", "8 bob mar 147 \n", "9 charlie mar 213 \n", "10 ann apr 130 \n", "11 bob apr 140 \n", "12 charlie apr 210 " ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "wts.1 %>% mutate(month = factor(month, \n", " ordered = TRUE, \n", " levels = c('jan', 'feb', 'mar', 'apr')))" ] }, { "cell_type": "code", "execution_count": 103, "metadata": {}, "outputs": [], "source": [ "wts.2 <- wts.1 %>% mutate(month = factor(month, \n", " ordered = TRUE, \n", " levels = c('jan', 'feb', 'mar', 'apr')))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Now we can easily work wtih the tidy data set" ] }, { "cell_type": "code", "execution_count": 106, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\t\n", "\t\n", "\t\n", "\t\n", "\n", "
monthminmaxmean
jan 102 211 156.0000
feb 112 211 157.6667
mar 123 213 161.0000
apr 130 210 160.0000
\n" ], "text/latex": [ "\\begin{tabular}{r|llll}\n", " month & min & max & mean\\\\\n", "\\hline\n", "\t jan & 102 & 211 & 156.0000\\\\\n", "\t feb & 112 & 211 & 157.6667\\\\\n", "\t mar & 123 & 213 & 161.0000\\\\\n", "\t apr & 130 & 210 & 160.0000\\\\\n", "\\end{tabular}\n" ], "text/markdown": [ "\n", "month | min | max | mean | \n", "|---|---|---|---|\n", "| jan | 102 | 211 | 156.0000 | \n", "| feb | 112 | 211 | 157.6667 | \n", "| mar | 123 | 213 | 161.0000 | \n", "| apr | 130 | 210 | 160.0000 | \n", "\n", "\n" ], "text/plain": [ " month min max mean \n", "1 jan 102 211 156.0000\n", "2 feb 112 211 157.6667\n", "3 mar 123 213 161.0000\n", "4 apr 130 210 160.0000" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "wts.2 %>% \n", "group_by(month) %>% \n", "summarize_at('weight', funs(min, max, mean)) " ] }, { "cell_type": "code", "execution_count": 119, "metadata": {}, "outputs": [], "source": [ "options(repr.plot.width=6, repr.plot.height=3)" ] }, { "cell_type": "code", "execution_count": 120, "metadata": {}, "outputs": [ { "data": {}, "metadata": {}, "output_type": "display_data" }, { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAtAAAAFoCAMAAABJ+DwrAAADAFBMVEUAAAAAujgBAQECAgID\nAwMEBAQFBQUGBgYHBwcICAgJCQkKCgoLCwsMDAwNDQ0ODg4PDw8QEBARERESEhITExMUFBQV\nFRUWFhYXFxcYGBgZGRkaGhobGxscHBwdHR0eHh4fHx8gICAhISEiIiIjIyMkJCQlJSUmJiYn\nJycoKCgpKSkqKiorKyssLCwtLS0uLi4vLy8wMDAxMTEyMjIzMzM0NDQ1NTU2NjY3Nzc4ODg5\nOTk6Ojo7Ozs8PDw9PT0+Pj4/Pz9AQEBBQUFCQkJDQ0NERERFRUVGRkZHR0dISEhJSUlKSkpL\nS0tMTExNTU1OTk5PT09QUFBRUVFSUlJTU1NUVFRVVVVWVlZXV1dYWFhZWVlaWlpbW1tcXFxd\nXV1eXl5fX19gYGBhYWFhnP9iYmJjY2NkZGRlZWVmZmZnZ2doaGhpaWlqampra2tsbGxtbW1u\nbm5vb29wcHBxcXFycnJzc3N0dHR1dXV3d3d4eHh5eXl6enp7e3t8fHx9fX1+fn5/f3+AgICB\ngYGCgoKDg4OEhISFhYWGhoaHh4eIiIiJiYmKioqLi4uMjIyOjo6Pj4+QkJCRkZGSkpKTk5OU\nlJSVlZWWlpaXl5eYmJiZmZmampqbm5ucnJydnZ2enp6fn5+goKChoaGioqKjo6OkpKSlpaWm\npqanp6epqamqqqqrq6usrKytra2urq6vr6+wsLCxsbGysrKzs7O0tLS1tbW2tra3t7e4uLi5\nubm6urq7u7u8vLy9vb2+vr6/v7/AwMDBwcHCwsLDw8PExMTFxcXGxsbHx8fIyMjJycnKysrL\ny8vMzMzNzc3Ozs7Pz8/Q0NDR0dHS0tLT09PU1NTV1dXW1tbX19fY2NjZ2dna2trb29vc3Nzd\n3d3e3t7f39/g4ODh4eHi4uLj4+Pk5OTl5eXm5ubn5+fo6Ojp6enq6urr6+vs7Ozt7e3u7u7v\n7+/w8PDx8fHy8vLz8/P09PT19fX29vb39/f4dm34+Pj5+fn6+vr7+/v8/Pz9/f3+/v7///9R\nfbnwAAAACXBIWXMAABJ0AAASdAHeZh94AAAeIklEQVR4nO3de2AU1b0H8Kkt6m1F9NpcHinB\nR0FrS+sjoNfWC1SrgrKQAAGMCOECFrRKr7dS4KoICNUKWi2IodTa+kCJoqUVapSAvCKEAgJh\nC0FQKsLGkEA2aR6bPXdmNjvJTmbC78ycs7Psfr9/7OzsfDn72/FjnCTLqjAESaIoXg+AICID\n0EhSBaCRpApAI0kVgEaSKgCNJFUAGkmqADSSVBEEuuLgmRJj5M+8noQcY2SvByFHDCpHEQQ6\n4D9TYoz8ideTkGOM7PUg5IhB5SgAnfgxRvZ6EHLEoHIUgE78GCN7PQg5YlA5CkAnfoyRvR6E\nHDGoHAWgEz/GyF4PQo4YVI4C0IkfY2SvByFHDCpHAejEjzGy14OQIwaVowB04scY2etByBGD\nylEAOvFjjOz1IOSIQeUoAO06K9Pb33cdY2RRC2YUtH1sb/qfRS0P0PGMMTJAx0QFvT59gaAn\nEIPKUQDadZIB9J4iFXTJQysFPYEYVI6ScKALh17e8/Z31ZP+1p2ZfZa2bEXFGFkg6NcH9Bq0\nyu8vHtu799jiln1RMUZ2vdKW3Ct7P7DHn7F0WK8++a3O9R+uHK5dcqjQt0++pteID4RNHP8k\nHOgBg15bPugm9STfvnbv9Es+NraiYowsEPR1L76e3Wvrvptufn35zT/eF90Xtb440KU3Zr/5\n3Hd/6c+4Kn913qV7Ws71dc+sawY96Pblb4/pXSJq4vgn0UDvm73W7190pXqSf+33f5i+1tiK\nijGyQNC/9/t3fW/u6xkf+v3rM16P7otaXxzol76tSl2Q58941O8vSl/fcq7nRL4pzChYmaE2\n9l71oqiJ459EA+3f+8YTE76nneTlfv9WDXTzVlSMkQWC1r6gjZ70VF9tr8+C6L6o9cWBfuzm\nyDbjVb9/c/r6lnP9ShT0b9Iz1KTPEzVx/JNooHfe0vfBF5+5MvKNiw66IPFBb1dvR937pA66\n75PRfVHriwP98K2RrXZOVdAx5zoC+ndXuH0OPWJQOUqigX7pst1+/8IzDPQy9RKj91OvZWz0\n+zdkLI/ui1pfHOj8Xjv8/sVDoqBjznUE9Pvpq/3+j3LWiJo4/kk00K+nL1q78KqL3z2TQF98\n7e/fHN57574Bt75ZcMuAfdF9UeuLA723b847y659IAo65lw3f1M4+sblb2Rdt1fUxPFPooH2\nz+j9ndFrRw6PQL60yNiKijGyMNCrR7zUr9dQ9V+5LWP0H9sZ+6JijOx6pQ9zLu/9wMdR0DHn\nuhn0zilX9cpxfbbFoHKUhAMtO8bIeC+HvIhB5SgAnfgxRvZ6EHLEoHIUgE78GCN7PQg5YlA5\nCkAnfoyRvR6EHDGoHAWgEz/GyF4PQo4YVI4C0IkfY2SvByFHDCpHAejEjzGy14OQIwaVowB0\n4scY2etByBGDylEAOvFjjOz1IOSIQeUogkAHyy1zIlxrfcCcxiZarzp8ktSrCNfZHDFGPmF9\nPBiuJD0D+aU1uH5pxsg2xxtDtGcIhqtIvQrySwvbHBCDylFEgQ5Y5gSrtT5gTmOY1qtmVaTe\nl6ze5ogxcoX18RpWSXqGCvJLY7ReNTtpc8QY2eZ4qIn2DEHiyatgdbQFG+wmEoPKUQDaFIAG\naAbQ7QSg4xqANgWgAZoBdDsB6LgGoE0BaIBmAN1OADquAWhTABqgGUC3E4COawDaFIAGaAbQ\n7QSg4xqANiUFQf83IZZ/EKBtA9AALSYAbQpAAzQD6HYiHDTVH6VHLloOBtC2AegzADS1B9AA\nHQBoQQFoUwDadQ+gAToA0IIC0KYAtOseQAN0AKAFBaBNAWjXvTMC9ImFY0fP+oSx0LKJ4xY1\ntGybA9C2Aei4hgr64am7/E/kVrD8vOKSSQuZsW0OQNsGoOMaIuhyX6n6VTl3dU3OBsa2ZVdG\nt9HjAG0bgI5riKCPv6peXtSN+Gupr5qxxiHbo1v1UPUv1ayus0w9C1kfMCfMaL1G1kDq1bMm\nmyPGS7I5HiI/Q7xeWrVpZIorao9c5Om1OsnxD8c3hXVPjD+5KVu7l1sY3ao3FZlqXpAynNQ0\neT0ANWYeFFfUHrnI0/M0ZNDh98dPr2Qbh2n3c9dEt9qBKjU2HxRXyYifkhay+5Q0U4KM+Nl2\nrN7miPGCbEauYbQPgDvB7D49z5RGu4+kM6WanbI5Yowc2aW4ovbIRZ7eGfHZdpUzJhSFGSv1\n1ajX0kNKotvoYVxD2wbX0HENEXT4wblBbRscUczYrqyK6DZ6HKBtA9BxDRH0jiFFO9QE2JIp\nB8qmPs2MbXMA2jYAHdcQQb/l07OKhfLHj1us/WKledscgLYNQMc1+NW3KQDtugfQAB0AaEEB\naFMA2nUPoAE6ANCCAtCmALTrHkADdACgBQWgTQFo1z2ABugAQAsKQJsC0K57AA3QAYAWFIA2\nBaBd9wAaoAMALSgAbQpAu+4BNEAHAFpQANoUgHbdA2iADgC0oAC0KQDtugfQAB0AaEEBaFMA\n2nUPoAE6ANCCAtCmALTrHkADdACgBUUQ6Lpay9SzkPUBc8KM1mtkDaRene0TGyPbPkO9y2cw\npYn40hrsXtop08gUV9QeucjTa3WS4x9BoGtPWqaaNVgfMKcpTOvVsRpS7xRrtDlijHzK+ng9\nC5KegfzSQozWq7M7iSeMkSP7FFfUHrnI02t1kuMfXHKYgksO171kuOQAaNsAdFwD0KYAtOse\nQAN0AKAFBaBNAWjXPYAG6ABACwpAmwLQrnsADdABgBYUgDYFoF33ABqgAwAtKABtCkC77gE0\nQAcAWlAA2hSAdt0DaIAOALSgALQpAO26B9AAHQBoQQFoUwDadQ+gAToA0IIC0KYAtOseQAN0\nAKAFBaBNAWjXPYAG6ABAC0qqg/4KJVbrALRt7wwB3ZirfdzCGz41WYyFlk0ct6jBOAjQtgHo\nuIYKun7nkz4N9DOzS0pKtjOWn1dcMmmhcRigbQPQcQ0VdMH4MTroae/ouzU5Gxjbll0ZPQzQ\ntgHouIZ+ybFfB507J+/O2UdYqa9avQgZon6pZvXvqSl1+XlZnn0UGAm01Tr4KDDb3hnyUWA6\n6Crf3N07Z+YFN2VrD+UWqjcVmWpekDWfvDTptyTQ3g5aZ9qnuKL2yEWenqfhBB0qDzNWPbxo\n4zDtodw16k3tH9RsrbZMDWu0PmBOE6P16lkdqRdkIZsjxusJ6rsk0NSinJdWZYwc2ae4ovbI\nRZ5eq5NsmYFZn916Xtd7tNf1ynUXnn/NUvXO4Kxtt1yYubLhwZ6d7jii7h8ceXGnfn/hgBwN\n7yWHnvtWlPpqVN1DSqKPJP81NLXX9qXhGtqUgT/st+KTxWdNUL8zU66fP+37yhsq6Ctu2ryn\n3zl9Hzuw/OwcxnZ0Sv/lY73P+h1ZpxFO0B/dr97U5mwJjihmbFdWRfQgQAO00TstaOU97bYH\nY9nd69VLqk4/VUF/7RBjRcpI9cDQDMb69/iSsYYB559qfyWLcIIOjpv19z2z7g+xJVMOlE19\n2jgI0K7hpxDoi7TbiWmMlWtfEQPnjVFB91LvlSra92KT01iF8rhWWaEUknlGw3vJceiRUWMX\nqt93h/LHj1ucSr9YAWhxoK/Wbielaaj++L/9z1U00Npj+5SX1dspaWyz0pzXyDyjwa++AVp0\n77Sg+2i3Guhnz75ozJLtGW1AlyjTi/Qc5ZYI0AAtukcGXX3uuJB6r3Mb0FXKTK3yeRH//60F\noAFadI8M+mPlOfXOaiXXDJrdnHacsaZbuoa4JQI0QIvukUHXd+/26B/u69K984tm0Ns7dpv5\nyLXKn/glAjRAi+7RQE/uxdiun3Tqceehzf0mtQZ9n/YDD3929wt+tMqBRIAGaNG9M+TNSe0G\noGX0vmIaObIr0h9A2wSgZfQAmj8A7QFUricGaK4AtAdQuZ4YoLkC0B5A5XpigOYKQHsAleuJ\nAZorAO0BVK4nTjrQoVrbNLT7B0kBaA+gcj1x0oGu/5dt+N+60SYA7QFUricGaK4AtAdQuZ4Y\noLkC0B5A5XpigOZKkoC2P0ctMZ1rgOaFSu0lEOgxpZHt+p9xLgPQMnoAzZ9WoMvLy5W3y7Uc\nn/ENzmUAWkYPoPnTCrTSKjdxLgPQMnoAzZ9WoBcsWKDcu0DPbw5zLgPQMnoAzZ/Ya+gBOxwu\nU3PCMidZnfUBc0JhWq+GVVs+TgEdaRojV+m7JFfUouieXjxhfJoPi7wCiitqj1zk6bU6yVY5\nI37KUWedehayOWJK2G4FUxpZg+XjFNCRpmlkkitqUXRPL9ZVm0amuKL2yEWeXp354yVjc1rQ\n+2678Pz+OxnrsDnn0m+/0bIlJRZ05aTLukZC/OPR4JJDRi9FLzkyBxR+0O86FXL/stCj/1Zr\nbEmJBT1BuXpsnh4i5GgAWkYvNUGHnyxj7OU0FfQixj5VDhhbUmJBdx4ephKOjSTQFKf/Amjn\n/hIRNGtY/1xeZw30B4yVa6Cbt6TEgk5z8PmlegBaRi81QQdv6PnQ28s00BuaQW9wDPqOqVTB\npgC0jF5qgl51nvpN5VIxoMsy8vk/fEkLQMvopSbodcorZUu7n7vTLeg+WtKVjr31OzyaGUDL\n6aUmaDaryzdHHMgaHIH89YPGlpQW0ANjwoFZC0DL6KUoaFdJ7LePAjRAcwagPYAK0PEC3bE5\n3/zuj+dz/V10gJbRA2j+xIJ+Kl25cnjOD5QBM0d/oxPxKlwPQMvoATR/YkEvOVf/X3m/d+Fa\n9sUlQzmWAWgZPYDmTyzoPvdGtlNvYuy5izmWAWgZPYDmTyzoTnMi23kXMPby1zmW4QQt0ilA\nu/CX7KD/6xr9raz1mTcwNuE7HMsAtIxecoIOh2zTxEHOJrGg//rVvq/t3Ln8P5WC2slnzedY\nBqBl9JITtNyYfg79Wob2V2Q7L2Pl50yu51gGoGX0khN0Y9A2PORsYv7FSt26Zc8XnmIsVG1Z\ntwtAy+glJ+h6+9d7xv6mEKAB2iqCP5eja+Qdd3F4tx1AA7RVxILu+n0Wr3fbATRAWwWXHADt\nzl/ygz5V+NrRWsu/tdKYe1K9DS2bOG5RQ8u2OQAtowfQ/DGBzj9fUYqKur3cdoqdT/o00Pl5\nxSWTFrZsmwPQMnoAzZ9Y0KvOGlCgFH3+E+Uv5l7B+DEa6JqcDYxty66MbqOHAVpGLzVBbzF9\nkTXv84C+8fuNTCliTdf+V9vmfg10qa9avfgYsj26jR4MiocK0ADNrPbbT2z5/NlMA80eubBt\nUwe9KVu7m1sY3WowMtW8EOlQXInucS3YKpH3DZBcUYuie3qRmT8pjuKK2iMXeXqnSTxB95gR\nAT0jo21TB71xmHY3d010q95UjVGzolEPxZXoHteCjY3G62nQd0muqEXRPb3YWGOMHHkFFFfU\nHrnI02t1kq1yetDrMjv1U//jfzy3S7e7jrfsOwCd860KDfSxbtk2oEt96ukNDSmJbqMHcckh\no5eqlxyXv73u9k7l4b7Xry26vk84uu8E9MHze8xTps9I6/gPG9DBEcWM7cqqiG4BWmYvVUGv\nZKym84K1HT5l7PDX1kX3nYBmO/pp77a72erruw6aLZlyoGzq0y1bgJbYS1XQ2gemD5u4uKe2\nd9mS6L4j0Ix9ubmkyrIZAR3KHz9ucUPLFqAl9lIVtPbT4Oz7fttL2+v52+i+E9C9p73v7C2p\nAC2jl6qg31IvMbo8X9jhCGOfdfgguk+TGAv6EkXpOOR5ns8vaA5Ay+ilKOhzL1m5aXC36vC1\nP9q44YeZ4eg+TaLpkuPwnyZdrihX/M9qIuRoAFpGLzVB7/atuvqCWw8wdmx0ly65x1v2SbH4\nofXR312h8L4JD6Bl9FITtLuY5Vb8edr1HZSOt3EuA9AyegDNn1jQD1z1VaXTHb8ubv9XPRYB\naBk9gOZPLGhF6f2qo4/wB2gZPYDmTyzoaTeco3Qe/sw2btQALaMH0PwxX0PXrps36AKl462c\nywC0jB5A88fixxl7nsVPOeRCBWibiAdd+vzIzspZ10xfy7kMQMvoATR/YkF3VZTOd/3pGP8y\nAC2jl5ygWdg2/PDaJBZ0//klzhYFaBm9JAUtNWI/l0O0P0oPoF348+SSw/5PJ9wHzYj2R+kB\ntAt/AG0TgJbRA2j+ALQHUAHaJgDNuyBAu4BK7QG0qx5Au/AH0DYBaBk9gOYPQHsAFaBtAtC8\nCwK0C6jUniDQHTa0/bMhZdtpJAK0B1AB2iYU0IeUJe2tLgh0feR/nEhxJbrHtWCo5Z3ejfou\nyRW1KLqnF0Mt/5Qjr4DiitojF3l6ofbfTu8cdP1BFfSJ6VvaW13UV+hyPRRXontcC5a3fELa\nCX2X5IpaFN37imnkyC7FFbVHLvL0ytv/GLrTgj42Mq3r1HrWoeCOCy57nbF9t114fv+dKvA/\npw3WLjlU6JWTe3Ty7bdcHZcchOCSI56XHE0/GLTpj//xMOvQfcXusV+vZ5kDCj/od50K+vJl\nnzSDHtC/qHh0txNWqwO0B1AB2iY66FUdValLxrIO8xk7qBwKP1nG2MtpKuinIt8Udtiw5ewK\n9W73d6xWB2gPoAK0TXTQT1wf6XV4j7EvlEOsYf1zeZ010H+Lgn5R6aBGWWi1OkB7ABWgbaKD\nnvujSE/7plAFHbyh50NvL0uL7EdAv3mR/eoA7QFUgLaJDnrFBacYe/WWKOhV59UxtjQGtF/5\nWD0ZQ/dYrQ7QHkAFaJvooBt6Dt361iU/j4Jep7xStrT7uTtbgWbDfvDB+oGXW34cEkB7ABWg\nbRL5sd2nQ/+9289ro6DZrC7fHHEga3Br0MF7u18w1PpDcgHaA6gAbRO8l4N3QYB2AZXaA2hX\nPYB24Q+gbQLQMnoAzR+A9gAqQNsEoHkXBGgXUKk9gHbVA2gX/gDaJgAtowfQ/AFoD6CmNmi5\nAWgPoAK0vAC0B1ABWl4A2gOoAC0vAO0BVICWF4D2ACpAywtAewAVoOUFoD2ACtDywgv6DZ+a\nLMZCyyaOW9RgPAzQMnoAzR9e0M/MLikp2c5Yfl5xyaSWv3YL0DJ6AM0fXtDTIh+GUJOzgbFt\n2ZXRhwFaRg+g+cMLOndO3p2zj7BSXzVjjUPUL9UsOE/N2lo9FFeie1wL1ra8WyCyS3JFLYru\n6cXaU6aRKa6oPXKRp1cr4C0ZjsMJuso3d/fOmXnBTdnaXm6helORqeaFyGGKK9E9rgVbpUm/\nJbmiFkX39CKrMw1OcUXtkYs8PU/DCTpUHmasenjRxmHaXu4a9abpiJrjFXoorkT3uBasqDBe\nSqW+S3JFLYru6cWKL42RI6+A4oraIxd5eq1Ocvzj6Md2960o9dWouoeURB/BNbSMHq6h+cMJ\n+qP7T6qXcjlbgiOKGduVZfyrCNAyegDNH07QwXGz/r5n1v0htmTKgbKpT7c8DtASegDNH95L\njkOPjBq78IR6uZE/ftxi/GJFbg+g+YNffXsAFaDlBaA9gArQ8gLQHkAFaHkBaA+gArS8ALQH\nUAFaXgDaA6gALS8A7QFUgJYXgPYAKkDLC0B7ABWg5QWgPYAK0PIC0B5ABWh5AWgPoAK0vAC0\nB1ABWl4A2gOoAC0vAO0BVICWF4D2ACpAywtAewAVoOUFoD2ACtDyAtAeQAVoeQFoD6ACtLwA\ntAdQAVpeANoDqAAtL4JA11brobgS3eNasLraGDmo75JcUYuie3qxusoYOfIKKK6oPXKRp9fq\nJMc/okCf0kNxJbrHteCplo/yrNZ3Sa6oRdE9vXjK+MxiFnkFFFfUHrnI02t1kuMfXHIQgkuO\nlLvkAGgZPYDmD0B7ABWg5QWgPYAK0PIC0B5ABWh5AWgPoAK0vAC0B1ABWl4A2gOoAC0vAO0B\nVICWF4D2ACpAywtAewAVoOUFoD2ACtDyAtAeQAVoeQFoD6ACtLwAtAdQAVpeANoDqAAtLwDt\nAVSAlheA9gAqQMsLQHsAFaDlBaA9gArQ8gLQHkAFaHkBaA+gArS8ALQHUAFaXpyCDi2bOG5R\ng7EL0DJ6AM0fp6Dz84pLJi00dgFaRg+g+eMQdE3OBsa2ZRsf6gPQMnoAzR+HoEt91Yw1Dtmu\n3m0oVlNWqYfiSnSPa8HKls/VOqnvklxRi6J7erHyhDFy5BVQXFF75CJPr9VJjn8cgt6Urd3m\nFqo3FZlqXog8THEluse1YKs06bckV9Si6J5eZHWmwSmuqD1ykafnaRyC3jhMu81do97UPKtm\nY41l6lij9QFzmhit18DqSb1aFrI5YrwEm+ONrM7lM5ji+qWdPN3I4TDtGRpdnzxTbF+aK5Lu\n4viSQx06NKQkut98DW3OCVZrfcCcxjCtV82qSL0vWb3NEeMlVFgfr2GVpGeoIL80RutVs5M2\nR4yRbY6HmmjPECSevApWR1uwwW4i5x5dxyHo4IhixnZlVRj71i8MoAE6znH6Y7slUw6UTX3a\n2AVo2wB0XOP4Fyv548ctbvOLFXMAGqDjHLG/+jYHoAE6zgFoUwAaoBlAtxOAjmsA2hSABmgG\n0O0EoOMagDYFoAGa2YI+dvhz2qk58imtd/TwMVrR9omNkW1AU5/huOiX9sXhL2yOGCO7fAbq\nSzt2+J+0BY8ctjkgBpWjCAJtE3/mfFpx1I203kuZ75N65Zm/oC3YJr/N3ErqHcycTVvw7uto\nvVczV9OKbTL0J7TeC5kbSb0jmf9HW3BCZhOtGMcAtCkADdD2AWiAjnMA2hSABmj7VBZspxUL\n36b1/AX/JPVqCzbTFmyT3QXHSb2TBSWnL2n54C1a70DBZ7Rim6xZRevtLfiC1Ksu+Ii24LqC\nMK0Yx8gFjSBxDkAjSRWARpIqHoPePWFG9K7f5+Ug9Jx5I595E7uIHNDHfO/Sir963Pgr7x6f\n6yQeOXEmjkPkgK5+yU8rznrVuMt9rhtJ37PTWkk9svyJySPLj6RLjqy97Mis0SNnfKLe3ffE\npHs2WNdm+nwPseCiCaPmfq6e690PjppeZlUbXvLY3TPL8+8ds4q1LPvRXXPaNpuP7r9r70O5\nMz+1a6XWyMIndjOy9MgD/eDMHbum/0K9O+No0yvD66172hePmTM+9v96bLXfN6V49+xRJy1a\nwx8OnrpvRHH4razalmWnvHesbbP56P5h047Vv3RXjU0rxUYWPbGbkaVHGuhwwVHG1t6l3v0L\nYwHfUeueeq79WacYaxpf7PdtYax+jNWvIYarX3qWqee3wne0Zdk3LYrRo/t9JerOPausW6k2\nsuCJXY0sPfK+Qod2r3pmjHZSdjF2sh0dhb4sNb6Vfu3j8tj8Zy1awz9m7GX1P2pBdRlj2b9b\nrdd8dL9P+xr0xPM2rRQbWfTEbkaWHmmg66b99PfF792ln/V2dWzKjdz1+4Lq7fznLVqtznXM\nsm0SPbrfV6XuzVts3Uq1kQVP7Gpk6ZEGemtOA2NrCDqO+A4xVvX4p37fZvW/hnf/1aLV6lzH\nLNsm0aP7fZu0/7K+w6cjWUcWPLGrkaVHGujdvqKja8YP++S0Otj8B3btnjUl5B82cUvpnLG1\nFq1W5zpm2TaJHt3vm1C8b25ukE9Hso4seGJXI0uPvGvoV+/OfeLovDmRUz3c5ueU2rmuWzx+\n1ONfsMNzt/581KOW/0iMc60u03rZtmk+ut+39YGR+g+UeHQk68iiJ3YzsvRIAj3U25e438f/\nRl2MzBsnI0uPHNBlvn9IWZcaB6caI3MndUDv8M1slLEuOfynGiPzJ3VAh7z8CHdnwchJErwf\nGkmqADSSVAFoJKkC0EhSBaCRpApAI0kVgJaagX28niDVAtCyskApB+j4B6BlBaA9CUDLCkB7\nEoB2lMFZ2265MHNlw4M9O91xRN3fOqhL10Hb1DsDsz679byu91SxAYqijFFBHxyc1nVi5enW\nQ0QFoB1l8BU3bd7T75y+jx1YfnYOY387u8f0GRef/TcV9A/7rfhk8VkT2I57lbdL2cD07vcv\nHaZM8nre1AlAO8rgrx1irEgZqd4dmsGaen8rwFh5+lVhNlB5T31sYI/oJYeSz1j4qsu8nTaV\nAtCOMriXelOqvKDeTk5jZcrj2oNzlENs4EXavYlpUdAdQ+ru2K4ejppiAWhHGXy1erNPeVm9\nnZLG1igrtQffVArZQO0Am2SA7q3t5gF03ALQjhILenUE9EpldfOPNVpA67sAHb8AtKPEgt6v\n/Ep7cJ5yEKC9DkA7SizopiszKhj7svt3m2JAHwfo+AegHSUWNHu3w2WPPnKp/mM7A/RvlBkf\nAnTcA9CO0gr0fdoPPIpv69JloP6LFV3wZPWxQz/+xs8AOu4BaCSpAtBIUgWgkaQKQCNJFYBG\nkioAjSRVABpJqgA0klQBaCSp8v90QAgsMZexuQAAAABJRU5ErkJggg==", "text/plain": [ "plot without title" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "ggplot(wts.2, aes(x=month, y=weight, fill=name)) + \n", "geom_bar(stat='identity') + \n", "facet_wrap('name')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Using `gather` on experiment metadata" ] }, { "cell_type": "code", "execution_count": 235, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\n", "
Labeli7 indexi5 index
2_MA_C ATTACTCGAGGCTATA
9_MA_C ATTACTCGGCCTCTAT
10_MA_C ATTACTCGAGGATAGG
14_MA_C ATTACTCGTCAGAGCC
15_MA_C ATTACTCGCTTCGCCT
21_MA_C ATTACTCGTAAGATTA
\n" ], "text/latex": [ "\\begin{tabular}{r|lll}\n", " Label & i7 index & i5 index\\\\\n", "\\hline\n", "\t 2\\_MA\\_C & ATTACTCG & AGGCTATA \\\\\n", "\t 9\\_MA\\_C & ATTACTCG & GCCTCTAT \\\\\n", "\t 10\\_MA\\_C & ATTACTCG & AGGATAGG \\\\\n", "\t 14\\_MA\\_C & ATTACTCG & TCAGAGCC \\\\\n", "\t 15\\_MA\\_C & ATTACTCG & CTTCGCCT \\\\\n", "\t 21\\_MA\\_C & ATTACTCG & TAAGATTA \\\\\n", "\\end{tabular}\n" ], "text/markdown": [ "\n", "Label | i7 index | i5 index | \n", "|---|---|---|---|---|---|\n", "| 2_MA_C | ATTACTCG | AGGCTATA | \n", "| 9_MA_C | ATTACTCG | GCCTCTAT | \n", "| 10_MA_C | ATTACTCG | AGGATAGG | \n", "| 14_MA_C | ATTACTCG | TCAGAGCC | \n", "| 15_MA_C | ATTACTCG | CTTCGCCT | \n", "| 21_MA_C | ATTACTCG | TAAGATTA | \n", "\n", "\n" ], "text/plain": [ " Label i7 index i5 index\n", "1 2_MA_C ATTACTCG AGGCTATA\n", "2 9_MA_C ATTACTCG GCCTCTAT\n", "3 10_MA_C ATTACTCG AGGATAGG\n", "4 14_MA_C ATTACTCG TCAGAGCC\n", "5 15_MA_C ATTACTCG CTTCGCCT\n", "6 21_MA_C ATTACTCG TAAGATTA" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "df %>% \n", "select(Label, ends_with('index')) %>% \n", "head()" ] }, { "cell_type": "code", "execution_count": 237, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\n", "
Labelindexsequence
2_MA_C i7 indexATTACTCG
9_MA_C i7 indexATTACTCG
10_MA_C i7 indexATTACTCG
14_MA_C i7 indexATTACTCG
15_MA_C i7 indexATTACTCG
21_MA_C i7 indexATTACTCG
\n" ], "text/latex": [ "\\begin{tabular}{r|lll}\n", " Label & index & sequence\\\\\n", "\\hline\n", "\t 2\\_MA\\_C & i7 index & ATTACTCG \\\\\n", "\t 9\\_MA\\_C & i7 index & ATTACTCG \\\\\n", "\t 10\\_MA\\_C & i7 index & ATTACTCG \\\\\n", "\t 14\\_MA\\_C & i7 index & ATTACTCG \\\\\n", "\t 15\\_MA\\_C & i7 index & ATTACTCG \\\\\n", "\t 21\\_MA\\_C & i7 index & ATTACTCG \\\\\n", "\\end{tabular}\n" ], "text/markdown": [ "\n", "Label | index | sequence | \n", "|---|---|---|---|---|---|\n", "| 2_MA_C | i7 index | ATTACTCG | \n", "| 9_MA_C | i7 index | ATTACTCG | \n", "| 10_MA_C | i7 index | ATTACTCG | \n", "| 14_MA_C | i7 index | ATTACTCG | \n", "| 15_MA_C | i7 index | ATTACTCG | \n", "| 21_MA_C | i7 index | ATTACTCG | \n", "\n", "\n" ], "text/plain": [ " Label index sequence\n", "1 2_MA_C i7 index ATTACTCG\n", "2 9_MA_C i7 index ATTACTCG\n", "3 10_MA_C i7 index ATTACTCG\n", "4 14_MA_C i7 index ATTACTCG\n", "5 15_MA_C i7 index ATTACTCG\n", "6 21_MA_C i7 index ATTACTCG" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "df %>% \n", "select(Label, ends_with('index')) %>% \n", "gather(index, sequence, -Label) %>% \n", "head()" ] }, { "cell_type": "code", "execution_count": 239, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\n", "
Labelindexsequence
2_MA_C i7 ATTACTCG
9_MA_C i7 ATTACTCG
10_MA_C i7 ATTACTCG
14_MA_C i7 ATTACTCG
15_MA_C i7 ATTACTCG
21_MA_C i7 ATTACTCG
\n" ], "text/latex": [ "\\begin{tabular}{r|lll}\n", " Label & index & sequence\\\\\n", "\\hline\n", "\t 2\\_MA\\_C & i7 & ATTACTCG \\\\\n", "\t 9\\_MA\\_C & i7 & ATTACTCG \\\\\n", "\t 10\\_MA\\_C & i7 & ATTACTCG \\\\\n", "\t 14\\_MA\\_C & i7 & ATTACTCG \\\\\n", "\t 15\\_MA\\_C & i7 & ATTACTCG \\\\\n", "\t 21\\_MA\\_C & i7 & ATTACTCG \\\\\n", "\\end{tabular}\n" ], "text/markdown": [ "\n", "Label | index | sequence | \n", "|---|---|---|---|---|---|\n", "| 2_MA_C | i7 | ATTACTCG | \n", "| 9_MA_C | i7 | ATTACTCG | \n", "| 10_MA_C | i7 | ATTACTCG | \n", "| 14_MA_C | i7 | ATTACTCG | \n", "| 15_MA_C | i7 | ATTACTCG | \n", "| 21_MA_C | i7 | ATTACTCG | \n", "\n", "\n" ], "text/plain": [ " Label index sequence\n", "1 2_MA_C i7 ATTACTCG\n", "2 9_MA_C i7 ATTACTCG\n", "3 10_MA_C i7 ATTACTCG\n", "4 14_MA_C i7 ATTACTCG\n", "5 15_MA_C i7 ATTACTCG\n", "6 21_MA_C i7 ATTACTCG" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "df %>% \n", "select(Label, ends_with('index')) %>% \n", "gather(index, sequence, -Label) %>% \n", "mutate(index=str_extract(index, '^..')) %>%\n", "head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2. Use `spread` to convert one column into multiple\n", "\n", "We can undo `gather` with `spread`. This is less commonly used than `gather`." ] }, { "cell_type": "code", "execution_count": 240, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\n", "
Labelindexsequence
2_MA_C i7 indexATTACTCG
9_MA_C i7 indexATTACTCG
10_MA_C i7 indexATTACTCG
14_MA_C i7 indexATTACTCG
15_MA_C i7 indexATTACTCG
21_MA_C i7 indexATTACTCG
\n" ], "text/latex": [ "\\begin{tabular}{r|lll}\n", " Label & index & sequence\\\\\n", "\\hline\n", "\t 2\\_MA\\_C & i7 index & ATTACTCG \\\\\n", "\t 9\\_MA\\_C & i7 index & ATTACTCG \\\\\n", "\t 10\\_MA\\_C & i7 index & ATTACTCG \\\\\n", "\t 14\\_MA\\_C & i7 index & ATTACTCG \\\\\n", "\t 15\\_MA\\_C & i7 index & ATTACTCG \\\\\n", "\t 21\\_MA\\_C & i7 index & ATTACTCG \\\\\n", "\\end{tabular}\n" ], "text/markdown": [ "\n", "Label | index | sequence | \n", "|---|---|---|---|---|---|\n", "| 2_MA_C | i7 index | ATTACTCG | \n", "| 9_MA_C | i7 index | ATTACTCG | \n", "| 10_MA_C | i7 index | ATTACTCG | \n", "| 14_MA_C | i7 index | ATTACTCG | \n", "| 15_MA_C | i7 index | ATTACTCG | \n", "| 21_MA_C | i7 index | ATTACTCG | \n", "\n", "\n" ], "text/plain": [ " Label index sequence\n", "1 2_MA_C i7 index ATTACTCG\n", "2 9_MA_C i7 index ATTACTCG\n", "3 10_MA_C i7 index ATTACTCG\n", "4 14_MA_C i7 index ATTACTCG\n", "5 15_MA_C i7 index ATTACTCG\n", "6 21_MA_C i7 index ATTACTCG" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "df %>% \n", "select(Label, ends_with('index')) %>% \n", "gather(index, sequence, -Label) %>% \n", "head()" ] }, { "cell_type": "code", "execution_count": 241, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\n", "
Labeli5 indexi7 index
1_MA_J AGGCTATACGCTCATT
1_RZ_J AGGCTATAGAGATTCC
10_MA_C AGGATAGGATTACTCG
10_RZ_C AGGATAGGTCCGGAGA
11_MA_J AGGATAGGCGCTCATT
11_RZ_J AGGATAGGGAGATTCC
\n" ], "text/latex": [ "\\begin{tabular}{r|lll}\n", " Label & i5 index & i7 index\\\\\n", "\\hline\n", "\t 1\\_MA\\_J & AGGCTATA & CGCTCATT \\\\\n", "\t 1\\_RZ\\_J & AGGCTATA & GAGATTCC \\\\\n", "\t 10\\_MA\\_C & AGGATAGG & ATTACTCG \\\\\n", "\t 10\\_RZ\\_C & AGGATAGG & TCCGGAGA \\\\\n", "\t 11\\_MA\\_J & AGGATAGG & CGCTCATT \\\\\n", "\t 11\\_RZ\\_J & AGGATAGG & GAGATTCC \\\\\n", "\\end{tabular}\n" ], "text/markdown": [ "\n", "Label | i5 index | i7 index | \n", "|---|---|---|---|---|---|\n", "| 1_MA_J | AGGCTATA | CGCTCATT | \n", "| 1_RZ_J | AGGCTATA | GAGATTCC | \n", "| 10_MA_C | AGGATAGG | ATTACTCG | \n", "| 10_RZ_C | AGGATAGG | TCCGGAGA | \n", "| 11_MA_J | AGGATAGG | CGCTCATT | \n", "| 11_RZ_J | AGGATAGG | GAGATTCC | \n", "\n", "\n" ], "text/plain": [ " Label i5 index i7 index\n", "1 1_MA_J AGGCTATA CGCTCATT\n", "2 1_RZ_J AGGCTATA GAGATTCC\n", "3 10_MA_C AGGATAGG ATTACTCG\n", "4 10_RZ_C AGGATAGG TCCGGAGA\n", "5 11_MA_J AGGATAGG CGCTCATT\n", "6 11_RZ_J AGGATAGG GAGATTCC" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "df %>% \n", "select(Label, ends_with('index')) %>% \n", "gather(index, sequence, -Label) %>% \n", "spread(index, sequence) %>%\n", "head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3. Use `separate` to split a single column containing multiple values\n", "\n", "The label column actually contains three pieces of informaiton separated by underscores - a sample number, the RNA enrichment methd, and code for the person presenting the sample. Assuuming that we do not have this informatio nduplicate in other columns, it woujld be tricky to compare, say, results by the sampling method.\n", "\n", "The `separate` function is designed to address this problem." ] }, { "cell_type": "code", "execution_count": 246, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\t\n", "\t\n", "\t\n", "\n", "
Labelconcentration_fold_difference
2_MA_C 1.34
9_MA_C 2.23
10_MA_C4.37
\n" ], "text/latex": [ "\\begin{tabular}{r|ll}\n", " Label & concentration\\_fold\\_difference\\\\\n", "\\hline\n", "\t 2\\_MA\\_C & 1.34 \\\\\n", "\t 9\\_MA\\_C & 2.23 \\\\\n", "\t 10\\_MA\\_C & 4.37 \\\\\n", "\\end{tabular}\n" ], "text/markdown": [ "\n", "Label | concentration_fold_difference | \n", "|---|---|---|\n", "| 2_MA_C | 1.34 | \n", "| 9_MA_C | 2.23 | \n", "| 10_MA_C | 4.37 | \n", "\n", "\n" ], "text/plain": [ " Label concentration_fold_difference\n", "1 2_MA_C 1.34 \n", "2 9_MA_C 2.23 \n", "3 10_MA_C 4.37 " ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "df %>% \n", "select(Label, concentration_fold_difference) %>% \n", "head(3)" ] }, { "cell_type": "code", "execution_count": 249, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\t\n", "\t\n", "\t\n", "\n", "
sample_numenrichmentpersonconcentration_fold_difference
2 MA C 1.34
9 MA C 2.23
10 MA C 4.37
\n" ], "text/latex": [ "\\begin{tabular}{r|llll}\n", " sample\\_num & enrichment & person & concentration\\_fold\\_difference\\\\\n", "\\hline\n", "\t 2 & MA & C & 1.34\\\\\n", "\t 9 & MA & C & 2.23\\\\\n", "\t 10 & MA & C & 4.37\\\\\n", "\\end{tabular}\n" ], "text/markdown": [ "\n", "sample_num | enrichment | person | concentration_fold_difference | \n", "|---|---|---|\n", "| 2 | MA | C | 1.34 | \n", "| 9 | MA | C | 2.23 | \n", "| 10 | MA | C | 4.37 | \n", "\n", "\n" ], "text/plain": [ " sample_num enrichment person concentration_fold_difference\n", "1 2 MA C 1.34 \n", "2 9 MA C 2.23 \n", "3 10 MA C 4.37 " ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "df %>% \n", "select(Label, concentration_fold_difference) %>% \n", "separate(Label, sep='_', into=c('sample_num', 'enrichment', 'person')) %>%\n", "head(3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### It is now easy to condition on sample_number, enrichment method or person\n", "\n", "Here is a summary of the data." ] }, { "cell_type": "code", "execution_count": 278, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\n", "
personenrichmentnconcentration_fold_difference
C MA 24 2.3675
C RZ 24 2.3675
C TOT 3 1.3400
J MA 24 3.4350
J RZ 24 3.4350
J TOT 3 1.9800
P MA 24 3.0800
P RZ 24 3.0800
P TOT 3 2.0500
\n" ], "text/latex": [ "\\begin{tabular}{r|llll}\n", " person & enrichment & n & concentration\\_fold\\_difference\\\\\n", "\\hline\n", "\t C & MA & 24 & 2.3675\\\\\n", "\t C & RZ & 24 & 2.3675\\\\\n", "\t C & TOT & 3 & 1.3400\\\\\n", "\t J & MA & 24 & 3.4350\\\\\n", "\t J & RZ & 24 & 3.4350\\\\\n", "\t J & TOT & 3 & 1.9800\\\\\n", "\t P & MA & 24 & 3.0800\\\\\n", "\t P & RZ & 24 & 3.0800\\\\\n", "\t P & TOT & 3 & 2.0500\\\\\n", "\\end{tabular}\n" ], "text/markdown": [ "\n", "person | enrichment | n | concentration_fold_difference | \n", "|---|---|---|---|---|---|---|---|---|\n", "| C | MA | 24 | 2.3675 | \n", "| C | RZ | 24 | 2.3675 | \n", "| C | TOT | 3 | 1.3400 | \n", "| J | MA | 24 | 3.4350 | \n", "| J | RZ | 24 | 3.4350 | \n", "| J | TOT | 3 | 1.9800 | \n", "| P | MA | 24 | 3.0800 | \n", "| P | RZ | 24 | 3.0800 | \n", "| P | TOT | 3 | 2.0500 | \n", "\n", "\n" ], "text/plain": [ " person enrichment n concentration_fold_difference\n", "1 C MA 24 2.3675 \n", "2 C RZ 24 2.3675 \n", "3 C TOT 3 1.3400 \n", "4 J MA 24 3.4350 \n", "5 J RZ 24 3.4350 \n", "6 J TOT 3 1.9800 \n", "7 P MA 24 3.0800 \n", "8 P RZ 24 3.0800 \n", "9 P TOT 3 2.0500 " ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "df %>% \n", "select(Label, concentration_fold_difference) %>% \n", "separate(Label, into=c('sample_num', 'enrichment', 'person'), sep='_') %>%\n", "add_count(enrichment) %>%\n", "group_by(person, enrichment, n) %>%\n", "summarize_at('concentration_fold_difference', mean) " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 4. Use `unite` to craate a single variable from multiple columns\n", "\n", "Sometimes we want to do the opposite and combine multiple columns into a single column. Use `unite` to do this." ] }, { "cell_type": "code", "execution_count": 276, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\n", "
sample_numenrichmentpersonconcentration_fold_difference
2 MA C 1.34
9 MA C 2.23
10 MA C 4.37
14 MA C 1.57
15 MA C 2.85
21 MA C 1.81
\n" ], "text/latex": [ "\\begin{tabular}{r|llll}\n", " sample\\_num & enrichment & person & concentration\\_fold\\_difference\\\\\n", "\\hline\n", "\t 2 & MA & C & 1.34\\\\\n", "\t 9 & MA & C & 2.23\\\\\n", "\t 10 & MA & C & 4.37\\\\\n", "\t 14 & MA & C & 1.57\\\\\n", "\t 15 & MA & C & 2.85\\\\\n", "\t 21 & MA & C & 1.81\\\\\n", "\\end{tabular}\n" ], "text/markdown": [ "\n", "sample_num | enrichment | person | concentration_fold_difference | \n", "|---|---|---|---|---|---|\n", "| 2 | MA | C | 1.34 | \n", "| 9 | MA | C | 2.23 | \n", "| 10 | MA | C | 4.37 | \n", "| 14 | MA | C | 1.57 | \n", "| 15 | MA | C | 2.85 | \n", "| 21 | MA | C | 1.81 | \n", "\n", "\n" ], "text/plain": [ " sample_num enrichment person concentration_fold_difference\n", "1 2 MA C 1.34 \n", "2 9 MA C 2.23 \n", "3 10 MA C 4.37 \n", "4 14 MA C 1.57 \n", "5 15 MA C 2.85 \n", "6 21 MA C 1.81 " ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "df %>% \n", "select(Label, concentration_fold_difference) %>% \n", "separate(Label, sep='_', into=c('sample_num', 'enrichment', 'person')) %>%\n", "head" ] }, { "cell_type": "code", "execution_count": 277, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\n", "
Labelconcentration_fold_difference
2_MA_C 1.34
9_MA_C 2.23
10_MA_C4.37
14_MA_C1.57
15_MA_C2.85
21_MA_C1.81
\n" ], "text/latex": [ "\\begin{tabular}{r|ll}\n", " Label & concentration\\_fold\\_difference\\\\\n", "\\hline\n", "\t 2\\_MA\\_C & 1.34 \\\\\n", "\t 9\\_MA\\_C & 2.23 \\\\\n", "\t 10\\_MA\\_C & 4.37 \\\\\n", "\t 14\\_MA\\_C & 1.57 \\\\\n", "\t 15\\_MA\\_C & 2.85 \\\\\n", "\t 21\\_MA\\_C & 1.81 \\\\\n", "\\end{tabular}\n" ], "text/markdown": [ "\n", "Label | concentration_fold_difference | \n", "|---|---|---|---|---|---|\n", "| 2_MA_C | 1.34 | \n", "| 9_MA_C | 2.23 | \n", "| 10_MA_C | 4.37 | \n", "| 14_MA_C | 1.57 | \n", "| 15_MA_C | 2.85 | \n", "| 21_MA_C | 1.81 | \n", "\n", "\n" ], "text/plain": [ " Label concentration_fold_difference\n", "1 2_MA_C 1.34 \n", "2 9_MA_C 2.23 \n", "3 10_MA_C 4.37 \n", "4 14_MA_C 1.57 \n", "5 15_MA_C 2.85 \n", "6 21_MA_C 1.81 " ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "df %>% \n", "select(Label, concentration_fold_difference) %>% \n", "separate(Label, sep='_', into=c('sample_num', 'enrichment', 'person')) %>%\n", "unite(Label, c('sample_num', 'enrichment', 'person'), sep='_') %>% \n", "head" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Extract parts with regular expressions\n", "\n", "Sometiems there is no simple delimiter between parts of a value. In such cases, we may need to use reuglar expressions to extract parts from a string." ] }, { "cell_type": "code", "execution_count": 293, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\n", "
Labelconcentration_fold_difference
2MAC 1.34
9MAC 2.23
10MAC4.37
14MAC1.57
15MAC2.85
21MAC1.81
\n" ], "text/latex": [ "\\begin{tabular}{r|ll}\n", " Label & concentration\\_fold\\_difference\\\\\n", "\\hline\n", "\t 2MAC & 1.34 \\\\\n", "\t 9MAC & 2.23 \\\\\n", "\t 10MAC & 4.37 \\\\\n", "\t 14MAC & 1.57 \\\\\n", "\t 15MAC & 2.85 \\\\\n", "\t 21MAC & 1.81 \\\\\n", "\\end{tabular}\n" ], "text/markdown": [ "\n", "Label | concentration_fold_difference | \n", "|---|---|---|---|---|---|\n", "| 2MAC | 1.34 | \n", "| 9MAC | 2.23 | \n", "| 10MAC | 4.37 | \n", "| 14MAC | 1.57 | \n", "| 15MAC | 2.85 | \n", "| 21MAC | 1.81 | \n", "\n", "\n" ], "text/plain": [ " Label concentration_fold_difference\n", "1 2MAC 1.34 \n", "2 9MAC 2.23 \n", "3 10MAC 4.37 \n", "4 14MAC 1.57 \n", "5 15MAC 2.85 \n", "6 21MAC 1.81 " ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "bad_labels <- df %>% \n", "select(Label, concentration_fold_difference) %>% \n", "separate(Label, \n", " sep='_', \n", " into=c('sample_num', 'enrichment', 'person')) %>%\n", "unite(Label, c('sample_num', 'enrichment', 'person'), sep='')\n", "head(bad_labels)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The regular expression consists of 3 capture groups in parentheses that will form the new columns.\n", "\n", "`'([0-9]+)(.*)(.$)'`\n", "\n", "where\n", "\n", "- Capture group 1 `[0-9]+` means match one or more `+` of any digits `[0-9]`\n", "- Capture group 2 `.*` means match zero or more `*` of any character `.`\n", "- Capture gorup 3 `.$` means match any character `.` at the end `$`" ] }, { "cell_type": "code", "execution_count": 306, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\n", "
sample_numenrichmentpersonconcentration_fold_difference
2 MA C 1.34
9 MA C 2.23
10 MA C 4.37
14 MA C 1.57
15 MA C 2.85
21 MA C 1.81
\n" ], "text/latex": [ "\\begin{tabular}{r|llll}\n", " sample\\_num & enrichment & person & concentration\\_fold\\_difference\\\\\n", "\\hline\n", "\t 2 & MA & C & 1.34\\\\\n", "\t 9 & MA & C & 2.23\\\\\n", "\t 10 & MA & C & 4.37\\\\\n", "\t 14 & MA & C & 1.57\\\\\n", "\t 15 & MA & C & 2.85\\\\\n", "\t 21 & MA & C & 1.81\\\\\n", "\\end{tabular}\n" ], "text/markdown": [ "\n", "sample_num | enrichment | person | concentration_fold_difference | \n", "|---|---|---|---|---|---|\n", "| 2 | MA | C | 1.34 | \n", "| 9 | MA | C | 2.23 | \n", "| 10 | MA | C | 4.37 | \n", "| 14 | MA | C | 1.57 | \n", "| 15 | MA | C | 2.85 | \n", "| 21 | MA | C | 1.81 | \n", "\n", "\n" ], "text/plain": [ " sample_num enrichment person concentration_fold_difference\n", "1 2 MA C 1.34 \n", "2 9 MA C 2.23 \n", "3 10 MA C 4.37 \n", "4 14 MA C 1.57 \n", "5 15 MA C 2.85 \n", "6 21 MA C 1.81 " ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "bad_labels %>% \n", "extract(Label, \n", " into=c('sample_num', 'enrichment', 'person'), \n", " '([0-9]+)(.*)(.$)') %>%\n", "head" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Using `separate_rows` rows when multiple values are in a single cell.\n", "\n", "Very occassionally, we find data sets where multiple values are stored in a single cell. We will make up and exmple here as this does not occur in the metadata data set." ] }, { "cell_type": "code", "execution_count": 330, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\n", "
Labelprimer
1_MA_J AGGCTATA,CGCTCATT
1_RZ_J AGGCTATA,GAGATTCC
10_MA_C AGGATAGG,ATTACTCG
10_RZ_C AGGATAGG,TCCGGAGA
11_MA_J AGGATAGG,CGCTCATT
11_RZ_J AGGATAGG,GAGATTCC
\n" ], "text/latex": [ "\\begin{tabular}{r|ll}\n", " Label & primer\\\\\n", "\\hline\n", "\t 1\\_MA\\_J & AGGCTATA,CGCTCATT\\\\\n", "\t 1\\_RZ\\_J & AGGCTATA,GAGATTCC\\\\\n", "\t 10\\_MA\\_C & AGGATAGG,ATTACTCG\\\\\n", "\t 10\\_RZ\\_C & AGGATAGG,TCCGGAGA\\\\\n", "\t 11\\_MA\\_J & AGGATAGG,CGCTCATT\\\\\n", "\t 11\\_RZ\\_J & AGGATAGG,GAGATTCC\\\\\n", "\\end{tabular}\n" ], "text/markdown": [ "\n", "Label | primer | \n", "|---|---|---|---|---|---|\n", "| 1_MA_J | AGGCTATA,CGCTCATT | \n", "| 1_RZ_J | AGGCTATA,GAGATTCC | \n", "| 10_MA_C | AGGATAGG,ATTACTCG | \n", "| 10_RZ_C | AGGATAGG,TCCGGAGA | \n", "| 11_MA_J | AGGATAGG,CGCTCATT | \n", "| 11_RZ_J | AGGATAGG,GAGATTCC | \n", "\n", "\n" ], "text/plain": [ " Label primer \n", "1 1_MA_J AGGCTATA,CGCTCATT\n", "2 1_RZ_J AGGCTATA,GAGATTCC\n", "3 10_MA_C AGGATAGG,ATTACTCG\n", "4 10_RZ_C AGGATAGG,TCCGGAGA\n", "5 11_MA_J AGGATAGG,CGCTCATT\n", "6 11_RZ_J AGGATAGG,GAGATTCC" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "bad_primer <- df %>% \n", "select(Label, `i5 index`, `i7 index`) %>% \n", "unite(primer, c(`i5 index`, `i7 index`), sep=',') %>%\n", "arrange(Label)\n", "head(bad_primer)" ] }, { "cell_type": "code", "execution_count": 331, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\n", "
Labelprimer
1_MA_J AGGCTATA
1_MA_J CGCTCATT
1_RZ_J AGGCTATA
1_RZ_J GAGATTCC
10_MA_C AGGATAGG
10_MA_C ATTACTCG
\n" ], "text/latex": [ "\\begin{tabular}{r|ll}\n", " Label & primer\\\\\n", "\\hline\n", "\t 1\\_MA\\_J & AGGCTATA \\\\\n", "\t 1\\_MA\\_J & CGCTCATT \\\\\n", "\t 1\\_RZ\\_J & AGGCTATA \\\\\n", "\t 1\\_RZ\\_J & GAGATTCC \\\\\n", "\t 10\\_MA\\_C & AGGATAGG \\\\\n", "\t 10\\_MA\\_C & ATTACTCG \\\\\n", "\\end{tabular}\n" ], "text/markdown": [ "\n", "Label | primer | \n", "|---|---|---|---|---|---|\n", "| 1_MA_J | AGGCTATA | \n", "| 1_MA_J | CGCTCATT | \n", "| 1_RZ_J | AGGCTATA | \n", "| 1_RZ_J | GAGATTCC | \n", "| 10_MA_C | AGGATAGG | \n", "| 10_MA_C | ATTACTCG | \n", "\n", "\n" ], "text/plain": [ " Label primer \n", "1 1_MA_J AGGCTATA\n", "2 1_MA_J CGCTCATT\n", "3 1_RZ_J AGGCTATA\n", "4 1_RZ_J GAGATTCC\n", "5 10_MA_C AGGATAGG\n", "6 10_MA_C ATTACTCG" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "bad_primer %>% \n", "separate_rows(primer, sep=',') %>%\n", "head" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Challenge: Reconstruct column structure of orignial metadata file" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Since `separate_rows` will place the items in each row in alternating columns, we create a temporary variable to store the position in the original cell." ] }, { "cell_type": "code", "execution_count": 332, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\n", "
Labelprimertmp
1_MA_J AGGCTATA1
1_MA_J CGCTCATT2
1_RZ_J AGGCTATA1
1_RZ_J GAGATTCC2
10_MA_C AGGATAGG1
10_MA_C ATTACTCG2
\n" ], "text/latex": [ "\\begin{tabular}{r|lll}\n", " Label & primer & tmp\\\\\n", "\\hline\n", "\t 1\\_MA\\_J & AGGCTATA & 1 \\\\\n", "\t 1\\_MA\\_J & CGCTCATT & 2 \\\\\n", "\t 1\\_RZ\\_J & AGGCTATA & 1 \\\\\n", "\t 1\\_RZ\\_J & GAGATTCC & 2 \\\\\n", "\t 10\\_MA\\_C & AGGATAGG & 1 \\\\\n", "\t 10\\_MA\\_C & ATTACTCG & 2 \\\\\n", "\\end{tabular}\n" ], "text/markdown": [ "\n", "Label | primer | tmp | \n", "|---|---|---|---|---|---|\n", "| 1_MA_J | AGGCTATA | 1 | \n", "| 1_MA_J | CGCTCATT | 2 | \n", "| 1_RZ_J | AGGCTATA | 1 | \n", "| 1_RZ_J | GAGATTCC | 2 | \n", "| 10_MA_C | AGGATAGG | 1 | \n", "| 10_MA_C | ATTACTCG | 2 | \n", "\n", "\n" ], "text/plain": [ " Label primer tmp\n", "1 1_MA_J AGGCTATA 1 \n", "2 1_MA_J CGCTCATT 2 \n", "3 1_RZ_J AGGCTATA 1 \n", "4 1_RZ_J GAGATTCC 2 \n", "5 10_MA_C AGGATAGG 1 \n", "6 10_MA_C ATTACTCG 2 " ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "bad_primer %>% \n", "separate_rows(primer, sep=',') %>%\n", "mutate(tmp=rep_len(1:2, n())) %>% \n", "head" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We then `spread` the primer values by the temporary variable and rename the columns as appropriate." ] }, { "cell_type": "code", "execution_count": 333, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\n", "
Labeli5 indexi7 index
1_MA_J AGGCTATACGCTCATT
1_RZ_J AGGCTATAGAGATTCC
10_MA_C AGGATAGGATTACTCG
10_RZ_C AGGATAGGTCCGGAGA
11_MA_J AGGATAGGCGCTCATT
11_RZ_J AGGATAGGGAGATTCC
\n" ], "text/latex": [ "\\begin{tabular}{r|lll}\n", " Label & i5 index & i7 index\\\\\n", "\\hline\n", "\t 1\\_MA\\_J & AGGCTATA & CGCTCATT \\\\\n", "\t 1\\_RZ\\_J & AGGCTATA & GAGATTCC \\\\\n", "\t 10\\_MA\\_C & AGGATAGG & ATTACTCG \\\\\n", "\t 10\\_RZ\\_C & AGGATAGG & TCCGGAGA \\\\\n", "\t 11\\_MA\\_J & AGGATAGG & CGCTCATT \\\\\n", "\t 11\\_RZ\\_J & AGGATAGG & GAGATTCC \\\\\n", "\\end{tabular}\n" ], "text/markdown": [ "\n", "Label | i5 index | i7 index | \n", "|---|---|---|---|---|---|\n", "| 1_MA_J | AGGCTATA | CGCTCATT | \n", "| 1_RZ_J | AGGCTATA | GAGATTCC | \n", "| 10_MA_C | AGGATAGG | ATTACTCG | \n", "| 10_RZ_C | AGGATAGG | TCCGGAGA | \n", "| 11_MA_J | AGGATAGG | CGCTCATT | \n", "| 11_RZ_J | AGGATAGG | GAGATTCC | \n", "\n", "\n" ], "text/plain": [ " Label i5 index i7 index\n", "1 1_MA_J AGGCTATA CGCTCATT\n", "2 1_RZ_J AGGCTATA GAGATTCC\n", "3 10_MA_C AGGATAGG ATTACTCG\n", "4 10_RZ_C AGGATAGG TCCGGAGA\n", "5 11_MA_J AGGATAGG CGCTCATT\n", "6 11_RZ_J AGGATAGG GAGATTCC" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "bad_primer %>% \n", "separate_rows(primer, sep=',') %>%\n", "mutate(tmp=rep_len(1:2, n())) %>% \n", "spread(tmp, primer) %>%\n", "rename(`i5 index`=`1`, `i7 index`=`2`) %>%\n", "head" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Comapre with original" ] }, { "cell_type": "code", "execution_count": 334, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\n", "
Labeli5 indexi7 index
1_MA_J AGGCTATACGCTCATT
1_RZ_J AGGCTATAGAGATTCC
10_MA_C AGGATAGGATTACTCG
10_RZ_C AGGATAGGTCCGGAGA
11_MA_J AGGATAGGCGCTCATT
11_RZ_J AGGATAGGGAGATTCC
\n" ], "text/latex": [ "\\begin{tabular}{r|lll}\n", " Label & i5 index & i7 index\\\\\n", "\\hline\n", "\t 1\\_MA\\_J & AGGCTATA & CGCTCATT \\\\\n", "\t 1\\_RZ\\_J & AGGCTATA & GAGATTCC \\\\\n", "\t 10\\_MA\\_C & AGGATAGG & ATTACTCG \\\\\n", "\t 10\\_RZ\\_C & AGGATAGG & TCCGGAGA \\\\\n", "\t 11\\_MA\\_J & AGGATAGG & CGCTCATT \\\\\n", "\t 11\\_RZ\\_J & AGGATAGG & GAGATTCC \\\\\n", "\\end{tabular}\n" ], "text/markdown": [ "\n", "Label | i5 index | i7 index | \n", "|---|---|---|---|---|---|\n", "| 1_MA_J | AGGCTATA | CGCTCATT | \n", "| 1_RZ_J | AGGCTATA | GAGATTCC | \n", "| 10_MA_C | AGGATAGG | ATTACTCG | \n", "| 10_RZ_C | AGGATAGG | TCCGGAGA | \n", "| 11_MA_J | AGGATAGG | CGCTCATT | \n", "| 11_RZ_J | AGGATAGG | GAGATTCC | \n", "\n", "\n" ], "text/plain": [ " Label i5 index i7 index\n", "1 1_MA_J AGGCTATA CGCTCATT\n", "2 1_RZ_J AGGCTATA GAGATTCC\n", "3 10_MA_C AGGATAGG ATTACTCG\n", "4 10_RZ_C AGGATAGG TCCGGAGA\n", "5 11_MA_J AGGATAGG CGCTCATT\n", "6 11_RZ_J AGGATAGG GAGATTCC" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "df %>%\n", "select(Label, `i5 index`, `i7 index`) %>%\n", "arrange(Label) %>%\n", "head" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Challenge exercise \n", "\n", "Let's look at the index and primer columne more closely. It seems that the index and primer values are linked and are just different ways of labeling the same thing. So we really should be have columns like this\n", "\n", "| primer_type | code | seq |\n", "| - | - | - |\n", "| i5 | i501 | AGGCTATA | \n", "| i5 | i502 | GCCTCTAT |\n", "| i7 | i701 | ATTACTCG |\n", "| i7 | i702 | TCCGGAGA |\n", "\n", "We can do this by combining the operatios we have seen above." ] }, { "cell_type": "code", "execution_count": 232, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\n", "
LabelRNA_sample_numMediaStrainReplicateconcentration_fold_differenceprimer_typecodeseq
2_MA_C 2 YPD H99 2 1.34 i5 01 AGGCTATA
2_RZ_C 2 YPD H99 2 1.34 i5 01 AGGCTATA
2_TOT_C 2 YPD H99 2 1.34 i5 01 AGGCTATA
1_MA_J 1 YPD H99 1 3.64 i5 01 AGGCTATA
1_RZ_J 1 YPD H99 1 3.64 i5 01 AGGCTATA
4_MA_P 4 YPD H99 4 2.05 i5 01 AGGCTATA
4_RZ_P 4 YPD H99 4 2.05 i5 01 AGGCTATA
9_MA_C 9 YPD mar1d 3 2.23 i5 02 GCCTCTAT
9_RZ_C 9 YPD mar1d 3 2.23 i5 02 GCCTCTAT
3_MA_J 3 YPD H99 3 1.98 i5 02 GCCTCTAT
\n" ], "text/latex": [ "\\begin{tabular}{r|lllllllll}\n", " Label & RNA\\_sample\\_num & Media & Strain & Replicate & concentration\\_fold\\_difference & primer\\_type & code & seq\\\\\n", "\\hline\n", "\t 2\\_MA\\_C & 2 & YPD & H99 & 2 & 1.34 & i5 & 01 & AGGCTATA \\\\\n", "\t 2\\_RZ\\_C & 2 & YPD & H99 & 2 & 1.34 & i5 & 01 & AGGCTATA \\\\\n", "\t 2\\_TOT\\_C & 2 & YPD & H99 & 2 & 1.34 & i5 & 01 & AGGCTATA \\\\\n", "\t 1\\_MA\\_J & 1 & YPD & H99 & 1 & 3.64 & i5 & 01 & AGGCTATA \\\\\n", "\t 1\\_RZ\\_J & 1 & YPD & H99 & 1 & 3.64 & i5 & 01 & AGGCTATA \\\\\n", "\t 4\\_MA\\_P & 4 & YPD & H99 & 4 & 2.05 & i5 & 01 & AGGCTATA \\\\\n", "\t 4\\_RZ\\_P & 4 & YPD & H99 & 4 & 2.05 & i5 & 01 & AGGCTATA \\\\\n", "\t 9\\_MA\\_C & 9 & YPD & mar1d & 3 & 2.23 & i5 & 02 & GCCTCTAT \\\\\n", "\t 9\\_RZ\\_C & 9 & YPD & mar1d & 3 & 2.23 & i5 & 02 & GCCTCTAT \\\\\n", "\t 3\\_MA\\_J & 3 & YPD & H99 & 3 & 1.98 & i5 & 02 & GCCTCTAT \\\\\n", "\\end{tabular}\n" ], "text/markdown": [ "\n", "Label | RNA_sample_num | Media | Strain | Replicate | concentration_fold_difference | primer_type | code | seq | \n", "|---|---|---|---|---|---|---|---|---|---|\n", "| 2_MA_C | 2 | YPD | H99 | 2 | 1.34 | i5 | 01 | AGGCTATA | \n", "| 2_RZ_C | 2 | YPD | H99 | 2 | 1.34 | i5 | 01 | AGGCTATA | \n", "| 2_TOT_C | 2 | YPD | H99 | 2 | 1.34 | i5 | 01 | AGGCTATA | \n", "| 1_MA_J | 1 | YPD | H99 | 1 | 3.64 | i5 | 01 | AGGCTATA | \n", "| 1_RZ_J | 1 | YPD | H99 | 1 | 3.64 | i5 | 01 | AGGCTATA | \n", "| 4_MA_P | 4 | YPD | H99 | 4 | 2.05 | i5 | 01 | AGGCTATA | \n", "| 4_RZ_P | 4 | YPD | H99 | 4 | 2.05 | i5 | 01 | AGGCTATA | \n", "| 9_MA_C | 9 | YPD | mar1d | 3 | 2.23 | i5 | 02 | GCCTCTAT | \n", "| 9_RZ_C | 9 | YPD | mar1d | 3 | 2.23 | i5 | 02 | GCCTCTAT | \n", "| 3_MA_J | 3 | YPD | H99 | 3 | 1.98 | i5 | 02 | GCCTCTAT | \n", "\n", "\n" ], "text/plain": [ " Label RNA_sample_num Media Strain Replicate concentration_fold_difference\n", "1 2_MA_C 2 YPD H99 2 1.34 \n", "2 2_RZ_C 2 YPD H99 2 1.34 \n", "3 2_TOT_C 2 YPD H99 2 1.34 \n", "4 1_MA_J 1 YPD H99 1 3.64 \n", "5 1_RZ_J 1 YPD H99 1 3.64 \n", "6 4_MA_P 4 YPD H99 4 2.05 \n", "7 4_RZ_P 4 YPD H99 4 2.05 \n", "8 9_MA_C 9 YPD mar1d 3 2.23 \n", "9 9_RZ_C 9 YPD mar1d 3 2.23 \n", "10 3_MA_J 3 YPD H99 3 1.98 \n", " primer_type code seq \n", "1 i5 01 AGGCTATA\n", "2 i5 01 AGGCTATA\n", "3 i5 01 AGGCTATA\n", "4 i5 01 AGGCTATA\n", "5 i5 01 AGGCTATA\n", "6 i5 01 AGGCTATA\n", "7 i5 01 AGGCTATA\n", "8 i5 02 GCCTCTAT\n", "9 i5 02 GCCTCTAT\n", "10 i5 02 GCCTCTAT" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "df %>% \n", "unite(i7, c(`i7 primer`, `i7 index`), sep=':') %>%\n", "unite(i5, c(`i5 primer`, `i5 index`), sep=':') %>%\n", "gather(primer_type, val, c(i7, i5)) %>%\n", "separate(val, into=c('code', 'seq'), sep=':') %>%\n", "mutate(code=str_extract(code, '..$')) %>%\n", "arrange(primer_type, code) %>% \n", "head(10)" ] } ], "metadata": { "kernelspec": { "display_name": "R", "language": "R", "name": "r" }, "language_info": { "codemirror_mode": "r", "file_extension": ".r", "mimetype": "text/x-r-source", "name": "R", "pygments_lexer": "r", "version": "3.4.4" } }, "nbformat": 4, "nbformat_minor": 2 }