R tidyverse
exercise (Solutions)¶
We will be working with the output files from the STAR aligner for this exercise. Thse files have four columns
column 1: gene ID
column 2: counts for unstranded RNA-seq
column 3: counts for the 1st read strand aligned with RNA (htseq-count option -s yes)
column 4: counts for the 2nd read strand aligned with RNA (htseq-count option -s reverse)
For explanation, see STAR quantMode geneCounts values
Based on the protocol we are using column 4 is the sense strand reads and column 3 is the anti-sense read counts, so we will be working with columns 1 and 4.
In [1]:
library(tidyverse)
Warning message:
“package ‘tidyverse’ was built under R version 3.4.2”── Attaching packages ─────────────────────────────────────── tidyverse 1.2.1 ──
✔ ggplot2 2.2.1 ✔ purrr 0.2.4
✔ tibble 1.4.2 ✔ dplyr 0.7.4
✔ tidyr 0.8.0 ✔ stringr 1.3.1
✔ readr 1.1.1 ✔ forcats 0.3.0
Warning message:
“package ‘tibble’ was built under R version 3.4.3”Warning message:
“package ‘tidyr’ was built under R version 3.4.3”Warning message:
“package ‘purrr’ was built under R version 3.4.2”Warning message:
“package ‘dplyr’ was built under R version 3.4.2”Warning message:
“package ‘forcats’ was built under R version 3.4.3”── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
We will work with the following data from the 4 lanes.
1_MA_J_S18_L001_R1_001ReadsPerGene.out.tab
1_MA_J_S18_L002_R1_001ReadsPerGene.out.tab
1_MA_J_S18_L003_R1_001ReadsPerGene.out.tab
1_MA_J_S18_L004_R1_001ReadsPerGene.out.tab
found in the directory
/data/hts/2018/foot
Setting up variables¶
1. Use file.path
to create a path to the data directory and save
it as the variable data_dir
.
In [ ]:
2. Save the filenames in a variable filenames
In [ ]:
Explore one data file¶
3. Read the data from the first file into a data.frame
or
tibble
called df
. Note that the file does not have a header row.
Name the columns id
,us
, fs
and rs
.
In [ ]:
4. View the first and last 10 lines of df
In [ ]:
5. Save the lines from 5 onwards into a new data.frame
called
df_genes
.
In [ ]:
6. Create a new file from df_genes
contining only the 1st and
4th columns and save as a new variable df_final
.
In [ ]:
7. Now do steps 3, 5 and 6 for the other 3 files using a loop, and
combine them with df_final
using full_join
on the id
column
to end up with a data.frame with 5 columns (id and 4 count columns).
In [ ]:
8. Rename the counts columns as lane1
, lane2
, lane3
and
lane4
. At this point you should have a data.frame
that looks
like this
id | lane1 | lane2 | lane3 | lane4 |
---|---|---|---|---|
gene0 | 0 | 0 | 0 | 1 |
gene1 | 0 | 0 | 0 | 0 |
gene2 | 8 | 4 | 10 | 3 |
In [ ]:
9. Create a new column containng the sum of lanes 1-4 called
counts
and save as df_wiht_counts
.
In [ ]:
10. Keep only the id
and coutns
columns and remove reow
where the gene count is 0 and save as df_counts
.
- How many genes with non-zero counts are there?
- What is the gene(s) with the highest count?
- What are the top 10 largest counts - i.e. the set with the largest number of genes having the same count?
In [ ]: