R tidyverse exercise (Solutions)

We will be working with the output files from the STAR aligner for this exercise. Thse files have four columns

column 1: gene ID
column 2: counts for unstranded RNA-seq
column 3: counts for the 1st read strand aligned with RNA (htseq-count option -s yes)
column 4: counts for the 2nd read strand aligned with RNA (htseq-count option -s reverse)

For explanation, see STAR quantMode geneCounts values

Based on the protocol we are using column 4 is the sense strand reads and column 3 is the anti-sense read counts, so we will be working with columns 1 and 4.

In [1]:
library(tidyverse)
Warning message:
“package ‘tidyverse’ was built under R version 3.4.2”── Attaching packages ─────────────────────────────────────── tidyverse 1.2.1 ──
✔ ggplot2 2.2.1     ✔ purrr   0.2.4
✔ tibble  1.4.2     ✔ dplyr   0.7.4
✔ tidyr   0.8.0     ✔ stringr 1.3.1
✔ readr   1.1.1     ✔ forcats 0.3.0
Warning message:
“package ‘tibble’ was built under R version 3.4.3”Warning message:
“package ‘tidyr’ was built under R version 3.4.3”Warning message:
“package ‘purrr’ was built under R version 3.4.2”Warning message:
“package ‘dplyr’ was built under R version 3.4.2”Warning message:
“package ‘forcats’ was built under R version 3.4.3”── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()

We will work with the following data from the 4 lanes.

1_MA_J_S18_L001_R1_001ReadsPerGene.out.tab
1_MA_J_S18_L002_R1_001ReadsPerGene.out.tab
1_MA_J_S18_L003_R1_001ReadsPerGene.out.tab
1_MA_J_S18_L004_R1_001ReadsPerGene.out.tab

found in the directory

/data/hts/2018/foot

Setting up variables

1. Use file.path to create a path to the data directory and save it as the variable data_dir.

In [ ]:




2. Save the filenames in a variable filenames

In [ ]:




Explore one data file

3. Read the data from the first file into a data.frame or tibble called df. Note that the file does not have a header row. Name the columns id,us, fs and rs.

In [ ]:




4. View the first and last 10 lines of df

In [ ]:




5. Save the lines from 5 onwards into a new data.frame called df_genes.

In [ ]:




6. Create a new file from df_genes contining only the 1st and 4th columns and save as a new variable df_final.

In [ ]:




7. Now do steps 3, 5 and 6 for the other 3 files using a loop, and combine them with df_final using full_join on the id column to end up with a data.frame with 5 columns (id and 4 count columns).

In [ ]:




8. Rename the counts columns as lane1, lane2, lane3 and lane4. At this point you should have a data.frame that looks like this

id lane1 lane2 lane3 lane4
gene0 0 0 0 1
gene1 0 0 0 0
gene2 8 4 10 3
In [ ]:




9. Create a new column containng the sum of lanes 1-4 called counts and save as df_wiht_counts.

In [ ]:




10. Keep only the id and coutns columns and remove reow where the gene count is 0 and save as df_counts.

  • How many genes with non-zero counts are there?
  • What is the gene(s) with the highest count?
  • What are the top 10 largest counts - i.e. the set with the largest number of genes having the same count?
In [ ]: