Introduction to FASTQ Files

Shell Variables

As before, we will use shell variables to make it easier to refer to the directories we are working with. The shell variables do not carry over between notebooks. Shell variables are specific to a shell session, and each notebook is a separate shell session.

So the first thing we will do is assign the variables in this notebook.

In [1]:
# Hack to handle broekn pipes - IGNORE.
cleanup () {
    :
}

trap "cleanup" SIGPIPE
In [2]:
set -u
DATA_BASE="/data/hts2018_pilot"
RAW_FASTQS="$DATA_BASE/Granek_4837_180427A5"

Looking at a FASTQ

Let’s take a quick look at our data. For our first pass at analysis, we are just going to be working with the first read data (R1) from one sample.

In [3]:
ls -lSrh $RAW_FASTQS
ls: /data/hts2018_pilot/Granek_4837_180427A5: No such file or directory

Compression: gzip, zcat, etc

The “.gz” at the end of the FASTQ file name indicates that the fastq file was compressed using a program named gzip. This is pretty common because FASTQ files can be huge. cat is a program for viewing text files, zcat is a special version of this program that lets you view compressed text files without first decompressing them.

Here we will use: * zcat: to show the compress FASTQ * head: to grab only the first 10 lines, since the whole file has over 5 x 10^6 lines (which would almost certainly hang our web browser)

In [4]:
zcat $RAW_FASTQS/27_MA_P_S38_L002_R1_001.fastq.gz | head
zcat: can't stat: /data/hts2018_pilot/Granek_4837_180427A5/27_MA_P_S38_L002_R1_001.fastq.gz (/data/hts2018_pilot/Granek_4837_180427A5/27_MA_P_S38_L002_R1_001.fastq.gz.Z): No such file or directory

less

less is a program for taking an interactive look at a text file, like a FASTQ - it let’s you scroll, search, etc. less won’t work in the bash notebook, if you want to try it out, you need to use a terminal.

To switch to a terminal, click on the jupyter “File” menu, and select “Open”. A new browser window/tab should open, with your jupyter “home base”. Here, you should click on the “Files” tab if it is not already active, there click on “New” and select “Terminal”, which should open a new live terminal.

Since we want to look at a compressed (gzipped) FASTQ, we will use a version of less called zless, which decompresses on the fly.

At the terminal’s command prompt, type (or paste) zless /data/hts2018_pilot/Granek_4837_180427A5/27_MA_P_S38_L002_R1_001.fastq.gz You should see the first few lines of the file, notice that it looks like the examples we saw in lecture.

zless (and its standard cousin less) can do a lot of things. Here are a few important keystrokes:

  • q : quit
  • space : scroll down a page
  • up/down arrow : scroll up/down by a line

What do quality scores mean?

See the Quality Scores notebook for a “translation” of quality scores. The Wikipedia article on FASTQs is also a useful resource.