Demultiplex a Raw FASTQ

In [1]:
source demux2015_config.sh
mkdir -p $DEMUX

Raw Data

Description

This data is from an experiment testing the transcriptional response of Escherichia coli to growth in high pH media. Samples were sequenced with 101bp paired-end reads.

The samples were sequenced in 3 pools: * dryrun * groups_EG * groups_KNP

We will just work with the dryrun data.

In [2]:
ls $CURDATA
ls: /data/HTS_2015_data/runs_combined: No such file or directory

Using fastq-multx to demultiplex

fastq-multx comes from ea-utils. It is (somewhat) straightforward to use. We can get some information on how to use it by running the command fastq-multx with no arguments

In [3]:
fastq-multx
Usage: fastq-multx [-g|-l|-B] <barcodes.fil> <read1.fq> -o r1.%.fq [mate.fq -o r2.%.fq] ...
Version: 1.02.772

Output files must contain a '%' sign which is replaced with the barcode id in the barcodes file.
Output file can be n/a to discard the corresponding data (use this for the barcode read)

Barcodes file (-B) looks like this:

<id1> <sequence1>
<id2> <sequence2> ...

Default is to guess the -bol or -eol based on clear stats.

If -g is used, then it's parameter is an index lane, and frequently occuring sequences are used.

If -l is used then all barcodes in the file are tried, and the *group* with the *most* matches is chosen.

Grouped barcodes file (-l or -L) looks like this:

<id1> <sequence1> <group1>
<id1> <sequence1> <group1>
<id2> <sequence2> <group2>...

Mated reads, if supplied, are kept in-sync

Options:

-o FIL1     Output files (one per input, required)
-g SEQFIL   Determine barcodes from the indexed read SEQFIL
-l BCFIL    Determine barcodes from any read, using BCFIL as a master list
-L BCFIL    Determine barcodes from <read1.fq>, using BCFIL as a master list
-B BCFIL    Use barcodes from BCFIL, no determination step, codes in <read1.fq>
-H          Use barcodes from illumina's header, instead of a read
-b          Force beginning of line (5') for barcode matching
-e          Force end of line (3') for batcode matching
-t NUM      Divide threshold for auto-determine by factor NUM (1), > 1 = more sensitive
-G NAME     Use group(s) matching NAME only
-x          Don't trim barcodes off before writing out destination
-n          Don't execute, just print likely barcode list
-v C        Verify that mated id's match up to character C (Use ' ' for illumina)
-m N        Allow up to N mismatches, as long as they are unique (1)
-d N        Require a minimum distance of N between the best and next best (2)
-q N        Require a minimum phred quality of N to accept a barcode base (0)

Since we have barcodes in a separate file, we are contrained in how we run it. Here are the command line arguments that we will be using:

  • -B BARCODE_FILE : a list of known barcodes, and the associated sample names
  • -o OUTPUT_FILE(s) : fastq-multx will produce a separate file for each barcode (two files when paired-end reads are input). This option provides a template for naming the output file - the program will fill in the “%” with the barcode.
  • -m : number of mismatches to allow in barcode
  • -d : minimum edit distance between the best and next best match
  • -x : don’t trim barcodes
  • I1_FASTQ : the index read FASTQ, which will be used to demultiplex other reads
  • R1_FASTQ : the R1 raw data to demultiplex
  • R2_FASTQ : (optional) if data is paired-end, the R2 raw data to demultiplex

You already know what is in the FASTQ file, but the barcode file is new. Let’s take a look …

In [4]:
cat ${CURDATA}/dryrun_barcodes.tab
cat: /data/HTS_2015_data/runs_combined/dryrun_barcodes.tab: No such file or directory

OK, now we are ready to run the demuxing …

In [5]:
fastq-multx -m1 -d1 -x -B ${CURDATA}/dryrun_barcodes.tab \
    ${CURDATA}/dryrun_combined_I1.fastq.gz \
    ${CURDATA}/dryrun_combined_R1.fastq.gz \
    ${CURDATA}/dryrun_combined_R2.fastq.gz \
    -o ${DEMUX}/i1.%.fq.gz ${DEMUX}/r1.%.fq.gz ${DEMUX}/r2.%.fq.gz > ${DEMUX}/dryrun_demux.stdout
Error opening file '/data/HTS_2015_data/runs_combined/dryrun_barcodes.tab': No such file or directory

Results

We redirected STDOUT to ${DEMUX}/dryrun_demux.stdout, so we can look at that file to get some run statistics.

Note that, while normally an error message such as gzip: stdout: Broken pipe would be reason for concern, we can safely ignore it in this case - it has something to do with running things within Jupyter.

Let’s take a look at the output …

In [6]:
cat ${DEMUX}/dryrun_demux.stdout

For each barcode in the barcode file, it tells us: 1. Id: the barcode name (as given in the barcode file) 2. Count: the number of reads corresponding to that barcode 3. Files(s): The demultiplexed files generated for this barcode

Unfortunately this is hard to read because it is giving the full path for the files. This is really our fault, because when we ran it we used full paths. There are a couple of ways we can more easily view this file.

We can use the cut program to show us the first N characters of each line

In [7]:
cut -c1-50 ${DEMUX}/dryrun_demux.stdout

This lets us see the Id and count information, but the files are truncated. Alternatively we can use the sed program to strip out the full path from the filenames

In [8]:
line_old="${DEMUX}/"
line_new=''
sed "s%$line_old%$line_new%g" ${DEMUX}/dryrun_demux.stdout

Here we can see that for each barcode, three files are generated, the “i1” file, which contains all of the index reads corresponding to the barcode, “r1” file, which contains all of the first reads corresponding to the barcode, and the “r2” file, which contains all of the second reads corresponding to the barcode. We can confirm this by looking at the files in our “DEMUX” directory…

In [9]:
ls ${DEMUX}
dryrun_demux.stdout

Note that neither the cut nor the sed command that we used alter the original file, no changes are saved, the results of the commands are just displayed. See the original is still the same …

In [10]:
cat ${DEMUX}/dryrun_demux.stdout