Demultiplex a Raw FASTQ¶
In [1]:
source demux2015_config.sh
mkdir -p $DEMUX
Raw Data¶
Description¶
This data is from an experiment testing the transcriptional response of Escherichia coli to growth in high pH media. Samples were sequenced with 101bp paired-end reads.
The samples were sequenced in 3 pools: * dryrun * groups_EG * groups_KNP
We will just work with the dryrun data.
In [2]:
ls $CURDATA
ls: /data/HTS_2015_data/runs_combined: No such file or directory
Using fastq-multx to demultiplex¶
fastq-multx
comes from ea-utils. It is (somewhat) straightforward to
use. We can get some information on how to use it by running the command
fastq-multx
with no arguments
In [3]:
fastq-multx
Usage: fastq-multx [-g|-l|-B] <barcodes.fil> <read1.fq> -o r1.%.fq [mate.fq -o r2.%.fq] ...
Version: 1.02.772
Output files must contain a '%' sign which is replaced with the barcode id in the barcodes file.
Output file can be n/a to discard the corresponding data (use this for the barcode read)
Barcodes file (-B) looks like this:
<id1> <sequence1>
<id2> <sequence2> ...
Default is to guess the -bol or -eol based on clear stats.
If -g is used, then it's parameter is an index lane, and frequently occuring sequences are used.
If -l is used then all barcodes in the file are tried, and the *group* with the *most* matches is chosen.
Grouped barcodes file (-l or -L) looks like this:
<id1> <sequence1> <group1>
<id1> <sequence1> <group1>
<id2> <sequence2> <group2>...
Mated reads, if supplied, are kept in-sync
Options:
-o FIL1 Output files (one per input, required)
-g SEQFIL Determine barcodes from the indexed read SEQFIL
-l BCFIL Determine barcodes from any read, using BCFIL as a master list
-L BCFIL Determine barcodes from <read1.fq>, using BCFIL as a master list
-B BCFIL Use barcodes from BCFIL, no determination step, codes in <read1.fq>
-H Use barcodes from illumina's header, instead of a read
-b Force beginning of line (5') for barcode matching
-e Force end of line (3') for batcode matching
-t NUM Divide threshold for auto-determine by factor NUM (1), > 1 = more sensitive
-G NAME Use group(s) matching NAME only
-x Don't trim barcodes off before writing out destination
-n Don't execute, just print likely barcode list
-v C Verify that mated id's match up to character C (Use ' ' for illumina)
-m N Allow up to N mismatches, as long as they are unique (1)
-d N Require a minimum distance of N between the best and next best (2)
-q N Require a minimum phred quality of N to accept a barcode base (0)
Since we have barcodes in a separate file, we are contrained in how we run it. Here are the command line arguments that we will be using:
- -B BARCODE_FILE : a list of known barcodes, and the associated sample names
- -o OUTPUT_FILE(s) : fastq-multx will produce a separate file for each barcode (two files when paired-end reads are input). This option provides a template for naming the output file - the program will fill in the “%” with the barcode.
- -m : number of mismatches to allow in barcode
- -d : minimum edit distance between the best and next best match
- -x : don’t trim barcodes
- I1_FASTQ : the index read FASTQ, which will be used to demultiplex other reads
- R1_FASTQ : the R1 raw data to demultiplex
- R2_FASTQ : (optional) if data is paired-end, the R2 raw data to demultiplex
You already know what is in the FASTQ file, but the barcode file is new. Let’s take a look …
In [4]:
cat ${CURDATA}/dryrun_barcodes.tab
cat: /data/HTS_2015_data/runs_combined/dryrun_barcodes.tab: No such file or directory
OK, now we are ready to run the demuxing …
In [5]:
fastq-multx -m1 -d1 -x -B ${CURDATA}/dryrun_barcodes.tab \
${CURDATA}/dryrun_combined_I1.fastq.gz \
${CURDATA}/dryrun_combined_R1.fastq.gz \
${CURDATA}/dryrun_combined_R2.fastq.gz \
-o ${DEMUX}/i1.%.fq.gz ${DEMUX}/r1.%.fq.gz ${DEMUX}/r2.%.fq.gz > ${DEMUX}/dryrun_demux.stdout
Error opening file '/data/HTS_2015_data/runs_combined/dryrun_barcodes.tab': No such file or directory
Results¶
We redirected STDOUT to ${DEMUX}/dryrun_demux.stdout
, so we can look
at that file to get some run statistics.
Note that, while normally an error message such as
gzip: stdout: Broken pipe
would be reason for concern, we can safely
ignore it in this case - it has something to do with running things
within Jupyter.
Let’s take a look at the output …
In [6]:
cat ${DEMUX}/dryrun_demux.stdout
For each barcode in the barcode file, it tells us: 1. Id: the barcode name (as given in the barcode file) 2. Count: the number of reads corresponding to that barcode 3. Files(s): The demultiplexed files generated for this barcode
Unfortunately this is hard to read because it is giving the full path for the files. This is really our fault, because when we ran it we used full paths. There are a couple of ways we can more easily view this file.
We can use the cut
program to show us the first N characters of each
line
In [7]:
cut -c1-50 ${DEMUX}/dryrun_demux.stdout
This lets us see the Id and count information, but the files are
truncated. Alternatively we can use the sed
program to strip out the
full path from the filenames
In [8]:
line_old="${DEMUX}/"
line_new=''
sed "s%$line_old%$line_new%g" ${DEMUX}/dryrun_demux.stdout
Here we can see that for each barcode, three files are generated, the “i1” file, which contains all of the index reads corresponding to the barcode, “r1” file, which contains all of the first reads corresponding to the barcode, and the “r2” file, which contains all of the second reads corresponding to the barcode. We can confirm this by looking at the files in our “DEMUX” directory…
In [9]:
ls ${DEMUX}
dryrun_demux.stdout
Note that neither the cut
nor the sed
command that we used alter
the original file, no changes are saved, the results of the commands are
just displayed. See the original is still the same …
In [10]:
cat ${DEMUX}/dryrun_demux.stdout