Demultiplex a Raw FASTQ¶
Let’s get started on the bioinformatics analysis!
Organization¶
The first thing we want to do is make sure we keep things organized, so we are going to keep different aspects of the project in separate directories. We will keep everything from this session in one place. This way we don’t clutter our home directory with the files we are going to generate.
Shell Variables¶
We will use variables to make it easier to refer to the directories we are working with. We can substitute these variables anyplace we would use a directory or file name.
In [1]:
CUROUT=$HOME/work/scratch/2015_output
DEMUX=$CUROUT/demux_fastqs
CURDATA=/data/HTS_2015_data/runs_combined
Notice that the directory refered to by $DEMUX does not exist yet.
In [2]:
# the "| cat" is a hack that prevents problems with jupyter
ls -d $DEMUX | cat
/Users/cliburn/work/scratch/2015_output/demux_fastqs
Making a new directory¶
In [3]:
mkdir -p $DEMUX
mkdir
makes a new directory.
-p
tells mkdir to not give an error if the directory we tell it to
make already exists.
$DEMUX
tells mkdir that we want to make a new directory with the
path stored in the “$DEMUX” variable.
Now let’s check to be sure that worked. We will run ls
and check
that the “$DEMUX” directory now exists.
In [4]:
ls -d $DEMUX
/Users/cliburn/work/scratch/2015_output/demux_fastqs
Raw Data¶
Description¶
This data is from an experiment testing the transcriptional response of Escherichia coli to growth in high pH media. Samples were sequenced with 101bp paired-end reads.
The samples were sequenced in 3 pools: * dryrun * groups_EG * groups_KNP
In [5]:
ls $CURDATA
ls: /data/HTS_2015_data/runs_combined: No such file or directory
Using fastq-multx to demultiplex¶
fastq-multx
comes from ea-utils. It is (somewhat) straightforward to
use. We can get some information on how to use it by running the command
fastq-multx -h
In [6]:
# the "| cat" is a hack that prevents problems with jupyter
fastq-multx -h | cat
Usage: fastq-multx [-g|-l|-B] <barcodes.fil> <read1.fq> -o r1.%.fq [mate.fq -o r2.%.fq] ...
Version: 1.02.772
Output files must contain a '%' sign which is replaced with the barcode id in the barcodes file.
Output file can be n/a to discard the corresponding data (use this for the barcode read)
Barcodes file (-B) looks like this:
<id1> <sequence1>
<id2> <sequence2> ...
Default is to guess the -bol or -eol based on clear stats.
If -g is used, then it's parameter is an index lane, and frequently occuring sequences are used.
If -l is used then all barcodes in the file are tried, and the *group* with the *most* matches is chosen.
Grouped barcodes file (-l or -L) looks like this:
<id1> <sequence1> <group1>
<id1> <sequence1> <group1>
<id2> <sequence2> <group2>...
Mated reads, if supplied, are kept in-sync
Options:
-o FIL1 Output files (one per input, required)
-g SEQFIL Determine barcodes from the indexed read SEQFIL
-l BCFIL Determine barcodes from any read, using BCFIL as a master list
-L BCFIL Determine barcodes from <read1.fq>, using BCFIL as a master list
-B BCFIL Use barcodes from BCFIL, no determination step, codes in <read1.fq>
-H Use barcodes from illumina's header, instead of a read
-b Force beginning of line (5') for barcode matching
-e Force end of line (3') for batcode matching
-t NUM Divide threshold for auto-determine by factor NUM (1), > 1 = more sensitive
-G NAME Use group(s) matching NAME only
-x Don't trim barcodes off before writing out destination
-n Don't execute, just print likely barcode list
-v C Verify that mated id's match up to character C (Use ' ' for illumina)
-m N Allow up to N mismatches, as long as they are unique (1)
-d N Require a minimum distance of N between the best and next best (2)
-q N Require a minimum phred quality of N to accept a barcode base (0)
Since we have barcodes in a separate file, we are contrained in how we run it. Here are the command line arguments that we will be using:
- -B BARCODE_FILE : a list of known barcodes, and the associated sample names
- -o OUTPUT_FILE(s) : fastq-multx will produce a separate file for each barcode (two files when paired-end reads are input). This option provides a template for naming the output file - the program will fill in the “%” with the barcode.
- -m : number of mismatches to allow in barcode
- -d : minimum edit distance between the best and next best match
- -x : don’t trim barcodes
- I1_FASTQ : the index read FASTQ, which will be used to demultiplex other reads
- R1_FASTQ : the R1 raw data to demultiplex
- R2_FASTQ : (optional) if data is paired-end, the R2 raw data to demultiplex
You already know what is in the FASTQ file, but the barcode file is new. Let’s take a look . . .
In [7]:
cat ${CURDATA}/dryrun_barcodes.tab
cat: /data/HTS_2015_data/runs_combined/dryrun_barcodes.tab: No such file or directory
OK, now we are ready to run the demuxing . . .
In [8]:
fastq-multx -m1 -d1 -x -B ${CURDATA}/dryrun_barcodes.tab \
${CURDATA}/dryrun_combined_I1.fastq.gz \
${CURDATA}/dryrun_combined_R1.fastq.gz \
${CURDATA}/dryrun_combined_R2.fastq.gz \
-o ${DEMUX}/i1.%.fq.gz ${DEMUX}/r1.%.fq.gz ${DEMUX}/r2.%.fq.gz > ${DEMUX}/dryrun_demux.stdout
Error opening file '/data/HTS_2015_data/runs_combined/dryrun_barcodes.tab': No such file or directory
Results¶
We redirected STDOUT to ${DEMUX}/dryrun_demux.stdout
, so we can look
at that file to get some run statistics.
Note that, while normally an error message such as
gzip: stdout: Broken pipe
would be reason for concern, we can safely
ignore it in this case - it has something to do with running things
within Jupyter.
Let’s take a look at the output . . .
In [9]:
cat ${DEMUX}/dryrun_demux.stdout
For each barcode in the barcode file, it tells us: 1. Id: the barcode name (as given in the barcode file) 2. Count: the number of reads corresponding to that barcode 3. Files(s): The demultiplexed files generated for this barcode
Unfortunately this is hard to read because it is giving the full path for the files. This is really our fault, because when we ran it we used full paths. There are a couple of ways we can more easily view this file.
We can use the cut
program to show us the first N characters of each
line
In [10]:
cut -c1-50 ${DEMUX}/dryrun_demux.stdout
This lets us see the Id and count information, but the files are
truncated. Alternatively we can use the sed
program to strip out the
full path from the filenames
In [11]:
line_old='/home/jovyan/work/scratch/2015_output/demux_fastqs/'
line_new=''
sed "s%$line_old%$line_new%g" ${DEMUX}/dryrun_demux.stdout
Here we can see that for each barcode, three files are generated, the “i1” file, which contains all of the index reads corresponding to the barcode, “r1” file, which contains all of the first reads corresponding to the barcode, and the “r2” file, which contains all of the second reads corresponding to the barcode. We can confirm this by looking at the files in our “DEMUX” directory. . .
In [12]:
ls ${DEMUX}
dryrun_demux.stdout
Note that neither the cut
nor the sed
command that we used alter
the original file, no changes are saved, the results of the commands are
just displayed. See the original is still the same . . .
In [13]:
cat ${DEMUX}/dryrun_demux.stdout