Trimming and Filtering a FASTQ

Shell Variables

Assign the variables in this notebook.

In [1]:
# We used these in the last notebook
CUROUT=$HOME/work/scratch/2015_output
DEMUX=$CUROUT/demux_fastqs

# The following are new for this notebook
TRIMMED=$CUROUT/trimmed_fastqs
INFO=$HOME/work/myinfo

Making New Directories

Make the directories that are new in this notebook

In [2]:
mkdir -p $TRIMMED
mkdir -p $INFO

Now let’s check to be sure that worked. We will run ls and check that these directories now exist in the $CUROUT directory.

In [3]:
ls $CUROUT
counts          demux_fastqs    qc_output       trimmed_fastqs

Trimming and Filtering

Now we get into some actual preprocessing. We will use fastq-mcf to trim adapter from our reads and do some quality filtering. We need to trim adapter, because if a fragment is short enough, we will sequence all the way through the fragment and into the adapter. Obviously the adapter sequence in not found in the genome, and can keep the read from aligning properly. To do the trimming, we need to generate an adapter file.

Making an adapter file

The first step is to get the adapter sequence. For our test file, the index is 16 (barcode CCGTCC), this is from NEBNext Multiplex Oligos for Illumina (Index Primers Set 2). So let’s look in the manual:

https://www.neb.com/~/media/Catalog/All-Products/6B6FC6C03B274E7FA0FDBF13015AB194/Datacards%20or%20Manuals/manualE7500.pdf

We will get the sequence for the universal primer and the index primer. Now we need to make the adapter file; this needs to be in FASTA format.

  1. click on the jupyter “File” menu, and select “Open”.
  2. When the the new browser window/tab opens, click on the “Files” tab if it is not already active.
  3. Click on the “home” symbol to go to the top level directory, then click on “myinfo”
  4. In the “New” menu select “Text File”.
  5. In this text file, type “>adapter_3prime”, this is the name for the adapter sequence, the “>” is part of the FASTA format indicating that it is the sequence name.
  6. Next copy the part of the 3’ common part of the adapter, the part that is to the right of the barcode sequence, and paste it on the line after the name.
  7. We also want to include the reverse complement of the adapter, in case that is what will actually be sequenced. The easiest way to do that is to use https://www.bioinformatics.org/sms/rev_comp.html to generate the reverse complement, then name it something like “adapter_3prime_RC”
  8. Now clean up by making sure that . . .
    1. The sequence is on its own line
    2. The sequence has a name on the line before it
    3. The name is preceded by a “>”
    4. All spaces and non-sequence characters have been removed
  9. We need to repeat steps 5-8 for the universal adapter sequence that is in the same PDF.
  10. Click on “untitled.txt” to change the file name to “neb_adapters.fasta”
  11. Save the file.

fastq-mcf

You can run fastq-mcf -h to get details about running fastq-mcf. We will adjust run parameters, because some of the defaults set a low bar (even the author acknowleges this).

In [4]:
# the "| cat" is a hack that prevents problems with jupyter
fastq-mcf -h | cat
Usage: fastq-mcf [options] <adapters.fa> <reads.fq> [mates1.fq ...]
Version: 1.04.803

Detects levels of adapter presence, computes likelihoods and
locations (start, end) of the adapters.   Removes the adapter
sequences from the fastq file(s).

Stats go to stderr, unless -o is specified.

Specify -0 to turn off all default settings

If you specify multiple 'paired-end' inputs, then a -o option is
required for each.  IE: -o read1.clip.q -o read2.clip.fq

Options:
    -h       This help
    -o FIL   Output file (stats to stdout)
    -O N     Only output the first N records (all)
    -s N.N   Log scale for adapter minimum-length-match (2.2)
    -t N     % occurance threshold before adapter clipping (0.25)
    -m N     Minimum clip length, overrides scaled auto (1)
    -p N     Maximum adapter difference percentage (10)
    -l N     Minimum remaining sequence length (19)
    -L N     Maximum remaining sequence length (none)
    -D N     Remove duplicate reads : Read_1 has an identical N bases (0)
    -k N     sKew percentage-less-than causing cycle removal (2)
    -x N     'N' (Bad read) percentage causing cycle removal (20)
    -q N     quality threshold causing base removal (10)
    -w N     window-size for quality trimming (1)
    -H       remove >95% homopolymer reads (no)
    -X       remove low complexity reads (no)
    -0       Set all default parameters to zero/do nothing
    -U|u     Force disable/enable Illumina PF filtering (auto)
    -P N     Phred-scale (auto)
    -R       Don't remove N's from the fronts/ends of reads
    -n       Don't clip, just output what would be done
    -K       Only keep clipped reads
    -S       Save all discarded reads to '.skip' files
    -C N     Number of reads to use for subsampling (300k)
    -d       Output lots of random debugging stuff

Quality adjustment options:
    --cycle-adjust      CYC,AMT   Adjust cycle CYC (negative = offset from end) by amount AMT
    --phred-adjust      SCORE,AMT Adjust score SCORE by amount AMT
    --phred-adjust-max  SCORE     Adjust scores > SCORE to SCOTE

Filtering options*:
    --[mate-]qual-mean  NUM       Minimum mean quality score
    --[mate-]qual-gt    NUM,THR   At least NUM quals > THR
    --[mate-]max-ns     NUM       Maxmium N-calls in a read (can be a %)
    --[mate-]min-len    NUM       Minimum remaining length (same as -l)
    --homopolymer-pct   PCT       Homopolymer filter percent (95)
    --lowcomplex-pct    PCT       Complexity filter percent (95)
    --keep-clipped                Only keep clipped (same as -K)
    --max-output-reads   N        Only output first N records (same as -O)

If mate- prefix is used, then applies to second non-barcode read only

Adapter files are 'fasta' formatted:

Specify n/a to turn off adapter clipping, and just use filters

Increasing the scale makes recognition-lengths longer, a scale
of 100 will force full-length recognition of adapters.

Adapter sequences with _5p in their label will match 'end's,
and sequences with _3p in their label will match 'start's,
otherwise the 'end' is auto-determined.

Skew is when one cycle is poor, 'skewed' toward a particular base.
If any nucleotide is less than the skew percentage, then the
whole cycle is removed.  Disable for methyl-seq, etc.

Set the skew (-k) or N-pct (-x) to 0 to turn it off (should be done
for miRNA, amplicon and other low-complexity situations!)

Duplicate read filtering is appropriate for assembly tasks, and
never when read length < expected coverage.  -D 50 will use
4.5GB RAM on 100m DNA reads - be careful. Great for RNA assembly.

*Quality filters are evaluated after clipping/trimming

Homopolymer filtering is a subset of low-complexity, but will not
be separately tracked unless both are turned on.

Running fastq-mcf

  1. neb_adapters.fasta : the adapter file
  2. r1.8A_pilot.fq.gz : the FASTQ with the data (fastq-mcf, like most NGS analysis software, detects gzipped files and automatically decompresses on the fly)
  3. -q 20 : if a read has any bases with quality score lower than this, trim them and anything 3’ of that base
  4. -x 0.5 : if this percentage (or higher) of the reads have an “N” in a given position, trim all reads to that position
  5. -o r1.test.trim.fastq.gz : output file (the .gz ending tells fastq-mcf to compress the output file)
In [5]:
fastq-mcf $INFO/neb_adapters.fasta \
    $DEMUX/r1.8A_pilot.fq.gz \
    -q 20 -x 0.5 \
    -o $TRIMMED/r1.8A_pilot.trim.fastq.gz
Error opening adapter file '/Users/cliburn/work/myinfo/neb_adapters.fasta': No such file or directory

Notice that it was the reverse complement of the adapter that was actually found, so its a good thing that we actually included it.

at this point we could run fastqc on the output of fastq-mcf to see if statistics have improved, but we will skip that for now.

In [6]:
ls $TRIMMED