Trimming and Filtering a FASTQ

Shell Variables

Retyping shell variables in every notebook is getting old, and its error prone. Let’s centralize these so we can share them between notebooks. We can create a shell script that contains the shell variables that we need, and then we can source it in each notebook. Let’s call it bioinf_intro_config.sh. We can do this using the Jupyter text editor.

In [1]:
source bioinf_intro_config.sh

Making New Directories

Make the directories that are new in this notebook

In [2]:
mkdir -p $TRIMMED
mkdir -p $MYINFO

Now let’s check to be sure that worked. We will run ls and check that these directories now exist in the $CUROUT directory.

In [3]:
ls $CUROUT
count_out               igv                     star_out
demo_chmod              myinfo                  stuff_for_igv.tgz
genome                  qc_output               trimmed_fastqs

Trimming and Filtering

Now we get into some actual preprocessing. We will use fastq-mcf to trim adapter from our reads and do some quality filtering. We need to trim adapter, because if a fragment is short enough, we will sequence all the way through the fragment and into the adapter. Obviously the adapter sequence in not found in the genome, and can keep the read from aligning properly. To do the trimming, we need to generate an adapter file.

Making an adapter file

The first step is to get the adapter sequence. We can get this from the manual, but sequences from a PDF can pick up weird characters, so we are better off getting the adapter sequences from the Primer Sample Sheet.

We can download and display the Sample Sheet using curl:

In [4]:
curl "https://www.neb.com/-/media/nebus/files/excel/e7600_nextseq_v4.csv?la=en"
[Header],,,,,,,,,
IEMFileVersion,4,,,,,,,,
Date,,,,,,,,,
Workflow,,,,,,,,,
Application,,,,,,,,,
Assay,,,,,,,,,
Description,,,,,,,,,
Chemistry,,,,,,,,,
,,,,,,,,,
[Reads],,,,,,,,,
,,,,,,,,,
,,,,,,,,,
,,,,,,,,,
[Settings],,,,,,,,,
Adapter,AGATCGGAAGAGCACACGTCTGAACTCCAGTCA,,,,,,,,
AdapterRead2,AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT,,,,,,,,
,,,,,,,,,
[Data],,,,,,,,,
Sample_ID,Sample_Name,Sample_Plate,Sample_Well,I7_Index_ID,index,I5_Index_ID,index2,Sample_Project,Description
1,,,,D701,ATTACTCG,D501,AGGCTATA,,
2,,,,D702,TCCGGAGA,D502,GCCTCTAT,,
3,,,,D703,CGCTCATT,D503,AGGATAGG,,
4,,,,D704,GAGATTCC,D504,TCAGAGCC,,
5,,,,D705,ATTCAGAA,D505,CTTCGCCT,,
6,,,,D706,GAATTCGT,D506,TAAGATTA,,
7,,,,D707,CTGAAGCT,D507,ACGTCCTG,,
8,,,,D708,TAATGCGC,D508,GTCAGTAC,,
9,,,,D709,CGGCTATG,D501,AGGCTATA,,
10,,,,D710,TCCGCGAA,D502,GCCTCTAT,,
11,,,,D711,TCTCGCGC,D503,AGGATAGG,,
12,,,,D712,AGCGATAG,D504,TCAGAGCC,,

We want the adapter sequences from the sample sheet:

Adapter,AGATCGGAAGAGCACACGTCTGAACTCCAGTCA,,,,,,,,
AdapterRead2,AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT,,,,,,,,

Now we need to make the adapter file; this needs to be in FASTA format.

  1. Browse to scratch/bioinf_intro/myinfo

  2. Click on the jupyter “File” menu, and select “Open”.

  3. When the the new browser window/tab opens, click on the “Files” tab if it is not already active.

  4. Click on the “home” symbol to go to the top level directory, then click on “myinfo”

  5. In the “New” menu select “Text File”.

  6. In this text file, paste the adapter lines from above.

  7. We also want to include the reverse complement of the adapter, in case the adapter contamination as sequenced is the reverse completement of what is given. The easiest way to do that is to use https://www.bioinformatics.org/sms/rev_comp.html to generate the reverse complement, then name it something like “Adapter_RC”

  8. Now clean up by making sure that …

    1. Each sequence is on its own line

    2. Each sequence has a name on the line before it

    3. The sequence name is preceded by a “>”

    4. All commas and spaces need to be removed, and non-sequence characters need to be removed from the sequence lines Now it should look like this:

      >Adapter
      AGATCGGAAGAGCACACGTCTGAACTCCAGTCA
      >AdapterRead2
      AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT
      >Adapter_rc
      TGACTGGAGTTCAGACGTGTGCTCTTCCGATCT
      >AdapterRead2_rc
      ACACTCTTTCCCTACACGACGCTCTTCCGATCT
      
  9. Click on “untitled.txt” to change the file name to “neb_e7600_adapters.fasta”

  10. Save the file.

fastq-mcf

You can run fastq-mcf -h to get details about running fastq-mcf. We will adjust run parameters, because some of the defaults set a low bar (even the author acknowleges this).

In [5]:
# the "| cat" is a hack that prevents problems with jupyter
fastq-mcf -h | cat
Usage: fastq-mcf [options] <adapters.fa> <reads.fq> [mates1.fq ...]
Version: 1.04.803

Detects levels of adapter presence, computes likelihoods and
locations (start, end) of the adapters.   Removes the adapter
sequences from the fastq file(s).

Stats go to stderr, unless -o is specified.

Specify -0 to turn off all default settings

If you specify multiple 'paired-end' inputs, then a -o option is
required for each.  IE: -o read1.clip.q -o read2.clip.fq

Options:
    -h       This help
    -o FIL   Output file (stats to stdout)
    -O N     Only output the first N records (all)
    -s N.N   Log scale for adapter minimum-length-match (2.2)
    -t N     % occurance threshold before adapter clipping (0.25)
    -m N     Minimum clip length, overrides scaled auto (1)
    -p N     Maximum adapter difference percentage (10)
    -l N     Minimum remaining sequence length (19)
    -L N     Maximum remaining sequence length (none)
    -D N     Remove duplicate reads : Read_1 has an identical N bases (0)
    -k N     sKew percentage-less-than causing cycle removal (2)
    -x N     'N' (Bad read) percentage causing cycle removal (20)
    -q N     quality threshold causing base removal (10)
    -w N     window-size for quality trimming (1)
    -H       remove >95% homopolymer reads (no)
    -X       remove low complexity reads (no)
    -0       Set all default parameters to zero/do nothing
    -U|u     Force disable/enable Illumina PF filtering (auto)
    -P N     Phred-scale (auto)
    -R       Don't remove N's from the fronts/ends of reads
    -n       Don't clip, just output what would be done
    -K       Only keep clipped reads
    -S       Save all discarded reads to '.skip' files
    -C N     Number of reads to use for subsampling (300k)
    -d       Output lots of random debugging stuff

Quality adjustment options:
    --cycle-adjust      CYC,AMT   Adjust cycle CYC (negative = offset from end) by amount AMT
    --phred-adjust      SCORE,AMT Adjust score SCORE by amount AMT
    --phred-adjust-max  SCORE     Adjust scores > SCORE to SCOTE

Filtering options*:
    --[mate-]qual-mean  NUM       Minimum mean quality score
    --[mate-]qual-gt    NUM,THR   At least NUM quals > THR
    --[mate-]max-ns     NUM       Maxmium N-calls in a read (can be a %)
    --[mate-]min-len    NUM       Minimum remaining length (same as -l)
    --homopolymer-pct   PCT       Homopolymer filter percent (95)
    --lowcomplex-pct    PCT       Complexity filter percent (95)
    --keep-clipped                Only keep clipped (same as -K)
    --max-output-reads   N        Only output first N records (same as -O)

If mate- prefix is used, then applies to second non-barcode read only

Adapter files are 'fasta' formatted:

Specify n/a to turn off adapter clipping, and just use filters

Increasing the scale makes recognition-lengths longer, a scale
of 100 will force full-length recognition of adapters.

Adapter sequences with _5p in their label will match 'end's,
and sequences with _3p in their label will match 'start's,
otherwise the 'end' is auto-determined.

Skew is when one cycle is poor, 'skewed' toward a particular base.
If any nucleotide is less than the skew percentage, then the
whole cycle is removed.  Disable for methyl-seq, etc.

Set the skew (-k) or N-pct (-x) to 0 to turn it off (should be done
for miRNA, amplicon and other low-complexity situations!)

Duplicate read filtering is appropriate for assembly tasks, and
never when read length < expected coverage.  -D 50 will use
4.5GB RAM on 100m DNA reads - be careful. Great for RNA assembly.

*Quality filters are evaluated after clipping/trimming

Homopolymer filtering is a subset of low-complexity, but will not
be separately tracked unless both are turned on.

Running fastq-mcf

  1. neb_e7600_adapters.fasta : the adapter file
  2. 27_MA_P_S38_L002_R1_001.fastq.gz : the FASTQ with the data (fastq-mcf, like most NGS analysis software, detects gzipped files and automatically decompresses on the fly)
  3. -q 20 : if a read has any bases with quality score lower than this, trim them and anything 3’ of that base
  4. -x 0.5 : if this percentage (or higher) of the reads have an “N” in a given position, trim all reads to that position
  5. -o 27_MA_P_S38_L002_R1_001.trim.fastq.gz : output file (the .gz ending tells fastq-mcf to compress the output file)
In [6]:
fastq-mcf $MYINFO/neb_e7600_adapters.fasta \
    $RAW_FASTQS/27_MA_P_S38_L002_R1_001.fastq.gz \
    -q 20 -x 0.5 \
    -o $TRIMMED/27_MA_P_S38_L002_R1_001.trim.fastq.gz
Command Line: /Users/cliburn/work/scratch/bioinf_intro/myinfo/neb_e7600_adapters.fasta /data/hts2018_pilot/Granek_4837_180427A5/27_MA_P_S38_L002_R1_001.fastq.gz -q 20 -x 0.5 -o /Users/cliburn/work/scratch/bioinf_intro/trimmed_fastqs/27_MA_P_S38_L002_R1_001.trim.fastq.gz
Scale used: 2.2
gunzip: can't stat: /data/hts2018_pilot/Granek_4837_180427A5/27_MA_P_S38_L002_R1_001.fastq.gz (/data/hts2018_pilot/Granek_4837_180427A5/27_MA_P_S38_L002_R1_001.fastq.gz.gz): No such file or directory
Phred: 64
No records in file /data/hts2018_pilot/Granek_4837_180427A5/27_MA_P_S38_L002_R1_001.fastq.gz

at this point we could run fastqc on the output of fastq-mcf to see if statistics have improved, but we will skip that for now.

In [7]:
ls $TRIMMED