Trimming and Filtering a FASTQ¶
Shell Variables¶
Retyping shell variables in every notebook is getting old, and its error
prone. Let’s centralize these so we can share them between notebooks. We
can create a shell script that contains the shell variables that we
need, and then we can source
it in each notebook. Let’s call it
bioinf_intro_config.sh
. We can do this using the Jupyter text
editor.
In [1]:
source bioinf_intro_config.sh
Making New Directories¶
Make the directories that are new in this notebook
In [2]:
mkdir -p $TRIMMED
mkdir -p $MYINFO
Now let’s check to be sure that worked. We will run ls
and check
that these directories now exist in the $CUROUT
directory.
In [3]:
ls $CUROUT
count_out igv star_out
demo_chmod myinfo stuff_for_igv.tgz
genome qc_output trimmed_fastqs
Trimming and Filtering¶
Now we get into some actual preprocessing. We will use fastq-mcf
to
trim adapter from our reads and do some quality filtering. We need to
trim adapter, because if a fragment is short enough, we will sequence
all the way through the fragment and into the adapter. Obviously the
adapter sequence in not found in the genome, and can keep the read from
aligning properly. To do the trimming, we need to generate an adapter
file.
Making an adapter file¶
The first step is to get the adapter sequence. We can get this from the manual, but sequences from a PDF can pick up weird characters, so we are better off getting the adapter sequences from the Primer Sample Sheet.
We can download and display the Sample Sheet using curl
:
In [4]:
curl "https://www.neb.com/-/media/nebus/files/excel/e7600_nextseq_v4.csv?la=en"
[Header],,,,,,,,,
IEMFileVersion,4,,,,,,,,
Date,,,,,,,,,
Workflow,,,,,,,,,
Application,,,,,,,,,
Assay,,,,,,,,,
Description,,,,,,,,,
Chemistry,,,,,,,,,
,,,,,,,,,
[Reads],,,,,,,,,
,,,,,,,,,
,,,,,,,,,
,,,,,,,,,
[Settings],,,,,,,,,
Adapter,AGATCGGAAGAGCACACGTCTGAACTCCAGTCA,,,,,,,,
AdapterRead2,AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT,,,,,,,,
,,,,,,,,,
[Data],,,,,,,,,
Sample_ID,Sample_Name,Sample_Plate,Sample_Well,I7_Index_ID,index,I5_Index_ID,index2,Sample_Project,Description
1,,,,D701,ATTACTCG,D501,AGGCTATA,,
2,,,,D702,TCCGGAGA,D502,GCCTCTAT,,
3,,,,D703,CGCTCATT,D503,AGGATAGG,,
4,,,,D704,GAGATTCC,D504,TCAGAGCC,,
5,,,,D705,ATTCAGAA,D505,CTTCGCCT,,
6,,,,D706,GAATTCGT,D506,TAAGATTA,,
7,,,,D707,CTGAAGCT,D507,ACGTCCTG,,
8,,,,D708,TAATGCGC,D508,GTCAGTAC,,
9,,,,D709,CGGCTATG,D501,AGGCTATA,,
10,,,,D710,TCCGCGAA,D502,GCCTCTAT,,
11,,,,D711,TCTCGCGC,D503,AGGATAGG,,
12,,,,D712,AGCGATAG,D504,TCAGAGCC,,
We want the adapter sequences from the sample sheet:
Adapter,AGATCGGAAGAGCACACGTCTGAACTCCAGTCA,,,,,,,,
AdapterRead2,AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT,,,,,,,,
Now we need to make the adapter file; this needs to be in FASTA format.
Browse to scratch/bioinf_intro/myinfo
Click on the jupyter “File” menu, and select “Open”.
When the the new browser window/tab opens, click on the “Files” tab if it is not already active.
Click on the “home” symbol to go to the top level directory, then click on “myinfo”
In the “New” menu select “Text File”.
In this text file, paste the adapter lines from above.
We also want to include the reverse complement of the adapter, in case the adapter contamination as sequenced is the reverse completement of what is given. The easiest way to do that is to use https://www.bioinformatics.org/sms/rev_comp.html to generate the reverse complement, then name it something like “Adapter_RC”
Now clean up by making sure that …
Each sequence is on its own line
Each sequence has a name on the line before it
The sequence name is preceded by a “>”
All commas and spaces need to be removed, and non-sequence characters need to be removed from the sequence lines Now it should look like this:
>Adapter AGATCGGAAGAGCACACGTCTGAACTCCAGTCA >AdapterRead2 AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT >Adapter_rc TGACTGGAGTTCAGACGTGTGCTCTTCCGATCT >AdapterRead2_rc ACACTCTTTCCCTACACGACGCTCTTCCGATCT
Click on “untitled.txt” to change the file name to “neb_e7600_adapters.fasta”
Save the file.
fastq-mcf¶
You can run fastq-mcf -h
to get details about running fastq-mcf. We
will adjust run parameters, because some of the defaults set a low bar
(even the author acknowleges this).
In [5]:
# the "| cat" is a hack that prevents problems with jupyter
fastq-mcf -h | cat
Usage: fastq-mcf [options] <adapters.fa> <reads.fq> [mates1.fq ...]
Version: 1.04.803
Detects levels of adapter presence, computes likelihoods and
locations (start, end) of the adapters. Removes the adapter
sequences from the fastq file(s).
Stats go to stderr, unless -o is specified.
Specify -0 to turn off all default settings
If you specify multiple 'paired-end' inputs, then a -o option is
required for each. IE: -o read1.clip.q -o read2.clip.fq
Options:
-h This help
-o FIL Output file (stats to stdout)
-O N Only output the first N records (all)
-s N.N Log scale for adapter minimum-length-match (2.2)
-t N % occurance threshold before adapter clipping (0.25)
-m N Minimum clip length, overrides scaled auto (1)
-p N Maximum adapter difference percentage (10)
-l N Minimum remaining sequence length (19)
-L N Maximum remaining sequence length (none)
-D N Remove duplicate reads : Read_1 has an identical N bases (0)
-k N sKew percentage-less-than causing cycle removal (2)
-x N 'N' (Bad read) percentage causing cycle removal (20)
-q N quality threshold causing base removal (10)
-w N window-size for quality trimming (1)
-H remove >95% homopolymer reads (no)
-X remove low complexity reads (no)
-0 Set all default parameters to zero/do nothing
-U|u Force disable/enable Illumina PF filtering (auto)
-P N Phred-scale (auto)
-R Don't remove N's from the fronts/ends of reads
-n Don't clip, just output what would be done
-K Only keep clipped reads
-S Save all discarded reads to '.skip' files
-C N Number of reads to use for subsampling (300k)
-d Output lots of random debugging stuff
Quality adjustment options:
--cycle-adjust CYC,AMT Adjust cycle CYC (negative = offset from end) by amount AMT
--phred-adjust SCORE,AMT Adjust score SCORE by amount AMT
--phred-adjust-max SCORE Adjust scores > SCORE to SCOTE
Filtering options*:
--[mate-]qual-mean NUM Minimum mean quality score
--[mate-]qual-gt NUM,THR At least NUM quals > THR
--[mate-]max-ns NUM Maxmium N-calls in a read (can be a %)
--[mate-]min-len NUM Minimum remaining length (same as -l)
--homopolymer-pct PCT Homopolymer filter percent (95)
--lowcomplex-pct PCT Complexity filter percent (95)
--keep-clipped Only keep clipped (same as -K)
--max-output-reads N Only output first N records (same as -O)
If mate- prefix is used, then applies to second non-barcode read only
Adapter files are 'fasta' formatted:
Specify n/a to turn off adapter clipping, and just use filters
Increasing the scale makes recognition-lengths longer, a scale
of 100 will force full-length recognition of adapters.
Adapter sequences with _5p in their label will match 'end's,
and sequences with _3p in their label will match 'start's,
otherwise the 'end' is auto-determined.
Skew is when one cycle is poor, 'skewed' toward a particular base.
If any nucleotide is less than the skew percentage, then the
whole cycle is removed. Disable for methyl-seq, etc.
Set the skew (-k) or N-pct (-x) to 0 to turn it off (should be done
for miRNA, amplicon and other low-complexity situations!)
Duplicate read filtering is appropriate for assembly tasks, and
never when read length < expected coverage. -D 50 will use
4.5GB RAM on 100m DNA reads - be careful. Great for RNA assembly.
*Quality filters are evaluated after clipping/trimming
Homopolymer filtering is a subset of low-complexity, but will not
be separately tracked unless both are turned on.
Running fastq-mcf¶
- neb_e7600_adapters.fasta : the adapter file
- 27_MA_P_S38_L002_R1_001.fastq.gz : the FASTQ with the data (fastq-mcf, like most NGS analysis software, detects gzipped files and automatically decompresses on the fly)
- -q 20 : if a read has any bases with quality score lower than this, trim them and anything 3’ of that base
- -x 0.5 : if this percentage (or higher) of the reads have an “N” in a given position, trim all reads to that position
- -o 27_MA_P_S38_L002_R1_001.trim.fastq.gz : output file (the .gz ending tells fastq-mcf to compress the output file)
In [6]:
fastq-mcf $MYINFO/neb_e7600_adapters.fasta \
$RAW_FASTQS/27_MA_P_S38_L002_R1_001.fastq.gz \
-q 20 -x 0.5 \
-o $TRIMMED/27_MA_P_S38_L002_R1_001.trim.fastq.gz
Command Line: /Users/cliburn/work/scratch/bioinf_intro/myinfo/neb_e7600_adapters.fasta /data/hts2018_pilot/Granek_4837_180427A5/27_MA_P_S38_L002_R1_001.fastq.gz -q 20 -x 0.5 -o /Users/cliburn/work/scratch/bioinf_intro/trimmed_fastqs/27_MA_P_S38_L002_R1_001.trim.fastq.gz
Scale used: 2.2
gunzip: can't stat: /data/hts2018_pilot/Granek_4837_180427A5/27_MA_P_S38_L002_R1_001.fastq.gz (/data/hts2018_pilot/Granek_4837_180427A5/27_MA_P_S38_L002_R1_001.fastq.gz.gz): No such file or directory
Phred: 64
No records in file /data/hts2018_pilot/Granek_4837_180427A5/27_MA_P_S38_L002_R1_001.fastq.gz
at this point we could run fastqc on the output of fastq-mcf to see if statistics have improved, but we will skip that for now.
In [7]:
ls $TRIMMED