{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Trimming and Filtering a FASTQ\n",
    "\n",
    "\n",
    "## Shell Variables\n",
    "Retyping shell variables in every notebook is getting old, and its error prone.  Let's centralize these so we can share them between notebooks.  We can create a shell script that contains the shell variables that we need, and then we can `source` it in each notebook.  Let's call it `bioinf_intro_config.sh`.  We can do this using the Jupyter text editor."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "source bioinf_intro_config.sh"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Making New Directories\n",
    "Make the directories that are new in this notebook"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "mkdir -p $TRIMMED\n",
    "mkdir -p $MYINFO"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now let's check to be sure that worked.  We will run `ls` and check that these directories now exist in the `$CUROUT` directory."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ls $CUROUT"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Trimming and Filtering\n",
    "Now we get into some actual preprocessing.  We will use `fastq-mcf` to trim adapter from our reads and do some quality filtering.  We need to trim adapter, because if a fragment is short enough, we will sequence all the way through the fragment and into the adapter.  Obviously the adapter sequence in not found in the genome, and can keep the read from aligning properly.  To do the trimming, we need to generate an adapter file.\n",
    "\n",
    "## Making an adapter file\n",
    "The first step is to get the adapter sequence.  We can get this from the [manual](https://www.neb.com/-/media/catalog/datacards-or-manuals/manuale7600.pdf), but sequences from a PDF can pick up weird characters, so we are better off getting the adapter sequences from the [Primer Sample Sheet](https://www.neb.com/-/media/nebus/files/excel/e7600_nextseq_v4.csv?la=en).  \n",
    "\n",
    "We can download and display the Sample Sheet using `curl`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "curl \"https://www.neb.com/-/media/nebus/files/excel/e7600_nextseq_v4.csv?la=en\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We want the adapter sequences from the sample sheet: \n",
    "```\n",
    "Adapter,AGATCGGAAGAGCACACGTCTGAACTCCAGTCA,,,,,,,,\n",
    "AdapterRead2,AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT,,,,,,,,\n",
    "```\n",
    "\n",
    "Now we need to make the adapter file; this needs to be in FASTA format.\n",
    "\n",
    "0. Browse to scratch/bioinf_intro/myinfo\n",
    "1. Click on the jupyter \"File\" menu, and select \"Open\".  \n",
    "2. When the the new browser window/tab opens, click on the \"Files\" tab if it is not already active.\n",
    "3. Click on the \"home\" symbol to go to the top level directory, then click on \"myinfo\"\n",
    "4. In the \"New\" menu select \"Text File\".\n",
    "5. In this text file, paste the adapter lines from above.\n",
    "7. We also want to include the reverse complement of the adapter, in case the adapter contamination as sequenced is the reverse completement of what is given.  The easiest way to do that is to use https://www.bioinformatics.org/sms/rev_comp.html to generate the reverse complement, then name it something like \"Adapter_RC\"\n",
    "8. Now clean up by making sure that . . .\n",
    "    1. Each sequence is on its own line\n",
    "    2. Each sequence has a name on the line before it\n",
    "    3. The sequence name is preceded by a \">\"\n",
    "    4. All commas and spaces need to be removed, and non-sequence characters need to be removed from the sequence lines\n",
    "Now it should look like this:\n",
    "```\n",
    ">Adapter\n",
    "AGATCGGAAGAGCACACGTCTGAACTCCAGTCA\n",
    ">AdapterRead2\n",
    "AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT\n",
    ">Adapter_rc\n",
    "TGACTGGAGTTCAGACGTGTGCTCTTCCGATCT\n",
    ">AdapterRead2_rc\n",
    "ACACTCTTTCCCTACACGACGCTCTTCCGATCT\n",
    "```\n",
    "10. Click on \"untitled.txt\" to change the file name to \"neb_e7600_adapters.fasta\"\n",
    "11. Save the file.\n",
    "\n",
    "\n",
    "## fastq-mcf\n",
    "You can run `fastq-mcf -h` to get details about running fastq-mcf.  We will adjust run parameters, because some of the defaults set a low bar (even the author acknowleges this)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# the \"| cat\" is a hack that prevents problems with jupyter\n",
    "fastq-mcf -h | cat"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Running fastq-mcf\n",
    "1. neb_e7600_adapters.fasta : the adapter file\n",
    "2. 27_MA_P_S38_L002_R1_001.fastq.gz : the FASTQ with the data (fastq-mcf, like most NGS analysis software, detects gzipped files and automatically decompresses on the fly)\n",
    "3. -q 20 : if a read has any bases with quality score lower than this, trim them and anything 3' of that base\n",
    "4. -x 0.5 : if this percentage (or higher) of the reads have an \"N\" in a given position, trim all reads to that position\n",
    "5. -o 27_MA_P_S38_L002_R1_001.trim.fastq.gz : output file (the .gz ending tells fastq-mcf to compress the output file)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fastq-mcf $MYINFO/neb_e7600_adapters.fasta \\\n",
    "    $RAW_FASTQS/27_MA_P_S38_L002_R1_001.fastq.gz \\\n",
    "    -q 20 -x 0.5 \\\n",
    "    -o $TRIMMED/27_MA_P_S38_L002_R1_001.trim.fastq.gz"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "at this point we could run fastqc on the output of fastq-mcf to see if statistics have improved, but we will skip that for now."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ls $TRIMMED"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Bash",
   "language": "bash",
   "name": "bash"
  },
  "language_info": {
   "codemirror_mode": "shell",
   "file_extension": ".sh",
   "mimetype": "text/x-sh",
   "name": "bash"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}