Looping with Globs

It can be tedious and error prone to individually specify the files we want to loop over. Regular expressions, in the form of bash globs allow us to automatically select groups of files to loop over.

Shell Variables

Assign the variables in this notebook.

In [1]:
source bioinf_intro_config.sh
mkdir -p $TRIMMED $STAR_OUT

Globs?

Previously we specified the files we wanted to loop over explicitly:

In [2]:
for FASTQ in 27_MA_P_S38_L002_R1 27_MA_P_S38_L001_R1
    do
        echo "RUNNING FASTQ: ${FASTQ}"
    done
RUNNING FASTQ: 27_MA_P_S38_L002_R1
RUNNING FASTQ: 27_MA_P_S38_L001_R1

This is appropriate in some situations, but often when there is a group of files that we want to work with, we can find a simple way to list these files without specifying them one-by-one. For example, since the reads for each FASTQ are split across four lanes, we might want to analyze the data from all four lanes at once. There are several ways that we can specify the FASTQs from all four lanes of FASTQ 27_MA:

  1. The easiest is to use the * wildcard, which matches any number of any characters (including zero), so we can match the FASTQs from all four lanes like this: $RAW_FASTQS/27_MA*
  2. * is easy and useful, but often it is better to be more specific, otherwise we might match something unintentionally. Since the only difference in the names between the lanes is the “L001”, “L002”, “L003”, and “L004”, we can narrow our match using ?, which matches any single character: 27_MA_P_S38_L00?_R1_001.fastq.gz
  3. The best approach is to specify exactly what characters we are allowing in that variable position. Square brackets allow you to list specific characters that can match [1234] or a range [1-4]: 27_MA_P_S38_L00[1-4]_R1_001.fastq.gz

Globs can be confusing. Keep in mind that here we are using the globs to search through all the file names in a directory and list the files with a name that matches a specific pattern.

In [3]:
echo $RAW_FASTQS/27_MA*
/data/hts2018_pilot/Granek_4837_180427A5/27_MA*
In [4]:
echo $RAW_FASTQS/27_MA_P_S38_L00?_R1_001.fastq.gz
/data/hts2018_pilot/Granek_4837_180427A5/27_MA_P_S38_L00?_R1_001.fastq.gz
In [5]:
echo $RAW_FASTQS/27_MA_P_S38_L00[1234]_R1_001.fastq.gz
/data/hts2018_pilot/Granek_4837_180427A5/27_MA_P_S38_L00[1234]_R1_001.fastq.gz
In [6]:
echo $RAW_FASTQS/27_MA_P_S38_L00[1-4]_R1_001.fastq.gz
/data/hts2018_pilot/Granek_4837_180427A5/27_MA_P_S38_L00[1-4]_R1_001.fastq.gz

More Complex Globs

We can combine multiple wildcards in a glob to match a more complex set of filenames. For example we could match MA samples 10 through 19, Lanes 1 through 4, with the following glob.

In [7]:
ls $RAW_FASTQS/1?_MA_*_L00[1-4]_R1_001.fastq.gz
ls: /data/hts2018_pilot/Granek_4837_180427A5/1?_MA_*_L00[1-4]_R1_001.fastq.gz: No such file or directory

Looping over a glob

Now we can put together for loops with globs, to loop over all the lanes from library 27_MA

In [8]:
for FASTQ in $RAW_FASTQS/27_MA_P_S38_L00[1-4]_R1_001.fastq.gz
    do
        echo "RUNNING FASTQ: ${FASTQ}"
    done
RUNNING FASTQ: /data/hts2018_pilot/Granek_4837_180427A5/27_MA_P_S38_L00[1-4]_R1_001.fastq.gz

String Manipulation

But we still have a bit of work to do. Before when we were manually specifying each FASTQ, we looped over a substring of the full path, then added on the prefix and suffix, for example: $TRIMMED/${FASTQ}_001.trim.fastq.gz. But now we are grabbing the full path, and we need to manipulate it so that we can generate output file names that are different than the input, and put the output in different directories. We can do all this with the basename bash function. The simple form of basename removes the whole directory portion of a path and just returns the filename:

In [9]:
for FASTQ in $RAW_FASTQS/27_MA_P_S38_L00[1-2]_R1_001.fastq.gz
    do
        echo "FULL PATH IS: ${FASTQ}"
        echo "basename gives us: $(basename ${FASTQ})"
        echo ""
    done
FULL PATH IS: /data/hts2018_pilot/Granek_4837_180427A5/27_MA_P_S38_L00[1-2]_R1_001.fastq.gz
basename gives us: 27_MA_P_S38_L00[1-2]_R1_001.fastq.gz

If you give basename a string as a second argument, it will strip that string from the end of the path (if it is there):

In [10]:
for FASTQ in $RAW_FASTQS/27_MA_P_S38_L00[1-2]_R1_001.fastq.gz
    do
        echo "FULL PATH IS: ${FASTQ}"
        echo "basename gives us: $(basename ${FASTQ} '_001.fastq.gz')"
        echo ""
    done
FULL PATH IS: /data/hts2018_pilot/Granek_4837_180427A5/27_MA_P_S38_L00[1-2]_R1_001.fastq.gz
basename gives us: 27_MA_P_S38_L00[1-2]_R1

Note that if the string you give is not a suffix of the path, nothing is stripped from the end of the path, but the directory prefix will still be removed:

In [11]:
for FASTQ in $RAW_FASTQS/27_MA_P_S38_L00[1-2]_R1_001.fastq.gz
    do
        echo "FULL PATH IS: ${FASTQ}"
        echo "with '_001.fastq.gz': $(basename ${FASTQ} '_001.fastq.gz')"
        echo "with 'fastq':         $(basename ${FASTQ} 'fastq')"
        echo "with 'foo':           $(basename ${FASTQ} 'foo')"
        echo ""
    done
FULL PATH IS: /data/hts2018_pilot/Granek_4837_180427A5/27_MA_P_S38_L00[1-2]_R1_001.fastq.gz
with '_001.fastq.gz': 27_MA_P_S38_L00[1-2]_R1
with 'fastq':         27_MA_P_S38_L00[1-2]_R1_001.fastq.gz
with 'foo':           27_MA_P_S38_L00[1-2]_R1_001.fastq.gz

And we can assign the results of basename to a variable for later use:

In [12]:
for FASTQ in $RAW_FASTQS/27_MA_P_S38_L00[1-2]_R1_001.fastq.gz
    do
        echo "FULL PATH IS: ${FASTQ}"
        FASTQ_BASE="$(basename ${FASTQ} '_001.fastq.gz')"
        echo "basename gives us: $FASTQ_BASE"
        echo "OUTPUT PATH: ${TRIMMED}/${FASTQ_BASE}_001.trim.fastq.gz"
    done
FULL PATH IS: /data/hts2018_pilot/Granek_4837_180427A5/27_MA_P_S38_L00[1-2]_R1_001.fastq.gz
basename gives us: 27_MA_P_S38_L00[1-2]_R1
OUTPUT PATH: /Users/cliburn/work/scratch/bioinf_intro/trimmed_fastqs/27_MA_P_S38_L00[1-2]_R1_001.trim.fastq.gz

With globs and basename in our toolbox, we are ready to conquer the world or at least run multiple FASTQs through our pipeline, without breaking a sweat!

A globy pipeline

In [13]:
for FASTQ in $RAW_FASTQS/27_MA_P_S38_L00[1-2]_R1_001.fastq.gz
    do
        FASTQ_BASE="$(basename ${FASTQ} '_001.fastq.gz')"
        echo "---------------- TRIMMING: $FASTQ_BASE ----------------"
        fastq-mcf \
            $MYINFO/neb_e7600_adapters.fasta \
            $RAW_FASTQS/${FASTQ_BASE}_001.fastq.gz \
            -q 20 -x 0.5 \
            -o $TRIMMED/${FASTQ_BASE}_001.trim.fastq.gz

        echo "---------------- MAPPING: $FASTQ_BASE ----------------"
        STAR \
            --runMode alignReads \
            --twopassMode None \
            --genomeDir $GENOME_DIR \
            --readFilesIn $TRIMMED/${FASTQ_BASE}_001.trim.fastq.gz \
            --readFilesCommand gunzip -c \
            --outFileNamePrefix ${STAR_OUT}/${FASTQ_BASE}_ \
            --quantMode GeneCounts \
            --outSAMtype BAM SortedByCoordinate
    done
---------------- TRIMMING: 27_MA_P_S38_L00[1-2]_R1 ----------------
Command Line: /Users/cliburn/work/scratch/bioinf_intro/myinfo/neb_e7600_adapters.fasta /data/hts2018_pilot/Granek_4837_180427A5/27_MA_P_S38_L00[1-2]_R1_001.fastq.gz -q 20 -x 0.5 -o /Users/cliburn/work/scratch/bioinf_intro/trimmed_fastqs/27_MA_P_S38_L00[1-2]_R1_001.trim.fastq.gz
Scale used: 2.2
gunzip: can't stat: /data/hts2018_pilot/Granek_4837_180427A5/27_MA_P_S38_L00[1-2]_R1_001.fastq.gz (/data/hts2018_pilot/Granek_4837_180427A5/27_MA_P_S38_L00[1-2]_R1_001.fastq.gz.gz): No such file or directory
Phred: 64
No records in file /data/hts2018_pilot/Granek_4837_180427A5/27_MA_P_S38_L00[1-2]_R1_001.fastq.gz
---------------- MAPPING: 27_MA_P_S38_L00[1-2]_R1 ----------------
STAR: Bad Option: --runMode.
Usage:  STAR cmd [options] [-find] file1 ... filen [find expression]

Use     STAR -help
and     STAR -xhelp
to get a list of valid cmds and options.

Use     STAR H=help
to get a list of valid archive header formats.

Use     STAR diffopts=help
to get a list of valid diff options.

And let’s check the result

In [14]:
ls ${STAR_OUT}
In [15]:
head ${STAR_OUT}/27_MA_P_S38_L00?_R1_ReadsPerGene.out.tab
head: /Users/cliburn/work/scratch/bioinf_intro/star_out/27_MA_P_S38_L00?_R1_ReadsPerGene.out.tab: No such file or directory

Glob References