Set environment

In [1]:
cleanup () {
    :
}

trap "cleanup" SIGPIPE
In [3]:
set -u

set directory

In [2]:
CURDIR=/home/jovyan/work/HTS2018
INFODIR=${CURDIR}/Info
PATHFILE=/home/jovyan/work/HTS-R25-DEV-2018/Info/PathwaysByGeneIds_Summary.txt
In [4]:
mkdir -p $INFODIR

Read data

In [5]:
head -5 $PATHFILE
Pathway Id      Map - Painted With Transformed Genes (new window)       Pathway Unique Gene Count       Genes
ec00010 "<a class=\new-window\"""
"                     data-name=\""pathway_map\"""
"                     href=\""/fungidb/app/record/pathway/KEGG/ec00010?geneStepId=110536770&exclude_incomplete_ec=0&exact_match_only=1\"">ec00010 (decorated)</a>"""   Glycolysis / Gluconeogenesis    34      CNAG_00038 | CNAG_00057 | CNAG_00515 | CNAG_00735 | CNAG_00797 | CNAG_01078 | CNAG_01120 | CNAG_01675 | CNAG_01820 | CNAG_01955 | CNAG_02035 | CNAG_02377 | CNAG_02489 | CNAG_02736 | CNAG_02903 | CNAG_03072 | CNAG_03358 | CNAG_03916 | CNAG_04217 | CNAG_04523 | CNAG_04659 | CNAG_04676 | CNAG_05059 | CNAG_05113 | CNAG_06035 | CNAG_06313 | CNAG_06628 | CNAG_06699 | CNAG_06770 | CNAG_07004 | CNAG_07316 | CNAG_07559 | CNAG_07660 | CNAG_07745
ec00020 "<a class=\new-window\"""

Extract the pathway and gene list

since the first line is the header, we could ignore it

In [6]:
head -n 1 $PATHFILE
Pathway Id      Map - Painted With Transformed Genes (new window)       Pathway Unique Gene Count       Genes

pathway

In [7]:
cat $PATHFILE | tail -n +2 | cut -f 1 | grep '^\w' > $INFODIR/pathway_names.txt

gene list

In [8]:
cat $PATHFILE | tail -n +2 | cut -f 4 | grep '^CNAG' > $INFODIR/pathway_genes.txt

check if the files are created

make sure the file size is not zero

In [9]:
ls -l  $INFODIR/pathway_*
-rw-r--r-- 1 jovyan users 425126 Jul 17 16:48 /home/jovyan/work/HTS2018/Info/pathway_genes.txt
-rw-r--r-- 1 jovyan users  18144 Jul 17 16:48 /home/jovyan/work/HTS2018/Info/pathway_names.txt
In [10]:
head -3 $INFODIR/pathway_names.txt
ec00010
ec00020
ec00030
In [11]:
head -3 $INFODIR/pathway_genes.txt
CNAG_00038 | CNAG_00057 | CNAG_00515 | CNAG_00735 | CNAG_00797 | CNAG_01078 | CNAG_01120 | CNAG_01675 | CNAG_01820 | CNAG_01955 | CNAG_02035 | CNAG_02377 | CNAG_02489 | CNAG_02736 | CNAG_02903 | CNAG_03072 | CNAG_03358 | CNAG_03916 | CNAG_04217 | CNAG_04523 | CNAG_04659 | CNAG_04676 | CNAG_05059 | CNAG_05113 | CNAG_06035 | CNAG_06313 | CNAG_06628 | CNAG_06699 | CNAG_06770 | CNAG_07004 | CNAG_07316 | CNAG_07559 | CNAG_07660 | CNAG_07745
CNAG_00061 | CNAG_00747 | CNAG_01120 | CNAG_01264 | CNAG_01657 | CNAG_01680 | CNAG_02736 | CNAG_03225 | CNAG_03226 | CNAG_03266 | CNAG_03375 | CNAG_03596 | CNAG_03674 | CNAG_03920 | CNAG_04189 | CNAG_04217 | CNAG_04468 | CNAG_04535 | CNAG_04640 | CNAG_05059 | CNAG_05236 | CNAG_05907 | CNAG_07004 | CNAG_07356 | CNAG_07363 | CNAG_07660 | CNAG_07851 | CNAG_07944
CNAG_00030 | CNAG_00057 | CNAG_00684 | CNAG_00827 | CNAG_01216 | CNAG_01395 | CNAG_01541 | CNAG_01675 | CNAG_01984 | CNAG_02133 | CNAG_02296 | CNAG_03048 | CNAG_03245 | CNAG_03335 | CNAG_03882 | CNAG_03916 | CNAG_04676 | CNAG_05365 | CNAG_05379 | CNAG_06313 | CNAG_06770 | CNAG_07445 | CNAG_07561

Check if both contain same number of lines

We need to make sure each pathway id match with one gene list

In [12]:
cat $PATHFILE | tail -n +2 | cut -f 1 | grep '^\w' | wc -l
1859
In [13]:
cat $PATHFILE | tail -n +2 | cut -f 4 | grep ^CNAG | wc -l
1859