Set environment¶
In [1]:
cleanup () {
:
}
trap "cleanup" SIGPIPE
In [3]:
set -u
set directory
In [2]:
CURDIR=/home/jovyan/work/HTS2018
INFODIR=${CURDIR}/Info
PATHFILE=/home/jovyan/work/HTS-R25-DEV-2018/Info/PathwaysByGeneIds_Summary.txt
In [4]:
mkdir -p $INFODIR
Read data¶
In [5]:
head -5 $PATHFILE
Pathway Id Map - Painted With Transformed Genes (new window) Pathway Unique Gene Count Genes
ec00010 "<a class=\new-window\"""
" data-name=\""pathway_map\"""
" href=\""/fungidb/app/record/pathway/KEGG/ec00010?geneStepId=110536770&exclude_incomplete_ec=0&exact_match_only=1\"">ec00010 (decorated)</a>""" Glycolysis / Gluconeogenesis 34 CNAG_00038 | CNAG_00057 | CNAG_00515 | CNAG_00735 | CNAG_00797 | CNAG_01078 | CNAG_01120 | CNAG_01675 | CNAG_01820 | CNAG_01955 | CNAG_02035 | CNAG_02377 | CNAG_02489 | CNAG_02736 | CNAG_02903 | CNAG_03072 | CNAG_03358 | CNAG_03916 | CNAG_04217 | CNAG_04523 | CNAG_04659 | CNAG_04676 | CNAG_05059 | CNAG_05113 | CNAG_06035 | CNAG_06313 | CNAG_06628 | CNAG_06699 | CNAG_06770 | CNAG_07004 | CNAG_07316 | CNAG_07559 | CNAG_07660 | CNAG_07745
ec00020 "<a class=\new-window\"""
Extract the pathway and gene list¶
since the first line is the header, we could ignore it
In [6]:
head -n 1 $PATHFILE
Pathway Id Map - Painted With Transformed Genes (new window) Pathway Unique Gene Count Genes
pathway¶
In [7]:
cat $PATHFILE | tail -n +2 | cut -f 1 | grep '^\w' > $INFODIR/pathway_names.txt
gene list¶
In [8]:
cat $PATHFILE | tail -n +2 | cut -f 4 | grep '^CNAG' > $INFODIR/pathway_genes.txt
check if the files are created¶
make sure the file size is not zero
In [9]:
ls -l $INFODIR/pathway_*
-rw-r--r-- 1 jovyan users 425126 Jul 17 16:48 /home/jovyan/work/HTS2018/Info/pathway_genes.txt
-rw-r--r-- 1 jovyan users 18144 Jul 17 16:48 /home/jovyan/work/HTS2018/Info/pathway_names.txt
In [10]:
head -3 $INFODIR/pathway_names.txt
ec00010
ec00020
ec00030
In [11]:
head -3 $INFODIR/pathway_genes.txt
CNAG_00038 | CNAG_00057 | CNAG_00515 | CNAG_00735 | CNAG_00797 | CNAG_01078 | CNAG_01120 | CNAG_01675 | CNAG_01820 | CNAG_01955 | CNAG_02035 | CNAG_02377 | CNAG_02489 | CNAG_02736 | CNAG_02903 | CNAG_03072 | CNAG_03358 | CNAG_03916 | CNAG_04217 | CNAG_04523 | CNAG_04659 | CNAG_04676 | CNAG_05059 | CNAG_05113 | CNAG_06035 | CNAG_06313 | CNAG_06628 | CNAG_06699 | CNAG_06770 | CNAG_07004 | CNAG_07316 | CNAG_07559 | CNAG_07660 | CNAG_07745
CNAG_00061 | CNAG_00747 | CNAG_01120 | CNAG_01264 | CNAG_01657 | CNAG_01680 | CNAG_02736 | CNAG_03225 | CNAG_03226 | CNAG_03266 | CNAG_03375 | CNAG_03596 | CNAG_03674 | CNAG_03920 | CNAG_04189 | CNAG_04217 | CNAG_04468 | CNAG_04535 | CNAG_04640 | CNAG_05059 | CNAG_05236 | CNAG_05907 | CNAG_07004 | CNAG_07356 | CNAG_07363 | CNAG_07660 | CNAG_07851 | CNAG_07944
CNAG_00030 | CNAG_00057 | CNAG_00684 | CNAG_00827 | CNAG_01216 | CNAG_01395 | CNAG_01541 | CNAG_01675 | CNAG_01984 | CNAG_02133 | CNAG_02296 | CNAG_03048 | CNAG_03245 | CNAG_03335 | CNAG_03882 | CNAG_03916 | CNAG_04676 | CNAG_05365 | CNAG_05379 | CNAG_06313 | CNAG_06770 | CNAG_07445 | CNAG_07561
Check if both contain same number of lines¶
We need to make sure each pathway id match with one gene list
In [12]:
cat $PATHFILE | tail -n +2 | cut -f 1 | grep '^\w' | wc -l
1859
In [13]:
cat $PATHFILE | tail -n +2 | cut -f 4 | grep ^CNAG | wc -l
1859