The Unix Shell: File and Directory Management

Listing files

In [93]:
ls
Bash_Exercise_1.ipynb
Bash_Exercise_1_Solutions.ipynb
Bash_Exercise_1_Solutions.sh
Bash_Exercise_2.ipynb
Bash_Exercise_2_Solutoins.ipynb
Bash_Exercise_2_Solutoins.sh
Bash_Exercise_Solutions.ipynb
Bash in Jupyter.ipynb
Bash_in_Jupyter.ipynb
Bash_tutorial-Copy1.ipynb
Bash_tutorial.ipynb
Bash_tutorial_prep.ipynb
data
data2
figs
hello.txt
nursery.txt
Process-RNA-seq-counts.ipynb
ref
R_Graphic_ ggplot2.ipynb
R Graphics Base.ipynb
R_Graphics_Base.ipynb
R_Graphics_Exercise.ipynb
R_Graphics_Exercise_Solutions.ipynb
R_Graphics_Exercise_Solutions.r
R Graphics ggplot2.ipynb
R Graphics Overview.ipynb
R_Graphics_Overview.ipynb
R_tidyverse_1.ipynb
R_tidyverse_2.ipynb
R_tidyverse_3.ipynb
R_tidyyverse_Exercise.ipynb
R_tidyyverse_Exercise_Solutions.ipynb
R_tidyyverse_Exercise_Solutions.r
stderr.txt
stdout.txt
The_Unix_Shell_01___File_and_Directory_Management.ipynb
The_Unix_Shell_02___Working_with_Text.ipynb
The_Unix_Shell_03___Finding_Stuff.ipynb
The_Unix_Shell_04___Regular_Expresssions.ipynb
The_Unix_Shell_05___Shell_Scripts.ipynb
The_Unix_Shell___Exercises.ipynb
The_Unix_Shell___Exercises_Solutions.ipynb
In [94]:
ls ref
cn.txt                                              foo_Copy.ipynb
Cryptococcus_neoformans_var_grubii_h99.CNA3.39.gtf  header.txt

Include hidden files

In [95]:
ls -a
.
..
Bash_Exercise_1.ipynb
Bash_Exercise_1_Solutions.ipynb
Bash_Exercise_1_Solutions.sh
Bash_Exercise_2.ipynb
Bash_Exercise_2_Solutoins.ipynb
Bash_Exercise_2_Solutoins.sh
Bash_Exercise_Solutions.ipynb
Bash in Jupyter.ipynb
Bash_in_Jupyter.ipynb
Bash_tutorial-Copy1.ipynb
Bash_tutorial.ipynb
Bash_tutorial_prep.ipynb
data
data2
figs
hello.txt
.ipynb_checkpoints
nursery.txt
Process-RNA-seq-counts.ipynb
ref
R_Graphic_ ggplot2.ipynb
R Graphics Base.ipynb
R_Graphics_Base.ipynb
R_Graphics_Exercise.ipynb
R_Graphics_Exercise_Solutions.ipynb
R_Graphics_Exercise_Solutions.r
R Graphics ggplot2.ipynb
R Graphics Overview.ipynb
R_Graphics_Overview.ipynb
R_tidyverse_1.ipynb
R_tidyverse_2.ipynb
R_tidyverse_3.ipynb
R_tidyyverse_Exercise.ipynb
R_tidyyverse_Exercise_Solutions.ipynb
R_tidyyverse_Exercise_Solutions.r
stderr.txt
stdout.txt
The_Unix_Shell_01___File_and_Directory_Management.ipynb
The_Unix_Shell_02___Working_with_Text.ipynb
The_Unix_Shell_03___Finding_Stuff.ipynb
The_Unix_Shell_04___Regular_Expresssions.ipynb
The_Unix_Shell_05___Shell_Scripts.ipynb
The_Unix_Shell___Exercises.ipynb
The_Unix_Shell___Exercises_Solutions.ipynb

Exclude current and parent directory

In [96]:
ls -A
Bash_Exercise_1.ipynb
Bash_Exercise_1_Solutions.ipynb
Bash_Exercise_1_Solutions.sh
Bash_Exercise_2.ipynb
Bash_Exercise_2_Solutoins.ipynb
Bash_Exercise_2_Solutoins.sh
Bash_Exercise_Solutions.ipynb
Bash in Jupyter.ipynb
Bash_in_Jupyter.ipynb
Bash_tutorial-Copy1.ipynb
Bash_tutorial.ipynb
Bash_tutorial_prep.ipynb
data
data2
figs
hello.txt
.ipynb_checkpoints
nursery.txt
Process-RNA-seq-counts.ipynb
ref
R_Graphic_ ggplot2.ipynb
R Graphics Base.ipynb
R_Graphics_Base.ipynb
R_Graphics_Exercise.ipynb
R_Graphics_Exercise_Solutions.ipynb
R_Graphics_Exercise_Solutions.r
R Graphics ggplot2.ipynb
R Graphics Overview.ipynb
R_Graphics_Overview.ipynb
R_tidyverse_1.ipynb
R_tidyverse_2.ipynb
R_tidyverse_3.ipynb
R_tidyyverse_Exercise.ipynb
R_tidyyverse_Exercise_Solutions.ipynb
R_tidyyverse_Exercise_Solutions.r
stderr.txt
stdout.txt
The_Unix_Shell_01___File_and_Directory_Management.ipynb
The_Unix_Shell_02___Working_with_Text.ipynb
The_Unix_Shell_03___Finding_Stuff.ipynb
The_Unix_Shell_04___Regular_Expresssions.ipynb
The_Unix_Shell_05___Shell_Scripts.ipynb
The_Unix_Shell___Exercises.ipynb
The_Unix_Shell___Exercises_Solutions.ipynb

Show details

In [97]:
ls -l
total 4380
-rw-r--r-- 1 jovyan  1000    9683 Jul  9 12:39 Bash_Exercise_1.ipynb
-rw-r--r-- 1 jovyan  1000   15460 Jul  9 12:39 Bash_Exercise_1_Solutions.ipynb
-rw-r--r-- 1 jovyan  1000    1323 Jul  9 12:39 Bash_Exercise_1_Solutions.sh
-rw-r--r-- 1 jovyan  1000    6217 Jul  9 12:39 Bash_Exercise_2.ipynb
-rw-r--r-- 1 jovyan  1000   12757 Jul  9 12:39 Bash_Exercise_2_Solutoins.ipynb
-rw-r--r-- 1 jovyan  1000    1126 Jul  9 12:39 Bash_Exercise_2_Solutoins.sh
-rw-r--r-- 1 jovyan  1000   15448 Jul  1 21:42 Bash_Exercise_Solutions.ipynb
-rw-r--r-- 1 jovyan  1000   65260 Jun 27 21:28 Bash in Jupyter.ipynb
-rw-r--r-- 1 jovyan  1000   65260 Jul  9 12:39 Bash_in_Jupyter.ipynb
-rw-r--r-- 1 jovyan users   36489 Jul  9 14:29 Bash_tutorial-Copy1.ipynb
-rw-r--r-- 1 jovyan  1000    5967 Jul  9 12:39 Bash_tutorial.ipynb
-rw-r--r-- 1 jovyan  1000   29522 Jul  9 12:39 Bash_tutorial_prep.ipynb
drwxr-xr-x 2 jovyan users    4096 Jul  5 15:01 data
drwxr-xr-x 2 jovyan users    4096 Jul  9 14:59 data2
drwxr-xr-x 2 jovyan  1000    4096 Jul  5 15:01 figs
-rw-r--r-- 1 jovyan users      45 Jul  9 15:15 hello.txt
-rw-r--r-- 1 jovyan users      24 Jul  9 15:17 nursery.txt
-rw-r--r-- 1 jovyan  1000   12634 Jul  9 12:39 Process-RNA-seq-counts.ipynb
drwxr-xr-x 2 jovyan users    4096 Jul  9 15:00 ref
-rw-r--r-- 1 jovyan  1000 1398902 Jul  9 12:39 R_Graphic_ ggplot2.ipynb
-rw-r--r-- 1 jovyan  1000  155396 Jun 27 21:28 R Graphics Base.ipynb
-rw-r--r-- 1 jovyan  1000  155396 Jul  9 12:39 R_Graphics_Base.ipynb
-rw-r--r-- 1 jovyan  1000   10670 Jul  9 12:39 R_Graphics_Exercise.ipynb
-rw-r--r-- 1 jovyan  1000  195947 Jul  9 12:39 R_Graphics_Exercise_Solutions.ipynb
-rw-r--r-- 1 jovyan  1000     938 Jul  9 12:39 R_Graphics_Exercise_Solutions.r
-rw-r--r-- 1 jovyan  1000 1398902 Jun 27 21:28 R Graphics ggplot2.ipynb
-rw-r--r-- 1 jovyan  1000  158687 Jun 27 21:28 R Graphics Overview.ipynb
-rw-r--r-- 1 jovyan  1000  158687 Jul  9 12:39 R_Graphics_Overview.ipynb
-rw-r--r-- 1 jovyan  1000  106176 Jul  9 12:39 R_tidyverse_1.ipynb
-rw-r--r-- 1 jovyan  1000   82867 Jul  9 12:39 R_tidyverse_2.ipynb
-rw-r--r-- 1 jovyan  1000  126650 Jul  9 12:39 R_tidyverse_3.ipynb
-rw-r--r-- 1 jovyan  1000   29862 Jul  9 12:39 R_tidyyverse_Exercise.ipynb
-rw-r--r-- 1 jovyan  1000    7260 Jul  9 12:39 R_tidyyverse_Exercise_Solutions.ipynb
-rw-r--r-- 1 jovyan  1000      70 Jul  9 12:39 R_tidyyverse_Exercise_Solutions.r
-rw-r--r-- 1 jovyan users      76 Jul  9 15:22 stderr.txt
-rw-r--r-- 1 jovyan users       0 Jul  9 15:22 stdout.txt
-rw-r--r-- 1 jovyan  1000   53864 Jul  9 15:12 The_Unix_Shell_01___File_and_Directory_Management.ipynb
-rw-r--r-- 1 jovyan  1000    7306 Jul  9 15:25 The_Unix_Shell_02___Working_with_Text.ipynb
-rw-r--r-- 1 jovyan  1000   19041 Jul  9 15:35 The_Unix_Shell_03___Finding_Stuff.ipynb
-rw-r--r-- 1 jovyan  1000   16736 Jul  9 15:51 The_Unix_Shell_04___Regular_Expresssions.ipynb
-rw-r--r-- 1 jovyan  1000   13236 Jul  9 12:39 The_Unix_Shell_05___Shell_Scripts.ipynb
-rw-r--r-- 1 jovyan  1000    4928 Jul  9 12:39 The_Unix_Shell___Exercises.ipynb
-rw-r--r-- 1 jovyan  1000    9293 Jul  9 12:39 The_Unix_Shell___Exercises_Solutions.ipynb

Shown only directories

In [98]:
ls -d */
data/  data2/  figs/  ref/

Alternative using grep

In [99]:
ls -l | grep -E '^d'
drwxr-xr-x 2 jovyan users    4096 Jul  5 15:01 data
drwxr-xr-x 2 jovyan users    4096 Jul  9 14:59 data2
drwxr-xr-x 2 jovyan  1000    4096 Jul  5 15:01 figs
drwxr-xr-x 2 jovyan users    4096 Jul  9 15:00 ref

Show only files

In [100]:
ls -l | grep -Ev '^d'
total 4380
-rw-r--r-- 1 jovyan  1000    9683 Jul  9 12:39 Bash_Exercise_1.ipynb
-rw-r--r-- 1 jovyan  1000   15460 Jul  9 12:39 Bash_Exercise_1_Solutions.ipynb
-rw-r--r-- 1 jovyan  1000    1323 Jul  9 12:39 Bash_Exercise_1_Solutions.sh
-rw-r--r-- 1 jovyan  1000    6217 Jul  9 12:39 Bash_Exercise_2.ipynb
-rw-r--r-- 1 jovyan  1000   12757 Jul  9 12:39 Bash_Exercise_2_Solutoins.ipynb
-rw-r--r-- 1 jovyan  1000    1126 Jul  9 12:39 Bash_Exercise_2_Solutoins.sh
-rw-r--r-- 1 jovyan  1000   15448 Jul  1 21:42 Bash_Exercise_Solutions.ipynb
-rw-r--r-- 1 jovyan  1000   65260 Jun 27 21:28 Bash in Jupyter.ipynb
-rw-r--r-- 1 jovyan  1000   65260 Jul  9 12:39 Bash_in_Jupyter.ipynb
-rw-r--r-- 1 jovyan users   36489 Jul  9 14:29 Bash_tutorial-Copy1.ipynb
-rw-r--r-- 1 jovyan  1000    5967 Jul  9 12:39 Bash_tutorial.ipynb
-rw-r--r-- 1 jovyan  1000   29522 Jul  9 12:39 Bash_tutorial_prep.ipynb
-rw-r--r-- 1 jovyan users      45 Jul  9 15:15 hello.txt
-rw-r--r-- 1 jovyan users      24 Jul  9 15:17 nursery.txt
-rw-r--r-- 1 jovyan  1000   12634 Jul  9 12:39 Process-RNA-seq-counts.ipynb
-rw-r--r-- 1 jovyan  1000 1398902 Jul  9 12:39 R_Graphic_ ggplot2.ipynb
-rw-r--r-- 1 jovyan  1000  155396 Jun 27 21:28 R Graphics Base.ipynb
-rw-r--r-- 1 jovyan  1000  155396 Jul  9 12:39 R_Graphics_Base.ipynb
-rw-r--r-- 1 jovyan  1000   10670 Jul  9 12:39 R_Graphics_Exercise.ipynb
-rw-r--r-- 1 jovyan  1000  195947 Jul  9 12:39 R_Graphics_Exercise_Solutions.ipynb
-rw-r--r-- 1 jovyan  1000     938 Jul  9 12:39 R_Graphics_Exercise_Solutions.r
-rw-r--r-- 1 jovyan  1000 1398902 Jun 27 21:28 R Graphics ggplot2.ipynb
-rw-r--r-- 1 jovyan  1000  158687 Jun 27 21:28 R Graphics Overview.ipynb
-rw-r--r-- 1 jovyan  1000  158687 Jul  9 12:39 R_Graphics_Overview.ipynb
-rw-r--r-- 1 jovyan  1000  106176 Jul  9 12:39 R_tidyverse_1.ipynb
-rw-r--r-- 1 jovyan  1000   82867 Jul  9 12:39 R_tidyverse_2.ipynb
-rw-r--r-- 1 jovyan  1000  126650 Jul  9 12:39 R_tidyverse_3.ipynb
-rw-r--r-- 1 jovyan  1000   29862 Jul  9 12:39 R_tidyyverse_Exercise.ipynb
-rw-r--r-- 1 jovyan  1000    7260 Jul  9 12:39 R_tidyyverse_Exercise_Solutions.ipynb
-rw-r--r-- 1 jovyan  1000      70 Jul  9 12:39 R_tidyyverse_Exercise_Solutions.r
-rw-r--r-- 1 jovyan users      76 Jul  9 15:22 stderr.txt
-rw-r--r-- 1 jovyan users       0 Jul  9 15:22 stdout.txt
-rw-r--r-- 1 jovyan  1000   53864 Jul  9 15:12 The_Unix_Shell_01___File_and_Directory_Management.ipynb
-rw-r--r-- 1 jovyan  1000    7306 Jul  9 15:25 The_Unix_Shell_02___Working_with_Text.ipynb
-rw-r--r-- 1 jovyan  1000   19041 Jul  9 15:35 The_Unix_Shell_03___Finding_Stuff.ipynb
-rw-r--r-- 1 jovyan  1000   16736 Jul  9 15:51 The_Unix_Shell_04___Regular_Expresssions.ipynb
-rw-r--r-- 1 jovyan  1000   13236 Jul  9 12:39 The_Unix_Shell_05___Shell_Scripts.ipynb
-rw-r--r-- 1 jovyan  1000    4928 Jul  9 12:39 The_Unix_Shell___Exercises.ipynb
-rw-r--r-- 1 jovyan  1000    9293 Jul  9 12:39 The_Unix_Shell___Exercises_Solutions.ipynb

Sort by last modified time

In [101]:
ls -lt | head -3
total 4380
-rw-r--r-- 1 jovyan  1000   16736 Jul  9 15:51 The_Unix_Shell_04___Regular_Expresssions.ipynb
-rw-r--r-- 1 jovyan  1000   19041 Jul  9 15:35 The_Unix_Shell_03___Finding_Stuff.ipynb

Human readable output

In [102]:
ls -lth | head -3
total 4.3M
-rw-r--r-- 1 jovyan  1000  17K Jul  9 15:51 The_Unix_Shell_04___Regular_Expresssions.ipynb
-rw-r--r-- 1 jovyan  1000  19K Jul  9 15:35 The_Unix_Shell_03___Finding_Stuff.ipynb

Recursive listing

In [103]:
ls -R
.:
Bash_Exercise_1.ipynb
Bash_Exercise_1_Solutions.ipynb
Bash_Exercise_1_Solutions.sh
Bash_Exercise_2.ipynb
Bash_Exercise_2_Solutoins.ipynb
Bash_Exercise_2_Solutoins.sh
Bash_Exercise_Solutions.ipynb
Bash in Jupyter.ipynb
Bash_in_Jupyter.ipynb
Bash_tutorial-Copy1.ipynb
Bash_tutorial.ipynb
Bash_tutorial_prep.ipynb
data
data2
figs
hello.txt
nursery.txt
Process-RNA-seq-counts.ipynb
ref
R_Graphic_ ggplot2.ipynb
R Graphics Base.ipynb
R_Graphics_Base.ipynb
R_Graphics_Exercise.ipynb
R_Graphics_Exercise_Solutions.ipynb
R_Graphics_Exercise_Solutions.r
R Graphics ggplot2.ipynb
R Graphics Overview.ipynb
R_Graphics_Overview.ipynb
R_tidyverse_1.ipynb
R_tidyverse_2.ipynb
R_tidyverse_3.ipynb
R_tidyyverse_Exercise.ipynb
R_tidyyverse_Exercise_Solutions.ipynb
R_tidyyverse_Exercise_Solutions.r
stderr.txt
stdout.txt
The_Unix_Shell_01___File_and_Directory_Management.ipynb
The_Unix_Shell_02___Working_with_Text.ipynb
The_Unix_Shell_03___Finding_Stuff.ipynb
The_Unix_Shell_04___Regular_Expresssions.ipynb
The_Unix_Shell_05___Shell_Scripts.ipynb
The_Unix_Shell___Exercises.ipynb
The_Unix_Shell___Exercises_Solutions.ipynb

./data:
duke_demographics.tsv  duke_proteins_v2.tsv  unc_genes_v2.tsv
duke_genes_v1.tsv      gene_counts.txt       unc_proteins_v1.tsv
duke_genes_v2.tsv      unc_demographics.tsv  unc_proteins_v2.tsv
duke_proteins_v1.tsv   unc_genes_v1.tsv

./data2:
duke_demographics.tsv  duke_proteins_v2.tsv  unc_genes_v2.tsv
duke_genes_v1.tsv      gene_counts.txt       unc_proteins_v1.tsv
duke_genes_v2.tsv      unc_demographics.tsv  unc_proteins_v2.tsv
duke_proteins_v1.tsv   unc_genes_v1.tsv

./figs:
fig1.png  fig2.png  fig3.png  fig4.png  fig5.png

./ref:
cn.txt                                              foo_Copy.ipynb
Cryptococcus_neoformans_var_grubii_h99.CNA3.39.gtf  header.txt

Globbing

The use of wild cards to specify Unix paths is known as globbing.

* represets any number of characters

In [104]:
ls *ipynb
Bash_Exercise_1.ipynb
Bash_Exercise_1_Solutions.ipynb
Bash_Exercise_2.ipynb
Bash_Exercise_2_Solutoins.ipynb
Bash_Exercise_Solutions.ipynb
Bash in Jupyter.ipynb
Bash_in_Jupyter.ipynb
Bash_tutorial-Copy1.ipynb
Bash_tutorial.ipynb
Bash_tutorial_prep.ipynb
Process-RNA-seq-counts.ipynb
R_Graphic_ ggplot2.ipynb
R Graphics Base.ipynb
R_Graphics_Base.ipynb
R_Graphics_Exercise.ipynb
R_Graphics_Exercise_Solutions.ipynb
R Graphics ggplot2.ipynb
R Graphics Overview.ipynb
R_Graphics_Overview.ipynb
R_tidyverse_1.ipynb
R_tidyverse_2.ipynb
R_tidyverse_3.ipynb
R_tidyyverse_Exercise.ipynb
R_tidyyverse_Exercise_Solutions.ipynb
The_Unix_Shell_01___File_and_Directory_Management.ipynb
The_Unix_Shell_02___Working_with_Text.ipynb
The_Unix_Shell_03___Finding_Stuff.ipynb
The_Unix_Shell_04___Regular_Expresssions.ipynb
The_Unix_Shell_05___Shell_Scripts.ipynb
The_Unix_Shell___Exercises.ipynb
The_Unix_Shell___Exercises_Solutions.ipynb
In [105]:
ls *Text*
The_Unix_Shell_02___Working_with_Text.ipynb

? represents exactly one character

In [107]:
ls data
duke_demographics.tsv  duke_proteins_v2.tsv  unc_genes_v2.tsv
duke_genes_v1.tsv      gene_counts.txt       unc_proteins_v1.tsv
duke_genes_v2.tsv      unc_demographics.tsv  unc_proteins_v2.tsv
duke_proteins_v1.tsv   unc_genes_v1.tsv
In [109]:
ls data/unc_genes_v?.tsv
data/unc_genes_v1.tsv  data/unc_genes_v2.tsv

Character sets

  • [abc] represents a or b or c
  • [a-z] represents any lower case character
  • ! negates
In [110]:
ls data
duke_demographics.tsv  duke_proteins_v2.tsv  unc_genes_v2.tsv
duke_genes_v1.tsv      gene_counts.txt       unc_proteins_v1.tsv
duke_genes_v2.tsv      unc_demographics.tsv  unc_proteins_v2.tsv
duke_proteins_v1.tsv   unc_genes_v1.tsv
In [111]:
ls data/[a-d]*
data/duke_demographics.tsv  data/duke_proteins_v1.tsv
data/duke_genes_v1.tsv      data/duke_proteins_v2.tsv
data/duke_genes_v2.tsv
In [112]:
ls data/[!a-d]*
data/gene_counts.txt       data/unc_genes_v1.tsv  data/unc_proteins_v1.tsv
data/unc_demographics.tsv  data/unc_genes_v2.tsv  data/unc_proteins_v2.tsv

Directory navigation

Show current directory

In [113]:
pwd
/home/jovyan/work/HTS2018-notebooks/cliburn

Move to parent directory

In [114]:
cd ..
In [115]:
pwd
/home/jovyan/work/HTS2018-notebooks

Move back to last directory

In [116]:
cd -
/home/jovyan/work/HTS2018-notebooks/cliburn

Move using relative addressing

In [117]:
cd data
In [118]:
pwd
/home/jovyan/work/HTS2018-notebooks/cliburn/data

Move using absolute addressing

Move to data folder

In [119]:
cd /home/jovyan/work/HTS2018-notebooks/cliburn/data
In [120]:
pwd
/home/jovyan/work/HTS2018-notebooks/cliburn/data

Move back to cliburn directory

In [138]:
cd /home/jovyan/work/HTS2018-notebooks/cliburn/
In [139]:
pwd
/home/jovyan/work/HTS2018-notebooks/cliburn

Making and removing new directories

In [140]:
mkdir foo
In [141]:
ls -d */
data/  data2/  figs/  foo/  ref/

Making intermediate directories automatically

In [142]:
mkdir a/b/c/d
mkdir: cannot create directory ‘a/b/c/d’: No such file or directory

In [143]:
mkdir -p a/b/c/d
In [144]:
ls -R a
a:
b

a/b:
c

a/b/c:
d

a/b/c/d:

Deleting directories

In [145]:
rmdir foo

Only works if directory is empty

The | cat part is not necessary on the command line, but is only used here for convenience of Run All Cells as Jupyter stops on non-zero exit codes. The | cat syntax “pipes” the output of rmdir data to a the cat program.

In [146]:
rmdir data | cat
rmdir: failed to remove ‘data’: Directory not empty

Recursive intermediate directories as well

In [147]:
rmdir -p a/b/c/d
In [148]:
ls -d */
data/  data2/  figs/  ref/

Working with files

Making an empty file

In [149]:
touch foo.txt
In [150]:
ls *txt
foo.txt  hello.txt  nursery.txt  stderr.txt  stdout.txt

Deleting a file

In [151]:
rm foo.txt
In [152]:
ls *txt
hello.txt  nursery.txt  stderr.txt  stdout.txt

Viewing a file

In [153]:
ls data
duke_demographics.tsv  duke_proteins_v2.tsv  unc_genes_v2.tsv
duke_genes_v1.tsv      gene_counts.txt       unc_proteins_v1.tsv
duke_genes_v2.tsv      unc_demographics.tsv  unc_proteins_v2.tsv
duke_proteins_v1.tsv   unc_genes_v1.tsv
In [154]:
cat date/duke_genes | head
cat: date/duke_genes: No such file or directory
In [155]:
head -n 3 ref/header.txt
#!genome-build CNA3
#!genome-version CNA3
#!genome-date 2015-11
In [156]:
tail -n 3 ref/header.txt
Mt      ena     CDS     24096   24848   .       +       0       gene_id "CNAG_09012"; transcript_id "AFR99114"; exon_number "1"; gene_source "ena"; gene_biotype "protein_coding"; transcript_source "ena"; transcript_biotype "protein_coding"; protein_id "AFR99114"; protein_version "1";
Mt      ena     start_codon     24096   24098   .       +       0       gene_id "CNAG_09012"; transcript_id "AFR99114"; exon_number "1"; gene_source "ena"; gene_biotype "protein_coding"; transcript_source "ena"; transcript_biotype "protein_coding";
Mt      ena     stop_codon      24849   24851   .       +       0       gene_id "CNAG_09012"; transcript_id "AFR99114"; exon_number "1"; gene_source "ena"; gene_biotype "protein_coding"; transcript_source "ena"; transcript_biotype "protein_coding";

Can start tail form a specified line number with (+)

In [157]:
tail -n +4 ref/header.txt | head
#!genome-build-accession GCA_000149245.3
#!genebuild-last-updated 2015-11
1       ena     gene    100     5645    .       -       .       gene_id "CNAG_04548"; gene_source "ena"; gene_biotype "protein_coding";
1       ena     transcript      100     5645    .       -       .       gene_id "CNAG_04548"; transcript_id "AFR92135"; gene_source "ena"; gene_biotype "protein_coding"; transcript_source "ena"; transcript_biotype "protein_coding";
1       ena     exon    5494    5645    .       -       .       gene_id "CNAG_04548"; transcript_id "AFR92135"; exon_number "1"; gene_source "ena"; gene_biotype "protein_coding"; transcript_source "ena"; transcript_biotype "protein_coding"; exon_id "AFR92135-1";
1       ena     CDS     5494    5645    .       -       0       gene_id "CNAG_04548"; transcript_id "AFR92135"; exon_number "1"; gene_source "ena"; gene_biotype "protein_coding"; transcript_source "ena"; transcript_biotype "protein_coding"; protein_id "AFR92135"; protein_version "1";
1       ena     start_codon     5643    5645    .       -       0       gene_id "CNAG_04548"; transcript_id "AFR92135"; exon_number "1"; gene_source "ena"; gene_biotype "protein_coding"; transcript_source "ena"; transcript_biotype "protein_coding";
1       ena     exon    5322    5422    .       -       .       gene_id "CNAG_04548"; transcript_id "AFR92135"; exon_number "2"; gene_source "ena"; gene_biotype "protein_coding"; transcript_source "ena"; transcript_biotype "protein_coding"; exon_id "AFR92135-2";
1       ena     CDS     5322    5422    .       -       1       gene_id "CNAG_04548"; transcript_id "AFR92135"; exon_number "2"; gene_source "ena"; gene_biotype "protein_coding"; transcript_source "ena"; transcript_biotype "protein_coding"; protein_id "AFR92135"; protein_version "1";
1       ena     exon    3958    5263    .       -       .       gene_id "CNAG_04548"; transcript_id "AFR92135"; exon_number "3"; gene_source "ena"; gene_biotype "protein_coding"; transcript_source "ena"; transcript_biotype "protein_coding"; exon_id "AFR92135-3";
tail: error writing ‘standard output’: Broken pipe

Copying and moving files

In [158]:
ls
Bash_Exercise_1.ipynb
Bash_Exercise_1_Solutions.ipynb
Bash_Exercise_1_Solutions.sh
Bash_Exercise_2.ipynb
Bash_Exercise_2_Solutoins.ipynb
Bash_Exercise_2_Solutoins.sh
Bash_Exercise_Solutions.ipynb
Bash in Jupyter.ipynb
Bash_in_Jupyter.ipynb
Bash_tutorial-Copy1.ipynb
Bash_tutorial.ipynb
Bash_tutorial_prep.ipynb
data
data2
figs
hello.txt
nursery.txt
Process-RNA-seq-counts.ipynb
ref
R_Graphic_ ggplot2.ipynb
R Graphics Base.ipynb
R_Graphics_Base.ipynb
R_Graphics_Exercise.ipynb
R_Graphics_Exercise_Solutions.ipynb
R_Graphics_Exercise_Solutions.r
R Graphics ggplot2.ipynb
R Graphics Overview.ipynb
R_Graphics_Overview.ipynb
R_tidyverse_1.ipynb
R_tidyverse_2.ipynb
R_tidyverse_3.ipynb
R_tidyyverse_Exercise.ipynb
R_tidyyverse_Exercise_Solutions.ipynb
R_tidyyverse_Exercise_Solutions.r
stderr.txt
stdout.txt
The_Unix_Shell_01___File_and_Directory_Management.ipynb
The_Unix_Shell_02___Working_with_Text.ipynb
The_Unix_Shell_03___Finding_Stuff.ipynb
The_Unix_Shell_04___Regular_Expresssions.ipynb
The_Unix_Shell_05___Shell_Scripts.ipynb
The_Unix_Shell___Exercises.ipynb
The_Unix_Shell___Exercises_Solutions.ipynb

Copying files

In [159]:
cp "The_Unix_Shell_01___File_and_Directory_Management.ipynb" foo.ipynb
In [160]:
ls f*ipynb
foo.ipynb

Copying directories (Recursive copy)

In [161]:
cp -R data data2
In [162]:
ls -R data2
data2:
data                   duke_proteins_v1.tsv  unc_genes_v1.tsv
duke_demographics.tsv  duke_proteins_v2.tsv  unc_genes_v2.tsv
duke_genes_v1.tsv      gene_counts.txt       unc_proteins_v1.tsv
duke_genes_v2.tsv      unc_demographics.tsv  unc_proteins_v2.tsv

data2/data:
duke_demographics.tsv  duke_proteins_v2.tsv  unc_genes_v2.tsv
duke_genes_v1.tsv      gene_counts.txt       unc_proteins_v1.tsv
duke_genes_v2.tsv      unc_demographics.tsv  unc_proteins_v2.tsv
duke_proteins_v1.tsv   unc_genes_v1.tsv

Renaming a file

In [163]:
mv foo.ipynb foo_Copy.ipynb
In [164]:
ls foo*
foo_Copy.ipynb

Move a file to a new location

In [165]:
mv foo_Copy.ipynb ref/
In [166]:
ls ref
cn.txt                                              foo_Copy.ipynb
Cryptococcus_neoformans_var_grubii_h99.CNA3.39.gtf  header.txt

File compression and archival

Combine multiple files into single file

In [167]:
ls data
duke_demographics.tsv  duke_proteins_v2.tsv  unc_genes_v2.tsv
duke_genes_v1.tsv      gene_counts.txt       unc_proteins_v1.tsv
duke_genes_v2.tsv      unc_demographics.tsv  unc_proteins_v2.tsv
duke_proteins_v1.tsv   unc_genes_v1.tsv
In [168]:
man tar | head -n 20
bash: man: command not found
In [169]:
tar -cvf data.tar data
data/
data/unc_proteins_v2.tsv
data/duke_genes_v1.tsv
data/duke_demographics.tsv
data/unc_genes_v2.tsv
data/gene_counts.txt
data/unc_proteins_v1.tsv
data/duke_proteins_v2.tsv
data/unc_demographics.tsv
data/unc_genes_v1.tsv
data/duke_proteins_v1.tsv
data/duke_genes_v2.tsv
In [170]:
ls *tar
data.tar
In [171]:
ls -d */
data/  data2/  figs/  ref/
In [172]:
rm -rf data/
In [173]:
ls -d */
data2/  figs/  ref/

Compress concatenated file

In [174]:
gzip data.tar
In [175]:
ls *gz
data.tar.gz

Uncompress

In [176]:
gunzip data.tar.gz
In [177]:
ls *tar
data.tar

Recover original files

In [178]:
tar -xvf data.tar
data/
data/unc_proteins_v2.tsv
data/duke_genes_v1.tsv
data/duke_demographics.tsv
data/unc_genes_v2.tsv
data/gene_counts.txt
data/unc_proteins_v1.tsv
data/duke_proteins_v2.tsv
data/unc_demographics.tsv
data/unc_genes_v1.tsv
data/duke_proteins_v1.tsv
data/duke_genes_v2.tsv
In [179]:
ls data/
duke_demographics.tsv  duke_proteins_v2.tsv  unc_genes_v2.tsv
duke_genes_v1.tsv      gene_counts.txt       unc_proteins_v1.tsv
duke_genes_v2.tsv      unc_demographics.tsv  unc_proteins_v2.tsv
duke_proteins_v1.tsv   unc_genes_v1.tsv
In [180]:
rm data.tar

Concatenate and compress

In [181]:
tar -cvzf data.tar.gz data
data/
data/unc_proteins_v2.tsv
data/duke_genes_v1.tsv
data/duke_demographics.tsv
data/unc_genes_v2.tsv
data/gene_counts.txt
data/unc_proteins_v1.tsv
data/duke_proteins_v2.tsv
data/unc_demographics.tsv
data/unc_genes_v1.tsv
data/duke_proteins_v1.tsv
data/duke_genes_v2.tsv
In [182]:
rm -rf data/
In [183]:
ls data*
data.tar.gz

data2:
data                   duke_proteins_v1.tsv  unc_genes_v1.tsv
duke_demographics.tsv  duke_proteins_v2.tsv  unc_genes_v2.tsv
duke_genes_v1.tsv      gene_counts.txt       unc_proteins_v1.tsv
duke_genes_v2.tsv      unc_demographics.tsv  unc_proteins_v2.tsv

Uncompress and recover

In [184]:
tar -xvzf data.tar.gz
data/
data/unc_proteins_v2.tsv
data/duke_genes_v1.tsv
data/duke_demographics.tsv
data/unc_genes_v2.tsv
data/gene_counts.txt
data/unc_proteins_v1.tsv
data/duke_proteins_v2.tsv
data/unc_demographics.tsv
data/unc_genes_v1.tsv
data/duke_proteins_v1.tsv
data/duke_genes_v2.tsv
In [185]:
ls data*
data.tar.gz

data:
duke_demographics.tsv  duke_proteins_v2.tsv  unc_genes_v2.tsv
duke_genes_v1.tsv      gene_counts.txt       unc_proteins_v1.tsv
duke_genes_v2.tsv      unc_demographics.tsv  unc_proteins_v2.tsv
duke_proteins_v1.tsv   unc_genes_v1.tsv

data2:
data                   duke_proteins_v1.tsv  unc_genes_v1.tsv
duke_demographics.tsv  duke_proteins_v2.tsv  unc_genes_v2.tsv
duke_genes_v1.tsv      gene_counts.txt       unc_proteins_v1.tsv
duke_genes_v2.tsv      unc_demographics.tsv  unc_proteins_v2.tsv
In [186]:
rm data.tar.gz

Checksums

When working with genomic data, we deal with very large files. There is a small risk that these files will be corrupted over time or during data transfer. To ensure that files are not changed, we use a “checksum” function. This is a function that generates an long, essentially random number called a checksum that represents the contents of the file. When the file contents change, so will the checksum. In theory, there is a very small probability that two different files generate the same checksum, but in practice the probability is too small to worry about.

There are several different algorithms for generating the checksums, and at least 3 Unix commands to do so, but they all work very similarly for our purposes.

The strategy is:

  • Generate and store a checksum together with a data file whose integrity you care about
  • When you use or download the data, re-generate the checksum (using the same algorithm e.g. MD5) and compare with the checksum
In [193]:
cat << EOF > hello.txt
1 Hello, bash
2 Hello, again
3 Hello
4 again
EOF
In [194]:
cat hello.txt
1 Hello, bash
2 Hello, again
3 Hello
4 again
In [195]:
cksum hello.txt
1567754519 45 hello.txt
In [196]:
md5sum hello.txt
a68554400613f5445c13c57907e976ed  hello.txt
In [197]:
sha1sum hello.txt
57eae725420bf0075d17f849cc8e75379bea6eb6  hello.txt

If we alter hello.txt in any way the checksum will be different

In [198]:
cat hello.txt
1 Hello, bash
2 Hello, again
3 Hello
4 again
In [199]:
md5sum hello.txt > hello.md5
In [200]:
cat hello.md5
a68554400613f5445c13c57907e976ed  hello.txt

Now make a small change to hello.txt

In [201]:
cat > test1.txt << EOF
One, two buckle my shoe
Three, four lock the door
EOF
In [202]:
cat > hello.txt << EOF
1 Hello, bash
2 Hella, again
3 Hello
4 again
EOF
In [203]:
cat hello.txt
1 Hello, bash
2 Hella, again
3 Hello
4 again
In [204]:
md5sum hello.txt
0d8c8172f2a69f5845f21cb03a436be3  hello.txt
In [205]:
md5sum -c hello.md5
hello.txt: FAILED
md5sum: WARNING: 1 computed checksum did NOT match

Restore original text

In [206]:
cat > hello.txt << EOF
1 Hello, bash
2 Hello, again
3 Hello
4 again
EOF
In [207]:
md5sum hello.txt > test.md5
In [208]:
md5sum -c hello.md5
hello.txt: OK

Checksums for multiple files

In [209]:
echo "aaaaa" > a.txt
echo "bbbbb" > b.txt
echo "ccccc" > c.txt

Generate md5 checksum file

In [210]:
md5sum a.txt b.txt c.txt > MD5_CHECKSUM
In [211]:
cat MD5_CHECKSUM
4c850c5b3b2756e67a91bad8e046ddac  a.txt
369d9bb6f2313be57f7a55502eb420ba  b.txt
34d9ae3c9b1fa64d91bdb00f3c0d6cd5  c.txt
Modify one file
In [212]:
echo "bbcbb" > b.txt

Check file integrity for all files

In [213]:
md5sum -c MD5_CHECKSUM
a.txt: OK
b.txt: FAILED
c.txt: OK
md5sum: WARNING: 1 computed checksum did NOT match