High Priority

If you have created or modified files in your Jupyter container that you would like to preserve, we recommend that you follow the instructions for Maintaining Personal Files. - We recommend you do this before August 20th, regardless of whether you plan to renew your Docker container or not - Keep in mind that you only need to worry about this for files that you have created or modified, the course material that we created and shared with you will continue to be publicly available.

Important Notes

See details below, but please keep the following in mind: 1. The course material will remain in the HTS2018-notebooks repo and will be publicly available, in perpetuity (or as long as https://gitlab.oit.duke.edu/ continues to exist), regardless of your affiliation with Duke (or lack thereof). See below for details. 2. The configuration and build information for the course Docker container will remain in the jupyter-HTS-2018 repo and will be publicly available, in perpetuity (or as long as https://gitlab.oit.duke.edu/ continues to exist), regardless of your affiliation with Duke (or lack thereof). 3. The HTS 2018 Docker image will remain at https://hub.docker.com/r/mccahill/jupyter-hts-2018/ and will be publicly available, in perpetuity (or as long as https://hub.docker.com continues to exist), regardless of your affiliation with Duke (or lack thereof).

Maintaining Personal Files

Tar and Download

Tarring

The simplest thing to do is to make a tarball of the whole directory that contains your personal files. Assuming that you have just been copying and renaming files in the HTS2018-notebooks directory, you can run the following to tar the whole HTS2018-notebooks directory and save it to a file called HTS2018-notebooks.tar.gz

In [1]:
tar -cvzf ~/work/HTS2018-notebooks.tar.gz ~/work/HTS2018-notebooks
tar: /Users/cliburn/work/HTS2018-notebooks: Cannot stat: No such file or directory
tar: Error exit delayed from previous errors.

If you only want to get notebooks, you could use the following to only grab the notebooks from HTS2018-notebooks

In [2]:
find ~/work/HTS2018-notebooks \
    -name "*.ipynb" \
    -not -path "*/.ipynb_checkpoints/*" \
    | tar -cvzf ~/work/HTS2018-notebooks.tar.gz -T -
find: /Users/cliburn/work/HTS2018-notebooks: No such file or directory

And if you only modified notebooks, and saved them with a standard naming scheme, e.g. sticking __MYSTUFF__ in the name like renaming demultiplex.ipynb to demultiplex__MYSTUFF__.ipynb , you could use the following to only grab the modified files from HTS2018-notebooks

In [3]:
find ~/work/HTS2018-notebooks \
    -name "*__MYSTUFF__*" \
    -not -path "*/.ipynb_checkpoints/*" \
    | tar -cvzf ~/work/HTS2018-notebooks.tar.gz -T -
find: /Users/cliburn/work/HTS2018-notebooks: No such file or directory
In [4]:
tar -tvf ~/work/HTS2018-notebooks.tar.gz

Downloadng the tarball

  1. Go to the file browser and click on the new file. You should be prompted to download.
  2. Once you have the file, your system’s archive program should be able to open it and allow you to extract to a directory on your computer.

Accessing Course Material

You can access the course material in three different ways: 1. You can browse and download the materials from the HTS2018-notebooks repo 2. You can clone the repo using git: git clone <https://gitlab.oit.duke.edu/HTS2018/HTS2018-notebooks.git> 3. You can browse the material at: http://people.duke.edu/~ccc14/duke-hts-2018/

Docker Containers

Course Instances

Duke Affiliates

Container Renewal

By August 10th you should receive an email telling you that your HTS2018 container will expire on August 20th unless you renew it. If you want to maintain your container, we strongly recommend that you mark these dates on your calendar and be proactive: if you have not received an email by August 10th, go to VM Manage and renew it. In the past course participants have lost their containers and contents because they failed to renew in time.

Reduced Capabilities

Soon after the course ends, the containers will be shifted to a server with drastically reduced the resources. That means that running analyses on those containers may not work and may crash the container.

Non-affiliates

For those not affiliated with Duke, the containers will be deactivated by August 20th. If you want to save any files from the course that you have modified, you must follow the instructions for Maintaining Personal Files before August 20th.

Running the Docker Image Elsewhere

Running on your local machine

Quick Start

The script that was demonstrated in class for installing the course Jupyter image and downloading the HTS2018-notebooks repo is at https://gitlab.oit.duke.edu/HTS2018/HTS2018-notebooks/blob/master/reproducibility/prep_local_jupyter.sh. It should work in any environment that meets the requirements listed below. Run it like this (on your local computer or server):

bash prep_local_jupyter.sh TARGET_DIR

Where TARGET_DIR is the parent directory for downloading data and repos. Once you start the Jupyter container it will continue running, even if you are not actively using Jupyter.

For detailed instructions see Install docker and Run docker below. If this

Requirements
  • git
  • docker
  • bash

Install docker

To run a container on your local machine or laptop, download the docker program from https://www.docker.com. There is a tab at the top of the page that says ‘Get Docker’. You can run it on OS X, Windows, and Linux: - Docker for Mac - Docker for Windows - Linux: Docker Community Edition

Run docker

If you successfully ran prep_local_jupyter.sh, as described above, you can skip this step and go on to Open the notebook in your browser

Once you have the docker program installed, open the program (you should get a terminal screen with command line). Enter the command:

docker pull mccahill/jupyter-hts-2018

This will pull down the course docker image from dockerhub. It may take a few minutes. Next, run the command to start a container:

docker run --name hts-course -v YOUR_DIRECTORY_WITH_COURSE_MATERIAL:/home/jovyan/work
-d -p 127.0.0.1\:9999\:8888
-e PASSWORD="YOUR_CHOSEN_NOTEBOOK_PASSWORD"
-e NB_UID=1000
-t mccahill/jupyter-hts-2018

The most important parts of this verbiage are the YOUR_DIRECTORY_WITH_COURSE_MATERIALS and YOUR_CHOSEN_NOTEBOOK_PASSWORD. - YOUR_DIRECTORY_WITH_COURSE_MATERIALS (Bind mounting): The directory name is the one you extracted your course materials into. So, if you put them in your home directory, it might look something like: -v /home/janice/HTS2018-notebooks:/home/jovyan/work - YOUR_CHOSEN_NOTEBOOK_PASSWORD: The password is whatever you want to use to password protect your notebook. Now, this command is running the notebook so that it is only ‘seen’ by your local computer - no one else on the internet can access it, and you cannot access it remotely, so the password is a bit of overkill. Use it anyway. An example might be: -e PASSWORD="Pssst_this_is_Secret" except that this is a terrible password and you should follow standard rules of not using words, use a mix of capital and lowercase and special symbols. etc. - -d -p 127.0.0.1\:9999\:8888 part of the command is telling docker to run the notebook so that it is only visible to the local machine. It is absolutely possible to run it as a server to be accessed across the web - but there are some security risks associated, so if you want to do this proceed with great caution and get help.

Open the notebook in your browser

Open a browser and point it to 127.0.0.1:9999 You should get to a Jupyter screen asking for a password. This is the password you created in the docker run command. Now, you should be able to run anything you like from the course. Depending on your laptop’s resources (RAM, cores), this might be slow, so be aware and start by testing only one file (vs the entire course data set). Using servers, etc.

Stopping Docker

The container will continue running, even if you do not have Jupyter open in a web browser. If you don’t plan to use it for a while, you might want to shut it down so it isn’t using resources on your computer. Here are two ways to do that: ##### Kitematic Included in the Docker for Mac and the Docker for Windows installations.

Commandline

You may want to familiarize yourself with the following Docker commands. - docker stop - docker rm - docker ps -a - docker images - docker rmi

Windows Note

These instructions have not been tested in a Windows environment. If you have problems with them, please give us feedback

Running on a Duke server (Duke Affiliates)

If you want to use the HTS Docker container for research, contact Mark Delong and the research computing people. Andy Ingham (one of Mark DeLong’s guys) can help with setup of a research VM with the HTS Docker container. If there is enough interest we could also look at having an option for an HTS image that could be automatically provisioned as part of the Research Toolkits/RAPID service for Duke researchers in general.

Running on a non-Duke server

Sys admins at other institution may be able to help set you up with compute resources if you let them know that you want to use a dockerhub image. Mark McCahill (OIT Wizard, class 50, and the person who keeps our course containers humming along) mark.mccahill@duke.edu has generously offered to consult with IT people at other institutions to get you up and running. You can also run Docker images in AWS or Azure cloud environments. Mark McCahill can probably give you some pointers here too.

Select Duke Computing Resources

Individual Virtual Machines

  • Virtual Computing Manager: Duke offers free small VMs (2 core, 2GB RAM) all affiliates through a system called VCM. For running the Jupyter containers you will want some version of Linux, such as “Ubuntu 16.04”
  • Research Toolkits: Duke faculty researchers have access to moderate size VMs (4 core, 32GB RAM)

Assorted Notes on Approaches to Reproducible Analysis

Scripting Options

I have used shell scripts, makefiles, Jupyter notebooks, and Rmarkdown notebooks as “master scripts” for running complex analyses. My current preference is for Rmarkdown notebooks. Jupyter and Rmarkdown let you combine code from different languages in a single notebook, so you can have a bash cell followed by an R cell followed by a python cell. You can’t put other languages in shell scripts or makefiles, so if you want to combine multiple languages with these, you need to put that code in its own script and run that script similar to how you would run a binary executable.

One thing I like about Rmarkdown is how easy it is to generate good looking reports/manuscripts.

Singularity

I switched from Docker to Singularity for several reasons that are mostly summed up by the fact that Singularity was designed for use in a research environment. As far as you are concerned the most important issue is that you have to have root to run Docker on a machine, so no cluster sysadmin will ever let you run your own Docker container. Singularity runs fine in user space and is installed on both of the Duke clusters I have access to. If you need to convince the cluster admin to install it, see this: http://singularity.lbl.gov/install-request.

I put together these notes about getting started with Singularity. They were intended to run on a Duke VCM or RAPID VM, but should work on any machine where you have root. If you don’t have root on any linux machine, you can try to run Singularity on your local machine: http://singularity.lbl.gov/install-mac or http://singularity.lbl.gov/install-windows

Etc

I almost never run anything locally on my computer, I do everything through web interfaces (RStudio Server) and command line session over ssh, both running on remote servers. I end up running most of my research projects on the Duke VCM or RAPID machines, which I can get away with because microbial datasets tend to be relatively modest.

These days when I start a new project, I prepare a Singularity Recipe with the software I expect to need and build a Singularity image from it. I usually include RStudio Server in these images so I can use that interface to develop Rmarkdown notebooks and run the analysis. For the occasional project where I need a bigger machine I still build the image on my VCM or RAPID machine (building a singularity image requires sudo, and these are the only machines where I have root), and copy the image over to a bigger server. I have access to one biggish server where I can run RStudio Server from a Singularity image, but I don’t think the clusters on campus allow this, so I am stuck working over an ssh session.

All my scripts, config files, metadata, and Singularity Recipes/Dockerfiles live in a git repo that is synced to Github/Bitbucket/Gitlab. This makes it easy to move this stuff between servers if I need to upgrade. I usually end up having to rsync data and Singularity images between servers, because only the VCM and RAPID servers can share storage. If I do this, I will tend to move everything over to the bigger server, instead of trying to work in two places, because it gets confusing. The one exception is that if I need to update the singularity image that has to be done on one of the VMs I control.

As much as possible, I avoid running things on clusters because they make things more complicated: I can’t use RStudio and I have to fiddle with SLURM. When I do need to work on a machine that won’t allow RStudio, I will often edit files with Emacs running locally (because it is set up how I like and has a nice mac os interface) using tramp-mode. Several other text editors allow you to edit remote files over ssh.