Data Provenance

Provenance:

  1. origin, source
  2. the history of ownership of a valued object or work of art or literature

Data Lineage/Data Provenance:

Data lineage includes the data’s origins, what happens to it and where it moves over time. Data lineage gives visibility while greatly simplifying the ability to trace errors back to the root cause in a data analytics process.

It also enables replaying specific portions or inputs of the data flow for step-wise debugging or regenerating lost output. Database systems use such information, called data provenance, to address similar validation and debugging challenges. Data provenance refers to records of the inputs, entities, systems, and processes that influence data of interest, providing a historical record of the data and its origins. The generated evidence supports forensic activities such as data-dependency analysis, error/compromise detection and recovery, auditing, and compliance analysis.

Shell Variables

Assign the variables in this notebook.

In [1]:
source bioinf_intro_config.sh

Checksums

MD5SUM

Get example code from TRICEM talk

Running MD5SUM

In [2]:
ls $RAW_FASTQS
ls: /data/hts2018_pilot/Granek_4837_180427A5: No such file or directory

In [3]:
head $RAW_FASTQS/Granek_4837_180427A5.checksum
head: /data/hts2018_pilot/Granek_4837_180427A5/Granek_4837_180427A5.checksum: No such file or directory

In [4]:
cd $DATA_BASE
md5sum -c $RAW_FASTQS/Granek_4837_180427A5.checksum
bash: cd: /data/hts2018_pilot: No such file or directory
parseopts.c:76: setup_check: fopen '/data/hts2018_pilot/Granek_4837_180427A5/Granek_4837_180427A5.checksum': No such file or directory

In [6]:
md5sum --help
Usage: md5sum [<option>] <file> [<file> [...] ]
       md5sum [<option>] --check <file>

Note:  These options are mostly compatible with GNU md5sum
       -s, -h, and -V are not available in GNU md5sum

 -b, --binary         Read files in binary mode
 -c, --check <file>   Check MD5 sums from <file>
 -t, --text           Read files in ASCII mode

 -s, --status         Silent mode: Use exit code to determine verification

 -h, --help           Display this help message and exit
 -V, --version        Display program version and exit

Data Protection

chmod

Follow md5sum example code from TRICEM talk

In [7]:
DEMO=$CUROUT/demo_chmod
mkdir -p $DEMO
cd $DEMO
In [8]:
pwd
/Users/cliburn/work/scratch/bioinf_intro/demo_chmod

TriCEM Talk

{r setup, include=FALSE} knitr::opts_chunk$set(echo = FALSE)

Overview

Slides online

A “webpage” version of these slides is available at https://bit.ly/2JD1lvm

What is Reproducible Analysis?

“Reproducible Analysis is an important part of reproducible research. Reproducible analysis requires that all components of the analysis be archived so that anyone can independently repeat the analysis and arrive at exactly the same results” - Josh Granek

Who is “anyone”

  • Labmates
  • Collaborators
  • Competitors
  • Everyone else interested in your work
  • Someone who want’s to apply your analysis to their own data
  • You, 6 months from now

Not Covered Here

  • Reproducibility in the Wet Lab
  • Lots of Other Stuff
  • Important Details

Soapbox

  1. Setup
  2. Stand on

Reproducible Analysis

… is like eating your vegetables

Three Pillars of Reproducible Analysis

  • Raw Data
  • Analysis
  • Compute Environment

Quick Start Guide

  • Archive It!

Raw Data

Raw Data

  • READ-ONLY
  • Provenance
  • Archive

READ-ONLY

{bash eval=FALSE, echo=TRUE} chmod -R a-w my_raw_data_directory

Provenance: Checksums

```{bash echo=TRUE, include=TRUE} mkdir -p /tmp/mydata echo “1,2,3,4” > /tmp/mydata/data1.csv echo “5,6,7,8” > /tmp/mydata/data2.csv

md5sum /tmp/mydata/*.csv ```

Provenance: Checksums (continued)

{bash echo=TRUE, include=TRUE} md5sum /tmp/mydata/*.csv > /tmp/mydata/mydata_md5.txt

{bash echo=TRUE, include=TRUE} md5sum -c /tmp/mydata/mydata_md5.txt

{bash echo=TRUE, include=TRUE, error=TRUE} echo "5,6,7,8,9" > /tmp/mydata/data2.csv md5sum -c /tmp/mydata/mydata_md5.txt

Archiving Raw Sequence Data @ NCBI

Other Data Types


Analysis Methods

Analysis Methods

  • Script Everything
  • Use Version Control

Script Everything

Publish Full Analysis Pipeline

  • Scripts (e.g. R, Python, Matlab, etc)
    • run parameters are embedded
  • Metadata
  • Documentation
  • Manuscript (optional)

What is Version Control?

{r, out.width = "400px"} knitr::include_graphics("http://www.phdcomics.com/comics/archive/phd101212s.gif")

What is Version Control?

  • Track
  • Backup
  • Rewind
  • Branch
  • Collaborate
  • Publish

What is Version Control?

{r, out.height = "400px"} knitr::include_graphics("./git_diff.png")

Version Control Software

  • git
  • mercurial
  • etc

Git-repository Hosts

Reproducible Computing Environment

Containerization for Reproducible Research

  • Versioning: Lock down the specific computing environment used for an analysis
  • Portability: Runs on Linux, Mac, and Windows
  • Sharebility: Docker Hub/Singularity Hub
  • Scalability: Runs on a laptop, massive server, and everything in between

Container Platforms

Organization

My strategy

  1. Raw Data directory: must be read-only
  2. Output directory: everything generated by a script
  3. Git repository: Code and metadata