Data Provenance¶

Provenance:¶

origin, source
the history of ownership of a valued object or work of art or literature

Data Lineage/Data Provenance:¶

Data lineage includes the data’s origins, what happens to it and where it moves over time. Data lineage gives visibility while greatly simplifying the ability to trace errors back to the root cause in a data analytics process.

It also enables replaying specific portions or inputs of the data flow for step-wise debugging or regenerating lost output. Database systems use such information, called data provenance, to address similar validation and debugging challenges. Data provenance refers to records of the inputs, entities, systems, and processes that influence data of interest, providing a historical record of the data and its origins. The generated evidence supports forensic activities such as data-dependency analysis, error/compromise detection and recovery, auditing, and compliance analysis.

Shell Variables¶

Assign the variables in this notebook.

In [1]:

source bioinf_intro_config.sh

Checksums¶

MD5SUM¶

Get example code from TRICEM talk

Running MD5SUM¶

In [2]:

ls $RAW_FASTQS

ls: /data/hts2018_pilot/Granek_4837_180427A5: No such file or directory

In [3]:

head $RAW_FASTQS/Granek_4837_180427A5.checksum

head: /data/hts2018_pilot/Granek_4837_180427A5/Granek_4837_180427A5.checksum: No such file or directory

In [4]:

cd $DATA_BASE
md5sum -c $RAW_FASTQS/Granek_4837_180427A5.checksum

bash: cd: /data/hts2018_pilot: No such file or directory
parseopts.c:76: setup_check: fopen '/data/hts2018_pilot/Granek_4837_180427A5/Granek_4837_180427A5.checksum': No such file or directory

In [6]:

md5sum --help

Usage: md5sum [<option>] <file> [<file> [...] ]
       md5sum [<option>] --check <file>

Note:  These options are mostly compatible with GNU md5sum
       -s, -h, and -V are not available in GNU md5sum

 -b, --binary         Read files in binary mode
 -c, --check <file>   Check MD5 sums from <file>
 -t, --text           Read files in ASCII mode

 -s, --status         Silent mode: Use exit code to determine verification

 -h, --help           Display this help message and exit
 -V, --version        Display program version and exit

Data Protection¶

chmod¶

Follow md5sum example code from TRICEM talk

In [7]:

DEMO=$CUROUT/demo_chmod
mkdir -p $DEMO
cd $DEMO

In [8]:

pwd

/Users/cliburn/work/scratch/bioinf_intro/demo_chmod

TriCEM Talk¶

{r setup, include=FALSE} knitr::opts_chunk$set(echo = FALSE)

Overview¶

Slides online¶

A “webpage” version of these slides is available at https://bit.ly/2JD1lvm

What is Reproducible Analysis?¶

“Reproducible Analysis is an important part of reproducible research. Reproducible analysis requires that all components of the analysis be archived so that anyone can independently repeat the analysis and arrive at exactly the same results” - Josh Granek

Who is “anyone”¶

Labmates

Collaborators

Competitors

Everyone else interested in your work

Someone who want’s to apply your analysis to their own data

You, 6 months from now

Not Covered Here¶

Reproducibility in the Wet Lab
Lots of Other Stuff
Important Details

Soapbox¶

Setup
Stand on

Reproducible Analysis¶

… is like eating your vegetables

Three Pillars of Reproducible Analysis¶

Raw Data
Analysis
Compute Environment

Quick Start Guide¶

Archive It!

Raw Data¶

READ-ONLY
Provenance
Archive

READ-ONLY¶

{bash eval=FALSE, echo=TRUE} chmod -R a-w my_raw_data_directory

Provenance: Checksums¶

```{bash echo=TRUE, include=TRUE} mkdir -p /tmp/mydata echo “1,2,3,4” > /tmp/mydata/data1.csv echo “5,6,7,8” > /tmp/mydata/data2.csv

md5sum /tmp/mydata/*.csv ```

Provenance: Checksums (continued)¶

{bash echo=TRUE, include=TRUE} md5sum /tmp/mydata/*.csv > /tmp/mydata/mydata_md5.txt

{bash echo=TRUE, include=TRUE} md5sum -c /tmp/mydata/mydata_md5.txt

{bash echo=TRUE, include=TRUE, error=TRUE} echo "5,6,7,8,9" > /tmp/mydata/data2.csv md5sum -c /tmp/mydata/mydata_md5.txt

Archiving Raw Sequence Data @ NCBI¶

GEO: Gene “Expression” Data
- GEO: Gene Expression Omnibus
  - RNA-Seq
  - ChIP-Seq
SRA: Everything else
- SRA: Sequence Read Archive

Alternatives to NCBI¶

International Nucleotide Sequence Database Collaboration
- European Nucleotide Archive (ENA)
- DNA Database of Japan

Other Data Types¶

Analysis Methods¶

Script Everything
Use Version Control

Script Everything¶

No Excel

Publish Full Analysis Pipeline¶

Scripts (e.g. R, Python, Matlab, etc)
- run parameters are embedded
Metadata
Documentation
Manuscript (optional)

What is Version Control?¶

{r, out.width = "400px"} knitr::include_graphics("http://www.phdcomics.com/comics/archive/phd101212s.gif")

What is Version Control?¶

Track
Backup
Rewind
Branch
Collaborate
Publish

What is Version Control?¶

{r, out.height = "400px"} knitr::include_graphics("./git_diff.png")

Version Control Software¶

git
mercurial
etc

Git-repository Hosts¶

Github (Education Discount)
Bitbucket
Gitlab
- Duke Gitlab

Reproducible Computing Environment¶

Containerization for Reproducible Research¶

Versioning: Lock down the specific computing environment used for an analysis
Portability: Runs on Linux, Mac, and Windows
Sharebility: Docker Hub/Singularity Hub
Scalability: Runs on a laptop, massive server, and everything in between

Container Platforms¶

Organization¶

My strategy¶

Raw Data directory: must be read-only
Output directory: everything generated by a script
Git repository: Code and metadata

Data Provenance¶

Provenance:¶

Data Lineage/Data Provenance:¶

Shell Variables¶

Checksums¶

MD5SUM¶

Running MD5SUM¶

Data Protection¶

chmod¶

TriCEM Talk¶

Overview¶

Slides online¶

What is Reproducible Analysis?¶

Who is “anyone”¶

Not Covered Here¶

Soapbox¶

Reproducible Analysis¶

Three Pillars of Reproducible Analysis¶

Quick Start Guide¶

Raw Data¶

Raw Data¶

READ-ONLY¶

Provenance: Checksums¶

Provenance: Checksums (continued)¶

Archiving Raw Sequence Data @ NCBI¶

Alternatives to NCBI¶

Other Data Types¶

Analysis Methods¶

Analysis Methods¶

Script Everything¶

Publish Full Analysis Pipeline¶

What is Version Control?¶

What is Version Control?¶

What is Version Control?¶

Version Control Software¶

Git-repository Hosts¶

Reproducible Computing Environment¶

Containerization for Reproducible Research¶

Container Platforms¶

Organization¶

My strategy¶

Alternatives¶

Resources¶

Resources¶

Scientific Computing Advice¶

Git for Version Control¶