# Data Provenance

## [Provenance](https://www.merriam-webster.com/dictionary/provenance):
1. origin, source
2. the history of ownership of a valued object or work of art or literature

## [Data Lineage/Data Provenance](https://en.wikipedia.org/wiki/Data_lineage):
Data lineage includes the data's origins, what happens to it and where it moves over time. Data lineage gives visibility while greatly simplifying the ability to trace errors back to the root cause in a data analytics process.

It also enables replaying specific portions or inputs of the data flow for step-wise debugging or regenerating lost output. Database systems use such information, called data provenance, to address similar validation and debugging challenges. Data provenance refers to records of the inputs, entities, systems, and processes that influence data of interest, providing a historical record of the data and its origins. The generated evidence supports forensic activities such as data-dependency analysis, error/compromise detection and recovery, auditing, and compliance analysis. 

## Shell Variables
Assign the variables in this notebook.

In [None]:
source bioinf_intro_config.sh

## Checksums
### MD5SUM


Get example code from TRICEM talk



### Running MD5SUM

In [None]:
ls $RAW_FASTQS

In [None]:
head $RAW_FASTQS/Granek_4837_180427A5.checksum

In [None]:
cd $DATA_BASE
md5sum -c $RAW_FASTQS/Granek_4837_180427A5.checksum

In [None]:
md5sum --help

## Data Protection
### chmod
Follow md5sum example code from TRICEM talk


In [None]:
DEMO=$CUROUT/demo_chmod
mkdir -p $DEMO
cd $DEMO

In [None]:
pwd

# TriCEM Talk

---
title: "Reproducible Analysis"
author: "Josh Granek"
date: "June 8, 2018"
output:
 md_document:
 variant: markdown_github
 beamer_presentation: default
 ioslides_presentation: default
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = FALSE)
```


# Overview
## Slides online
A "webpage" version of these slides is available at https://bit.ly/2JD1lvm

## What is Reproducible Analysis? {.build}
 "Reproducible Analysis is an important part of reproducible research. Reproducible analysis requires that all components of the analysis be archived so that _anyone_ can independently repeat the analysis and arrive at exactly the same results" - Josh Granek

### Who is "anyone"
> - Labmates
> - Collaborators
> - Competitors
> - Everyone else interested in your work
> - Someone who want's to apply your analysis to their own data
> - You, 6 months from now

## Not Covered Here
 - Reproducibility in the Wet Lab
 - Lots of Other Stuff
 - Important Details

## Soapbox
 1. Setup
 2. Stand on

## Reproducible Analysis
 . . . is like eating your vegetables

## Three Pillars of Reproducible Analysis {.build}
 - Raw Data
 - Analysis
 - Compute Environment
 
### Quick Start Guide
 - Archive It!

# Raw Data

## Raw Data
 - READ-ONLY
 - Provenance
 - Archive

## READ-ONLY
```{bash eval=FALSE, echo=TRUE}
chmod -R a-w my_raw_data_directory
```

## Provenance: Checksums
```{bash echo=TRUE, include=TRUE}
mkdir -p /tmp/mydata
echo "1,2,3,4" > /tmp/mydata/data1.csv
echo "5,6,7,8" > /tmp/mydata/data2.csv

md5sum /tmp/mydata/*.csv
```
 
## Provenance: Checksums (continued)

```{bash echo=TRUE, include=TRUE}
md5sum /tmp/mydata/*.csv > /tmp/mydata/mydata_md5.txt
```
 
```{bash echo=TRUE, include=TRUE}
md5sum -c /tmp/mydata/mydata_md5.txt
```

```{bash echo=TRUE, include=TRUE, error=TRUE}
echo "5,6,7,8,9" > /tmp/mydata/data2.csv
md5sum -c /tmp/mydata/mydata_md5.txt
```

## Archiving Raw Sequence Data @ NCBI
 - [GEO: Gene "Expression" Data](https://www.ncbi.nlm.nih.gov/geo/)
 - GEO: Gene Expression Omnibus
 - RNA-Seq
 - ChIP-Seq
 - [SRA: Everything else](https://www.ncbi.nlm.nih.gov/sra)
 - SRA: Sequence Read Archive

## Alternatives to NCBI
 - [International Nucleotide Sequence Database Collaboration](http://www.insdc.org)
 - [European Nucleotide Archive (ENA)](https://www.ebi.ac.uk/ena)
 - [DNA Database of Japan](https://www.ddbj.nig.ac.jp/index-e.html)

## Other Data Types
 ???????????????

# Analysis Methods
## Analysis Methods
 - Script Everything
 - Use Version Control
 
## Script Everything
 - [No Excel](https://www.bloomberg.com/news/articles/2013-04-18/faq-reinhart-rogoff-and-the-excel-error-that-changed-history)

## Publish Full Analysis Pipeline
 - Scripts (e.g. R, Python, Matlab, etc)
 - run parameters are embedded
 - Metadata
 - Documentation
 - Manuscript (optional)
 
## What is Version Control?
```{r, out.width = "400px"}
knitr::include_graphics("http://www.phdcomics.com/comics/archive/phd101212s.gif")
```

## What is Version Control?
 - Track
 - Backup
 - Rewind
 - Branch
 - Collaborate
 - Publish
 
## What is Version Control?
```{r, out.height = "400px"}
knitr::include_graphics("./git_diff.png")
```


 


 
## Version Control Software
 - git
 - mercurial
 - etc

## Git-repository Hosts
 - Github ([Education Discount](https://help.github.com/categories/teaching-and-learning-with-github-education/))
 - Bitbucket
 - Gitlab
 - Duke Gitlab

## Reproducible Computing Environment
### Containerization for Reproducible Research

 - *Versioning*: Lock down the specific computing environment used for an analysis
 - Portability: Runs on Linux, Mac, and Windows
 - Sharebility: [Docker Hub](https://hub.docker.com)/[Singularity Hub](https://singularity-hub.org)
 - Scalability: Runs on a laptop, massive server, and everything in between
 
### Container Platforms
 - [Docker](https://docs.docker.com/get-started/#docker-concepts)
 - [Singularity](http://singularity.lbl.gov/)

# Organization
## My strategy
1. Raw Data directory: must be *read-only*
2. Output directory: everything generated by a script
3. Git repository: Code and metadata 

## Alternatives
1. [ProjectTemplate](https://swcarpentry.github.io/r-novice-gapminder/02-project-intro/#tip-projecttemplate---a-possible-solution)
2. [Resources on Project Directory Organization](https://discuss.ropensci.org/t/resources-on-project-directory-organization/340)
3. [A Quick Guide to Organizing Computational Biology Projects](https://doi.org/10.1371/journal.pcbi.1000424)
4. [Designing projects](https://nicercode.github.io/blog/2013-04-05-projects/)

# Resources
## Resources
### Scientific Computing Advice
 - [Good enough practices in scientific computing](https://doi.org/10.1371/journal.pcbi.1005510)
 - [Best Practices for Scientific Computing](https://doi.org/10.1371/journal.pbio.1001745)

### Git for Version Control
 - [Introduction to Version Control with Git](https://swcarpentry.github.io/git-novice/)
 - [Installing Git](https://swcarpentry.github.io/workshop-template/#git)
 - [Sourcetree: a free GUI for git](https://www.sourcetreeapp.com)
 - [Git in RStudio](https://gitlab.oit.duke.edu/IBIEM/IBIEM_2017_2018/blob/master/git_material/git_overview.md)


