Data Provenance¶
Provenance:¶
- origin, source
- the history of ownership of a valued object or work of art or literature
Data Lineage/Data Provenance:¶
Data lineage includes the data’s origins, what happens to it and where it moves over time. Data lineage gives visibility while greatly simplifying the ability to trace errors back to the root cause in a data analytics process.
It also enables replaying specific portions or inputs of the data flow for step-wise debugging or regenerating lost output. Database systems use such information, called data provenance, to address similar validation and debugging challenges. Data provenance refers to records of the inputs, entities, systems, and processes that influence data of interest, providing a historical record of the data and its origins. The generated evidence supports forensic activities such as data-dependency analysis, error/compromise detection and recovery, auditing, and compliance analysis.
Checksums¶
MD5SUM¶
Get example code from TRICEM talk
Running MD5SUM¶
In [2]:
ls $RAW_FASTQS
ls: /data/hts2018_pilot/Granek_4837_180427A5: No such file or directory
In [3]:
head $RAW_FASTQS/Granek_4837_180427A5.checksum
head: /data/hts2018_pilot/Granek_4837_180427A5/Granek_4837_180427A5.checksum: No such file or directory
In [4]:
cd $DATA_BASE
md5sum -c $RAW_FASTQS/Granek_4837_180427A5.checksum
bash: cd: /data/hts2018_pilot: No such file or directory
parseopts.c:76: setup_check: fopen '/data/hts2018_pilot/Granek_4837_180427A5/Granek_4837_180427A5.checksum': No such file or directory
In [6]:
md5sum --help
Usage: md5sum [<option>] <file> [<file> [...] ]
md5sum [<option>] --check <file>
Note: These options are mostly compatible with GNU md5sum
-s, -h, and -V are not available in GNU md5sum
-b, --binary Read files in binary mode
-c, --check <file> Check MD5 sums from <file>
-t, --text Read files in ASCII mode
-s, --status Silent mode: Use exit code to determine verification
-h, --help Display this help message and exit
-V, --version Display program version and exit
TriCEM Talk¶
{r setup, include=FALSE} knitr::opts_chunk$set(echo = FALSE)
Overview¶
Slides online¶
A “webpage” version of these slides is available at https://bit.ly/2JD1lvm
What is Reproducible Analysis?¶
“Reproducible Analysis is an important part of reproducible research. Reproducible analysis requires that all components of the analysis be archived so that anyone can independently repeat the analysis and arrive at exactly the same results” - Josh Granek
Who is “anyone”¶
- Labmates
- Collaborators
- Competitors
- Everyone else interested in your work
- Someone who want’s to apply your analysis to their own data
- You, 6 months from now
Not Covered Here¶
- Reproducibility in the Wet Lab
- Lots of Other Stuff
- Important Details
Soapbox¶
- Setup
- Stand on
Reproducible Analysis¶
… is like eating your vegetables
Raw Data¶
Raw Data¶
- READ-ONLY
- Provenance
- Archive
READ-ONLY¶
{bash eval=FALSE, echo=TRUE} chmod -R a-w my_raw_data_directory
Provenance: Checksums¶
```{bash echo=TRUE, include=TRUE} mkdir -p /tmp/mydata echo “1,2,3,4” > /tmp/mydata/data1.csv echo “5,6,7,8” > /tmp/mydata/data2.csv
md5sum /tmp/mydata/*.csv ```
Provenance: Checksums (continued)¶
{bash echo=TRUE, include=TRUE} md5sum /tmp/mydata/*.csv > /tmp/mydata/mydata_md5.txt
{bash echo=TRUE, include=TRUE} md5sum -c /tmp/mydata/mydata_md5.txt
{bash echo=TRUE, include=TRUE, error=TRUE} echo "5,6,7,8,9" > /tmp/mydata/data2.csv md5sum -c /tmp/mydata/mydata_md5.txt
Archiving Raw Sequence Data @ NCBI¶
- GEO: Gene “Expression” Data
- GEO: Gene Expression Omnibus
- RNA-Seq
- ChIP-Seq
- GEO: Gene Expression Omnibus
- SRA: Everything else
- SRA: Sequence Read Archive
Alternatives to NCBI¶
Other Data Types¶
Analysis Methods¶
Analysis Methods¶
- Script Everything
- Use Version Control
Publish Full Analysis Pipeline¶
- Scripts (e.g. R, Python, Matlab, etc)
- run parameters are embedded
- Metadata
- Documentation
- Manuscript (optional)
What is Version Control?¶
{r, out.width = "400px"} knitr::include_graphics("http://www.phdcomics.com/comics/archive/phd101212s.gif")
What is Version Control?¶
- Track
- Backup
- Rewind
- Branch
- Collaborate
- Publish
What is Version Control?¶
{r, out.height = "400px"} knitr::include_graphics("./git_diff.png")
Version Control Software¶
- git
- mercurial
- etc
Git-repository Hosts¶
- Github (Education Discount)
- Bitbucket
- Gitlab
- Duke Gitlab
Reproducible Computing Environment¶
Containerization for Reproducible Research¶
- Versioning: Lock down the specific computing environment used for an analysis
- Portability: Runs on Linux, Mac, and Windows
- Sharebility: Docker Hub/Singularity Hub
- Scalability: Runs on a laptop, massive server, and everything in between
Container Platforms¶
Organization¶
My strategy¶
- Raw Data directory: must be read-only
- Output directory: everything generated by a script
- Git repository: Code and metadata