Duke-UNC CFAR Data Workshop

Learning objectives

This two-day CFAR data workshop is designed to introduce biologists to elementary data science in the context of HIV/AIDS research. The participant we have in mind is someone who works in the laboratory or clinic, and who is struggling to keep up with the data being generated from their research. There may be tens, hundreds or thousands of text files or spreadsheets that need to be combined and analyzed, and you still need to occasionally eat, sleep and walk the dog.

The workshop will introduce you to the use of Jupyter as a scientific notebook that can combine documentation, code and cool visualizations. The coding part will focus on the manipulation of tabular data, since that is the most common way of capturing scientific output. You will learn how to read in tabular data from text files or Excel spreadsheets, how to tidy them for analysis, how to slice and dice the data to answer your research questions, and how to create beautiful plots for exploratory analysis and reporting.

Day 1 of the course will emphasize working with data, and day 2 will focus on visualization and the generation of statistical plots. Each day consists of four 75-minute sessions. The first three sessions will show how to perform common data science tasks with short code examples, and the last session will be a hands-on exercise meant to integrate everything learned so far. Since data science is not a spectator sport, there will also be coding exercises interleaved into the first three sessions.

It takes more than two days to train a competent data scientist, and we do not expect you to be proficient by the end of the workshop. Instead, our goal is that you will have become comfortable with the idea of writing code for analysis and visualization, and appreciate its advantages over the exclusive use of Excel. You will have made the transition from total outsider to an apprentice data scientist, eager and able to continue learning on your own with the resources given at the end of this page, or other resources you find on your own. I expect that you will struggle and be frustrated with coding as you try to apply it to your own data sets. It will get easier over time if you keep practicing, and the rewards are great.

Workshop Overview

  • Day 1 Session 1 (9:45 - 11:00): Jupyter, Python and Data Science
  • Day 1 Session 2 (11:15 - 12:30): Tidying Data
  • Day 1 Session 3 (1:45 - 3:00): Manipulating Data
  • Day 1 Session 4 (3:15 - 4:30): Capstone - Summarizing Clinical and Demographic Data
  • Day 2 Session 5 (9:45 - 11:00): Visualizing Data
  • Day 2 Session 6 (11:15 - 12:30): A Gallery of Plots
  • Day 2 Session 7 (1:45 - 3:00): Customizing Plots and Layouts
  • Day 2 Session 8 (3:15 - 4:30): Capstone - Mapping the Antibody Response to HIV Epitopes

Post-workshop

Indices and tables