Assignment 2: Data formats

This is a review of basic Python from BIOS 821 as well as practice manipulating data formats common in medical data.

Text

1. (10 points)

Read the text file data/s01/alice.txt and count the number of occurrences of ‘Alice’ in the text.

In [1]:




JSON

2. (20 points)

Use curl to download the file at https://swapi.co/api/people/1/ as luke.json in data/s01. Find the body mass index (BMI) of Luke Skywalker, rounded to 1 decimal place, and print the BMI category for Luke.

BMI Categories:

  • Underweight = <18.5
  • Normal weight = 18.5–24.9
  • Overweight = 25–29.9
  • Obesity = BMI of 30 or greater

Note: Use the json package.

In [1]:




XML

3. (20 points)

Read the XML file in data/s01/patient.xml and find all the unique FHIR tags used. FHIR tags start with {http://hl7.org/fhir}.

Note: Use the xml.etree package.

In [1]:




Time series (structured data)

4. (20 points)

Read the worksheet Tourist arrivals in the file data/s01/touristexp.xls into a pandas data frame. Drop any rows with missing values. Show a table of arrivals to United States where the rows are the Region or origin and the columns are years.

In [1]:




Image

5. (20 points)

Use the imread function to read in the JPG image data/s01/pony.jp as a numpy array. What are the dimensions of the array? Display the image using matplotlib. Set all values in the red channel to 0. Redisplay the image. Make the region in the rectable with width between 300 and 400 pixels and height between 200 and 300 pixels black. Redisplay the image.

Note: In NumPy indexing, the first dimension corresponds to rows, while the second corresponds to columns, with the origin on the top-left corner.

In [1]:




Genomics data

6. (10 points)

Join the first 10 lines containing sequence data of the E. Coli genome found in FASTA file data/s01/ecoli.fna into a single string. Note that header lines start with ‘>’ in the FASTA format. Print the reverse complement of the joined sequence in lines of length 80.

Use textwrap to format the fixed width output.

In [1]: