Assignment 2: Data formats¶
This is a review of basic Python from BIOS 821 as well as practice manipulating data formats common in medical data.
Text
1. (10 points)
Read the text file data/s01/alice.txt
and count the number of
occurrences of ‘Alice’ in the text.
In [1]:
JSON
2. (20 points)
Use curl
to download the file at https://swapi.co/api/people/1/
as luke.json
in data/s01
. Find the body mass index (BMI) of Luke
Skywalker, rounded to 1 decimal place, and print the BMI category for
Luke.
BMI Categories:
- Underweight = <18.5
- Normal weight = 18.5–24.9
- Overweight = 25–29.9
- Obesity = BMI of 30 or greater
Note: Use the json
package.
In [1]:
XML
3. (20 points)
Read the XML file in data/s01/patient.xml
and find all the unique
FHIR tags used. FHIR tags start with {http://hl7.org/fhir}
.
Note: Use the xml.etree
package.
In [1]:
Time series (structured data)
4. (20 points)
Read the worksheet Tourist arrivals
in the file
data/s01/touristexp.xls
into a pandas
data frame. Drop any rows
with missing values. Show a table of arrivals to United States
where
the rows are the Region or origin
and the columns are years.
In [1]:
Image
5. (20 points)
Use the imread
function to read in the JPG image
data/s01/pony.jp
as a numpy
array. What are the dimensions of
the array? Display the image using matplotlib
. Set all values in the
red
channel to 0. Redisplay the image. Make the region in the
rectable with width between 300 and 400 pixels and height between 200
and 300 pixels black. Redisplay the image.
Note: In NumPy indexing, the first dimension corresponds to rows, while the second corresponds to columns, with the origin on the top-left corner.
In [1]:
Genomics data
6. (10 points)
Join the first 10 lines containing sequence data of the E. Coli genome
found in FASTA file data/s01/ecoli.fna
into a single string. Note
that header lines start with ‘>’ in the FASTA format. Print the reverse
complement of the joined sequence in lines of length 80.
Use textwrap
to format the fixed width output.
In [1]: