Data Science entry test (100 points total)

1. Write a Unix statement to assign the string “Hello” to the variabel foo and another statement to display foo on standard output (5 points)

In [1]:

2. Write a Unix command to finds all files containing the word “foo” recursively within the current directory (5 points)

In [1]:

3. What command(s) would you give to make the directories a/b/c with a in the parent directory? (5 points)

In [1]:

4. Which a regular expression that matches abc-123-XYZ and uvwxyz-6789-ABCDEFG but not abc-67.28-XYZ or ABC-123-XYZ or abc-123- (5 points)

In [1]:

5. Write a Unix command to download the file https://raw.github.com/vincentarelbundock/Rdatasets/master/csv/datasets/BOD.csv without using a browser. Show the first 3 lines of the file. (5 points)

In [1]:

6. Order the following complexities from smallest to largest: (5 points)

\(O(n^2), O(2^n), O(n \log n), O(n!), O(n)\)
O(n) - smallest
O(n log n)
O(n^2)
O(2^n)
O(n!) - largest

7. Write a function to find the sum of a collection of \(n\) numbers using recursion. Any language that supports recursion can be used in your answer (name the language used). Explain if your solution can benefit from tail-call optimization, and whether the language used supports this compiler optimization. You may use pseudocode if this cannot be easily done in a language that you are familiar with. Show that the function gives the correct sum for the collection (3,1,2,4). (10 points)

In [1]:

8. A hash table consists of a set of numbered buckets and a hash function that maps an item to a bucket (by generating a number). For example, if hash("hello") returns 3, then the string “hello” would be “stored” in bucket 3. Write a hash function that is capable of mapping any word to one of 8 buckets in any language (name the language used). Find the bucket number each word in the sentence “the quick brown fox jumps over the lazy dog” hashes to. You may use pseudocode if this cannot be easily done in a language that you are familiar with. (5 points)

In [1]:

9. For each of the following properties, name an appropriate data structure (5 points)

  • First-In-First-Out (FIFO) semantics
  • Last-In-First-Out (LIFO) semantics
  • Maintains sort order
  • Random access takes constant time (or amortized constant time)
  • Keeps only unique entires (no duplicates)

Solution

- First-In-First-Out (FIFO) semantics => queue
- Last-In-First-Out (LIFO) semantics => stack
- Maintains sort order => heap
- Random access takes constant time (or amortized constant time) => array or hash table
- Keeps only unique entires (no duplicates) => set

10. Give a one sentence description of each of the following: (5 points)

  • git
  • docker
  • make
  • literate programming
  • unit test

Solution

- `git` => distributed version control system
- `docker` => container for reproducible computing environments
- `make` => build system; part of Unix system
- `literate programming` => interweaving comments and executable code
- `unit test` => practice of checking functions using test cases with known answer

11. Give an example of a language that matches each term (5 points)

  • Domain-specific
  • Functional
  • Object-oriented
  • Interpreted
  • Static typing

Solution

- Domain-specific => SQL
- Functional => Scala
- Object-oriented => Java
- Interpreted => Python
- Static typing => C

12. Express the decimal number 3.125 as a binary number. Show the steps taken (5 points)

Solution

\(3.125 = 1 \times 2^1 + 1 \times 2^0 + 0 \times 2^{-1} + 0 \times 2^{-2} + 1 \times 2^{-3}\)

So binary representation is 11.001

13. When multiplying many (non-zero) probabilities, it is advisable to first perform a log transform. Explain why. (5 points)

Solution

Probabilities are between 0 and 1. So the multiplication of many probabilities is likely to be a very small number. As computers store numbers with finite precision, there is a risk of rounding to zero or underflow. By contrast, the log transform of a small number between 0 and 1 is a (possibly large) negative number. The product of numbers is the same as the sum of the log numbers. So there is no risk of underflow.

14. Write a program that prints the numbers from 1 to 100. But for multiples of three print “Data” instead of the number and for the multiples of five print “Science”. For numbers which are multiples of both three and five print “Data Science”. (Name the language used) (10 points)

In [1]:

15. Implement the following algorithm (pseudocode assumes a language with 0-based arrays) and use it to sort the numbers 3,1,4,5,2 in ascending order. What is the Big O complexity of this sorting algorithm? (10 points)

procedure my_sort( A : list of sortable items )
   n = length(A)
   repeat
     swapped = false
     for i = 1 to n-1 inclusive do
       /* if this pair is out of order */
       if A[i-1] > A[i] then
         /* swap them and remember something changed */
         swap( A[i-1], A[i] )
         swapped = true
       end if
     end for
   until not swapped
end procedure
In [1]:

16. Suppose the population of Durham is 100,000 and the population of Chapel Hill is 200,000. Every year, 10% of the people in Durham move to Chapel Hill, and 90% stay in Durham; 20% of the people in Chapel Hill move to Durham and 80% stay in Chapel Hill. Assuming no births, deaths or immigration/emigration, what is the population of Durham after many, many years? (10 points)

In [1]: