Data Science entry test (100 points total)¶
1. Write a Unix statement to assign the string “Hello” to the
variabel foo
and another statement to display foo
on standard
output (5 points)
In [1]:
2. Write a Unix command to finds all files containing the word “foo” recursively within the current directory (5 points)
In [1]:
3. What command(s) would you give to make the directories a/b/c
with a
in the parent directory? (5 points)
In [1]:
4. Which a regular expression that matches abc-123-XYZ
and
uvwxyz-6789-ABCDEFG
but not abc-67.28-XYZ
or ABC-123-XYZ
or
abc-123-
(5 points)
In [1]:
5. Write a Unix command to download the file https://raw.github.com/vincentarelbundock/Rdatasets/master/csv/datasets/BOD.csv without using a browser. Show the first 3 lines of the file. (5 points)
In [1]:
6. Order the following complexities from smallest to largest: (5 points)
\(O(n^2), O(2^n), O(n \log n), O(n!), O(n)\)
O(n) - smallest
O(n log n)
O(n^2)
O(2^n)
O(n!) - largest
7. Write a function to find the sum of a collection of \(n\) numbers using recursion. Any language that supports recursion can be used in your answer (name the language used). Explain if your solution can benefit from tail-call optimization, and whether the language used supports this compiler optimization. You may use pseudocode if this cannot be easily done in a language that you are familiar with. Show that the function gives the correct sum for the collection (3,1,2,4). (10 points)
In [1]:
8. A hash table consists of a set of numbered buckets and a hash
function that maps an item to a bucket (by generating a number). For
example, if hash("hello")
returns 3, then the string “hello” would
be “stored” in bucket 3. Write a hash function that is capable of
mapping any word to one of 8 buckets in any language (name the language
used). Find the bucket number each word in the sentence “the quick brown
fox jumps over the lazy dog” hashes to. You may use pseudocode if this
cannot be easily done in a language that you are familiar with. (5
points)
In [1]:
9. For each of the following properties, name an appropriate data structure (5 points)
- First-In-First-Out (FIFO) semantics
- Last-In-First-Out (LIFO) semantics
- Maintains sort order
- Random access takes constant time (or amortized constant time)
- Keeps only unique entires (no duplicates)
Solution
- First-In-First-Out (FIFO) semantics => queue
- Last-In-First-Out (LIFO) semantics => stack
- Maintains sort order => heap
- Random access takes constant time (or amortized constant time) => array or hash table
- Keeps only unique entires (no duplicates) => set
10. Give a one sentence description of each of the following: (5 points)
git
docker
make
literate programming
unit test
Solution
- `git` => distributed version control system
- `docker` => container for reproducible computing environments
- `make` => build system; part of Unix system
- `literate programming` => interweaving comments and executable code
- `unit test` => practice of checking functions using test cases with known answer
11. Give an example of a language that matches each term (5 points)
- Domain-specific
- Functional
- Object-oriented
- Interpreted
- Static typing
Solution
- Domain-specific => SQL
- Functional => Scala
- Object-oriented => Java
- Interpreted => Python
- Static typing => C
12. Express the decimal number 3.125 as a binary number. Show the steps taken (5 points)
Solution
\(3.125 = 1 \times 2^1 + 1 \times 2^0 + 0 \times 2^{-1} + 0 \times 2^{-2} + 1 \times 2^{-3}\)
So binary representation is 11.001
13. When multiplying many (non-zero) probabilities, it is advisable to first perform a log transform. Explain why. (5 points)
Solution
Probabilities are between 0 and 1. So the multiplication of many probabilities is likely to be a very small number. As computers store numbers with finite precision, there is a risk of rounding to zero or underflow. By contrast, the log transform of a small number between 0 and 1 is a (possibly large) negative number. The product of numbers is the same as the sum of the log numbers. So there is no risk of underflow.
14. Write a program that prints the numbers from 1 to 100. But for multiples of three print “Data” instead of the number and for the multiples of five print “Science”. For numbers which are multiples of both three and five print “Data Science”. (Name the language used) (10 points)
In [1]:
15. Implement the following algorithm (pseudocode assumes a language
with 0-based arrays) and use it to sort the numbers 3,1,4,5,2
in
ascending order. What is the Big O complexity of this sorting algorithm?
(10 points)
procedure my_sort( A : list of sortable items )
n = length(A)
repeat
swapped = false
for i = 1 to n-1 inclusive do
/* if this pair is out of order */
if A[i-1] > A[i] then
/* swap them and remember something changed */
swap( A[i-1], A[i] )
swapped = true
end if
end for
until not swapped
end procedure
In [1]:
16. Suppose the population of Durham is 100,000 and the population of Chapel Hill is 200,000. Every year, 10% of the people in Durham move to Chapel Hill, and 90% stay in Durham; 20% of the people in Chapel Hill move to Durham and 80% stay in Chapel Hill. Assuming no births, deaths or immigration/emigration, what is the population of Durham after many, many years? (10 points)
In [1]: