Monte Carlo Integration, numba and pyspark

Instructions

Write code to solve all problems. The grading rubric includes the following criteria:

  • Correctness
  • Readability
  • Efficiency

Please do not copy answers found on the web or elsewhere as it will not benefit your learning. Searching the web for general references etc. is OK. Some discussion with friends is fine too - but again, do not just copy their answer.

Honor Code: By submitting this assignment, you certify that this is your original work.

Exercise 1 (25 points)

Use Simple Monte Carlo Integration to estimate the function

f(x) = x \cos 7x + \sin 13x, \ \ 0 \le x \le 1

Python code to do this is provided.

Write parallel code to speed up this calculation using ProcessPoolExecutor with concurrent.futures or multiprocessing and as many cores as are available. Calculate the speed-up relative to the single processor version.

In [ ]:
def f(x):
    return x * np.cos(71*x) + np.sin(13*x)
In [2]:
x = np.linspace(0, 1, 100)
plt.plot(x, f(x))
pass
homework/../_build/doctrees/nbsphinx/homework_Homework10_4_0.png
In [6]:
%%time
n = 10000000
x = f(np.random.random(n))
y = 1.0/n * np.sum(x)
print(y)
0.0206429660391
CPU times: user 1.12 s, sys: 294 ms, total: 1.41 s
Wall time: 1.41 s
In [ ]:




Exercise 2 (25 points)

Write a numba GUFunc to calculate \(Ax + b\), where \(A\) is a \(m \times n\) matrix, \(x\) is a \(n \times 1\) vector, and \(b\) is a \(m \times 1\) vector. Show that it works by applying to the followng data sets. The operation done without using GUFuncs is given.

In [14]:
m = 5
n = 4
k = 10

A = np.random.random((k,m,n))
x = np.random.random((k,n))
b = np.random.random((k,m))

for i in range(k):
    print(np.ravel(A[i] @ x[i] + b[i]))
[ 0.58652194  1.52954634  1.31413025  0.81224489  1.3229399 ]
[ 1.33266914  2.39131731  1.76424481  2.18475058  2.34178884]
[ 0.72361559  1.55917151  0.26459967  0.92263731  0.23204147]
[ 0.89639604  2.17564338  1.12019565  1.2042532   1.13893343]
[ 2.07707642  2.25804084  1.01482411  0.39107074  1.08606996]
[ 1.19304489  1.36529161  0.88697774  1.13022133  0.45933287]
[ 1.34251888  1.59933455  1.630729    2.0700424   1.382506  ]
[ 1.74398981  1.66932255  1.63371065  1.88526005  2.41578891]
[ 1.65004242  1.41414463  0.97330151  1.03208207  1.68916298]
[ 1.47932596  2.7067318   1.90861838  2.50126424  2.17164071]
In [ ]:




Exercise 3 (50 points)

Wrtie a pyspark program to find the top 10 words in the English Wikipedia dump, using only articles from the directories that begin with C. Words should be converted to lowercase, stripped of all punctuation, and exclude strings consisting of all numbers. Exclude the following stop words:

a,able,about,across,after,all,almost,also,am,among,an,and,any,are,as,at,be,because,been,but,by,can,cannot,could,dear,did,do,does,either,else,ever,every,for,from,get,got,had,has,have,he,her,hers,him,his,how,however,i,if,in,into,is,it,its,just,least,let,like,likely,may,me,might,most,must,my,neither,no,nor,not,of,off,often,on,only,or,other,our,own,rather,said,say,says,she,should,since,so,some,than,that,the,their,them,then,there,these,they,this,tis,to,too,twas,us,wants,was,we,were,what,when,where,which,while,who,whom,why,will,with,would,yet,you,your

Note: The dataset can be found in Sakai Resources in the folder Wiki Data Set as the zipped file wiki_C.zip. It is almost 1 GB compressed so might take a while to download.

In [ ]: