Monte Carlo Integration, numba
and pyspark
¶
Instructions¶
Write code to solve all problems. The grading rubric includes the following criteria:
- Correctness
- Readability
- Efficiency
Please do not copy answers found on the web or elsewhere as it will not benefit your learning. Searching the web for general references etc. is OK. Some discussion with friends is fine too - but again, do not just copy their answer.
Honor Code: By submitting this assignment, you certify that this is your original work.
Exercise 1 (25 points)
Use Simple Monte Carlo Integration to estimate the function
Python code to do this is provided.
Write parallel code to speed up this calculation using
ProcessPoolExecutor
with concurrent.futures
or
multiprocessing
and as many cores as are available. Calculate the
speed-up relative to the single processor version.
In [ ]:
def f(x):
return x * np.cos(71*x) + np.sin(13*x)
In [2]:
x = np.linspace(0, 1, 100)
plt.plot(x, f(x))
pass
In [6]:
%%time
n = 10000000
x = f(np.random.random(n))
y = 1.0/n * np.sum(x)
print(y)
0.0206429660391
CPU times: user 1.12 s, sys: 294 ms, total: 1.41 s
Wall time: 1.41 s
In [ ]:
Exercise 2 (25 points)
Write a numba
GUFunc to calculate \(Ax + b\), where \(A\) is
a \(m \times n\) matrix, \(x\) is a \(n \times 1\) vector,
and \(b\) is a \(m \times 1\) vector. Show that it works by
applying to the followng data sets. The operation done without using
GUFuncs is given.
In [14]:
m = 5
n = 4
k = 10
A = np.random.random((k,m,n))
x = np.random.random((k,n))
b = np.random.random((k,m))
for i in range(k):
print(np.ravel(A[i] @ x[i] + b[i]))
[ 0.58652194 1.52954634 1.31413025 0.81224489 1.3229399 ]
[ 1.33266914 2.39131731 1.76424481 2.18475058 2.34178884]
[ 0.72361559 1.55917151 0.26459967 0.92263731 0.23204147]
[ 0.89639604 2.17564338 1.12019565 1.2042532 1.13893343]
[ 2.07707642 2.25804084 1.01482411 0.39107074 1.08606996]
[ 1.19304489 1.36529161 0.88697774 1.13022133 0.45933287]
[ 1.34251888 1.59933455 1.630729 2.0700424 1.382506 ]
[ 1.74398981 1.66932255 1.63371065 1.88526005 2.41578891]
[ 1.65004242 1.41414463 0.97330151 1.03208207 1.68916298]
[ 1.47932596 2.7067318 1.90861838 2.50126424 2.17164071]
In [ ]:
Exercise 3 (50 points)
Wrtie a pyspark program to find the top 10 words in the English
Wikipedia dump, using only articles from the directories that begin with
C
. Words should be converted to lowercase, stripped of all
punctuation, and exclude strings consisting of all numbers. Exclude the
following stop words:
a,able,about,across,after,all,almost,also,am,among,an,and,any,are,as,at,be,because,been,but,by,can,cannot,could,dear,did,do,does,either,else,ever,every,for,from,get,got,had,has,have,he,her,hers,him,his,how,however,i,if,in,into,is,it,its,just,least,let,like,likely,may,me,might,most,must,my,neither,no,nor,not,of,off,often,on,only,or,other,our,own,rather,said,say,says,she,should,since,so,some,than,that,the,their,them,then,there,these,they,this,tis,to,too,twas,us,wants,was,we,were,what,when,where,which,while,who,whom,why,will,with,would,yet,you,your
Note: The dataset can be found in Sakai Resources in the folder Wiki
Data Set as the zipped file wiki_C.zip
. It is almost 1 GB compressed
so might take a while to download.
In [ ]: