©2019 Raazesh Sainudiin. Attribution 4.0 International (CC BY 4.0)
Fill in your Personal Number, make sure you pass the # ... Test
cells and
submit by email from your official uu.se
student email account to raazesh.sainudiin
@
math.uu.se
with Subject line YOIYUI001 Assignment 2.
You can submit multiple times before the deadline and your highest score will be used.
KEEP TRACK of and REPORT the said variables in each PROBLEM or up to now to help Raaz calibrate the exam fairly - your report will be used to justify exam-difficulty:
timeToCompleteThisProblemInMinutes
an integer in minutes to report how much time it is taking you to complete this problem on your own (ok to discuss with your mates too; but not in exam)didIAttendOrCatchUpFromMyMatesOnTheSecondLab
is a Boolean True
or False
to indicate whether you attended or caught up on (from your class mates who attended) what was done in the second computer lab.timeSpentOnLearningInThisCourseInHours
an integer in hours to report how much time you have spent on this course up to now.0x.ipynb
notebooks on their own where x
is in $\{5,6,7,8\}$ should report 0
hours here.((5*2/3)/1.5)*37=82.22
hours of full-time study on the course including attending lectures, labs, if you were not too busy with other commitments. You may reserve 2*8=16
hours during study week to revise for this course.You should update (download most up to date versions of) the lecture notebooks 0x.ipynb
notebooks before starting the Assignment.
In [ ]:
# Enter your 12 digit personal number here and evaluate this cell
MyPersonalNumber = 'YYYYMMDDXXXX'
#tests
assert(isinstance(MyPersonalNumber, basestring))
assert(MyPersonalNumber.isdigit())
assert(len(MyPersonalNumber)==12)
# Report these variables so the exam can be calibrated fairly - your report will be used to justify exam-difficulty
didIAttendOrCatchUpFromMyMatesOnTheSecondLab = False # replace False by True, if so
timeSpentOnLearningInThisCourseInHours = 0 # replace 0 by a positive integer if it applies
We have seen how to implement a new iterator (just like a function) but with yield
statement (just like return
in a function). This model of computation is called continuation. This is very useful in combinatorics, especially when combined with recursion (Computational Mathematics with SageMath, SIAM, 2019, p. 346). Below is an iterator called generateWords(alphabet,L)
that can generate all words of of a given length L
on a given alphabet
.
Your task is simple!
L
equalling 3 and then 23 using sum
:Wall time
(see Wall Time, it's just the elapsed real time from the start to end of a computation).%%time
# time for list comprehension to compute the sum of [1,1,1,...,2^23]
sumFromListCom = sum( [ 1 for w in generateWords(['H','T'], 23) ] )
will result in output:
CPU times: user 6.94 s, sys: 200 ms, total: 7.14 s
Wall time: 7.11 s
%%time
# time for generator expression to compute the sum of [1,1,1,...,2^23]
sumFromGenEx = sum( ( 1 for w in generateWords(['H','T'], 23) ) )
will result in output:
CPU times: user 5.51 s, sys: 0 ns, total: 5.51 s
Wall time: 5.52 s
(you may have slightly different numbers for time
and Wall time
based on your machine details at the time of computation).
Multiple-choice Question:
Wall time
for generator expression (genex) smaller that for the list comprehension (listcomp) here? Answer Choices
In [182]:
choiceForProblem0 = 'X' # replace X by A, B or C
Evaluate cell below to make sure your answer is valid. You should not modify anything in the cell below when evaluating it to do a local test of your solution. You may need to include and evaluate code snippets from lecture notebooks in cells above to make the local test work correctly sometimes (see error messages for clues). This is meant to help you become efficient at recalling materials covered in lectures that relate to this problem. Such local tests will generally not be available in the exam.
In [184]:
try:
assert(choiceForProblem0 in ['A','B','C'])
print "You have chosen one of the possible options. Hopefully, you are correct."
except AssertionError:
print "Try again. you have to choose between 'A', 'B' and 'C'."
In [186]:
# This cell is to help you make the right choice between A, B and C
def generateWords(alphabet, L):
if L == 0:
yield []
else:
for word in generateWords(alphabet, L-1): # here is the recursion when we cann the iterator again on L-1
for L in alphabet:
yield word + [L]
print [ w for w in generateWords(['H','T'], 3) ] # now call the iterator to find all words of length 3 in ['H','T']
print sum( [ 1 for w in generateWords(['H','T'], 3) ] ) # these words can then be counted by list comprehension
print sum( ( 1 for w in generateWords(['H','T'], 3) ) ) # these words can then be counted by generator expression
print 'The number of words of length 3 from an alphabet of size 2 is 2^3 = ', 2^3 # the above sum`s makes sense
Recall how we downloaded Pride and Prejudice and processed it as a String and split it by Chapter
s. These code snippets are at our disposal now - all we need to do is copy-paste the right set of cells from earlier into the cells below here to have the string from that Book for more refined processing.
Think about what algorithmic constructs and methods one will need to split
each sentence by the English words it contains and then count the number of each distinct word.
Now that you have understood for
loops, list
comprehensions and anonymous function
s, and can learn about the needed methods on strings for splitting (which you can search by adding a .
after a srt
and hitting the Tab
button to look through existing methods and followed by ?
for their docstrings), the dictionary
data structure, and already seen how to count the number of ball labels, you are ready for this problem stated below. If you attended the lab then you have an advantage if you tried to work on this with some help from your instructors.
Problem: Process the English words in a text file, such as those in the book Pride and Prejudice by Jane Austin, and obtain the top K
most frequent words that are longer than a given parameter wordLongerThan
which can be any value in $\mathbb{Z}_+ := \{ 0, 1, 2, 3, 4, \ldots \}$, i.e., words that are longer than wordLongerThan
many characters in length.
Your function must be generic and named as follows including input parameter order and names:
frequencyOftheKMostCommonWordsIn(thisTextFile, wordLongerThan, K)
This function must be capable of:
data/
directory that can be passed as the parameter thisTextFile
dict
type whose:wordlongerThan
and dict
should only contain the top K
most frequent words longer than wordLongerThan
and be already sorted in descending order of in frequency.Use the next cell to submit your answer and for rough-work use more cells as needed in order to copy-paste code snippets from earlier content to get this working. But please remove the cells for rough-work when done.
Note: that you may not import libraries that have not been introduced in the course so far.
In [19]:
# Report these variables so the exam can be calibrated fairly - your report will be used to justify exam-difficulty
timeToCompleteThisProblemInMinutes = 0 # replace 0 by a positive integer if it applies
# Do NOT change the name of the function and names of paramaters !
thisTextFile = 'data/someTextFilename' # try a text file in data/ directory
wordLongerThan = 0 # this can be any larger integer also
K = 20 # this can be any integer larger than 0 also
def frequencyOftheKMostCommonWordsIn(thisTextFile, wordLongerThan, K):
'''explain what the function is supposed to do briefly'''
# write the body of the function and replace 'None' with the correct return value
# ...
# ...
return None
Recall the problem above on counting the number of votes by party across all of Sweden from the Swedish 2018 National Election Data.
Your task is to adapt the code snippets there and others we have encountered thus far to count the total number of votes by each district and return a list
of Integers
giving the number of votes for the top K
districts with the most votes. Your function numberOfVotesInKMostVotedDistrictsInSE('data/final.csv', K)
should work for any valid integer K
.
Note: that you may not import libraries that have not been introduced in the course so far.
unzip issues: If you are unable to call unzip final.csv.zip
on your windows laptop. You can either do it in the computer lab or do the following with internet access to download the large final.csv
file from the internet:
%%sh
cd data
curl -O http://lamastex.org/datasets/public/elections/2018/sv/final.csv
Then you should have the needed data/final.csv
file.
In [25]:
# Report these variables so the exam can be calibrated fairly - your report will be used to justify exam-difficulty
timeToCompleteThisProblemInMinutes = 0 # replace 0 by a positive integer if it applies
# Do NOT change the name of the function and names of paramaters !
K = 20 # this can be any integer larger than 0 also, change K and make sure your function works
filename = 'data/final.csv' # this has to be a csv file with the same structure as out final.csv
def numberOfVotesInKMostVotedDistrictsInSE(filename, K):
'''explain what the function is supposed to do briefly'''
# write the body of the function and replace 'None' with the correct return value
# ...
# ...
return None
Evaluate cell below to make sure your answer is valid. You should not modify anything in the cell below when evaluating it to do a local test of your solution. You may need to include and evaluate code snippets from lecture notebooks in cells above to make the local test work correctly sometimes (see error messages for clues). This is meant to help you become efficient at recalling materials covered in lectures that relate to this problem. Such local tests will generally not be available in the exam.
In [27]:
# test that your answer is indeed a probability by evaluating this cell after you replaced XXX above and evaluated it.
try:
assert(numberOfVotesInKMostVotedDistrictsInSE('data/final.csv', 3) == [13435, 10625, 7910])
assert(numberOfVotesInKMostVotedDistrictsInSE('data/final.csv', 1) == [13435])
print("Your answer is correct for two test cases with K=3 and K=1. Hopefully it works correctly for any K")
except AssertionError:
print("Try again! and make sure you are actually producing what is expected of you.")
A disadvantage of using list comprehension is that we cannot create a lot of random numbers as we will have to store the returned list. Since you know about generators your task is to use the following warm-up on generating natural numbers and write an iterator version called lcg
of the function LinConGen
we have been seeing thus far.
In [93]:
def naturals():
'''define the countably infinite set of natural numbers using an iterator'''
n = 1 # the first natural number 1
while True: # an infinite while loop
yield n # output n
n = n + 1 # increment n by 1
In [94]:
# Example run - keep printing the natural numbers using the iterator until we hit 5
for n in naturals():
print(n)
if n >= 5:
break
In [95]:
# printing next from our iterator
generateNaturals = naturals() # let's assign our iterator
print(generateNaturals.next())
print(generateNaturals.next())
In [96]:
zip(naturals(), ['a', 'b', 'c', 'd']) # the second list stops at 4 to give an enumeration that ends
Out[96]:
In [ ]:
# Here is the actual task
# just replace XXX with the right values to make an iterator of function LinConGen
def lcg(m, a, c, x0):
x = XXX
while True:
yield XXX
x = XXX
Evaluate cell below to make sure your answer is valid. You should not modify anything in the cell below when evaluating it to do a local test of your solution. You may need to include and evaluate code snippets from lecture notebooks in cells above to make the local test work correctly sometimes (see error messages for clues). This is meant to help you become efficient at recalling materials covered in lectures that relate to this problem. Such local tests will generally not be available in the exam.
In [98]:
def linConGen(m, a, c, x0, n):
'''A linear congruential sequence generator.
Param m is the integer modulus to use in the generator.
Param a is the integer multiplier.
Param c is the integer increment.
Param x0 is the integer seed.
Param n is the integer number of desired pseudo-random numbers.
Returns a list of n pseudo-random integer modulo m numbers.'''
x = x0 # the seed
retValue = [Mod(x,m)] # start the list with x=x0
for i in range(2, n+1, 1):
x = Mod(a * x + c, m) # the generator, using modular arithmetic
retValue.append(x) # append the new x to the list
return retValue
try:
m, a, c, x0, n = 2147483648, 65539, 0, 1, 5010
randuListComp = linConGen(m, a, c, x0, n)
randuGenExp = [x for i,x in zip(xrange(n), lcg(m, a, c, x0))]
assert(randuListComp == randuGenExp)
print("It seems like you have passed a test. Hopefully you have the right answer.")
except AssertionError:
print("Try again. You need the output of your iterator lcg to coincide with that of LinConGen")
First read and understand the following simple simulation (originally written by Jenny Harlow). Then you will modify the simulation to find the solution to this problem.
We could use the samplers we have made to do a very simple simulation. Suppose the inter-arrival times, in minutes, of Orbiter buses at an Orbiter stop in Christchurch follows an $Exponential(\lambda = 0.1)$ distribution. Also suppose that this is quite a popular bus stop, and the arrival of people is very predictable: one new person will arrive in each whole minute. This means that the longer another bus takes in coming, the more people arrive to join the queue. Also suppose that the number of free seats available on any bus follows a $de\, Moivre(k=40)$ distribution, i.e, there are equally like to to be 1, or 2, or 3 ... or 40 spare seats. If there are more spare seats than people in the queue, everyone can get onto the bus and nobody is left waiting, but if there are not enough spare seats some people will be left waiting for the next bus. As they wait, more people arrive to join the queue....
This is not very realistic - we would want a better model for how many people arrive at the stop at least, and for the number of spare seats there will be on the bus. However, we are just using this as a simple example that you can do using the random variables you already know how to simulate samples from.
Try to code this example yourself, using our suggested steps. We have put our version the code into a cell below, but you will get more out of this example by trying to do it yourself first.
exponentialSamples
function. Assign the list to a variable named something like busTime
s. These are your 100 simulated bus inter-arrival times. waiting
. leftWaiting
, which to begin with contains just the value assigned to waiting
. boardBus
. busTimes
in turn, i.e. each bus inter-arrival time, and within the for loop:deMoivreFInverse
rather than deMoivreSample
because you only need one value - remember that you will have to pass a simulated $u \thicksim Uniform(0,1)$ to deMoivreFInverse
as well as the value of the parameter $k$).getOnBus
.getOnBus
to the list boardBus
.getOnBus
from the number of people waiting, waiting (e.g., waiting = waiting - getOnBus
will decrement waiting by the number of people who get on the bus).waiting
to the list leftWaiting
. Here is our code to do the bus stop simulation. Yours may be different - maybe it will be better!
You are expected to find the needed functions from the latest notebook this assignment came from and be able to answer this question. Unless you can do it in your head.
In [73]:
def busStopSimulation(buses, lam, seats):
'''A Simple Simulation - see description above!'''
BusTimes = exponentialSample(buses,lam)
waiting = 0 # how many people are waiting at the start of the simulation
BoardBus = [] # empty list
LeftWaiting = [waiting] # list with just waiting in it
for time in BusTimes: # for each bus inter-arrival time
arrivals = floor(time) # people who arrive at the stop before the bus gets there
waiting = waiting + arrivals # add them to the queue
busSeats = deMoivreFInverse(random(), seats) # how many seats available on the bus
getOnBus = min(waiting, busSeats) # how many people can get on the bus
BoardBus.append(getOnBus) # add to the list
waiting = waiting - getOnBus # take the people who board the bus out of the queue
LeftWaiting.append(waiting) # add to the list
return [LeftWaiting, BoardBus, BusTimes]
In [74]:
# let's simulate the people left waiting at the bus stop
set_random_seed(None) # replace None by a integer to fix seed and output of simulation
buses = 100
lam = 0.1
seats = 40
leftWaiting, boardBus, busTimes = busStopSimulation(buses, lam, seats)
print(leftWaiting) # look at the leftWaiting list
print(boardBus) # boad bus
print(busTimes)
We could do an interactive visualisation of this by evaluating the next cell. This will be showing the number of people able to board the bus and the number of people left waiting at the bus stop by the height of lines on the plot.
In [75]:
@interact
def _(seed=[0,123,456], lam=[0.1,0.01], seats=[40,10,1000]):
set_random_seed(seed)
buses=100
leftWaiting, boardBus, busTimes = busStopSimulation(buses, lam,seats)
p1 = line([(0.5,0),(0.5,leftWaiting[0])])
from pylab import cumsum
csBusTimes=list(cumsum(busTimes))
for i in range(1, len(leftWaiting), 1):
p1+= line([(csBusTimes[i-1],0),(csBusTimes[i-1],boardBus[i-1])], rgbcolor='green')
p1+= line([(csBusTimes[i-1]+.01,0),(csBusTimes[i-1]+.01,leftWaiting[i])], rgbcolor='red')
t1 = text("Boarding the bus", (csBusTimes[len(busTimes)-1]/3,max(max(boardBus),max(leftWaiting))+1), \
rgbcolor='green',fontsize=10)
t2 = text("Waiting", (csBusTimes[len(busTimes)-1]*(2/3),max(max(boardBus),max(leftWaiting))+1), \
rgbcolor='red',fontsize=10)
xaxislabel = text("Time", (csBusTimes[len(busTimes)-1],-10),fontsize=10,color='black')
yaxislabel = text("People", (-50,max(max(boardBus),max(leftWaiting))+1),fontsize=10,color='black')
show(p1+t1+t2+xaxislabel+yaxislabel,figsize=[8,5])
Very briefly explain the effect of varying one of the three parameters:
seed
lam
seats
while holding the other two parameters fixed on:
by using the dropdown menus in the @interact
above. Think if the simulation makes sense and explain why. You can write down your answers using keyboard by double-clicking this cell and writing between ---
and ---
.