Homework #3

This notebook is due on Friday, October 28th, 2016 at 11:59 p.m.. Please make sure to get started early, and come by the instructors' office hours if you have any questions. Office hours and locations can be found in the course syllabus. IMPORTANT: While it's fine if you talk to other people in class about this homework - and in fact we encourage it! - you are responsible for creating the solutions for this homework on your own, and each student must submit their own homework assignment.

Some links that you may find helpful:

FOR THIS HOMEWORK and for all future homework assignments: We will be grading you on:

  1. The quality of your code
  2. The correctness of your code
  3. Whether your code runs.

To that end:

  1. Code quality: make sure that you use functions whenever possible, use descriptive variable names, and use comments to explain what your code does as well as function properties (including what arguments they take, what they do, and what they return).
  2. Whether your code runs: prior to submitting your homework assignment, re-run the entire notebook and test it. Go to the "kernel" menu, select "Restart", and then click "clear all outputs and restart." Then, go to the "Cell" menu and choose "Run all" to ensure that your code produces the correct results. We will take off points for code that does not work correctly.

Your name

Put your name here!


Section 1: 2D random walk

We're going to finish up the "random walk" project that we started in class. In this section, we're going to do it in two dimensions, x and y. You need to write a program that performs a random walk that starts at the origin (x=y=0), picks a random direction (up, down, left, or right), and take one step in that direction. You will then randomly pick a new direction, take a step, and so on, for a total of $N_{step}$ = 1000 steps.

First: Write the code to do this for a single random walk, and keep track of the (x,y) position of your walker for each step in the random walk. Make a plot showing the path for this random walk. Make sure to write a function that decides what direction the walker will go on each step, and returns that information to your program.


In [ ]:
# put your code here

%matplotlib inline
import matplotlib.pyplot as plt
import random
import numpy as np
import numpy.random as rand

# takes nothing, returns an xstep, ystep pair
def step2d():
    '''
    step2d picks a random direction to step in (+/-x, +/-y).
    Takes no arguments, returns an xstep, ystep pair.
    '''
    a = random.randint(0,3)
    if a == 0:  # left
        return -1,0
    elif a == 1:  # right
        return 1,0
    elif a == 2:  # up
        return 0,1
    elif a == 3:  # down
        return 0,-1
    else:  # error check!
        print("aaah!")
        return -9999,-9999

In [ ]:
x = y = 0  # initial position (will be modified)
n_steps = 1000  # number of steps

x_traj = []
y_traj = []

# loop over n_steps, calling step2d() and incrementing trajectory
for i in range(0,n_steps):

    xstep,ystep=step2d()
    x += xstep
    y += ystep

    # keep track of trajectory to plot it!
    x_traj.append(x)
    y_traj.append(y)

plt.plot(x_traj,y_traj)

Second: Modify your code to only keep track of the final distance from the origin (magnitude, not x and y components - in other words, $d = (x^2 + y^2)^{1/2}$), and do the experiment $N_{trial} = 1000$ times, keeping track of the final distances for all of the trials. Plot the distribution of distances from the origin in a histogram, and calculate the mean value.


In [ ]:
# put your code here

distance = []

n_steps = 1000  # number of steps
n_trials = 1000

# as in the previous cell, but now loop over number of trials as 
# well as number of steps.  Just keep track of distances
for j in range(n_trials):

    x=y=0
    
    for i in range(n_steps):

        xstep,ystep=step2d()
        x += xstep
        y += ystep

    distance.append( (x**2+y**2)**0.5)

In [ ]:
plt.hist(distance,bins=40)

In [ ]:
distance = np.array(distance)
print((distance**2).mean())

Question: How does the 2D random walk behave similarly to, and differently from, the 1D random walk that you explored in class? Compare them below.

It's basically the same as a 1D random walk - the x and y values behave exactly the same as separate 1D random walks, and the mean distance^2 is basically the same as a 1D random walk with the same number of steps. The only real difference is that it's 2D and not 1D.


Section 2: 1D random walk with weighting

Now we want to see what happens in the 1D random walk when the "coin toss" is biased - in other words, when you're more likely to take a step in one direction than in the other (i.e., the probability of stepping to the right is $p_{step}$, of stepping to the left is $1-p_{step}$, and $p_{step} \neq 0.5$).

Modify the function for the 1D random walk that you wrote in class to take as an argument a probability that you step in one direction (say, to the right) and then to decide what direction to go. Use that to calculate a distribution of distances from the origin for $N_{step} = 1000$ and $N_{trial} = 1000$, as well as the mean distance from the origin. Plot a histogram of the distribution. Answer the following two questions:

  1. How does this distribution look different than the distribution that you observed for the standard 1D random walk?
  2. How does the distribution of distances traveled, as well as the mean distance from the origin , change as $p_{step}$ varies from 0.5 to 1?

Put your answers here


In [ ]:
# put your code here

n_steps = 1000  # number of steps
n_traj = 1000
prob_right = 0.55


def step1d_weighted(prob_right=0.5):
    '''
    Takes into account a probability of taking a step to the right (implicitly assumed
    to be between 0.0 and 1.0, inclusive) and returns a +1 for a right step, -1 for a left step.
    '''
    a = random.random()
    if a > prob_right:  # step left
        return -1
    else:  # step right
        return 1
    
dist = np.array([])

# loop over some number of trajectories
for trj in range(0,n_traj):
    x = 0  # initial position (will be modified)

    # for each trajectory, loop over number of steps
    for i in range(0,n_steps):
        x += step1d_weighted(prob_right)

    # keep track of all distances traveled
    dist = np.append(dist,x)

plt.hist(dist,bins=20) #,bins=50)    
    
print("average distance is:", dist.mean(), "average abs. distance is:",np.abs(dist).mean())

ANSWER:

  1. The distribution looks relatively similar, except it's now skewed away from the origin and the mean distance is now much larger (and gets larger the more the probability differs from 0.5)

  2. P = probability of stepping to the right. (I am only doing P >= 0.5 because the problem is symmetrical.) For 1000 steps, the distribution looks like this:

P <dist> <abs_dist>
0.5 -1.312 25.128
0.55 98.572 98.604
0.6 198.94 198.48
0.8 600.06 600.04
1.0 1000.0 1009.0

Section 3: Modeling the presidential election

The final part of this homework is the creation of a model of the 2016 Presidential Election that's similar to those used at election prediction sites such as FiveThirtyEight. You've been provided with a link to a CSV (comma-separated value) file containing information about each of the 50 states as well as the District of Columbia, and you'll use this to make election predictions.

A quick civics lesson: The United States does not directly elect the President and Vice President of the United States. Instead, they choose "Electors" that are apportioned to each state based on the most recent U.S. census results (equal to the number of members of Congress to which that state receives). Electors are typically required to vote for the candidate with the majority of the popular vote in their state. At the moment, there are 538 Electors in the "Electoral College.", and to win the presidency a candidate needs to recieve half plus one of those votes, or 270 votes.

The data file: Each row of the provided CSV file contains information about one of the 50 states or Washington D.C. This information includes: the state name and abbreviation, the the number of electoral votes that state receives in the Electoral College, the state's polling information, the number of people surveyed in the last poll, and then information for each of the three candidates with significant popular support: Hilary Clinton, Donald Trump, and Gary Johnson. For each candidate, the number corresponds to the percentage of the voting population that intends to vote for them according to the polls, as well as the margin of error of the poll. Note that we are only using the most recent poll for each state.

Margin of error: The "margin of error" of the poll represents uncertainty in polling data - typically only hundreds of people are polled at any given time, and pollsters are attempting to extrapolate from that sample of people to all of the likely voters in the state. This error is typically calculated assuming that the uncertainty is a "normal" (or "Gaussian") distribution, and the reported error is the "standard deviation". For example, Polls in Michigan indicate that Hilary Clinton is currently favored by 42% of the voting population, with a margin of error of 1.71%. This means that the most likely outcome is for 42% of the population to vote for Clinton, with a 68% likelihood that the real percentage is between 40.29-43.71%, and a 95% chance likelihood that the real percentage is between 38.58-45.42%.

Project instructions

We are going to attempt to duplicate the FiveThirtyEight election predictions, which predict not just who will win the election but what the range of likely outcomes will be. In particular, we're going to reproduce the expected distribution of Electoral College votes for each candidate. To do this, we need to run large numbers of model elections - say, $N_{elec} = 10,000$ - and keep track of the results for each one. To do so, you need to do the following for each of your model elections:

  • Write a function that, for each state, calculates the percentage of the popular vote that votes for each candidate, decides who the winner is (i.e., who has the most votes), and then returns the number of Electoral College votes received by that person for that state. Almost all states are a "winner take all" state with regards to Electoral College votes, so whoever wins the popular vote receives all Electoral College votes for that state. Note that you can use the Python random module's random.normalvariate() function, or the NumPy random module's random.normal() function to calculate the possible outcomes from a Gaussian distribution, given the mean and standard deviation.
  • Write a function that loops through the list of states and keep track of the number of electoral votes awarded to each candidate by each state.
  • Keep track of the total Electoral College votes, as well as the total popular vote, for each candidate in each model election to decide who won.

You will then be asked to answer several questions, as described below.


In [ ]:
import pandas

# reads in the CSV file and puts it into a Pandas data frame called "all_states"
all_states = pandas.read_csv('https://raw.githubusercontent.com/bwoshea/2016_election_info/master/State_polling_info.csv')

In [ ]:
# put your code here!

#all_states.head()

def votes_per_state(row):
    '''
    Takes in a row in a data frame (one state).  Returns the string 'Clinton', 
    'Trump', or 'Johnson', the number of electoral votes won by that 
    candidate from this state, and the number of actual votes that each candidate 
    receives from this state.
    '''

    # randomly sample from the probability distributions for clinton, trump, and johnson
    clinton = rand.normal(row['Clinton'],row['Cl_error'])
    trump   = rand.normal(row['Trump'],  row['Tr_error'])
    johnson = rand.normal(row['Johnson'],row['Jo_error'])
    
    # calculate the number of people who voted for each one
    clinton_vote = clinton*row['Population']/100.0
    trump_vote   = trump*row['Population']/100.0
    johnson_vote = johnson*row['Population']/100.0
    
    # some logic to figure out who the winner is
    if clinton >= trump:
        if clinton > johnson:
            winner = 'Clinton'
        else:
            winner = 'Johnson'
    else:
        if trump > johnson:
            winner = 'Trump'
        else:
            winner = 'Johnson'

    # return the winner, the number of electoral votes, and number of actual
    # votes from this state.
    return winner, row['Electoral votes'], clinton_vote, trump_vote, johnson_vote

def election_result(country_df):
    '''
    Takes in a dataframe with all of the necessary information for all states + washington d.c.
    loops over all of the states, calculating the total number of electoral votes as well as the 
    popular vote.
    
    Return the number of electoral votes as well as the total popular vote that each candidate
    receives for this model election.
    '''

    # counters for electoral votes
    clinton_EC_votes = 0
    trump_EC_votes = 0
    johnson_EC_votes = 0

    # counters for popular votes
    clinton_pop_votes = 0
    trump_pop_votes = 0
    johnson_pop_votes = 0
    
    # loop over the states + DC
    for index, row in country_df.iterrows():

        # for each state, figure out who wins
        winner, number, c_votes, t_votes, j_votes= votes_per_state(row)

        # assign electoral votes to that person
        if winner == 'Clinton':
            clinton_EC_votes += number
        elif winner == 'Trump':
            trump_EC_votes += number
        elif winner == 'Johnson':
            johnson_EC_votes += number
        else:
            print("wtf?")

        # keep track of popular votes for everybody
        clinton_pop_votes += c_votes
        trump_pop_votes += t_votes
        johnson_pop_votes += j_votes
    
    # quick error check
    if (clinton_EC_votes+trump_EC_votes+johnson_EC_votes) != 538:
        print("not 538 votes - something's wrong!")
    
    # return everything
    return clinton_EC_votes, trump_EC_votes, johnson_EC_votes, clinton_pop_votes, trump_pop_votes, johnson_pop_votes

In [ ]:
# loop over number of elections we want to simulate
N_elections = 10000

# lists to keep track of electoral votes
clinton_EC_votes = []
trump_EC_votes = []
johnson_EC_votes = []

# lists to keep track of popular votes
clinton_pop_votes = []
trump_pop_votes = []
johnson_pop_votes = []

# counters to keep track of the number of elections that clinton and trump win
clinton_winner = 0
trump_winner = 0

# loop over the number of elections we want to simulate
for i in range(N_elections):
    
    # get electoral college votes and popular vote for all of the states
    clinton_EC, trump_EC, johnson_EC, clinton_pop, trump_pop, johnson_pop = election_result(all_states)
    
    # figure out who wins and take this into account
    # (I am ignoring Gary Johnson)
    if clinton_EC > trump_EC:
        clinton_winner += 1
    elif trump_EC > clinton_EC:
        trump_winner += 1
    else:
        print("tie!")
    
    # keep track of electoral college votes
    clinton_EC_votes.append(clinton_EC)
    trump_EC_votes.append(trump_EC)
    johnson_EC_votes.append(johnson_EC)

    # keep track of popular votes
    clinton_pop_votes.append(clinton_pop)
    trump_pop_votes.append(trump_pop)
    johnson_pop_votes.append(johnson_pop)
    
    # just to let us know something is happening!
    if i%1000==0:
        print("i: ",i)

Question 1: Plot the histogram of expected Electoral College votes for the three candidates. Who do you expect to win the election?

Put your answer here!


In [ ]:
# put your code and plots here.  Add additional cells if necessary.

a,b,c= plt.hist(clinton_EC_votes,color='b')
a,b,c= plt.hist(trump_EC_votes,color='r')
a,b,c= plt.hist(johnson_EC_votes,color='g')

Question 2: In what percentage of the model elections does Hilary Clinton win the Presidency? How about Donald Trump and Gary Johnson?

Clinton wins in approximately 97.5% of the model elections. Trump wins in about 2.5%, and Gary Johnson wins 0%.


In [ ]:
# put your code and plots here.  Add additional cells if necessary.

print("clinton win %:", 100.0*clinton_winner / (clinton_winner+trump_winner))
print("trump win %:  ", 100.0*trump_winner / (clinton_winner+trump_winner))

Question 3: Let's look at the difference between popular vote and Electoral College vote for one of the two main-party candidates - let's use Hilary Clinton as our example. Make a scatter plot of the expected Electoral College votes vs. the fraction of popular vote received for all of the elections, and put lines indicating 50% of the popular vote as well as the needed 270 Electoral College votes. Is it possible to win more than half of the Electoral College vote but get less than half of the popular votes, or vice versa? Why might this be true?

In this particular election, given that there's a strong third-party candidate it's extremely unlikely that any candidate will get more than 50% of the votes, even the one that wins way more than 270 Electoral College votes. It might be possible to win the popular vote but lose the Electoral College (thinking Al Gore in 2000...) but it isn't happening for this set of election results.


In [ ]:
# put your code and plots here.  Add additional cells if necessary.

clinton_pop_votes = np.array(clinton_pop_votes)
johnson_pop_votes = np.array(johnson_pop_votes)
trump_pop_votes = np.array(trump_pop_votes)

clinton_frac_votes = clinton_pop_votes/(clinton_pop_votes+johnson_pop_votes+trump_pop_votes)
clinton_frac_votes_nojohn = clinton_pop_votes/(clinton_pop_votes+trump_pop_votes)
clinton_EC_votes = np.array(clinton_EC_votes)

plt.plot(clinton_EC_votes,clinton_frac_votes,'k.')
plt.plot(clinton_EC_votes,clinton_frac_votes_nojohn,'g.')

Question 4: Take a look at the results on the FiveThirtyEight election forecast page, and read their description of how they create these forecasts. How well does the distribution that you calculated in Question 1 agree with FiveThirtyEight's Electoral College forecast? If it is different, why do you think that might be?

ANSWER:

Our distribution is different - the median values for Clinton's number of electoral votes that we predicted are lower than FiveThirtyEight's (~325 instead of ~350), and the distribution we predict is narrower. The same is true for Trump's votes, but we predict higher numbers than 538 (but still a narrower distribution). It's not totally obvious why we get different results, but there are several possibilities based on their description of their methodology:

  • We are using instantaneous poll results and a single poll per state, and 538 uses a more complex set of data that includes multiple polls for each state, with a weighting scheme that takes into account how long it has been since the poll was done, and also attempts to correct for bias of each pollster (i.e., some tend to swing Republican or Democrate; historical data can be used to normalize this).
  • 538 uses economic data to try to make adjustments to their predictions, and we don't do that.
  • 538 uses a different error model than we do - a t-distribution instead of a Gaussian distribution of outcomes, which gives more weight to the tails of the distribution. They also take into account something other than Poisson noise (and a modified error taking into account other uncertainties) for each poll so that errors are greater.
  • 538 treats the fraction of undecided voters differently than we do (we ignore it; they divide them among the candidates). This also may tie into the error budget that they use.
  • 538 may do something different with Gary Johnson's voters, and assume that some will defect prior to the election and go to another of the candidates.
  • We are implicitly assuming that the polls are accurate and that all of the population will vote. We don't take into account demographic issues or trying to figure out what likely voters think, which 538 attempts to do at some level.

Section 4: Feedback (required!)


In [ ]:
from IPython.display import HTML
HTML(
"""
<iframe
    src="https://goo.gl/forms/q2zZVDznls9zeqJo2?embedded=true" 
    width="80%" 
    height="1200px" 
    frameborder="0" 
    marginheight="0" 
    marginwidth="0">
    Loading...
</iframe>
"""
)

Congratulations, you're done!

How to submit this assignment

Log into the course Desire2Learn website (d2l.msu.edu) and go to the "Homework assignments" folder. There will be a dropbox labeled "Homework 3". Upload this notebook there.