Section 2 Notebook

In this notebook we will reason about recent presidential approval poll data. We will explore how the concepts of conditional probability, Law of Total Probability and Bayes' Theorem help us better understand a simple survey. Along the way we will learn how the Python data analysis library pandas facilitates easy manipulation of data tables.

Learning Goals:

  1. Analyze poll data with conditional probability, Law of Total Probability and Bayes' Theorem
  2. Learn some basic pandas skills

Poll Data - Presidential Approval

Problem: You collect data on whether or not people approve of President Trump, a potential candidate in the upcoming election. We have collected real poll data from the last 13 CNN polls, which can be found here (link directly to the CNN poll here).

Let $A$ be the event that a person says they approve of the way President Trump is handling his job as president. Let $M$ be the event that a user answered "No opinion." We are interested in estimating $P(A)$, however that is hard given the small but significant number of users who answered "No opinion".

Note 1: We assume in our model that given enough information the "No opinion" users would make an approve/disapprove decision.

Note 2: The latest CNN poll (Jan 16-19, 2020) had a sample of 1156 respondents. For simplicity we will assume all polls also had this sample size.


In [0]:
num_respondents = 1156


dates = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'June', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec', 'Jan2020']
data = {}
data['approve'] = [37, 40, 42, 43, 43, 43, 40, 39, 41, 42, 43, 43]
data['disapprove'] = [57, 55, 51, 52, 52, 52, 54, 55, 57, 54, 53, 53]
data['no_opinion'] = [7, 5, 8, 5, 5, 5, 6, 6, 2, 4, 4, 4]

In the below cell, import pandas and make a DataFrame object using the above poll data and using the dates list as the index.

Then, display the data by printing your DataFrame object.

Hint: Instead of using print, try using the DataFrame variable name alone on a single line at the end of the cell. It will look prettier :)


In [0]:
import pandas as pd

polldf = # TODO
polldf

a) For each month, what is the fraction of users who responded with their opinion $P(M^C)$?

Using your DataFrame object created above, compute $P(M^C)$. See pandas.DataFrame.sum to sum rows or columns of the table.

Hint: Try accessing the DataFrame using its column names and then doing elementwise vector math. For example, use polldf['approve'] / ... instead of for loops.


In [0]:
# TODO

b) For each month, what is the probability that a user said they approve, given that they responded to the poll $P(A|M^C)$?

You know the drill :)


In [0]:
# TODO

c) Compute $P(A)$ under the following assumptions:

  1. $P(A|M) = P(A|M^C)$. That is, people with no opinion will have similar approval ratios as the others.
  2. $P(A|M) = 0$. That is, people with no opinion actually disapprove.
  3. $P(A^C|M) = 0$. That is, people with no opinion actually approve.

In [0]:
polldf['P(A) w/ A.1'] =  # TODO
polldf['P(A) w/ A.2'] = # TODO
polldf['P(A) w/ A.3'] = # TODO
polldf

d) Discuss: Which of the assumptions do you think is best? What assumptions would you employ in practice, or what other data would you gather to support arguments using this survey data?