Lab

Probability and Statistics

Alessandro D. Gagliardi

On Groundhog Day, February 2, a famous groundhog in Punxsutawney, PA is used to predict whether a winter will be long or not based on whether or not he sees his shadow. I collected data on whether he saw his shadow or not from here. I stored some of this data in this table.

Although Phil is on the East Coast, I wondered if the information says anything about whether or not we will experience a rainy winter out here in California. For this, I found rainfall data, and saved it in a table. To see how this was extracted see this notebook.

  1. Make a boxplot of the average rainfall in Northen California comparing the years Phil sees his shadow versus the years he does not.

  2. Construct a 90% confidence interval for the difference between the mean rainfall in years Phil sees his shadow and years he does not.

  3. Interpret the interval in part 2.

  4. At level, $\alpha = 0.05$ would you reject the null hypothesis that the average rainfall in Northern California during the month of February was the same in years Phil sees his shadow versus years he does not?

  5. What assumptions are you making in forming your confidence interval and in your hypothesis test?


In [2]:
%matplotlib inline
import os
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import statsmodels.api as sm
import twitter
import yaml
from pymongo import MongoClient

Part 1


In [3]:
rainfall = pd.read_csv('http://stats191.stanford.edu/data/rainfall.csv')
groundhog = pd.read_csv('http://stats191.stanford.edu/data/groundhog.table')

In [4]:
df = rainfall.merge(groundhog, left_on='WY', right_on='year')[['Total', 'shadow']]

In [5]:
df.boxplot(column='Total', by='shadow')


Out[5]:
<matplotlib.axes.AxesSubplot at 0x10ce11f10>

Part 2


In [50]:
mod = sm.OLS.from_formula("Total ~ shadow == 'Y'", df)
res = mod.fit()
print res.summary(alpha = .1)


                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  Total   R-squared:                       0.016
Model:                            OLS   Adj. R-squared:                 -0.036
Method:                 Least Squares   F-statistic:                    0.2996
Date:                Mon, 28 Apr 2014   Prob (F-statistic):              0.591
Time:                        21:18:20   Log-Likelihood:                -88.568
No. Observations:                  21   AIC:                             181.1
Df Residuals:                      19   BIC:                             183.2
Df Model:                           1                                         
=========================================================================================
                            coef    std err          t      P>|t|      [90.0% Conf. Int.]
-----------------------------------------------------------------------------------------
Intercept                56.4160      7.721      7.307      0.000        43.066    69.766
shadow == 'Y'[T.True]    -4.8410      8.845     -0.547      0.591       -20.135    10.453
==============================================================================
Omnibus:                        1.662   Durbin-Watson:                   1.830
Prob(Omnibus):                  0.436   Jarque-Bera (JB):                1.276
Skew:                           0.403   Prob(JB):                        0.528
Kurtosis:                       2.102   Cond. No.                         3.88
==============================================================================

I report the confidence interval [-20.135, 10.453] for the difference between the shadow == 'Y' and the shadow == 'N' means.

Part 3

If I repeated this sample of shadow and rainfall in Northern California (assuming they are IID each year) and I form this confidence interval as t.test does. Then, 90% of the intervals will cover the true underlying difference in the rainfall between years the groundhog sees his shadow or not.

Part 4

As the reported $p$-value is 0.591, I fail to reject the null hypothesis at the 5% level.

Part 5

I am assuming that the rainfall measurements are independent $N(\mu_i,\sigma^2)$ where $\mu_i=\mu_N$ in the shadow == 'N' years and $\mu_i=\mu_Y$ in the shadow == 'Y' years.


1-2 Pairs

Start with the data on US and Canada trends from last week.

  1. Create a histogram of text_len within each group. (Use alpha = 0.5 to overlay them.)

  2. Compute the sample mean and standard deviation in the two groups.

  3. Create a DataFrame concatenating data from each collection adding a country column to distinguish US from Canada.
    i.e. given ca_text_len and us_text_len as Series containing the length of each text in the Canadian and US collections respectively:

    text_len_df = pd.concat([pd.DataFrame({'text_len': ca_text_len, 'country': 'CA'}), 
                              pd.DataFrame({'text_len': us_text_len, 'country': 'US'})])
    
  4. Use this DataFrame to create a boxplot of the text_len by country.

  5. Use OLS to compute a 90% confidence interval for the difference in text_len between the two groups. Name a problem with describing the confidence interval of tweet length in this way.

  6. At level $\alpha=5\%$, test the null hypothesis that the average text length does not differ between the two groups. What can you conclude?

  7. Repeat the above steps but pick your own tags and try to find a pair with a more significant difference.


In [1]:
# standard library:
import os  

from pprint import pprint

# other modules:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import twitter
import yaml

from pymongo import MongoClient

credentials = yaml.load(open(os.path.expanduser('~/api_cred.yml')))

auth = twitter.oauth.OAuth(credentials['ACCESS_TOKEN'], 
                           credentials['ACCESS_TOKEN_SECRET'],
                           credentials['API_KEY'],
                           credentials['API_SECRET'])

twitter_api = twitter.Twitter(auth=auth)

# The Yahoo! Where On Earth ID for the entire world is 1.
# See https://dev.twitter.com/docs/api/1.1/get/trends/place and
# http://developer.yahoo.com/geo/geoplanet/

WORLD_WOE_ID = 1
US_WOE_ID = 23424977
CA_WOE_ID = 23424775

# Prefix ID with the underscore for query string parameterization.
# Without the underscore, the twitter package appends the ID value
# to the URL itself as a special case keyword argument.

us_trends = twitter_api.trends.place(_id=US_WOE_ID)
canada_trends = twitter_api.trends.place(_id=CA_WOE_ID)

us_trends_set = set([trend['name'] for trends in us_trends
                     for trend in trends['trends']]) 
ca_trends_set = set([trend['name'] for trends in canada_trends
                     for trend in trends['trends']])

c = MongoClient()

db = c.lab05

%matplotlib inline

ca_results = list()
ca_only_trends_list = list(ca_trends_set)

for trend in enumerate(ca_only_trends_list):
    q = str(trend)
    print('Trend: '+ q)
    count = 100
    search_results = twitter_api.search.tweets(q=q, count=count)
    statuses = search_results['statuses']
    if len(statuses) > 0:
        ca_results.extend(statuses)
        
print(len(ca_results))

us_results = list()
us_only_trends_list = list(us_trends_set)
for trend in enumerate(us_only_trends_list):
    q = str(trend)
    print('Trend: '+ q)
    count = 100
    search_results = twitter_api.search.tweets(q=q, count=count)
    statuses = search_results['statuses']
    if len(statuses) > 0:
        us_results.extend(statuses)

print(len(us_results))

if len(ca_results) > 0:
    ca_statuses = db.canada.insert(ca_results)
if len(us_results) > 0:
    us_statuses = db.us.insert(us_results)


Trend: (0, u'Ric Flair')
Trend: (1, u'Columbus')
Trend: (2, u'#VoiceTop10')
Trend: (3, u'Peter Maher')
Trend: (4, u'#WeWereBornForThis')
Trend: (5, u'Norma')
Trend: (6, u'#BuyProblemOniTunes')
Trend: (7, u'#BuyOrdinaryOniTunes')
Trend: (8, u'#dontstop')
Trend: (9, u'Brie')
148
Trend: (0, u'#bbwlareunion')
Trend: (1, u'#ShieldVsEvolution')
Trend: (2, u'#AskTeamJeta')
Trend: (3, u'#ordinary')
Trend: (4, u'Cinnamon Roll Pancake')
Trend: (5, u'Tuscaloosa')
Trend: (6, u'James Spann')
Trend: (7, u'#BuyOrdinaryOniTunes')
Trend: (8, u'Carlos Santana')
Trend: (9, u'Sonny Gray')
12

In [53]:
#Create a histogram of text_len within each group. Use alpha = 0.5 to overlay them.
#pprint(ca_results[:5])


us_df = pd.DataFrame({'text_len': len(tweet['text'])} for tweet in us_results)
ca_df = pd.DataFrame({'text_len': len(tweet['text'])} for tweet in ca_results)

#us_df.describe()
#ca_df.describe()

us_df.hist(alpha=0.5)
ca_df.hist(alpha=0.5)
plt.show()

#Compute the sample mean and standard deviation in the two groups.

print(us_df.mean())
print(ca_df.mean())

print(us_df.std())
print(ca_df.std())

#Create a DataFrame concatenating data from each collection adding a country column to distinguish US from Canada.
ca_text_len = ca_df.ix[:,0]
us_text_len = us_df.ix[:,0]

text_len_df = pd.concat([pd.DataFrame({'text_len': ca_text_len, 'country': 'CA'}), 
                          pd.DataFrame({'text_len': us_text_len, 'country': 'US'})])

#Use this DataFrame to create a boxplot of the text_len by country.
text_len_df.boxplot(column='text_len', by='country')

#Use OLS to compute a 90% confidence interval for the difference in text_len between the two groups. 
#Name a problem with describing the confidence interval of tweet length in this way.
mod = sm.OLS.from_formula("text_len ~ country", text_len_df)
res = mod.fit()
print res.summary(alpha = .1)

#At level α=5%, test the null hypothesis that the average text length does not differ between the two groups. 
#What can you conclude?
print('P=0.615 thus fail to reject the null hypothesis at alpha = 0.05')

us_rt_df = pd.DataFrame({'retweet_count': tweet['retweet_count']} for tweet in us_results)
ca_rt_df = pd.DataFrame({'retweet_count': tweet['retweet_count']} for tweet in ca_results)

us_rt_df.hist(alpha=0.5)
ca_rt_df.hist(alpha=0.5)
plt.show()

#Compute the sample mean and standard deviation in the two groups.

print(us_rt_df.mean())
print(ca_rt_df.mean())

print(us_rt_df.std())
print(ca_rt_df.std())

#Create a DataFrame concatenating data from each collection adding a country column to distinguish US from Canada.
ca_retweet_count = ca_rt_df.ix[:,0]
us_retweet_count = us_rt_df.ix[:,0]

retweet_count_df = pd.concat([pd.DataFrame({'retweet_count': ca_retweet_count, 'country': 'CA'}), 
                          pd.DataFrame({'retweet_count': us_retweet_count, 'country': 'US'})])

#Use this DataFrame to create a boxplot of the text_len by country.
retweet_count_df.boxplot(column='retweet_count', by='country')

#Use OLS to compute a 90% confidence interval for the difference in text_len between the two groups. 
#Name a problem with describing the confidence interval of tweet length in this way.
mod = sm.OLS.from_formula("retweet_count ~ country", retweet_count_df)
res = mod.fit()
print res.summary(alpha = .1)


text_len    129.833333
dtype: float64
text_len    126.574324
dtype: float64
text_len    11.876894
dtype: float64
text_len    22.115372
dtype: float64
                            OLS Regression Results                            
==============================================================================
Dep. Variable:               text_len   R-squared:                       0.002
Model:                            OLS   Adj. R-squared:                 -0.005
Method:                 Least Squares   F-statistic:                    0.2536
Date:                Mon, 28 Apr 2014   Prob (F-statistic):              0.615
Time:                        21:28:25   Log-Likelihood:                -717.36
No. Observations:                 160   AIC:                             1439.
Df Residuals:                     158   BIC:                             1445.
Df Model:                           1                                         
=================================================================================
                    coef    std err          t      P>|t|      [90.0% Conf. Int.]
---------------------------------------------------------------------------------
Intercept       126.5743      1.772     71.419      0.000       123.642   129.507
country[T.US]     3.2590      6.471      0.504      0.615        -7.448    13.966
==============================================================================
Omnibus:                       85.313   Durbin-Watson:                   1.276
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              278.487
Skew:                          -2.222   Prob(JB):                     3.37e-61
Kurtosis:                       7.693   Cond. No.                         3.82
==============================================================================
P=0.615 thus fail to reject the null hypothesis at alpha = 0.05
retweet_count    0.5
dtype: float64
retweet_count    83.277027
dtype: float64
retweet_count    0.904534
dtype: float64
retweet_count    523.080941
dtype: float64
                            OLS Regression Results                            
==============================================================================
Dep. Variable:          retweet_count   R-squared:                       0.002
Model:                            OLS   Adj. R-squared:                 -0.004
Method:                 Least Squares   F-statistic:                    0.2988
Date:                Mon, 28 Apr 2014   Prob (F-statistic):              0.585
Time:                        21:28:26   Log-Likelihood:                -1221.8
No. Observations:                 160   AIC:                             2448.
Df Residuals:                     158   BIC:                             2454.
Df Model:                           1                                         
=================================================================================
                    coef    std err          t      P>|t|      [90.0% Conf. Int.]
---------------------------------------------------------------------------------
Intercept        83.2770     41.473      2.008      0.046        14.657   151.897
country[T.US]   -82.7770    151.439     -0.547      0.585      -333.341   167.787
==============================================================================
Omnibus:                      248.822   Durbin-Watson:                   1.253
Prob(Omnibus):                  0.000   Jarque-Bera (JB):            17228.297
Skew:                           7.031   Prob(JB):                         0.00
Kurtosis:                      51.852   Cond. No.                         3.82
==============================================================================

Homework

Use enron.db from last week.

  1. Create a boxplot of the message recipient count (MAX(rno)), splitting the data based on the seniority of the sender.

  2. Compute the sample mean and standard deviation in the two groups.

  3. Create a histogram of the recipient count within each group.

  4. Compute a 90% confidence interval for the difference in supervisor performance between the two groups. What is a problem with this? How might you fix it?

  5. At level $\alpha=5\%$, test the null hypothesis that the average supervisor performance does not differ between the two groups. What assumptions are you making? What can you conclude?

  6. Repeat the test in 5. using OLS.


In [11]:
# standard library:
import os  

from pprint import pprint

# other modules:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import sqlite3
from pandas.io import sql
conn = sqlite3.connect('enron.db')

from pymongo import MongoClient
%matplotlib inline

In [8]:
%%bash
sqlite3 enron.db .tables


Employee          EmployeeWithVars  MessageBase       RecipientBase   
EmployeeBase      Message           Recipient       

In [16]:
%%sql sqlite:///enron.db
SELECT * FROM RecipientBase LIMIT 5


Done.
Out[16]:
mid rno to_eid
1 1 59
2 1 15
3 1 15
4 1 109
4 2 49

In [28]:
#Create a boxplot of the message recipient count (MAX(rno)), splitting the data based on the seniority of the sender.
senior_max = sql.frame_query("""SELECT rno as recipient_count, eid, seniority, from_eid, mid
                FROM EmployeeBase JOIN MessageBase ON eid = from_eid
                JOIN RecipientBase USING(mid)
                WHERE seniority = 'Senior'""", conn, "eid")
junior_max = sql.frame_query("""SELECT rno as recipient_count, eid, seniority, from_eid, mid
                FROM EmployeeBase JOIN MessageBase ON eid = from_eid
                JOIN RecipientBase USING(mid)
                WHERE seniority = 'Junior'""", conn, "eid")
senior_max.head()
print senior_max.count()
junior_max.head()
print junior_max.count()
#TODO: Generate create data frames from query results and plot box plots


recipient_count    22490
seniority          22490
from_eid           22490
mid                22490
dtype: int64
recipient_count    15898
seniority          15898
from_eid           15898
mid                15898
dtype: int64

In [ ]:
max_df = pd.concat([pd.DataFrame({'retweet_count': ca_retweet_count, 'country': 'CA'}), 
                          pd.DataFrame({'retweet_count': us_retweet_count, 'country': 'US'})])