On Groundhog Day, February 2, a famous groundhog in Punxsutawney, PA is used to predict whether a winter will be long or not based on whether or not he sees his shadow. I collected data on whether he saw his shadow or not from here. I stored some of this data in this table.
Although Phil is on the East Coast, I wondered if the information says anything about whether or not we will experience a rainy winter out here in California. For this, I found rainfall data, and saved it in a table. To see how this was extracted see this notebook.
Make a boxplot of the average rainfall in Northen California comparing the years Phil sees his shadow versus the years he does not.
Construct a 90% confidence interval for the difference between the mean rainfall in years Phil sees his shadow and years he does not.
Interpret the interval in part 2.
At level, $\alpha = 0.05$ would you reject the null hypothesis that the average rainfall in Northern California during the month of February was the same in years Phil sees his shadow versus years he does not?
What assumptions are you making in forming your confidence interval and in your hypothesis test?
In [2]:
%matplotlib inline
import os
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import statsmodels.api as sm
import twitter
import yaml
from pymongo import MongoClient
In [3]:
rainfall = pd.read_csv('http://stats191.stanford.edu/data/rainfall.csv')
groundhog = pd.read_csv('http://stats191.stanford.edu/data/groundhog.table')
In [4]:
df = rainfall.merge(groundhog, left_on='WY', right_on='year')[['Total', 'shadow']]
In [5]:
df.boxplot(column='Total', by='shadow')
Out[5]:
In [50]:
mod = sm.OLS.from_formula("Total ~ shadow == 'Y'", df)
res = mod.fit()
print res.summary(alpha = .1)
I report the confidence interval [-20.135, 10.453] for the difference between the shadow == 'Y' and the shadow == 'N' means.
If I repeated this sample of shadow and rainfall in Northern California (assuming they are IID each year) and I form this confidence interval as t.test does. Then, 90% of the intervals will cover the true underlying difference in the rainfall between years the groundhog sees his shadow or not.
Start with the data on US and Canada trends from last week.
Create a histogram of text_len within each group. (Use alpha = 0.5 to overlay them.)
Compute the sample mean and standard deviation in the two groups.
Create a DataFrame concatenating data from each collection adding a country column to distinguish US from Canada.
i.e. given ca_text_len and us_text_len as Series containing the length of each text in the Canadian and US collections respectively:
text_len_df = pd.concat([pd.DataFrame({'text_len': ca_text_len, 'country': 'CA'}),
pd.DataFrame({'text_len': us_text_len, 'country': 'US'})])
Use this DataFrame to create a boxplot of the text_len by country.
Use OLS to compute a 90% confidence interval for the difference in text_len between the two groups. Name a problem with describing the confidence interval of tweet length in this way.
At level $\alpha=5\%$, test the null hypothesis that the average text length does not differ between the two groups. What can you conclude?
Repeat the above steps but pick your own tags and try to find a pair with a more significant difference.
In [1]:
# standard library:
import os
from pprint import pprint
# other modules:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import twitter
import yaml
from pymongo import MongoClient
credentials = yaml.load(open(os.path.expanduser('~/api_cred.yml')))
auth = twitter.oauth.OAuth(credentials['ACCESS_TOKEN'],
credentials['ACCESS_TOKEN_SECRET'],
credentials['API_KEY'],
credentials['API_SECRET'])
twitter_api = twitter.Twitter(auth=auth)
# The Yahoo! Where On Earth ID for the entire world is 1.
# See https://dev.twitter.com/docs/api/1.1/get/trends/place and
# http://developer.yahoo.com/geo/geoplanet/
WORLD_WOE_ID = 1
US_WOE_ID = 23424977
CA_WOE_ID = 23424775
# Prefix ID with the underscore for query string parameterization.
# Without the underscore, the twitter package appends the ID value
# to the URL itself as a special case keyword argument.
us_trends = twitter_api.trends.place(_id=US_WOE_ID)
canada_trends = twitter_api.trends.place(_id=CA_WOE_ID)
us_trends_set = set([trend['name'] for trends in us_trends
for trend in trends['trends']])
ca_trends_set = set([trend['name'] for trends in canada_trends
for trend in trends['trends']])
c = MongoClient()
db = c.lab05
%matplotlib inline
ca_results = list()
ca_only_trends_list = list(ca_trends_set)
for trend in enumerate(ca_only_trends_list):
q = str(trend)
print('Trend: '+ q)
count = 100
search_results = twitter_api.search.tweets(q=q, count=count)
statuses = search_results['statuses']
if len(statuses) > 0:
ca_results.extend(statuses)
print(len(ca_results))
us_results = list()
us_only_trends_list = list(us_trends_set)
for trend in enumerate(us_only_trends_list):
q = str(trend)
print('Trend: '+ q)
count = 100
search_results = twitter_api.search.tweets(q=q, count=count)
statuses = search_results['statuses']
if len(statuses) > 0:
us_results.extend(statuses)
print(len(us_results))
if len(ca_results) > 0:
ca_statuses = db.canada.insert(ca_results)
if len(us_results) > 0:
us_statuses = db.us.insert(us_results)
In [53]:
#Create a histogram of text_len within each group. Use alpha = 0.5 to overlay them.
#pprint(ca_results[:5])
us_df = pd.DataFrame({'text_len': len(tweet['text'])} for tweet in us_results)
ca_df = pd.DataFrame({'text_len': len(tweet['text'])} for tweet in ca_results)
#us_df.describe()
#ca_df.describe()
us_df.hist(alpha=0.5)
ca_df.hist(alpha=0.5)
plt.show()
#Compute the sample mean and standard deviation in the two groups.
print(us_df.mean())
print(ca_df.mean())
print(us_df.std())
print(ca_df.std())
#Create a DataFrame concatenating data from each collection adding a country column to distinguish US from Canada.
ca_text_len = ca_df.ix[:,0]
us_text_len = us_df.ix[:,0]
text_len_df = pd.concat([pd.DataFrame({'text_len': ca_text_len, 'country': 'CA'}),
pd.DataFrame({'text_len': us_text_len, 'country': 'US'})])
#Use this DataFrame to create a boxplot of the text_len by country.
text_len_df.boxplot(column='text_len', by='country')
#Use OLS to compute a 90% confidence interval for the difference in text_len between the two groups.
#Name a problem with describing the confidence interval of tweet length in this way.
mod = sm.OLS.from_formula("text_len ~ country", text_len_df)
res = mod.fit()
print res.summary(alpha = .1)
#At level α=5%, test the null hypothesis that the average text length does not differ between the two groups.
#What can you conclude?
print('P=0.615 thus fail to reject the null hypothesis at alpha = 0.05')
us_rt_df = pd.DataFrame({'retweet_count': tweet['retweet_count']} for tweet in us_results)
ca_rt_df = pd.DataFrame({'retweet_count': tweet['retweet_count']} for tweet in ca_results)
us_rt_df.hist(alpha=0.5)
ca_rt_df.hist(alpha=0.5)
plt.show()
#Compute the sample mean and standard deviation in the two groups.
print(us_rt_df.mean())
print(ca_rt_df.mean())
print(us_rt_df.std())
print(ca_rt_df.std())
#Create a DataFrame concatenating data from each collection adding a country column to distinguish US from Canada.
ca_retweet_count = ca_rt_df.ix[:,0]
us_retweet_count = us_rt_df.ix[:,0]
retweet_count_df = pd.concat([pd.DataFrame({'retweet_count': ca_retweet_count, 'country': 'CA'}),
pd.DataFrame({'retweet_count': us_retweet_count, 'country': 'US'})])
#Use this DataFrame to create a boxplot of the text_len by country.
retweet_count_df.boxplot(column='retweet_count', by='country')
#Use OLS to compute a 90% confidence interval for the difference in text_len between the two groups.
#Name a problem with describing the confidence interval of tweet length in this way.
mod = sm.OLS.from_formula("retweet_count ~ country", retweet_count_df)
res = mod.fit()
print res.summary(alpha = .1)
Use enron.db from last week.
Create a boxplot of the message recipient count (MAX(rno)), splitting the data based on the seniority of the sender.
Compute the sample mean and standard deviation in the two groups.
Create a histogram of the recipient count within each group.
Compute a 90% confidence interval for the difference in supervisor performance between the two groups. What is a problem with this? How might you fix it?
At level $\alpha=5\%$, test the null hypothesis that the average supervisor performance does not differ between the two groups. What assumptions are you making? What can you conclude?
Repeat the test in 5. using OLS.
In [11]:
# standard library:
import os
from pprint import pprint
# other modules:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import sqlite3
from pandas.io import sql
conn = sqlite3.connect('enron.db')
from pymongo import MongoClient
%matplotlib inline
In [8]:
%%bash
sqlite3 enron.db .tables
In [16]:
%%sql sqlite:///enron.db
SELECT * FROM RecipientBase LIMIT 5
Out[16]:
In [28]:
#Create a boxplot of the message recipient count (MAX(rno)), splitting the data based on the seniority of the sender.
senior_max = sql.frame_query("""SELECT rno as recipient_count, eid, seniority, from_eid, mid
FROM EmployeeBase JOIN MessageBase ON eid = from_eid
JOIN RecipientBase USING(mid)
WHERE seniority = 'Senior'""", conn, "eid")
junior_max = sql.frame_query("""SELECT rno as recipient_count, eid, seniority, from_eid, mid
FROM EmployeeBase JOIN MessageBase ON eid = from_eid
JOIN RecipientBase USING(mid)
WHERE seniority = 'Junior'""", conn, "eid")
senior_max.head()
print senior_max.count()
junior_max.head()
print junior_max.count()
#TODO: Generate create data frames from query results and plot box plots
In [ ]:
max_df = pd.concat([pd.DataFrame({'retweet_count': ca_retweet_count, 'country': 'CA'}),
pd.DataFrame({'retweet_count': us_retweet_count, 'country': 'US'})])