In this homework we ask you three questions that we expect you to answer using data. For each question we ask you to complete a series of tasks that should help guide you through the data analysis. Complete these tasks and then write a short (100 words or less) answer to the question.
Note: We will briefly discuss this homework assignment on Thursday in class.
For this assignment we will use two databases:
The Sean Lahman's Baseball Database which contains the "complete batting and pitching statistics from 1871 to 2013, plus fielding statistics, standings, team stats, managerial records, post-season data, and more. For more details on the latest release, please read the documentation."
Gapminder is a great resource that contains over 500 data sets related to world indicators such as income, GDP and life expectancy.
In this assignment, you will learn how to:
a. Load in CSV files from the web.
b. Create functions in python.
C. Create plots and summary statistics for exploratory data analysis such as histograms, boxplots and scatter plots.
In [428]:
# The %... is an iPython thing, and is not part of the Python language.
# In this case we're just telling the plotting library to draw things on
# the notebook, instead of on a separate window.
%matplotlib inline
#this line above prepares IPython notebook for working with matplotlib
# See all the "as ..." contructs? They're just aliasing the package names.
# That way we can call methods like plt.plot() instead of matplotlib.pyplot.plot().
import numpy as np # imports a fast numerical programming library
import scipy as sp #imports stats functions, amongst other things
import matplotlib as mpl # this actually imports matplotlib
import matplotlib.cm as cm #allows us easy access to colormaps
import matplotlib.pyplot as plt #sets up plotting under plt
import pandas as pd #lets us handle data as dataframes
#sets up pandas table display
pd.set_option('display.width', 500)
pd.set_option('display.max_columns', 100)
pd.set_option('display.notebook_repr_html', True)
import seaborn as sns #sets up styles and gives us more plotting options
In Lecture 1, we showed a plot that provided evidence that the 2002 and 2003 Oakland A's, a team that used data science, had a competitive advantage. Since, others teams have started using data science as well. Use exploratory data analysis to determine if the competitive advantage has since disappeared.
Load in these CSV files from the Sean Lahman's Baseball Database. For this assignment, we will use the 'Salaries.csv' and 'Teams.csv' tables. Read these tables into a pandas DataFrame
and show the head of each table.
Hint Use the requests, StringIO and zipfile modules to get from the web.
In [429]:
# imports
import requests
import StringIO
import zipfile
In [430]:
req = requests.get("http://seanlahman.com/files/database/lahman-csv_2014-02-14.zip", stream=True)
z = zipfile.ZipFile(StringIO.StringIO(req.content))
In [431]:
for zip_item in z.filelist:
print zip_item.filename
In [432]:
salaries_csv = z.extract('Salaries.csv')
teams_csv = z.extract('Teams.csv')
In [433]:
print salaries_csv
print teams_csv
In [434]:
teams = pd.read_csv(teams_csv)
print teams.shape
teams.head()
Out[434]:
In [435]:
teams.dtypes
Out[435]:
In [436]:
salaries = pd.read_csv(salaries_csv)
print salaries.shape
salaries.head()
Out[436]:
In [437]:
salaries.dtypes
Out[437]:
In [438]:
# as_index=False is effectively “SQL-style” grouped output.
# Makes salaries_per_team_per_year a DF instead of group_by object
salaries_per_team_per_year = salaries.groupby(['teamID','yearID'] , as_index=False)
salaries_per_team_per_year.head()
Out[438]:
In [439]:
total_salary_per_team_per_year = salaries_per_team_per_year.sum()
total_salary_per_team_per_year.head()
Out[439]:
In [440]:
# Reduce the Teams Data Frame to only use columns we care about
reduced_teams = teams[['yearID', 'teamID', 'W']]
reduced_teams.head()
Out[440]:
In [441]:
merged_teams = pd.merge(total_salary_per_team_per_year, reduced_teams, on=['yearID', 'teamID'])
print merged_teams.shape
merged_teams.head()
Out[441]:
How would you graphically display the relationship between total wins and total salaries for a given year? What kind of plot would be best? Choose a plot to show this relationship and specifically annotate the Oakland baseball team on the on the plot. Show this plot across multiple years. In which years can you detect a competitive advantage from the Oakland baseball team of using data science? When did this end?
Hints: Use a for
loop to consider multiple years. Use the teamID
(three letter representation of the team name) to save space on the plot.
In [442]:
starting_year = 1998
ending_year = 2007
print merged_teams[(merged_teams['yearID'] >= starting_year) & (merged_teams['yearID'] <= ending_year)].salary.max()
print merged_teams[(merged_teams['yearID'] >= starting_year) & (merged_teams['yearID'] <= ending_year)].salary.min()
In [443]:
oakland_id = 'OAK'
for year in np.arange(starting_year, ending_year + 1):
that_year = merged_teams[merged_teams['yearID'] == year]
plt.scatter(that_year.salary / 1e6, that_year.W)
plt.title('Wins vs. Total Salary ({})'.format(year))
plt.xlabel('Total Salary ($millions)')
plt.ylabel('Wins')
plt.xlim(0, 210)
plt.ylim(30, 130)
plt.annotate(oakland_id,
xy = (that_year['salary'][that_year['teamID'] == oakland_id] / 1e6, that_year['W'][that_year['teamID'] == oakland_id]),
xytext = (-15, 30), textcoords = 'offset points', ha = 'right', va = 'bottom',
bbox = dict(boxstyle = 'round, pad=0.5', fc = 'green', alpha = 0.5),
arrowprops = dict(arrowstyle = '->', facecolor = 'black' , connectionstyle = 'arc3,rad=0')
)
plt.show()
Notes: Used a scatter plot as it is a great way to see the correlation between two variables. In this case, team salaries and wins per season. I looked at these scatter plots from 1998-2007, and it looks like the Oakland A's competitive advantage began in 1999 when the number of wins went up significantly and went up slightly more in 2000. 2001-2003 was when the A's had the greatest comeptitive advantage and were in the top 4 of the league during those years. The competitive advatage declined starting in 2004.
In [444]:
resid_data = pd.DataFrame(index=range(starting_year, ending_year + 1), columns=merged_teams['teamID'].unique())
for year in np.arange(starting_year, ending_year + 1):
data_for_year = merged_teams[merged_teams['yearID'] == year]
x = data_for_year['salary'].values / 1e6
y = data_for_year['W'].values
# y = Ap where A = [x, 1] and p = [m, c] where m is slope and c is intercept
# Add to end of x vector for fitting intercept
A = np.array([x, np.ones(len(x))]).T
m, c = np.linalg.lstsq(A, y)[0]
# regression model predictions
yhat = (m*x+c)
# Annual Residuals
residuals = y - yhat
# Set residual for each team using year as index
for i, team_id in enumerate(data_for_year['teamID']):
resid_data[team_id][year] = residuals[i]
resid_data
Out[444]:
In [445]:
resid_data.index.format()
resid_data.plot(title = 'Residuals from least squares estimates across years', figsize = (15, 8),
color=map(lambda x: 'green' if x=='OAK' else 'gray', resid_data.columns))
plt.xlabel('Year')
plt.ylabel('Residuals')
plt.show()
Write a brief discussion of your conclusions to the questions and tasks above in 100 words or less.
Combining both plots from 1(d) and 1(e), it seems that 2001-2003 were the most significant years of competitive advantage for the Oakland A's. Michael Lewis' Moneyball shows how this advantage was a result of the sabermetric approach to assembling a competitive baseball team conducted by the A's general manager Billy Beane. From 2004 on, this advantage seemed to go away as other teams caught on.
Several media reports have demonstrated the income inequality has increased in the US during this last decade. Here we will look at global data. Use exploratory data analysis to determine if the gap between Africa/Latin America/Asia and Europe/NorthAmerica has increased, decreased or stayed the same during the last two decades.
Using the list of countries by continent from World Atlas data, load in the countries.csv
file into a pandas DataFrame and name this data set as countries
. This data set can be found on Github in the 2014_data repository here.
In [446]:
countries = pd.read_csv("https://raw.githubusercontent.com/cs109/2014_data/master/countries.csv")
countries.head()
Out[446]:
Using the data available on Gapminder, load in the Income per person (GDP/capita, PPP$ inflation-adjusted) as a pandas DataFrame and name this data set as income
.
Hint: Consider using the pandas function pandas.read_excel()
to read in the .xlsx file directly.
In [447]:
import xlrd
income = pd.read_excel("gdp_per_capita_ppp.xlsx")
income.head()
Out[447]:
Transform the data set to have years as the rows and countries as the columns. Show the head of this data set when it is loaded.
In [448]:
income = income.T
income.columns = income.iloc[0]
income = income.ix[1:]
income.head()
Out[448]:
In [449]:
year = 2000
plt.hist(income.ix[year].dropna().values, bins=30)
plt.title(year)
plt.xlabel('GDP per capita ($USD in 2011)')
plt.ylabel('Frequency')
plt.show()
plt.hist(np.log10(income.ix[year].dropna().values.astype(int)), bins=30)
plt.title(year)
plt.xlabel('GDP per capita (log10 of $USD in 2011)')
plt.ylabel('Frequency')
plt.show()
The best kind of plot for viewing a frequency distribution of values is a histogram. I plotted both the untransformed data and the log10 scale plot since the range of actual values for GDP per capita were extremely large.
In [450]:
"""
Function
--------
mergeByYear
Return a merged DataFrame containing the income,
country name and region for a given year.
Parameters
----------
year : int
The year of interest
Returns
-------
a DataFrame
A pandas DataFrame with three columns titled
'Country', 'Region', and 'Income'.
Example
-------
>>> mergeByYear(2010)
"""
def mergeByYear(yr):
income_by_year = pd.DataFrame({'Income' : income.ix[yr].values, 'Country': income.columns})
income_with_region = pd.merge(income_by_year, countries, on=['Country'])
return income_with_region
In [451]:
mergeByYear(2010).head()
Out[451]:
In [452]:
years = np.arange(1985, 2015 + 1, 5)
for yr in years:
continent_yr_df = mergeByYear(yr)
continent_yr_df.dropna(inplace=True)
continent_yr_df['Income'] = continent_yr_df['Income'].astype(int)
continent_yr_df.boxplot('Income', by='Region', showmeans=True, rot=90)
plt.title("GDP per capita by Continent ({})".format(yr))
plt.xlabel('')
plt.ylabel('GDP per capita ($USD in 2011)')
plt.ylim(10**2, 10**5)
plt.yscale('log')
The largest change by far is Asia which has had trmemdous GDP per capita growth. Africa has had substantial GDP per capita growth, but is still heavily unequal compared to the Europe, Asia, and the Americas. Both North and South America have had small GDP per capita growth of its nations and Europe and Oceania have been the most stagnant in this regard.
Write a brief discussion of your conclusions to the questions and tasks above in 100 words or less.
In 2b we saw how per capita GDP was heavily unequally distributed following more of an exponential distribution. There were many counties with very low per capita GDP, and a fat tail of country's with higher per capita GDP.
In 2c, we looked at per capita GDP accross continents. We saw that Asia had explosive growth in that regard, Africa had substantial growth(but is still objectively low relative to other continents), and that more developed continents had miniscule growth. Inter-continental income inequality as measured by GDP per capita has decreased substantially for Asia and relative to more developed continents, and somewhat for Africa. However, the same inequality structure in terms of which continent has more GDP per capita relative to another seems to have not changed over the last 30 years.
Assume you have two list of numbers, X and Y, with distribution approximately normal. X and Y have standard deviation equal to 1, but the average of X is different from the average of Y. If the difference in the average of X and the average of Y is larger than 0, how does the proportion of X > a compare to the proportion of Y > a?
Write a function that analytically calculates the ratio of these two proportions: Pr(X > a)/Pr(Y > a) as function of the difference in the average of X and the average of Y.
Hint: Use the scipy.stats
module for useful functions related to a normal random variable such as the probability density function, cumulative distribution function and survival function.
Update: Assume Y is normally distributed with mean equal to 0.
Show the curve for different values of a (a = 2,3,4 and 5).
In [453]:
"""
Function
--------
ratioNormals
Return ratio of these two proportions:
Pr(X > a)/Pr(Y > a) as function of
the difference in the average of X
and the average of Y.
Parameters
----------
diff : difference in the average of X
and the average of Y.
a : cutoff value
Returns
-------
Returns ratio of these two proportions:
Pr(X > a)/Pr(Y > a)
Example
-------
>>> ratioNormals(diff = 1, a = 2)
"""
def ratioNormals(diff, a):
X = sp.stats.norm(loc=diff, scale=1)
Y = sp.stats.norm(loc=0, scale=1)
return X.sf(a) / Y.sf(a) # .sf is survival function. i.e. Pr(X > a)
In [454]:
diffs = np.linspace(0, 5, 50)
a_vals = range(2, 6)
for a in a_vals:
ratios = [ratioNormals(diff, a) for diff in diffs]
plt.plot(diffs, ratios)
# Plot Description
plt.title('Ratio of Pr(X > a)/Pr(Y > a) vs. difference of mean between X & Y')
plt.xlabel('Difference of mean b/t X & Y')
plt.ylabel('Pr(X > a)/Pr(Y > a)')
plt.legend(["a={}".format(a) for a in a_vals])
plt.yscale('log') # Needed because of a=5 so that other a values can be seen
Now consider the distribution of income per person from two regions: Asia and South America. Estimate the average income per person across the countries in those two regions. Which region has the larger average of income per person across the countries in that region?
Update: Use the year 2012.
In [455]:
country_region_income_2012 = mergeByYear(2012)
asia_sa_2012 = country_region_income_2012[
(country_region_income_2012.Region == 'ASIA') | (country_region_income_2012.Region == 'SOUTH AMERICA')
]
asia_sa_2012 = asia_sa_2012.dropna(subset=['Income'])
asia_sa_2012.Income = asia_sa_2012.Income.astype(int)
grouped_asia_sa_2012 = asia_sa_2012.groupby('Region')
In [456]:
grouped_asia_sa_2012.Income.mean()
Out[456]:
In [457]:
asia_sa_2012.boxplot('Income', by='Region', showmeans=True, rot=90)
plt.title("GDP per capita, Asia v. South America (2012)")
plt.xlabel('')
plt.ylabel('GDP per capita ($USD in 2011)')
plt.ylim(1000, 60000)
Out[457]:
Asia has the larger average of income per person across the countries in its region relative to South America by nearly double.
Calculate the proportion of countries with income per person that is greater than 10,000 dollars. Which region has a larger proportion of countries with income per person greater than 10,000 dollars? If the answer here is different from the answer in 3(b), explain why in light of your answer to 3(a).
Update: Use the year 2012.
In [458]:
grouped_asia_sa_2012.size()
Out[458]:
In [459]:
grouped_asia_sa_2012.apply(lambda x: (x.Income > 10000).mean())
Out[459]:
In [460]:
grouped_asia_sa_2012.Income.std()
Out[460]:
South America has a larger proportion of countries with income per person greater than 10,000 dollars relative to Asia. This is not parallel with the results in 3(b), where where we found Asia has a higher avg. income per person across its countries relative to South America.
The reason for this is: South America has many fewer countries than Asia, and a much smaller standard deviation of gdp cer capita for its countries. The South American avg. gdp cer capita of 13015 is close to its median, while Asia's median is well below its average, and in fact, below that of South America. Asia has a handful of countries with very high gdp cer capita values, and this gives them the advantage when calculating the average. Given those facts, the result is not surprising.
For AC209 Students: Re-run this analysis in Problem 3 but compute the average income per person for each region, instead of the average of the reported incomes per person across countries in the region. Why are these two different? Hint: use this data set.
In [462]:
# Optional: do the same analysis with population weighted gdp per capita for each Region
Write a brief discussion of your conclusions to the questions and tasks above in 100 words or less.
The problem starts with the general question of: If group A has larger values than group B on average, does this mean the largest values are from group A?
In short, No. The example of GDP per capita in Asia vs. South America in 2012 showed this assumption doesn't always hold. In this case, the average GDP per capita for a country in Asia was higher than that of South America, but the proportion of countries with GDP per capita higher than $10,000 was higher in South America than in Asia. This is due to a difference in probability distributions of GDP per capita for the countries in either region. Asia had a standard deviation around 4 times as large as South America, as well as a lower median.
In [ ]: