Homework 1. Exploratory Data Analysis

Due: Thursday, September 18, 2014 11:59 PM

Introduction

In this homework we ask you three questions that we expect you to answer using data. For each question we ask you to complete a series of tasks that should help guide you through the data analysis. Complete these tasks and then write a short (100 words or less) answer to the question.

Note: We will briefly discuss this homework assignment on Thursday in class.

Data

For this assignment we will use two databases:

The Sean Lahman's Baseball Database which contains the "complete batting and pitching statistics from 1871 to 2013, plus fielding statistics, standings, team stats, managerial records, post-season data, and more. For more details on the latest release, please read the documentation."
Gapminder is a great resource that contains over 500 data sets related to world indicators such as income, GDP and life expectancy.

Purpose

In this assignment, you will learn how to:

a. Load in CSV files from the web.

b. Create functions in python.

C. Create plots and summary statistics for exploratory data analysis such as histograms, boxplots and scatter plots.

Useful libraries for this assignment

numpy, for arrays
pandas, for data frames
matplotlib, for plotting



In [1]:

    
# special IPython command to prepare the notebook for matplotlib
%matplotlib inline 

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

Problem 1

In Lecture 1, we showed a plot that provided evidence that the 2002 and 2003 Oakland A's, a team that used data science, had a competitive advantage. Since, others teams have started using data science as well. Use exploratory data analysis to determine if the competitive advantage has since disappeared.

Problem 1(a)

Load in these CSV files from the Sean Lahman's Baseball Database. For this assignment, we will use the 'Salaries.csv' and 'Teams.csv' tables. Read these tables into a pandas DataFrame and show the head of each table.

Hint Use the requests, StringIO and zipfile modules to get from the web.



In [2]:

    
#your code here

Problem 1(b)

Summarize the Salaries DataFrame to show the total salaries for each team for each year. Show the head of the new summarized DataFrame.



In [3]:

    
#your code here

Problem 1(c)

Merge the new summarized Salaries DataFrame and Teams DataFrame together to create a new DataFrame showing wins and total salaries for each team for each year year. Show the head of the new merged DataFrame.

Hint: Merge the DataFrames using teamID and yearID.



In [ ]:

    
#your code here

Problem 1(d)

How would you graphically display the relationship between total wins and total salaries for a given year? What kind of plot would be best? Choose a plot to show this relationship and specifically annotate the Oakland baseball team on the on the plot. Show this plot across multiple years. In which years can you detect a competitive advantage from the Oakland baseball team of using data science? When did this end?

Hints: Use a for loop to consider multiple years. Use the teamID (three letter representation of the team name) to save space on the plot.



In [4]:

    
#your code here

Problem 1(e):

For AC209 Students: Fit a linear regression to the data from each year and obtain the residuals. Plot the residuals against time to detect patterns that support your answer in 1(d).



In [5]:

    
#your code here

Discussion for Problem 1

Write a brief discussion of your conclusions to the questions and tasks above in 100 words or less.

Problem 2

Several media reports have demonstrated the income inequality has increased in the US during this last decade. Here we will look at global data. Use exploratory data analysis to determine if the gap between Africa/Latin America/Asia and Europe/NorthAmerica has increased, decreased or stayed the same during the last two decades.

Problem 2(a)

Using the list of countries by continent from World Atlas data, load in the countries.csv file into a pandas DataFrame and name this data set as countries. This data set can be found on Github in the 2014_data repository here.



In [6]:

    
#your code here

Using the data available on Gapminder, load in the Income per person (GDP/capita, PPP$ inflation-adjusted) as a pandas DataFrame and name this data set as income.

Hint: Consider using the pandas function pandas.read_excel() to read in the .xlsx file directly.



In [7]:

    
#your code here

Transform the data set to have years as the rows and countries as the columns. Show the head of this data set when it is loaded.



In [8]:

    
#your code here

Problem 2(b)

Graphically display the distribution of income per person across all countries in the world for any given year (e.g. 2000). What kind of plot would be best?



In [9]:

    
#your code here

Problem 2(c)

Write a function to merge the countries and income data sets for any given year.



In [ ]:

    
"""
Function
--------
mergeByYear

Return a merged DataFrame containing the income, 
country name and region for a given year. 

Parameters
----------
year : int
    The year of interest

Returns
-------
a DataFrame
   A pandas DataFrame with three columns titled 
   'Country', 'Region', and 'Income'. 

Example
-------
>>> mergeByYear(2010)
"""
#your code here

Problem 2(d)

Use exploratory data analysis tools such as histograms and boxplots to explore the distribution of the income per person by region data set from 2(c) for a given year. Describe how these change through the recent years?

Hint: Use a for loop to consider multiple years.



In [11]:

    
#your code here

Discussion for Problem 2

Write a brief discussion of your conclusions to the questions and tasks above in 100 words or less.

Problem 3

In general, if group A has larger values than group B on average, does this mean the largest values are from group A? Discuss after completing each of the problems below.

Problem 3(a)

Assume you have two list of numbers, X and Y, with distribution approximately normal. X and Y have standard deviation equal to 1, but the average of X is different from the average of Y. If the difference in the average of X and the average of Y is larger than 0, how does the proportion of X > a compare to the proportion of Y > a?

Write a function that analytically calculates the ratio of these two proportions: Pr(X > a)/Pr(Y > a) as function of the difference in the average of X and the average of Y.

Hint: Use the scipy.stats module for useful functions related to a normal random variable such as the probability density function, cumulative distribution function and survival function.

Update: Assume Y is normally distributed with mean equal to 0.

Show the curve for different values of a (a = 2,3,4 and 5).



In [ ]:

    
"""
Function
--------
ratioNormals

Return ratio of these two proportions: 
    Pr(X > a)/Pr(Y > a) as function of 
    the difference in the average of X 
    and the average of Y. 

Parameters
----------
diff : difference in the average of X 
    and the average of Y. 
a : cutoff value

Returns
-------
Returns ratio of these two proportions: 
    Pr(X > a)/Pr(Y > a)
    
Example
-------
>>> ratioNormals(diff = 1, a = 2)
"""
#your code here



In [13]:

    
#your code here

Problem 3(b)

Now consider the distribution of income per person from two regions: Asia and South America. Estimate the average income per person across the countries in those two regions. Which region has the larger average of income per person across the countries in that region?

Update: Use the year 2012.



In [14]:

    
#your code here

Problem 3(c)

Calculate the proportion of countries with income per person that is greater than 10,000 dollars. Which region has a larger proportion of countries with income per person greater than 10,000 dollars? If the answer here is different from the answer in 3(b), explain why in light of your answer to 3(a).

Update: Use the year 2012.



In [15]:

    
#your code here

Problem 3(d)

For AC209 Students: Re-run this analysis in Problem 3 but compute the average income per person for each region, instead of the average of the reported incomes per person across countries in the region. Why are these two different? Hint: use this data set.



In [16]:

    
#your code here

Discussion for Problem 3

Write a brief discussion of your conclusions to the questions and tasks above in 100 words or less.