Python for Data Analysis Lightning Tutorials

Pandas Cookbook Series

Python for Data Analysis Lightning Tutorials is a series of tutorials in Data Analysis, Statistics, and Graphics using Python. The Pandas Cookbook series of tutorials provides recipes for common tasks and moves on to more advanced topics in statistics and time series analysis.

Created by Alfred Essa, Dec 22nd, 2013

Note: IPython Notebook and Data files can be found at my Github Site: http://github/alfredessa

Lesson 3.1a (Groupby)

3.1 Introduction

Step1 (Split): The groupby operation splits</i> the dataframe into a group of dataframe based on some criteria. Note that the grouped object is not a dataframe. It has a dictionary-like structure and is also iterable.

Step 2 (Analyze): Once we have a grouped object we can apply</i> functions or run analysis to each group, set of groups, or the entire group.

Step 3 (Combine): We can also combine</i> the results of the analysis into a new data structure(s).

3.11 Load Major League Baseball salaries dataset


In [1]:
import pandas as pd

In [2]:
mlb = pd.read_csv('data/mlbsalaries.csv')

In [4]:
mlb


Out[4]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 19543 entries, 0 to 19542
Data columns (total 5 columns):
Year        19543  non-null values
Player      19543  non-null values
Salary      19543  non-null values
Position    19543  non-null values
Team        19543  non-null values
dtypes: int64(2), object(3)

3.12 Groupby using Year column


In [5]:
grouped = mlb.groupby('Year')

In [6]:
type(grouped)


Out[6]:
pandas.core.groupby.DataFrameGroupBy

In [7]:
for k,group in grouped:
        print k
        print group


1988
<class 'pandas.core.frame.DataFrame'>
Int64Index: 686 entries, 0 to 685
Data columns (total 5 columns):
Year        686  non-null values
Player      686  non-null values
Salary      686  non-null values
Position    686  non-null values
Team        686  non-null values
dtypes: int64(2), object(3)
1989
<class 'pandas.core.frame.DataFrame'>
Int64Index: 710 entries, 686 to 1395
Data columns (total 5 columns):
Year        710  non-null values
Player      710  non-null values
Salary      710  non-null values
Position    710  non-null values
Team        710  non-null values
dtypes: int64(2), object(3)
1990
<class 'pandas.core.frame.DataFrame'>
Int64Index: 972 entries, 1396 to 2367
Data columns (total 5 columns):
Year        972  non-null values
Player      972  non-null values
Salary      972  non-null values
Position    972  non-null values
Team        972  non-null values
dtypes: int64(2), object(3)
1991
<class 'pandas.core.frame.DataFrame'>
Int64Index: 720 entries, 2368 to 3087
Data columns (total 5 columns):
Year        720  non-null values
Player      720  non-null values
Salary      720  non-null values
Position    720  non-null values
Team        720  non-null values
dtypes: int64(2), object(3)
1992
<class 'pandas.core.frame.DataFrame'>
Int64Index: 753 entries, 3088 to 3840
Data columns (total 5 columns):
Year        753  non-null values
Player      753  non-null values
Salary      753  non-null values
Position    753  non-null values
Team        753  non-null values
dtypes: int64(2), object(3)
1993
<class 'pandas.core.frame.DataFrame'>
Int64Index: 786 entries, 3841 to 4626
Data columns (total 5 columns):
Year        786  non-null values
Player      786  non-null values
Salary      786  non-null values
Position    786  non-null values
Team        786  non-null values
dtypes: int64(2), object(3)
1994
<class 'pandas.core.frame.DataFrame'>
Int64Index: 748 entries, 4627 to 5374
Data columns (total 5 columns):
Year        748  non-null values
Player      748  non-null values
Salary      748  non-null values
Position    748  non-null values
Team        748  non-null values
dtypes: int64(2), object(3)
1995
<class 'pandas.core.frame.DataFrame'>
Int64Index: 805 entries, 5375 to 6179
Data columns (total 5 columns):
Year        805  non-null values
Player      805  non-null values
Salary      805  non-null values
Position    805  non-null values
Team        805  non-null values
dtypes: int64(2), object(3)
1996
<class 'pandas.core.frame.DataFrame'>
Int64Index: 768 entries, 6180 to 6947
Data columns (total 5 columns):
Year        768  non-null values
Player      768  non-null values
Salary      768  non-null values
Position    768  non-null values
Team        768  non-null values
dtypes: int64(2), object(3)
1997
<class 'pandas.core.frame.DataFrame'>
Int64Index: 773 entries, 6948 to 7720
Data columns (total 5 columns):
Year        773  non-null values
Player      773  non-null values
Salary      773  non-null values
Position    773  non-null values
Team        773  non-null values
dtypes: int64(2), object(3)
1998
<class 'pandas.core.frame.DataFrame'>
Int64Index: 846 entries, 7721 to 8566
Data columns (total 5 columns):
Year        846  non-null values
Player      846  non-null values
Salary      846  non-null values
Position    846  non-null values
Team        846  non-null values
dtypes: int64(2), object(3)
1999
<class 'pandas.core.frame.DataFrame'>
Int64Index: 919 entries, 8567 to 9485
Data columns (total 5 columns):
Year        919  non-null values
Player      919  non-null values
Salary      919  non-null values
Position    919  non-null values
Team        919  non-null values
dtypes: int64(2), object(3)
2000
<class 'pandas.core.frame.DataFrame'>
Int64Index: 843 entries, 9486 to 10328
Data columns (total 5 columns):
Year        843  non-null values
Player      843  non-null values
Salary      843  non-null values
Position    843  non-null values
Team        843  non-null values
dtypes: int64(2), object(3)
2001
<class 'pandas.core.frame.DataFrame'>
Int64Index: 857 entries, 10329 to 11185
Data columns (total 5 columns):
Year        857  non-null values
Player      857  non-null values
Salary      857  non-null values
Position    857  non-null values
Team        857  non-null values
dtypes: int64(2), object(3)
2002
<class 'pandas.core.frame.DataFrame'>
Int64Index: 849 entries, 11186 to 12034
Data columns (total 5 columns):
Year        849  non-null values
Player      849  non-null values
Salary      849  non-null values
Position    849  non-null values
Team        849  non-null values
dtypes: int64(2), object(3)
2003
<class 'pandas.core.frame.DataFrame'>
Int64Index: 826 entries, 12035 to 12860
Data columns (total 5 columns):
Year        826  non-null values
Player      826  non-null values
Salary      826  non-null values
Position    826  non-null values
Team        826  non-null values
dtypes: int64(2), object(3)
2004
<class 'pandas.core.frame.DataFrame'>
Int64Index: 833 entries, 12861 to 13693
Data columns (total 5 columns):
Year        833  non-null values
Player      833  non-null values
Salary      833  non-null values
Position    833  non-null values
Team        833  non-null values
dtypes: int64(2), object(3)
2005
<class 'pandas.core.frame.DataFrame'>
Int64Index: 831 entries, 13694 to 14524
Data columns (total 5 columns):
Year        831  non-null values
Player      831  non-null values
Salary      831  non-null values
Position    831  non-null values
Team        831  non-null values
dtypes: int64(2), object(3)
2006
<class 'pandas.core.frame.DataFrame'>
Int64Index: 823 entries, 14525 to 15347
Data columns (total 5 columns):
Year        823  non-null values
Player      823  non-null values
Salary      823  non-null values
Position    823  non-null values
Team        823  non-null values
dtypes: int64(2), object(3)
2007
<class 'pandas.core.frame.DataFrame'>
Int64Index: 848 entries, 15348 to 16195
Data columns (total 5 columns):
Year        848  non-null values
Player      848  non-null values
Salary      848  non-null values
Position    848  non-null values
Team        848  non-null values
dtypes: int64(2), object(3)
2008
<class 'pandas.core.frame.DataFrame'>
Int64Index: 858 entries, 16196 to 17053
Data columns (total 5 columns):
Year        858  non-null values
Player      858  non-null values
Salary      858  non-null values
Position    858  non-null values
Team        858  non-null values
dtypes: int64(2), object(3)
2009
<class 'pandas.core.frame.DataFrame'>
Int64Index: 818 entries, 17054 to 17871
Data columns (total 5 columns):
Year        818  non-null values
Player      818  non-null values
Salary      818  non-null values
Position    818  non-null values
Team        818  non-null values
dtypes: int64(2), object(3)
2010
<class 'pandas.core.frame.DataFrame'>
Int64Index: 828 entries, 17872 to 18699
Data columns (total 5 columns):
Year        828  non-null values
Player      828  non-null values
Salary      828  non-null values
Position    828  non-null values
Team        828  non-null values
dtypes: int64(2), object(3)
2011
<class 'pandas.core.frame.DataFrame'>
Int64Index: 843 entries, 18700 to 19542
Data columns (total 5 columns):
Year        843  non-null values
Player      843  non-null values
Salary      843  non-null values
Position    843  non-null values
Team        843  non-null values
dtypes: int64(2), object(3)

3.13 Extract highest-paid player from each group (i.e. year)


In [8]:
# generator comprehension for retrieving highest paid player by year
t = (group.sort_index(by='Salary', ascending=False)[:1] for yr,group in grouped)

In [9]:
t


Out[9]:
<generator object <genexpr> at 0x10f078870>

3.14 Create new dataframe with list of highest-paid players by year


In [10]:
#create empty dataframe for populating
topsalaries = pd.DataFrame()

In [11]:
for line in t:
    topsalaries = topsalaries.append(line)

In [12]:
topsalaries


Out[12]:
Player Position Salary Team Year
321 Gary Carter Catcher 2360714 New York Mets 1988
908 Orel Hershiser Pitcher 2766666 Los Angeles Dodgers 1989
1580 Robin Yount Outfielder 3200000 Milwaukee Brewers 1990
2590 Darryl Strawberry Outfielder 3800000 Los Angeles Dodgers 1991
3441 Bobby Bonilla Outfielder 6100000 New York Mets 1992
4214 Bobby Bonilla Outfielder 6200000 New York Mets 1993
4977 Bobby Bonilla Third Baseman 6300000 New York Mets 1994
6068 Cecil Fielder First Baseman 9237500 Detroit Tigers 1995
6840 Cecil Fielder First Baseman 9237500 Detroit Tigers 1996
7666 Albert Belle Outfielder 10000000 Chicago White Sox 1997
8090 Gary Sheffield Outfielder 10000000 Florida Marlins 1998
9067 Albert Belle Outfielder 11949794 Baltimore Orioles 1999
9743 Kevin J. Brown Pitcher 15714286 Los Angeles Dodgers 2000
10903 Alex Rodriguez Shortstop 22000000 Texas Rangers 2001
11752 Alex Rodriguez Shortstop 22000000 Texas Rangers 2002
12589 Alex Rodriguez Shortstop 22000000 Texas Rangers 2003
13497 Manny Ramirez Outfielder 22500000 Boston Red Sox 2004
14499 Alex Rodriguez Third Baseman 26000000 New York Yankees 2005
15320 Alex Rodriguez Third Baseman 21680727 New York Yankees 2006
16167 Jason Giambi Designated Hitter 23428571 New York Yankees 2007
17023 Alex Rodriguez Third Baseman 28000000 New York Yankees 2008
17846 Alex Rodriguez Third Baseman 33000000 New York Yankees 2009
18675 Alex Rodriguez Third Baseman 33000000 New York Yankees 2010
19513 Alex Rodriguez Third Baseman 32000000 New York Yankees 2011

3.15 Create series of topsalaries and perform plots


In [13]:
ts = topsalaries[['Year','Salary']]

In [15]:
ts = ts.set_index('Year')

In [17]:
ts.plot()


Out[17]:
<matplotlib.axes.AxesSubplot at 0x10f9d57d0>

In [18]:
grouped['Salary'].median().plot()


Out[18]:
<matplotlib.axes.AxesSubplot at 0x10fa18690>

In [ ]: