Python for Data Analysis Lightning Tutorials

Pandas Cookbook Series

Python for Data Analysis Lightning Tutorials is a series of tutorials in Data Analysis, Statistics, and Graphics using Python. The Pandas Cookbook series of tutorials provides recipes for common tasks and moves on to more advanced topics in statistics and time series analysis.

Created by Alfred Essa, Dec 22nd, 2013

Note: IPython Notebook and Data files can be found at my Github Site: http://github/alfredessa

Chapter 2: Common Operations

2.1 Problem. How can I sort in a DataFrame?

2.11 Review: DataFrame Object

The DataFrame data structure in Pandas is a two-dimensional labeled array.

2.12 Import Pandas and Read .CSV File of Major League Baseball Salaries


In [32]:
import pandas as pd

In [33]:
mlb = pd.read_csv('data/mlbsalaries.csv')

In [35]:
mlb.tail()


Out[35]:
Year Player Salary Position Team
19538 2011 Gustavo Molina 455000 Catcher New York Yankees
19539 2011 Ivan Nova 432900 Pitcher New York Yankees
19540 2011 Colin Curtis 420400 Outfielder New York Yankees
19541 2011 Eduardo Nunez 419300 Shortstop New York Yankees
19542 2011 Reegie Corona 414000 Second Baseman New York Yankees

2.13 Review: Use value_counts() method to identify unique values and corresponding counts


In [39]:
mlb.Year.value_counts()


Out[39]:
1990    972
1999    919
2008    858
2001    857
2002    849
2007    848
1998    846
2011    843
2000    843
2004    833
2005    831
2010    828
2003    826
2006    823
2009    818
1995    805
1993    786
1997    773
1996    768
1992    753
1994    748
1991    720
1989    710
1988    686
dtype: int64

In [ ]:

2.14 Create a subset of mlb dataset for Year 2010


In [40]:
yr2010 = mlb[mlb.Year==2010]

In [41]:
yr2010 = yr2010.set_index('Player')

In [42]:
yr2010.head()


Out[42]:
Year Salary Position Team
Player
Torii Hunter 2010 18500000 Outfielder Los Angeles Angels
Bobby Abreu 2010 9000000 Outfielder Los Angeles Angels
Brian Fuentes 2010 9000000 Pitcher Los Angeles Angels
Scott Kazmir 2010 8000000 Pitcher Los Angeles Angels
Joel Pineiro 2010 8000000 Pitcher Los Angeles Angels

In [ ]:

2.15 Sort Operations


In [44]:
# sort row labels
yr2010.sort_index().head()


Out[44]:
Year Salary Position Team
Player
A.J. Burnett 2010 16500000 Pitcher New York Yankees
A.J. Pierzynski 2010 6750000 Catcher Chicago White Sox
Aaron Cook 2010 9625000 Pitcher Colorado Rockies
Aaron Harang 2010 12500000 Pitcher Cincinnati Reds
Aaron Heilman 2010 2150000 Pitcher Arizona Diamondbacks

In [45]:
# sort column labels
yr2010.sort_index(axis=1).head()


Out[45]:
Position Salary Team Year
Player
Torii Hunter Outfielder 18500000 Los Angeles Angels 2010
Bobby Abreu Outfielder 9000000 Los Angeles Angels 2010
Brian Fuentes Pitcher 9000000 Los Angeles Angels 2010
Scott Kazmir Pitcher 8000000 Los Angeles Angels 2010
Joel Pineiro Pitcher 8000000 Los Angeles Angels 2010

In [46]:
# sort column values using order field; note: the order field returns a series
yr2010.Salary.order(ascending=False).head()


Out[46]:
Player
Alex Rodriguez    33000000
CC Sabathia       24285714
Derek Jeter       22600000
Mark Teixeira     20625000
Johan Santana     20144707
Name: Salary, dtype: int64

In [47]:
# sort column values using sort_index method
sorted_yr2010 = yr2010.sort_index(ascending=False, by = ['Salary'])

In [48]:
sorted_yr2010.head(20)


Out[48]:
Year Salary Position Team
Player
Alex Rodriguez 2010 33000000 Third Baseman New York Yankees
CC Sabathia 2010 24285714 Pitcher New York Yankees
Derek Jeter 2010 22600000 Shortstop New York Yankees
Mark Teixeira 2010 20625000 First Baseman New York Yankees
Johan Santana 2010 20144707 Pitcher New York Mets
Miguel Cabrera 2010 20000000 First Baseman Detroit Tigers
Carlos Beltran 2010 19401569 Outfielder New York Mets
Carlos Lee 2010 19000000 Outfielder Houston Astros
Ryan Howard 2010 19000000 First Baseman Philadelphia Phillies
Alfonso Soriano 2010 19000000 Outfielder Chicago Cubs
Carlos Zambrano 2010 18875000 Pitcher Chicago Cubs
John Lackey 2010 18700000 Pitcher Boston Red Sox
Manny Ramirez 2010 18695006 Outfielder Los Angeles Dodgers
Torii Hunter 2010 18500000 Outfielder Los Angeles Angels
Barry Zito 2010 18500000 Pitcher San Francisco Giants
Ichiro Suzuki 2010 18000000 Outfielder Seattle Mariners
Magglio Ordonez 2010 17825976 Outfielder Detroit Tigers
Todd Helton 2010 17775000 First Baseman Colorado Rockies
Aramis Ramirez 2010 16750000 Third Baseman Chicago Cubs
A.J. Burnett 2010 16500000 Pitcher New York Yankees

In [50]:
yr2010.sort_index(ascending=[False,True], by =['Salary', 'Team']).head(20)


Out[50]:
Year Salary Position Team
Player
Alex Rodriguez 2010 33000000 Third Baseman New York Yankees
CC Sabathia 2010 24285714 Pitcher New York Yankees
Derek Jeter 2010 22600000 Shortstop New York Yankees
Mark Teixeira 2010 20625000 First Baseman New York Yankees
Johan Santana 2010 20144707 Pitcher New York Mets
Miguel Cabrera 2010 20000000 First Baseman Detroit Tigers
Carlos Beltran 2010 19401569 Outfielder New York Mets
Alfonso Soriano 2010 19000000 Outfielder Chicago Cubs
Carlos Lee 2010 19000000 Outfielder Houston Astros
Ryan Howard 2010 19000000 First Baseman Philadelphia Phillies
Carlos Zambrano 2010 18875000 Pitcher Chicago Cubs
John Lackey 2010 18700000 Pitcher Boston Red Sox
Manny Ramirez 2010 18695006 Outfielder Los Angeles Dodgers
Torii Hunter 2010 18500000 Outfielder Los Angeles Angels
Barry Zito 2010 18500000 Pitcher San Francisco Giants
Ichiro Suzuki 2010 18000000 Outfielder Seattle Mariners
Magglio Ordonez 2010 17825976 Outfielder Detroit Tigers
Todd Helton 2010 17775000 First Baseman Colorado Rockies
Aramis Ramirez 2010 16750000 Third Baseman Chicago Cubs
A.J. Burnett 2010 16500000 Pitcher New York Yankees

In [51]:
# Top 10 highest paid players
 top10 = yr2010.Salary.order(ascending=False).head()

In [53]:
type(top10)


Out[53]:
pandas.core.series.Series

In [54]:
#plot highest paid
plt.figure()
top10.plot(label='Salaries')
xticks(rotation='vertical')
plt.legend()


Out[54]:
<matplotlib.legend.Legend at 0x110710750>

In [ ]: