Analyzing data with Pandas

First a little setup. Importing the pandas library as pd


In [ ]:

Set some helpful display options. Uncomment the boilerplate in this cell.


In [ ]:
#pd.set_option("max_columns", 150)
#pd.set_option('max_colwidth',40)
#pd.options.display.float_format = '{:,.2f}'.format
#%matplotlib inline

open and read in the Master.csv and Salaries.csv tables in the data/2017/ directory


In [ ]:

check to see what type each object is with print(table_name). You can also use the .info() method to explore the data's structure.


In [ ]:


In [ ]:

print out sample data for each table with table.head()
see additional options by pressing tab after you type the head() method


In [ ]:


In [ ]:

Now we join the two csv's using pd.merge.
We want to keep all the players names in the master data set
even if their salary is missing from the salary data set.
We can always filter the NaN values out later


In [ ]:

see what columns the joined table contains


In [ ]:

check if all the players have a salary assigned. The easiest way is to deduct the length of the joined table from the master table


In [ ]:

Something went wrong. There are now more players in the joined data set than in the master data set.
Some entries probably got duplicated
Let's check if we have duplicate playerIDs by using .value_counts()


In [ ]:

Yep, we do.
Let's filter out an arbitrary player to see why there is duplication


In [ ]:

As we can see, there are now salaries in the dataset for each year of the players carreer.
We only want to have the most recent salary though.
We therefore need to 'deduplicate' the data set.

But first, let's make sure we get the newest year. We can do this by sorting the data on the newest entry


In [ ]:

Now we deduplicate


In [ ]:

And let's do the check again


In [ ]:

Now we van get into the interesting part: analysis!

What is the average (mean, median, max, min) salary?


In [ ]:

Who makes the most money?


In [ ]:


In [ ]:

What are the most common baseball players salaries?

Draw a histogram.


In [ ]:

We can do the same with the column yearID to see how recent our data is.
We have 30 years in our data set, so we need to do some minor tweaking


In [ ]:

Who are the top 10% highest-paid players?

calculate the 90 percentile cutoff


In [ ]:

filter out players that make more money than the cutoff


In [ ]:

use the nlargest to see the top 10 best paid players


In [ ]:

draw a chart


In [ ]:

save the data


In [ ]: