Analyzing data with Pandas

First a little setup. Importing the pandas library as pd



In [ ]:

Set some helpful display options. Uncomment the boilerplate in this cell.



In [ ]:

    
#pd.set_option("max_columns", 150)
#pd.set_option('max_colwidth',40)
#pd.options.display.float_format = '{:,.2f}'.format
#%matplotlib inline

open and read in the Master.csv and Salaries.csv tables in the data/2017/ directory



In [ ]:

check to see what type each object is with print(table_name). You can also use the .info() method to explore the data's structure.



In [ ]:



In [ ]:

print out sample data for each table with table.head()
see additional options by pressing tab after you type the head() method



In [ ]:



In [ ]:

Now we join the two csv's using pd.merge.
We want to keep all the players names in the master data set
even if their salary is missing from the salary data set.
We can always filter the NaN values out later



In [ ]:

see what columns the joined table contains



In [ ]:

check if all the players have a salary assigned. The easiest way is to deduct the length of the joined table from the master table



In [ ]:

Something went wrong. There are now more players in the joined data set than in the master data set.
Some entries probably got duplicated
Let's check if we have duplicate playerIDs by using .value_counts()



In [ ]:

Yep, we do.
Let's filter out an arbitrary player to see why there is duplication



In [ ]:

As we can see, there are now salaries in the dataset for each year of the players carreer.
We only want to have the most recent salary though.
We therefore need to 'deduplicate' the data set.

But first, let's make sure we get the newest year. We can do this by sorting the data on the newest entry



In [ ]:

Now we deduplicate



In [ ]:

And let's do the check again



In [ ]:

Now we van get into the interesting part: analysis!

What is the average (mean, median, max, min) salary?



In [ ]:

Who makes the most money?



In [ ]:



In [ ]:

What are the most common baseball players salaries?

Draw a histogram.



In [ ]:

We can do the same with the column yearID to see how recent our data is.
We have 30 years in our data set, so we need to do some minor tweaking



In [ ]:

Who are the top 10% highest-paid players?

calculate the 90 percentile cutoff



In [ ]:

filter out players that make more money than the cutoff



In [ ]:

use the nlargest to see the top 10 best paid players



In [ ]:

draw a chart



In [ ]:

save the data



In [ ]: