First a little setup. Importing the pandas library as pd
In [ ]:
Set some helpful display options. Uncomment the boilerplate in this cell.
In [ ]:
#pd.set_option("max_columns", 150)
#pd.set_option('max_colwidth',40)
#pd.options.display.float_format = '{:,.2f}'.format
#%matplotlib inline
open and read in the Master.csv and Salaries.csv tables in the data/2017/ directory
In [ ]:
check to see what type each object is with print(table_name). You can also use the .info() method to explore the data's structure.
In [ ]:
In [ ]:
print out sample data for each table with table.head()
see additional options by pressing tab after you type the head() method
In [ ]:
In [ ]:
Now we join the two csv's using pd.merge.
We want to keep all the players names in the master data set
even if their salary is missing from the salary data set.
We can always filter the NaN values out later
In [ ]:
see what columns the joined table contains
In [ ]:
check if all the players have a salary assigned. The easiest way is to deduct the length of the joined table from the master table
In [ ]:
Something went wrong. There are now more players in the joined data set than in the master data set.
Some entries probably got duplicated
Let's check if we have duplicate playerIDs by using .value_counts()
In [ ]:
Yep, we do.
Let's filter out an arbitrary player to see why there is duplication
In [ ]:
As we can see, there are now salaries in the dataset for each year of the players carreer.
We only want to have the most recent salary though.
We therefore need to 'deduplicate' the data set.
But first, let's make sure we get the newest year. We can do this by sorting the data on the newest entry
In [ ]:
Now we deduplicate
In [ ]:
And let's do the check again
In [ ]:
Now we van get into the interesting part: analysis!
In [ ]:
In [ ]:
In [ ]:
Draw a histogram.
In [ ]:
We can do the same with the column yearID to see how recent our data is.
We have 30 years in our data set, so we need to do some minor tweaking
In [ ]:
calculate the 90 percentile cutoff
In [ ]:
filter out players that make more money than the cutoff
In [ ]:
use the nlargest to see the top 10 best paid players
In [ ]:
draw a chart
In [ ]:
save the data
In [ ]: