pandasPandas! They are adorable animals. You might think they are the worst animal ever but that is not true. You might sometimes think pandas is the worst library every, and that is only kind of true.
The important thing is use the right tool for the job. pandas is good for some stuff, SQL is good for some stuff, writing raw Python is good for some stuff. You'll figure it out as you go along.
Now let's start coding. Hopefully you did pip install pandas before you started up this notebook.
In [ ]:
# import pandas, but call it pd. Why? Because that's What People Do.
When you import pandas, you use import pandas as pd. That means instead of typing pandas in your code you'll type pd.
You don't have to, but every other person on the planet will be doing it, so you might as well.
Now we're going to read in a file. Our file is called NBA-Census-10.14.2013.csv because we're sports moguls. pandas can read_ different types of files, so try to figure it out by typing pd.read_ and hitting tab for autocomplete.
In [ ]:
# We're going to call this df, which means "data frame"
# It isn't in UTF-8 (I saved it from my mac!) so we need to set the encoding
A dataframe is basically a spreadsheet, except it lives in the world of Python or the statistical programming language R. They can't call it a spreadsheet because then people would think those programmers used Excel, which would make them boring and normal and they'd have to wear a tie every day.
Now let's look at our data, since that's what data is for
In [ ]:
# Let's look at all of it
If we scroll we can see all of it. But maybe we don't want to see all of it. Maybe we hate scrolling?
In [ ]:
# Look at the first few rows
...but maybe we want to see more than a measly five results?
In [ ]:
# Let's look at MORE of the first few rows
But maybe we want to make a basketball joke and see the final four?
In [ ]:
# Let's look at the final few rows
So yes, head and tail work kind of like the terminal commands. That's nice, I guess.
But maybe we're incredibly demanding (which we are) and we want, say, the 6th through the 8th row (which we do). Don't worry (which I know you were), we can do that, too.
In [ ]:
# Show the 6th through the 8th rows
In [ ]:
# Get the names of the columns, just because
In [ ]:
# If we want to be "correct" we add .values on the end of it
In [ ]:
# Select only name and age
In [ ]:
# Combing that with .head() to see not-so-many rows
In [ ]:
# We can also do this all in one line, even though it starts looking ugly
# (unlike the cute bears pandas looks ugly pretty often)
NOTE: That was not df['Name', 'Age'], it was df[['Name', 'Age]]. You'll definitely type it wrong all of the time. When things break with pandas it's probably because you forgot to put in a million brackets.
In [ ]:
I want to know how many people are in each position. Luckily, pandas can tell me!
In [ ]:
# Grab the POS column, and count the different values in it.
Now that was a little weird, yes - we used df['POS'] instead of df[['POS']] when viewing the data's details.
But now I'm curious about numbers: how old is everyone? Maybe we could, I don't know, get some statistics about age? Some statistics to describe age?
In [ ]:
# Summary statistics for Age
In [ ]:
# That's pretty good. Does it work for everything? How about the money?
Unfortunately because that has dollar signs and commas it's thought of as a string. We'll fix it in a second, but let's try describing one more thing.
In [ ]:
# Doing more describing
In [ ]:
# Take another look at our inches, but only the first few
In [ ]:
# Divide those inches by 12
In [ ]:
# Let's divide ALL of them by 12
In [ ]:
# Can we get statistics on those?
In [ ]:
# Let's look at our original data again
Okay that was nice but unfortunately we can't do anything with it. It's just sitting there, separate from our data. If this were normal code we could do blahblah['feet'] = blahblah['Ht (In.)'] / 12, but since this is pandas, we can't. Right? Right?
In [ ]:
# Store a new column
That's cool, maybe we could do the same thing with their salary? Take out the $ and the , and convert it to an integer?
In [ ]:
# Can't just use .replace
In [ ]:
# Need to use this weird .str thing
In [ ]:
# Can't just immediately replace the , either
In [ ]:
# Need to use the .str thing before EVERY string method
In [ ]:
# Describe still doesn't work.
In [ ]:
# Let's convert it to an integer using .astype(int) before we describe it
In [ ]:
In [ ]:
# Maybe we can just make them millions?
In [ ]:
# Unfortunately one is "n/a" which is going to break our code, so we can make n/a be 0
In [ ]:
# Remove the .head() piece and save it back into the dataframe
In [ ]:
In [ ]:
# This is just the first few guys in the dataset. Can we order it?
In [ ]:
# Let's try to sort them
Those guys are making nothing! If only there were a way to sort from high to low, a.k.a. descending instead of ascending.
In [ ]:
# It isn't descending = True, unfortunately
In [ ]:
# We can use this to find the oldest guys in the league
In [ ]:
# Or the youngest, by taking out 'ascending=False'
But sometimes instead of just looking at them, I want to do stuff with them. Play some games with them! Dunk on them~ describe them! And we don't want to dunk on everyone, only the players above 7 feet tall.
First, we need to check out boolean things.
In [ ]:
# Get a big long list of True and False for every single row.
In [ ]:
# We could use value counts if we wanted
In [ ]:
# But we can also apply this to every single row to say whether YES we want it or NO we don't
In [ ]:
# Instead of putting column names inside of the brackets, we instead
# put the True/False statements. It will only return the players above
# seven feet tall
In [ ]:
# Or only the guards
In [ ]:
# Or only the guards who make more than 15 million
In [ ]:
# It might be easier to break down the booleans into separate variables
In [ ]:
# We can save this stuff
In [ ]:
In [ ]:
# Maybe we can compare them to taller players?
In [ ]:
In [ ]:
# This will scream we don't have matplotlib.
matplotlib is a graphing library. It's the Python way to make graphs!
In [ ]:
In [ ]:
# this will open up a weird window that won't do anything
In [ ]:
# So instead you run this code
In [ ]:
But that's ugly. There's a thing called ggplot for R that looks nice. We want to look nice. We want to look like ggplot.
In [ ]:
# Import matplotlib
# What's available?
In [ ]:
# Use ggplot
In [ ]:
# Make a histogram
In [ ]:
# Try some other styles
In [ ]:
That might look better with a little more customization. So let's customize it.
In [ ]:
# Pass in all sorts of stuff!
# Most from http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.hist.html
# .range() is a matplotlib thing
I want more graphics! Do tall people make more money?!?!
In [ ]:
In [ ]:
In [ ]:
# How does experience relate with the amount of money they're making?
In [ ]:
# At least we can assume height and weight are related
In [ ]:
# At least we can assume height and weight are related
# http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.plot.html
In [ ]:
In [ ]:
In [ ]:
# We can also use plt separately
# It's SIMILAR but TOTALLY DIFFERENT
In [ ]: