An Introduction to pandas

Pandas! They are adorable animals. You might think they are the worst animal ever but that is not true. You might sometimes think pandas is the worst library every, and that is only kind of true.

The important thing is use the right tool for the job. pandas is good for some stuff, SQL is good for some stuff, writing raw Python is good for some stuff. You'll figure it out as you go along.

Now let's start coding. Hopefully you did pip install pandas before you started up this notebook.


In [ ]:
# import pandas, but call it pd. Why? Because that's What People Do.

When you import pandas, you use import pandas as pd. That means instead of typing pandas in your code you'll type pd.

You don't have to, but every other person on the planet will be doing it, so you might as well.

Now we're going to read in a file. Our file is called NBA-Census-10.14.2013.csv because we're sports moguls. pandas can read_ different types of files, so try to figure it out by typing pd.read_ and hitting tab for autocomplete.


In [ ]:
# We're going to call this df, which means "data frame"
# It isn't in UTF-8 (I saved it from my mac!) so we need to set the encoding

A dataframe is basically a spreadsheet, except it lives in the world of Python or the statistical programming language R. They can't call it a spreadsheet because then people would think those programmers used Excel, which would make them boring and normal and they'd have to wear a tie every day.

Selecting rows

Now let's look at our data, since that's what data is for


In [ ]:
# Let's look at all of it

If we scroll we can see all of it. But maybe we don't want to see all of it. Maybe we hate scrolling?


In [ ]:
# Look at the first few rows

...but maybe we want to see more than a measly five results?


In [ ]:
# Let's look at MORE of the first few rows

But maybe we want to make a basketball joke and see the final four?


In [ ]:
# Let's look at the final few rows

So yes, head and tail work kind of like the terminal commands. That's nice, I guess.

But maybe we're incredibly demanding (which we are) and we want, say, the 6th through the 8th row (which we do). Don't worry (which I know you were), we can do that, too.


In [ ]:
# Show the 6th through the 8th rows

It's kind of like an array, right? Except where in an array we'd say df[0] this time we need to give it two numbers, the start and the end.

Selecting columns

But jeez, my eyes don't want to go that far over the data. I only want to see, uh, name and age.


In [ ]:
# Get the names of the columns, just because

In [ ]:
# If we want to be "correct" we add .values on the end of it

In [ ]:
# Select only name and age

In [ ]:
# Combing that with .head() to see not-so-many rows

In [ ]:
# We can also do this all in one line, even though it starts looking ugly
# (unlike the cute bears pandas looks ugly pretty often)

NOTE: That was not df['Name', 'Age'], it was df[['Name', 'Age]]. You'll definitely type it wrong all of the time. When things break with pandas it's probably because you forgot to put in a million brackets.

Describing your data

A powerful tool of pandas is being able to select a portion of your data, because who ordered all that data anyway.


In [ ]:

I want to know how many people are in each position. Luckily, pandas can tell me!


In [ ]:
# Grab the POS column, and count the different values in it.

Now that was a little weird, yes - we used df['POS'] instead of df[['POS']] when viewing the data's details.

But now I'm curious about numbers: how old is everyone? Maybe we could, I don't know, get some statistics about age? Some statistics to describe age?


In [ ]:
# Summary statistics for Age

In [ ]:
# That's pretty good. Does it work for everything? How about the money?

Unfortunately because that has dollar signs and commas it's thought of as a string. We'll fix it in a second, but let's try describing one more thing.


In [ ]:
# Doing more describing

That's stupid, though, what's an inch even look like? What's 80 inches? I don't have a clue. If only there were some wa to manipulate our data.

Manipulating data

Oh wait there is, HA HA HA.


In [ ]:
# Take another look at our inches, but only the first few

In [ ]:
# Divide those inches by 12

In [ ]:
# Let's divide ALL of them by 12

In [ ]:
# Can we get statistics on those?

In [ ]:
# Let's look at our original data again

Okay that was nice but unfortunately we can't do anything with it. It's just sitting there, separate from our data. If this were normal code we could do blahblah['feet'] = blahblah['Ht (In.)'] / 12, but since this is pandas, we can't. Right? Right?


In [ ]:
# Store a new column

That's cool, maybe we could do the same thing with their salary? Take out the $ and the , and convert it to an integer?


In [ ]:
# Can't just use .replace

In [ ]:
# Need to use this weird .str thing

In [ ]:
# Can't just immediately replace the , either

In [ ]:
# Need to use the .str thing before EVERY string method

In [ ]:
# Describe still doesn't work.

In [ ]:
# Let's convert it to an integer using .astype(int) before we describe it

In [ ]:


In [ ]:
# Maybe we can just make them millions?

In [ ]:
# Unfortunately one is "n/a" which is going to break our code, so we can make n/a be 0

In [ ]:
# Remove the .head() piece and save it back into the dataframe

In [ ]:

The average basketball player makes 3.8 million dollars and is a little over six and a half feet tall.

But who cares about those guys? I don't care about those guys. They're boring. I want the real rich guys!

Sorting and sub-selecting


In [ ]:
# This is just the first few guys in the dataset. Can we order it?

In [ ]:
# Let's try to sort them

Those guys are making nothing! If only there were a way to sort from high to low, a.k.a. descending instead of ascending.


In [ ]:
# It isn't descending = True, unfortunately

In [ ]:
# We can use this to find the oldest guys in the league

In [ ]:
# Or the youngest, by taking out 'ascending=False'

But sometimes instead of just looking at them, I want to do stuff with them. Play some games with them! Dunk on them~ describe them! And we don't want to dunk on everyone, only the players above 7 feet tall.

First, we need to check out boolean things.


In [ ]:
# Get a big long list of True and False for every single row.

In [ ]:
# We could use value counts if we wanted

In [ ]:
# But we can also apply this to every single row to say whether YES we want it or NO we don't

In [ ]:
# Instead of putting column names inside of the brackets, we instead
# put the True/False statements. It will only return the players above 
# seven feet tall

In [ ]:
# Or only the guards

In [ ]:
# Or only the guards who make more than 15 million

In [ ]:
# It might be easier to break down the booleans into separate variables

In [ ]:
# We can save this stuff

In [ ]:


In [ ]:
# Maybe we can compare them to taller players?

Drawing pictures

Okay okay enough code and enough stupid numbers. I'm visual. I want graphics. Okay????? Okay.


In [ ]:


In [ ]:
# This will scream we don't have matplotlib.

matplotlib is a graphing library. It's the Python way to make graphs!


In [ ]:


In [ ]:
# this will open up a weird window that won't do anything

In [ ]:
# So instead you run this code

In [ ]:

But that's ugly. There's a thing called ggplot for R that looks nice. We want to look nice. We want to look like ggplot.


In [ ]:
# Import matplotlib
# What's available?

In [ ]:
# Use ggplot

In [ ]:
# Make a histogram

In [ ]:
# Try some other styles

In [ ]:

That might look better with a little more customization. So let's customize it.


In [ ]:
# Pass in all sorts of stuff!
# Most from http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.hist.html
# .range() is a matplotlib thing

I want more graphics! Do tall people make more money?!?!


In [ ]:


In [ ]:


In [ ]:
# How does experience relate with the amount of money they're making?

In [ ]:
# At least we can assume height and weight are related

In [ ]:
# At least we can assume height and weight are related
# http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.plot.html

In [ ]:


In [ ]:


In [ ]:
# We can also use plt separately
# It's SIMILAR but TOTALLY DIFFERENT

In [ ]: