pandasPandas! They are adorable animals. You might think they are the worst animal ever but that is not true. You might sometimes think pandas is the worst library every, and that is only kind of true.
The important thing is use the right tool for the job. pandas is good for some stuff, SQL is good for some stuff, writing raw Python is good for some stuff. You'll figure it out as you go along.
Now let's start coding. Hopefully you did pip install pandas before you started up this notebook.
In [1]:
# import pandas, but call it pd. Why? Because that's What People Do.
import pandas as pd #so that you don't have to type pandas later -- most people use pd instead of pands
When you import pandas, you use import pandas as pd. That means instead of typing pandas in your code you'll type pd.
You don't have to, but every other person on the planet will be doing it, so you might as well.
Now we're going to read in a file. Our file is called NBA-Census-10.14.2013.csv because we're sports moguls. pandas can read_ different types of files, so try to figure it out by typing pd.read_ and hitting tab for autocomplete.
In [2]:
# We're going to call this df, which means "data frame"
# It isn't in UTF-8 (I saved it from my mac!) so we need to set the encoding
#saved on mac, therefore the encoding needs to be mac_roman!
#it will not open if the encoding is not set :()
df = pd.read_csv('NBA-Census-10.14.2013.csv', encoding='mac_roman')
# encoding, the most common are: mac_roman if saved on a mac, latin-1 if saved on PC or UTF-8
# 'pd.read_csv?' will give you more info about how read_csv works
A dataframe is basically a spreadsheet, except it lives in the world of Python or the statistical programming language R. They can't call it a spreadsheet because then people would think those programmers used Excel, which would make them boring and normal and they'd have to wear a tie every day.
Now let's look at our data, since that's what data is for
In [3]:
# Let's look at all of it
print(df)
If we scroll we can see all of it. But maybe we don't want to see all of it. Maybe we hate scrolling?
In [4]:
# Look at the first few rows
df.head() # shows header + first 5 rows!
Out[4]:
...but maybe we want to see more than a measly five results?
In [5]:
# Let's look at MORE of the first few rows
df.head(10) # shows the first 10 lines of the program
Out[5]:
But maybe we want to make a basketball joke and see the final four?
In [6]:
# Let's look at the final few rows
df.tail(4) # shows the final four
Out[6]:
So yes, head and tail work kind of like the terminal commands. That's nice, I guess.
But maybe we're incredibly demanding (which we are) and we want, say, the 6th through the 8th row (which we do). Don't worry (which I know you were), we can do that, too.
In [7]:
# Show the 6th through the 8th rows
df[6:9]
Out[7]:
In [8]:
# Get the names of the columns, just because
df.columns #prints out the name of the columns - casing must match actually
Out[8]:
In [9]:
# If we want to be "correct" we add .values on the end of it
df.columns.values
Out[9]:
In [10]:
# Select only name and age
columns_we_want = ['Name', 'Age']
#passing the list of colums we want to data frame
df[columns_we_want]
Out[10]:
In [11]:
# Combing that with .head() to see not-so-many rows
In [12]:
# We can also do this all in one line, even though it starts looking ugly
# (unlike the cute bears pandas looks ugly pretty often)
df[['Name', 'Age']] # brackets brackets
Out[12]:
NOTE: That was not df['Name', 'Age'], it was df[['Name', 'Age']]. You'll definitely type it wrong all of the time. When things break with pandas it's probably because you forgot to put in a million brackets.
In [13]:
df['POS'] # shows you each position
Out[13]:
I want to know how many people are in each position. Luckily, pandas can tell me!
In [14]:
# Grab the POS column, and count the different values in it.
df['POS'].value_counts() # counts the number of values that match each position
Out[14]:
In [15]:
df['Race'].value_counts() # race of players
Out[15]:
Now that was a little weird, yes - we used df['POS'] instead of df[['POS']] when viewing the data's details.
But now I'm curious about numbers: how old is everyone? Maybe we could, I don't know, get some statistics about age? Some statistics to describe age?
In [16]:
# Summary statistics for Age
df['Age'].value_counts() # statistics about age
Out[16]:
In [17]:
df['Age'].describe() #statistics about NBA players and their ages
Out[17]:
In [18]:
# That's pretty good. Does it work for everything? How about the money?
df.describe() # shows info for all of the numerical data
# EEEK minum weight = 20 lbs -- seems incorrect
Out[18]:
In [19]:
df['Ht (In.)'].describe()
Out[19]:
In [20]:
#df.columns # look at column names again
df['2013 $'].describe() # this column is string as opposed to int -- therefore it didn't work :()
Out[20]:
Unfortunately because that has dollar signs and commas it's thought of as a string. We'll fix it in a second, but let's try describing one more thing.
In [21]:
# Doing more describing
In [22]:
# Take another look at our inches, but only the first few
df['Ht (In.)'].head()
Out[22]:
In [23]:
# Divide those inches by 12
df['Ht (In.)'].head()/12 #divides every single value by 12
Out[23]:
In [24]:
# Let's divide ALL of them by 12
df['Ht (In.)']/12
Out[24]:
In [25]:
# Can we get statistics on those?
height_in_feet = df['Ht (In.)']/12
height_in_feet.describe()
Out[25]:
In [26]:
# Let's look at our original data again
df.head()
Out[26]:
Okay that was nice but unfortunately we can't do anything with it. It's just sitting there, separate from our data. If this were normal code we could do blahblah['feet'] = blahblah['Ht (In.)'] / 12, but since this is pandas, we can't. Right? Right?
In [27]:
# Store a new column
df['Ht (Ft.)'] = df['Ht (In.)']/12 # adds a new column with the height as feet
df.head()
Out[27]:
In [28]:
df.sort_values('Ht (Ft.)') # automatically sorts from lowest to highest - ascending value
Out[28]:
In [29]:
#shows the tallest players by height in feet
df.sort_values('Ht (Ft.)', ascending =False).head() # automatically sorts from lowest to highest - ascending value
Out[29]:
In [30]:
#shows you who is/isn't above 6"5 ft.
above_or_below_six_five = df['Ht (Ft.)'] > 6
above_or_below_six_five.value_counts() # returns how many players are or are not above 6"5
Out[30]:
That's cool, maybe we could do the same thing with their salary? Take out the $ and the , and convert it to an integer?
In [31]:
# Can't just use .replace
In [32]:
# Need to use this weird .str thing
In [33]:
# Can't just immediately replace the , either
In [34]:
# Need to use the .str thing before EVERY string method
In [35]:
# Describe still doesn't work.
In [36]:
# Let's convert it to an integer using .astype(int) before we describe it
In [ ]:
In [37]:
# Maybe we can just make them millions?
In [38]:
# Unfortunately one is "n/a" which is going to break our code, so we can make n/a be 0
In [39]:
# Remove the .head() piece and save it back into the dataframe
In [ ]:
In [40]:
# This is just the first few guys in the dataset. Can we order it?
In [41]:
# Let's try to sort them
Those guys are making nothing! If only there were a way to sort from high to low, a.k.a. descending instead of ascending.
In [42]:
# It isn't descending = True, unfortunately
In [43]:
# We can use this to find the oldest guys in the league
#shows the oldest players
df.sort_values('Age', ascending =False).head() # automatically sorts from lowest to highest - ascending value
Out[43]:
In [44]:
# Or the youngest, by taking out 'ascending=False'
#shows the youngest players
df.sort_values('Age').head() # automatically sorts from lowest to highest
Out[44]:
But sometimes instead of just looking at them, I want to do stuff with them. Play some games with them! Dunk on them~ describe them! And we don't want to dunk on everyone, only the players above 7 feet tall.
First, we need to check out boolean things.
In [45]:
# Get a big long list of True and False for every single row.
#shows you who is/isn't above 6"5 ft.
above_or_below_six_five = df['Ht (Ft.)'] > 6
# print(above_or_below_six_five)
In [46]:
# We could use value counts if we wanted
above_or_below_six_five.value_counts() # returns how many players are or are not above 6"5
Out[46]:
In [47]:
# But we can also apply this to every single row to say whether YES we want it or NO we don't
above_or_below_six_five = df['Ht (Ft.)'] > 7
In [48]:
df[df['Race'] == 'Asian']
Out[48]:
In [49]:
# Instead of putting column names inside of the brackets, we instead
# put the True/False statements. It will only return the players above
# seven feet tall
In [50]:
# Or only the guards
df[df['Ht (Ft.)'] > 7]
Out[50]:
In [51]:
# Or only the guards who are under 6 feet tall
# are you a guard? AND are below 6 feet tall?
df[(df['POS'] == 'G') & (df['Ht (Ft.)'] < 6)]
Out[51]:
In [52]:
# It might be easier to break down the booleans into separate variables
is_a_guard = df['POS'] == 'G'
is_below_six_feet = df['Ht (Ft.)'] < 6
df[is_a_guard & is_below_six_feet]
Out[52]:
In [53]:
centers = df[df['POS'] == 'C']
guards = df[df['POS'] == 'G']
In [54]:
# We can save this stuff
centers['Ht (Ft.)'].describe()
Out[54]:
In [55]:
guards['Ht (Ft.)'].describe()
Out[55]:
In [56]:
# Maybe we can compare them to taller players?
In [57]:
!pip install matplotlib
In [63]:
import matplotlib.pyplot as plt
%matplotlib inline
# This will scream we don't have matplotlib.
df['Ht (Ft.)'].hist()
Out[63]:
matplotlib is a graphing library. It's the Python way to make graphs!
In [64]:
%matplotlib inline
# save things as .png and not .jpeg
plt.savefig('heights.png')
In [105]:
# this will open up a weird window that won't do anything
In [106]:
# So instead you run this code
In [ ]:
But that's ugly. There's a thing called ggplot for R that looks nice. We want to look nice. We want to look like ggplot.
In [107]:
# Import matplotlib
# What's available?
In [108]:
# Use ggplot
In [109]:
# Make a histogram
In [110]:
# Try some other styles
In [ ]:
That might look better with a little more customization. So let's customize it.
In [111]:
# Pass in all sorts of stuff!
# Most from http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.hist.html
# .range() is a matplotlib thing
I want more graphics! Do tall people make more money?!?!
In [ ]:
In [ ]:
In [112]:
# How does experience relate with the amount of money they're making?
In [113]:
# At least we can assume height and weight are related
In [114]:
# At least we can assume height and weight are related
# http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.plot.html
In [ ]:
In [ ]:
In [115]:
# We can also use plt separately
# It's SIMILAR but TOTALLY DIFFERENT
In [ ]: