In [1]:
import pandas as pd
Always name your initial dataframe df
. Always always always. Why? Because everyone else does, and because
df
is your original dataframe.
In [6]:
import matplotlib.pyplot as plt
%matplotlib inline
Your best friends for a quick peek are are
df.head()
to see the first tendf.columns
to see the column namesdf.info()
to see more detailed info about each columndf.describe()
to see summary statistics about each numeric columnOnce you do these few steps, you should be able to draw up a list of ten or so questions to ask your data.
In [8]:
df=pd.read_excel("richpeople.xlsx")
In [10]:
df.head()
df.columns
Out[10]:
I made a terrible mistake on Monday about how I taught sub-selects (the things that are basically SQL WHERE
statements), and it's all because of how I named my variables.
We're going to take a sample of the data to work with so it looks a little cleaner - only take a few columns of the first 8. It's a complicated line, don't sweat it. Just think about the result for now.
In [11]:
recent = df[df['year'] == 2014]
recent = recent[['name','age','gender','selfmade','networthusbillion','countrycode','rank']]
recent = recent.sort_values(by='rank').head(8)
recent
Out[11]:
Who is from the USA? There two different things we can do. This was a big point of confusion in class
Let's look at them both, then see how they're used.
In [13]:
recent['countrycode']=='USA'
Out[13]:
This just gave us a list of True
and False
values.
On the other hand, we can also use similar syntax to get a list of every person from the USA.
In [ ]:
That one was a two-step process,
True
/False
values (are they from the USA?), thenTrue
/False
list back into the dataframe.For every row that's True
, it let us keep the row. If it was False
, it doesn't let us keep the row.
Let's say we wanted to select every single row that is from USA, has inherited wealth, and is in the top 10 wealthiest people in the world. If we wanted to do it all on one line, we could!
In [ ]:
That looks super confusing. In it we have three separate things we're testing for (a.k.a. getting True/False back)
sample['countrycode'] == 'USA'
sample['selfmade'] == 'inherited'
sample['rank'] < 10
And we're jamming them all into sample[blah blah]
to get back a dataframe for which all of those are true. It works, but it's ugly.
Again, let's remind ourselves what a piece like that looks like, it's just a list of True and False values
In [ ]:
If we felt like the code we had before was a little too complicated, we could break it down into manageable, well-named pieces like this:
In [ ]:
Let's look at each individual piece.
is_american
, which is the same thing as sample['countrycode'] == 'USA'
, is not a list of people. It is not a dataframe. It is a list of true and false values.
In [ ]:
It's the same thing with is_inherited
and is_top_ten
. It's just a shortcut for me, so I don't have to type out sample['selfmade'] == 'inherited'
. We can combine them all using &
to get a list of True
/False
values in which every one of those conditions is true
In [ ]:
Which we can then feed into the dataframe
In [ ]:
In [ ]:
In [ ]:
In [14]:
recent = df[df['year'] == 2014]
recent = recent[['name','countrycode','networthusbillion']]
recent.head()
Out[14]:
I'm going to get my GDP data from this web site, because... I googled GDP 2014 and it came up.
We could manually say okay, so A. Jerrold Perenchio has 2.6 billion dollars, and USA's GDP was 17.418 trillion (17418 billion), so he's at about 0.015% of the GDP. But we are lazy, so we want the computer to do it. In order to do that, though, we need to merge that information into our existing dataset.
I'm also too lazy to look up the GDP for every country, so I'm just going to pick the top ten.
In [15]:
# Make a list of dictionaries
country_gdp = [
{ 'country': 'USA', 'gdp': 17.418 },
{ 'country': 'DEU', 'gdp': 3.859 },
{ 'country': 'CHN', 'gdp': 10.380 },
{ 'country': 'RUS', 'gdp': 1.857 },
{ 'country': 'JPN', 'gdp': 4.616 },
{ 'country': 'BRA', 'gdp': 2.353 },
{ 'country': 'HKG', 'gdp': 0.289 },
{ 'country': 'FRA', 'gdp': 2.846 },
{ 'country': 'GBR', 'gdp': 2.945 },
{ 'country': 'IND', 'gdp': 2.049 }
]
country_gdp
Out[15]:
I'm also too lazy to find a csv that has the data I need, so I'm going to make a dataframe out of a dictionary. It's pretty easy.
In [17]:
# make a data frame
gdp_df=pd.DataFrame(country_gdp)
gdp_df
Out[17]:
In [18]:
gdp_df['billions'] = gdp_df['gdp']*1000
gdp_df
Out[18]:
In [20]:
recent.head(3)
Out[20]:
In [21]:
gdp_df.head(3)
Out[21]:
In [ ]:
# we are matching up:
#countrycode from recent to country from gdp_df
In [24]:
merged=recent.merge(gdp_df, left_on="countrycode", right_on="country")
merged.head()
Out[24]:
In [25]:
len(gdp_df) #len counts the number of columns in a table frame gdp_df
Out[25]:
In [48]:
gdp_df
Out[48]:
In [26]:
merged[merged['countrycode']=='SAU']
Out[26]:
In [27]:
recent[recent['countrycode']=='SAU']
Out[27]:
In [28]:
len(recent)
Out[28]:
In [29]:
len(merged)
Out[29]:
In [30]:
# A left join will allow you to have empty non-matching values so we'll have our Saudis back
left_merged = recent.merge(gdp_df, how="left", left_on="countrycode", right_on="country")
# on = 'country'
left_merged.head()
Out[30]:
In [ ]:
#!= not equal
In [31]:
# .isnull will tell you if something is NaN or not
left_merged['billions'].isnull()
# or use.notnull() to pull out the ones where the value is not missing
only_people_with_gdps = left_merged[left_merged['billions'].notnull()]
only_people_with_gdps.head()
Out[31]:
Hmmm see how rank skips through 1, 4, 5, 6, 6? This is because we only kept rows we have matches for. The second wealthiest person is from Mexico, but since we didn't have GDP from Mexico it left. You can see if we just do a simple .count()
In [ ]:
In [ ]:
In [ ]:
The default kind of join is an inner join. It requires you to have a match on the left hand side AND the right hand side.
If you're in the original dataframe and you don't have a match in the joining dataframe, you disappear. Watch how Carlos "Slim" Helu and Amancio Ortega both disappear.
In [ ]:
In [33]:
import pandas as pd
In [40]:
nba = pd.read_csv("NBA-Census-10.14.2013.csv", encoding='mac_roman')
nba.head()
Out[40]:
In [ ]:
#\n: start a new line
In [ ]:
In [ ]:
In [47]:
nba.sort_values(by ='2013$')
In [ ]:
nba.plot(kind='barh', x='Name', y='')
In [44]:
million_dollar_string.replace ("0","9")
In [45]:
million_dollar_string.replace("$", " ") #" ": empty
In [46]:
million_dollar_string.replce(",", " ")
In [ ]:
million_dollar_string.replace(",", " ").replace("$", " ")