The setup

A lot of times when you're using pandas you're also going to use matplotlib, and when you're using matplotlib you're also going to be making inline graphics. So why not do it all at once?



In [1]:

    
import pandas as pd

The import

Always name your initial dataframe df. Always always always. Why? Because everyone else does, and because

when you slice it and dice it into different pieces (only really rich people, only cats, only flights to Finland, etc)
you can give each of those names, while knowing df is your original dataframe.



In [6]:

    
import matplotlib.pyplot as plt
%matplotlib inline

The examining

Your best friends for a quick peek are are

df.head() to see the first ten
df.columns to see the column names
df.info() to see more detailed info about each column
df.describe() to see summary statistics about each numeric column

Once you do these few steps, you should be able to draw up a list of ten or so questions to ask your data.



In [8]:

    
df=pd.read_excel("richpeople.xlsx")



In [10]:

    
df.head()
df.columns









    Out[10]:





Index(['year', 'name', 'rank', 'citizenship', 'countrycode',
       'networthusbillion', 'selfmade', 'typeofwealth', 'gender', 'age',
       'industry', 'IndustryAggregates', 'region', 'north',
       'politicalconnection', 'founder', 'generationofinheritance', 'sector',
       'company', 'companytype', 'relationshiptocompany', 'foundingdate',
       'gdpcurrentus', 'sourceofwealth', 'notes', 'notes2', 'source',
       'source_2', 'source_3', 'source_4'],
      dtype='object')

Slicing your data

I made a terrible mistake on Monday about how I taught sub-selects (the things that are basically SQL WHERE statements), and it's all because of how I named my variables.

We're going to take a sample of the data to work with so it looks a little cleaner - only take a few columns of the first 8. It's a complicated line, don't sweat it. Just think about the result for now.



In [11]:

    
recent = df[df['year'] == 2014]
recent = recent[['name','age','gender','selfmade','networthusbillion','countrycode','rank']]
recent = recent.sort_values(by='rank').head(8)
recent









    Out[11]:






  
    
      
      name
      age
      gender
      selfmade
      networthusbillion
      countrycode
      rank
    
  
  
    
      284
      Bill Gates
      58.0
      male
      self-made
      76.0
      USA
      1
    
    
      348
      Carlos Slim Helu
      74.0
      male
      self-made
      72.0
      MEX
      2
    
    
      124
      Amancio Ortega
      77.0
      male
      self-made
      64.0
      ESP
      3
    
    
      2491
      Warren Buffett
      83.0
      male
      self-made
      58.2
      USA
      4
    
    
      1377
      Larry Ellison
      69.0
      male
      self-made
      48.0
      USA
      5
    
    
      509
      David Koch
      73.0
      male
      inherited
      40.0
      USA
      6
    
    
      381
      Charles Koch
      78.0
      male
      inherited
      40.0
      USA
      6
    
    
      2185
      Sheldon Adelson
      80.0
      male
      self-made
      38.0
      USA
      8

Who is from the USA? There two different things we can do. This was a big point of confusion in class

We can list whether someone is from the USA or not.
Or, we can use that list to select only the people from the USA.

Let's look at them both, then see how they're used.



In [13]:

    
recent['countrycode']=='USA'









    Out[13]:





284      True
348     False
124     False
2491     True
1377     True
509      True
381      True
2185     True
Name: countrycode, dtype: bool

Value counts vs. groupby

groupby whenever you

This just gave us a list of True and False values.

First person? Definitely from the USA.
Next row? Not from the USA.
Third person? Not from the USA.
Fourth? Yes, they're from the USA.

On the other hand, we can also use similar syntax to get a list of every person from the USA.



In [ ]:

That one was a two-step process,

we got a list of True/False values (are they from the USA?), then
fed that True/False list back into the dataframe.

For every row that's True, it let us keep the row. If it was False, it doesn't let us keep the row.

When to use what

Let's say we wanted to select every single row that is from USA, has inherited wealth, and is in the top 10 wealthiest people in the world. If we wanted to do it all on one line, we could!



In [ ]:

That looks super confusing. In it we have three separate things we're testing for (a.k.a. getting True/False back)

sample['countrycode'] == 'USA'
sample['selfmade'] == 'inherited'
sample['rank'] < 10

And we're jamming them all into sample[blah blah] to get back a dataframe for which all of those are true. It works, but it's ugly.

Again, let's remind ourselves what a piece like that looks like, it's just a list of True and False values



In [ ]:

If we felt like the code we had before was a little too complicated, we could break it down into manageable, well-named pieces like this:



In [ ]:

Let's look at each individual piece.

is_american, which is the same thing as sample['countrycode'] == 'USA', is not a list of people. It is not a dataframe. It is a list of true and false values.



In [ ]:

It's the same thing with is_inherited and is_top_ten. It's just a shortcut for me, so I don't have to type out sample['selfmade'] == 'inherited'. We can combine them all using & to get a list of True/False values in which every one of those conditions is true



In [ ]:

Which we can then feed into the dataframe



In [ ]:

So what I'm saying is, these two things are the same:

Method One: Breaking into separate readable things



In [ ]:

Method Two: Doing it all at once



In [ ]:

Make sure you know the difference between these two

df['countrycode'] == 'USA'
df[df['countrycode'] == 'USA']

Joining data

Now, let's say we wanted to join a bunch of data together. We have a lot of countries here, right? And we'd like to know how much of their wealth is concentrated in these rich people.



In [14]:

    
recent = df[df['year'] == 2014]
recent = recent[['name','countrycode','networthusbillion']]
recent.head()









    Out[14]:






  
    
      
      name
      countrycode
      networthusbillion
    
  
  
    
      1
      A. Jerrold Perenchio
      USA
      2.6
    
    
      5
      Abdulla Al Futtaim
      ARE
      2.5
    
    
      6
      Abdulla bin Ahmad Al Ghurair
      ARE
      4.8
    
    
      8
      Abdullah Al Rajhi
      SAU
      2.4
    
    
      9
      Abdulsamad Rabiu
      NGA
      1.2

What do we combine it with?

I'm going to get my GDP data from this web site, because... I googled GDP 2014 and it came up.

We could manually say okay, so A. Jerrold Perenchio has 2.6 billion dollars, and USA's GDP was 17.418 trillion (17418 billion), so he's at about 0.015% of the GDP. But we are lazy, so we want the computer to do it. In order to do that, though, we need to merge that information into our existing dataset.

I'm also too lazy to look up the GDP for every country, so I'm just going to pick the top ten.



In [15]:

    
# Make a list of dictionaries
country_gdp = [
   { 'country': 'USA', 'gdp': 17.418 },
   { 'country': 'DEU', 'gdp': 3.859 },
   { 'country': 'CHN', 'gdp': 10.380 },
   { 'country': 'RUS', 'gdp': 1.857 },
   { 'country': 'JPN', 'gdp': 4.616 },
   { 'country': 'BRA', 'gdp': 2.353 },
   { 'country': 'HKG', 'gdp': 0.289 },
   { 'country': 'FRA', 'gdp': 2.846 },
   { 'country': 'GBR', 'gdp': 2.945 },
   { 'country': 'IND', 'gdp': 2.049 }
]
country_gdp









    Out[15]:





[{'country': 'USA', 'gdp': 17.418},
 {'country': 'DEU', 'gdp': 3.859},
 {'country': 'CHN', 'gdp': 10.38},
 {'country': 'RUS', 'gdp': 1.857},
 {'country': 'JPN', 'gdp': 4.616},
 {'country': 'BRA', 'gdp': 2.353},
 {'country': 'HKG', 'gdp': 0.289},
 {'country': 'FRA', 'gdp': 2.846},
 {'country': 'GBR', 'gdp': 2.945},
 {'country': 'IND', 'gdp': 2.049}]

I'm also too lazy to find a csv that has the data I need, so I'm going to make a dataframe out of a dictionary. It's pretty easy.



In [17]:

    
# make a data frame
gdp_df=pd.DataFrame(country_gdp)
gdp_df



In [18]:

    
gdp_df['billions'] = gdp_df['gdp']*1000
gdp_df

STEP ONE: Finding a join key

My favorite resource for joining data with pandas is this page. It compares everything to SQL JOINs, which I understand. You could also check out this page which I find kind of confusing.

So what we need to do is match up a column between our two dataframes.



In [20]:

    
recent.head(3)









    Out[20]:






  
    
      
      name
      countrycode
      networthusbillion
    
  
  
    
      1
      A. Jerrold Perenchio
      USA
      2.6
    
    
      5
      Abdulla Al Futtaim
      ARE
      2.5
    
    
      6
      Abdulla bin Ahmad Al Ghurair
      ARE
      4.8



In [21]:

    
gdp_df.head(3)

It looks like we're going to be matching up countrycode from df with country from gdp_df.

STEP TWO: Doing the merge

You merge using .merge, and then give two parameters:

left_on, the first dataframe's column name to merge on
right_on, the second dataframe's column name to merge on



In [ ]:

    
# we are matching up:
#countrycode from recent to country from gdp_df



In [24]:

    
merged=recent.merge(gdp_df, left_on="countrycode", right_on="country")
merged.head()









    Out[24]:






  
    
      
      name
      countrycode
      networthusbillion
      country
      gdp
      billions
    
  
  
    
      0
      A. Jerrold Perenchio
      USA
      2.6
      USA
      17.418
      17418.0
    
    
      1
      Abigail Johnson
      USA
      17.3
      USA
      17.418
      17418.0
    
    
      2
      Aerin Lauder Zinterhofer
      USA
      1.1
      USA
      17.418
      17418.0
    
    
      3
      Alan Gerry
      USA
      1.4
      USA
      17.418
      17418.0
    
    
      4
      Alec Gores
      USA
      2.0
      USA
      17.418
      17418.0



In [25]:

    
len(gdp_df) #len counts the number of columns in a table frame gdp_df









    Out[25]:





10



In [48]:

    
gdp_df

recent['countrycode'].describe()



In [26]:

    
merged[merged['countrycode']=='SAU']









    Out[26]:






  
    
      
      name
      countrycode
      networthusbillion
      country
      gdp
      billions



In [27]:

    
recent[recent['countrycode']=='SAU']









    Out[27]:






  
    
      
      name
      countrycode
      networthusbillion
    
  
  
    
      8
      Abdullah Al Rajhi
      SAU
      2.4
    
    
      1711
      Mohammad Al Amoudi
      SAU
      15.3
    
    
      1713
      Mohammed Al Issa
      SAU
      3.1
    
    
      1902
      Prince Alwaleed Bin Talal Alsaud
      SAU
      20.4
    
    
      1903
      Prince Sultan bin Mohammed bin Saud Al Kabeer
      SAU
      3.5
    
    
      2125
      Saleh Kamel
      SAU
      2.4
    
    
      2292
      Sulaiman Al Rajhi
      SAU
      1.9

when we merged, we didn't have a gdp for Saudi Arabi, as a result, we got rid of all of the billionaires who were from Saudi Arabi because we didn't have a match for them



In [28]:

    
len(recent)









    Out[28]:





1653



In [29]:

    
len(merged)









    Out[29]:





1130



In [30]:

    
# A left join will allow you to have empty non-matching values so we'll have our Saudis back
left_merged = recent.merge(gdp_df, how="left", left_on="countrycode", right_on="country")
# on = 'country'
left_merged.head()









    Out[30]:






  
    
      
      name
      countrycode
      networthusbillion
      country
      gdp
      billions
    
  
  
    
      0
      A. Jerrold Perenchio
      USA
      2.6
      USA
      17.418
      17418.0
    
    
      1
      Abdulla Al Futtaim
      ARE
      2.5
      NaN
      NaN
      NaN
    
    
      2
      Abdulla bin Ahmad Al Ghurair
      ARE
      4.8
      NaN
      NaN
      NaN
    
    
      3
      Abdullah Al Rajhi
      SAU
      2.4
      NaN
      NaN
      NaN
    
    
      4
      Abdulsamad Rabiu
      NGA
      1.2
      NaN
      NaN
      NaN



In [ ]:

    
#!= not equal



In [31]:

    
# .isnull will tell you if something is NaN or not
left_merged['billions'].isnull()
# or use.notnull() to pull out the ones where the value is not missing
only_people_with_gdps = left_merged[left_merged['billions'].notnull()]
only_people_with_gdps.head()









    Out[31]:






  
    
      
      name
      countrycode
      networthusbillion
      country
      gdp
      billions
    
  
  
    
      0
      A. Jerrold Perenchio
      USA
      2.6
      USA
      17.418
      17418.0
    
    
      5
      Abigail Johnson
      USA
      17.3
      USA
      17.418
      17418.0
    
    
      6
      Abilio dos Santos Diniz
      BRA
      2.8
      BRA
      2.353
      2353.0
    
    
      8
      Adi Godrej
      IND
      3.5
      IND
      2.049
      2049.0
    
    
      9
      Aerin Lauder Zinterhofer
      USA
      1.1
      USA
      17.418
      17418.0

Hmmm see how rank skips through 1, 4, 5, 6, 6? This is because we only kept rows we have matches for. The second wealthiest person is from Mexico, but since we didn't have GDP from Mexico it left. You can see if we just do a simple .count()



In [ ]:



In [ ]:

We lost like 500 people! Maybe even if we don't have the GDP of their country we want to keep them?

Types of joins

Let's just look at our data real quick before I explain the two major types of joins.



In [ ]:

Inner joins

The default kind of join is an inner join. It requires you to have a match on the left hand side AND the right hand side.

If you're in the original dataframe and you don't have a match in the joining dataframe, you disappear. Watch how Carlos "Slim" Helu and Amancio Ortega both disappear.



In [ ]:

Left join

The other major kind of join is a left join. The "left" dataframe (the one you type first), gets to keep its rows, even if there aren't any matches in the joining dataframe.



In [33]:

    
import pandas as pd



In [40]:

    
nba = pd.read_csv("NBA-Census-10.14.2013.csv", encoding='mac_roman')
nba.head()









    Out[40]:






  
    
      
      Name
      Age
      Team
      POS
      #
      2013 $
      Ht (In.)
      WT
      EXP
      1st Year
      DOB
      School
      City
      State (Province, Territory, Etc..)
      Country
      Race
      HS Only
    
  
  
    
      0
      Gee, Alonzo
      26
      Cavaliers
      F
      33
      $3,250,000
      78
      219
      4
      2009
      5/29/1987
      Alabama
      Riviera Beach, FL
      Florida
      US
      Black
      No
    
    
      1
      Wallace, Gerald
      31
      Celtics
      F
      45
      $10,105,855
      79
      220
      12
      2001
      7/23/1982
      Alabama
      Sylacauga, AL
      Alabama
      US
      Black
      No
    
    
      2
      Williams, Mo
      30
      Trail Blazers
      G
      25
      $2,652,000
      73
      195
      10
      2003
      12/19/1982
      Alabama
      Jackson, MS
      Mississippi
      US
      Black
      No
    
    
      3
      Gladness, Mickell
      27
      Magic
      C
      40
      $762,195
      83
      220
      2
      2011
      7/26/1986
      Alabama A&M
      Birmingham, AL
      Alabama
      US
      Black
      No
    
    
      4
      Jefferson, Richard
      33
      Jazz
      F
      44
      $11,046,000
      79
      230
      12
      2001
      6/21/1980
      Arizona
      Los Angeles, CA
      California
      US
      Black
      No



In [ ]:

    
#\n: start a new line



In [ ]:



In [ ]:



In [47]:

    
nba.sort_values(by ='2013$')









    



---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
/Users/sz2472/.virtualenvs/pandas/lib/python3.5/site-packages/pandas/indexes/base.py in get_loc(self, key, method, tolerance)
   1944             try:
-> 1945                 return self._engine.get_loc(key)
   1946             except KeyError:

pandas/index.pyx in pandas.index.IndexEngine.get_loc (pandas/index.c:4154)()

pandas/index.pyx in pandas.index.IndexEngine.get_loc (pandas/index.c:4018)()

pandas/hashtable.pyx in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:12368)()

pandas/hashtable.pyx in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:12322)()

KeyError: '2013$'

During handling of the above exception, another exception occurred:

KeyError                                  Traceback (most recent call last)
<ipython-input-47-4ac2b2f65e1d> in <module>()
----> 1 nba.sort_values(by ='2013$')

/Users/sz2472/.virtualenvs/pandas/lib/python3.5/site-packages/pandas/core/frame.py in sort_values(self, by, axis, ascending, inplace, kind, na_position)
   3149 
   3150             by = by[0]
-> 3151             k = self[by].values
   3152             if k.ndim == 2:
   3153 

/Users/sz2472/.virtualenvs/pandas/lib/python3.5/site-packages/pandas/core/frame.py in __getitem__(self, key)
   1995             return self._getitem_multilevel(key)
   1996         else:
-> 1997             return self._getitem_column(key)
   1998 
   1999     def _getitem_column(self, key):

/Users/sz2472/.virtualenvs/pandas/lib/python3.5/site-packages/pandas/core/frame.py in _getitem_column(self, key)
   2002         # get column
   2003         if self.columns.is_unique:
-> 2004             return self._get_item_cache(key)
   2005 
   2006         # duplicate columns & possible reduce dimensionality

/Users/sz2472/.virtualenvs/pandas/lib/python3.5/site-packages/pandas/core/generic.py in _get_item_cache(self, item)
   1348         res = cache.get(item)
   1349         if res is None:
-> 1350             values = self._data.get(item)
   1351             res = self._box_item_values(item, values)
   1352             cache[item] = res

/Users/sz2472/.virtualenvs/pandas/lib/python3.5/site-packages/pandas/core/internals.py in get(self, item, fastpath)
   3288 
   3289             if not isnull(item):
-> 3290                 loc = self.items.get_loc(item)
   3291             else:
   3292                 indexer = np.arange(len(self.items))[isnull(self.items)]

/Users/sz2472/.virtualenvs/pandas/lib/python3.5/site-packages/pandas/indexes/base.py in get_loc(self, key, method, tolerance)
   1945                 return self._engine.get_loc(key)
   1946             except KeyError:
-> 1947                 return self._engine.get_loc(self._maybe_cast_indexer(key))
   1948 
   1949         indexer = self.get_indexer([key], method=method, tolerance=tolerance)

pandas/index.pyx in pandas.index.IndexEngine.get_loc (pandas/index.c:4154)()

pandas/index.pyx in pandas.index.IndexEngine.get_loc (pandas/index.c:4018)()

pandas/hashtable.pyx in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:12368)()

pandas/hashtable.pyx in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:12322)()

KeyError: '2013$'



In [ ]:

    
nba.plot(kind='barh', x='Name', y='')



In [44]:

    
million_dollar_string.replace ("0","9")









    



---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-44-21a20333e867> in <module>()
----> 1 million_dollar_string.replace ("0","9")

NameError: name 'million_dollar_string' is not defined



In [45]:

    
million_dollar_string.replace("$", " ") #" ": empty









    



---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-45-92160d7d57da> in <module>()
----> 1 million_dollar_string.replace("$", " ")

NameError: name 'million_dollar_string' is not defined



In [46]:

    
million_dollar_string.replce(",", " ")









    



---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-46-421cdd3eb2ae> in <module>()
----> 1 million_dollar_string.replce(",", " ")

NameError: name 'million_dollar_string' is not defined



In [ ]:

    
million_dollar_string.replace(",", " ").replace("$", " ")

Other stuff

Did you see this image already? Well, now you have.

	name	age	gender	selfmade	networthusbillion	countrycode	rank
284	Bill Gates	58.0	male	self-made	76.0	USA	1
348	Carlos Slim Helu	74.0	male	self-made	72.0	MEX	2
124	Amancio Ortega	77.0	male	self-made	64.0	ESP	3
2491	Warren Buffett	83.0	male	self-made	58.2	USA	4
1377	Larry Ellison	69.0	male	self-made	48.0	USA	5
509	David Koch	73.0	male	inherited	40.0	USA	6
381	Charles Koch	78.0	male	inherited	40.0	USA	6
2185	Sheldon Adelson	80.0	male	self-made	38.0	USA	8

	name	countrycode	networthusbillion
1	A. Jerrold Perenchio	USA	2.6
5	Abdulla Al Futtaim	ARE	2.5
6	Abdulla bin Ahmad Al Ghurair	ARE	4.8
8	Abdullah Al Rajhi	SAU	2.4
9	Abdulsamad Rabiu	NGA	1.2

	country	gdp
0	USA	17.418
1	DEU	3.859
2	CHN	10.380
3	RUS	1.857
4	JPN	4.616
5	BRA	2.353
6	HKG	0.289
7	FRA	2.846
8	GBR	2.945
9	IND	2.049

	country	gdp	billions
0	USA	17.418	17418.0
1	DEU	3.859	3859.0
2	CHN	10.380	10380.0
3	RUS	1.857	1857.0
4	JPN	4.616	4616.0
5	BRA	2.353	2353.0
6	HKG	0.289	289.0
7	FRA	2.846	2846.0
8	GBR	2.945	2945.0
9	IND	2.049	2049.0

	name	countrycode	networthusbillion
8	Abdullah Al Rajhi	SAU	2.4
1711	Mohammad Al Amoudi	SAU	15.3
1713	Mohammed Al Issa	SAU	3.1
1902	Prince Alwaleed Bin Talal Alsaud	SAU	20.4
1903	Prince Sultan bin Mohammed bin Saud Al Kabeer	SAU	3.5
2125	Saleh Kamel	SAU	2.4
2292	Sulaiman Al Rajhi	SAU	1.9

	Name	Age	Team	POS	#	2013 $	Ht (In.)	WT	EXP	1st Year	DOB	School	City	State (Province, Territory, Etc..)	Country	Race	HS Only
0	Gee, Alonzo	26	Cavaliers	F	33	$3,250,000	78	219	4	2009	5/29/1987	Alabama	Riviera Beach, FL	Florida	US	Black	No
1	Wallace, Gerald	31	Celtics	F	45	$10,105,855	79	220	12	2001	7/23/1982	Alabama	Sylacauga, AL	Alabama	US	Black	No
2	Williams, Mo	30	Trail Blazers	G	25	$2,652,000	73	195	10	2003	12/19/1982	Alabama	Jackson, MS	Mississippi	US	Black	No
3	Gladness, Mickell	27	Magic	C	40	$762,195	83	220	2	2011	7/26/1986	Alabama A&M	Birmingham, AL	Alabama	US	Black	No
4	Jefferson, Richard	33	Jazz	F	44	$11,046,000	79	230	12	2001	6/21/1980	Arizona	Los Angeles, CA	California	US	Black	No