Series



In [2]:

    
import pandas as pd

A Series is like a cross between a list and a dictionary. The items are stored in an order and there are labels with which you can retrieve them. A Series object also has a name attribute.



In [30]:

    
animals = ["Lion", "Tiger", "Monkey", None]
s = pd.Series(animals)
print(s)
print("The name of this Series: ", s.name)









    



0      Lion
1     Tiger
2    Monkey
3      None
dtype: object
The name of this Series:  None



In [6]:

    
numbers = [1, 2, 3, None]
pd.Series(numbers)









    Out[6]:





0    1.0
1    2.0
2    3.0
3    NaN
dtype: float64



In [7]:

    
import numpy as np
np.NaN == None









    Out[7]:





False



In [9]:

    
np.NaN == np.NaN









    Out[9]:





False



In [10]:

    
np.isnan(np.NaN)









    Out[10]:





True



In [14]:

    
sports = {'Cricket': 'India', 'Football': 'America', 'Soccer': 'Brazil'}
s = pd.Series(sports)
s









    Out[14]:





Cricket       India
Football    America
Soccer       Brazil
dtype: object



In [16]:

    
s.index









    Out[16]:





Index(['India', 'America', 'Brazil'], dtype='object')



In [34]:

    
s = pd.Series(['Cricket', 'Football', 'Soccer'], index = [ 'India', 'America', 'Brazil'])
s









    Out[34]:





India       Cricket
America    Football
Brazil       Soccer
dtype: object

Querying a Series

A pandas Series can be queried either by the index position or the index label. As we saw if you don't give an index to the series, the position and the label are effectively the same values. To query by numeric location, starting at zero, use the iloc attribute. To query by the index label, you can use the loc attribute.



In [36]:

    
s.iloc[0]









    Out[36]:





'Cricket'



In [37]:

    
s.loc['America']









    Out[37]:





'Football'

iloc and loc are not methods, they are attributes.

Okay, so now we know how to get data out of the series. Let's talk about working with the data. A common task is to want to consider all of the values inside of a series and want to do some sort of operation. This could be trying to find a certain number, summarizing data or transforming the data in some way. A typical programmatic approach to this would be to iterate over all the items in the series, and invoke the operation one is interested in. For instance, we could create a data frame of floating point values. Let's think of these as prices for different products. We could write a little routine which iterates over all of the items in the series and adds them together to get a total. This works, but it's slow. Modern computers can do many tasks simultaneously, especially, but not only, tasks involving mathematics. Pandas and the underlying NumPy libraries support a method of computation called vectorization. Vectorization works with most of the functions in the NumPy library, including the sum function.



In [38]:

    
s = pd.Series(np.random.randint(0,1000,10000))
s.head()









    Out[38]:





0    651
1    125
2    498
3    710
4    610
dtype: int64

Magic functions begin with a percentage sign. If we type % sign and then hit the Tab key, we can see a list of the available magic functions. You could write your own magic functions too, but that's a little bit outside of the scope of this course. We're actually going to use what's called a cellular magic function. These start with two percentage signs and modify a raptor code in the current Jupyter cell. The function we're going to use is called timeit. And as you may have guessed from the name, this function will run our code a few times to determine, on average, how long it takes.



In [40]:

    
%%timeit -n 100
summary = 0
for item in s:
    summary += item









    



100 loops, best of 3: 1.41 ms per loop



In [41]:

    
%%timeit -n 100
np.sum(s)









    



100 loops, best of 3: 143 µs per loop

Related feature in Pandas and NumPy is called broadcasting. With broadcasting, you can apply an operation to every value in the series, changing the series. For instance, if we wanted to increase every random variable by 2, we could do so quickly using the += operator directly on the series object.



In [44]:

    
%%timeit -n 10
s = pd.Series(np.random.randint(0,1000,10000))
for label, value in s.iteritems():
    s.set_value(label,  value + 2)









    



10 loops, best of 3: 27.3 ms per loop



In [42]:

    
%%timeit -n 10
s = pd.Series(np.random.randint(0,1000,10000))
for label, value in s.iteritems():
    s.loc[label] = value + 2









    



10 loops, best of 3: 838 ms per loop

But if you find yourself iterating through a series, you should question whether you're doing things in the best possible way. Here's how we would do this using the series set value method.



In [43]:

    
%%timeit -n 10
s = pd.Series(np.random.randint(0,1000,10000))
s += 2









    



10 loops, best of 3: 456 µs per loop

Amazing. Not only is it significantly faster, but it's more concise and maybe even easier to read too. The typical mathematical operations you would expect are vectorized, and the NumPy documentation outlines what it takes to create vectorized functions of your own. One last note on using the indexing operators to access series data. The .loc attribute lets you not only modify data in place, but also add new data as well. If the value you pass in as the index doesn't exist, then a new entry is added. And keep in mind, indices can have mixed types. While it's important to be aware of the typing going on underneath, Pandas will automatically change the underlying NumPy types as appropriate.

Mixed types are also possible



In [48]:

    
s = pd.Series([2,1,2])
s.loc['Animal'] = 'Bear'
s









    Out[48]:





0            2
1            1
2            2
Animal    Bear
dtype: object



In [51]:

    
original_sports = pd.Series({'Archery':'Bhutan',
                             'Golf': 'Scotland',
                             'Sumo': 'Japan'})
cricket_loving_countries = pd.Series(['Australia', 'India', 'England'], index=['Cricket','Cricket','Cricket'])
all_countries = original_sports.append(cricket_loving_countries)
all_countries









    Out[51]:





Archery       Bhutan
Golf        Scotland
Sumo           Japan
Cricket    Australia
Cricket        India
Cricket      England
dtype: object



In [52]:

    
original_sports









    Out[52]:





Archery      Bhutan
Golf       Scotland
Sumo          Japan
dtype: object

There are a couple of important considerations when using append. First, Pandas is going to take your series and try to infer the best data types to use. In this example, everything is a string, so there's no problems here. Second, the append method doesn't actually change the underlying series. It instead returns a new series which is made up of the two appended together. We can see this by going back and printing the original series of values and seeing that they haven't changed. This is actually a significant issue for new Pandas users who are used to objects being changed in place. So watch out for it, not just with append but with other Pandas functions as well.



In [53]:

    
all_countries['Cricket']









    Out[53]:





Cricket    Australia
Cricket        India
Cricket      England
dtype: object

Finally, we see that when we query the appended series for those who have cricket as their national sport, we don't get a single value, but a series itself. This is actually very common, and if you have a relational database background, this is very similar to every table query resulting in a return set which itself is a table.

The DataFrame Data Structure

You can create a DataFrame in many different ways, some of which you might expect. For instance, you can use a group of series, where each series represents a row of data. Or you could use a group of dictionaries, where each dictionary represents a row of data.



In [83]:

    
purchase_1 = pd.Series({'Name':'Kasi',
                        'Item purchased': 'Dog Food',
                        'Cost': 22.50})
purchase_2 = pd.Series({'Name':'Pradeep',
                        'Item purchased': 'Cat Food',
                        'Cost': 21.50})
purchase_3 = pd.Series({'Name':'Sri',
                        'Item purchased': 'Bird Food',
                        'Cost': 5.50})
df = pd.DataFrame([purchase_1, purchase_2, purchase_3], index=['Store1','Store1','Store2'])
df









    Out[83]:






  
    
      
      Cost
      Item purchased
      Name
    
  
  
    
      Store1
      22.5
      Dog Food
      Kasi
    
    
      Store1
      21.5
      Cat Food
      Pradeep
    
    
      Store2
      5.5
      Bird Food
      Sri



In [84]:

    
print(df.loc['Store2'])
type(df.loc['Store2'])









    



Cost                    5.5
Item purchased    Bird Food
Name                    Sri
Name: Store2, dtype: object






    Out[84]:





pandas.core.series.Series



In [86]:

    
print(df.loc['Store1'])
type(df.loc['Store1'])









    



        Cost Item purchased     Name
Store1  22.5       Dog Food     Kasi
Store1  21.5       Cat Food  Pradeep






    Out[86]:





pandas.core.frame.DataFrame

What if we want to do column, for example we want to get a list of all the costs?



In [88]:

    
df.T # This essential turns your column names into indicies









    Out[88]:






  
    
      
      Store1
      Store1
      Store2
    
  
  
    
      Cost
      22.5
      21.5
      5.5
    
    
      Item purchased
      Dog Food
      Cat Food
      Bird Food
    
    
      Name
      Kasi
      Pradeep
      Sri



In [89]:

    
df.T.loc['Cost'] # We can then use the loc method









    Out[89]:





Store1    22.5
Store1    21.5
Store2     5.5
Name: Cost, dtype: object

Since iloc and loc are used for row selection, the Panda's developers reserved indexing operator directly on the DataFrame for column selection. In a Panda's DataFrame, columns always have a name. So this selection is always label based, not as confusing as it was when using the square bracket operator on the series objects. For those familiar with relational databases, this operator is analogous to column projection.



In [85]:

    
print(df['Item purchased'])
type(df['Item purchased'])









    



Store1     Dog Food
Store1     Cat Food
Store2    Bird Food
Name: Item purchased, dtype: object






    Out[85]:





pandas.core.series.Series

Finally, since the result of using the indexing operator is the DataFrame or series, you can chain operations together. For instance, we could have rewritten the query for all Store 1 costs as



In [90]:

    
df.loc['Store1']['Cost']









    Out[90]:





Store1    22.5
Store1    21.5
Name: Cost, dtype: float64

This looks pretty reasonable and gets us the result we wanted. But chaining can come with some costs and is best avoided if you can use another approach. In particular, chaining tends to cause Pandas to return a copy of the DataFrame instead of a view on the DataFrame. For selecting a data, this is not a big deal, though it might be slower than necessary. If you are changing data though, this is an important distinction and can be a source of error.

Here's another method. As we saw, .loc does row selection, and it can take two parameters, the row index and the list of column names. .loc also supports slicing. If we wanted to select all rows, we can use a column to indicate a full slice from beginning to end. And then add the column name as the second parameter as a string. In fact, if we wanted to include multiply columns, we could do so in a list. And Pandas will bring back only the columns we have asked for.



In [91]:

    
df.loc[:, ['Name','Cost']]









    Out[91]:






  
    
      
      Name
      Cost
    
  
  
    
      Store1
      Kasi
      22.5
    
    
      Store1
      Pradeep
      21.5
    
    
      Store2
      Sri
      5.5

So that's selecting and projecting data from a DataFrame based on row and column labels. The key concepts to remember are that the rows and columns are really just for our benefit. Underneath this is just a two axis labeled array, and transposing the columns is easy. Also, consider the issue of chaining carefully, and try to avoid it, it can cause unpredictable results. Where your intent was to obtain a view of the data, but instead Pandas returns to you a copy. In the Panda's world, friends don't let friends chain calls. So if you see it, point it out, and share a less ambiguous solution.



In [92]:

    
df.drop('Store1')









    Out[92]:






  
    
      
      Cost
      Item purchased
      Name
    
  
  
    
      Store2
      5.5
      Bird Food
      Sri

It's easy to delete data in series and DataFrames, and we can use the drop function to do so. This function takes a single parameter, which is the index or roll label, to drop. This is another tricky place for new users to pandas. The drop function doesn't change the DataFrame by default. And instead, returns to you a copy of the DataFrame with the given rows removed. We can see that our original DataFrame is still intact. This is a very typical pattern in Pandas, where in place changes to a DataFrame are only done if need be, usually on changes involving indices. So it's important to be aware of. Drop has two interesting optional parameters. The first is called in place, and if it's set to true, the DataFrame will be updated in place, instead of a copy being returned. The second parameter is the axis, which should be dropped. By default, this value is 0, indicating the row axis. But you could change it to 1 if you want to drop a column.



In [93]:

    
df.drop('Cost',axis=1)









    Out[93]:






  
    
      
      Item purchased
      Name
    
  
  
    
      Store1
      Dog Food
      Kasi
    
    
      Store1
      Cat Food
      Pradeep
    
    
      Store2
      Bird Food
      Sri

There is a second way to drop a column, however. And that's directly through the use of the indexing operator, using the del keyword. This way of dropping data, however, takes immediate effect on the DataFrame and does not return a view.



In [94]:

    
del df['Item purchased']
df









    Out[94]:






  
    
      
      Cost
      Name
    
  
  
    
      Store1
      22.5
      Kasi
    
    
      Store1
      21.5
      Pradeep
    
    
      Store2
      5.5
      Sri

Finally, adding a new column to the DataFrame is as easy as assigning it to some value. For instance, if we wanted to add a new location as a column with default value of none, we could do so by using the assignment operator after the square brackets. This broadcasts the default value to the new column immediately.



In [95]:

    
df['Location'] = None
df









    Out[95]:






  
    
      
      Cost
      Name
      Location
    
  
  
    
      Store1
      22.5
      Kasi
      None
    
    
      Store1
      21.5
      Pradeep
      None
    
    
      Store2
      5.5
      Sri
      None

The common work flow is to read your data into a DataFrame then reduce this DataFrame to the particular columns or rows that you're interested in working with. As you've seen, the Panda's toolkit tries to give you views on a DataFrame. This is much faster than copying data and much more memory efficient too. But it does mean that if you're manipulating the data you have to be aware that any changes to the DataFrame you're working on may have an impact on the base data frame you used originally. Here's an example using our same purchasing DataFrame from earlier. We can create a series based on just the cost category using the square brackets.



In [96]:

    
costs = df['Cost']
costs









    Out[96]:





Store1    22.5
Store1    21.5
Store2     5.5
Name: Cost, dtype: float64

Then we can increase the cost in this series using broadcasting.



In [97]:

    
costs += 2
costs









    Out[97]:





Store1    24.5
Store1    23.5
Store2     7.5
Name: Cost, dtype: float64

Now if we look at our original DataFrame, we see those costs have risen as well. This is an important consideration to watch out for. If you want to explicitly use a copy, then you should consider calling the copy method on the DataFrame for it first.



In [98]:

    
df









    Out[98]:






  
    
      
      Cost
      Name
      Location
    
  
  
    
      Store1
      24.5
      Kasi
      None
    
    
      Store1
      23.5
      Pradeep
      None
    
    
      Store2
      7.5
      Sri
      None

A common workflow is to read the dataset in, usually from some external file. We saw previously how you can do this using Python, and lists, and dictionaries. You can imagine how you might use those dictionaries to create a Pandas DataFrame. Thankfully, Pandas has built-in support for delimited files such as CSV files as well as a variety of other data formats including relational databases, Excel, and HTML tables. I've saved a CSV file called olympics.csv, which has data from Wikipedia that contains a summary list of the medal various countries have won at the Olympics. We can take a look at this file using the shell command cat. Which we can invoke directly using the exclamation point. What happens here is that when the Jupyter notebook sees a line beginning with an exclamation mark, it sends the rest of the line to the operating system shell for evaluation. So cat works on Linux and Macs.



In [142]:

    
!cat olympics.csv









    



0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
,№ Summer,01 !,02 !,03 !,Total,№ Winter,01 !,02 !,03 !,Total,№ Games,01 !,02 !,03 !,Combined Total
 Afghanistan ,14,0,0,2,2,0,0,0,0,0,14,0,0,2,2
 Algeria ,13,5,4,8,17,3,0,0,0,0,16,5,4,8,17
 Argentina ,24,21,25,28,74,18,0,0,0,0,42,21,25,28,74
 Armenia ,6,2,5,7,14,6,0,0,0,0,12,2,5,7,14
 Australasia  ,2,3,4,5,12,0,0,0,0,0,2,3,4,5,12
 Australia  ,26,147,163,187,497,18,5,3,4,12,44,152,166,191,509
 Austria ,27,18,33,36,87,22,59,78,81,218,49,77,111,117,305
 Azerbaijan ,6,7,11,25,43,5,0,0,0,0,11,7,11,25,43
 Bahamas ,16,6,2,6,14,0,0,0,0,0,16,6,2,6,14
 Bahrain ,9,1,1,1,3,0,0,0,0,0,9,1,1,1,3
 Barbados  ,12,0,0,1,1,0,0,0,0,0,12,0,0,1,1
 Belarus ,6,12,27,38,77,6,6,4,5,15,12,18,31,43,92
 Belgium ,26,40,53,55,148,20,1,1,3,5,46,41,54,58,153
 Bermuda ,18,0,0,1,1,7,0,0,0,0,25,0,0,1,1
 Bohemia  ,3,0,1,3,4,0,0,0,0,0,3,0,1,3,4
 Botswana ,10,0,1,0,1,0,0,0,0,0,10,0,1,0,1
 Brazil ,22,30,36,63,129,7,0,0,0,0,29,30,36,63,129
 British West Indies  ,1,0,0,2,2,0,0,0,0,0,1,0,0,2,2
 Bulgaria  ,20,51,86,80,217,19,1,2,3,6,39,52,88,83,223
 Burundi ,6,1,1,0,2,0,0,0,0,0,6,1,1,0,2
 Cameroon ,14,3,1,1,5,1,0,0,0,0,15,3,1,1,5
 Canada ,26,63,102,136,301,22,62,56,52,170,48,125,158,188,471
 Chile  ,23,2,7,4,13,16,0,0,0,0,39,2,7,4,13
 China  ,10,224,164,153,541,10,12,22,19,53,20,236,186,172,594
 Colombia ,19,5,8,14,27,1,0,0,0,0,20,5,8,14,27
 Costa Rica ,15,1,1,2,4,6,0,0,0,0,21,1,1,2,4
 Côte d'Ivoire  ,13,1,1,1,3,0,0,0,0,0,13,1,1,1,3
 Croatia ,7,11,10,12,33,7,4,6,1,11,14,15,16,13,44
 Cuba  ,20,77,68,75,220,0,0,0,0,0,20,77,68,75,220
 Cyprus ,10,0,1,0,1,10,0,0,0,0,20,0,1,0,1
 Czech Republic  ,6,15,17,23,55,6,7,9,8,24,12,22,26,31,79
 Czechoslovakia  ,16,49,49,45,143,16,2,8,15,25,32,51,57,60,168
 Denmark  ,27,45,74,75,194,13,0,1,0,1,40,45,75,75,195
 Djibouti  ,8,0,0,1,1,0,0,0,0,0,8,0,0,1,1
 Dominican Republic ,14,3,2,2,7,0,0,0,0,0,14,3,2,2,7
 Ecuador ,14,1,1,0,2,0,0,0,0,0,14,1,1,0,2
 Egypt  ,22,7,9,13,29,1,0,0,0,0,23,7,9,13,29
 Eritrea ,5,0,0,1,1,0,0,0,0,0,5,0,0,1,1
 Estonia ,12,9,9,16,34,9,4,2,1,7,21,13,11,17,41
 Ethiopia ,13,22,10,21,53,2,0,0,0,0,15,22,10,21,53
 Fiji ,14,1,0,0,1,3,0,0,0,0,17,1,0,0,1
 Finland ,25,101,85,117,303,22,42,62,57,161,47,143,147,174,464
 France  ,28,212,241,262,715,22,31,31,47,109,50,243,272,309,824
 Gabon ,10,0,1,0,1,0,0,0,0,0,10,0,1,0,1
 Georgia ,6,8,7,17,32,6,0,0,0,0,12,8,7,17,32
 Germany  ,16,191,192,232,615,11,78,78,53,209,27,269,270,285,824
 United Team of Germany  ,3,28,54,36,118,3,8,6,5,19,6,36,60,41,137
 East Germany  ,5,153,129,127,409,6,39,36,35,110,11,192,165,162,519
 West Germany  ,5,56,67,81,204,6,11,15,13,39,11,67,82,94,243
 Ghana  ,14,0,1,3,4,1,0,0,0,0,15,0,1,3,4
 Great Britain  ,28,263,295,289,847,22,10,4,12,26,50,273,299,301,873
 Greece  ,28,33,43,40,116,18,0,0,0,0,46,33,43,40,116
 Grenada ,9,1,1,0,2,0,0,0,0,0,9,1,1,0,2
 Guatemala ,14,0,1,0,1,1,0,0,0,0,15,0,1,0,1
 Guyana  ,17,0,0,1,1,0,0,0,0,0,17,0,0,1,1
 Haiti  ,15,0,1,1,2,0,0,0,0,0,15,0,1,1,2
 Hong Kong  ,16,1,1,1,3,4,0,0,0,0,20,1,1,1,3
 Hungary ,26,175,147,169,491,22,0,2,4,6,48,175,149,173,497
 Iceland ,20,0,2,2,4,17,0,0,0,0,37,0,2,2,4
 India  ,24,9,7,12,28,9,0,0,0,0,33,9,7,12,28
 Indonesia ,15,7,12,11,30,0,0,0,0,0,15,7,12,11,30
 Iran  ,16,18,21,29,68,10,0,0,0,0,26,18,21,29,68
 Iraq ,14,0,0,1,1,0,0,0,0,0,14,0,0,1,1
 Ireland ,21,9,10,12,31,6,0,0,0,0,27,9,10,12,31
 Israel ,16,1,1,7,9,6,0,0,0,0,22,1,1,7,9
 Italy  ,27,206,178,193,577,22,37,34,43,114,49,243,212,236,691
 Jamaica  ,17,22,33,22,77,7,0,0,0,0,24,22,33,22,77
 Japan ,22,142,135,162,439,20,10,17,18,45,42,152,152,180,484
 Jordan ,10,1,0,0,1,0,0,0,0,0,10,1,0,0,1
 Kazakhstan ,6,14,21,25,60,6,1,3,3,7,12,15,24,28,67
 Kenya ,14,31,38,31,100,3,0,0,0,0,17,31,38,31,100
 Kosovo ,1,1,0,0,1,0,0,0,0,0,1,1,0,0,1
 North Korea ,10,16,15,23,54,8,0,1,1,2,18,16,16,24,56
 South Korea ,17,90,85,89,264,17,26,17,10,53,34,116,102,99,317
 Kuwait ,12,0,0,2,2,0,0,0,0,0,12,0,0,2,2
 Kyrgyzstan ,6,0,1,2,3,6,0,0,0,0,12,0,1,2,3
 Latvia ,11,3,11,5,19,10,0,4,3,7,21,3,15,8,26
 Lebanon ,17,0,2,2,4,16,0,0,0,0,33,0,2,2,4
 Liechtenstein ,17,0,0,0,0,18,2,2,5,9,35,2,2,5,9
 Lithuania ,9,6,7,12,25,8,0,0,0,0,17,6,7,12,25
 Luxembourg  ,23,1,1,0,2,8,0,2,0,2,31,1,3,0,4
 Macedonia ,6,0,0,1,1,5,0,0,0,0,11,0,0,1,1
 Malaysia  ,13,0,7,4,11,0,0,0,0,0,13,0,7,4,11
 Mauritius ,9,0,0,1,1,0,0,0,0,0,9,0,0,1,1
 Mexico ,23,13,24,30,67,8,0,0,0,0,31,13,24,30,67
 Moldova ,6,0,2,4,6,6,0,0,0,0,12,0,2,4,6
 Mongolia ,13,2,10,14,26,13,0,0,0,0,26,2,10,14,26
 Montenegro ,3,0,1,0,1,2,0,0,0,0,5,0,1,0,1
 Morocco ,14,6,5,12,23,6,0,0,0,0,20,6,5,12,23
 Mozambique ,10,1,0,1,2,0,0,0,0,0,10,1,0,1,2
 Namibia ,7,0,4,0,4,0,0,0,0,0,7,0,4,0,4
 Netherlands  ,26,85,92,108,285,20,37,38,35,110,46,122,130,143,395
 Netherlands Antilles  ,13,0,1,0,1,2,0,0,0,0,15,0,1,0,1
 New Zealand  ,23,46,27,44,117,15,0,1,0,1,38,46,28,44,118
 Niger ,12,0,1,1,2,0,0,0,0,0,12,0,1,1,2
 Nigeria ,16,3,9,12,24,0,0,0,0,0,16,3,9,12,24
 Norway  ,25,56,49,47,152,22,118,111,100,329,47,174,160,147,481
 Pakistan ,17,3,3,4,10,2,0,0,0,0,19,3,3,4,10
 Panama ,17,1,0,2,3,0,0,0,0,0,17,1,0,2,3
 Paraguay ,12,0,1,0,1,1,0,0,0,0,13,0,1,0,1
 Peru  ,18,1,3,0,4,2,0,0,0,0,20,1,3,0,4
 Philippines ,21,0,3,7,10,4,0,0,0,0,25,0,3,7,10
 Poland ,21,66,85,131,282,22,6,7,7,20,43,72,92,138,302
 Portugal ,24,4,8,12,24,7,0,0,0,0,31,4,8,12,24
 Puerto Rico ,18,1,2,6,9,6,0,0,0,0,24,1,2,6,9
 Qatar ,9,0,1,4,5,0,0,0,0,0,9,0,1,4,5
 Romania ,21,89,95,122,306,20,0,0,1,1,41,89,95,123,307
 Russia  ,6,147,126,154,427,6,49,40,35,124,12,196,166,189,551
 Russian Empire  ,3,1,4,3,8,0,0,0,0,0,3,1,4,3,8
 Soviet Union  ,9,395,319,296,1,010,9,78,57,59,194,18,473,376,355,1,204
 Unified Team  ,1,45,38,29,112,1,9,6,8,23,2,54,44,37,135
 Saudi Arabia ,11,0,1,2,3,0,0,0,0,0,11,0,1,2,3
 Samoa ,9,0,1,0,1,0,0,0,0,0,9,0,1,0,1
 Senegal ,14,0,1,0,1,5,0,0,0,0,19,0,1,0,1
 Serbia  ,4,3,6,6,15,2,0,0,0,0,6,3,6,6,15
 Serbia and Montenegro  ,3,2,4,3,9,3,0,0,0,0,6,2,4,3,9
 Singapore ,16,1,2,2,5,0,0,0,0,0,16,1,2,2,5
 Slovakia  ,6,9,11,8,28,6,2,2,1,5,12,11,13,9,33
 Slovenia ,7,5,8,10,23,7,2,4,9,15,14,7,12,19,38
 South Africa ,19,25,32,29,86,6,0,0,0,0,25,25,32,29,86
 Spain  ,23,45,63,41,149,19,1,0,1,2,42,46,63,42,151
 Sri Lanka  ,17,0,2,0,2,0,0,0,0,0,17,0,2,0,2
 Sudan ,12,0,1,0,1,0,0,0,0,0,12,0,1,0,1
 Suriname  ,12,1,0,1,2,0,0,0,0,0,12,1,0,1,2
 Sweden  ,27,145,170,179,494,22,50,40,54,144,49,195,210,233,638
 Switzerland ,28,50,75,67,192,22,50,40,48,138,50,100,115,115,330
 Syria ,13,1,1,1,3,0,0,0,0,0,13,1,1,1,3
 Chinese Taipei  ,14,5,7,12,24,11,0,0,0,0,25,5,7,12,24
 Tajikistan ,6,1,1,2,4,4,0,0,0,0,10,1,1,2,4
 Tanzania  ,13,0,2,0,2,0,0,0,0,0,13,0,2,0,2
 Thailand ,16,9,8,14,31,3,0,0,0,0,19,9,8,14,31
 Togo ,10,0,0,1,1,1,0,0,0,0,11,0,0,1,1
 Tonga ,9,0,1,0,1,1,0,0,0,0,10,0,1,0,1
 Trinidad and Tobago  ,17,3,5,11,19,3,0,0,0,0,20,3,5,11,19
 Tunisia ,14,4,2,7,13,0,0,0,0,0,14,4,2,7,13
 Turkey ,22,39,27,28,94,16,0,0,0,0,38,39,27,28,94
 Uganda ,15,2,3,2,7,0,0,0,0,0,15,2,3,2,7
 Ukraine ,6,35,30,55,120,6,2,1,4,7,12,37,31,59,127
 United Arab Emirates ,9,1,0,1,2,0,0,0,0,0,9,1,0,1,2
 United States  ,27,1,022,794,704,2,520,22,96,102,84,282,49,1,118,896,788,2,802
 Uruguay ,21,2,2,6,10,1,0,0,0,0,22,2,2,6,10
 Uzbekistan ,6,9,6,17,32,6,1,0,0,1,12,10,6,17,33
 Venezuela ,18,2,3,10,15,4,0,0,0,0,22,2,3,10,15
 Vietnam ,15,1,3,0,4,0,0,0,0,0,15,1,3,0,4
 Virgin Islands ,12,0,1,0,1,7,0,0,0,0,19,0,1,0,1
 Yugoslavia  ,16,26,29,28,83,14,0,3,1,4,30,26,32,29,87
 Zambia  ,13,0,1,1,2,0,0,0,0,0,13,0,1,1,2
 Zimbabwe  ,13,3,4,1,8,1,0,0,0,0,14,3,4,1,8
 Independent Olympic Athletes  ,3,1,0,1,2,0,0,0,0,0,3,1,0,1,2
 Independent Olympic Participants  ,1,0,1,2,3,0,0,0,0,0,1,0,1,2,3
 Mixed team  ,3,8,5,4,17,0,0,0,0,0,3,8,5,4,17

We see from the cat output that there seems to be a numeric list of columns followed by a bunch of column identifiers. The column identifiers have some odd looking characters in them. This is the unicode numero sign, which means number of. Then we have rows of data, all columns separated. We can read this into a DataFrame by calling the read_csv function of the module. When we look at the DataFrame we see that the first cell has an NaN in it since it's an empty value, and the rows have been automatically indexed for us.



In [144]:

    
df = pd.read_csv('olympics.csv')
df.head()









    Out[144]:






  
    
      
      0
      1
      2
      3
      4
      5
      6
      7
      8
      9
      10
      11
      12
      13
      14
      15
    
  
  
    
      0
      NaN
      № Summer
      01 !
      02 !
      03 !
      Total
      № Winter
      01 !
      02 !
      03 !
      Total
      № Games
      01 !
      02 !
      03 !
      Combined Total
    
    
      1
      Afghanistan
      14
      0
      0
      2
      2
      0
      0
      0
      0
      0
      14
      0
      0
      2
      2
    
    
      2
      Algeria
      13
      5
      4
      8
      17
      3
      0
      0
      0
      0
      16
      5
      4
      8
      17
    
    
      3
      Argentina
      24
      21
      25
      28
      74
      18
      0
      0
      0
      0
      42
      21
      25
      28
      74
    
    
      4
      Armenia
      6
      2
      5
      7
      14
      6
      0
      0
      0
      0
      12
      2
      5
      7
      14

It seems pretty clear that the first row of data in the DataFrame is what we really want to see as the column names. It also seems like the first column in the data is the country name, which we would like to make an index. Read csv has a number of parameters that we can use to indicate to Pandas how rows and columns should be labeled. For instance, we can use the index call to indicate which column should be the index and we can also use the header parameter to indicate which row from the data file should be used as the header.



In [145]:

    
df = pd.read_csv('olympics.csv', index_col=0, skiprows=1)
df.head()









    Out[145]:






  
    
      
      № Summer
      01 !
      02 !
      03 !
      Total
      № Winter
      01 !.1
      02 !.1
      03 !.1
      Total.1
      № Games
      01 !.2
      02 !.2
      03 !.2
      Combined Total
    
  
  
    
      Afghanistan
      14
      0
      0
      2
      2
      0
      0
      0
      0
      0
      14
      0
      0
      2
      2
    
    
      Algeria
      13
      5
      4
      8
      17
      3
      0
      0
      0
      0
      16
      5
      4
      8
      17
    
    
      Argentina
      24
      21
      25
      28
      74
      18
      0
      0
      0
      0
      42
      21
      25
      28
      74
    
    
      Armenia
      6
      2
      5
      7
      14
      6
      0
      0
      0
      0
      12
      2
      5
      7
      14
    
    
      Australasia
      2
      3
      4
      5
      12
      0
      0
      0
      0
      0
      2
      3
      4
      5
      12

Now this data came from the all time Olympic games medal table on Wikipedia. If we head to the page we could see that instead of running gold, silver and bronze in the pages, these nice little icons with a one, a two, and a three in them In our csv file these were represented with the strings 01 !, 02 !, and so on. We see that the column values are repeated which really isn't good practice. Panda's recognize this in a panda.1 and .2 to make things more unique. But this labeling isn't really as clear as it could be, so we should clean up the data file. We can of course do this just by going and editing the CSV file directly, but we can also set the column names using the Pandas name property. Panda stores a list of all of the columns in the .columns attribute.



In [146]:

    
df.columns









    Out[146]:





Index(['№ Summer', '01 !', '02 !', '03 !', 'Total', '№ Winter', '01 !.1',
       '02 !.1', '03 !.1', 'Total.1', '№ Games', '01 !.2', '02 !.2', '03 !.2',
       'Combined Total'],
      dtype='object')

We can change the values of the column names by iterating over this list and calling the rename method of the data frame. Here we just iterate through all of the columns looking to see if they start with a 01, 02, 03 or numeric character. If they do, we can call rename and set the column parameters to a dictionary with the keys being the column we want to replace and the value being the new value we want. Here we'll slice some of the old values in two, since we don't want to lose the unique appended values. We'll also set the ever-important in place parameter to true so Pandas knows to update this data frame directly.



In [125]:

    
df.rename?



In [147]:

    
for col in df.columns:
    if col[:2]=='01': # if the first two letters are '01'
        df.rename(columns={col:'Gold'+col[4:]}, inplace=True) #mapping changes labels
    if col[:2]=='02':
        df.rename(columns={col:'Silver'+col[4:]}, inplace=True)
    if col[:2]=='03':
        df.rename(columns={col:'Bronze'+col[4:]}, inplace=True)
    if col[:1]=='№':
        df.rename(columns={col:'#'+col[1:]}, inplace=True)
df.head()









    Out[147]:






  
    
      
      # Summer
      Gold
      Silver
      Bronze
      Total
      # Winter
      Gold.1
      Silver.1
      Bronze.1
      Total.1
      # Games
      Gold.2
      Silver.2
      Bronze.2
      Combined Total
    
  
  
    
      Afghanistan
      14
      0
      0
      2
      2
      0
      0
      0
      0
      0
      14
      0
      0
      2
      2
    
    
      Algeria
      13
      5
      4
      8
      17
      3
      0
      0
      0
      0
      16
      5
      4
      8
      17
    
    
      Argentina
      24
      21
      25
      28
      74
      18
      0
      0
      0
      0
      42
      21
      25
      28
      74
    
    
      Armenia
      6
      2
      5
      7
      14
      6
      0
      0
      0
      0
      12
      2
      5
      7
      14
    
    
      Australasia
      2
      3
      4
      5
      12
      0
      0
      0
      0
      0
      2
      3
      4
      5
      12

Querying a DataFrame

Boolean masking is the heart of fast and efficient querying in NumPy. It's analogous a bit to masking used in other computational areas. A Boolean mask is an array which can be of one dimension like a series, or two dimensions like a DataFrame, where each of the values in the array are either true or false. This array is essentially overlaid on top of the data structure that we're querying. And any cell aligned with the true value will be admitted into our final result, and any sign aligned with a false value will not. Boolean masking is powerful conceptually and is the cornerstone of efficient NumPy and pandas querying. This technique is well used in other areas of computer science, for instance, in graphics. But it doesn't really have an analogue in other traditional relational databases, so I think it's worth pointing out here. Boolean masks are created by applying operators directly to the pandas series or DataFrame objects. For instance, in our Olympics data set, you might be interested in seeing only those countries who have achieved a gold medal at the summer Olympics. To build a Boolean mask for this query, we project the gold column using the indexing operator and apply the greater than operator with a comparison value of zero. This is essentially broadcasting a comparison operator, greater than, with the results being returned as a Boolean series.



In [149]:

    
df['Gold']>0









    Out[149]:





 Afghanistan                           False
 Algeria                                True
 Argentina                              True
 Armenia                                True
 Australasia                            True
 Australia                              True
 Austria                                True
 Azerbaijan                             True
 Bahamas                                True
 Bahrain                                True
 Barbados                              False
 Belarus                                True
 Belgium                                True
 Bermuda                               False
 Bohemia                               False
 Botswana                              False
 Brazil                                 True
 British West Indies                   False
 Bulgaria                               True
 Burundi                                True
 Cameroon                               True
 Canada                                 True
 Chile                                  True
 China                                  True
 Colombia                               True
 Costa Rica                             True
 Côte d'Ivoire                          True
 Croatia                                True
 Cuba                                   True
 Cyprus                                False
                                       ...  
 Sri Lanka                             False
 Sudan                                 False
 Suriname                               True
 Sweden                                 True
 Switzerland                            True
 Syria                                  True
 Chinese Taipei                         True
 Tajikistan                             True
 Tanzania                              False
 Thailand                               True
 Togo                                  False
 Tonga                                 False
 Trinidad and Tobago                    True
 Tunisia                                True
 Turkey                                 True
 Uganda                                 True
 Ukraine                                True
 United Arab Emirates                   True
 United States                          True
 Uruguay                                True
 Uzbekistan                             True
 Venezuela                              True
 Vietnam                                True
 Virgin Islands                        False
 Yugoslavia                             True
 Zambia                                False
 Zimbabwe                               True
 Independent Olympic Athletes           True
 Independent Olympic Participants      False
 Mixed team                             True
Name: Gold, dtype: bool

The resultant series is indexed where the value of each cell is either true or false depending on whether a country has won at least one gold medal, and the index is the country name.

So this builds us the Boolean mask, which is half the battle. What we want to do next is overlay that mask on the DataFrame. We can do this using the where function. The where function takes a Boolean mask as a condition, applies it to the DataFrame or series, and returns a new DataFrame or series of the same shape. Let's apply this Boolean mask to our Olympics data and create a DataFrame of only those countries who have won a gold at a summer games.



In [151]:

    
only_gold = df.where(df['Gold']>0)
only_gold.head()









    Out[151]:






  
    
      
      # Summer
      Gold
      Silver
      Bronze
      Total
      # Winter
      Gold.1
      Silver.1
      Bronze.1
      Total.1
      # Games
      Gold.2
      Silver.2
      Bronze.2
      Combined Total
    
  
  
    
      Afghanistan
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      Algeria
      13.0
      5.0
      4.0
      8.0
      17.0
      3.0
      0.0
      0.0
      0.0
      0.0
      16.0
      5.0
      4.0
      8.0
      17.0
    
    
      Argentina
      24.0
      21.0
      25.0
      28.0
      74.0
      18.0
      0.0
      0.0
      0.0
      0.0
      42.0
      21.0
      25.0
      28.0
      74.0
    
    
      Armenia
      6.0
      2.0
      5.0
      7.0
      14.0
      6.0
      0.0
      0.0
      0.0
      0.0
      12.0
      2.0
      5.0
      7.0
      14.0
    
    
      Australasia
      2.0
      3.0
      4.0
      5.0
      12.0
      0.0
      0.0
      0.0
      0.0
      0.0
      2.0
      3.0
      4.0
      5.0
      12.0

We see that the resulting DataFrame keeps the original indexed values, and only data from countries that met the condition are retained. All of the countries which did not meet the condition have NaN data instead. This is okay. Most statistical functions built into the DataFrame object ignore values of NaN.



In [153]:

    
df['Gold'].count()









    Out[153]:





151



In [154]:

    
only_gold['Gold'].count()









    Out[154]:





109



In [156]:

    
only_gold = only_gold.dropna()
only_gold.head()









    Out[156]:






  
    
      
      # Summer
      Gold
      Silver
      Bronze
      Total
      # Winter
      Gold.1
      Silver.1
      Bronze.1
      Total.1
      # Games
      Gold.2
      Silver.2
      Bronze.2
      Combined Total
    
  
  
    
      Algeria
      13.0
      5.0
      4.0
      8.0
      17.0
      3.0
      0.0
      0.0
      0.0
      0.0
      16.0
      5.0
      4.0
      8.0
      17.0
    
    
      Argentina
      24.0
      21.0
      25.0
      28.0
      74.0
      18.0
      0.0
      0.0
      0.0
      0.0
      42.0
      21.0
      25.0
      28.0
      74.0
    
    
      Armenia
      6.0
      2.0
      5.0
      7.0
      14.0
      6.0
      0.0
      0.0
      0.0
      0.0
      12.0
      2.0
      5.0
      7.0
      14.0
    
    
      Australasia
      2.0
      3.0
      4.0
      5.0
      12.0
      0.0
      0.0
      0.0
      0.0
      0.0
      2.0
      3.0
      4.0
      5.0
      12.0
    
    
      Australia
      26.0
      147.0
      163.0
      187.0
      497.0
      18.0
      5.0
      3.0
      4.0
      12.0
      44.0
      152.0
      166.0
      191.0
      509.0

Often we want to drop those rows which have no data. To do this, we can use the drop NA function. You can optionally provide drop NA the axis it should be considering. Remember that the axis is just an indicator for the columns or rows and that the default is zero, which means rows.

When you find yourself talking about pandas and saying phrases like, often I want to, it's quite likely the developers have included a shortcut for this common operation. For instance, in this example, we don't actually have to use the where function explicitly. The pandas developers allow the indexing operator to take a Boolean mask as a value instead of just a list of column names.



In [158]:

    
only_gold = df[df['Gold']>0]
only_gold.head()









    Out[158]:






  
    
      
      # Summer
      Gold
      Silver
      Bronze
      Total
      # Winter
      Gold.1
      Silver.1
      Bronze.1
      Total.1
      # Games
      Gold.2
      Silver.2
      Bronze.2
      Combined Total
    
  
  
    
      Algeria
      13
      5
      4
      8
      17
      3
      0
      0
      0
      0
      16
      5
      4
      8
      17
    
    
      Argentina
      24
      21
      25
      28
      74
      18
      0
      0
      0
      0
      42
      21
      25
      28
      74
    
    
      Armenia
      6
      2
      5
      7
      14
      6
      0
      0
      0
      0
      12
      2
      5
      7
      14
    
    
      Australasia
      2
      3
      4
      5
      12
      0
      0
      0
      0
      0
      2
      3
      4
      5
      12
    
    
      Australia
      26
      147
      163
      187
      497
      18
      5
      3
      4
      12
      44
      152
      166
      191
      509



In [159]:

    
#To get the no of countries who recieved at least one gold in Summer or Winter Olympics
len(df[(df['Gold']>0) | df['Gold.1']>0])









    Out[159]:





110



In [163]:

    
#Are there any countries which won a gold in winter olympics but never in summer olympics
df[(df['Gold']==0) & (df['Gold.1']>0)]









    Out[163]:






  
    
      
      # Summer
      Gold
      Silver
      Bronze
      Total
      # Winter
      Gold.1
      Silver.1
      Bronze.1
      Total.1
      # Games
      Gold.2
      Silver.2
      Bronze.2
      Combined Total
    
  
  
    
      Liechtenstein
      17
      0
      0
      0
      0
      18
      2
      2
      5
      9
      35
      2
      2
      5
      9

Extremely important, and often an issue for new users, is to remember that each Boolean mask needs to be encased in parenthesis because of the order of operations. This can cause no end of frustration if you're not used to it, so be careful.

Indexing DataFrames

The index is essentially a row level label, and we know that rows correspond to axis zero. In our Olympics data, we indexed the data frame by the name of the country. Indices can either be inferred, such as when we create a new series without an index, in which case we get numeric values, or they can be set explicitly, like when we use the dictionary object to create the series, or when we loaded data from the CSV file and specified the header. Another option for setting an index is to use the set_index function. This function takes a list of columns and promotes those columns to an index. Set index is a destructive process, it doesn't keep the current index. If you want to keep the current index, you need to manually create a new column and copy into it values from the index attribute. Let's go back to our Olympics DataFrame. Let's say that we don't want to index the DataFrame by countries, but instead want to index by the number of gold medals that were won at summer games. First we need to preserve the country information into a new column. We can do this using the indexing operator or the string that has the column label. Then we can use the set_index to set index of the column to summer gold medal wins.



In [164]:

    
df['country'] = df.index
df = df.set_index('Gold')
df.head()









    Out[164]:






  
    
      
      # Summer
      Silver
      Bronze
      Total
      # Winter
      Gold.1
      Silver.1
      Bronze.1
      Total.1
      # Games
      Gold.2
      Silver.2
      Bronze.2
      Combined Total
      country
    
    
      Gold
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      0
      14
      0
      2
      2
      0
      0
      0
      0
      0
      14
      0
      0
      2
      2
      Afghanistan
    
    
      5
      13
      4
      8
      17
      3
      0
      0
      0
      0
      16
      5
      4
      8
      17
      Algeria
    
    
      21
      24
      25
      28
      74
      18
      0
      0
      0
      0
      42
      21
      25
      28
      74
      Argentina
    
    
      2
      6
      5
      7
      14
      6
      0
      0
      0
      0
      12
      2
      5
      7
      14
      Armenia
    
    
      3
      2
      4
      5
      12
      0
      0
      0
      0
      0
      2
      3
      4
      5
      12
      Australasia

You'll see that when we create a new index from an existing column it appears that a new first row has been added with empty values. This isn't quite what's happening. And we know this in part because an empty value is actually rendered either as a none or an NaN if the data type of the column is numeric. What's actually happened is that the index has a name. Whatever the column name was in the Jupiter notebook has just provided this in the output. We can get rid of the index completely by calling the function reset_index. This promotes the index into a column and creates a default numbered index.



In [165]:

    
df = df.reset_index()
df.head()









    Out[165]:






  
    
      
      Gold
      # Summer
      Silver
      Bronze
      Total
      # Winter
      Gold.1
      Silver.1
      Bronze.1
      Total.1
      # Games
      Gold.2
      Silver.2
      Bronze.2
      Combined Total
      country
    
  
  
    
      0
      0
      14
      0
      2
      2
      0
      0
      0
      0
      0
      14
      0
      0
      2
      2
      Afghanistan
    
    
      1
      5
      13
      4
      8
      17
      3
      0
      0
      0
      0
      16
      5
      4
      8
      17
      Algeria
    
    
      2
      21
      24
      25
      28
      74
      18
      0
      0
      0
      0
      42
      21
      25
      28
      74
      Argentina
    
    
      3
      2
      6
      5
      7
      14
      6
      0
      0
      0
      0
      12
      2
      5
      7
      14
      Armenia
    
    
      4
      3
      2
      4
      5
      12
      0
      0
      0
      0
      0
      2
      3
      4
      5
      12
      Australasia

One nice feature of pandas is that it has the option to do multi-level indexing. This is similar to composite keys in relational database systems. To create a multi-level index, we simply call set index and give it a list of columns that we're interested in promoting to an index.

Pandas will search through these in order, finding the distinct data and forming composite indices. A good example of this is often found when dealing with geographical data which is sorted by regions or demographics. Let's change data sets and look at some census data for a better example. This data is stored in the file census.csv and comes from the United States Census Bureau. In particular, this is a breakdown of the population level data at the US county level. It's a great example of how different kinds of data sets might be formatted when you're trying to clean them. For instance, in this data set there are two summarized levels, one that contains summary data for the whole country. And one that contains summary data for each state, and one that contains summary data for each county.



In [169]:

    
df = pd.read_csv('census.csv')
df.head()









    Out[169]:






  
    
      
      SUMLEV
      REGION
      DIVISION
      STATE
      COUNTY
      STNAME
      CTYNAME
      CENSUS2010POP
      ESTIMATESBASE2010
      POPESTIMATE2010
      ...
      RDOMESTICMIG2011
      RDOMESTICMIG2012
      RDOMESTICMIG2013
      RDOMESTICMIG2014
      RDOMESTICMIG2015
      RNETMIG2011
      RNETMIG2012
      RNETMIG2013
      RNETMIG2014
      RNETMIG2015
    
  
  
    
      0
      40
      3
      6
      1
      0
      Alabama
      Alabama
      4779736
      4780127
      4785161
      ...
      0.002295
      -0.193196
      0.381066
      0.582002
      -0.467369
      1.030015
      0.826644
      1.383282
      1.724718
      0.712594
    
    
      1
      50
      3
      6
      1
      1
      Alabama
      Autauga County
      54571
      54571
      54660
      ...
      7.242091
      -2.915927
      -3.012349
      2.265971
      -2.530799
      7.606016
      -2.626146
      -2.722002
      2.592270
      -2.187333
    
    
      2
      50
      3
      6
      1
      3
      Alabama
      Baldwin County
      182265
      182265
      183193
      ...
      14.832960
      17.647293
      21.845705
      19.243287
      17.197872
      15.844176
      18.559627
      22.727626
      20.317142
      18.293499
    
    
      3
      50
      3
      6
      1
      5
      Alabama
      Barbour County
      27457
      27457
      27341
      ...
      -4.728132
      -2.500690
      -7.056824
      -3.904217
      -10.543299
      -4.874741
      -2.758113
      -7.167664
      -3.978583
      -10.543299
    
    
      4
      50
      3
      6
      1
      7
      Alabama
      Bibb County
      22915
      22919
      22861
      ...
      -5.527043
      -5.068871
      -6.201001
      -0.177537
      0.177258
      -5.088389
      -4.363636
      -5.403729
      0.754533
      1.107861
    
  

5 rows × 100 columns

I often find that I want to see a list of all the unique values in a given column. In this DataFrame, we see that the possible values for the sum level are using the unique function on the DataFrame. This is similar to the SQL distinct operator. Here we can run unique on the sum level of our current DataFrame and see that there are only two different values, 40 and 50.



In [170]:

    
df['SUMLEV'].unique() #40 belongs to state level data and 50 belongs to county level data









    Out[170]:





array([40, 50])

Let's get rid of all of the rows that are summaries at the state level and just keep the county data.



In [171]:

    
df = df[df['SUMLEV']==50]
df.head()









    Out[171]:






  
    
      
      SUMLEV
      REGION
      DIVISION
      STATE
      COUNTY
      STNAME
      CTYNAME
      CENSUS2010POP
      ESTIMATESBASE2010
      POPESTIMATE2010
      ...
      RDOMESTICMIG2011
      RDOMESTICMIG2012
      RDOMESTICMIG2013
      RDOMESTICMIG2014
      RDOMESTICMIG2015
      RNETMIG2011
      RNETMIG2012
      RNETMIG2013
      RNETMIG2014
      RNETMIG2015
    
  
  
    
      1
      50
      3
      6
      1
      1
      Alabama
      Autauga County
      54571
      54571
      54660
      ...
      7.242091
      -2.915927
      -3.012349
      2.265971
      -2.530799
      7.606016
      -2.626146
      -2.722002
      2.592270
      -2.187333
    
    
      2
      50
      3
      6
      1
      3
      Alabama
      Baldwin County
      182265
      182265
      183193
      ...
      14.832960
      17.647293
      21.845705
      19.243287
      17.197872
      15.844176
      18.559627
      22.727626
      20.317142
      18.293499
    
    
      3
      50
      3
      6
      1
      5
      Alabama
      Barbour County
      27457
      27457
      27341
      ...
      -4.728132
      -2.500690
      -7.056824
      -3.904217
      -10.543299
      -4.874741
      -2.758113
      -7.167664
      -3.978583
      -10.543299
    
    
      4
      50
      3
      6
      1
      7
      Alabama
      Bibb County
      22915
      22919
      22861
      ...
      -5.527043
      -5.068871
      -6.201001
      -0.177537
      0.177258
      -5.088389
      -4.363636
      -5.403729
      0.754533
      1.107861
    
    
      5
      50
      3
      6
      1
      9
      Alabama
      Blount County
      57322
      57322
      57373
      ...
      1.807375
      -1.177622
      -1.748766
      -2.062535
      -1.369970
      1.859511
      -0.848580
      -1.402476
      -1.577232
      -0.884411
    
  

5 rows × 100 columns

Also while this data set is interesting for a number of different reasons, let's reduce the data that we're going to look at to just the total population estimates and the total number of births. We can do this by creating a list of column names that we want to keep then project those and assign the resulting DataFrame to our df variable.



In [173]:

    
columns_to_keep = ['STNAME',
                   'CTYNAME',
                   'BIRTHS2010',
                   'BIRTHS2011',
                   'BIRTHS2012',
                   'BIRTHS2013',
                   'BIRTHS2014',
                   'BIRTHS2015',
                   'POPESTIMATE2010',
                   'POPESTIMATE2011',
                   'POPESTIMATE2012',
                   'POPESTIMATE2013',
                   'POPESTIMATE2014',
                   'POPESTIMATE2015'
                  ]
df = df[columns_to_keep]
df.head()









    Out[173]:






  
    
      
      STNAME
      CTYNAME
      BIRTHS2010
      BIRTHS2011
      BIRTHS2012
      BIRTHS2013
      BIRTHS2014
      BIRTHS2015
      POPESTIMATE2010
      POPESTIMATE2011
      POPESTIMATE2012
      POPESTIMATE2013
      POPESTIMATE2014
      POPESTIMATE2015
    
  
  
    
      1
      Alabama
      Autauga County
      151
      636
      615
      574
      623
      600
      54660
      55253
      55175
      55038
      55290
      55347
    
    
      2
      Alabama
      Baldwin County
      517
      2187
      2092
      2160
      2186
      2240
      183193
      186659
      190396
      195126
      199713
      203709
    
    
      3
      Alabama
      Barbour County
      70
      335
      300
      283
      260
      269
      27341
      27226
      27159
      26973
      26815
      26489
    
    
      4
      Alabama
      Bibb County
      44
      266
      245
      259
      247
      253
      22861
      22733
      22642
      22512
      22549
      22583
    
    
      5
      Alabama
      Blount County
      183
      744
      710
      646
      618
      603
      57373
      57711
      57776
      57734
      57658
      57673

The US Census data breaks down estimates of population data by state and county. We can load the data and set the index to be a combination of the state and county values and see how pandas handles it in a DataFrame. We do this by creating a list of the column identifiers we want to have indexed. And then calling set index with this list and assigning the output as appropriate. We see here that we have a dual index, first the state name and then the county name.



In [174]:

    
df = df.set_index(['STNAME','CTYNAME'])
df.head()









    Out[174]:






  
    
      
      
      BIRTHS2010
      BIRTHS2011
      BIRTHS2012
      BIRTHS2013
      BIRTHS2014
      BIRTHS2015
      POPESTIMATE2010
      POPESTIMATE2011
      POPESTIMATE2012
      POPESTIMATE2013
      POPESTIMATE2014
      POPESTIMATE2015
    
    
      STNAME
      CTYNAME
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      Alabama
      Autauga County
      151
      636
      615
      574
      623
      600
      54660
      55253
      55175
      55038
      55290
      55347
    
    
      Baldwin County
      517
      2187
      2092
      2160
      2186
      2240
      183193
      186659
      190396
      195126
      199713
      203709
    
    
      Barbour County
      70
      335
      300
      283
      260
      269
      27341
      27226
      27159
      26973
      26815
      26489
    
    
      Bibb County
      44
      266
      245
      259
      247
      253
      22861
      22733
      22642
      22512
      22549
      22583
    
    
      Blount County
      183
      744
      710
      646
      618
      603
      57373
      57711
      57776
      57734
      57658
      57673

An immediate question which comes up is how we can query this DataFrame. For instance, we saw previously that the loc attribute of the DataFrame can take multiple arguments. And it could query both the row and the columns. When you use a MultiIndex, you must provide the arguments in order by the level you wish to query. Inside of the index, each column is called a level and the outermost column is level zero. For instance, if we want to see the population results from Washtenaw County, you'd want to the first argument as the state of Michigan.



In [175]:

    
df.loc['Michigan', 'Washtenaw County']









    Out[175]:





BIRTHS2010            977
BIRTHS2011           3826
BIRTHS2012           3780
BIRTHS2013           3662
BIRTHS2014           3683
BIRTHS2015           3709
POPESTIMATE2010    345563
POPESTIMATE2011    349048
POPESTIMATE2012    351213
POPESTIMATE2013    354289
POPESTIMATE2014    357029
POPESTIMATE2015    358880
Name: (Michigan, Washtenaw County), dtype: int64

You might be interested in just comparing two counties. For instance, Washtenaw and Wayne County which covers Detroit. To do this, we can pass the loc method, a list of tuples which describe the indices we wish to query. Since we have a MultiIndex of two values, the state and the county, we need to provide two values as each element of our filtering list.



In [176]:

    
df.loc[[('Michigan','Washtenaw County'),('Michigan','Wayne County')]]









    Out[176]:






  
    
      
      
      BIRTHS2010
      BIRTHS2011
      BIRTHS2012
      BIRTHS2013
      BIRTHS2014
      BIRTHS2015
      POPESTIMATE2010
      POPESTIMATE2011
      POPESTIMATE2012
      POPESTIMATE2013
      POPESTIMATE2014
      POPESTIMATE2015
    
    
      STNAME
      CTYNAME
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      Michigan
      Washtenaw County
      977
      3826
      3780
      3662
      3683
      3709
      345563
      349048
      351213
      354289
      357029
      358880
    
    
      Wayne County
      5918
      23819
      23270
      23377
      23607
      23586
      1815199
      1801273
      1792514
      1775713
      1766008
      1759335

Okay so that's how hierarchical indices work in a nutshell. They're a special part of the pandas library which I think can make management and reasoning about data easier. Of course hierarchical labeling isn't just for rows. For example, you can transpose this matrix and now have hierarchical column labels. And projecting a single column which has these labels works exactly the way you would expect it to.

Question



In [204]:

    
purchase_1 = pd.Series({'Name': 'Chris',
                        'Item Purchased': 'Dog Food',
                        'Cost': 22.50})
purchase_2 = pd.Series({'Name': 'Kevyn',
                        'Item Purchased': 'Kitty Litter',
                        'Cost': 2.50})
purchase_3 = pd.Series({'Name': 'Vinod',
                        'Item Purchased': 'Bird Seed',
                        'Cost': 5.00})

df = pd.DataFrame([purchase_1, purchase_2, purchase_3], index=['Store 1', 'Store 1', 'Store 2'])
df









    Out[204]:






  
    
      
      Cost
      Item Purchased
      Name
    
  
  
    
      Store 1
      22.5
      Dog Food
      Chris
    
    
      Store 1
      2.5
      Kitty Litter
      Kevyn
    
    
      Store 2
      5.0
      Bird Seed
      Vinod

Reindex the purchase records DataFrame to be indexed hierarchically, first by store, then by person. Name the indexes 'Location' and 'Name'. Then add a new entry to it with the value of: Name: 'Kevyn', Item Purchased:'Kitty Food', 'Cost':3.00, Location:'Store 2'.



In [205]:

    
df = df.set_index([df.index, 'Name'])
df.index.names = ['Location', 'Name']
df = df.append(pd.Series(data={'Cost': 3.00, 'Item Purchased': 'Kitty Food'}, name=('Store 2', 'Kevyn')))
df









    Out[205]:






  
    
      
      
      Cost
      Item Purchased
    
    
      Location
      Name
      
      
    
  
  
    
      Store 1
      Chris
      22.5
      Dog Food
    
    
      Kevyn
      2.5
      Kitty Litter
    
    
      Store 2
      Vinod
      5.0
      Bird Seed
    
    
      Kevyn
      3.0
      Kitty Food

If we want we can also reset the index as columns as follows:



In [206]:

    
df.reset_index()









    Out[206]:






  
    
      
      Location
      Name
      Cost
      Item Purchased
    
  
  
    
      0
      Store 1
      Chris
      22.5
      Dog Food
    
    
      1
      Store 1
      Kevyn
      2.5
      Kitty Litter
    
    
      2
      Store 2
      Vinod
      5.0
      Bird Seed
    
    
      3
      Store 2
      Kevyn
      3.0
      Kitty Food

	Cost	Item purchased	Name
Store1	22.5	Dog Food	Kasi
Store1	21.5	Cat Food	Pradeep
Store2	5.5	Bird Food	Sri

	Store1	Store1	Store2
Cost	22.5	21.5	5.5
Item purchased	Dog Food	Cat Food	Bird Food
Name	Kasi	Pradeep	Sri

	0	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15
0	NaN	№ Summer	01 !	02 !	03 !	Total	№ Winter	01 !	02 !	03 !	Total	№ Games	01 !	02 !	03 !	Combined Total
1	Afghanistan	14	0	0	2	2	0	0	0	0	0	14	0	0	2	2
2	Algeria	13	5	4	8	17	3	0	0	0	0	16	5	4	8	17
3	Argentina	24	21	25	28	74	18	0	0	0	0	42	21	25	28	74
4	Armenia	6	2	5	7	14	6	0	0	0	0	12	2	5	7	14

	# Summer	Gold	Silver	Bronze	Total	# Winter	Gold.1	Silver.1	Bronze.1	Total.1	# Games	Gold.2	Silver.2	Bronze.2	Combined Total
Afghanistan	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
Algeria	13.0	5.0	4.0	8.0	17.0	3.0	0.0	0.0	0.0	0.0	16.0	5.0	4.0	8.0	17.0
Argentina	24.0	21.0	25.0	28.0	74.0	18.0	0.0	0.0	0.0	0.0	42.0	21.0	25.0	28.0	74.0
Armenia	6.0	2.0	5.0	7.0	14.0	6.0	0.0	0.0	0.0	0.0	12.0	2.0	5.0	7.0	14.0
Australasia	2.0	3.0	4.0	5.0	12.0	0.0	0.0	0.0	0.0	0.0	2.0	3.0	4.0	5.0	12.0

	# Summer	Silver	Bronze	Total	# Winter	Gold.1	Silver.1	Bronze.1	Total.1	# Games	Gold.2	Silver.2	Bronze.2	Combined Total	country
Gold
0	14	0	2	2	0	0	0	0	0	14	0	0	2	2	Afghanistan
5	13	4	8	17	3	0	0	0	0	16	5	4	8	17	Algeria
21	24	25	28	74	18	0	0	0	0	42	21	25	28	74	Argentina
2	6	5	7	14	6	0	0	0	0	12	2	5	7	14	Armenia
3	2	4	5	12	0	0	0	0	0	2	3	4	5	12	Australasia

	SUMLEV	REGION	DIVISION	STATE	COUNTY	STNAME	CTYNAME	CENSUS2010POP	ESTIMATESBASE2010	POPESTIMATE2010	...	RDOMESTICMIG2011	RDOMESTICMIG2012	RDOMESTICMIG2013	RDOMESTICMIG2014	RDOMESTICMIG2015	RNETMIG2011	RNETMIG2012	RNETMIG2013	RNETMIG2014	RNETMIG2015
0	40	3	6	1	0	Alabama	Alabama	4779736	4780127	4785161	...	0.002295	-0.193196	0.381066	0.582002	-0.467369	1.030015	0.826644	1.383282	1.724718	0.712594
1	50	3	6	1	1	Alabama	Autauga County	54571	54571	54660	...	7.242091	-2.915927	-3.012349	2.265971	-2.530799	7.606016	-2.626146	-2.722002	2.592270	-2.187333
2	50	3	6	1	3	Alabama	Baldwin County	182265	182265	183193	...	14.832960	17.647293	21.845705	19.243287	17.197872	15.844176	18.559627	22.727626	20.317142	18.293499
3	50	3	6	1	5	Alabama	Barbour County	27457	27457	27341	...	-4.728132	-2.500690	-7.056824	-3.904217	-10.543299	-4.874741	-2.758113	-7.167664	-3.978583	-10.543299
4	50	3	6	1	7	Alabama	Bibb County	22915	22919	22861	...	-5.527043	-5.068871	-6.201001	-0.177537	0.177258	-5.088389	-4.363636	-5.403729	0.754533	1.107861

		BIRTHS2010	BIRTHS2011	BIRTHS2012	BIRTHS2013	BIRTHS2014	BIRTHS2015	POPESTIMATE2010	POPESTIMATE2011	POPESTIMATE2012	POPESTIMATE2013	POPESTIMATE2014	POPESTIMATE2015
STNAME	CTYNAME
Alabama	Autauga County	151	636	615	574	623	600	54660	55253	55175	55038	55290	55347
	Baldwin County	517	2187	2092	2160	2186	2240	183193	186659	190396	195126	199713	203709
	Barbour County	70	335	300	283	260	269	27341	27226	27159	26973	26815	26489
	Bibb County	44	266	245	259	247	253	22861	22733	22642	22512	22549	22583
	Blount County	183	744	710	646	618	603	57373	57711	57776	57734	57658	57673

		BIRTHS2010	BIRTHS2011	BIRTHS2012	BIRTHS2013	BIRTHS2014	BIRTHS2015	POPESTIMATE2010	POPESTIMATE2011	POPESTIMATE2012	POPESTIMATE2013	POPESTIMATE2014	POPESTIMATE2015
STNAME	CTYNAME
Michigan	Washtenaw County	977	3826	3780	3662	3683	3709	345563	349048	351213	354289	357029	358880
Michigan	Wayne County	5918	23819	23270	23377	23607	23586	1815199	1801273	1792514	1775713	1766008	1759335

	Cost	Item Purchased	Name
Store 1	22.5	Dog Food	Chris
Store 1	2.5	Kitty Litter	Kevyn
Store 2	5.0	Bird Seed	Vinod

		Cost	Item Purchased
Location	Name
Store 1	Chris	22.5	Dog Food
Store 1	Kevyn	2.5	Kitty Litter
Store 2	Vinod	5.0	Bird Seed
Store 2	Kevyn	3.0	Kitty Food