Title: Data Analysis with Python: Overview of Pandas Author: Fermín Huarte Larrañaga Created: 2015 Version: 2.0 Date: June 2017

Bibliography

This IPython Notebook is based almost completely on:

  • "Python for Data Analysis" by Wes McKinney, Ed. O'Reilly, 2012.

Online resources:

Data Analysis with Python: Overview of Pandas

The aim of this session is to have a first experience with tools that should allow you to manipulate, process, clean, and crunch data using Python. By "data" we are referring to structured data such as:

  • multidimensional arrays
  • tabular or spreadsheet-like data
  • time series (no necessarily evenly spaced!)
  • multiple data related by key columns ##Why Python when analyzing data? There a considerable amount of alternatives when it comes to analyzing large sets of data such as R, MATLAB, SAS, Stata, and others. Efficient as they may be, they are often restricted to a small area of application. The versatility of Python and the growing comunity of Python users in the scientific domain has provided an remarkable improvement in its library support during the recent years, becoming a strong competitor for data manipulation tasks. Added to Python's strength as a general purpose programming language it becomes an excellent choice as a single platform to develop a data analysis application.

Essential libraries we will be using:

  • NumPy
  • pandas (new!)
  • matplotlib
  • IPython
  • SciPy

Wetting our appetite

Before learning the basics of data analysis with pandas we will emulate author of the pandas module and start by running a not so simple. Do not intend to fully understand the instructions given in the next cells. They will be introduced along the session. Relax, try to understand its logic, and enjoy! ;-)

Data from a URL shortenning service

In 2011, the URL shortenning service named bit.ly partnered with the US governement website usa.gov to provide a food of anonymous data gathered from users who shorten links ending with .gov or .mil. This data (updated daily with hourly snapshots) can be downloaded as text files. Each line in the hourly snapshot data file contains a JSON (JavaScript Object Notation) form.

Will work with this data using first standard built-in Python, then the collections module and finally, pandas.

Standard Python

The following lines will open such file and display its contents. Please, download this data file and run the cell.


In [ ]:
data_file ='usagov_bitly_data2012-03-16-1331923249.txt'
file = open(data_file)
file.readline()

Now instead, of opening it as a simple text file, we will load the lines in the JSON file into a dictionary object. Let us read de data set using JavaScript Object Notation json module (we will not cover this topic)


In [ ]:
import json
data_file = 'usagov_bitly_data2012-03-16-1331923249.txt'
records = [json.loads(line) for line in open(data_file)]

Thanks to the json module now variable records is a list of dictionaries, imported from the JSON form.


In [ ]:
print("Variable records is {} and its elements are {}.".format(type(records),type(records[0])))

Let us have a look at the first element:


In [ ]:

As said before, this is a typical dictionary structure with keys and values. Find out the keys:


In [ ]:

Find out the value for the key 'tz' in this first record:


In [ ]:

In this case (and we were not supposed to know this) 'tz' stands for time zone. Suppose we are interested in identifying the most commonly found time zones in the set of data we just imported. Surely, each one of you will find a different way to work around it. First, we want to obtain a list of all time zones found in the list, name the list as list_of_timezones:


In [ ]:

Check the length of list_of_timezones and, for instance its first ten elements:


In [ ]:

Try to think of an algorithm to count the occurences of the different timezones (including the blank field ' '). Hint: You might want to use a dictionary to store the occurence (If you can't solve it, follow this link for a possible solution)


In [ ]:

  • How often does 'America/Sao_Paulo' appear?

In [ ]:

  • How many different timezones are there?

In [ ]:


In [ ]:


Collections module

The Python standard library provides the collections module that contains the collections.Counter class. This does the job that we just made but in a nicer way:


In [ ]:
import collections
print("First counter is of ", type(counter))
counter = collections.Counter(counter) #generate an instance to the Counter class using our counter variable
print("Now counter is of ", type(counter))
#The Counter class has new useful functionalities
counter.most_common(10)

The pandas alternative

Now, let us do the same work using pandas. The main pandas data structure is the DataFrame. It can be seen as a representation of a table or spreadsheet of data. First, we will create the DataFrame from the original data file:


In [ ]:
import numpy as np
import pandas as pd
from pandas import Series, DataFrame
myframe = DataFrame(records)
myframe

myframe is now a DataFrame, a class introduced by pandas to efficiently work with structured data. Check the type:


In [ ]:
type(myframe)

The DataFrame object is composed of Series (another pandas object), They can be seen as the columns of a spreadsheet. For instance, myframe['tz'].


In [ ]:
type(myframe['tz'])

Check the time zones ('tz') in the first ten records of myframe.


In [ ]:

The Series object has a useful method: value_counts:


In [ ]:
tz_counter = myframe['tz'].value_counts()

In one line of code, all the timezones are grouped and accounted for. Check the result and get the top 5 time zones:


In [ ]:

Much few lines of work, right? As we have said repeatedly, there is no need to reinvent the wheel. Probably someone out there solved your problem before you ran into it and, unless you are really really good, that solution is probably better than yours! ;-)

Next, we might want to plot the data using the matplotlib library.


In [ ]:
# This line configures matplotlib to show figures embedded in the notebook, 
# instead of opening a new window for each figure.
%matplotlib inline

Pandas can call matplotlib directly without calling the module explicitly. We will make an histogramatic plot:


In [ ]:
tz_counter[:10].plot(kind='barh', rot=0)

It is kind of odd to realize that the second most popular timezone has a blank label. The DataFrame object of pandas has a fillna function that can replace missing (NA) values or empty strings:


In [ ]:
#First we generate new Series from a column of myframe, when a value is 'NaN' insert the word 'Missing'
clean_tz = myframe['tz'].fillna('Missing')
#In this new Series replace EMPTY VALUES with the word 'Unknown'. Try to understand BOOLEAN INDEXING
clean_tz[clean_tz == ''] = 'Unknown'
#Use the method VALUE_COUNTS to generate a new Series containing time zones and occurrences
tz_counter = clean_tz.value_counts()
#Finally, plot the top ten values
tz_counter[:10].plot(kind='barh', rot=0)

Let's complicate the example a bit more. The 'a' field in the datasheet contains information on the browser used to perform the URL shortening. For example, check the content of the 'a' field in the first record of myframe:


In [ ]:

Let's generate a Series from the datasheet containing all the browser data: try to understand the following line. A good strategy might be to work it out by pieces:

myframe.a
myframe.a.dropna()

In [ ]:
browser = Series([x.split()[0] for x in myframe.a.dropna()])

As we did with the time zones, we can use the value_counts method on the browser Series to see the most common browsers:


In [ ]:

Let's decompose the top time zones into people using Windows and not using Windows. This piece of code requires some more knowledge of pandas, skip the details for now:


In [ ]:
cframe = myframe[myframe.a.notnull()]
os = np.where(cframe['a'].str.contains('Windows'),'Windows','Not Windows')
tz_and_os = cframe.groupby(['tz',os])
agg_counter = tz_and_os.size().unstack().fillna(0)
agg_counter[:10]

Let's select the top overall time zones. To do so, we construct an indirect index array from the row counts in agg_counter


In [ ]:
#Use to sort in ascending order
indexer = agg_counter.sum(1).argsort()
indexer[:10]

Next, use take to select the rows in that order and then slice off the last 10 rows:


In [ ]:
count_subset = agg_counter.take(indexer)[-10:]

In [ ]:
count_subset

In [ ]:
count_subset.plot(kind='barh', stacked='True')

Same plot, but percentages instead of absolute numbers


In [ ]:
subset_normalized = count_subset.div(count_subset.sum(1), axis=0)
subset_normalized.plot(kind='barh', stacked='True')

After this example. We will go through the basics of pandas.


DataFrame and Series

The two basic Data Structures introduced by pandas are DataFrame and Series.


In [ ]:
from pandas import Series, DataFrame
import pandas as pd

Series

A Series is a one-dimensional array-like object containing an array of data (any NumPy data type is fine) and associated array of data labels, called index. The simplest Series one can think of would be formed only by an array of data:


In [ ]:
o1 = Series([-4, 7, 11, 13, -22])
o1

Notice that the representation of the Series shows the index on the left and the values on the right. No index was specified when the Series was created and a default one has been assigned: integer number from 0 to N-1 (N would be de length of the data array).


In [ ]:
o1.values

In [ ]:
o1.index

If we need to specify the index:


In [ ]:
o2 = Series([7, 0.2, 11.3, -5], index=['d','e','a','z'])

In [ ]:
o2

In [ ]:
o2.index

Unlike NumPy arrays, we can use values in the index when selecting single values from a set of values:


In [ ]:
o2['e']

The following is also equivalent:


In [ ]:
o2.e

Values correspondong to two indices (notice double square brackets!):


In [ ]:
o2[['z','d']]

NumPy array operations, such as masking using a boolean array, scalar broadcasting, or applying mathematical functions, preserve the index-value link:


In [ ]:
o2[o2 > 0] #filter positive elements in o2, the indices are conserved. Compare with the same operation in a NumPy array!!!

In [ ]:
o2*np.pi

Pandas Series have also been described as a fixed.length, ordered dictionary, since it actually maps index values to data values. Many functions that expect a dict can be used with Series:


In [ ]:
'z' in o2

A Python dictionary can be used to create a pandas Series, here is a list of the top 5 most populated cities (2015) according to Wikipedia:


In [ ]:
pop_data = {'Shanghai': 24150000, 'Karachi': 23500000, 'Beijing': 21516000, 'Tianjin': 14722000, 'Istanbul': 14377000}
print ("pop_data is of type ",type(pop_data))

In [ ]:
ser1 = Series(pop_data)
print("ser1 is of type ",type(ser1))
print("Indices of the Series are: ",ser1.index)
print("Values of the Series are: ",ser1.values)

As you just checked, when passing the dictionary the resulting Series uses the dict keys as indices and sorts the values corresponding to the index.

In the next case we create a Series from a dictionary but selecting the indices we are interested in:


In [ ]:
cities = ['Karachi', 'Istanbul', 'Beijing', 'Moscow']
ser2 = Series(pop_data, index=cities)
ser2

Note that the values found in pop_data have been placed in the appropiate locations. No data was found for 'Moscow' and value NaN is assigned. This is used in pandas to mark missing or not available (NA) values. In order to detect missing data in pandas, one should use the isnull and notnull (both present as functions and Series methods):


In [ ]:
pd.isnull(ser2)

In [ ]:
ser2.isnull()

In [ ]:
ser2.notnull()

An important feature of Series to be highlighted here is that Series are automatically aligned when performing arithmetic operations. It doesn't make much sense to add the population data but...


In [ ]:
ser1 + ser2

We can assign names to both the Series object and its index using the name attribute:


In [ ]:
ser1.name = 'population'
ser1.index.name = 'city'
ser1

DataFrame

The DataFrame object represents a tabular, spreadsheet-like data structure containing an ordered collection of columns, each of which can be a different value type. The DataFrame has both a row and column index and it can be seen as a dictionary of Series. Under the hood, the data is stored as one or more 2D blocks rather than a list, dict, or some other collection of 1D arrays. See the following example:


In [ ]:
data = {'city': ['Madrid', 'Madrid','Madrid','Barcelona','Barcelona','Sevilla','Sevilla','Girona','Girona','Girona'],
       'year': ['2002', '2006', '2010', '2006', '2010', '2002', '2010', '2002', '2006', '2010'],
       'pop': [5478405, 5953604, 6373532, 5221848, 5488633, 1732697, 1902956, 568690, 668911, 741841]}
pop_frame = DataFrame(data)

In [ ]:
pop_frame

The resulting DataFrame has automatically assigned indices and the columns are sorted. This order can be altered if we specify a sequence of colums:


In [ ]:
DataFrame(data, columns=['year','city','pop'])

You should keep in mind that indices in the DataFrame are Index objects, they have attributes (such as name as we will see) and are immutable.

What will happen if we pass a column that is not contained in the data set?


In [ ]:
pop_frame2 = DataFrame(data, columns=['year','city','pop','births'])
pop_frame2

A column in a DataFrame can be retrieved as a Series in two ways:

  • using dict-like notation

In [ ]:
pop_frame2['city']
  • using the DataFrame attribute

In [ ]:
pop_frame2.city

Columns can be modified by assignment. Let's get rid of those NA values in the births column:


In [ ]:
pop_frame2['births'] = 100000
pop_frame2

When assigning lists or arrays to a column, the values' length must match the length of the DataFrame. If we assign a Series it will be instead conformed exactly to the DataFrame's index, inserting missing values in any holes:


In [ ]:
birth_series = Series([100000, 15000, 98000], index=[0,2,3])

In [ ]:
pop_frame2['births'] = birth_series
pop_frame2

Assigning a column that doesn't exist will result in creating a new column. Columns can be deleted using the del command:


In [ ]:
pop_frame2['Catalunya'] = ((pop_frame2.city == 'Barcelona') | (pop_frame2.city == 'Girona'))
pop_frame2

In [ ]:
del pop_frame2['Catalunya']
pop_frame2.columns

Alternatively, the DataFrame can be built from a nested dict of dicts:


In [ ]:
pop_data = {'Madrid': {'2002': 5478405, '2006': 5953604, '2010': 6373532}, 
            'Barcelona': {'2006': 5221848, '2010': 5488633}, 'Sevilla': {'2002': 1732697, '2010': 1902956}, 
            'Girona': {'2002': 568690, '2006': 668911, '2010': 741841}}
pop_frame3 = DataFrame(pop_data)
pop_frame3

The outer dict keys act as the columns and the inner keys as the unioned row indices. Possible data inputs to construct a DataFrame:

  • 2D NumPy array
  • dict of arrays, lists, or tuples
  • dict of Series
  • dict of dicts
  • list of dicts or Series
  • List of lists or tuples
  • DataFrames
  • NumPy masked array

As in Series, the index and columns in a DataFrame have name attributes:


In [ ]:
pop_frame3.columns.name = 'city'; pop_frame3.index.name = 'year'
pop_frame3

Similarly, the values attribute returns de data contained in the DataFrame as a 2D array:


In [ ]:
pop_frame3.values

Basic functionality

We will not cover all the possible operations using Pandas and the related data structures. We will try to cover some of the basics.

Reindexing

A critical method in pandas is reindex. This implies creating a new object with the data of a given structure but conformed to a new index. For instance:

  1. Extract the column of pop_frame3 belonging to Barcelona
  2. Check the type of the column, it should be a Series
  3. Find out the indices of the Barcelona Series

In [ ]:

Call reindex on the Barcelona Series to rearrange the data to a new index [2010, 2008, 2006, 2004, 2002], following this example:

obj = Series([4.5, 7.2, -5.3, 3.6], index=['d', 'b', 'a', 'c'])
obj2 = obj.reindex(['a', 'b', 'c', 'd', 'e'])

In [ ]:

The reindex method can be combined with the fill_value= option in the non existing values:

obj2 = obj.reindex(['a', 'b', 'c', 'd', 'e'], fill_value=0)

In [ ]:

It does'nt make much sense in this case to estimate the non-existing values as zeros. For ordered data such as time series, we can use interpolation or foward/backward filling.


In the case of DataFrames, reindex can alter either (row) index, column or both.


In [ ]:
#Inverting the row indices and adding some more years
years = ['2010', '2008', '2006', '2004', '2002']
pop_frame4 = pop_frame3.reindex(years)

If instead we want to reindex the columns, we need to use the columns keyword:


In [ ]:
cities = ['Madrid', 'Sevilla','Barcelona', 'Girona']
pop_frame4 = pop_frame4.reindex(columns=cities)

Both at once:


In [ ]:
pop_frame4 = pop_frame3.reindex(index = years, columns = cities)

Dropping entries from an axis

From the Barcelona population Series let's get rid of years 2002 and 2008:


In [ ]:
pop_bcn2.drop(['2002', '2008'])

Check that the object was not modified.

In DataFrame, index values can be deleted from both axes. Use the IPython help to find out the use of drop and get rid of all the data related to Madrid and year 2002:


In [ ]:

Indexing, selection, and filtering

Series

Indexing in Series works similarly to NumPy array indexing. The main difference is that we can actually use the Serie's index instead of integer numbers


In [ ]:
pop_bcn2.index.name = 'year'

Population in Barcelona in year 2006?


In [ ]:

Boolean indexing, non-zero data in Barcelona?


In [ ]:

Important: Slicing with labels behaves differently than normal Python slicing the endpoint is inclusive! Give it a try:


In [ ]:
pop_bcn2[:2]

In [ ]:
pop_bcn2['2002':'2006']

In [ ]: