The aim of this session is to have a first experience with tools that should allow you to manipulate, process, clean, and crunch data using Python. By "data" we are referring to structured data such as:
Essential libraries we will be using:
Before learning the basics of data analysis with pandas we will emulate author of the pandas module and start by running a not so simple. Do not intend to fully understand the instructions given in the next cells. They will be introduced along the session. Relax, try to understand its logic, and enjoy! ;-)
In 2011, the URL shortenning service named bit.ly partnered with the US governement website usa.gov
to provide a food of anonymous data gathered from users who shorten links ending with .gov
or .mil
. This data (updated daily with hourly snapshots) can be downloaded as text files. Each line in the hourly snapshot data file contains a JSON (JavaScript Object Notation) form.
Will work with this data using first standard built-in Python, then the collections
module and finally, pandas.
The following lines will open such file and display its contents. Please, download this data file and run the cell.
In [ ]:
data_file ='usagov_bitly_data2012-03-16-1331923249.txt'
file = open(data_file)
file.readline()
Now instead, of opening it as a simple text file, we will load the lines in the JSON file into a dictionary object. Let us read de data set using JavaScript Object Notation json
module (we will not cover this topic)
In [ ]:
import json
data_file = 'usagov_bitly_data2012-03-16-1331923249.txt'
records = [json.loads(line) for line in open(data_file)]
Thanks to the json
module now variable records
is a list of dictionaries, imported from the JSON form.
In [ ]:
print("Variable records is {} and its elements are {}.".format(type(records),type(records[0])))
Let us have a look at the first element:
In [ ]:
As said before, this is a typical dictionary
structure with keys and values. Find out the keys:
In [ ]:
Find out the value for the key 'tz'
in this first record:
In [ ]:
In this case (and we were not supposed to know this) 'tz'
stands for time zone. Suppose we are interested in identifying the most commonly found time zones in the set of data we just imported. Surely, each one of you will find a different way to work around it. First, we want to obtain a list of all time zones found in the list, name the list as list_of_timezones
:
In [ ]:
Check the length of list_of_timezones
and, for instance its first ten elements:
In [ ]:
Try to think of an algorithm to count the occurences of the different timezones (including the blank field ' '
). Hint: You might want to use a dictionary to store the occurence (If you can't solve it, follow this link for a possible solution)
In [ ]:
In [ ]:
In [ ]:
In [ ]:
The Python standard library provides the collections
module that contains the collections.Counter
class. This does the job that we just made but in a nicer way:
In [ ]:
import collections
print("First counter is of ", type(counter))
counter = collections.Counter(counter) #generate an instance to the Counter class using our counter variable
print("Now counter is of ", type(counter))
#The Counter class has new useful functionalities
counter.most_common(10)
In [ ]:
import numpy as np
import pandas as pd
from pandas import Series, DataFrame
myframe = DataFrame(records)
myframe
myframe
is now a DataFrame, a class introduced by pandas to efficiently work with structured data. Check the type:
In [ ]:
type(myframe)
The DataFrame
object is composed of Series
(another pandas object), They can be seen as the columns of a spreadsheet. For instance, myframe['tz']
.
In [ ]:
type(myframe['tz'])
Check the time zones ('tz'
) in the first ten records of myframe
.
In [ ]:
The Series
object has a useful method: value_counts
:
In [ ]:
tz_counter = myframe['tz'].value_counts()
In one line of code, all the timezones are grouped and accounted for. Check the result and get the top 5 time zones:
In [ ]:
Much few lines of work, right? As we have said repeatedly, there is no need to reinvent the wheel. Probably someone out there solved your problem before you ran into it and, unless you are really really good, that solution is probably better than yours! ;-)
Next, we might want to plot the data using the matplotlib library.
In [ ]:
# This line configures matplotlib to show figures embedded in the notebook,
# instead of opening a new window for each figure.
%matplotlib inline
Pandas can call matplotlib directly without calling the module explicitly. We will make an histogramatic plot:
In [ ]:
tz_counter[:10].plot(kind='barh', rot=0)
It is kind of odd to realize that the second most popular timezone has a blank label. The DataFrame
object of pandas has a fillna
function that can replace missing (NA) values or empty strings:
In [ ]:
#First we generate new Series from a column of myframe, when a value is 'NaN' insert the word 'Missing'
clean_tz = myframe['tz'].fillna('Missing')
#In this new Series replace EMPTY VALUES with the word 'Unknown'. Try to understand BOOLEAN INDEXING
clean_tz[clean_tz == ''] = 'Unknown'
#Use the method VALUE_COUNTS to generate a new Series containing time zones and occurrences
tz_counter = clean_tz.value_counts()
#Finally, plot the top ten values
tz_counter[:10].plot(kind='barh', rot=0)
Let's complicate the example a bit more. The 'a'
field in the datasheet contains information on the browser used to perform the URL shortening. For example, check the content of the 'a'
field in the first record of myframe
:
In [ ]:
Let's generate a Series
from the datasheet containing all the browser data: try to understand the following line. A good strategy might be to work it out by pieces:
myframe.a
myframe.a.dropna()
In [ ]:
browser = Series([x.split()[0] for x in myframe.a.dropna()])
As we did with the time zones, we can use the value_counts
method on the browser Series
to see the most common browsers:
In [ ]:
Let's decompose the top time zones into people using Windows and not using Windows. This piece of code requires some more knowledge of pandas, skip the details for now:
In [ ]:
cframe = myframe[myframe.a.notnull()]
os = np.where(cframe['a'].str.contains('Windows'),'Windows','Not Windows')
tz_and_os = cframe.groupby(['tz',os])
agg_counter = tz_and_os.size().unstack().fillna(0)
agg_counter[:10]
Let's select the top overall time zones. To do so, we construct an indirect index array from the row counts in agg_counter
In [ ]:
#Use to sort in ascending order
indexer = agg_counter.sum(1).argsort()
indexer[:10]
Next, use take
to select the rows in that order and then slice off the last 10 rows:
In [ ]:
count_subset = agg_counter.take(indexer)[-10:]
In [ ]:
count_subset
In [ ]:
count_subset.plot(kind='barh', stacked='True')
Same plot, but percentages instead of absolute numbers
In [ ]:
subset_normalized = count_subset.div(count_subset.sum(1), axis=0)
subset_normalized.plot(kind='barh', stacked='True')
After this example. We will go through the basics of pandas.
In [ ]:
from pandas import Series, DataFrame
import pandas as pd
In [ ]:
o1 = Series([-4, 7, 11, 13, -22])
o1
Notice that the representation of the Series shows the index on the left and the values on the right. No index was specified when the Series was created and a default one has been assigned: integer number from 0 to N-1 (N would be de length of the data array).
In [ ]:
o1.values
In [ ]:
o1.index
If we need to specify the index:
In [ ]:
o2 = Series([7, 0.2, 11.3, -5], index=['d','e','a','z'])
In [ ]:
o2
In [ ]:
o2.index
Unlike NumPy arrays, we can use values in the index when selecting single values from a set of values:
In [ ]:
o2['e']
The following is also equivalent:
In [ ]:
o2.e
Values correspondong to two indices (notice double square brackets!):
In [ ]:
o2[['z','d']]
NumPy array operations, such as masking using a boolean array, scalar broadcasting, or applying mathematical functions, preserve the index-value link:
In [ ]:
o2[o2 > 0] #filter positive elements in o2, the indices are conserved. Compare with the same operation in a NumPy array!!!
In [ ]:
o2*np.pi
Pandas Series
have also been described as a fixed.length, ordered dictionary, since it actually maps index values to data values. Many functions that expect a dict
can be used with Series
:
In [ ]:
'z' in o2
A Python dictionary can be used to create a pandas Series, here is a list of the top 5 most populated cities (2015) according to Wikipedia:
In [ ]:
pop_data = {'Shanghai': 24150000, 'Karachi': 23500000, 'Beijing': 21516000, 'Tianjin': 14722000, 'Istanbul': 14377000}
print ("pop_data is of type ",type(pop_data))
In [ ]:
ser1 = Series(pop_data)
print("ser1 is of type ",type(ser1))
print("Indices of the Series are: ",ser1.index)
print("Values of the Series are: ",ser1.values)
As you just checked, when passing the dictionary the resulting Series uses the dict
keys as indices and sorts the values corresponding to the index.
In the next case we create a Series
from a dictionary but selecting the indices we are interested in:
In [ ]:
cities = ['Karachi', 'Istanbul', 'Beijing', 'Moscow']
ser2 = Series(pop_data, index=cities)
ser2
Note that the values found in pop_data
have been placed in the appropiate locations. No data was found for 'Moscow'
and value NaN
is assigned. This is used in pandas to mark missing or not available (NA) values. In order to detect missing data in pandas, one should use the isnull
and notnull
(both present as functions and Series methods):
In [ ]:
pd.isnull(ser2)
In [ ]:
ser2.isnull()
In [ ]:
ser2.notnull()
An important feature of Series to be highlighted here is that Series are automatically aligned when performing arithmetic operations. It doesn't make much sense to add the population data but...
In [ ]:
ser1 + ser2
We can assign names to both the Series object and its index using the name
attribute:
In [ ]:
ser1.name = 'population'
ser1.index.name = 'city'
ser1
DataFrame
The DataFrame
object represents a tabular, spreadsheet-like data structure containing an ordered collection of columns, each of which can be a different value type
. The DataFrame
has both a row and column index and it can be seen as a dictionary
of Series
. Under the hood, the data is stored as one or more 2D blocks rather than a list, dict, or some other collection of 1D arrays. See the following example:
In [ ]:
data = {'city': ['Madrid', 'Madrid','Madrid','Barcelona','Barcelona','Sevilla','Sevilla','Girona','Girona','Girona'],
'year': ['2002', '2006', '2010', '2006', '2010', '2002', '2010', '2002', '2006', '2010'],
'pop': [5478405, 5953604, 6373532, 5221848, 5488633, 1732697, 1902956, 568690, 668911, 741841]}
pop_frame = DataFrame(data)
In [ ]:
pop_frame
The resulting DataFrame
has automatically assigned indices and the columns are sorted. This order can be altered if we specify a sequence of colums:
In [ ]:
DataFrame(data, columns=['year','city','pop'])
You should keep in mind that indices in the DataFrame are Index objects
, they have attributes (such as name as we will see) and are immutable.
What will happen if we pass a column that is not contained in the data set?
In [ ]:
pop_frame2 = DataFrame(data, columns=['year','city','pop','births'])
pop_frame2
A column in a DataFrame
can be retrieved as a Series in two ways:
In [ ]:
pop_frame2['city']
DataFrame
attribute
In [ ]:
pop_frame2.city
Columns can be modified by assignment. Let's get rid of those NA values in the births column:
In [ ]:
pop_frame2['births'] = 100000
pop_frame2
When assigning lists or arrays to a column, the values' length must match the length of the DataFrame. If we assign a Series it will be instead conformed exactly to the DataFrame's index, inserting missing values in any holes:
In [ ]:
birth_series = Series([100000, 15000, 98000], index=[0,2,3])
In [ ]:
pop_frame2['births'] = birth_series
pop_frame2
Assigning a column that doesn't exist will result in creating a new column. Columns can be deleted using the del
command:
In [ ]:
pop_frame2['Catalunya'] = ((pop_frame2.city == 'Barcelona') | (pop_frame2.city == 'Girona'))
pop_frame2
In [ ]:
del pop_frame2['Catalunya']
pop_frame2.columns
Alternatively, the DataFrame
can be built from a nested dict
of dicts
:
In [ ]:
pop_data = {'Madrid': {'2002': 5478405, '2006': 5953604, '2010': 6373532},
'Barcelona': {'2006': 5221848, '2010': 5488633}, 'Sevilla': {'2002': 1732697, '2010': 1902956},
'Girona': {'2002': 568690, '2006': 668911, '2010': 741841}}
pop_frame3 = DataFrame(pop_data)
pop_frame3
The outer dict keys act as the columns and the inner keys as the unioned row indices. Possible data inputs to construct a DataFrame
:
As in Series
, the index
and columns
in a DataFrame
have name
attributes:
In [ ]:
pop_frame3.columns.name = 'city'; pop_frame3.index.name = 'year'
pop_frame3
Similarly, the values
attribute returns de data contained in the DataFrame as a 2D array:
In [ ]:
pop_frame3.values
We will not cover all the possible operations using Pandas and the related data structures. We will try to cover some of the basics.
A critical method in pandas is reindex
. This implies creating a new object with the data of a given structure but conformed to a new index. For instance:
Barcelona
Series
Barcelona Series
In [ ]:
Call reindex
on the Barcelona Series
to rearrange the data to a new index [2010, 2008, 2006, 2004, 2002], following this example:
obj = Series([4.5, 7.2, -5.3, 3.6], index=['d', 'b', 'a', 'c'])
obj2 = obj.reindex(['a', 'b', 'c', 'd', 'e'])
In [ ]:
The reindex
method can be combined with the fill_value=
option in the non existing values:
obj2 = obj.reindex(['a', 'b', 'c', 'd', 'e'], fill_value=0)
In [ ]:
It does'nt make much sense in this case to estimate the non-existing values as zeros. For ordered data such as time series, we can use interpolation or foward/backward filling.
In the case of DataFrames, reindex
can alter either (row) index, column or both.
In [ ]:
#Inverting the row indices and adding some more years
years = ['2010', '2008', '2006', '2004', '2002']
pop_frame4 = pop_frame3.reindex(years)
If instead we want to reindex the columns, we need to use the columns
keyword:
In [ ]:
cities = ['Madrid', 'Sevilla','Barcelona', 'Girona']
pop_frame4 = pop_frame4.reindex(columns=cities)
Both at once:
In [ ]:
pop_frame4 = pop_frame3.reindex(index = years, columns = cities)
reindex
function arguments
In [ ]:
pop_bcn2.drop(['2002', '2008'])
Check that the object was not modified.
In DataFrame, index values can be deleted from both axes. Use the IPython help to find out the use of drop
and get rid of all the data related to Madrid and year 2002:
In [ ]:
In [ ]:
pop_bcn2.index.name = 'year'
Population in Barcelona in year 2006?
In [ ]:
Boolean indexing, non-zero data in Barcelona?
In [ ]:
Important: Slicing with labels behaves differently than normal Python slicing the endpoint is inclusive! Give it a try:
In [ ]:
pop_bcn2[:2]
In [ ]:
pop_bcn2['2002':'2006']
In [ ]: