OpenTechSchool Workshop - Hugo Herter - August 2014
Pandas is a Python library providing high-performance, easy-to-use data structures and data analysis tools.
The library is oriented towards DataFrames, table-like data structures that can be manipulated by a collection of methods:
If you are familiar with R, think of R Data Frames ported to Python.
Under the hood, Pandas is based on NumPy, a powerful library for numerical computing in Python - you will encounter it a few times in this workshop.
The usecase we will be investigating is the following:
When did people register for this event ? Up to how many per day ? What hour during the day ?
To achieve this, we will load the RSVPs information about this event from Meetup's API, load it into Pandas and manipulate it in order to answer the question above.
In [1]:
# Import and configure NumPy, Matplotlib and other useful math tools for running within the Notebook
%pylab inline
In [2]:
# Load Pandas as any Python library:
import pandas
Get your hands on a simple DataFrame
In [3]:
dataframe = pandas.DataFrame(range(0, 50, 10))
dataframe
Out[3]:
You can plot this DataFrame by calling it's plot() method.
In [4]:
dataframe.plot()
Out[4]:
By pressing TAB after typing dataframe., the notebook will show you other methods you can use on this DataFrame. Don't hesitate to experiment.
For this exercise, we will have a look at this event's data on Meetup.
Grab your API authentication key: https://secure.meetup.com/meetup_api/key/
Load the required libraries
import requests for HTTP requests194806872Load the data and parse it.
URL -> JSON -> DataFrame
In [5]:
import requests # load HTTP requests
from __future__ import print_function # use Python 3 syntax for print(...)
api_key = '72d84e1c4c2a50321268137412e73'
event_id = '194806872'
rsvps_url = "https://api.meetup.com/2/rsvps/"
event_url = "https://api.meetup.com/2/event/" + event_id
print("URLs:")
print(rsvps_url)
print(event_url)
In [6]:
rsvps_dict = requests.get(rsvps_url, params={'key': api_key, 'event_id': event_id}).json()
You can then print the first result to see how it looks like. Here, we only print the first valye of the sub-element 'results' because the whole content is quite long, but you can experiment with it and try printing all of it or other subsets.
In [7]:
rsvps_dict['results'][0]
Out[7]:
Pandas can load data from many different formats. In this case, our data are structured as a list of dictionnaries, which can easily be translated into a table.
Note that the source document is hierarchical with a depth greater than one for some fields. Once converted to a DataFrame, the value for these fields will keep its hierarchical structure as a Python dictionnary (display is cropped in the table below).
In [8]:
results = rsvps_dict['results']
rsvps_raw = pandas.DataFrame(results)
rsvps_raw.head() # only prints the top rows of the table
Out[8]:
It would be useful to index the original dataframe based on the RSVP creation date, as this is the dimension we are interested in. The created field is expressed in microseconds, so you also have to convert it into seconds before converting it to a DateTime object that can more easily be manipulated.
In [9]:
# Index the raw dataframe by RSVP creation date:
# - convers the 'created' field to a DateTime - Pandas wants microseconds
rsvps_raw.index = pandas.to_datetime(rsvps_raw['created'] * 1000000)
As you can see, the first columns of your DataFrame looks different: it has been set as the index of the table:
In [10]:
rsvps_raw.head(3)
Out[10]:
In [11]:
# Create a new 'clean' DataFrame:
rsvps = pandas.DataFrame(rsvps_raw['response'])
# Here is a function that will return the given key from a dictionnary if it exists,
# and else return None. You will use it to extract information from dictionnary fields.
def get_or_nan(dic, key):
try: return dic.get(key)
except AttributeError: return None
rsvps['name'] = rsvps_raw['member'].map(lambda x: x.get('name'))
rsvps['thumb'] = rsvps_raw['member_photo'].map(lambda x: get_or_nan(x, 'thumb_link'))
rsvps['photo'] = rsvps_raw['member_photo'].map(lambda x: get_or_nan(x, 'photo_link'))
rsvps.head(5)
Out[11]:
Pandas contains a lot of shortcuts and handful properties. For example, the date or the hour of a Datetime column can be extracted easily in the form of dataframe.column.date or dataframe.column.hour.
The method groupby(field name) on a DataFrame allows for grouping it's content by a certain column for further processing. In this case, you will want to group it by day and then aggregate the length of each group to get the number of registrations for each day. (This will only show registrations for dates with at least one registration).
In [12]:
# Create a new column containing only the date (without the time)
rsvps['day'] = rsvps.index.date
rsvps_per_day = rsvps.groupby('day').agg(len)
rsvps_per_day
Out[12]:
Now that you got your data, you may want to visualize it as a graph using the plot() method. Specifying the marker allows to see more precisely the data points.
In [13]:
rsvps_per_day['response'].plot(marker='o')
Out[13]:
In [14]:
rsvps['hour'] = rsvps.index.hour
rsvps_per_hour = rsvps.groupby('hour').agg(len)
rsvps_per_hour['response'].plot(marker='o')
Out[14]:
From here, you can have a look at the photo of the people who did register at a given hour. In this example, I will show you how to look for people who did register at 13h (after 13h and before 14h), and who have a profile photo.
rsvps.thumb.notnull() will give you a boolean matrix you can use to filter out people with no profile photo.
Reminder: .head() is used to only print the first rows of the result.
In [15]:
rsvps.thumb.notnull().head()
Out[15]:
In a similar way, rsvps.hour==13 will give you a boolean matrix you can use to filter people who registeted at 13h.
In [16]:
(rsvps.hour == 13).head()
Out[16]:
You can combine both these filters as follows, and you should only get RSVPs between 13h and 14h with a profile picture.
In [63]:
rsvps[rsvps.thumb.notnull()][rsvps.hour==13]
Out[63]:
Finally, you can plot the member's photos:
In [18]:
from IPython.display import Image
for i in rsvps[rsvps.thumb.notnull()][rsvps.hour==13].thumb:
display(Image(i))
How would you search for the photo for registrations during night time ?
Finally, you will have a look at some geographical data to see how well they can be manipulated in the notebook.
In this chapter, you will download the list of venues of past OpenTechSchool Brussels events from Meetup, filter empty values and display them on a map of Brussels.
The Map section will be handled with folium, a Python library to control the famous Leaflet JavaScript library for Interactive Maps.
In [19]:
OTS_venues_url = "https://api.meetup.com/2/venues"
In [20]:
venues_dict = requests.get(OTS_venues_url, params={'key': api_key, 'group_urlname': "OpenTechSchool-Brussels", 'status': 'past'}).json()
In [21]:
venues_dict['results'][0]
Out[21]:
In [58]:
venues = pandas.DataFrame(venues_dict['results'])
venues
Out[58]:
In [59]:
import folium
The following function will allow you to include the maps directly into IPython Notebook.
In [60]:
from IPython.display import HTML
def inline_map(map):
"Embeds the HTML source of the map directly into the IPython notebook."
map._build_map()
return HTML('<iframe srcdoc="{srcdoc}" style="width: 100%; height: 510px; border: none"></iframe>'.format(srcdoc=map.HTML.replace('"', '"')))
Finally, show a map of Brussels with a marker for each venue from the DataFrame:
In [61]:
Brussels = [50.850338, 4.351710]
map = folium.Map(location=Brussels, zoom_start=12, tiles='Stamen Toner')
for index, row in venues[venues.lat!=0].iterrows():
map.simple_marker([row.lat, row.lon], popup=str(row['name']))
inline_map(map)
Out[61]:
I hope you got the taste of IPython Notebook and Pandas.
You can discover more on the web via the nbviewer website, for example:
You can also use the Notebook with the languages Ruby and Julia.