Python Data Analysis Library

OpenTechSchool Workshop - Hugo Herter - August 2014

Description

Pandas is a Python library providing high-performance, easy-to-use data structures and data analysis tools.

The library is oriented towards DataFrames, table-like data structures that can be manipulated by a collection of methods:

  • sort, merge, join, filter, group by, reshape & pivots
  • read/write Excell, CSV, SQL databases, JSON
  • view as tables, plot graphs

If you are familiar with R, think of R Data Frames ported to Python.

Under the hood, Pandas is based on NumPy, a powerful library for numerical computing in Python - you will encounter it a few times in this workshop.

When did people register to this event

The usecase we will be investigating is the following:

When did people register for this event ? Up to how many per day ? What hour during the day ?

To achieve this, we will load the RSVPs information about this event from Meetup's API, load it into Pandas and manipulate it in order to answer the question above.

First steps

From Python data structures to Pandas


In [1]:
# Import and configure NumPy, Matplotlib and other useful math tools for running within the Notebook
%pylab inline


Populating the interactive namespace from numpy and matplotlib

In [2]:
# Load Pandas as any Python library:
import pandas

Get your hands on a simple DataFrame


In [3]:
dataframe = pandas.DataFrame(range(0, 50, 10))
dataframe


Out[3]:
0
0 0
1 10
2 20
3 30
4 40

You can plot this DataFrame by calling it's plot() method.


In [4]:
dataframe.plot()


Out[4]:
<matplotlib.axes.AxesSubplot at 0x7f8a471ace90>

By pressing TAB after typing dataframe., the notebook will show you other methods you can use on this DataFrame. Don't hesitate to experiment.

Meetup's data

For this exercise, we will have a look at this event's data on Meetup.

First

Grab your API authentication key: https://secure.meetup.com/meetup_api/key/

Second

Load the required libraries

Third

Load the data and parse it.

URL -> JSON -> DataFrame


In [5]:
import requests  # load HTTP requests
from __future__ import print_function  # use Python 3 syntax for print(...)

api_key = '72d84e1c4c2a50321268137412e73'
event_id = '194806872'

rsvps_url = "https://api.meetup.com/2/rsvps/"
event_url = "https://api.meetup.com/2/event/" + event_id

print("URLs:")
print(rsvps_url)
print(event_url)


URLs:
https://api.meetup.com/2/rsvps/
https://api.meetup.com/2/event/194806872

Fetching data from the Meetup API

First, load the data from the URL rsvps_url. You need to pass it two parameters: your API key and the identifier of the event.


In [6]:
rsvps_dict = requests.get(rsvps_url, params={'key': api_key, 'event_id': event_id}).json()

You can then print the first result to see how it looks like. Here, we only print the first valye of the sub-element 'results' because the whole content is quite long, but you can experiment with it and try printing all of it or other subsets.


In [7]:
rsvps_dict['results'][0]


Out[7]:
{u'created': 1408919587000,
 u'event': {u'event_url': u'http://www.meetup.com/OpenTechSchool-Brussels/events/194806872/',
  u'id': u'194806872',
  u'name': u'Learn Data analysis with IPython Notebook and Pandas',
  u'time': 1409072400000},
 u'group': {u'group_lat': 50.83000183105469,
  u'group_lon': 4.329999923706055,
  u'id': 9462812,
  u'join_mode': u'open',
  u'urlname': u'OpenTechSchool-Brussels'},
 u'guests': 0,
 u'member': {u'member_id': 162880392, u'name': u'Yannick Yemo'},
 u'member_photo': {u'highres_link': u'http://photos2.meetupstatic.com/photos/member/6/8/6/2/highres_220706722.jpeg',
  u'photo_id': 220706722,
  u'photo_link': u'http://photos4.meetupstatic.com/photos/member/6/8/6/2/member_220706722.jpeg',
  u'thumb_link': u'http://photos2.meetupstatic.com/photos/member/6/8/6/2/thumb_220706722.jpeg'},
 u'mtime': 1408919587000,
 u'response': u'yes',
 u'rsvp_id': 1411704412,
 u'venue': {u'address_1': u'Franklin Rooseveltlaan 50, Brussels',
  u'city': u'Brussels',
  u'country': u'be',
  u'id': 21985282,
  u'lat': 50.81473,
  u'lon': 4.381928,
  u'name': u'ULB Building K, third floor, room K.3.201 (Campus Solbosch)',
  u'repinned': True}}

Transformation into a Pandas DataFrame

Pandas can load data from many different formats. In this case, our data are structured as a list of dictionnaries, which can easily be translated into a table.

Note that the source document is hierarchical with a depth greater than one for some fields. Once converted to a DataFrame, the value for these fields will keep its hierarchical structure as a Python dictionnary (display is cropped in the table below).


In [8]:
results = rsvps_dict['results']

rsvps_raw = pandas.DataFrame(results)
rsvps_raw.head()  # only prints the top rows of the table


Out[8]:
created event group guests member member_photo mtime response rsvp_id venue
0 1408919587000 {u'event_url': u'http://www.meetup.com/OpenTec... {u'urlname': u'OpenTechSchool-Brussels', u'gro... 0 {u'name': u'Yannick Yemo', u'member_id': 16288... {u'thumb_link': u'http://photos2.meetupstatic.... 1408919587000 yes 1411704412 {u'city': u'Brussels', u'name': u'ULB Building...
1 1408633578000 {u'event_url': u'http://www.meetup.com/OpenTec... {u'urlname': u'OpenTechSchool-Brussels', u'gro... 0 {u'name': u'Yann', u'member_id': 140547792} {u'thumb_link': u'http://photos3.meetupstatic.... 1408633578000 yes 1407439252 {u'city': u'Brussels', u'name': u'ULB Building...
2 1408636360000 {u'event_url': u'http://www.meetup.com/OpenTec... {u'urlname': u'OpenTechSchool-Brussels', u'gro... 0 {u'name': u'Robert Johnston', u'member_id': 14... NaN 1408636360000 yes 1407497352 {u'city': u'Brussels', u'name': u'ULB Building...
3 1407686959000 {u'event_url': u'http://www.meetup.com/OpenTec... {u'urlname': u'OpenTechSchool-Brussels', u'gro... 0 {u'name': u'Dylan Soubien', u'member_id': 1383... {u'thumb_link': u'http://photos3.meetupstatic.... 1407686959000 no 1391751912 {u'city': u'Brussels', u'name': u'ULB Building...
4 1405346392000 {u'event_url': u'http://www.meetup.com/OpenTec... {u'urlname': u'OpenTechSchool-Brussels', u'gro... 0 {u'name': u'Ludo', u'member_id': 137883082} NaN 1405346392000 yes 1354372332 {u'city': u'Brussels', u'name': u'ULB Building...

Index by 'created' field

It would be useful to index the original dataframe based on the RSVP creation date, as this is the dimension we are interested in. The created field is expressed in microseconds, so you also have to convert it into seconds before converting it to a DateTime object that can more easily be manipulated.


In [9]:
# Index the raw dataframe by RSVP creation date:
# - convers the 'created' field to a DateTime - Pandas wants microseconds
rsvps_raw.index = pandas.to_datetime(rsvps_raw['created'] * 1000000)

As you can see, the first columns of your DataFrame looks different: it has been set as the index of the table:


In [10]:
rsvps_raw.head(3)


Out[10]:
created event group guests member member_photo mtime response rsvp_id venue
created
2014-08-24 22:33:07 1408919587000 {u'event_url': u'http://www.meetup.com/OpenTec... {u'urlname': u'OpenTechSchool-Brussels', u'gro... 0 {u'name': u'Yannick Yemo', u'member_id': 16288... {u'thumb_link': u'http://photos2.meetupstatic.... 1408919587000 yes 1411704412 {u'city': u'Brussels', u'name': u'ULB Building...
2014-08-21 15:06:18 1408633578000 {u'event_url': u'http://www.meetup.com/OpenTec... {u'urlname': u'OpenTechSchool-Brussels', u'gro... 0 {u'name': u'Yann', u'member_id': 140547792} {u'thumb_link': u'http://photos3.meetupstatic.... 1408633578000 yes 1407439252 {u'city': u'Brussels', u'name': u'ULB Building...
2014-08-21 15:52:40 1408636360000 {u'event_url': u'http://www.meetup.com/OpenTec... {u'urlname': u'OpenTechSchool-Brussels', u'gro... 0 {u'name': u'Robert Johnston', u'member_id': 14... NaN 1408636360000 yes 1407497352 {u'city': u'Brussels', u'name': u'ULB Building...

Clean DataFrame

The DataFrame you have just created contains both information we are not interested in and information structured in sub-dictionnaries.

You will therefore create a new DataFrame and copy only the information relevant for our task into it.


In [11]:
# Create a new 'clean' DataFrame:
rsvps = pandas.DataFrame(rsvps_raw['response'])

# Here is a function that will return the given key from a dictionnary if it exists,
# and else return None. You will use it to extract information from dictionnary fields.
def get_or_nan(dic, key):
    try: return dic.get(key)
    except AttributeError: return None

rsvps['name'] = rsvps_raw['member'].map(lambda x: x.get('name'))
rsvps['thumb'] = rsvps_raw['member_photo'].map(lambda x: get_or_nan(x, 'thumb_link'))
rsvps['photo'] = rsvps_raw['member_photo'].map(lambda x: get_or_nan(x, 'photo_link'))

rsvps.head(5)


Out[11]:
response name thumb photo
created
2014-08-24 22:33:07 yes Yannick Yemo http://photos2.meetupstatic.com/photos/member/... http://photos4.meetupstatic.com/photos/member/...
2014-08-21 15:06:18 yes Yann http://photos3.meetupstatic.com/photos/member/... http://photos1.meetupstatic.com/photos/member/...
2014-08-21 15:52:40 yes Robert Johnston None None
2014-08-10 16:09:19 no Dylan Soubien http://photos3.meetupstatic.com/photos/member/... http://photos3.meetupstatic.com/photos/member/...
2014-07-14 13:59:52 yes Ludo None None

Grouping data

Pandas contains a lot of shortcuts and handful properties. For example, the date or the hour of a Datetime column can be extracted easily in the form of dataframe.column.date or dataframe.column.hour.

The method groupby(field name) on a DataFrame allows for grouping it's content by a certain column for further processing. In this case, you will want to group it by day and then aggregate the length of each group to get the number of registrations for each day. (This will only show registrations for dates with at least one registration).


In [12]:
# Create a new column containing only the date (without the time)
rsvps['day'] = rsvps.index.date
rsvps_per_day = rsvps.groupby('day').agg(len)
rsvps_per_day


Out[12]:
response name thumb photo
day
2014-07-14 8 8 8 8
2014-07-15 4 4 4 4
2014-07-16 3 3 3 3
2014-07-17 1 1 1 1
2014-07-18 1 1 1 1
2014-07-21 1 1 1 1
2014-07-23 1 1 1 1
2014-07-26 1 1 1 1
2014-07-28 1 1 1 1
2014-07-30 1 1 1 1
2014-08-01 1 1 1 1
2014-08-02 1 1 1 1
2014-08-03 1 1 1 1
2014-08-04 1 1 1 1
2014-08-06 1 1 1 1
2014-08-09 1 1 1 1
2014-08-10 2 2 2 2
2014-08-11 2 2 2 2
2014-08-12 1 1 1 1
2014-08-17 2 2 2 2
2014-08-18 2 2 2 2
2014-08-19 1 1 1 1
2014-08-20 5 5 5 5
2014-08-21 4 4 4 4
2014-08-22 1 1 1 1
2014-08-23 1 1 1 1
2014-08-24 2 2 2 2
2014-08-25 3 3 3 3

Now that you got your data, you may want to visualize it as a graph using the plot() method. Specifying the marker allows to see more precisely the data points.


In [13]:
rsvps_per_day['response'].plot(marker='o')


Out[13]:
<matplotlib.axes.AxesSubplot at 0x7f8a46d659d0>

Grouping by hour and going further.

You can also look at the hour of the day at which people registered for this event by grouping by hour.


In [14]:
rsvps['hour'] = rsvps.index.hour
rsvps_per_hour = rsvps.groupby('hour').agg(len)
rsvps_per_hour['response'].plot(marker='o')


Out[14]:
<matplotlib.axes.AxesSubplot at 0x7f8a4641a550>

From here, you can have a look at the photo of the people who did register at a given hour. In this example, I will show you how to look for people who did register at 13h (after 13h and before 14h), and who have a profile photo.

rsvps.thumb.notnull() will give you a boolean matrix you can use to filter out people with no profile photo.

Reminder: .head() is used to only print the first rows of the result.


In [15]:
rsvps.thumb.notnull().head()


Out[15]:
created
2014-08-24 22:33:07     True
2014-08-21 15:06:18     True
2014-08-21 15:52:40    False
2014-08-10 16:09:19     True
2014-07-14 13:59:52    False
Name: thumb, dtype: bool

In a similar way, rsvps.hour==13 will give you a boolean matrix you can use to filter people who registeted at 13h.


In [16]:
(rsvps.hour == 13).head()


Out[16]:
created
2014-08-24 22:33:07    False
2014-08-21 15:06:18    False
2014-08-21 15:52:40    False
2014-08-10 16:09:19    False
2014-07-14 13:59:52     True
Name: hour, dtype: bool

You can combine both these filters as follows, and you should only get RSVPs between 13h and 14h with a profile picture.


In [63]:
rsvps[rsvps.thumb.notnull()][rsvps.hour==13]


Out[63]:
response name thumb photo day hour
created
2014-08-20 13:03:10 yes Satblip http://photos3.meetupstatic.com/photos/member/... http://photos3.meetupstatic.com/photos/member/... 2014-08-20 13
2014-07-14 13:05:24 no Steve http://photos2.meetupstatic.com/photos/member/... http://photos2.meetupstatic.com/photos/member/... 2014-07-14 13
2014-07-14 13:46:05 yes Antonio Rebordao http://photos1.meetupstatic.com/photos/member/... http://photos3.meetupstatic.com/photos/member/... 2014-07-14 13
2014-07-28 13:59:23 no Rosalind Lester http://photos3.meetupstatic.com/photos/member/... http://photos3.meetupstatic.com/photos/member/... 2014-07-28 13
2014-07-14 13:42:39 yes Matteo Gagliolo http://photos1.meetupstatic.com/photos/member/... http://photos3.meetupstatic.com/photos/member/... 2014-07-14 13
2014-08-20 13:04:16 yes Augustin http://photos2.meetupstatic.com/photos/member/... http://photos4.meetupstatic.com/photos/member/... 2014-08-20 13

Finally, you can plot the member's photos:


In [18]:
from IPython.display import Image

for i in rsvps[rsvps.thumb.notnull()][rsvps.hour==13].thumb:
    display(Image(i))


How would you search for the photo for registrations during night time ?

Geographical data

Finally, you will have a look at some geographical data to see how well they can be manipulated in the notebook.

In this chapter, you will download the list of venues of past OpenTechSchool Brussels events from Meetup, filter empty values and display them on a map of Brussels.

The Map section will be handled with folium, a Python library to control the famous Leaflet JavaScript library for Interactive Maps.


In [19]:
OTS_venues_url = "https://api.meetup.com/2/venues"

In [20]:
venues_dict = requests.get(OTS_venues_url, params={'key': api_key, 'group_urlname': "OpenTechSchool-Brussels", 'status': 'past'}).json()

In [21]:
venues_dict['results'][0]


Out[21]:
{u'address_1': u'Bessenveldstraat 19',
 u'city': u'Diegem',
 u'country': u'be',
 u'distance': 0,
 u'id': 13589072,
 u'lat': 50.882275,
 u'lon': 4.436641,
 u'name': u'Capgemini Belgium',
 u'rating': 4,
 u'rating_count': 1}

In [58]:
venues = pandas.DataFrame(venues_dict['results'])
venues


Out[58]:
address_1 city country distance id lat lon name phone rating rating_count
0 Bessenveldstraat 19 Diegem be 0 13589072 50.882275 4.436641 Capgemini Belgium NaN 4.00 1
1 Koolmijnenkaai 30-34 Brussels be 0 1277210 50.854667 4.342152 FoAM vzw NaN 5.00 3
2 Steenweg op Etterbeek 180 Brussels be 0 1318152 50.838128 4.380773 Google Belgium +32 477 482870 4.60 25
3 Rue du Moulin 40 | Molenstraat 40 Brussels be 0 14691632 50.850338 4.351710 Hackerspace Brussels NaN 4.67 21
4 30-34 Quai des Charbonnages Brussels be 0 18557642 50.854759 4.342060 iMal NaN 5.00 2
5 Rue Sans Souci 131, 1050 Ixelles Brussels be 0 21499202 50.832249 4.372640 Ixelles Library Sans Soucie NaN 4.67 3
6 Chaussée de Louvain 242 Brussels be 0 19428452 50.850121 4.380240 Labokube NaN 4.50 4
7 Avenue Franklin Roosevelt 50 Brussels be 0 18625132 0.000000 0.000000 Room K.3.401, Building K, Third floor, ULB Cam... NaN 4.00 2
8 Avenue Franklin Roosevelt 50 Brussels be 0 19526042 50.811722 4.381080 Room K.4.401, Building K, Fourth Floor, ULB Ca... NaN 4.25 4
9 Franklin Rooseveltlaan 50, Brussels Brussels be 0 21985282 50.811722 4.381080 ULB Building K, third floor, room K.3.201 (Cam... NaN 0.00 0
10 Franklin Rooseveltlaan 50 Brussels be 0 17352662 0.000000 0.000000 ULB Campus Solbosch Batiment K, third floor, r... NaN 4.00 2

Venues on a Map

First, you will need to import folium. Make sure you have installed it, else use pip install folium.


In [59]:
import folium

The following function will allow you to include the maps directly into IPython Notebook.


In [60]:
from IPython.display import HTML
def inline_map(map):
    "Embeds the HTML source of the map directly into the IPython notebook."
    map._build_map()
    return HTML('<iframe srcdoc="{srcdoc}" style="width: 100%; height: 510px; border: none"></iframe>'.format(srcdoc=map.HTML.replace('"', '&quot;')))

Finally, show a map of Brussels with a marker for each venue from the DataFrame:


In [61]:
Brussels = [50.850338, 4.351710]
map = folium.Map(location=Brussels, zoom_start=12, tiles='Stamen Toner')

for index, row in venues[venues.lat!=0].iterrows():
    map.simple_marker([row.lat, row.lon], popup=str(row['name']))
    
inline_map(map)


Out[61]:

That's it !

I hope you got the taste of IPython Notebook and Pandas.

You can discover more on the web via the nbviewer website, for example:

You can also use the Notebook with the languages Ruby and Julia.