DataPhilly is a local data meetup group I started back in 2012. I had attended a few data science conferences and I was really disappointed about the lack of a local meetup group for people interested in data science. And so DataPhilly was born!
The Jupyter Notebook is a web application that allows you to create and share documents that contain live code, equations, visualizations and explanatory text. Uses include: data cleaning and transformation, numerical simulation, statistical modeling, machine learning and much more.
Through Jupyter's kernel and messaging architecture, the Notebook allows code to be run in a range of different programming languages. For each notebook document that a user opens, the web application starts a kernel that runs the code for that notebook. Each kernel is capable of running code in a single programming language and there are kernels available in the following languages
The default kernel runs Python code. The notebook provides a simple way for users to pick which of these kernels is used for a given notebook.
Jupyter examples and tutorials can be found in the Jupyter github repo here.
The task I'll be walking you through today will demonstrate how to use Python for exploratory data analysis. The dataset I'll use is one I created by querying the Meetup API for the DataPhilly meetup. I'll walk you through using Jupyter notebook (The webapp we're using now), Pandas (an excel like tool for data exploration) and scikit-learn (a Python machine learning library) to explore the DataPhilly dataset. I won't go in depth into these tools but my hope is that you'll leave my talk wanting to learn more about using Python for exploratory data analysis and that you'll learn some interesting things about DataPhilly in the process.
First let's start off by initializing our environment
%matplotlib inline
initializes matplotlib so that we can display graphs and charts in our notebook.import seaborn as sns
imports seaborn a graphing library built on top of matplotlib.import pandas as pd
imports pandas a tool I'll explain in the next section.Hint: If you've installed Jupyter Notebook and you're running this on your machine, you can use the run button in the toolbar at the top of the page to execute each cell
Click on the cell above and the cell below. You'll notice that the cell above is Markdown. You can edit it by double clicking on it. The cell below contains Python code which can be modified and executed. If the code has any output it will be printed out below the cell with Out [n]: in front of it.
In [1]:
%matplotlib inline
import seaborn as sns
import pandas as pd
from matplotlib import rcParams
# Modify aesthetics for visibility during presentation
sns.set_style('darkgrid', {'axes.facecolor': '#C2C2C8'})
sns.set_palette('colorblind')
# Make everything bigger for visibility during presentation
rcParams['figure.figsize'] = 20, 10
rcParams['axes.titlesize'] = 'xx-large'
rcParams['axes.labelsize'] = 'x-large'
rcParams['xtick.labelsize'] = 'x-large'
rcParams['ytick.labelsize'] = 'x-large'
rcParams['legend.fontsize'] = 'xx-large'
rcParams['lines.linewidth'] = 4.0
rcParams['grid.linewidth'] = 2.0
# Hide warnings in the notebook
import warnings
warnings.filterwarnings('ignore')
Pandas is a library that provides data analysis tools for the Python programming language. You can think of it as Excel on steroids, but in Python.
To start off, I've used the meetup API to gather a bunch of data on members of the DataPhilly meetup group. First let's start off by looking at the events we've had over the past few years. I've loaded the data into a pandas DataFrame
and stored it in the file events.pkl
. A DataFrame
is a table similar to an Excel spreadsheet. Let's load it and see what it looks like:
In [2]:
events_df = pd.read_pickle('events.pkl')
events_df = events_df.sort_values(by='time')
events_df
Out[2]:
You can access values in a DataFrame
column like this:
In [3]:
events_df['yes_rsvp_count']
Out[3]:
You can access a row of a DataFrame
using iloc
:
In [4]:
events_df.iloc[4]
Out[4]:
We can view the first few rows using the head
method:
In [5]:
events_df.head()
Out[5]:
And similarly the last few using tail
:
In [6]:
events_df.tail(3)
Out[6]:
We can see that the yes_rsvp_count
contains the number of people who RSVPed yes for each event. First let's look at some basic statistics:
In [7]:
yes_rsvp_count = events_df['yes_rsvp_count']
yes_rsvp_count.sum(), yes_rsvp_count.mean(), yes_rsvp_count.min(), yes_rsvp_count.max()
Out[7]:
When we access a single column of the DataFrame
like this we get a Series
object which is just a 1-dimensional version of a DataFrame
.
In [8]:
type(yes_rsvp_count)
Out[8]:
We can use the built-in describe
method to print out a lot of useful stats in a nice tabular format:
In [9]:
yes_rsvp_count.describe()
Out[9]:
Next I'd like to graph the number of RSVPs over time to see if there are any interesting trends. To do this let's first sum
the waitlist_count
and yes_rsvp_count
columns and make a new column called total_RSVP_count
.
In [10]:
events_df['total_RSVP_count'] = events_df['waitlist_count'] + events_df['yes_rsvp_count']
events_df['total_RSVP_count']
Out[10]:
We can plot these values using the plot
method
In [11]:
events_df['total_RSVP_count'].plot()
Out[11]:
The plot method utilizes the matplotlib
library behind the scenes to draw the plot. This is interesting, but it would be nice to have the dates of the meetups on the X-axis of the plot.
To accomplish this, let's convert the time
field from a unix epoch timestamp to a python datetime
utilizing the apply
method and a function.
In [12]:
events_df.head(2)
Out[12]:
In [13]:
import datetime
def get_datetime_from_epoch(epoch):
return datetime.datetime.fromtimestamp(epoch/1000.0)
In [14]:
events_df['time'] = events_df['time'].apply(get_datetime_from_epoch)
events_df['time']
Out[14]:
Next let's make the time
column the index of the DataFrame
using the set_index
method and then re-plot
our data.
In [15]:
events_df.set_index('time', inplace=True)
events_df[['total_RSVP_count']].plot()
Out[15]:
We can also easily plot multiple columns on the same plot.
In [16]:
all_rsvps = events_df[['yes_rsvp_count', 'waitlist_count', 'total_RSVP_count']]
all_rsvps.plot(title='Attendance over time')
Out[16]:
Alright so I'm seeing some interesting trends here. Let's take a look at something different.
The Meetup API also provides us access to member info. Let's have a look at the data we have available:
In [17]:
members_df = pd.read_pickle('members.pkl')
for column in ['joined', 'visited']:
members_df[column] = members_df[column].apply(get_datetime_from_epoch)
members_df.head(3)
Out[17]:
You'll notice that I've anonymized the meetup member_id and the member's name. I've also used the python module SexMachine
to infer members gender based on their first name. I ran SexMachine on the original names before I anonymized them. Let's have a closer look at the gender breakdown of our members:
In [18]:
gender_counts = members_df['gender'].value_counts()
gender_counts
Out[18]:
Next let's use the hist
method to plot a histogram of membership_count
. This is the number of groups each member is in.
In [19]:
members_df['membership_count'].hist(bins=20)
Out[19]:
Something looks odd here let's check out the value_counts
:
In [20]:
members_df['membership_count'].value_counts().head()
Out[20]:
Okay so most members are members of 0 meetup groups?! This seems odd! I did a little digging and came up with the answer; members can set their membership details to be private, and then this value will be zero. Let's filter out these members and recreate the histogram.
In [21]:
members_df_non_zero = members_df[members_df['membership_count'] != 0]
members_df_non_zero['membership_count'].hist(bins=50)
Out[21]:
Okay so most members are only members of a few meetup groups. There's some outliers that are pretty hard to read, let's try plotting this on a logarithmic scale to see if that helps:
In [22]:
ax = members_df_non_zero['membership_count'].hist(bins=50)
ax.set_yscale('log')
ax.set_xlim(0, 500)
Out[22]:
Let's use a mask
to filter out the outliers so we can dig into them a little further:
In [23]:
all_the_meetups = members_df[members_df['membership_count'] > 100]
filtered = all_the_meetups[['membership_count', 'city', 'country', 'state']]
filtered.sort_values(by='membership_count', ascending=False)
Out[23]:
The people from Philly might actually be legitimate members, let's use a compound mask to filter them out as well:
In [24]:
all_the_meetups = members_df[
(members_df['membership_count'] > 100) & (members_df['city'] != 'Philadelphia')
]
filtered = all_the_meetups[['membership_count', 'city', 'country', 'state']]
filtered.sort_values(by='membership_count', ascending=False)
Out[24]:
That's strange, I don't think we've ever had any members from Berlin, San Francisco, or Jerusalem in attendance :-).
Moving on, we also have all the events that each member RSVPed to:
In [25]:
rsvps_df = pd.read_pickle('rsvps.pkl')
rsvps_df.head(3)
Out[25]:
We can utilize the pandas merge
method to join our members DataFrame
and our rsvps DataFrame
:
In [26]:
joined_with_rsvps_df = pd.merge(members_df, rsvps_df, left_on='anon_id', right_on='member_id')
joined_with_rsvps_df.head(3)
Out[26]:
In [27]:
joined_with_rsvps_df.columns
Out[27]:
Now we have a ton of data, let's see what kind of interesting things we can discover. Let's look at the some stats on male attendees vs. female attendees:
First we can use the isin
method to make DataFrame
s for male and female members.
In [28]:
male_attendees = joined_with_rsvps_df[joined_with_rsvps_df['gender'].isin(['male', 'mostly_male'])]
male_attendees.tail(3)
Out[28]:
In [29]:
female_attendees = joined_with_rsvps_df[joined_with_rsvps_df['gender'].isin(['female', 'mostly_female'])]
female_attendees.tail(3)
Out[29]:
Next we can use the sum
method to count the number of male and female attendees per event and create a Series
for each.
In [30]:
event_ids = [
'102502622', '106043892', '107740582', '120425212', '133803672', '138415912', '144769822', '149515412',
'160323532', '168747852', '175993712', '182860422', '206754182', '215265722', '219055217', '219840555',
'220526799', '221245827', '225488147', '89769502', '98833672'
]
male_attendees[event_ids].sum().head(3)
Out[30]:
We can then recombine the male and female Series
' into a new DataFrame
.
In [31]:
gender_attendance = pd.DataFrame({'male': male_attendees[event_ids].sum(), 'female': female_attendees[event_ids].sum()})
gender_attendance.head(3)
Out[31]:
And then we can use merge
again to combine this with our events DataFrame
.
In [32]:
events_with_gender_df = pd.merge(events_df, gender_attendance, left_on='id', right_index=True)
events_with_gender_df.head(3)
Out[32]:
The we can plot
the attendance by gender over time
In [33]:
gender_df = events_with_gender_df[['female', 'male']]
gender_df.plot(title='Attendance by gender over time')
Out[33]:
This might be easier to interpret by looking at the percentage of females in attendance. We can use the div
(divide) method to calculate this.
In [34]:
female_ratio = gender_df['female'].div(gender_df['male'] + gender_df['female'])
female_ratio.plot(title='Percentage female attendance over time', ylim=(0.0, 1.0))
Out[34]:
The members DataFrame
also has some other interesting stuff in it. Let's take a look at the topics
column.
In [35]:
members_df['topics'].iloc[0]
Out[35]:
Let's see if we can identify any trends in member's topics. Let's start off by identifying the most common topics:
In [36]:
from collections import Counter
topic_counter = Counter()
for m in members_df['topics']:
topic_counter.update([t['name'] for t in m])
topic_counter.most_common(20)
Out[36]:
Next let's create a new DataFrame
where each column is one of the top 100 topics, and each row is a member. We'll set the values of each cell to be either 0 or 1 to indicate that that member has (or doesn't have) that topic.
In [37]:
top_100_topics = set([t[0] for t in topic_counter.most_common(100)])
topic_member_map = {}
for i, m in members_df.iterrows():
if m['topics']:
top_topic_count = {}
for topic in m['topics']:
if topic['name'] in top_100_topics:
top_topic_count[topic['name']] = 1
topic_member_map[m['anon_id']] = top_topic_count
top_topic_df = pd.DataFrame(topic_member_map)
top_topic_df.head(3)
Out[37]:
Okay for what I'm going to do next, I want the rows to be the members and the columns to be the topics. We can use the T
(transpose) method to fix this.
In [38]:
top_topic_df = top_topic_df.T
top_topic_df.head(3)
Out[38]:
Next we can use the fillna
method to fill in the missing values with zeros.
In [39]:
top_topic_df.fillna(0, inplace=True)
top_topic_df.head(3)
Out[39]:
Next let's use a clustering algorithm to see if there are any patterns in the topics members are interested in. A clustering algorithm groups a set of data points so that similar objects are in the same group. This is a classic type of unsupervised machine learning. Below you can find visualisations of how different clustering algorithms perform on various kinds of data:
Kmeans clustering is quick and can scale well to larger datasets. Let's see how it performs on our dataset:
In [40]:
from sklearn.cluster import MiniBatchKMeans as KMeans
X = top_topic_df.as_matrix()
n_clusters = 3
k_means = KMeans(init='k-means++', n_clusters=n_clusters, n_init=10, random_state=47)
k_means.fit(X)
k_means.labels_
Out[40]:
We've grouped our members into 3 clusters, let's see how many members are in each cluster
In [41]:
Counter(list(k_means.labels_)).most_common()
Out[41]:
Next let's see which topics are most popular in each cluster:
In [42]:
from collections import defaultdict
cluster_index_map = defaultdict(list)
for i in range(k_means.labels_.shape[0]):
cluster_index_map[k_means.labels_[i]].append(top_topic_df.index[i])
for cluster_num in range(n_clusters):
print 'Cluster {}'.format(cluster_num)
f = top_topic_df[top_topic_df.index.isin(cluster_index_map[cluster_num])].sum()
f2 = f[f > 0]
f3 = f2.sort_values(ascending=False)
print f3[:10]
print
So it looks like our biggest cluster (#2) contains members whose primary interest is data science.
The second biggest cluster (#1) contains members whose primary interests are technology, and data science just happens to be one of those interests.
The smallest cluster (#0) contains members whose primary interests are around socializing.
Based on this information we might be able to engage members in the "social" (#0) cluster by having more socially oriented events. We might be able to engaged with the members in cluster (#1) by having more events geared toward beginners.
Hopefully you learned a little bit about DataPhilly and doing exploratory analysis in Python. There's tons of extra data in our datasets that I don't even have time to get into today. If you feel like you missed anything and would like to revist it, you can find this Notebook and instructions for how to use it in my github repo http://github.com/mdbecker/. If you find something interesting in the data and you'd like to share it with me I'm @beckerfuffle on Twitter, and you can always contact me through the DataPhilly Meetup page.
In [ ]: