Pandas has become the defacto package for data analysis. In this workshop, we are going to use the basics of pandas to analyze the interests of today's group. We are going to use meetup.com's api and fetch the list of interests that are listed in each of our meetup.com profile. We will compute which interests are common, which are uncommon, and find out which of the two members have most similar interests. Lets get started by importing the essentials.
In [ ]:
import meetup.api
import pandas as pd
from IPython.display import Image, display, HTML
from itertools import combinations
Next we need your meetup.com API. You will find it https://secure.meetup.com/meetup_api/key/ Also we need today's event id. The event id created under Chicago Pythonistas is 233460758 and that under Chicago Python user group is 236205125. Use the one that has the higher number of RSVPs so that you get more data points. As an additional exercise, you might go for merging the two sets of RSVPs - but that's not needed for the workshop.
In [ ]:
API_KEY = '3f6d3275d3b6314e73453c4aa27'
event_id='235484841'
The following function uses the api and loads the data into a pandas data frame.
In [114]:
def get_members(event_id):
client = meetup.api.Client(API_KEY)
rsvps=client.GetRsvps(event_id=event_id, urlname='_ChiPy_')
member_id = ','.join([str(i['member']['member_id']) for i in rsvps.results])
return client.GetMembers(member_id=member_id)
def get_topics(members):
topics = set()
for member in members.results:
try:
for t in member['topics']:
topics.add(t['name'])
except:
pass
return list(topics)
def df_topics(event_id):
members = get_members(event_id=event_id)
topics = get_topics(members)
columns=['name','id','thumb_link'] + topics
data = []
for member in members.results:
topic_vector = [0]*len(topics)
for topic in member['topics']:
index = topics.index(topic['name'])
topic_vector[index-1] = 1
try:
data.append([member['name'], member['id'], member['photo']['thumb_link']] + topic_vector)
except:
pass
return pd.DataFrame(data=data, columns=columns)
#df.to_csv('output.csv', sep=";")
So you need to call the df_topics function with the event id and it would give you back a pandas dataframe.
In [204]:
df = df_topics(event_id='235484841')
df.head(n=10)
Out[204]:
In [205]:
df.ix[:, 'Cloud Deployment':].sum().nlargest(10)
Out[205]:
In [206]:
_df = df.ix[:, 'Cloud Deployment':]
import numpy as np
s=_df.sum()
most_popular = s.order(ascending=False).rank(ascending=False).nsmallest(3).keys()[-1]
least_popular = s.order().rank().nsmallest(3).keys()[-1]
print most_popular, s[most_popular]
print least_popular, s[least_popular]
In [207]:
df[df[most_popular]==1][['name', most_popular]]
Out[207]:
In [208]:
df[df[least_popular]==1][['name', least_popular]]
Out[208]:
In [209]:
basic_details = df.ix[:, :'thumb_link']
df['total']= _df.sum(axis=1)
print max(df['total'])
df[['name', 'total']].sort(columns=['total'], ascending=False)
df[df.total == max(df['total'])][['name', 'total']]
Out[209]:
In [210]:
print df['total'].mean()
In [218]:
cc = list(combinations(df['name'],2))
out = pd.DataFrame([frame.loc[c,'Cloud Deployment':'Artificial Intelligence'].product() for c in cc], index=cc)
print out.sum(axis=1).order(ascending=False)