Data Analysis using Pandas

Pandas has become the defacto package for data analysis. In this workshop, we are going to use the basics of pandas to analyze the interests of today's group. We are going to use meetup.com's api and fetch the list of interests that are listed in each of our meetup.com profile. We will compute which interests are common, which are uncommon, and find out which of the two members have most similar interests. Lets get started by importing the essentials.


In [ ]:
import meetup.api
import pandas as pd
from IPython.display import Image, display, HTML
from itertools import combinations

Next we need your meetup.com API. You will find it https://secure.meetup.com/meetup_api/key/ Also we need today's event id. The event id created under Chicago Pythonistas is 233460758 and that under Chicago Python user group is 236205125. Use the one that has the higher number of RSVPs so that you get more data points. As an additional exercise, you might go for merging the two sets of RSVPs - but that's not needed for the workshop.


In [ ]:
API_KEY = '3f6d3275d3b6314e73453c4aa27'
event_id='235484841'

The following function uses the api and loads the data into a pandas data frame.


In [114]:
def get_members(event_id):
    client = meetup.api.Client(API_KEY)
    rsvps=client.GetRsvps(event_id=event_id, urlname='_ChiPy_')
    member_id = ','.join([str(i['member']['member_id']) for i in rsvps.results])
    return client.GetMembers(member_id=member_id)

def get_topics(members):
    topics = set()
    for member in members.results:
        try:
            for t in member['topics']:
                topics.add(t['name'])
        except:
            pass

    return list(topics)

def df_topics(event_id):
    members = get_members(event_id=event_id)
    topics = get_topics(members)
    columns=['name','id','thumb_link'] + topics
    
    data = [] 
    for member in members.results:
        topic_vector = [0]*len(topics)
        for topic in member['topics']:
            index = topics.index(topic['name'])        
            topic_vector[index-1] = 1
        try:
            data.append([member['name'], member['id'], member['photo']['thumb_link']] + topic_vector)
        except:
            pass
    return pd.DataFrame(data=data, columns=columns)
    
    #df.to_csv('output.csv', sep=";")

So you need to call the df_topics function with the event id and it would give you back a pandas dataframe.

Load data from meetup.com into a dataframe by calling df_topics


In [204]:
df = df_topics(event_id='235484841')
df.head(n=10)


29/30 (10 seconds remaining)
28/30 (10 seconds remaining)
Out[204]:
name id thumb_link Cloud Deployment NoSQL Mobile Technology Dungeons & Dragons Open Data Entrepreneurship Science Fiction ... Business Analytics Cat Rescue Critical Thinking Agnostic Mobile Web Asian Professionals Virtualization National Politics Flamenco Artificial Intelligence
0 abhishek kumar 186173861 http://photos4.meetupstatic.com/photos/member/... 0 0 0 0 0 0 1 ... 0 0 0 0 0 0 0 0 0 0
1 Aile Oleghe 209272270 http://photos1.meetupstatic.com/photos/member/... 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
2 Alexandria 189631525 http://photos2.meetupstatic.com/photos/member/... 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
3 Alfredo Nava 146641722 http://photos1.meetupstatic.com/photos/member/... 0 1 0 0 1 0 0 ... 0 0 0 0 0 0 0 0 1 0
4 Amy Lehman 122663532 http://photos2.meetupstatic.com/photos/member/... 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
5 Anish 150646522 http://photos3.meetupstatic.com/photos/member/... 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
6 Ashley 36278932 http://photos2.meetupstatic.com/photos/member/... 0 1 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
7 Bee DC 146477422 http://photos4.meetupstatic.com/photos/member/... 0 0 0 0 0 0 0 ... 0 1 0 0 0 0 0 0 1 0
8 Chris Wight 212477108 http://photos1.meetupstatic.com/photos/member/... 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
9 Dan Temkin 209909313 http://photos1.meetupstatic.com/photos/member/... 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

10 rows × 514 columns

What does the first and last 10 rows of the dataset look like?

What are the column names?

Additional Exercise: Can you merge the two data for two events into one data frame and remove the dups?

What are the top 10 most common interests of today’s attendees?


In [205]:
df.ix[:, 'Cloud Deployment':].sum().nlargest(10)


Out[205]:
Using MongoDB in the cloud    21
CSS                           19
Chinese Language              19
Food Photography              18
Hacking                       15
House Music                   13
Kanban                        13
Android Development           12
Self Exploration              11
IBM                           11
dtype: int64

In [206]:
_df = df.ix[:, 'Cloud Deployment':]
import numpy as np
s=_df.sum()
most_popular = s.order(ascending=False).rank(ascending=False).nsmallest(3).keys()[-1]
least_popular = s.order().rank().nsmallest(3).keys()[-1]
print most_popular, s[most_popular] 
print least_popular, s[least_popular]


Chinese Language 19
Cloud Deployment 1

In [207]:
df[df[most_popular]==1][['name', most_popular]]


Out[207]:
name Chinese Language
1 Aile Oleghe 1
3 Alfredo Nava 1
6 Ashley 1
7 Bee DC 1
9 Dan Temkin 1
10 David Locke 1
11 David Matsumura 1
12 Dawn M Graunke 1
18 frank 1
19 Govind G Nair 1
22 Jason Wirth 1
23 Jayna Kehres 1
28 Kem 1
31 Lauren 1
34 Matthew Green 1
46 Suz D 1
47 Tathagata Dasgupta 1
49 Trevor 1
51 Virginia 1

In [208]:
df[df[least_popular]==1][['name', least_popular]]


Out[208]:
name Cloud Deployment
47 Tathagata Dasgupta 1

Which memebers have the highest number of topics of interest?


In [209]:
basic_details = df.ix[:, :'thumb_link']
df['total']= _df.sum(axis=1)
print max(df['total'])
df[['name', 'total']].sort(columns=['total'], ascending=False)
df[df.total == max(df['total'])][['name', 'total']]


50
Out[209]:
name total
3 Alfredo Nava 50
7 Bee DC 50
30 Lamar Smith 50
32 MangoDriver 50
33 Matt Hall 50
50 Venkata sivanaga saisuvarna kris 50

What is the average number of topics of interest?


In [210]:
print df['total'].mean()


19.9090909091

Which two members have the most common overlap of interests?


In [218]:
cc = list(combinations(df['name'],2))
out = pd.DataFrame([frame.loc[c,'Cloud Deployment':'Artificial Intelligence'].product() for c in cc], index=cc)
print out.sum(axis=1).order(ascending=False)


(Alfredo Nava, Ashley)                          15
(Bee DC, Govind G Nair)                         12
(Ashley, Raymond)                               11
(Jaimie Catoe, Raymond)                         11
(Bee DC, Raymond)                               11
(Ashley, Nikhil Sharma)                         11
(Alfredo Nava, Lamar Smith)                     11
(Lamar Smith, MangoDriver)                      11
(Alfredo Nava, Tathagata Dasgupta)              11
(Lamar Smith, Nikhil Sharma)                    11
(Nikhil Sharma, Raymond)                        10
(Alfredo Nava, Bee DC)                          10
(Jaimie Catoe, Jayna Kehres)                    10
(MangoDriver, Raymond)                          10
(Ashley, MangoDriver)                           10
(Jason Wirth, Jennifer Joo)                     10
(Ashley, Lamar Smith)                           10
(Bee DC, Jason Wirth)                           10
(Alfredo Nava, Jason Wirth)                     10
(abhishek kumar, Smitha Shivakumar)              9
(abhishek kumar, Raymond)                        9
(Ashley, Bee DC)                                 9
(Jayna Kehres, MangoDriver)                      9
(Julia Poncela-Casasnovas, Raymond)              9
(Alfredo Nava, Matt Hall)                        9
(Jaimie Catoe, MangoDriver)                      9
(Ashley, Jason Wirth)                            9
(Lamar Smith, Raymond)                           9
(Bee DC, Jennifer Joo)                           9
(Jaimie Catoe, Lamar Smith)                      9
                                                ..
(Ellie A., Jeff)                                 0
(Ellie A., Jennifer Joo)                         0
(Ellie A., Viseth Sen)                           0
(Ellie A., Virginia)                             0
(Ellie A., Venkata sivanaga saisuvarna kris)     0
(Ellie A., Trevor)                               0
(Ellie A., Teja Kodali)                          0
(Ellie A., Tathagata Dasgupta)                   0
(Ellie A., Suz D)                                0
(Ellie A., Smitha Shivakumar)                    0
(Ellie A., Rob Creel)                            0
(Ellie A., Rob)                                  0
(Ellie A., Raymond)                              0
(Ellie A., Patrick Boland)                       0
(Ellie A., Parfait)                              0
(Ellie A., Nikhil Sharma)                        0
(Ellie A., Nicole Carpenter)                     0
(Ellie A., Nick Hattwick)                        0
(Ellie A., Nicholas Kincaid)                     0
(Ellie A., Michael Ward)                         0
(Ellie A., Matthew Green)                        0
(Ellie A., Matt Hall)                            0
(Ellie A., MangoDriver)                          0
(Ellie A., Lauren)                               0
(Ellie A., Lamar Smith)                          0
(Ellie A., Kishon McCormick)                     0
(Ellie A., Kem)                                  0
(Ellie A., Julia Poncela-Casasnovas)             0
(Ellie A., Jordan Dietch)                        0
(Elizabeth Carroll, Will Fuger)                  0
dtype: int64

How many members are there who have no overlaps at all?

Given a member which other member(s) have the most common interests?