Data Analysis using Pandas

Pandas has become the defacto package for data analysis. In this workshop, we are going to use the basics of pandas to analyze the interests of today's group. We are going to use meetup.com's api and fetch the list of interests that are listed in each of our meetup.com profile. We will compute which interests are common, which are uncommon, and find out which of the two members have most similar interests. Lets get started by importing the essentials.



In [ ]:

    
import meetup.api
import pandas as pd
from IPython.display import Image, display, HTML
from itertools import combinations

Next we need your meetup.com API. You will find it https://secure.meetup.com/meetup_api/key/ Also we need today's event id. The event id created under Chicago Pythonistas is 233460758 and that under Chicago Python user group is 236205125. Use the one that has the higher number of RSVPs so that you get more data points. As an additional exercise, you might go for merging the two sets of RSVPs - but that's not needed for the workshop.



In [ ]:

    
API_KEY = '3f6d3275d3b6314e73453c4aa27'
event_id='235484841'

The following function uses the api and loads the data into a pandas data frame.



In [114]:

    
def get_members(event_id):
    client = meetup.api.Client(API_KEY)
    rsvps=client.GetRsvps(event_id=event_id, urlname='_ChiPy_')
    member_id = ','.join([str(i['member']['member_id']) for i in rsvps.results])
    return client.GetMembers(member_id=member_id)

def get_topics(members):
    topics = set()
    for member in members.results:
        try:
            for t in member['topics']:
                topics.add(t['name'])
        except:
            pass

    return list(topics)

def df_topics(event_id):
    members = get_members(event_id=event_id)
    topics = get_topics(members)
    columns=['name','id','thumb_link'] + topics
    
    data = [] 
    for member in members.results:
        topic_vector = [0]*len(topics)
        for topic in member['topics']:
            index = topics.index(topic['name'])        
            topic_vector[index-1] = 1
        try:
            data.append([member['name'], member['id'], member['photo']['thumb_link']] + topic_vector)
        except:
            pass
    return pd.DataFrame(data=data, columns=columns)
    
    #df.to_csv('output.csv', sep=";")

So you need to call the df_topics function with the event id and it would give you back a pandas dataframe.

Load data from meetup.com into a dataframe by calling df_topics



In [204]:

    
df = df_topics(event_id='235484841')
df.head(n=10)









    



29/30 (10 seconds remaining)
28/30 (10 seconds remaining)






    Out[204]:






  
    
      
      name
      id
      thumb_link
      Cloud Deployment
      NoSQL
      Mobile Technology
      Dungeons & Dragons
      Open Data
      Entrepreneurship
      Science Fiction
      ...
      Business Analytics
      Cat Rescue
      Critical Thinking
      Agnostic
      Mobile Web
      Asian Professionals
      Virtualization
      National Politics
      Flamenco
      Artificial Intelligence
    
  
  
    
      0
      abhishek kumar
      186173861
      http://photos4.meetupstatic.com/photos/member/...
      0
      0
      0
      0
      0
      0
      1
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      1
      Aile Oleghe
      209272270
      http://photos1.meetupstatic.com/photos/member/...
      0
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      2
      Alexandria
      189631525
      http://photos2.meetupstatic.com/photos/member/...
      0
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      3
      Alfredo Nava
      146641722
      http://photos1.meetupstatic.com/photos/member/...
      0
      1
      0
      0
      1
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      1
      0
    
    
      4
      Amy Lehman
      122663532
      http://photos2.meetupstatic.com/photos/member/...
      0
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      5
      Anish
      150646522
      http://photos3.meetupstatic.com/photos/member/...
      0
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      6
      Ashley
      36278932
      http://photos2.meetupstatic.com/photos/member/...
      0
      1
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      7
      Bee DC
      146477422
      http://photos4.meetupstatic.com/photos/member/...
      0
      0
      0
      0
      0
      0
      0
      ...
      0
      1
      0
      0
      0
      0
      0
      0
      1
      0
    
    
      8
      Chris Wight
      212477108
      http://photos1.meetupstatic.com/photos/member/...
      0
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      9
      Dan Temkin
      209909313
      http://photos1.meetupstatic.com/photos/member/...
      0
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
  

10 rows × 514 columns

What does the first and last 10 rows of the dataset look like?

What are the column names?

Additional Exercise: Can you merge the two data for two events into one data frame and remove the dups?

What are the top 10 most common interests of today’s attendees?



In [205]:

    
df.ix[:, 'Cloud Deployment':].sum().nlargest(10)









    Out[205]:





Using MongoDB in the cloud    21
CSS                           19
Chinese Language              19
Food Photography              18
Hacking                       15
House Music                   13
Kanban                        13
Android Development           12
Self Exploration              11
IBM                           11
dtype: int64

What is the third most popular and third least popular topic of interest? Are there ties?



In [206]:

    
_df = df.ix[:, 'Cloud Deployment':]
import numpy as np
s=_df.sum()
most_popular = s.order(ascending=False).rank(ascending=False).nsmallest(3).keys()[-1]
least_popular = s.order().rank().nsmallest(3).keys()[-1]
print most_popular, s[most_popular] 
print least_popular, s[least_popular]









    



Chinese Language 19
Cloud Deployment 1

Which members have the third most popular interest?



In [207]:

    
df[df[most_popular]==1][['name', most_popular]]









    Out[207]:






  
    
      
      name
      Chinese Language
    
  
  
    
      1
      Aile Oleghe
      1
    
    
      3
      Alfredo Nava
      1
    
    
      6
      Ashley
      1
    
    
      7
      Bee DC
      1
    
    
      9
      Dan Temkin
      1
    
    
      10
      David Locke
      1
    
    
      11
      David Matsumura
      1
    
    
      12
      Dawn M Graunke
      1
    
    
      18
      frank
      1
    
    
      19
      Govind G Nair
      1
    
    
      22
      Jason Wirth
      1
    
    
      23
      Jayna Kehres
      1
    
    
      28
      Kem
      1
    
    
      31
      Lauren
      1
    
    
      34
      Matthew Green
      1
    
    
      46
      Suz D
      1
    
    
      47
      Tathagata Dasgupta
      1
    
    
      49
      Trevor
      1
    
    
      51
      Virginia
      1

Which members have the third most popular interest?



In [208]:

    
df[df[least_popular]==1][['name', least_popular]]









    Out[208]:






  
    
      
      name
      Cloud Deployment
    
  
  
    
      47
      Tathagata Dasgupta
      1

Which memebers have the highest number of topics of interest?



In [209]:

    
basic_details = df.ix[:, :'thumb_link']
df['total']= _df.sum(axis=1)
print max(df['total'])
df[['name', 'total']].sort(columns=['total'], ascending=False)
df[df.total == max(df['total'])][['name', 'total']]









    



50






    Out[209]:






  
    
      
      name
      total
    
  
  
    
      3
      Alfredo Nava
      50
    
    
      7
      Bee DC
      50
    
    
      30
      Lamar Smith
      50
    
    
      32
      MangoDriver
      50
    
    
      33
      Matt Hall
      50
    
    
      50
      Venkata sivanaga saisuvarna kris
      50

What is the average number of topics of interest?



In [210]:

    
print df['total'].mean()









    



19.9090909091

Which two members have the most common overlap of interests?



In [218]:

    
cc = list(combinations(df['name'],2))
out = pd.DataFrame([frame.loc[c,'Cloud Deployment':'Artificial Intelligence'].product() for c in cc], index=cc)
print out.sum(axis=1).order(ascending=False)









    



(Alfredo Nava, Ashley)                          15
(Bee DC, Govind G Nair)                         12
(Ashley, Raymond)                               11
(Jaimie Catoe, Raymond)                         11
(Bee DC, Raymond)                               11
(Ashley, Nikhil Sharma)                         11
(Alfredo Nava, Lamar Smith)                     11
(Lamar Smith, MangoDriver)                      11
(Alfredo Nava, Tathagata Dasgupta)              11
(Lamar Smith, Nikhil Sharma)                    11
(Nikhil Sharma, Raymond)                        10
(Alfredo Nava, Bee DC)                          10
(Jaimie Catoe, Jayna Kehres)                    10
(MangoDriver, Raymond)                          10
(Ashley, MangoDriver)                           10
(Jason Wirth, Jennifer Joo)                     10
(Ashley, Lamar Smith)                           10
(Bee DC, Jason Wirth)                           10
(Alfredo Nava, Jason Wirth)                     10
(abhishek kumar, Smitha Shivakumar)              9
(abhishek kumar, Raymond)                        9
(Ashley, Bee DC)                                 9
(Jayna Kehres, MangoDriver)                      9
(Julia Poncela-Casasnovas, Raymond)              9
(Alfredo Nava, Matt Hall)                        9
(Jaimie Catoe, MangoDriver)                      9
(Ashley, Jason Wirth)                            9
(Lamar Smith, Raymond)                           9
(Bee DC, Jennifer Joo)                           9
(Jaimie Catoe, Lamar Smith)                      9
                                                ..
(Ellie A., Jeff)                                 0
(Ellie A., Jennifer Joo)                         0
(Ellie A., Viseth Sen)                           0
(Ellie A., Virginia)                             0
(Ellie A., Venkata sivanaga saisuvarna kris)     0
(Ellie A., Trevor)                               0
(Ellie A., Teja Kodali)                          0
(Ellie A., Tathagata Dasgupta)                   0
(Ellie A., Suz D)                                0
(Ellie A., Smitha Shivakumar)                    0
(Ellie A., Rob Creel)                            0
(Ellie A., Rob)                                  0
(Ellie A., Raymond)                              0
(Ellie A., Patrick Boland)                       0
(Ellie A., Parfait)                              0
(Ellie A., Nikhil Sharma)                        0
(Ellie A., Nicole Carpenter)                     0
(Ellie A., Nick Hattwick)                        0
(Ellie A., Nicholas Kincaid)                     0
(Ellie A., Michael Ward)                         0
(Ellie A., Matthew Green)                        0
(Ellie A., Matt Hall)                            0
(Ellie A., MangoDriver)                          0
(Ellie A., Lauren)                               0
(Ellie A., Lamar Smith)                          0
(Ellie A., Kishon McCormick)                     0
(Ellie A., Kem)                                  0
(Ellie A., Julia Poncela-Casasnovas)             0
(Ellie A., Jordan Dietch)                        0
(Elizabeth Carroll, Will Fuger)                  0
dtype: int64

Data Analysis using Pandas

Load data from meetup.com into a dataframe by calling df_topics

What does the first and last 10 rows of the dataset look like?

What are the column names?

Additional Exercise: Can you merge the two data for two events into one data frame and remove the dups?

What are the top 10 most common interests of today’s attendees?

What is the third most popular and third least popular topic of interest? Are there ties?

Which members have the third most popular interest?

Which members have the third most popular interest?

Which memebers have the highest number of topics of interest?

What is the average number of topics of interest?

Which two members have the most common overlap of interests?

How many members are there who have no overlaps at all?

Given a member which other member(s) have the most common interests?

	name	id	thumb_link	NoSQL	Open Data	Science Fiction	...	Cat Rescue	Flamenco
0	abhishek kumar	186173861	http://photos4.meetupstatic.com/photos/member/...	0	0	1	...	0	0
1	Aile Oleghe	209272270	http://photos1.meetupstatic.com/photos/member/...	0	0	0	...	0	0
2	Alexandria	189631525	http://photos2.meetupstatic.com/photos/member/...	0	0	0	...	0	0
3	Alfredo Nava	146641722	http://photos1.meetupstatic.com/photos/member/...	1	1	0	...	0	1
4	Amy Lehman	122663532	http://photos2.meetupstatic.com/photos/member/...	0	0	0	...	0	0
5	Anish	150646522	http://photos3.meetupstatic.com/photos/member/...	0	0	0	...	0	0
6	Ashley	36278932	http://photos2.meetupstatic.com/photos/member/...	1	0	0	...	0	0
7	Bee DC	146477422	http://photos4.meetupstatic.com/photos/member/...	0	0	0	...	1	1
8	Chris Wight	212477108	http://photos1.meetupstatic.com/photos/member/...	0	0	0	...	0	0
9	Dan Temkin	209909313	http://photos1.meetupstatic.com/photos/member/...	0	0	0	...	0	0

	name	total
3	Alfredo Nava	50
7	Bee DC	50
30	Lamar Smith	50
32	MangoDriver	50
33	Matt Hall	50
50	Venkata sivanaga saisuvarna kris	50