Welcome to our team project for Data Science. In this exercise we will build up on the foundations created in our last two team projects where we devised an algorithm to group attendees of Python Project Night and built a web based roster app that can be use the algorithm for grouping.

What will you learn in this team project

This will give you a gentle introduction to handling data with pandas, using a third party machine learning SaaS api to do image recognition.

Problem Definition

"Diversity is the engine of invention." Justin Trudeau, 2016

Diversity in tech communities has been a widely addressed topic. As one of the most active tech community in the world, in this exercise we would try to measure some aspects of diversity in tech community. We will use image recognition on the meetup.com profile pictures of the members of ChiPy user group and determine determine how diverse our attendees are. Then we will compare the same with other tech groups in the city and around the world.

Before we go further

The first proof of concept implementation did not take long, and it concerned me. If it is so easy build tools that can be potentially abused or misinterpreted, we need to think through the implications of the tools we build. So if you are concerned, we are on the same page. There is a question at around the middle of the project to address this.

Just the beginning

Note the approach used in this is a crude first step and is not without flaws. Like all software, what you will build is incomplete and needs a lot of refinement (that's why this is open source!) before we can get comprehensive results. So take the initial results of your analysis with copious amount of salt.

For this project, we are going to look at just one facet of diversity - gender diversity of the members.

Setting up your environment

  • You should already have Python3 installed in your computer. You can download it from here.
  • Instrucitons to install Jupyter Notebooks
  • Instructions to install dependencies. Executing the cell below should install all the dependecies you need.

In [ ]:
!pip3 install meetup-api pandas pytest matplotlib clarifai

This part of the exercise is straight from the previous team project. We use the meetup.com api to load get the ChiPy members who RSVP-ed for one event.


In [1]:
import meetup.api
import pandas as pd


API_KEY = ''
event_id=''


def get_members(event_id):
    client = meetup.api.Client(API_KEY)
    rsvps=client.GetRsvps(event_id=event_id, urlname='_ChiPy_')
    member_id = ','.join([str(i['member']['member_id']) for i in rsvps.results])
    return client.GetMembers(member_id=member_id)

Now lets load the data into pandas dataframe.


In [118]:
def load_members_to_data_frame(event_id):
    members = get_members(event_id=event_id)
    columns=['name','id','thumb_link']
    
    data = [] 
    for member in members.results:
        try:
            data.append([member['name'], member['id'], member['photo']['thumb_link']])
        except:
            print('Discard incomplete profile')
    return pd.DataFrame(data=data, columns=columns)

df=load_members_to_data_frame(event_id=event_id)


29/30 (10 seconds remaining)
28/30 (10 seconds remaining)
Discard incomplete profile
Discard incomplete profile
Discard incomplete profile
Discard incomplete profile
Discard incomplete profile

What does the first and last 10 rows of the dataset look like?


In [ ]:

Next we introduce Clarifai. It is a powerful image recognition as service.

Signing up is very easy.

From the Clarifai API docs:

The API is built around a simple idea. You send inputs (images) to the service and it returns predictions.

The type of prediction is based on what model you run the input through. For example, if you run your input through the 'food' model, the predictions it returns will contain concepts that the 'food' model knows about. If you run your input through the 'color' model, it will return predictions about the dominant colors in your image.

Input Output:

Here is rest of the docs if you need them.


In [6]:
client_id, client_secret = '', '' #your keys here
from clarifai.rest import ClarifaiApp
def analyze_image(url):
    app = ClarifaiApp(client_id, client_secret)
    model = app.models.get("general-v1.3")
    return model.predict_by_url(url=url)

Test analyze_image with - http://bit.ly/2s3rxWD


In [ ]:

Test analyze_image with - http://bit.ly/2t4aKkO


In [ ]:

Implement a function get_concepts_from_image that prints just the tuple of concepts & values.

Your output should look like:

[('people', 0.9814924), ('woman', 0.9796125), ('adult', 0.9717163), ('one', 0.9707799)]


In [ ]:

Using a few more examples look at the different concepts returned. What are the most common concepts for a man? Most common for a woman? What do they share, or have different?


In [ ]:

Implement determine_gender

Clarifai will return a number of concepts with different values indicating how confident it is with the prediction. If it can identify if it is a picture of a man or a woman the returned concepts would include man and woman. It might include boy or girl as well.


In [2]:
def determine_gender(url):
    return 'M'

assert determine_gender('iron_man') == 'M'

Test determine_gender function

Test out your implementation of determine_gender with the profile pictures of your team members. Refine your algorithm to make changes based on your results. Some people like to have cats or pandas as their profile pictures. Think of a strategy for handling situations like that.

Before we bring the pieces togther, we need to do a little bit of refining so that we can evaluate our results visually.

We will use IPython's HTML display features by converting thumblink urls to be inserted inside html img tags. Note the function calls above mutate the dataframe itself, so if you execute the cell more than once it will malformat the img tag and the images would not be rendered correctly.


In [ ]:
from IPython.display import Image, display, HTML
pd.set_option('display.max_colwidth', -1)

df['pic']=df.thumb_link.map(lambda x:'<img src="{0}" height=80 width=80 />'.format(x))
HTML(df[['name','pic']].to_html(escape=False))

Now that we have a visual way of evaulating the results, lets apply your determine_gender function to the list of attendees.

Apply determine_gender to your data frame and display image and gender next to each other


In [ ]:

Lets take a look at the results.

Compare your determine_gender function result with randomly generated results

How good are your results? Is it any better than flipping a coin? Go back and tune determine_gender if nedded.


In [ ]:

Get the counts

What are the counts of male vs female attendees today?


In [ ]:

Putting it together

Now implement a function that takes a meetup.com event id and gives back the male female counts of the RSVP.


In [ ]:

Pause and reflect

Disucss within your team and come up with following

  • three nefarious usages that your program can have
  • three beneficial usages that your program can have

In [ ]:

Plot the male female counts of the last 10 ChiPy events

This will probably get you to hit api limits. To get around that use API keys of your team members. Share it on slack.


In [ ]:

Generate the same plots for other meetup.com communities in Chicago.

Here are some. Feel free to include others you are aware of

Feel free to collaborate with other teams on #team-projects slack channel so that we may cover all the different user groups. Share the meetup.com urls that you have found.

Generate the average ratio from those last 10 meetings for ChiPy


In [ ]:

Generate the average ratios for the user groups you used above


In [ ]:

Compare Chicago with the tech communities of the different cities in the USA

  • Silicon Valley
  • New York
  • St. Louis
  • Salt Lake City (and Utah)
  • Teaxs, Austin

Compare the different countries with USA

  • Canada
  • Mexico
  • India
  • United Kindgom
  • Australia
  • China
  • Japan