What might the future developer workforce look like?

Morgan White Data Bootcamp, Spring 2016

Introduction

Diversity, as a topic, has haunted technology companies and development teams for some time now. Their teams are notoriously white and male, in large part because this reflects the demographic of people who have traditionally been in these academic programs.

However, this lack of diversity has been in the news a lot lately, with major scandals at Google. There are many programs encouraging minority and female participation in STEM programs, and it feels like perhaps the next generation of developers will start to look more like the population of the world.

..Or will it?



In [100]:

    
import sys                             # system module
import pandas as pd                    # data package
import matplotlib.pyplot as plt        # graphics module  
import datetime as dt                  # date and time module
import numpy as np                     # foundation for Pandas
import seaborn.apionly as sns          # fancy matplotlib graphics (no styling)
from pandas_datareader import data, wb         

# plotly imports
from plotly.offline import iplot, iplot_mpl  # plotting functions
import plotly.graph_objs as go               # ditto
import plotly                                # just to print version and init notebook
import cufflinks as cf                       # gives us df.iplot that feels like df.plot
cf.set_config_file(offline=True, offline_show_link=False)

# these lines make our graphics show up in the notebook
%matplotlib inline             
plotly.offline.init_notebook_mode()

# check versions (overkill, but why not?)
print('Python version:', sys.version)
print('Pandas version: ', pd.__version__)
print('Plotly version: ', plotly.__version__)
print('Today: ', dt.date.today())









    



Python version: 3.5.1 |Anaconda 4.0.0 (x86_64)| (default, Dec  7 2015, 11:24:55) 
[GCC 4.2.1 (Apple Inc. build 5577)]
Pandas version:  0.18.0
Plotly version:  1.9.7
Today:  2016-05-14

Data

2013 National AP Exams I thought a natural starting point was to ask the question: how are current students buying into the existing education programs, and what do those demographics look like over the years?

It turns out the AP exam for Computer Science only became available a few years ago, but the engagement numbers and related demographic information is freely available.



In [101]:

    
url = 'http://home.cc.gatech.edu/ice-gt/uploads/556/DetailedStateInfoAP-CS-A-2006-2013-with-PercentBlackAndHIspanicByState-fixed.xlsx'
ap0 = pd.read_excel(url, sheetname=1, skiprows=[51, 52, 53, 54, 55], header=0)
ap0.shape









    Out[101]:





(50, 29)



In [181]:

    
ap = ap0.drop("Unnamed: 2", 1)
ap = ap.replace({"":0,"*":0})
ap['# male']=ap["Total #"]-ap["# female"]
ap['# male passed']=ap['# passed']-ap['# female passed']
ap['% fem passed']=ap["# female passed"]/ap["# female"]
ap["% male passed"]=ap["# male passed"]/ap["# male"]
ap.set_index("2013 data")
ap.head()









    Out[181]:






  
    
      
      2013 data
      # schools
      Total #
      yield per teacher
      # passed
      % passed
      # female
      # female passed
      % female passed
      %female
      ...
      # Hispanic Females
      # Hispanic Females passed
      % Hispanic females passed
      % hispanic taking exam
      % Hispanic in state
      % taking / % state * 100
      # male
      # male passed
      % fem passed
      % male passed
    
  
  
    
      0
      California
      211
      4964
      23.526066
      3761.0
      75.765512
      1074
      776
      72.253259
      21.635778
      ...
      82
      24*
      29.27*
      7.896857
      37.6
      21.002280
      3890
      2985.0
      0.722533
      0.767352
    
    
      1
      Texas
      271
      3979
      14.682657
      2454.0
      61.673787
      910
      520
      57.142857
      22.870068
      ...
      178
      56*
      31.46*
      18.874089
      37.6
      50.197045
      3069
      1934.0
      0.571429
      0.630173
    
    
      2
      New York
      124
      1858
      14.983871
      1278.0
      68.783638
      377
      216
      57.294430
      20.290635
      ...
      45
      10
      22.2222
      8.073197
      17.6
      45.870437
      1481
      1062.0
      0.572944
      0.717083
    
    
      3
      Virginia
      110
      1655
      15.045455
      1074.0
      60.371517
      308
      207
      67.207792
      18.610272
      ...
      9
      2*
      0
      5.438066
      7.9
      68.836284
      1347
      867.0
      0.672078
      0.643653
    
    
      4
      Maryland
      112
      1629
      14.544643
      1068.0
      65.561694
      323
      190
      58.823529
      19.828115
      ...
      18
      6*
      0
      5.402087
      8.2
      65.879112
      1306
      878.0
      0.588235
      0.672282
    
  

5 rows × 32 columns

Analysis

How many women are taking the AP Computer Science exam?

Before we can start to understand what helps women achieve in this field, we should understand how they are represented. Below we've taken a look by state at how women are represented in the overall test taking population.

We see that women remain extremely underrepresented. Even in states like California and Texas, where participation in the exam is relatively high, women still max out at around 25% of total test takers. We see extremely low numbers after that.



In [163]:

    
sns.set(style="whitegrid")
f, ax = plt.subplots(figsize=(20,25))

sns.set_color_codes("pastel")
sns.barplot(x="Total #", y="2013 data", data=ap,
            label="Male Test Takers", color="b")
sns.set_color_codes("muted")
sns.barplot(x="# female", y="2013 data", data=ap,
            label="Female Test Takers", color="g")

ax.legend(ncol=2, loc="lower right", frameon=True)
ax.set(xlim=(0, 5000), ylabel="",
       xlabel="Female Test Takers as a Proportion of Total Test Takers")
sns.despine(left=True, bottom=True)

Does size matter: is there a relationship between the number of men/women taking the test and their pass rates?

On the whole, based on the results of the analysis below, it is safe to say that having more people in your state take the exam does not greatly impact the rate in which people pass.

However, this analysis also shows that, as a cohort, women are being slightly outperformed by men. Is there a bias in the AP exam or coursework? Are women not finding resources they need as readily? That is not clear from the data, but the performance here is a huge concern. If women are not doing as well on the AP exam for some reason, are they more likely to drop out instead of improve?



In [194]:

    
sns.set(style="whitegrid")
sns.set_color_codes("pastel")
sns.factorplot(x="# male", y="% male passed", data=ap,
            label="Male Test Takers", color="b")
sns.set_color_codes("muted")
sns.factorplot(x="# female", y="% fem passed", data=ap,
            label="Female Test Takers", color="g")









    Out[194]:





<seaborn.axisgrid.FacetGrid at 0x11db669b0>

Do clusters of CompSci programs in a state make female participation more likely?

Is there a correlation between the number of schools with Computer Science programs and women participating in the program?

From the analysis below, it's clear that a higher concentration of schools in each state correlates to a higher participation rate in women. It's not clear if this is because interest in general might be higher, or states with higher concentrations of these programs are actually helping to encourage women to participate.

Further, this is not a guarantee, as there are states with high concentrations of programs not breaking the 400 women ceiling.



In [164]:

    
ap1= sns.swarmplot(y="# schools", x="# female", data=ap, split = True)
fig_cluster = ap1.get_figure()

Do larger or smaller classes encourage female participation better?

From the analysis below, it seems as though, once class size exceeds 10, female participation is much higher. Like much of the analysis in this document, it is not clear whether female participation is higher because overall demand is higher in some states than others (hence the larger classes), or if larger classes encourage female participation (easier to blend in, etc.)



In [165]:

    
ap2= sns.swarmplot(y="yield per teacher", x="# female", data=ap, split = True)
fig_cluster = ap1.get_figure()

Girl Power: Does having girlfriends with you increase women's odds of passing?

States have a lot of variability in encouraging female participation in the Computer Science AP Exam, and in getting higher rates of women to pass the AP exam. It's clear that living in California gains you the best odds of being a female who took and passed the exam. As this exam is administered for another 4 or 5 years, and more schools start administering it, we may see these rates increase.



In [175]:

    
sns.set(style="whitegrid")
f, ax = plt.subplots(figsize=(20,25))

sns.set_color_codes("pastel")
sns.barplot(x="# female", y="2013 data", data=ap,
            label="Female Not Passed", color="r")
sns.set_color_codes("muted")
sns.barplot(x="# female passed", y="2013 data", data=ap,
            label="Female Passed", color="r")


ax.legend(ncol=2, loc="lower right", frameon=True)
ax.set(xlim=(0, 1000), ylabel="",
       xlabel="Passed Females as a Proportion of Not Passed Females")
sns.despine(left=True, bottom=True)

Good news, bad news

Being a young woman in Computer Science is clearly a challenge. Fewer women than men are taking the exam, and they are passing it at lower rates than men as well. The increase in popularity of these programs will help bolster female participation: women seem to do better when states have more of these programs, and class size increases. This former Spice Girls fan would like to believe it's girl power - the more women participate, the higher the pass rates might be. However, that might be too idealistic. In reality, women need role models in these fields to encourage them to follow in these relatively new paths.

The reality is that that generation of role models is taking flight today. Perhaps by the time this cohort reaches the workplace, they will indeed find those role models. Either way, diversity is a huge issue that every workplace is looking at. Encouraging women to pursue Computer Science ensures the field will incorporate a balance of perspectives, which is important for its long term success.

Thank you for introducing me. It's been a wonderful journey; one I intend to continue!



In [ ]:

	2013 data	# schools	Total #	yield per teacher	# passed	% passed	# female	# female passed	% female passed	%female	...	# Hispanic Females	# Hispanic Females passed	% Hispanic females passed	% hispanic taking exam	% Hispanic in state	% taking / % state * 100	# male	# male passed	% fem passed	% male passed
0	California	211	4964	23.526066	3761.0	75.765512	1074	776	72.253259	21.635778	...	82	24*	29.27*	7.896857	37.6	21.002280	3890	2985.0	0.722533	0.767352
1	Texas	271	3979	14.682657	2454.0	61.673787	910	520	57.142857	22.870068	...	178	56*	31.46*	18.874089	37.6	50.197045	3069	1934.0	0.571429	0.630173
2	New York	124	1858	14.983871	1278.0	68.783638	377	216	57.294430	20.290635	...	45	10	22.2222	8.073197	17.6	45.870437	1481	1062.0	0.572944	0.717083
3	Virginia	110	1655	15.045455	1074.0	60.371517	308	207	67.207792	18.610272	...	9	2*	0	5.438066	7.9	68.836284	1347	867.0	0.672078	0.643653
4	Maryland	112	1629	14.544643	1068.0	65.561694	323	190	58.823529	19.828115	...	18	6*	0	5.402087	8.2	65.879112	1306	878.0	0.588235	0.672282