Summary Analysis of the 2017 GitHub Open Source Survey

By R. Stuart Geiger (@staeiou), Berkeley Institute for Data Science

Overview

This notebook analyzes the 2017 Open Source Survey, conducted by staff at GitHub, Inc. and other collaborators (see https://opensourcesurvey.org/2017 and https://github.com/github/open-source-survey). The survey was run in 2017, asking over 50 questions on a variety of topics. The survey's designers explain the motivation, design, and distribution of the survey:

In collaboration with researchers from academia, industry, and the community, GitHub designed a survey to gather high quality and novel data on open source software development practices and communities. We collected responses from 5,500 randomly sampled respondents sourced from over 3,800 open source repositories on GitHub.com, and over 500 responses from a non-random sample of communities that work on other platforms. The results are an open data set about the attitudes, experiences, and backgrounds of those who use, build, and maintain open source software."

Purpose and goal

The GitHub survey team presented analyses of some questions when releasing the survey, but there were many more questions asked that are relevant to researchers and community members. This report is an exploratory analysis of all questions asked in the survey, providing a basic summary of the responses to each question. This report presents and plots summary statistics -- mostly frequency counts, proportions, then a frequency or proportion bar graph -- of all questions asked in the survey. Most questions are presented individually, with panel questions grouped together as appropriate. There are no correlations, regressions, or descriptive breakouts between subgroups. Likert-style questions (e.g. Strongly agree <-> strongly disagree) have not been recoded to numerical, scalar values. There are no discussions or interpretations of results. This is left for future work.

The purpose of this notebook is to facilitate future research on this dataset by giving an overview of the kinds of questions asked in the survey, as well as serve as the basis for a PDF report, published on SocArXiv and OSF at https://osf.io/preprints/socarxiv/qps53/. The notebook is public on GitHub at https://github.com/staeiou/github-survey-analysis and others are encouraged to extend it as they see fit.


In [1]:
!pip install pandas seaborn


Requirement already satisfied: pandas in /home/staeiou/conda/lib/python3.5/site-packages
Requirement already satisfied: seaborn in /home/staeiou/conda/lib/python3.5/site-packages
Requirement already satisfied: python-dateutil>=2 in /home/staeiou/conda/lib/python3.5/site-packages (from pandas)
Requirement already satisfied: pytz>=2011k in /home/staeiou/conda/lib/python3.5/site-packages (from pandas)
Requirement already satisfied: numpy>=1.7.0 in /home/staeiou/conda/lib/python3.5/site-packages (from pandas)
Requirement already satisfied: six>=1.5 in /home/staeiou/conda/lib/python3.5/site-packages (from python-dateutil>=2->pandas)

In [2]:
import pandas as pd
import matplotlib, matplotlib.pyplot as plt
import numpy as np
import seaborn as sns

%matplotlib inline
pd.options.display.float_format = '{:.2f}%'.format # add % to all floats, all floats here are percentages

In [3]:
## For making pretty tables when nbconverting to latex

pd.set_option('display.notebook_repr_html', True)

def _repr_latex_(self):
    return "\centering{%s}" % self.to_latex()

pd.DataFrame._repr_latex_ = _repr_latex_  # monkey patch pandas DataFrame

Download and unzip data


In [4]:
!unzip -o data_for_public_release.zip


Archive:  data_for_public_release.zip
   creating: data_for_public_release/
  inflating: data_for_public_release/negative_incidents.csv  
  inflating: __MACOSX/data_for_public_release/._negative_incidents.csv  
  inflating: data_for_public_release/notes.txt  
  inflating: data_for_public_release/questionnaire.txt  
  inflating: __MACOSX/data_for_public_release/._questionnaire.txt  
  inflating: data_for_public_release/README.txt  
  inflating: __MACOSX/data_for_public_release/._README.txt  
  inflating: data_for_public_release/survey_data.csv  
  inflating: __MACOSX/data_for_public_release/._survey_data.csv  

In [5]:
!ls data_for_public_release/


negative_incidents.csv	questionnaire.txt  survey_data.csv
notes.txt		README.txt

Data processing

Main dataset

Load main dataset into pandas


In [6]:
pd.options.display.max_rows = 500

In [7]:
survey_df = pd.read_csv("data_for_public_release/survey_data.csv")

In [8]:
print("survey_data.csv length:", len(survey_df))


survey_data.csv length: 6029

In [9]:
survey_complete_df = survey_df.query("STATUS == 'Complete'")
print("survey_data.csv completed responses:", len(survey_complete_df))


survey_data.csv completed responses: 3746

Explore the main dataset with some sample responses


In [10]:
survey_complete_df[0:3].transpose()


Out[10]:
3 4 6
RESPONSE.ID 48 49 51
DATE.SUBMITTED 3/21/17 15:42 3/21/17 15:38 3/21/17 15:41
STATUS Complete Complete Complete
PARTICIPATION.TYPE.FOLLOW 1 1 1
PARTICIPATION.TYPE.USE.APPLICATIONS 1 1 1
PARTICIPATION.TYPE.USE.DEPENDENCIES 1 1 1
PARTICIPATION.TYPE.CONTRIBUTE 1 1 0
PARTICIPATION.TYPE.OTHER 0 0 0
CONTRIBUTOR.TYPE.CONTRIBUTE.CODE Frequently Occasionally NaN
CONTRIBUTOR.TYPE.CONTRIBUTE.DOCS Rarely Rarely NaN
CONTRIBUTOR.TYPE.PROJECT.MAINTENANCE Frequently Rarely NaN
CONTRIBUTOR.TYPE.FILE.BUGS Frequently Frequently NaN
CONTRIBUTOR.TYPE.FEATURE.REQUESTS Frequently Frequently NaN
CONTRIBUTOR.TYPE.COMMUNITY.ADMIN Never Occasionally NaN
EMPLOYMENT.STATUS Employed full time Full time student Employed full time
PROFESSIONAL.SOFTWARE Frequently NaN Frequently
FUTURE.CONTRIBUTION.INTEREST Very interested Very interested Very interested
FUTURE.CONTRIBUTION.LIKELIHOOD Very likely Very likely Somewhat unlikely
OSS.USER.PRIORITIES.LICENSE Very important to have Very important to have Very important to have
OSS.USER.PRIORITIES.CODE.OF.CONDUCT Somewhat important not to have Somewhat important to have Not important either way
OSS.USER.PRIORITIES.CONTRIBUTING.GUIDE Somewhat important to have Very important to have Somewhat important to have
OSS.USER.PRIORITIES.CLA Not important either way Very important to have Don't know what this is
OSS.USER.PRIORITIES.ACTIVE.DEVELOPMENT Somewhat important to have Very important to have Very important to have
OSS.USER.PRIORITIES.RESPONSIVE.MAINTAINERS Somewhat important to have Very important to have Very important to have
OSS.USER.PRIORITIES.WELCOMING.COMMUNITY Very important to have Very important to have Somewhat important to have
OSS.USER.PRIORITIES.WIDESPREAD.USE Somewhat important to have Not important either way Somewhat important to have
OSS.CONTRIBUTOR.PRIORITIES.LICENSE Not important either way NaN NaN
OSS.CONTRIBUTOR.PRIORITIES.CODE.OF.CONDUCT Somewhat important not to have NaN NaN
OSS.CONTRIBUTOR.PRIORITIES.CONTRIBUTING.GUIDE Not important either way NaN NaN
OSS.CONTRIBUTOR.PRIORITIES.CLA Not important either way NaN NaN
OSS.CONTRIBUTOR.PRIORITIES.ACTIVE.DEVELOPMENT Somewhat important to have NaN NaN
OSS.CONTRIBUTOR.PRIORITIES.RESPONSIVE.MAINTAINERS Somewhat important to have NaN NaN
OSS.CONTRIBUTOR.PRIORITIES.WELCOMING.COMMUNITY Somewhat important to have NaN NaN
OSS.CONTRIBUTOR.PRIORITIES.WIDESPREAD.USE Somewhat important to have NaN NaN
SEEK.OPEN.SOURCE Sometimes Always Always
OSS.UX Generally easier to use About the same Generally easier to use
OSS.SECURITY Generally more secure Generally more secure About the same
OSS.STABILITY About the same Generally less stable About the same
INTERNAL.EFFICACY Strongly agree Strongly agree Strongly agree
EXTERNAL.EFFICACY Strongly agree Strongly agree Neither agree nor disagree
OSS.IDENTIFICATION Neither agree nor disagree Strongly agree Neither agree nor disagree
USER.VALUES.STABILITY Moderately important Extremely important Extremely important
USER.VALUES.INNOVATION Not at all important Very important Moderately important
USER.VALUES.REPLICABILITY Very important Very important Moderately important
USER.VALUES.COMPATIBILITY Very important Very important Extremely important
USER.VALUES.SECURITY Very important Very important Extremely important
USER.VALUES.COST Very important Not at all important Very important
USER.VALUES.TRANSPARENCY Very important Extremely important Extremely important
USER.VALUES.USER.EXPERIENCE Extremely important Moderately important Very important
USER.VALUES.CUSTOMIZABILITY Extremely important Very important Extremely important
USER.VALUES.SUPPORT Slightly important Moderately important Not at all important
USER.VALUES.TRUSTED.PRODUCER Very important Slightly important Moderately important
TRANSPARENCY.PRIVACY.BELIEFS People should be able to contribute code witho... People should be able to contribute code witho... People should be able to contribute code witho...
INFO.AVAILABILITY A lot of information about me A lot of information about me A little information about me
INFO.JOB Yes No No
TRANSPARENCY.PRIVACY.PRACTICES.GENERAL I include my real name. I include my real name. I don't publish this kind of content online.
TRANSPARENCY.PRIVACY.PRACTICES.OSS I include my real name. I include my real name. NaN
RECEIVED.HELP Yes Yes Yes
FIND.HELPER Other - Please describe I asked for help in a public forum (e.g. in a ... I asked a specific person for help.
HELPER.PRIOR.RELATIONSHIP We knew each other well. Total strangers, I didn't know of them previou... We knew each other well.
RECEIVED.HELP.TYPE Writing code or otherwise implementing ideas. Installing or using an application. Installing or using an application.
PROVIDED.HELP Yes Yes Yes
FIND.HELPEE I reached out to them to offer unsolicited help. They asked for help in a public forum (e.g. in... They asked me directly for help.
HELPEE.PRIOR.RELATIONSHIP Total strangers, I didn't know of them previou... Total strangers, I didn't know of them previou... We knew each other well.
PROVIDED.HELP.TYPE Writing code or otherwise implementing ideas. Installing or using an application. Installing or using an application.
DISCOURAGING.BEHAVIOR.LACK.OF.RESPONSE Yes Yes Yes
DISCOURAGING.BEHAVIOR.REJECTION.WOUT.EXPLANATION Yes No No
DISCOURAGING.BEHAVIOR.DISMISSIVE.RESPONSE Yes Yes Yes
DISCOURAGING.BEHAVIOR.BAD.DOCS Yes Yes Yes
DISCOURAGING.BEHAVIOR.CONFLICT Yes Yes No
DISCOURAGING.BEHAVIOR.UNWELCOMING.LANGUAGE No No No
OSS.AS.JOB Yes, directly- some or all of my work duties ... NaN NaN
OSS.AT.WORK Frequently NaN Frequently
OSS.IP.POLICY I am free to contribute without asking for per... NaN I'm not sure.
EMPLOYER.POLICY.APPLICATIONS Use of open source applications is acceptable ... NaN Use of open source applications is encouraged.
EMPLOYER.POLICY.DEPENDENCIES Use of open source dependencies is acceptable ... NaN Use of open source dependencies is encouraged.
OSS.HIRING Very important NaN NaN
IMMIGRATION No, I live in the country where I was born. No, I live in the country where I was born. Yes, and I intend to stay permanently.
MINORITY.HOMECOUNTRY NaN NaN No
MINORITY.CURRENT.COUNTRY No No No
GENDER Man Man Man
TRANSGENDER.IDENTITY No No No
SEXUAL.ORIENTATION No Yes No
WRITTEN.ENGLISH Very well Very well Very well
AGE 35 to 44 years 17 or younger 35 to 44 years
FORMAL.EDUCATION Bachelor's degree Secondary (high) school graduate or equivalent Vocational/trade program or apprenticeship
PARENTS.FORMAL.EDUCATION Bachelor's degree Master's degree Bachelor's degree
AGE.AT.FIRST.COMPUTER.INTERNET 13 - 17 years old Younger than 13 years old 13 - 17 years old
LOCATION.OF.FIRST.COMPUTER.INTERNET At home (belonging to me or a family member) At home (belonging to me or a family member) At home (belonging to me or a family member)
PARTICIPATION.TYPE.ANY.REPONSE 1 1 1
POPULATION github github github
OFF.SITE.ID NaN NaN NaN
TRANSLATED 0 0 0

Create lists of variables for bulk analysis


In [11]:
participation_type_vars = ['PARTICIPATION.TYPE.FOLLOW',
       'PARTICIPATION.TYPE.USE.APPLICATIONS',
       'PARTICIPATION.TYPE.USE.DEPENDENCIES', 'PARTICIPATION.TYPE.CONTRIBUTE',
       'PARTICIPATION.TYPE.OTHER']

contrib_type_vars = ['CONTRIBUTOR.TYPE.CONTRIBUTE.CODE',
       'CONTRIBUTOR.TYPE.CONTRIBUTE.DOCS',
       'CONTRIBUTOR.TYPE.PROJECT.MAINTENANCE', 'CONTRIBUTOR.TYPE.FILE.BUGS',
       'CONTRIBUTOR.TYPE.FEATURE.REQUESTS', 'CONTRIBUTOR.TYPE.COMMUNITY.ADMIN']

contrib_other_vars = ['EMPLOYMENT.STATUS', 'PROFESSIONAL.SOFTWARE',
       'FUTURE.CONTRIBUTION.INTEREST', 'FUTURE.CONTRIBUTION.LIKELIHOOD']

contrib_ident_vars = participation_type_vars + contrib_type_vars + contrib_other_vars

In [12]:
user_pri_vars = ['OSS.USER.PRIORITIES.LICENSE', 'OSS.USER.PRIORITIES.CODE.OF.CONDUCT',
       'OSS.USER.PRIORITIES.CONTRIBUTING.GUIDE', 'OSS.USER.PRIORITIES.CLA',
       'OSS.USER.PRIORITIES.ACTIVE.DEVELOPMENT',
       'OSS.USER.PRIORITIES.RESPONSIVE.MAINTAINERS',
       'OSS.USER.PRIORITIES.WELCOMING.COMMUNITY',
       'OSS.USER.PRIORITIES.WIDESPREAD.USE']

contrib_pri_vars = ['OSS.CONTRIBUTOR.PRIORITIES.LICENSE',
       'OSS.CONTRIBUTOR.PRIORITIES.CODE.OF.CONDUCT',
       'OSS.CONTRIBUTOR.PRIORITIES.CONTRIBUTING.GUIDE',
       'OSS.CONTRIBUTOR.PRIORITIES.CLA',
       'OSS.CONTRIBUTOR.PRIORITIES.ACTIVE.DEVELOPMENT',
       'OSS.CONTRIBUTOR.PRIORITIES.RESPONSIVE.MAINTAINERS',
       'OSS.CONTRIBUTOR.PRIORITIES.WELCOMING.COMMUNITY',
       'OSS.CONTRIBUTOR.PRIORITIES.WIDESPREAD.USE']

oss_values_vars = [ 'SEEK.OPEN.SOURCE',
       'OSS.UX', 'OSS.SECURITY', 'OSS.STABILITY', 'INTERNAL.EFFICACY',
       'EXTERNAL.EFFICACY', 'OSS.IDENTIFICATION']

user_values_vars = ['USER.VALUES.STABILITY',
       'USER.VALUES.INNOVATION', 'USER.VALUES.REPLICABILITY',
       'USER.VALUES.COMPATIBILITY', 'USER.VALUES.SECURITY', 'USER.VALUES.COST',
       'USER.VALUES.TRANSPARENCY', 'USER.VALUES.USER.EXPERIENCE',
       'USER.VALUES.CUSTOMIZABILITY', 'USER.VALUES.SUPPORT',
       'USER.VALUES.TRUSTED.PRODUCER']

values_pri_vars = user_pri_vars + contrib_pri_vars + user_values_vars + oss_values_vars

In [13]:
privacy_transp_vars = ['TRANSPARENCY.PRIVACY.BELIEFS',
       'INFO.AVAILABILITY', 'INFO.JOB',
       'TRANSPARENCY.PRIVACY.PRACTICES.GENERAL',
       'TRANSPARENCY.PRIVACY.PRACTICES.OSS']

In [14]:
help_vars = ['RECEIVED.HELP', 'FIND.HELPER',
       'HELPER.PRIOR.RELATIONSHIP', 'RECEIVED.HELP.TYPE', 'PROVIDED.HELP',
       'FIND.HELPEE', 'HELPEE.PRIOR.RELATIONSHIP', 'PROVIDED.HELP.TYPE']

In [15]:
paid_work_vars = ['OSS.AS.JOB',
       'OSS.AT.WORK', 'OSS.IP.POLICY', 'EMPLOYER.POLICY.APPLICATIONS',
       'EMPLOYER.POLICY.DEPENDENCIES', 'OSS.HIRING']

In [16]:
discouraging_vars = ['DISCOURAGING.BEHAVIOR.LACK.OF.RESPONSE',
       'DISCOURAGING.BEHAVIOR.REJECTION.WOUT.EXPLANATION',
       'DISCOURAGING.BEHAVIOR.DISMISSIVE.RESPONSE',
       'DISCOURAGING.BEHAVIOR.BAD.DOCS', 'DISCOURAGING.BEHAVIOR.CONFLICT',
       'DISCOURAGING.BEHAVIOR.UNWELCOMING.LANGUAGE']

In [17]:
demographic_vars = ['IMMIGRATION',
       'MINORITY.HOMECOUNTRY', 'MINORITY.CURRENT.COUNTRY', 'GENDER',
       'TRANSGENDER.IDENTITY', 'SEXUAL.ORIENTATION', 'WRITTEN.ENGLISH', 'AGE',
       'FORMAL.EDUCATION', 'PARENTS.FORMAL.EDUCATION',
       'AGE.AT.FIRST.COMPUTER.INTERNET', 'LOCATION.OF.FIRST.COMPUTER.INTERNET',
       'PARTICIPATION.TYPE.ANY.REPONSE', 'POPULATION', 'OFF.SITE.ID',
       'TRANSLATED']

In [18]:
survey_vars = [contrib_ident_vars, values_pri_vars, privacy_transp_vars, \
               help_vars, paid_work_vars, discouraging_vars, demographic_vars]

Negative incidents

Load into pandas


In [19]:
neg_df = pd.read_csv("data_for_public_release/negative_incidents.csv")

In [20]:
print("negative_incidents.csv length:", len(survey_df))


negative_incidents.csv length: 6029

Explore the negative dataset with some sample responses


In [21]:
neg_df[0:3].transpose()


Out[21]:
0 1 2
NEGATIVE.WITNESS.RUDENESS 1 1 0
NEGATIVE.WITNESS.NAME.CALLING 1 0 0
NEGATIVE.WITNESS.THREATS 0 0 0
NEGATIVE.WITNESS.IMPERSONATION 0 0 1
NEGATIVE.WITNESS.SUSTAINED.HARASSMENT 0 0 0
NEGATIVE.WITNESS.CROSS.PLATFORM.HARASSMENT 0 0 0
NEGATIVE.WITNESS.STALKING 0 0 0
NEGATIVE.WITNESS.SEXUAL.ADVANCES 0 0 0
NEGATIVE.WITNESS.STEREOTYPING 0 0 0
NEGATIVE.WITNESS.DOXXING 0 0 1
NEGATIVE.WITNESS.OTHER 0 0 0
NEGATIVE.WITNESS.NONE.OF.THE.ABOVE 0 0 0
NEGATIVE.EXPERIENCE.RUDENESS 0 1 0
NEGATIVE.EXPERIENCE.NAME.CALLING 0 0 0
NEGATIVE.EXPERIENCE.THREATS 0 0 0
NEGATIVE.EXPERIENCE.IMPERSONATION 0 0 0
NEGATIVE.EXPERIENCE.SUSTAINED.HARASSMENT 0 0 0
NEGATIVE.EXPERIENCE.CROSS.PLATFORM.HARASSMENT 0 0 0
NEGATIVE.EXPERIENCE.STALKING 0 0 0
NEGATIVE.EXPERIENCE.SEXUAL.ADVANCES 0 0 0
NEGATIVE.EXPERIENCE.STEREOTYPING 0 0 0
NEGATIVE.EXPERIENCE.DOXXING 0 0 0
NEGATIVE.EXPERIENCE.OTHER 0 0 0
NEGATIVE.EXPERIENCE.NONE.OF.THE.ABOVE 1 0 1
NEGATIVE.RESPONSE.ASKED.USER.TO.STOP 0 0 0
NEGATIVE.RESPONSE.SOLICITED.COMMUNITY.SUPPORT 0 0 0
NEGATIVE.RESPONSE.BLOCKED.USER 0 0 0
NEGATIVE.RESPONSE.REPORTED.TO.MAINTAINERS 0 0 0
NEGATIVE.RESPONSE.REPORTED.TO.HOST.OR.ISP 0 0 0
NEGATIVE.RESPONSE.CONSULTED.LEGAL.COUNSEL 0 0 0
NEGATIVE.RESPONSE.CONTACTED.LAW.ENFORCEMENT 0 0 0
NEGATIVE.RESPONSE.OTHER 0 0 0
NEGATIVE.RESPONSE.IGNORED 0 1 0
RESPONSE.EFFECTIVENESS.ASKED.USER.TO.STOP NaN NaN NaN
RESPONSE.EFFECTIVENESS.SOLICITED.COMMUNITY.SUPPORT NaN NaN NaN
RESPONSE.EFFECTIVENESS.BLOCKED.USER NaN NaN NaN
RESPONSE.EFFECTIVENESS.REPORTED.TO.MAINTAINERS NaN NaN NaN
RESPONSE.EFFECTIVENESS.REPORTED.TO.HOST.OR.ISP NaN NaN NaN
RESPONSE.EFFECTIVENESS.CONSULTED.LEGAL.COUNSEL NaN NaN NaN
RESPONSE.EFFECTIVENESS.CONTACTED.LAW.ENFORCEMENT NaN NaN NaN
RESPONSE.EFFECTIVENESS.OTHER NaN NaN NaN
NEGATIVE.CONSEQUENCES.STOPPED.CONTRIBUTING 0 0 1
NEGATIVE.CONSEQUENCES.PSEUDONYM 0 0 0
NEGATIVE.CONSEQUENCES.WORK.IN.PRIVATE 0 0 0
NEGATIVE.CONSEQUENCES.CHANGE.USERNAME 0 0 0
NEGATIVE.CONSEQUENCES.CHANGE.ONLINE.PRESENCE 0 0 0
NEGATIVE.CONSEQUENCES.SUGGEST.COC 0 0 0
NEGATIVE.CONSEQUENCES.PRIVATE.COMMUNITY.DISCUSSION 0 0 0
NEGATIVE.CONSEQUENCES.PUBLIC.COMMUNITY.DISCUSSION 0 1 0
NEGATIVE.CONSEQUENCES.OFFLINE.CHANGES 0 0 0
NEGATIVE.CONSEQUENCES.OTHER 0 0 0
NEGATIVE.CONSEQUENCES.NONE.OF.THE.ABOVE 1 0 0
NEGATIVE.WITNESS.ANY.RESPONSE 1 1 1
NEGATIVE.EXPERIENCE.ANY.RESPONSE 1 1 1
NEGATIVE.RESPONSE.ANY.RESPONSE 0 1 0
NEGATIVE.CONSEQUENCES.ANY.RESPONSE 1 1 1
POPULATION github github github

Create lists of variables for bulk analysis


In [22]:
neg_witness_vars = ['NEGATIVE.WITNESS.RUDENESS', 'NEGATIVE.WITNESS.NAME.CALLING',
       'NEGATIVE.WITNESS.THREATS', 'NEGATIVE.WITNESS.IMPERSONATION',
       'NEGATIVE.WITNESS.SUSTAINED.HARASSMENT',
       'NEGATIVE.WITNESS.CROSS.PLATFORM.HARASSMENT',
       'NEGATIVE.WITNESS.STALKING', 'NEGATIVE.WITNESS.SEXUAL.ADVANCES',
       'NEGATIVE.WITNESS.STEREOTYPING', 'NEGATIVE.WITNESS.DOXXING',
       'NEGATIVE.WITNESS.OTHER', 'NEGATIVE.WITNESS.NONE.OF.THE.ABOVE', 'NEGATIVE.WITNESS.ANY.RESPONSE']

In [23]:
neg_exp_vars = ['NEGATIVE.EXPERIENCE.RUDENESS', 'NEGATIVE.EXPERIENCE.NAME.CALLING',
       'NEGATIVE.EXPERIENCE.THREATS', 'NEGATIVE.EXPERIENCE.IMPERSONATION',
       'NEGATIVE.EXPERIENCE.SUSTAINED.HARASSMENT',
       'NEGATIVE.EXPERIENCE.CROSS.PLATFORM.HARASSMENT',
       'NEGATIVE.EXPERIENCE.STALKING', 'NEGATIVE.EXPERIENCE.SEXUAL.ADVANCES',
       'NEGATIVE.EXPERIENCE.STEREOTYPING', 'NEGATIVE.EXPERIENCE.DOXXING',
       'NEGATIVE.EXPERIENCE.OTHER', 'NEGATIVE.EXPERIENCE.NONE.OF.THE.ABOVE', 'NEGATIVE.EXPERIENCE.ANY.RESPONSE']

In [24]:
neg_resp_vars = ['NEGATIVE.RESPONSE.ASKED.USER.TO.STOP',
       'NEGATIVE.RESPONSE.SOLICITED.COMMUNITY.SUPPORT',
       'NEGATIVE.RESPONSE.BLOCKED.USER',
       'NEGATIVE.RESPONSE.REPORTED.TO.MAINTAINERS',
       'NEGATIVE.RESPONSE.REPORTED.TO.HOST.OR.ISP',
       'NEGATIVE.RESPONSE.CONSULTED.LEGAL.COUNSEL',
       'NEGATIVE.RESPONSE.CONTACTED.LAW.ENFORCEMENT',
       'NEGATIVE.RESPONSE.OTHER', 'NEGATIVE.RESPONSE.IGNORED', 'NEGATIVE.RESPONSE.ANY.RESPONSE']

In [25]:
neg_effect_vars = ['RESPONSE.EFFECTIVENESS.ASKED.USER.TO.STOP',
       'RESPONSE.EFFECTIVENESS.SOLICITED.COMMUNITY.SUPPORT',
       'RESPONSE.EFFECTIVENESS.BLOCKED.USER',
       'RESPONSE.EFFECTIVENESS.REPORTED.TO.MAINTAINERS',
       'RESPONSE.EFFECTIVENESS.REPORTED.TO.HOST.OR.ISP',
       'RESPONSE.EFFECTIVENESS.CONSULTED.LEGAL.COUNSEL',
       'RESPONSE.EFFECTIVENESS.CONTACTED.LAW.ENFORCEMENT',
       'RESPONSE.EFFECTIVENESS.OTHER']

In [26]:
neg_conseq_vars = ['NEGATIVE.CONSEQUENCES.STOPPED.CONTRIBUTING',
       'NEGATIVE.CONSEQUENCES.PSEUDONYM',
       'NEGATIVE.CONSEQUENCES.WORK.IN.PRIVATE',
       'NEGATIVE.CONSEQUENCES.CHANGE.USERNAME',
       'NEGATIVE.CONSEQUENCES.CHANGE.ONLINE.PRESENCE',
       'NEGATIVE.CONSEQUENCES.SUGGEST.COC',
       'NEGATIVE.CONSEQUENCES.PRIVATE.COMMUNITY.DISCUSSION',
       'NEGATIVE.CONSEQUENCES.PUBLIC.COMMUNITY.DISCUSSION',
       'NEGATIVE.CONSEQUENCES.OFFLINE.CHANGES', 'NEGATIVE.CONSEQUENCES.OTHER',
       'NEGATIVE.CONSEQUENCES.NONE.OF.THE.ABOVE', 'NEGATIVE.CONSEQUENCES.ANY.RESPONSE']

In [27]:
neg_anyresp_vars = ['NEGATIVE.WITNESS.ANY.RESPONSE', 'NEGATIVE.EXPERIENCE.ANY.RESPONSE',
       'NEGATIVE.RESPONSE.ANY.RESPONSE', 'NEGATIVE.CONSEQUENCES.ANY.RESPONSE']

Analysis


In [28]:
sns.set(font_scale=1.5)

Contributor identity

People participate in open source in different ways. Which of the following activities do you engage in?

Choose all that apply.


In [29]:
participation_type_resp= survey_df[participation_type_vars].apply(pd.Series.value_counts).transpose()
participation_type_resp.columns = ["No", "Yes"]
participation_type_resp


Out[29]:
No Yes
PARTICIPATION.TYPE.FOLLOW 1287 4742
PARTICIPATION.TYPE.USE.APPLICATIONS 454 5575
PARTICIPATION.TYPE.USE.DEPENDENCIES 946 5083
PARTICIPATION.TYPE.CONTRIBUTE 1722 4307
PARTICIPATION.TYPE.OTHER 5742 287

In [ ]:


In [30]:
participation_type_prop = survey_df[participation_type_vars].mean() * 100
participation_type_prop = participation_type_prop.sort_values()
pd.DataFrame(participation_type_prop, columns=["percent"])


Out[30]:
percent
PARTICIPATION.TYPE.OTHER 4.76%
PARTICIPATION.TYPE.CONTRIBUTE 71.44%
PARTICIPATION.TYPE.FOLLOW 78.65%
PARTICIPATION.TYPE.USE.DEPENDENCIES 84.31%
PARTICIPATION.TYPE.USE.APPLICATIONS 92.47%

In [31]:
ax = participation_type_prop.plot(kind='barh')

labels = []
for l in ax.get_yticklabels():
    title_text = l.get_text()[19:].replace(".", " ") # cut off "CONTRIBUTOR.TYPE"
        
    labels.append(title_text)
    
plt.xlim(0,100)
ax.set_yticklabels(labels)

ax.set_xlabel("Percent of respondents")
t = plt.title("% of people who participate in the following activities:")


Contributon type: How often do you engage in each of the following activities?


In [32]:
contrib_type_responses = survey_df[contrib_type_vars].apply(pd.Series.value_counts).transpose()

#contrib_type_responses.columns = ["Not at all important", "Slightly important","Don't know", "Somewhat important", "Very important"]
contrib_type_responses = contrib_type_responses[["Never", "Rarely", "Occasionally", "Frequently"]]
contrib_type_responses = contrib_type_responses[["Frequently", "Occasionally", "Rarely", "Never"]]
contrib_type_responses = contrib_type_responses.sort_values(by='Frequently')
contrib_type_responses


Out[32]:
Frequently Occasionally Rarely Never
CONTRIBUTOR.TYPE.COMMUNITY.ADMIN 287 417 867 2412
CONTRIBUTOR.TYPE.CONTRIBUTE.DOCS 460 1214 1665 661
CONTRIBUTOR.TYPE.FEATURE.REQUESTS 573 1625 1346 451
CONTRIBUTOR.TYPE.PROJECT.MAINTENANCE 996 944 974 1090
CONTRIBUTOR.TYPE.FILE.BUGS 1067 2073 768 106
CONTRIBUTOR.TYPE.CONTRIBUTE.CODE 1160 1383 1301 189

In [33]:
sns.set(style="whitegrid", font_scale=1.75)
fig, ax = plt.subplots()
cmap=matplotlib.cm.Blues_r
contrib_type_responses.plot.barh(stacked=True, ax=ax, figsize=[12,6], cmap=cmap, edgecolor='black', linewidth=1)

labels = []
for l in ax.get_yticklabels():
    title_text = l.get_text()[17:].replace(".", " ") # cut off "CONTRIBUTOR.TYPE"
        
    labels.append(title_text)
    
ax.set_yticklabels(labels)


plt.title("How often do you engage in each of the following activities?")

plt.xlabel("Number of responses")



legend = plt.legend(fancybox=True, loc='upper center', bbox_to_anchor=(.5, -.13), ncol=4, shadow=True)
legend.get_frame().set_edgecolor('b')
legend.get_frame().set_facecolor('white')


Employment status

EMPLOYMENT.STATUS


In [34]:
prop_df = pd.DataFrame((survey_df['EMPLOYMENT.STATUS'].value_counts()))
prop_df.columns=["count"]
prop_df


Out[34]:
count
Employed full time 3615
Full time student 1048
Employed part time 349
Temporarily not working 314
Other - please describe 184
Retired or permanently not working (e.g. due to disability) 90

In [35]:
prop_df = pd.DataFrame((survey_df['EMPLOYMENT.STATUS'].value_counts(normalize=True).round(4)*100))
prop_df.columns=["percent"]
prop_df


Out[35]:
percent
Employed full time 64.55%
Full time student 18.71%
Employed part time 6.23%
Temporarily not working 5.61%
Other - please describe 3.29%
Retired or permanently not working (e.g. due to disability) 1.61%

In [36]:
ax = pd.DataFrame(survey_df['EMPLOYMENT.STATUS'].value_counts()).plot(kind='barh')
plt.suptitle("Employment status")
t = ax.set_xlabel("Count of responses")


In your main job, how often do you write or otherwise directly contribute to producing software?

PROFESSIONAL.SOFTWARE


In [37]:
prop_df = pd.DataFrame((survey_df['PROFESSIONAL.SOFTWARE'].value_counts()))
prop_df.columns=["count"]
prop_df


Out[37]:
count
Frequently 2747
Occasionally 542
Rarely 339
Never 279

In [38]:
prop_df = pd.DataFrame((survey_df['PROFESSIONAL.SOFTWARE'].value_counts(normalize=True).round(4)*100))
prop_df.columns=["percent"]
prop_df


Out[38]:
percent
Frequently 70.31%
Occasionally 13.87%
Rarely 8.68%
Never 7.14%

In [39]:
ax = pd.DataFrame(survey_df['PROFESSIONAL.SOFTWARE'].value_counts()).plot(kind='barh')
plt.title("In your main job, how often do you write or\notherwise directly contribute to producing software?")
t = ax.set_xlabel("Count of responses")


How interested are you in contributing to open source projects in the future?

FUTURE.CONTRIBUTION.INTEREST


In [40]:
prop_df = pd.DataFrame((survey_df['FUTURE.CONTRIBUTION.INTEREST'].value_counts()))
prop_df.columns=["count"]
prop_df


Out[40]:
count
Very interested 3929
Somewhat interested 1430
Not too interested 125
Not at all interested 24

In [41]:
prop_df = pd.DataFrame((survey_df['FUTURE.CONTRIBUTION.INTEREST'].value_counts(normalize=True).round(4).round(4)*100))
prop_df.columns=["percent"]
prop_df


Out[41]:
percent
Very interested 71.33%
Somewhat interested 25.96%
Not too interested 2.27%
Not at all interested 0.44%

In [42]:
ax = pd.DataFrame(survey_df['FUTURE.CONTRIBUTION.INTEREST'].value_counts()).plot(kind='barh')
plt.title("How interested are you in contributing\nto open source projects in the future?")
t = ax.set_xlabel("Count of responses")


How likely are you to contribute to open source projects in the future?


In [43]:
prop_df = pd.DataFrame((survey_df['FUTURE.CONTRIBUTION.LIKELIHOOD'].value_counts()))
prop_df.columns=["count"]
prop_df


Out[43]:
count
Very likely 3271
Somewhat likely 1719
Somewhat unlikely 440
Very unlikely 81

In [44]:
prop_df = pd.DataFrame((survey_df['FUTURE.CONTRIBUTION.LIKELIHOOD'].value_counts(normalize=True).round(4)*100))
prop_df.columns=["percent"]
prop_df


Out[44]:
percent
Very likely 59.35%
Somewhat likely 31.19%
Somewhat unlikely 7.98%
Very unlikely 1.47%

In [45]:
ax = pd.DataFrame(survey_df['FUTURE.CONTRIBUTION.LIKELIHOOD'].value_counts()).plot(kind='barh')
plt.title("How likely are you to contribute to\nopen source projects in the future?")
t = ax.set_xlabel("Count of responses")


Priorities and values

When thinking about whether to use open source software, how important are the following things?

OSS.USER.PRIORITIES.*


In [46]:
user_pri_responses = survey_df[user_pri_vars].apply(pd.Series.value_counts).transpose()

#contrib_type_responses.columns = ["Not at all important", "Slightly important","Don't know", "Somewhat important", "Very important"]
user_pri_responses = user_pri_responses[["Very important to have",
                                             "Somewhat important to have",
                                             "Not important either way",
                                             "Somewhat important not to have",
                                             "Very important not to have",
                                             "Don't know what this is"]]
user_pri_responses = user_pri_responses.sort_values(by="Very important to have")

In [47]:
idx = []
for i in user_pri_responses.index:
    idx.append(i[20:])
idx = pd.Series(idx)    
user_pri_responses.set_index(idx)


Out[47]:
Very important to have Somewhat important to have Not important either way Somewhat important not to have Very important not to have Don't know what this is
CLA 490 1024 2282 336 157 488
CODE.OF.CONDUCT 848 1461 1993 166 120 209
WIDESPREAD.USE 984 2067 1576 114 47 28
CONTRIBUTING.GUIDE 1212 1866 1516 95 62 62
WELCOMING.COMMUNITY 2062 1822 812 67 33 18
RESPONSIVE.MAINTAINERS 2575 1850 302 31 35 20
ACTIVE.DEVELOPMENT 2768 1722 267 30 31 16
LICENSE 3125 1160 435 31 33 47

In [48]:
user_pri_responses_prop = survey_df[user_pri_vars].apply(pd.Series.value_counts, normalize=True).round(4).transpose()

#contrib_type_responses.columns = ["Not at all important", "Slightly important","Don't know", "Somewhat important", "Very important"]
user_pri_responses_prop = user_pri_responses_prop[["Very important to have",
                                             "Somewhat important to have",
                                             "Not important either way",
                                             "Somewhat important not to have",
                                             "Very important not to have",
                                             "Don't know what this is"]]
user_pri_responses_prop = user_pri_responses_prop.sort_values(by="Very important to have")
user_pri_responses_prop = user_pri_responses_prop * 100

In [49]:
idx = []
for i in user_pri_responses_prop.index:
    idx.append(i[20:])
idx = pd.Series(idx)    
user_pri_responses_prop.set_index(idx)


Out[49]:
Very important to have Somewhat important to have Not important either way Somewhat important not to have Very important not to have Don't know what this is
CLA 10.26% 21.44% 47.77% 7.03% 3.29% 10.22%
CODE.OF.CONDUCT 17.68% 30.46% 41.55% 3.46% 2.50% 4.36%
WIDESPREAD.USE 20.43% 42.92% 32.72% 2.37% 0.98% 0.58%
CONTRIBUTING.GUIDE 25.18% 38.77% 31.50% 1.97% 1.29% 1.29%
WELCOMING.COMMUNITY 42.83% 37.85% 16.87% 1.39% 0.69% 0.37%
RESPONSIVE.MAINTAINERS 53.50% 38.44% 6.27% 0.64% 0.73% 0.42%
ACTIVE.DEVELOPMENT 57.26% 35.62% 5.52% 0.62% 0.64% 0.33%
LICENSE 64.69% 24.01% 9.00% 0.64% 0.68% 0.97%

In [50]:
sns.set(style="whitegrid", font_scale=1.75)
fig, ax = plt.subplots()
cmap=matplotlib.cm.coolwarm
colors = ["xkcd:darkblue", "xkcd:lightblue", "xkcd:beige", "xkcd:salmon", "xkcd:crimson", "xkcd:green"]
user_pri_responses.plot.barh(stacked=True, ax=ax, figsize=[12,8], color=colors)

labels = []
for l in ax.get_yticklabels():
    title_text = l.get_text()[20:].replace(".", " ") # cut off "OSS.USER.PRIORITIES."
        
    labels.append(title_text)
    
ax.set_yticklabels(labels)

plt.title("When thinking about whether to *use* open source software,\n how important are the following things?")

plt.xlabel("Number of responses")



legend = plt.legend(fancybox=True, loc='upper center', bbox_to_anchor=(.5, -.1), ncol=2, shadow=True)
legend.get_frame().set_edgecolor('b')
legend.get_frame().set_facecolor('white')


When thinking about whether to contribute to an open source project, how important are the following things?

OSS.CONTRIBUTOR.PRIORITIES.*


In [51]:
contrib_pri_responses = survey_df[contrib_pri_vars].apply(pd.Series.value_counts).transpose()

#contrib_type_responses.columns = ["Not at all important", "Slightly important","Don't know", "Somewhat important", "Very important"]
contrib_pri_responses = contrib_pri_responses[["Very important to have",
                                             "Somewhat important to have",
                                             "Not important either way",
                                             "Somewhat important not to have",
                                             "Very important not to have",
                                             "Don't know what this is"]]

contrib_pri_responses = contrib_pri_responses.sort_values(by="Very important to have")

In [52]:
idx = []
for i in contrib_pri_responses.index:
    idx.append(i[27:])
idx = pd.Series(idx)    
contrib_pri_responses.set_index(idx)


Out[52]:
Very important to have Somewhat important to have Not important either way Somewhat important not to have Very important not to have Don't know what this is
WIDESPREAD.USE 387 1016 1666 70 30 12
CLA 419 712 1266 327 166 280
CODE.OF.CONDUCT 655 1145 1085 119 84 96
CONTRIBUTING.GUIDE 1198 1396 500 41 18 24
ACTIVE.DEVELOPMENT 1368 1333 448 21 18 5
WELCOMING.COMMUNITY 1533 1199 411 21 15 7
RESPONSIVE.MAINTAINERS 1994 1022 138 7 16 7
LICENSE 2199 610 337 16 15 18

In [ ]:


In [53]:
contrib_pri_responses_prop = survey_df[contrib_pri_vars].apply(pd.Series.value_counts, normalize=True).round(4).transpose()

#contrib_type_responses.columns = ["Not at all important", "Slightly important","Don't know", "Somewhat important", "Very important"]
contrib_pri_responses_prop = contrib_pri_responses_prop[["Very important to have",
                                             "Somewhat important to have",
                                             "Not important either way",
                                             "Somewhat important not to have",
                                             "Very important not to have",
                                             "Don't know what this is"]]
contrib_pri_responses_prop = contrib_pri_responses_prop.sort_values(by="Very important to have")
contrib_pri_responses_prop = contrib_pri_responses_prop * 100

In [54]:
idx = []
for i in contrib_pri_responses_prop.index:
    idx.append(i[27:])
idx = pd.Series(idx)    
contrib_pri_responses_prop.set_index(idx)


Out[54]:
Very important to have Somewhat important to have Not important either way Somewhat important not to have Very important not to have Don't know what this is
WIDESPREAD.USE 12.17% 31.94% 52.37% 2.20% 0.94% 0.38%
CLA 13.22% 22.46% 39.94% 10.32% 5.24% 8.83%
CODE.OF.CONDUCT 20.57% 35.96% 34.08% 3.74% 2.64% 3.02%
CONTRIBUTING.GUIDE 37.71% 43.94% 15.74% 1.29% 0.57% 0.76%
ACTIVE.DEVELOPMENT 42.84% 41.75% 14.03% 0.66% 0.56% 0.16%
WELCOMING.COMMUNITY 48.12% 37.63% 12.90% 0.66% 0.47% 0.22%
RESPONSIVE.MAINTAINERS 62.63% 32.10% 4.33% 0.22% 0.50% 0.22%
LICENSE 68.83% 19.09% 10.55% 0.50% 0.47% 0.56%

In [55]:
sns.set(style="whitegrid", font_scale=1.75)
fig, ax = plt.subplots()
cmap=matplotlib.cm.coolwarm
colors = ["xkcd:darkblue", "xkcd:lightblue", "xkcd:beige", "xkcd:salmon", "xkcd:crimson", "xkcd:green"]
contrib_pri_responses.plot.barh(stacked=True, ax=ax, figsize=[12,8], color=colors)

labels = []
for l in ax.get_yticklabels():
    title_text = l.get_text()[27:].replace(".", " ") # cut off "OSS.USER.PRIORITIES."
        
    labels.append(title_text)
    
ax.set_yticklabels(labels)

plt.title("When thinking about whether to *contribute* to an open source project,\nhow important are the following things?")

plt.xlabel("Number of responses")



legend = plt.legend(fancybox=True, loc='upper center', bbox_to_anchor=(.5, -.1), ncol=2, shadow=True)
legend.get_frame().set_edgecolor('b')
legend.get_frame().set_facecolor('white')


How often do you try to find open source options over other kinds of software?

SEEK.OPEN.SOURCE


In [56]:
count_df = pd.DataFrame(data=survey_df['SEEK.OPEN.SOURCE'].value_counts())
count_df.columns = ["count"]
count_df


Out[56]:
count
Always 3407
Sometimes 1111
Rarely 100
Never 25

In [57]:
prop_df = pd.DataFrame((survey_df['SEEK.OPEN.SOURCE'].value_counts(normalize=True).round(4)*100))
prop_df.columns=["percent"]
prop_df


Out[57]:
percent
Always 73.38%
Sometimes 23.93%
Rarely 2.15%
Never 0.54%

In [58]:
ax = pd.DataFrame(survey_df['SEEK.OPEN.SOURCE'].value_counts()).plot(kind='barh')
plt.title("How often do you try to find open\nsource options over other kinds of software?")
t = ax.set_xlabel("Count of responses")


Open source software usability

OSS.UX: Do you believe that open source software is generally easier to use than closed source (proprietary) software, harder to use, or about the same?


In [59]:
count_df = pd.DataFrame(data=survey_df['OSS.UX'].value_counts())
count_df.columns = ["count"]
count_df


Out[59]:
count
About the same 2027
Generally easier to use 1597
Generally harder to use 897

In [60]:
prop_df = pd.DataFrame((survey_df['OSS.UX'].value_counts(normalize=True).round(4)*100))
prop_df.columns=["percent"]
prop_df


Out[60]:
percent
About the same 44.84%
Generally easier to use 35.32%
Generally harder to use 19.84%

In [61]:
ax = pd.DataFrame(survey_df['OSS.UX'].value_counts()).plot(kind='barh')
plt.title("Do you believe that open source software is generally\neasier to use than closed source (proprietary)\nsoftware, harder to use, or about the same?")
t = ax.set_xlabel("Count of responses")


Open source software security

OSS.SECURITY: Do you believe that open source software is generally more secure than closed source (proprietary) software, less secure, or about the same?


In [62]:
count_df = pd.DataFrame(data=survey_df['OSS.SECURITY'].value_counts())
count_df.columns = ["count"]
count_df


Out[62]:
count
Generally more secure 2688
About the same 1537
Generally less secure 295

In [63]:
prop_df = pd.DataFrame((survey_df['OSS.SECURITY'].value_counts(normalize=True).round(4)*100))
prop_df.columns=["percent"]
prop_df


Out[63]:
percent
Generally more secure 59.47%
About the same 34.00%
Generally less secure 6.53%

In [64]:
ax = pd.DataFrame(survey_df['OSS.SECURITY'].value_counts()).plot(kind='barh')
plt.title("Do you believe that open source software is\ngenerally more secure than closed source (proprietary)\nsoftware, less secure, or about the same?")
t = ax.set_xlabel("Count of responses")


Open source software stability

OSS.STABILITY: Do you believe that open source software is generally more stable than closed source (proprietary) software, less stable, or about the same?


In [65]:
count_df = pd.DataFrame(data=survey_df['OSS.STABILITY'].value_counts())
count_df.columns = ["count"]
count_df


Out[65]:
count
About the same 2240
Generally more stable 1399
Generally less stable 877

In [66]:
prop_df = pd.DataFrame((survey_df['OSS.STABILITY'].value_counts(normalize=True).round(4)*100))
prop_df.columns=["percent"]
prop_df


Out[66]:
percent
About the same 49.60%
Generally more stable 30.98%
Generally less stable 19.42%

In [67]:
pd.DataFrame(survey_df['OSS.STABILITY'].value_counts()).plot(kind='barh')
plt.title("Do you believe that open source software is\ngenerally more stable than closed source\n(proprietary), less stable, or about the same?")
t = ax.set_xlabel("Count of responses")



In [ ]:

Identification with open source

How much do you agree or disagree with the following statements:

  • EXTERNAL.EFFICACY: The open source community values contributions from people like me.
  • INTERNAL.EFFICACY: I have the skills and understanding necessary to make meaningful contributions to open source projects.
  • OSS.IDENTIFICATION: I consider myself to be a member of the open source (and/or the Free/Libre software) community.

In [68]:
oss_id_vars = ["INTERNAL.EFFICACY", "EXTERNAL.EFFICACY", "OSS.IDENTIFICATION"]

In [69]:
oss_id_responses = survey_df[oss_id_vars].apply(pd.Series.value_counts).transpose()

#contrib_type_responses.columns = ["Not at all important", "Slightly important","Don't know", "Somewhat important", "Very important"]
oss_id_responses = oss_id_responses[["Strongly agree",
                                     "Somewhat agree",
                                     "Neither agree nor disagree",
                                     "Somewhat disagree",
                                     "Strongly disagree"]]
oss_id_responses = oss_id_responses.sort_values(by="Strongly agree")
oss_id_responses


Out[69]:
Strongly agree Somewhat agree Neither agree nor disagree Somewhat disagree Strongly disagree
EXTERNAL.EFFICACY 1518 1610 1116 150 58
OSS.IDENTIFICATION 1579 1513 863 351 150
INTERNAL.EFFICACY 2052 1685 418 240 62

In [70]:
oss_id_responses_prop = survey_df[oss_id_vars].apply(pd.Series.value_counts, normalize=True).round(4) * 100 
oss_id_responses_prop.transpose()


Out[70]:
Neither agree nor disagree Somewhat agree Somewhat disagree Strongly agree Strongly disagree
INTERNAL.EFFICACY 9.38% 37.81% 5.38% 46.04% 1.39%
EXTERNAL.EFFICACY 25.07% 36.16% 3.37% 34.10% 1.30%
OSS.IDENTIFICATION 19.37% 33.95% 7.88% 35.44% 3.37%

In [71]:
sns.set(style="whitegrid", font_scale=1.75)
fig, ax = plt.subplots()
cmap=matplotlib.cm.coolwarm
colors = ["xkcd:darkblue", "xkcd:lightblue", "xkcd:beige", "xkcd:salmon", "xkcd:crimson"]
oss_id_responses.plot.barh(stacked=True, ax=ax, figsize=[12,5], cmap=matplotlib.cm.coolwarm, edgecolor='black', linewidth=1)

#print(str(ax.get_yticklabels()))

ax.set_yticklabels(["The open source community values\ncontributions from people like me.",
                    "I consider myself to be a member\nof the open source (and/or the\nFree/Libre software) community.",
                    "I have the skills and understanding\nnecessary to make meaningful\ncontributions to open source projects."])


plt.title("How much do you agree or disagree with the following statements:")

plt.xlabel("Number of responses")


legend = plt.legend(fancybox=True, loc='upper center', bbox_to_anchor=(.5, -.25), ncol=2, shadow=True)
legend.get_frame().set_edgecolor('b')
legend.get_frame().set_facecolor('white')


Transparency vs privacy

Attribution

TRANSPARENCY.PRIVACY.BELIEFS: Which of the following statements is closest to your beliefs about attribution in software development?

  • Records of authorship should be required so that end users know who created the source code they are working with.
  • People should be able to contribute code without attribution, if they wish to remain anonymous.

In [72]:
counts_df = pd.DataFrame(survey_df['TRANSPARENCY.PRIVACY.BELIEFS'].value_counts())
counts_df.columns=["count"]
counts_df


Out[72]:
count
People should be able to contribute code without attribution, if they wish to remain anonymous. 2454
Records of authorship should be required so that end users know who created the source code they are working with. 1594

In [73]:
prop_df = pd.DataFrame((survey_df['TRANSPARENCY.PRIVACY.BELIEFS'].value_counts(normalize=True).round(4)*100))
prop_df.columns=["percent"]
prop_df


Out[73]:
percent
People should be able to contribute code without attribution, if they wish to remain anonymous. 60.62%
Records of authorship should be required so that end users know who created the source code they are working with. 39.38%

In [74]:
ax = pd.DataFrame(survey_df['TRANSPARENCY.PRIVACY.BELIEFS'].value_counts()).plot(kind='barh', figsize=[10,6])
plt.title("Which of the following statements is closest to your\nbeliefs about attribution in software development?")
ax.set_yticklabels(["People should be able to contribute\ncode without attribution, if\nthey wish to remain anonymous.",
                    "Records of authorship should be\nrequired so that end users know\nwho created the source code they are working with."])
t = ax.set_xlabel("Count of responses")


In general, how much information about you is publicly available online?

INFO.AVAILABILITY


In [75]:
count_df = pd.DataFrame(survey_df['INFO.AVAILABILITY'].value_counts())
count_df.columns=["count"]
count_df


Out[75]:
count
Some information about me 1776
A little information about me 1133
A lot of information about me 1011
No information at all about me 140

In [76]:
prop_df = pd.DataFrame((survey_df['INFO.AVAILABILITY'].value_counts(normalize=True).round(4)*100))
prop_df.columns=["percent"]
prop_df


Out[76]:
percent
Some information about me 43.74%
A little information about me 27.91%
A lot of information about me 24.90%
No information at all about me 3.45%

In [77]:
ax = pd.DataFrame(survey_df['INFO.AVAILABILITY'].value_counts()).plot(kind='barh')
plt.title("In general, how much information about\nyou is publicly available online?")
t = ax.set_xlabel("Count of responses")


Do you feel that you need to make information available about yourself online for professional reasons?

INFO.JOB


In [78]:
count_df = pd.DataFrame(survey_df['INFO.JOB'].value_counts())
count_df.columns = ["count"]
count_df


Out[78]:
count
Yes 2327
No 1638

In [79]:
prop_df = pd.DataFrame((survey_df['INFO.JOB'].value_counts(normalize=True).round(4)*100))
prop_df.columns=["percent"]
prop_df


Out[79]:
percent
Yes 58.69%
No 41.31%

In [80]:
ax = pd.DataFrame(survey_df['INFO.JOB'].value_counts()).plot(kind='barh', figsize=[10,6])
plt.title("Do you feel that you need to make information available\nabout yourself online for professional reasons?")
t = ax.set_xlabel("Count of responses")


General privacy practices

TRANSPARENCY.PRIVACY.PRACTICES.GENERAL

"Which of the following best describes your practices around publishing content online, such as posts on social media (e.g. Facebook, Instagram, Twitter, etc.), blogs, and other platforms (not including contributions to open source projects)?" (single choice)


In [81]:
counts_df = pd.DataFrame(survey_df['TRANSPARENCY.PRIVACY.PRACTICES.GENERAL'].value_counts())
counts_df.columns = ["count"]
counts_df


Out[81]:
count
I include my real name. 1718
I usually use a consistent pseudonym that is easily linked to my real name online. 1141
I don't publish this kind of content online. 517
I usually use a consistent pseudonym that is not linked anywhere with my real name online. 363
I take precautions to use different pseudonymns on different platforms. 270

In [82]:
prop_df = pd.DataFrame((survey_df['TRANSPARENCY.PRIVACY.PRACTICES.GENERAL'].value_counts(normalize=True).round(4)*100))
prop_df.columns=["percent"]
prop_df


Out[82]:
percent
I include my real name. 42.85%
I usually use a consistent pseudonym that is easily linked to my real name online. 28.46%
I don't publish this kind of content online. 12.90%
I usually use a consistent pseudonym that is not linked anywhere with my real name online. 9.05%
I take precautions to use different pseudonymns on different platforms. 6.73%

In [83]:
plot_counts_df = pd.DataFrame(survey_df['TRANSPARENCY.PRIVACY.PRACTICES.GENERAL'].value_counts())
idx = ['I include my real name.',
       'I usually use a consistent pseudonym that\nis easily linked to my real name online.',
       'I don\'t publish this kind of content online.',
       'I usually use a consistent pseudonym that\nis not linked anywhere with my real name online.',
       'I take precautions to use different\npseudonymns on different platforms.']
plot_counts_df.index = idx

In [84]:
ax = plot_counts_df.plot(kind='barh', figsize=[12,6])
plt.title("Which of the following best describes your\npractices around publishing content online [...] \nnot including contributions to open source projects?")
t = ax.set_xlabel("Count of responses")


OSS privacy practices

"Which of the following best describes your practices when making open source contributions?"


In [85]:
counts_df = pd.DataFrame(survey_df['TRANSPARENCY.PRIVACY.PRACTICES.OSS'].value_counts())
counts_df.columns = ["count"]
counts_df


Out[85]:
count
I include my real name. 1845
I usually contribute using a consistent pseudonym that is easily linked to my real name online. 766
I usually contribute using a consistent pseudonym that is not linked anywhere with my real name online. 273
I take precautions to use different usernames in different projects. 42

In [86]:
prop_df = pd.DataFrame((survey_df['TRANSPARENCY.PRIVACY.PRACTICES.OSS'].value_counts(normalize=True).round(4)*100))
prop_df.columns=["percent"]
prop_df


Out[86]:
percent
I include my real name. 63.06%
I usually contribute using a consistent pseudonym that is easily linked to my real name online. 26.18%
I usually contribute using a consistent pseudonym that is not linked anywhere with my real name online. 9.33%
I take precautions to use different usernames in different projects. 1.44%

In [87]:
plot_counts_df = pd.DataFrame(survey_df['TRANSPARENCY.PRIVACY.PRACTICES.OSS'].value_counts())
idx = ['I include my real name.',
       'I usually use a consistent pseudonym that\nis easily linked to my real name online.',
       'I usually use a consistent pseudonym that\nis not linked anywhere with my real name online.',
       'I take precautions to use different\npseudonymns on different platforms.']
plot_counts_df.index = idx

In [88]:
ax = plot_counts_df.plot(kind='barh', figsize=[12,6])
plt.title("Which of the following best describes your\npractices when making open source contributions?")
t = ax.set_xlabel("Count of responses")


Mentorship / Help

RECEIVED.HELP


In [89]:
counts_df = pd.DataFrame(survey_df['RECEIVED.HELP'].value_counts())
counts_df.columns = ["count"]
counts_df


Out[89]:
count
Yes 2845
No 1064

In [90]:
prop_df = pd.DataFrame((survey_df['RECEIVED.HELP'].value_counts(normalize=True).round(4)*100))
prop_df.columns=["percent"]
prop_df


Out[90]:
percent
Yes 72.78%
No 27.22%

In [91]:
ax = pd.DataFrame(survey_df['RECEIVED.HELP'].value_counts()).plot(kind='barh')
plt.title("Have you ever received any kind of help from other people\nrelated to using or contributing to an open source project?")
t = ax.set_xlabel("Count of responses")


Thinking of the most recent case where someone helped you, how did you find someone to help you?


In [92]:
counts_df = pd.DataFrame(survey_df['FIND.HELPER'].value_counts())
counts_df.columns = ["count"]
counts_df


Out[92]:
count
I asked for help in a public forum (e.g. in a GitHub Issue, project mailing list, etc.) and someone responded. 2057
I asked a specific person for help. 403
Someone offered me unsolicited help. 272
Other - Please describe 64

In [93]:
prop_df = pd.DataFrame((survey_df['FIND.HELPER'].value_counts(normalize=True).round(4)*100))
prop_df.columns=["percent"]
prop_df


Out[93]:
percent
I asked for help in a public forum (e.g. in a GitHub Issue, project mailing list, etc.) and someone responded. 73.57%
I asked a specific person for help. 14.41%
Someone offered me unsolicited help. 9.73%
Other - Please describe 2.29%

In [94]:
ax = pd.DataFrame(survey_df['FIND.HELPER'].value_counts()).plot(kind='barh')
ax.set_xlabel("Count of responses")
plt.title("How did you find someone to help you?")
t = ax.set_yticklabels(['I asked for help in a public forum\n(e.g. in a GitHub Issue, project mailing list, etc.)\nand someone responded.',
 'I asked a specific person for help.',
 'Someone offered me unsolicited help.',
 'Other - Please describe'])


Which best describes your prior relationship with the person who helped you?

HELPER.PRIOR.RELATIONSHIP


In [95]:
counts_df = pd.DataFrame(survey_df['HELPER.PRIOR.RELATIONSHIP'].value_counts())
counts_df.columns = ["count"]
counts_df


Out[95]:
count
Total strangers, I didn't know of them previously. 1565
I knew of them through their contributions to projects, but didn't know them personally. 809
We knew each other a little. 211
We knew each other well. 208

In [96]:
prop_df = pd.DataFrame((survey_df['HELPER.PRIOR.RELATIONSHIP'].value_counts(normalize=True).round(4)*100))
prop_df.columns=["percent"]
prop_df


Out[96]:
percent
Total strangers, I didn't know of them previously. 56.03%
I knew of them through their contributions to projects, but didn't know them personally. 28.97%
We knew each other a little. 7.55%
We knew each other well. 7.45%

In [97]:
ax = pd.DataFrame(survey_df['HELPER.PRIOR.RELATIONSHIP'].value_counts()).plot(kind='barh')
plt.title("Which best describes your prior\nrelationship with the person who helped you?")
t = ax.set_xlabel("Count of responses")


What kind of problem did they help you with?

RECEIVED.HELP.TYPE


In [98]:
counts_df = pd.DataFrame(survey_df['RECEIVED.HELP.TYPE'].value_counts())
counts_df.columns = ["count"]
counts_df


Out[98]:
count
Writing code or otherwise implementing ideas. 1633
Installing or using an application. 820
Understanding community norms (e.g. how to submit a contribution, how to communicate effectively). 181
Other (please describe) 142
Introductions to other people 13

In [99]:
prop_df = pd.DataFrame((survey_df['RECEIVED.HELP.TYPE'].value_counts(normalize=True).round(4)*100))
prop_df.columns=["percent"]
prop_df


Out[99]:
percent
Writing code or otherwise implementing ideas. 58.55%
Installing or using an application. 29.40%
Understanding community norms (e.g. how to submit a contribution, how to communicate effectively). 6.49%
Other (please describe) 5.09%
Introductions to other people 0.47%

In [100]:
ax = pd.DataFrame(survey_df['RECEIVED.HELP.TYPE'].value_counts()).plot(kind='barh')
plt.title("What kind of problem did they help you with?")
t = ax.set_xlabel("Count of responses")

t = ax.set_yticklabels(['Writing code or otherwise implementing ideas.',
       'Installing or using an application.',
       'Understanding community norms (e.g. how to submit\na contribution, how to communicate effectively).',
       'Other (please describe)', 'Introductions to other people'])


Have you ever provided help for another person on an open source project?

PROVIDED.HELP


In [101]:
counts_df = pd.DataFrame(survey_df['PROVIDED.HELP'].value_counts())
counts_df.columns = ["count"]
counts_df


Out[101]:
count
Yes 2891
No 1013

In [102]:
prop_df = pd.DataFrame((survey_df['PROVIDED.HELP'].value_counts(normalize=True).round(4)*100))
prop_df.columns=["percent"]
prop_df


Out[102]:
percent
Yes 74.05%
No 25.95%

In [103]:
ax = pd.DataFrame(survey_df['PROVIDED.HELP'].value_counts()).plot(kind='barh')
plt.title("Have you ever provided help for another\nperson on an open source project?")
t = ax.set_xlabel("Count of responses")



In [ ]:

Thinking of the most recent case where you helped someone, how did you come to help this person?

FIND.HELPEE


In [104]:
counts_df = pd.DataFrame(survey_df['FIND.HELPEE'].value_counts())
counts_df.columns = ["count"]
counts_df


Out[104]:
count
They asked for help in a public forum (e.g. in a GitHub Issue, project mailing list, etc.) and I responded. 1839
They asked me directly for help. 566
I reached out to them to offer unsolicited help. 405
Other (please describe) 28

In [105]:
prop_df = pd.DataFrame((survey_df['FIND.HELPEE'].value_counts(normalize=True).round(4)*100))
prop_df.columns=["percent"]
prop_df


Out[105]:
percent
They asked for help in a public forum (e.g. in a GitHub Issue, project mailing list, etc.) and I responded. 64.80%
They asked me directly for help. 19.94%
I reached out to them to offer unsolicited help. 14.27%
Other (please describe) 0.99%

In [106]:
ax = pd.DataFrame(survey_df['FIND.HELPEE'].value_counts()).plot(kind='barh')
ax.set_xlabel("Count of responses")
t = plt.title("How did you come to help this person?")
t = ax.set_yticklabels(['They asked for help in a public forum\n(e.g. in a GitHub Issue, project mailing list, etc.)\n and I responded.',
       'They asked me directly for help.',
       'I reached out to them to offer unsolicited help.',
       'Other (please describe)'])


Which best describes your prior relationship with the person you helped?

HELPEE.PRIOR.RELATIONSHIP


In [107]:
counts_df = pd.DataFrame(survey_df['HELPEE.PRIOR.RELATIONSHIP'].value_counts())
counts_df.columns = ["count"]
counts_df


Out[107]:
count
Total strangers, I didn't know of them previously. 1984
We knew each other well. 292
I knew of them through their contributions to projects, but didn't know them personally. 288
We knew each other a little. 275

In [108]:
prop_df = pd.DataFrame((survey_df['HELPEE.PRIOR.RELATIONSHIP'].value_counts(normalize=True).round(4)*100))
prop_df.columns=["percent"]
prop_df


Out[108]:
percent
Total strangers, I didn't know of them previously. 69.88%
We knew each other well. 10.29%
I knew of them through their contributions to projects, but didn't know them personally. 10.14%
We knew each other a little. 9.69%

In [109]:
ax = pd.DataFrame(survey_df['HELPEE.PRIOR.RELATIONSHIP'].value_counts()).plot(kind='barh')
plt.title("Which best describes your prior\nrelationship with the person you helped?")
t = ax.set_xlabel("Count of responses")


What kind of problem did you help them with?

PROVIDED.HELP.TYPE


In [110]:
counts_df = pd.DataFrame(survey_df['PROVIDED.HELP.TYPE'].value_counts())
counts_df.columns = ["count"]
counts_df


Out[110]:
count
Writing code or otherwise implementing ideas. 1602
Installing or using an application. 1028
Other (please describe) 101
Understanding community norms (e.g. how to submit a contribution, how to communicate effectively). 99
Introductions to other people. 8

In [111]:
prop_df = pd.DataFrame((survey_df['PROVIDED.HELP.TYPE'].value_counts(normalize=True).round(4)*100))
prop_df.columns=["percent"]
prop_df


Out[111]:
percent
Writing code or otherwise implementing ideas. 56.45%
Installing or using an application. 36.22%
Other (please describe) 3.56%
Understanding community norms (e.g. how to submit a contribution, how to communicate effectively). 3.49%
Introductions to other people. 0.28%

In [112]:
ax = pd.DataFrame(survey_df['PROVIDED.HELP.TYPE'].value_counts()).plot(kind='barh')
plt.title("What kind of problem did you help them with?")
t = ax.set_xlabel("Count of responses")
t = ax.set_yticklabels(['Writing code or otherwise implementing ideas.',
       'Installing or using an application.', 'Other (please describe)',
       'Understanding community norms (e.g. how to\nsubmit a contribution, how to communicate effectively).',
       'Introductions to other people.'])


Open Source Software in Paid Work

Do you contribute to open source as part of your professional work?

OSS.AS.JOB: Do you contribute to open source as part of your professional work? In other words, are you paid for any of your time spent on open source contributions?

  • Yes, indirectly- I contribute to open source in carrying out my work duties, but I am not required or expected to do so.
  • No.
  • Yes, directly- some or all of my work duties include contributing to open source projects.

In [113]:
counts_df = pd.DataFrame(survey_df['OSS.AS.JOB'].value_counts())
counts_df.columns = ["count"]
counts_df


Out[113]:
count
Yes, indirectly- I contribute to open source in carrying out my work duties, but I am not required or expected to do so. 896
No. 687
Yes, directly- some or all of my work duties include contributing to open source projects. 464

In [114]:
prop_df = pd.DataFrame((survey_df['OSS.AS.JOB'].value_counts(normalize=True).round(4)*100))
prop_df.columns=["percent"]
prop_df


Out[114]:
percent
Yes, indirectly- I contribute to open source in carrying out my work duties, but I am not required or expected to do so. 43.77%
No. 33.56%
Yes, directly- some or all of my work duties include contributing to open source projects. 22.67%

In [115]:
oss_as_job_df = pd.DataFrame(survey_df['OSS.AS.JOB'].value_counts())
oss_as_job_df.index = ["Yes, indirectly", "No", "Yes, directly"]
ax = oss_as_job_df.plot(kind='barh')
plt.title("Do you contribute to open source\nas part of your professional work?")
t = ax.set_xlabel("Count of responses")


How often do you use open source software in your professional work?

OSS.AT.WORK


In [116]:
counts_df = pd.DataFrame(survey_df['OSS.AT.WORK'].value_counts())
counts_df.columns = ["count"]
counts_df


Out[116]:
count
Frequently 2191
Sometimes 300
Rarely 110
Never 65

In [117]:
prop_df = pd.DataFrame((survey_df['OSS.AT.WORK'].value_counts(normalize=True).round(4)*100))
prop_df.columns=["percent"]
prop_df


Out[117]:
percent
Frequently 82.18%
Sometimes 11.25%
Rarely 4.13%
Never 2.44%

In [118]:
ax = pd.DataFrame(survey_df['OSS.AT.WORK'].value_counts()).plot(kind='barh')
plt.title("How often do you use open source\nsoftware in your professional work?")
t = ax.set_xlabel("Count of responses")


How does your employer's intellectual property agreement/policy affect your free-time contributions to open source unrelated to your work?

OSS.IP.POLICY


In [119]:
counts_df = pd.DataFrame(survey_df['OSS.IP.POLICY'].value_counts())
counts_df.columns = ["count"]
counts_df


Out[119]:
count
I am free to contribute without asking for permission. 1178
My employer doesn't have a clear policy on this. 695
I am permitted to contribute to open source, but need to ask for permission. 287
I'm not sure. 238
Not applicable 180
I am not permitted to contribute to open source at all. 63

In [120]:
prop_df = pd.DataFrame((survey_df['OSS.IP.POLICY'].value_counts(normalize=True).round(4)*100))
prop_df.columns=["percent"]
prop_df


Out[120]:
percent
I am free to contribute without asking for permission. 44.60%
My employer doesn't have a clear policy on this. 26.32%
I am permitted to contribute to open source, but need to ask for permission. 10.87%
I'm not sure. 9.01%
Not applicable 6.82%
I am not permitted to contribute to open source at all. 2.39%

In [121]:
ax = pd.DataFrame(survey_df['OSS.IP.POLICY'].value_counts()).plot(kind='barh')
plt.title("How does your employer's intellectual property\nagreement/policy affect your free-time contributions\nto open source unrelated to your work?")
t = ax.set_xlabel("Count of responses")


Which is closest to your employer’s policy on using open source software applications?


In [122]:
counts_df = pd.DataFrame(survey_df['EMPLOYER.POLICY.APPLICATIONS'].value_counts())
counts_df.columns = ["count"]
counts_df


Out[122]:
count
Use of open source applications is encouraged. 1174
Use of open source applications is acceptable if it is the most appropriate tool. 916
My employer doesn't have a clear policy on this. 338
Not applicable 88
I'm not sure. 83
Use of open source applications is rarely, if ever, permitted. 42

In [123]:
prop_df = pd.DataFrame((survey_df['EMPLOYER.POLICY.APPLICATIONS'].value_counts(normalize=True).round(4)*100))
prop_df.columns=["percent"]
prop_df


Out[123]:
percent
Use of open source applications is encouraged. 44.45%
Use of open source applications is acceptable if it is the most appropriate tool. 34.68%
My employer doesn't have a clear policy on this. 12.80%
Not applicable 3.33%
I'm not sure. 3.14%
Use of open source applications is rarely, if ever, permitted. 1.59%

In [124]:
ax = pd.DataFrame(survey_df['EMPLOYER.POLICY.APPLICATIONS'].value_counts()).plot(kind='barh')
ax.set_xlabel("Count of responses")
t = plt.title("Which is closest to your employer’s policy\non using open source software applications?")


How important do you think your involvement in open source was to getting your current job?

OSS.HIRING


In [125]:
counts_df = pd.DataFrame(survey_df['OSS.HIRING'].value_counts())
counts_df.columns = ["count"]
counts_df


Out[125]:
count
Very important 618
Somewhat important 448
Not at all important 361
Not too important 352
Not applicable-I hadn't made any contributions when I got this job. 254

In [126]:
prop_df = pd.DataFrame((survey_df['OSS.HIRING'].value_counts(normalize=True).round(4)*100))
prop_df.columns=["percent"]
prop_df


Out[126]:
percent
Very important 30.40%
Somewhat important 22.04%
Not at all important 17.76%
Not too important 17.31%
Not applicable-I hadn't made any contributions when I got this job. 12.49%

In [127]:
ax = pd.DataFrame(survey_df['OSS.HIRING'].value_counts()).plot(kind='barh')
ax.set_xlabel("Count of responses")
t = plt.title("How important do you think your involvement\nin open source was to getting your current job?")


Demographics

Do you currently live in a country other than the one in which you were born?

IMMIGRATION


In [128]:
counts_df = pd.DataFrame(survey_df['IMMIGRATION'].value_counts())
counts_df.columns = ["count"]
counts_df


Out[128]:
count
No, I live in the country where I was born. 2764
Yes, and I intend to stay permanently. 513
Yes, and I am not sure about my future plans. 292
Yes, and I intend to stay temporarily. 165

In [129]:
prop_df = pd.DataFrame((survey_df['IMMIGRATION'].value_counts(normalize=True).round(4)*100))
prop_df.columns=["percent"]
prop_df


Out[129]:
percent
No, I live in the country where I was born. 74.02%
Yes, and I intend to stay permanently. 13.74%
Yes, and I am not sure about my future plans. 7.82%
Yes, and I intend to stay temporarily. 4.42%

In [130]:
ax = pd.DataFrame(survey_df['IMMIGRATION'].value_counts()).plot(kind='barh')
ax.set_xlabel("Count of responses")
t = plt.title("Do you currently live in a country other\nthan the one in which you were born?")


Thinking of where you were born, are you a member of an ethnicity or nationality that is a considered a minority in that country?

MINORITY.HOMECOUNTRY


In [131]:
counts_df = pd.DataFrame(survey_df['MINORITY.HOMECOUNTRY'].value_counts())
counts_df.columns = ["count"]
counts_df


Out[131]:
count
No 754
Yes 124
Not sure 45
Prefer not to say 34

In [132]:
prop_df = pd.DataFrame((survey_df['MINORITY.HOMECOUNTRY'].value_counts(normalize=True).round(4)*100))
prop_df.columns=["percent"]
prop_df


Out[132]:
percent
No 78.79%
Yes 12.96%
Not sure 4.70%
Prefer not to say 3.55%

In [133]:
ax = pd.DataFrame(survey_df['MINORITY.HOMECOUNTRY'].value_counts()).plot(kind='barh')
ax.set_xlabel("Count of responses")
t = plt.title("Thinking of where you were born, are\nyou a member of an ethnicity or nationality that\nis a considered a minority in that country?")



In [ ]:

Thinking of where you currently live, are you a member of an ethnicity or nationality that is a considered a minority in that country?

MINORITY.CURRENT.COUNTRY


In [134]:
counts_df = pd.DataFrame(survey_df['MINORITY.CURRENT.COUNTRY'].value_counts())
counts_df.columns = ["count"]
counts_df


Out[134]:
count
No 2837
Yes 546
Not sure 193
Prefer not to say 156

In [135]:
prop_df = pd.DataFrame((survey_df['MINORITY.CURRENT.COUNTRY'].value_counts(normalize=True).round(4)*100))
prop_df.columns=["percent"]
prop_df


Out[135]:
percent
No 76.02%
Yes 14.63%
Not sure 5.17%
Prefer not to say 4.18%

In [136]:
ax = pd.DataFrame(survey_df['MINORITY.CURRENT.COUNTRY'].value_counts()).plot(kind='barh')
ax.set_xlabel("Count of responses")
t = plt.title("Thinking of where you currently live, are you\na member of an ethnicity or nationality that is a\nconsidered a minority in that country?")


What is your gender?

GENDER


In [137]:
counts_df = pd.DataFrame(survey_df['GENDER'].value_counts())
counts_df.columns = ["count"]
counts_df


Out[137]:
count
Man 3387
Prefer not to say 173
Woman 125
Non-binary or Other 39

In [138]:
prop_df = pd.DataFrame((survey_df['GENDER'].value_counts(normalize=True).round(4)*100))
prop_df.columns=["percent"]
prop_df


Out[138]:
percent
Man 90.95%
Prefer not to say 4.65%
Woman 3.36%
Non-binary or Other 1.05%

In [139]:
ax = pd.DataFrame(survey_df['GENDER'].value_counts()).plot(kind='barh')
ax.set_xlabel("Count of responses")
t = plt.title("What is your gender?")


Do you identify as transgender?

TRANSGENDER.IDENTITY


In [140]:
counts_df = pd.DataFrame(survey_df['TRANSGENDER.IDENTITY'].value_counts())
counts_df.columns = ["count"]
counts_df


Out[140]:
count
No 3494
Prefer not to say 158
Yes 33
Not sure 30

In [141]:
prop_df = pd.DataFrame((survey_df['TRANSGENDER.IDENTITY'].value_counts(normalize=True).round(4)*100))
prop_df.columns=["percent"]
prop_df


Out[141]:
percent
No 94.05%
Prefer not to say 4.25%
Yes 0.89%
Not sure 0.81%

In [142]:
ax = pd.DataFrame(survey_df['TRANSGENDER.IDENTITY'].value_counts()).plot(kind='barh')
ax.set_xlabel("Count of responses")
t = plt.title("Do you identify as transgender?")



In [ ]:

Do you identify as gay, lesbian, or bisexual, asexual, or any other minority sexual orientation?

SEXUAL.ORIENTATION


In [143]:
counts_df = pd.DataFrame(survey_df['SEXUAL.ORIENTATION'].value_counts())
counts_df.columns = ["count"]
counts_df


Out[143]:
count
No 3187
Yes 246
Prefer not to say 201
Not sure 85

In [144]:
prop_df = pd.DataFrame((survey_df['SEXUAL.ORIENTATION'].value_counts(normalize=True).round(4)*100))
prop_df.columns=["percent"]
prop_df


Out[144]:
percent
No 85.70%
Yes 6.61%
Prefer not to say 5.40%
Not sure 2.29%

In [145]:
ax = pd.DataFrame(survey_df['SEXUAL.ORIENTATION'].value_counts()).plot(kind='barh')
ax.set_xlabel("Count of responses")
t = plt.title("Do you identify as gay, lesbian, or bisexual,\nasexual, or any other minority sexual orientation?")


How well can you read and write in English?

WRITTEN.ENGLISH


In [146]:
counts_df = pd.DataFrame(survey_df['WRITTEN.ENGLISH'].value_counts())
counts_df.columns = ["count"]
counts_df


Out[146]:
count
Very well 2865
Moderately well 742
Not very well 108
Not at all 6

In [147]:
prop_df = pd.DataFrame((survey_df['WRITTEN.ENGLISH'].value_counts(normalize=True).round(4)*100))
prop_df.columns=["percent"]
prop_df


Out[147]:
percent
Very well 77.00%
Moderately well 19.94%
Not very well 2.90%
Not at all 0.16%

In [148]:
ax = pd.DataFrame(survey_df['WRITTEN.ENGLISH'].value_counts()).plot(kind='barh')
ax.set_xlabel("Count of responses")
t = plt.title("How well can you read and write in English?")



In [ ]:

What is your age?

AGE


In [149]:
counts_df = pd.DataFrame(survey_df['AGE'].value_counts().sort_index())
counts_df.columns = ["count"]
counts_df


Out[149]:
count
17 or younger 139
18 to 24 years 871
25 to 34 years 1400
35 to 44 years 772
45 to 54 years 267
55 to 64 years 93
65 years or older 36

In [150]:
prop_df = pd.DataFrame((survey_df['AGE'].value_counts(normalize=True).round(4)*100).sort_index())
prop_df.columns=["percent"]
prop_df


Out[150]:
percent
17 or younger 3.88%
18 to 24 years 24.34%
25 to 34 years 39.13%
35 to 44 years 21.58%
45 to 54 years 7.46%
55 to 64 years 2.60%
65 years or older 1.01%

In [151]:
ax = pd.DataFrame(survey_df['AGE'].value_counts().sort_index()).plot(kind='barh')
ax.set_xlabel("Count of responses")
t = plt.title("What is your age?")



In [ ]:

What is highest level of formal education that you have completed?

FORMAL.EDUCATION


In [152]:
counts_df = pd.DataFrame(survey_df['FORMAL.EDUCATION'].value_counts())
counts_df.columns = ["count"]
counts_df


Out[152]:
count
Bachelor's degree 1321
Master's degree 852
Some college, no degree 640
Secondary (high) school graduate or equivalent 375
Doctorate (Ph.D.) or other advanced degree (e.g. M.D., J.D.) 256
Vocational/trade program or apprenticeship 127
Less than secondary (high) school 126

In [153]:
prop_df = pd.DataFrame((survey_df['FORMAL.EDUCATION'].value_counts(normalize=True).round(4)*100))
prop_df.columns=["percent"]
prop_df


Out[153]:
percent
Bachelor's degree 35.73%
Master's degree 23.05%
Some college, no degree 17.31%
Secondary (high) school graduate or equivalent 10.14%
Doctorate (Ph.D.) or other advanced degree (e.g. M.D., J.D.) 6.92%
Vocational/trade program or apprenticeship 3.44%
Less than secondary (high) school 3.41%

In [154]:
order = ["Less than secondary (high) school",
         "Secondary (high) school graduate or equivalent",
         "Vocational/trade program or apprenticeship",
         "Some college, no degree",
         "Bachelor's degree",
         "Master's degree",
         "Doctorate (Ph.D.) or other advanced degree (e.g. M.D., J.D.)"]

edu_counts_df = survey_df['FORMAL.EDUCATION'].value_counts()[order]

ax = edu_counts_df.plot(kind='barh')
ax.set_xlabel("Count of responses")
t = plt.title("What is highest level of formal education that you have completed?")



In [ ]:

What is the highest level of formal education that either of your parents completed?

PARENTS.FORMAL.EDUCATION


In [155]:
counts_df = pd.DataFrame(survey_df['PARENTS.FORMAL.EDUCATION'].value_counts())
counts_df.columns = ["count"]
counts_df


Out[155]:
count
Bachelor's degree 961
Master's degree 871
Secondary (high) school graduate or equivalent 566
Some college, no degree 388
Doctorate (Ph.D.) or other advanced degree (e.g. M.D., J.D.) 387
Vocational/trade program or apprenticeship 257
Less than secondary (high) school 243

In [156]:
prop_df = pd.DataFrame((survey_df['PARENTS.FORMAL.EDUCATION'].value_counts(normalize=True).round(4)*100))
prop_df.columns=["percent"]
prop_df


Out[156]:
percent
Bachelor's degree 26.16%
Master's degree 23.71%
Secondary (high) school graduate or equivalent 15.41%
Some college, no degree 10.56%
Doctorate (Ph.D.) or other advanced degree (e.g. M.D., J.D.) 10.54%
Vocational/trade program or apprenticeship 7.00%
Less than secondary (high) school 6.62%

In [157]:
order = ["Less than secondary (high) school",
         "Secondary (high) school graduate or equivalent",
         "Vocational/trade program or apprenticeship",
         "Some college, no degree",
         "Bachelor's degree",
         "Master's degree",
         "Doctorate (Ph.D.) or other advanced degree (e.g. M.D., J.D.)"]

edu_counts_df = survey_df['PARENTS.FORMAL.EDUCATION'].value_counts()[order]

ax = edu_counts_df.plot(kind='barh')
ax.set_xlabel("Count of responses")
t = plt.title("What is highest level of formal education that either of your parents completed?")



In [ ]:


In [ ]:

How old were you when you first had regular access to a computer with an internet connection?

AGE.AT.FIRST.COMPUTER.INTERNET


In [158]:
counts_df = pd.DataFrame(survey_df['AGE.AT.FIRST.COMPUTER.INTERNET'].value_counts())
counts_df.columns = ["count"]
counts_df


Out[158]:
count
Younger than 13 years old 1478
13 - 17 years old 1313
18 - 24 years old 695
25 - 45 years old 202
Older than 45 years old 23

In [159]:
prop_df = pd.DataFrame((survey_df['AGE.AT.FIRST.COMPUTER.INTERNET'].value_counts(normalize=True).round(4)*100))
prop_df.columns=["percent"]
prop_df


Out[159]:
percent
Younger than 13 years old 39.83%
13 - 17 years old 35.38%
18 - 24 years old 18.73%
25 - 45 years old 5.44%
Older than 45 years old 0.62%

In [160]:
ax = pd.DataFrame(survey_df['AGE.AT.FIRST.COMPUTER.INTERNET'].value_counts()).plot(kind='barh')
ax.set_xlabel("Count of responses")
t = plt.title("How old were you when you first had regular\naccess to a computer with an internet connection?")



In [ ]:

Where did you first have regular access to a computer with internet connection?

LOCATION.OF.FIRST.COMPUTER.INTERNET


In [161]:
counts_df = pd.DataFrame(survey_df['LOCATION.OF.FIRST.COMPUTER.INTERNET'].value_counts())
counts_df.columns = ["count"]
counts_df


Out[161]:
count
At home (belonging to me or a family member) 2520
In a classroom, computer lab, or library at school 746
At an internet cafe or similar space 182
Other (please describe) 106
At a public library or community center 87
At work (recoded from open ended) 70

In [162]:
prop_df = pd.DataFrame((survey_df['LOCATION.OF.FIRST.COMPUTER.INTERNET'].value_counts(normalize=True).round(4)*100))
prop_df.columns=["percent"]
prop_df


Out[162]:
percent
At home (belonging to me or a family member) 67.91%
In a classroom, computer lab, or library at school 20.10%
At an internet cafe or similar space 4.90%
Other (please describe) 2.86%
At a public library or community center 2.34%
At work (recoded from open ended) 1.89%

In [163]:
ax = pd.DataFrame(survey_df['LOCATION.OF.FIRST.COMPUTER.INTERNET'].value_counts()).plot(kind='barh', figsize=[8.5,6])
ax.set_xlabel("Count of responses")
t = plt.title("Where did you first have regular access to a\ncomputer with internet connection?")


Where was the respondent surveyed from?

POPLATION


In [164]:
counts_df = pd.DataFrame(survey_df['POPULATION'].value_counts())
counts_df.columns = ["count"]
counts_df


Out[164]:
count
github 5495
off site community 534

In [165]:
prop_df = pd.DataFrame((survey_df['POPULATION'].value_counts(normalize=True).round(4)*100))
prop_df.columns=["percent"]
prop_df


Out[165]:
percent
github 91.14%
off site community 8.86%

In [166]:
ax = pd.DataFrame(survey_df['POPULATION'].value_counts()).plot(kind='barh', figsize=[8.5,6])
ax.set_xlabel("Count of responses")
t = plt.title("Where was the respondent surveyed from?")


Harassment / Inclusiveness of OSS

Have you ever observed any of the following in the context of an open source project?

DISCOURAGING.BEHAVIOR.*


In [167]:
discouraging_responses = survey_df[discouraging_vars].apply(pd.Series.value_counts).transpose()[["Yes", "No"]]
discouraging_responses


Out[167]:
Yes No
DISCOURAGING.BEHAVIOR.LACK.OF.RESPONSE 3017 792
DISCOURAGING.BEHAVIOR.REJECTION.WOUT.EXPLANATION 1210 2580
DISCOURAGING.BEHAVIOR.DISMISSIVE.RESPONSE 2195 1598
DISCOURAGING.BEHAVIOR.BAD.DOCS 3559 263
DISCOURAGING.BEHAVIOR.CONFLICT 1830 1966
DISCOURAGING.BEHAVIOR.UNWELCOMING.LANGUAGE 649 3158

In [168]:
discouraging_percent = pd.DataFrame(discouraging_responses["Yes"] / discouraging_responses.sum(axis=1) * 100, columns=["percent_yes"]).sort_values(by="percent_yes")
discouraging_percent.round(2)


Out[168]:
percent_yes
DISCOURAGING.BEHAVIOR.UNWELCOMING.LANGUAGE 17.05%
DISCOURAGING.BEHAVIOR.REJECTION.WOUT.EXPLANATION 31.93%
DISCOURAGING.BEHAVIOR.CONFLICT 48.21%
DISCOURAGING.BEHAVIOR.DISMISSIVE.RESPONSE 57.87%
DISCOURAGING.BEHAVIOR.LACK.OF.RESPONSE 79.21%
DISCOURAGING.BEHAVIOR.BAD.DOCS 93.12%

In [169]:
sns.set(style="whitegrid", font_scale=1.75)
fig, ax = plt.subplots()
cmap=matplotlib.cm.Blues_r
discouraging_responses.sort_values(by="No").plot.barh(stacked=True, ax=ax, figsize=[12,6], cmap=cmap, edgecolor='black', linewidth=1)

labels = []
for l in ax.get_yticklabels():
    title_text = l.get_text()[22:].replace(".", " ") # cut off "CONTRIBUTOR.TYPE"
        
    labels.append(title_text)
    
ax.set_yticklabels(labels)


plt.title("Have you ever observed any of the following in the context of an open source project?")

plt.xlabel("Number of responses")



legend = plt.legend(fancybox=True, loc='upper center', bbox_to_anchor=(.5, -.13), ncol=4, shadow=True)
legend.get_frame().set_edgecolor('b')
legend.get_frame().set_facecolor('white')



In [ ]:

Have you ever witnessed any of the following behaviors directed at another person in the context of an open source project? (not including something directed at you)

NEGATIVE.WITNESS.*


In [170]:
neg_witness_responses = neg_df[neg_witness_vars].apply(pd.Series.value_counts).transpose()[[1,0]]
neg_witness_responses.columns = ["Yes", "Blank"]
neg_witness_responses


Out[170]:
Yes Blank
NEGATIVE.WITNESS.RUDENESS 1753 4276
NEGATIVE.WITNESS.NAME.CALLING 789 5240
NEGATIVE.WITNESS.THREATS 162 5867
NEGATIVE.WITNESS.IMPERSONATION 177 5852
NEGATIVE.WITNESS.SUSTAINED.HARASSMENT 237 5792
NEGATIVE.WITNESS.CROSS.PLATFORM.HARASSMENT 175 5854
NEGATIVE.WITNESS.STALKING 108 5921
NEGATIVE.WITNESS.SEXUAL.ADVANCES 136 5893
NEGATIVE.WITNESS.STEREOTYPING 423 5606
NEGATIVE.WITNESS.DOXXING 151 5878
NEGATIVE.WITNESS.OTHER 78 5951
NEGATIVE.WITNESS.NONE.OF.THE.ABOVE 1721 4308
NEGATIVE.WITNESS.ANY.RESPONSE 3664 2365

Only 3,664 respondents clicked any boxes in this question, meaning 2,365 did not click the "none of the above" or an option (or even get to this question). We have to adjust the no responses accordingly.


In [171]:
neg_witness_responses_adj = neg_witness_responses
neg_witness_responses_adj["Blank"] = neg_witness_responses_adj["Blank"] - 2365
neg_witness_responses_adj_df = pd.DataFrame(neg_witness_responses_adj["Yes"] / (neg_witness_responses_adj["Yes"] + neg_witness_responses_adj["Blank"]) * 100, columns=["percent_yes"])

In [172]:
neg_witness_responses_adj.columns = ["Yes", "No"]
neg_witness_responses_adj[:-1]


Out[172]:
Yes No
NEGATIVE.WITNESS.RUDENESS 1753 1911
NEGATIVE.WITNESS.NAME.CALLING 789 2875
NEGATIVE.WITNESS.THREATS 162 3502
NEGATIVE.WITNESS.IMPERSONATION 177 3487
NEGATIVE.WITNESS.SUSTAINED.HARASSMENT 237 3427
NEGATIVE.WITNESS.CROSS.PLATFORM.HARASSMENT 175 3489
NEGATIVE.WITNESS.STALKING 108 3556
NEGATIVE.WITNESS.SEXUAL.ADVANCES 136 3528
NEGATIVE.WITNESS.STEREOTYPING 423 3241
NEGATIVE.WITNESS.DOXXING 151 3513
NEGATIVE.WITNESS.OTHER 78 3586
NEGATIVE.WITNESS.NONE.OF.THE.ABOVE 1721 1943

In [173]:
neg_witness_responses_adj_df.sort_values(by="percent_yes").round(2)


Out[173]:
percent_yes
NEGATIVE.WITNESS.OTHER 2.13%
NEGATIVE.WITNESS.STALKING 2.95%
NEGATIVE.WITNESS.SEXUAL.ADVANCES 3.71%
NEGATIVE.WITNESS.DOXXING 4.12%
NEGATIVE.WITNESS.THREATS 4.42%
NEGATIVE.WITNESS.CROSS.PLATFORM.HARASSMENT 4.78%
NEGATIVE.WITNESS.IMPERSONATION 4.83%
NEGATIVE.WITNESS.SUSTAINED.HARASSMENT 6.47%
NEGATIVE.WITNESS.STEREOTYPING 11.54%
NEGATIVE.WITNESS.NAME.CALLING 21.53%
NEGATIVE.WITNESS.NONE.OF.THE.ABOVE 46.97%
NEGATIVE.WITNESS.RUDENESS 47.84%
NEGATIVE.WITNESS.ANY.RESPONSE 100.00%

In [174]:
sns.set(style="whitegrid", font_scale=1.75)
fig, ax = plt.subplots()
cmap=matplotlib.cm.Blues_r
neg_witness_responses_adj[:-1].sort_values(by='No').plot.barh(stacked=True, ax=ax, figsize=[12,6], cmap=cmap, edgecolor='black', linewidth=1)

labels = []
for l in ax.get_yticklabels():
    title_text = l.get_text()[17:].replace(".", " ") # cut off "CONTRIBUTOR.TYPE"
        
    labels.append(title_text)
    
ax.set_yticklabels(labels)


plt.title("Have you ever witnessed any of the following behaviors\ndirected at another person in the context of an open source\nproject? (not including something directed at you)")

plt.xlabel("Number of responses")



legend = plt.legend(fancybox=True, loc='upper center', bbox_to_anchor=(.5, -.13), ncol=4, shadow=True)
legend.get_frame().set_edgecolor('b')
legend.get_frame().set_facecolor('white')


Have you ever experienced any of the following behaviors directed at you in the context of an open source project?


In [175]:
neg_exp_responses = neg_df[neg_exp_vars].apply(pd.Series.value_counts).transpose()[[1,0]]
neg_exp_responses.columns = ["Yes", "Blank"]
neg_exp_responses


Out[175]:
Yes Blank
NEGATIVE.EXPERIENCE.RUDENESS 646 5383
NEGATIVE.EXPERIENCE.NAME.CALLING 192 5837
NEGATIVE.EXPERIENCE.THREATS 43 5986
NEGATIVE.EXPERIENCE.IMPERSONATION 45 5984
NEGATIVE.EXPERIENCE.SUSTAINED.HARASSMENT 55 5974
NEGATIVE.EXPERIENCE.CROSS.PLATFORM.HARASSMENT 42 5987
NEGATIVE.EXPERIENCE.STALKING 35 5994
NEGATIVE.EXPERIENCE.SEXUAL.ADVANCES 25 6004
NEGATIVE.EXPERIENCE.STEREOTYPING 114 5915
NEGATIVE.EXPERIENCE.DOXXING 23 6006
NEGATIVE.EXPERIENCE.OTHER 39 5990
NEGATIVE.EXPERIENCE.NONE.OF.THE.ABOVE 2900 3129
NEGATIVE.EXPERIENCE.ANY.RESPONSE 3638 2391

Only 3,638 respondents clicked any boxes in this question, meaning 2,391 did not click the "none of the above" (or even get to this question). We have to adjust the no responses accordingly.


In [176]:
neg_exp_responses_adj = neg_exp_responses
neg_exp_responses_adj["Blank"] = neg_exp_responses_adj["Blank"] - 2391
neg_exp_responses_adj_df = pd.DataFrame(neg_exp_responses["Yes"] / (neg_exp_responses_adj["Yes"] + neg_exp_responses_adj["Blank"]) * 100, columns=["percent_yes"])

In [177]:
neg_exp_responses_adj.columns = ["Yes", "No"]
neg_exp_responses_adj[:-1]


Out[177]:
Yes No
NEGATIVE.EXPERIENCE.RUDENESS 646 2992
NEGATIVE.EXPERIENCE.NAME.CALLING 192 3446
NEGATIVE.EXPERIENCE.THREATS 43 3595
NEGATIVE.EXPERIENCE.IMPERSONATION 45 3593
NEGATIVE.EXPERIENCE.SUSTAINED.HARASSMENT 55 3583
NEGATIVE.EXPERIENCE.CROSS.PLATFORM.HARASSMENT 42 3596
NEGATIVE.EXPERIENCE.STALKING 35 3603
NEGATIVE.EXPERIENCE.SEXUAL.ADVANCES 25 3613
NEGATIVE.EXPERIENCE.STEREOTYPING 114 3524
NEGATIVE.EXPERIENCE.DOXXING 23 3615
NEGATIVE.EXPERIENCE.OTHER 39 3599
NEGATIVE.EXPERIENCE.NONE.OF.THE.ABOVE 2900 738

In [178]:
neg_exp_responses_adj_df.sort_values(by="percent_yes").round(2)


Out[178]:
percent_yes
NEGATIVE.EXPERIENCE.DOXXING 0.63%
NEGATIVE.EXPERIENCE.SEXUAL.ADVANCES 0.69%
NEGATIVE.EXPERIENCE.STALKING 0.96%
NEGATIVE.EXPERIENCE.OTHER 1.07%
NEGATIVE.EXPERIENCE.CROSS.PLATFORM.HARASSMENT 1.15%
NEGATIVE.EXPERIENCE.THREATS 1.18%
NEGATIVE.EXPERIENCE.IMPERSONATION 1.24%
NEGATIVE.EXPERIENCE.SUSTAINED.HARASSMENT 1.51%
NEGATIVE.EXPERIENCE.STEREOTYPING 3.13%
NEGATIVE.EXPERIENCE.NAME.CALLING 5.28%
NEGATIVE.EXPERIENCE.RUDENESS 17.76%
NEGATIVE.EXPERIENCE.NONE.OF.THE.ABOVE 79.71%
NEGATIVE.EXPERIENCE.ANY.RESPONSE 100.00%

In [179]:
sns.set(style="whitegrid", font_scale=1.75)
fig, ax = plt.subplots()
cmap=matplotlib.cm.Blues_r
neg_exp_responses_adj[:-1].sort_values(by='No').plot.barh(stacked=True, ax=ax, figsize=[12,6], cmap=cmap, edgecolor='black', linewidth=1)

labels = []
for l in ax.get_yticklabels():
    title_text = l.get_text()[20:].replace(".", " ") # cut off "CONTRIBUTOR.TYPE"
        
    labels.append(title_text)
    
ax.set_yticklabels(labels)


plt.title("Have you ever experienced any of the following behaviors\ndirected at you in the context of an open source project?")

plt.xlabel("Number of responses")



legend = plt.legend(fancybox=True, loc='upper center', bbox_to_anchor=(.5, -.13), ncol=4, shadow=True)
legend.get_frame().set_edgecolor('b')
legend.get_frame().set_facecolor('white')


Thinking of the last time you experienced harassment, how did you respond?

NEGATIVE.RESPONSE.*


In [180]:
neg_resp_responses = neg_df[neg_resp_vars].apply(pd.Series.value_counts).transpose()[[1,0]]
neg_resp_responses.columns = ["Yes", "Blank"]
neg_resp_responses


Out[180]:
Yes Blank
NEGATIVE.RESPONSE.ASKED.USER.TO.STOP 194 5835
NEGATIVE.RESPONSE.SOLICITED.COMMUNITY.SUPPORT 112 5917
NEGATIVE.RESPONSE.BLOCKED.USER 170 5859
NEGATIVE.RESPONSE.REPORTED.TO.MAINTAINERS 95 5934
NEGATIVE.RESPONSE.REPORTED.TO.HOST.OR.ISP 20 6009
NEGATIVE.RESPONSE.CONSULTED.LEGAL.COUNSEL 8 6021
NEGATIVE.RESPONSE.CONTACTED.LAW.ENFORCEMENT 9 6020
NEGATIVE.RESPONSE.OTHER 71 5958
NEGATIVE.RESPONSE.IGNORED 350 5679
NEGATIVE.RESPONSE.ANY.RESPONSE 719 5310

Only 719 respondents clicked any boxes in this question, meaning 5,310 did not click on "I did not react / ignored the incident" or any response (or even get to this question). We have to adjust the no responses accordingly.


In [181]:
neg_resp_responses_adj = neg_resp_responses
neg_resp_responses_adj["Blank"] = neg_resp_responses_adj["Blank"] - 5310
neg_resp_responses_adj_df = pd.DataFrame(neg_resp_responses_adj["Yes"] / (neg_resp_responses_adj["Yes"] + neg_resp_responses_adj["Blank"]) * 100, columns=["percent_yes"])

In [182]:
neg_resp_responses_adj.columns = ["Yes", "No"]
neg_resp_responses_adj[:-1]


Out[182]:
Yes No
NEGATIVE.RESPONSE.ASKED.USER.TO.STOP 194 525
NEGATIVE.RESPONSE.SOLICITED.COMMUNITY.SUPPORT 112 607
NEGATIVE.RESPONSE.BLOCKED.USER 170 549
NEGATIVE.RESPONSE.REPORTED.TO.MAINTAINERS 95 624
NEGATIVE.RESPONSE.REPORTED.TO.HOST.OR.ISP 20 699
NEGATIVE.RESPONSE.CONSULTED.LEGAL.COUNSEL 8 711
NEGATIVE.RESPONSE.CONTACTED.LAW.ENFORCEMENT 9 710
NEGATIVE.RESPONSE.OTHER 71 648
NEGATIVE.RESPONSE.IGNORED 350 369

In [183]:
neg_resp_responses_adj_df.sort_values(by="percent_yes").round(2)


Out[183]:
percent_yes
NEGATIVE.RESPONSE.CONSULTED.LEGAL.COUNSEL 1.11%
NEGATIVE.RESPONSE.CONTACTED.LAW.ENFORCEMENT 1.25%
NEGATIVE.RESPONSE.REPORTED.TO.HOST.OR.ISP 2.78%
NEGATIVE.RESPONSE.OTHER 9.87%
NEGATIVE.RESPONSE.REPORTED.TO.MAINTAINERS 13.21%
NEGATIVE.RESPONSE.SOLICITED.COMMUNITY.SUPPORT 15.58%
NEGATIVE.RESPONSE.BLOCKED.USER 23.64%
NEGATIVE.RESPONSE.ASKED.USER.TO.STOP 26.98%
NEGATIVE.RESPONSE.IGNORED 48.68%
NEGATIVE.RESPONSE.ANY.RESPONSE 100.00%

In [184]:
sns.set(style="whitegrid", font_scale=1.75)
fig, ax = plt.subplots()
cmap=matplotlib.cm.Blues_r
neg_resp_responses_adj[:-1].sort_values(by='No').plot.barh(stacked=True, ax=ax, figsize=[12,6], cmap=cmap, edgecolor='black', linewidth=1)

labels = []
for l in ax.get_yticklabels():
    title_text = l.get_text()[18:].replace(".", " ") # cut off "CONTRIBUTOR.TYPE"
        
    labels.append(title_text)
    
ax.set_yticklabels(labels)


plt.title("Thinking of the last time you experienced\nharassment, how did you respond?")

plt.xlabel("Number of responses")



legend = plt.legend(fancybox=True, loc='upper center', bbox_to_anchor=(.5, -.13), ncol=4, shadow=True)
legend.get_frame().set_edgecolor('b')
legend.get_frame().set_facecolor('white')



In [ ]:

How effective were the following responses?

RESPONSE.EFFECTIVENESS.*


In [185]:
neg_effect_responses = neg_df[neg_effect_vars].apply(pd.Series.value_counts).transpose()
neg_effect_responses = neg_effect_responses.replace(np.nan, 0).sort_values(by="Mostly effective")
neg_effect_responses = neg_effect_responses[["Not at all effective", "A little effective", "Somewhat effective", "Mostly effective", "Completely effective"]]

In [186]:
idx = []
for i in neg_effect_responses.index:
    idx.append(i[23:].replace(".", " "))
neg_effect_responses.index = idx
neg_effect_responses.astype(int)


Out[186]:
Not at all effective A little effective Somewhat effective Mostly effective Completely effective
CONTACTED LAW ENFORCEMENT 4 0 2 0 3
CONSULTED LEGAL COUNSEL 1 1 3 2 1
REPORTED TO HOST OR ISP 6 4 6 3 1
OTHER 4 0 4 10 11
REPORTED TO MAINTAINERS 10 11 31 30 13
SOLICITED COMMUNITY SUPPORT 6 22 38 32 14
ASKED USER TO STOP 48 51 50 33 11
BLOCKED USER 6 20 28 56 58

In [187]:
sns.set(style="whitegrid", font_scale=1.75)
fig, ax = plt.subplots()
cmap=matplotlib.cm.Blues
neg_effect_responses.plot.barh(stacked=True, ax=ax, figsize=[12,6], cmap=cmap, edgecolor='black', linewidth=1)

plt.title("How effective were the following responses?\n(counting number of responses)")

plt.xlabel("Number of responses")



legend = plt.legend(fancybox=True, loc='upper center', bbox_to_anchor=(.5, -.13), ncol=4, shadow=True)
legend.get_frame().set_edgecolor('b')
legend.get_frame().set_facecolor('white')



In [188]:
neg_effect_responses_prop = neg_df[neg_effect_vars].apply(pd.Series.value_counts, normalize=True).round(4).transpose()
neg_effect_responses_prop = neg_effect_responses_prop.replace(np.nan, 0).sort_values(by="Completely effective")
neg_effect_responses_prop = neg_effect_responses_prop[["Not at all effective", "A little effective", "Somewhat effective", "Mostly effective", "Completely effective"]]
neg_effect_responses_prop = neg_effect_responses_prop * 100

In [189]:
idx = []
for i in neg_effect_responses_prop.index:
    idx.append(i[23:].replace(".", " "))
neg_effect_responses_prop.index = idx
neg_effect_responses_prop


Out[189]:
Not at all effective A little effective Somewhat effective Mostly effective Completely effective
REPORTED TO HOST OR ISP 30.00% 20.00% 30.00% 15.00% 5.00%
ASKED USER TO STOP 24.87% 26.42% 25.91% 17.10% 5.70%
SOLICITED COMMUNITY SUPPORT 5.36% 19.64% 33.93% 28.57% 12.50%
CONSULTED LEGAL COUNSEL 12.50% 12.50% 37.50% 25.00% 12.50%
REPORTED TO MAINTAINERS 10.53% 11.58% 32.63% 31.58% 13.68%
CONTACTED LAW ENFORCEMENT 44.44% 0.00% 22.22% 0.00% 33.33%
BLOCKED USER 3.57% 11.90% 16.67% 33.33% 34.52%
OTHER 13.79% 0.00% 13.79% 34.48% 37.93%

In [190]:
sns.set(style="whitegrid", font_scale=1.75)
fig, ax = plt.subplots()
cmap=matplotlib.cm.Blues
neg_effect_responses_prop.plot.barh(stacked=True, ax=ax, figsize=[12,6], cmap=cmap, edgecolor='black', linewidth=1)

plt.title("How effective were the following responses?\n(proportion of responses)")

plt.xlabel("Proportion of responses")



legend = plt.legend(fancybox=True, loc='upper center', bbox_to_anchor=(.5, -.13), ncol=4, shadow=True)
legend.get_frame().set_edgecolor('b')
legend.get_frame().set_facecolor('white')


As a result of experiencing or witnessing harassment, which, if any, of the following have you done?

NEGATIVE.CONSEQUENCES.*


In [ ]:


In [191]:
neg_conseq_responses = neg_df[neg_conseq_vars].apply(pd.Series.value_counts).transpose()[[1,0]]
neg_conseq_responses.columns = ["Yes", "Blank"]
neg_conseq_responses


Out[191]:
Yes Blank
NEGATIVE.CONSEQUENCES.STOPPED.CONTRIBUTING 390 5639
NEGATIVE.CONSEQUENCES.PSEUDONYM 50 5979
NEGATIVE.CONSEQUENCES.WORK.IN.PRIVATE 166 5863
NEGATIVE.CONSEQUENCES.CHANGE.USERNAME 48 5981
NEGATIVE.CONSEQUENCES.CHANGE.ONLINE.PRESENCE 79 5950
NEGATIVE.CONSEQUENCES.SUGGEST.COC 116 5913
NEGATIVE.CONSEQUENCES.PRIVATE.COMMUNITY.DISCUSSION 301 5728
NEGATIVE.CONSEQUENCES.PUBLIC.COMMUNITY.DISCUSSION 248 5781
NEGATIVE.CONSEQUENCES.OFFLINE.CHANGES 85 5944
NEGATIVE.CONSEQUENCES.OTHER 90 5939
NEGATIVE.CONSEQUENCES.NONE.OF.THE.ABOVE 1094 4935
NEGATIVE.CONSEQUENCES.ANY.RESPONSE 1953 4076

Only 1,953 respondents clicked any boxes in this question, meaning 4,076 did not click a response or the "none of the above" option (or even get to the question). We have to adjust the no responses accordingly.


In [192]:
neg_conseq_responses_adj = neg_conseq_responses
neg_conseq_responses_adj["Blank"] = neg_conseq_responses_adj["Blank"] - 4076
neg_conseq_responses_adj_df = pd.DataFrame(neg_conseq_responses_adj["Yes"] / (neg_conseq_responses_adj["Yes"] + neg_conseq_responses_adj["Blank"]) * 100, columns=["percent_yes"])

In [193]:
neg_conseq_responses_adj.columns = ['Yes', 'No']
neg_conseq_responses_adj[:-1]


Out[193]:
Yes No
NEGATIVE.CONSEQUENCES.STOPPED.CONTRIBUTING 390 1563
NEGATIVE.CONSEQUENCES.PSEUDONYM 50 1903
NEGATIVE.CONSEQUENCES.WORK.IN.PRIVATE 166 1787
NEGATIVE.CONSEQUENCES.CHANGE.USERNAME 48 1905
NEGATIVE.CONSEQUENCES.CHANGE.ONLINE.PRESENCE 79 1874
NEGATIVE.CONSEQUENCES.SUGGEST.COC 116 1837
NEGATIVE.CONSEQUENCES.PRIVATE.COMMUNITY.DISCUSSION 301 1652
NEGATIVE.CONSEQUENCES.PUBLIC.COMMUNITY.DISCUSSION 248 1705
NEGATIVE.CONSEQUENCES.OFFLINE.CHANGES 85 1868
NEGATIVE.CONSEQUENCES.OTHER 90 1863
NEGATIVE.CONSEQUENCES.NONE.OF.THE.ABOVE 1094 859

In [194]:
neg_conseq_responses_adj_df.sort_values(by="percent_yes").round(2)


Out[194]:
percent_yes
NEGATIVE.CONSEQUENCES.CHANGE.USERNAME 2.46%
NEGATIVE.CONSEQUENCES.PSEUDONYM 2.56%
NEGATIVE.CONSEQUENCES.CHANGE.ONLINE.PRESENCE 4.05%
NEGATIVE.CONSEQUENCES.OFFLINE.CHANGES 4.35%
NEGATIVE.CONSEQUENCES.OTHER 4.61%
NEGATIVE.CONSEQUENCES.SUGGEST.COC 5.94%
NEGATIVE.CONSEQUENCES.WORK.IN.PRIVATE 8.50%
NEGATIVE.CONSEQUENCES.PUBLIC.COMMUNITY.DISCUSSION 12.70%
NEGATIVE.CONSEQUENCES.PRIVATE.COMMUNITY.DISCUSSION 15.41%
NEGATIVE.CONSEQUENCES.STOPPED.CONTRIBUTING 19.97%
NEGATIVE.CONSEQUENCES.NONE.OF.THE.ABOVE 56.02%
NEGATIVE.CONSEQUENCES.ANY.RESPONSE 100.00%

In [195]:
sns.set(style="whitegrid", font_scale=1.75)
fig, ax = plt.subplots()
cmap=matplotlib.cm.Blues_r
neg_conseq_responses_adj[:-1].sort_values(by='No').plot.barh(stacked=True, ax=ax, figsize=[12,6], cmap=cmap, edgecolor='black', linewidth=1)

labels = []
for l in ax.get_yticklabels():
    title_text = l.get_text()[22:].replace(".", " ") # cut off "CONTRIBUTOR.TYPE"
        
    labels.append(title_text)
    
ax.set_yticklabels(labels)


plt.title("As a result of experiencing or witnessing\nharassment, which, if any, of the following have you done?")

plt.xlabel("Number of responses")



legend = plt.legend(fancybox=True, loc='upper center', bbox_to_anchor=(.5, -.13), ncol=4, shadow=True)
legend.get_frame().set_edgecolor('b')
legend.get_frame().set_facecolor('white')



In [ ]: