The Hacker Within Spring 2017 survey

by R. Stuart Geiger, freely licensed CC-BY 4.0, MIT license

Importing and processing data

Importing libraries



In [1]:

    
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

Importing data and previewing



In [2]:

    
df = pd.read_csv("survey.tsv",sep="\t")
df[0:4]









    Out[2]:






  
    
      
      opt_out
      R
      Python
      Julia
      Fortran
      C / C++
      Go
      Haskell
      Rust
      SQL databases
      ...
      Web Scraping
      Publication Tools (e.g., markup, LaTeX)
      Documentation Tools
      Hardware sensors / Internet of Things
      Novel Architectures (e.g., GPUs)
      Web Development
      Software engineering (including unit testing)
      Skill level
      Personal experience
      Presentation style
    
  
  
    
      0
      0
      NaN
      1
      NaN
      NaN
      2.0
      NaN
      NaN
      NaN
      NaN
      ...
      NaN
      1,3
      1.0
      NaN
      1,3
      NaN
      1,3
      2.0
      2.0
      2.0
    
    
      1
      0
      NaN
      2
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      ...
      NaN
      NaN
      NaN
      2.0
      NaN
      NaN
      NaN
      2.0
      2.0
      3.0
    
    
      2
      0
      3
      2
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      ...
      2.0
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      2.0
      2.0
      1.0
    
    
      3
      0
      1
      1,2
      1.0
      NaN
      NaN
      1.0
      NaN
      NaN
      NaN
      ...
      NaN
      NaN
      NaN
      1.0
      NaN
      NaN
      NaN
      2.0
      2.0
      4.0
    
  

4 rows × 41 columns

Creating two dataframes: df_topics for interest/experience about topics and df_meta for questions about THW



In [3]:

    
df_topics = df
df_topics = df_topics.drop(['opt_out', 'Skill level', 'Personal experience', 'Presentation style'], axis=1)



In [4]:

    
df_meta = df
df_meta = df[['Skill level', 'Personal experience', 'Presentation style']]

Topic interest

Each topic (e.g. Python, R, GitHub) has one cell, with a list based on the items checked.

If someone clicked "I want this at THW", there will be a 1.
If someone clicked "I really want this at THW," there will be a 2.
If someone clicked "I know something about this..." there will be a 3.

These are mutually independent -- if someone clicked all of them, the value would be "1, 2, 3" and so on.

Assumptions for calculating interest: If someone clicked that they just wanted a topic, add 1 to the topic's score. If someone clicked that they really wanted it, add 3 to the topic's score. If they clicked both, just add 3, not 4.



In [5]:

    
topic_interest = {}
topic_teaching = {}

for topic in df_topics:
    
    topic_interest[topic] = 0
    topic_teaching[topic] = 0

    for row in df_topics[topic]:
        
        # if row contains only value 1, increment interest dict by 1
        if str(row).find('1')>=0 and str(row).find('2')==-1:
            topic_interest[topic] += 1
        
        # if row contains value 2, increment interest dict by 3
        if str(row).find('2')>=0:
            topic_interest[topic] += 3
            
        if str(row).find('3')>=0:
            topic_teaching[topic] += 1

Results



In [6]:

    
topic_interest_df = pd.DataFrame.from_dict(topic_interest, orient="index")
topic_interest_df.sort_values([0], ascending=False)









    Out[6]:






  
    
      
      0
    
  
  
    
      Visualization Tools
      34
    
    
      Python
      30
    
    
      Statistical Analysis
      30
    
    
      Machine Learning
      27
    
    
      Containers (e.g. docker)
      21
    
    
      Parallelizing code
      18
    
    
      GitHub
      17
    
    
      SQL databases
      16
    
    
      Publication Tools (e.g., markup, LaTeX)
      16
    
    
      Linux/UNIX/bash
      16
    
    
      Deep Learning
      16
    
    
      Web Scraping
      15
    
    
      Open Science / Open Data platforms
      14
    
    
      High Performance Computing
      14
    
    
      Reproducible Workflows
      14
    
    
      C / C++
      12
    
    
      How open source projects are run
      11
    
    
      Cloud Computing
      11
    
    
      Hardware sensors / Internet of Things
      10
    
    
      Profiling / performance analysis
      10
    
    
      R
      9
    
    
      Timeseries Analysis
      9
    
    
      jekyll and GitHub pages
      9
    
    
      Mapping and Geospatial Analaysis
      9
    
    
      Julia
      8
    
    
      noSQL databases
      8
    
    
      Software engineering (including unit testing)
      7
    
    
      Textual Analysis
      6
    
    
      Documentation Tools
      6
    
    
      Meshing Tools
      5
    
    
      Haskell
      5
    
    
      Go
      4
    
    
      Image Analysis
      4
    
    
      Web Development
      4
    
    
      Fortran
      2
    
    
      Novel Architectures (e.g., GPUs)
      2
    
    
      Rust
      1



In [7]:

    
topic_interest_df = topic_interest_df.sort_values([0], ascending=True)
topic_interest_df.plot(figsize=[8,14], kind='barh', fontsize=20)









    Out[7]:





<matplotlib.axes._subplots.AxesSubplot at 0x7f5d17cda160>

Topic expertise



In [8]:

    
topic_teaching_df = pd.DataFrame.from_dict(topic_teaching, orient="index")
topic_teaching_df = topic_teaching_df[topic_teaching_df[0] != 0]
topic_teaching_df.sort_values([0], ascending=False)









    Out[8]:






  
    
      
      0
    
  
  
    
      Python
      5
    
    
      R
      4
    
    
      GitHub
      3
    
    
      SQL databases
      3
    
    
      Reproducible Workflows
      3
    
    
      How open source projects are run
      3
    
    
      Publication Tools (e.g., markup, LaTeX)
      3
    
    
      Linux/UNIX/bash
      3
    
    
      Visualization Tools
      2
    
    
      Machine Learning
      2
    
    
      Software engineering (including unit testing)
      1
    
    
      Novel Architectures (e.g., GPUs)
      1
    
    
      jekyll and GitHub pages
      1
    
    
      Web Scraping
      1
    
    
      Go
      1
    
    
      Statistical Analysis
      1
    
    
      Julia
      1
    
    
      Containers (e.g. docker)
      1
    
    
      Image Analysis
      1
    
    
      Mapping and Geospatial Analaysis
      1
    
    
      Web Development
      1



In [9]:

    
topic_teaching_df = topic_teaching_df.sort_values([0], ascending=True)
topic_teaching_df.plot(figsize=[8,10], kind='barh', fontsize=20)









    Out[9]:





<matplotlib.axes._subplots.AxesSubplot at 0x7f5d17b2f6d8>

Meta questions about THW



In [10]:

    
df_meta['Personal experience'].replace([1, 2, 3], ['1: Beginner', '2: Intermediate', '3: Advanced'], inplace=True)
df_meta['Skill level'].replace([1, 2, 3], ['1: Beginner', '2: Intermediate', '3: Advanced'], inplace=True)
df_meta['Presentation style'].replace([1,2,3,4,5], ["1: 100% presentation / 0% hackathon", "2: 75% presentation / 25% hackathon", "3: 50% presentation / 50% hackathon", "4: 25% presentation / 75% hackathon", "5: 100% hackathon"], inplace = True)









    



/home/vm/anaconda3/lib/python3.5/site-packages/pandas/core/generic.py:3443: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._update_inplace(new_data)



In [11]:

    
df_meta = df_meta.dropna()
df_meta[0:4]









    Out[11]:






  
    
      
      Skill level
      Personal experience
      Presentation style
    
  
  
    
      0
      2: Intermediate
      2: Intermediate
      2: 75% presentation / 25% hackathon
    
    
      1
      2: Intermediate
      2: Intermediate
      3: 50% presentation / 50% hackathon
    
    
      2
      2: Intermediate
      2: Intermediate
      1: 100% presentation / 0% hackathon
    
    
      3
      2: Intermediate
      2: Intermediate
      4: 25% presentation / 75% hackathon

Personal experience with scientific computing



In [12]:

    
pe_df = df_meta['Personal experience'].value_counts(sort=False).sort_index(ascending=False)
pe_plot = pe_df.plot(kind='barh', fontsize=20, figsize=[8,4])
plt.title("What is your personal experience with scientific computing?", size=20)









    Out[12]:





<matplotlib.text.Text at 0x7f5d17a91860>

What skill level should we aim for?



In [13]:

    
skill_df = df_meta['Skill level'].value_counts(sort=False).sort_values(ascending=False)
skill_plot = skill_df.plot(kind='barh', fontsize=20, figsize=[8,4])
plt.title("What skill level should we aim for?", size=20)









    Out[13]:





<matplotlib.text.Text at 0x7f5d1798e9b0>

What should our sessions look like?



In [14]:

    
style_df = df_meta['Presentation style'].value_counts(sort=False).sort_index(ascending=False)
style_plot = style_df.plot(kind='barh', fontsize=20, figsize=[8,4])
plt.title("Session format", size=20)









    Out[14]:





<matplotlib.text.Text at 0x7f5d179002e8>



In [ ]:

	R	Python	Julia	Fortran	C / C++	Go	Haskell	Rust	SQL databases	...	Web Scraping	Publication Tools (e.g., markup, LaTeX)	Documentation Tools	Hardware sensors / Internet of Things	Novel Architectures (e.g., GPUs)	Web Development	Software engineering (including unit testing)	Skill level	Personal experience	Presentation style
0	NaN	1	NaN	NaN	2.0	NaN	NaN	NaN	NaN	...	NaN	1,3	1.0	NaN	1,3	NaN	1,3	2.0	2.0	2.0
1	NaN	2	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	2.0	NaN	NaN	NaN	2.0	2.0	3.0
2	3	2	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	2.0	NaN	NaN	NaN	NaN	NaN	NaN	2.0	2.0	1.0
3	1	1,2	1.0	NaN	NaN	1.0	NaN	NaN	NaN	...	NaN	NaN	NaN	1.0	NaN	NaN	NaN	2.0	2.0	4.0

	0
Visualization Tools	34
Python	30
Statistical Analysis	30
Machine Learning	27
Containers (e.g. docker)	21
Parallelizing code	18
GitHub	17
SQL databases	16
Publication Tools (e.g., markup, LaTeX)	16
Linux/UNIX/bash	16
Deep Learning	16
Web Scraping	15
Open Science / Open Data platforms	14
High Performance Computing	14
Reproducible Workflows	14
C / C++	12
How open source projects are run	11
Cloud Computing	11
Hardware sensors / Internet of Things	10
Profiling / performance analysis	10
R	9
Timeseries Analysis	9
jekyll and GitHub pages	9
Mapping and Geospatial Analaysis	9
Julia	8
noSQL databases	8
Software engineering (including unit testing)	7
Textual Analysis	6
Documentation Tools	6
Meshing Tools	5
Haskell	5
Go	4
Image Analysis	4
Web Development	4
Fortran	2
Novel Architectures (e.g., GPUs)	2
Rust	1

	Skill level	Personal experience	Presentation style
0	2: Intermediate	2: Intermediate	2: 75% presentation / 25% hackathon
1	2: Intermediate	2: Intermediate	3: 50% presentation / 50% hackathon
2	2: Intermediate	2: Intermediate	1: 100% presentation / 0% hackathon
3	2: Intermediate	2: Intermediate	4: 25% presentation / 75% hackathon

	R	Python	Julia	Fortran	C / C++	Go	Haskell	Rust	SQL databases	...	Web Scraping	Publication Tools (e.g., markup, LaTeX)	Documentation Tools	Hardware sensors / Internet of Things	Novel Architectures (e.g., GPUs)	Web Development	Software engineering (including unit testing)	Skill level	Personal experience	Presentation style
0	NaN	1	NaN	NaN	2.0	NaN	NaN	NaN	NaN	...	NaN	1,3	1.0	NaN	1,3	NaN	1,3	2.0	2.0	2.0
1	NaN	2	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	2.0	NaN	NaN	NaN	2.0	2.0	3.0
2	3	2	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	2.0	NaN	NaN	NaN	NaN	NaN	NaN	2.0	2.0	1.0
3	1	1,2	1.0	NaN	NaN	1.0	NaN	NaN	NaN	...	NaN	NaN	NaN	1.0	NaN	NaN	NaN	2.0	2.0	4.0

	R	Python	Julia	Fortran	C / C++	Go	Haskell	Rust	SQL databases	...	Web Scraping	Publication Tools (e.g., markup, LaTeX)	Documentation Tools	Hardware sensors / Internet of Things	Novel Architectures (e.g., GPUs)	Web Development	Software engineering (including unit testing)	Skill level	Personal experience	Presentation style
0	NaN	1	NaN	NaN	2.0	NaN	NaN	NaN	NaN	...	NaN	1,3	1.0	NaN	1,3	NaN	1,3	2.0	2.0	2.0
1	NaN	2	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	2.0	NaN	NaN	NaN	2.0	2.0	3.0
2	3	2	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	2.0	NaN	NaN	NaN	NaN	NaN	NaN	2.0	2.0	1.0
3	1	1,2	1.0	NaN	NaN	1.0	NaN	NaN	NaN	...	NaN	NaN	NaN	1.0	NaN	NaN	NaN	2.0	2.0	4.0