The Hacker Within Fall 2017 survey

by R. Stuart Geiger, freely dual licensed CC-BY 4.0, MIT license

Importing and processing data

Importing libraries



In [1]:

    
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

Importing data and previewing



In [2]:

    
df = pd.read_csv("thwfall2017-survey.csv")
df = df[1:]

Oh no, these are some messy column names! Gotta clean them up, truncating the first 308 characters.



In [3]:

    
df.columns[0:3]









    Out[3]:





Index(['For each of the languages, areas, and topics, please check the boxes in each column if you: \n\n\n\twant something on this at THW (check as many as you want)\n\treally want something on this at THW (check no more than 5)\n\tknow something about this and could help teach (no obligation, check as many as you want) - R',
       'For each of the languages, areas, and topics, please check the boxes in each column if you: \n\n\n\twant something on this at THW (check as many as you want)\n\treally want something on this at THW (check no more than 5)\n\tknow something about this and could help teach (no obligation, check as many as you want) - Python',
       'For each of the languages, areas, and topics, please check the boxes in each column if you: \n\n\n\twant something on this at THW (check as many as you want)\n\treally want something on this at THW (check no more than 5)\n\tknow something about this and could help teach (no obligation, check as many as you want) - Julia'],
      dtype='object')



In [4]:

    
count = 0
for column in df.columns:
    if count < 43:
        df = df.rename(columns = {column:column[308:]})
        print(len(column), column[308:])
    count = count + 1









    



309 R
314 Python
313 Julia
315 Fortran
315 C / C++
310 Go
315 Haskell
312 Rust
321 SQL databases
323 noSQL databases
324 Machine Learning
321 Deep Learning
334 High Performance Computing
332 Containers (e.g. docker)
323 Cloud Computing
327 Visualization Tools
328 Statistical Analysis
340 Mapping and Geospatial Analaysis
324 Textual Analysis
322 Image Analysis
327 Timeseries Analysis
326 Parallelizing code
340 Profiling / performance analysis
321 Meshing Tools
323 Linux/UNIX/bash
314 GitHub
340 How open source projects are run
331 jekyll and GitHub pages
330 Reproducible Workflows
342 Open Science / Open Data platforms
320 Web Scraping
357 Publication Tools (e.g., markup languages, LaTeX)
348 Documentation (tools and best practices)
375 Hardware sensors / Internet of Things (e.g., arduino, raspberry pi)
340 Novel Architectures (e.g., GPUs)
323 Web Development
353 Software engineering (including unit testing)
313 Other
320 Other - Text
315 Other.1
322 Other - Text.1
315 Other.2
322 Other - Text.2



In [5]:

    
topic_list = df.columns[0:37]
topic_list









    Out[5]:





Index(['R', 'Python', 'Julia', 'Fortran', 'C / C++', 'Go', 'Haskell', 'Rust',
       'SQL databases', 'noSQL databases', 'Machine Learning', 'Deep Learning',
       'High Performance Computing', 'Containers (e.g. docker)',
       'Cloud Computing', 'Visualization Tools', 'Statistical Analysis',
       'Mapping and Geospatial Analaysis', 'Textual Analysis',
       'Image Analysis', 'Timeseries Analysis', 'Parallelizing code',
       'Profiling / performance analysis', 'Meshing Tools', 'Linux/UNIX/bash',
       'GitHub', 'How open source projects are run', 'jekyll and GitHub pages',
       'Reproducible Workflows', 'Open Science / Open Data platforms',
       'Web Scraping', 'Publication Tools (e.g., markup languages, LaTeX)',
       'Documentation (tools and best practices)',
       'Hardware sensors / Internet of Things (e.g., arduino, raspberry pi)',
       'Novel Architectures (e.g., GPUs)', 'Web Development',
       'Software engineering (including unit testing)'],
      dtype='object')

Creating two dataframes: df_topics for interest/experience about topics and df_meta for questions about THW



In [6]:

    
df_topics = df[topic_list]



In [7]:

    
df_meta = df
df_meta = df[['Skill level', 'Personal experience', 'Presentation style']]

Topic interest

Each topic (e.g. Python, R, GitHub) has one cell, with a list based on the items checked.

If someone clicked "I want this at THW", there will be a 1.
If someone clicked "I really want this at THW," there will be a 2.
If someone clicked "I know something about this..." there will be a 3.

These are mutually independent -- if someone clicked all of them, the value would be "1, 2, 3" and so on.

Assumptions for calculating interest: If someone clicked that they just wanted a topic, add 1 to the topic's score. If someone clicked that they really wanted it, add 3 to the topic's score. If they clicked both, just add 3, not 4.



In [8]:

    
topic_interest = {}
topic_teaching = {}

for topic in df_topics:
    
    topic_interest[topic] = 0
    topic_teaching[topic] = 0

    for row in df_topics[topic]:
        
        # if row contains only value 1, increment interest dict by 1
        if str(row).find('1')>=0 and str(row).find('2')==-1:
            topic_interest[topic] += 1
        
        # if row contains value 2, increment interest dict by 3
        if str(row).find('2')>=0:
            topic_interest[topic] += 3
            
        if str(row).find('3')>=0:
            topic_teaching[topic] += 1

Results



In [9]:

    
topic_interest_df = pd.DataFrame.from_dict(topic_interest, orient="index")
topic_interest_df.sort_values([0], ascending=False)









    Out[9]:







  
    
      
      0
    
  
  
    
      Statistical Analysis
      44
    
    
      Python
      42
    
    
      Machine Learning
      40
    
    
      Deep Learning
      39
    
    
      Visualization Tools
      33
    
    
      SQL databases
      31
    
    
      Profiling / performance analysis
      28
    
    
      Timeseries Analysis
      28
    
    
      Mapping and Geospatial Analaysis
      28
    
    
      Web Scraping
      28
    
    
      High Performance Computing
      27
    
    
      Parallelizing code
      27
    
    
      Web Development
      24
    
    
      R
      23
    
    
      jekyll and GitHub pages
      22
    
    
      Containers (e.g. docker)
      21
    
    
      Open Science / Open Data platforms
      20
    
    
      How open source projects are run
      20
    
    
      Software engineering (including unit testing)
      20
    
    
      Reproducible Workflows
      20
    
    
      Documentation (tools and best practices)
      19
    
    
      Julia
      19
    
    
      Image Analysis
      19
    
    
      Cloud Computing
      19
    
    
      Linux/UNIX/bash
      18
    
    
      Textual Analysis
      18
    
    
      Publication Tools (e.g., markup languages, LaTeX)
      17
    
    
      GitHub
      17
    
    
      Hardware sensors / Internet of Things (e.g., arduino, raspberry pi)
      16
    
    
      noSQL databases
      16
    
    
      Novel Architectures (e.g., GPUs)
      13
    
    
      Meshing Tools
      11
    
    
      C / C++
      10
    
    
      Haskell
      9
    
    
      Fortran
      7
    
    
      Go
      4
    
    
      Rust
      3



In [10]:

    
topic_interest_df = topic_interest_df.sort_values([0], ascending=True)
topic_interest_df.plot(figsize=[8,14], kind='barh', fontsize=20)









    Out[10]:





<matplotlib.axes._subplots.AxesSubplot at 0x7f619396c128>

Topic expertise



In [11]:

    
topic_teaching_df = pd.DataFrame.from_dict(topic_teaching, orient="index")
topic_teaching_df = topic_teaching_df[topic_teaching_df[0] != 0]
topic_teaching_df.sort_values([0], ascending=False)









    Out[11]:







  
    
      
      0
    
  
  
    
      Python
      9
    
    
      GitHub
      6
    
    
      Linux/UNIX/bash
      6
    
    
      Publication Tools (e.g., markup languages, LaTeX)
      6
    
    
      High Performance Computing
      3
    
    
      R
      3
    
    
      Mapping and Geospatial Analaysis
      3
    
    
      Parallelizing code
      2
    
    
      Open Science / Open Data platforms
      2
    
    
      Documentation (tools and best practices)
      2
    
    
      Software engineering (including unit testing)
      2
    
    
      C / C++
      2
    
    
      Web Development
      2
    
    
      Reproducible Workflows
      2
    
    
      Novel Architectures (e.g., GPUs)
      2
    
    
      SQL databases
      2
    
    
      Deep Learning
      1
    
    
      Cloud Computing
      1
    
    
      Image Analysis
      1
    
    
      Textual Analysis
      1
    
    
      Web Scraping
      1
    
    
      Profiling / performance analysis
      1
    
    
      Visualization Tools
      1
    
    
      Statistical Analysis
      1



In [12]:

    
topic_teaching_df = topic_teaching_df.sort_values([0], ascending=True)
topic_teaching_df.plot(figsize=[8,10], kind='barh', fontsize=20)









    Out[12]:





<matplotlib.axes._subplots.AxesSubplot at 0x7f6193939a90>

Meta questions about THW



In [13]:

    
df_meta = df_meta.dropna()
df_meta[0:4]









    Out[13]:







  
    
      
      Skill level
      Personal experience
      Presentation style
    
  
  
    
      1
      2
      2
      2
    
    
      2
      3
      2
      4
    
    
      3
      2
      2
      2
    
    
      4
      2
      3
      2

Personal experience with scientific computing



In [14]:

    
fig, ax = plt.subplots()
pe_df = df_meta['Personal experience'].value_counts(sort=False).sort_index(ascending=True)
pe_plot = pe_df.plot(kind='barh', fontsize=20, figsize=[8,4], ax=ax)
plt.title("What is your personal experience with scientific computing?", size=20)
ax.set_yticklabels(["Beginner", "Intermediate", "Advanced"])









    Out[14]:





[<matplotlib.text.Text at 0x7f6191056588>,
 <matplotlib.text.Text at 0x7f6191042eb8>,
 <matplotlib.text.Text at 0x7f619100fe10>]

What skill level should we aim for?



In [15]:

    
fig, ax = plt.subplots()
skill_df = df_meta['Skill level'].value_counts(sort=False).sort_index(ascending=True)
skill_plot = skill_df.plot(kind='barh', fontsize=20, figsize=[8,4], ax=ax)
plt.title("What skill level should we aim for?", size=20)
ax.set_yticklabels(["Beginner", "Intermediate", "Advanced"])









    Out[15]:





[<matplotlib.text.Text at 0x7f6190fbfc50>,
 <matplotlib.text.Text at 0x7f6190fbb668>,
 <matplotlib.text.Text at 0x7f6190f906d8>]

What should our sessions look like?



In [17]:

    
fig, ax = plt.subplots()
style_df = df_meta['Presentation style'].value_counts(sort=False).sort_index(ascending=True)
style_plot = style_df.plot(kind='barh', fontsize=20, figsize=[8,4], ax=ax)
plt.title("Session format", size=20)
ax.set_yticklabels(["100% presentation / 0% hackathon",
                    "75% presentation / 25% hackathon",
                    "50% presentation / 50% hackathon",
                    "25% presentation / 75% hackathon",
                    "0% presentation / 100% hackathon"])









    Out[17]:





[<matplotlib.text.Text at 0x7f61902c1e10>,
 <matplotlib.text.Text at 0x7f61902c1e80>,
 <matplotlib.text.Text at 0x7f6190287860>,
 <matplotlib.text.Text at 0x7f619028b400>,
 <matplotlib.text.Text at 0x7f619028bbe0>]



In [ ]:

	0
Statistical Analysis	44
Python	42
Machine Learning	40
Deep Learning	39
Visualization Tools	33
SQL databases	31
Profiling / performance analysis	28
Timeseries Analysis	28
Mapping and Geospatial Analaysis	28
Web Scraping	28
High Performance Computing	27
Parallelizing code	27
Web Development	24
R	23
jekyll and GitHub pages	22
Containers (e.g. docker)	21
Open Science / Open Data platforms	20
How open source projects are run	20
Software engineering (including unit testing)	20
Reproducible Workflows	20
Documentation (tools and best practices)	19
Julia	19
Image Analysis	19
Cloud Computing	19
Linux/UNIX/bash	18
Textual Analysis	18
Publication Tools (e.g., markup languages, LaTeX)	17
GitHub	17
Hardware sensors / Internet of Things (e.g., arduino, raspberry pi)	16
noSQL databases	16
Novel Architectures (e.g., GPUs)	13
Meshing Tools	11
C / C++	10
Haskell	9
Fortran	7
Go	4
Rust	3