INTRODUCTION

This notebook performs hierarchical clustering and other distance-based analysis on a data set of occupations and their features. It has the following sections:

  • Set-up: import modules
  • Read in Data: read in and organize data from text files
  • Data Filtering: extract only the features we care about
  • Normalize the Data: transform the features to be on scales from 0 to 1
  • Visualize the Data: histograms of feature distributions
  • Correlation between Features: correlation analysis
  • Weight the Data: weighting of features ultimately not used in further analysis
  • Handle NaN Values: analysis and filtering of NaN values
  • Calculate Euclidean Distance between Occupations: distance between occupations, for later use
  • Density clustering: dbscan clustering and PCA visualization (ultimately not used)
  • Hierarchical clustering: Ward's clustering on the occupations using Euclidean distance
  • Distance-based Analysis: Finding similar occupations and differentiating features using the Euclidean distances

Data Source: O*NET, the Occupational Information Network from the US Department of Labor.

SET-UP


In [1]:
# Load needed modules and functions
import matplotlib.pyplot as plt
%matplotlib inline

import os
import numpy as np
import pandas as pd
from pandas import DataFrame, Series
import sklearn
from sklearn.neighbors import NearestNeighbors
from pylab import figure, show

Read in Data


In [2]:
#set up path to the data files
data_folder = os.path.join(os.pardir, "data")

In [3]:
file_names = ['Abilities.txt','Interests.txt','Job Zones.txt', 'Knowledge.txt','Occupation Data.txt','Skills.txt','Work Activities.txt','Work Context Categories.txt','Work Context.txt','Work Styles.txt','Work Values.txt']

In [4]:
#read in each of the files into a dataframe
name_list = []
for name in file_names:
    frame_name = name.replace('.txt','').lower().replace(" ","_").replace(",","")
    name_list.append(frame_name)
    frame = pd.read_table(data_folder + '/' + name, sep= '\t')
    #reformat column names
    frame.columns = [x.lower().replace("*","").replace("-","_").replace(" ","_") for x in frame.columns]
    #create a variable named the frame_name that contains the data
    vars()[frame_name] = frame

Data Filtering


In [5]:
#function that calculates the number of features available in a dataframe (the # rows divided by # of jobs)
def feature(dataframe):
    return len(dataframe)/len(dataframe.onet_soc_code.unique())

Abilities


In [6]:
#In abilities, we only want to keep the rows where scale_id == 'IM'
abilities_final = abilities[abilities.scale_id == 'IM']

In [7]:
len(abilities_final)


Out[7]:
47996

In [8]:
feature(abilities_final)


Out[8]:
52

In [9]:
abilities_final['domain'] = 'Abilities'
abilities_final.head()


Out[9]:
onet_soc_code element_id element_name scale_id data_value n standard_error lower_ci_bound upper_ci_bound recommend_suppress not_relevant date domain_source domain
0 11-1011.00 1.A.1.a.1 Oral Comprehension IM 4.50 8 0.19 4.13 4.87 N n/a 06/2006 Analyst Abilities
2 11-1011.00 1.A.1.a.2 Written Comprehension IM 4.38 8 0.18 4.02 4.73 N n/a 06/2006 Analyst Abilities
4 11-1011.00 1.A.1.a.3 Oral Expression IM 4.50 8 0.19 4.13 4.87 N n/a 06/2006 Analyst Abilities
6 11-1011.00 1.A.1.a.4 Written Expression IM 4.25 8 0.25 3.76 4.74 N n/a 06/2006 Analyst Abilities
8 11-1011.00 1.A.1.b.1 Fluency of Ideas IM 4.00 8 0.19 3.63 4.37 N n/a 06/2006 Analyst Abilities

In [10]:
abilities_pt = abilities_final.pivot_table('data_value',
                                           rows = 'onet_soc_code',
                                           cols = ['domain', 'element_name'],
                                           aggfunc = 'sum')
abilities_pt.head()


Out[10]:
domain Abilities
element_name Arm-Hand Steadiness Auditory Attention Category Flexibility Control Precision Deductive Reasoning Depth Perception Dynamic Flexibility Dynamic Strength Explosive Strength Extent Flexibility ... Speed of Limb Movement Stamina Static Strength Time Sharing Trunk Strength Visual Color Discrimination Visualization Wrist-Finger Speed Written Comprehension Written Expression
onet_soc_code
11-1011.00 1.00 2.25 3.50 1.88 4.00 2.13 1 1 1 1 ... 1.00 1.00 1 2.88 1.00 2.25 3.38 1.00 4.38 4.25
11-1011.03 1.00 1.88 3.38 1.75 4.00 2.00 1 1 1 1 ... 1.00 1.00 1 2.62 1.12 2.00 2.75 1.12 4.00 3.88
11-1021.00 2.38 2.38 3.00 2.25 3.62 1.88 1 1 1 1 ... 1.88 1.88 2 2.62 2.12 2.25 2.75 1.00 3.88 3.88
11-2011.00 1.88 1.88 3.38 1.50 3.88 1.88 1 1 1 1 ... 1.00 1.00 1 2.75 1.25 2.88 3.00 1.25 3.88 3.88
11-2021.00 1.12 2.00 3.38 1.00 3.88 1.75 1 1 1 1 ... 1.00 1.00 1 2.62 1.12 2.75 3.00 1.12 3.88 3.75

5 rows × 52 columns

Knowledge


In [11]:
#In knowledge, we only want to keep the rows where scale_id == 'IM'
knowledge_final = knowledge[knowledge.scale_id == 'IM']

In [12]:
len(knowledge_final)


Out[12]:
30459

In [13]:
feature(knowledge_final)


Out[13]:
33

In [14]:
knowledge_final['domain'] = 'Knowledge'
knowledge_final.head()


Out[14]:
onet_soc_code element_id element_name scale_id data_value n standard_error lower_ci_bound upper_ci_bound recommend_suppress not_relevant date domain_source domain
0 11-1011.00 2.C.1.a Administration and Management IM 4.45 30 0.20 4.04 4.86 N n/a 06/2006 Incumbent Knowledge
2 11-1011.00 2.C.1.b Clerical IM 2.46 30 0.28 1.89 3.04 N n/a 06/2006 Incumbent Knowledge
4 11-1011.00 2.C.1.c Economics and Accounting IM 4.00 30 0.24 3.51 4.49 N n/a 06/2006 Incumbent Knowledge
6 11-1011.00 2.C.1.d Sales and Marketing IM 3.68 30 0.18 3.31 4.05 N n/a 06/2006 Incumbent Knowledge
8 11-1011.00 2.C.1.e Customer and Personal Service IM 3.90 30 0.32 3.25 4.54 N n/a 06/2006 Incumbent Knowledge

In [15]:
knowledge_pt = knowledge_final.pivot_table('data_value',
                                           rows = 'onet_soc_code',
                                           cols = ['domain', 'element_name'],
                                           aggfunc = 'sum')
knowledge_pt.head()


Out[15]:
domain Knowledge
element_name Administration and Management Biology Building and Construction Chemistry Clerical Communications and Media Computers and Electronics Customer and Personal Service Design Economics and Accounting ... Philosophy and Theology Physics Production and Processing Psychology Public Safety and Security Sales and Marketing Sociology and Anthropology Telecommunications Therapy and Counseling Transportation
onet_soc_code
11-1011.00 4.45 1.54 1.99 1.60 2.46 2.53 2.91 3.90 2.03 4.00 ... 2.08 1.52 2.21 2.89 3.04 3.68 1.66 1.71 2.19 2.13
11-1011.03 3.85 2.44 3.69 2.36 2.58 2.84 2.65 3.62 3.72 2.96 ... 1.85 2.48 2.23 2.88 2.40 3.50 2.38 1.58 1.23 2.42
11-1021.00 4.33 1.46 2.92 1.92 3.20 2.32 3.41 3.98 2.97 3.68 ... 1.24 1.84 2.98 2.91 2.92 3.18 1.86 2.28 1.36 2.83
11-2011.00 4.11 1.11 1.12 1.09 3.10 4.33 3.43 3.79 2.94 2.21 ... 1.47 1.06 3.12 2.63 1.78 3.88 1.73 3.06 1.28 1.61
11-2021.00 3.92 1.04 2.16 1.08 2.92 3.80 3.00 4.32 2.88 2.84 ... 1.40 1.24 2.28 2.56 1.64 4.72 2.40 1.76 1.28 1.80

5 rows × 33 columns

Interests


In [16]:
#In interests, we only want to keep rows where scale_id == 'OI'
interests_final = interests[interests.scale_id == 'OI']

In [17]:
len(interests_final)


Out[17]:
5844

In [18]:
feature(interests_final)


Out[18]:
6

In [19]:
interests_final['domain'] = 'Interests'
interests_final.head()


Out[19]:
onet_soc_code element_id element_name scale_id data_value date domain_source domain
0 11-1011.00 1.B.1.a Realistic OI 1.33 06/2008 Analyst Interests
1 11-1011.00 1.B.1.b Investigative OI 2.00 06/2008 Analyst Interests
2 11-1011.00 1.B.1.c Artistic OI 2.67 06/2008 Analyst Interests
3 11-1011.00 1.B.1.d Social OI 3.67 06/2008 Analyst Interests
4 11-1011.00 1.B.1.e Enterprising OI 7.00 06/2008 Analyst Interests

In [20]:
interests_pt = interests_final.pivot_table('data_value',
                                           rows = 'onet_soc_code',
                                           cols = ['domain', 'element_name'],
                                           aggfunc = 'sum')
interests_pt.head()


Out[20]:
domain Interests
element_name Artistic Conventional Enterprising Investigative Realistic Social
onet_soc_code
11-1011.00 2.67 5.33 7 2.00 1.33 3.67
11-1011.03 2.67 4.33 7 4.33 1.00 2.33
11-1021.00 1.00 3.67 7 1.33 1.33 3.33
11-1031.00 3.67 3.00 7 3.67 1.00 4.67
11-2011.00 5.33 4.67 7 2.00 1.67 2.33

Job Zones


In [21]:
#we do not need to do anything to job_zones
job_zones_final = job_zones

In [22]:
len(job_zones_final)


Out[22]:
924

In [23]:
feature(job_zones_final)


Out[23]:
1

In [24]:
job_zones_final.head()


Out[24]:
onet_soc_code job_zone date domain_source
0 11-1011.00 5 06/2006 Analyst
1 11-1011.03 5 07/2013 Analyst
2 11-1021.00 3 06/2008 Analyst
3 11-1031.00 4 06/2008 Analyst
4 11-2011.00 4 06/2010 Analyst

In [25]:
job_zones_final['domain'] = 'Job_Zones'
job_zones_final['element_name'] = 'job_zone'
job_zones_pt = job_zones_final.pivot_table('job_zone',
                                           rows = 'onet_soc_code',
                                           cols = ['domain', 'element_name'],
                                           aggfunc = 'sum')
job_zones_pt.head()


Out[25]:
domain Job_Zones
element_name job_zone
onet_soc_code
11-1011.00 5
11-1011.03 5
11-1021.00 3
11-1031.00 4
11-2011.00 4

Skills


In [26]:
#for skills, we only want to keep rows where scale_id == "IM"
skills_final = skills[skills.scale_id == 'IM']

In [27]:
len(skills_final)


Out[27]:
32305

In [28]:
feature(skills_final)


Out[28]:
35

In [29]:
skills_final.head()


Out[29]:
onet_soc_code element_id element_name scale_id data_value n standard_error lower_ci_bound upper_ci_bound recommend_suppress not_relevant date domain_source
0 11-1011.00 2.A.1.a Reading Comprehension IM 4.38 8 0.18 4.02 4.73 N n/a 06/2010 Analyst
2 11-1011.00 2.A.1.b Active Listening IM 4.38 8 0.18 4.02 4.73 N n/a 06/2010 Analyst
4 11-1011.00 2.A.1.c Writing IM 4.12 8 0.23 3.68 4.57 N n/a 06/2010 Analyst
6 11-1011.00 2.A.1.d Speaking IM 4.38 8 0.18 4.02 4.73 N n/a 06/2010 Analyst
8 11-1011.00 2.A.1.e Mathematics IM 3.00 8 0.19 2.63 3.37 N n/a 06/2010 Analyst

In [30]:
skills_final['domain'] = 'Skills'
skills_pt = skills_final.pivot_table('data_value',
                                     rows = 'onet_soc_code',
                                     cols = ['domain', 'element_name'],
                                     aggfunc = 'sum')
skills_pt.head()


Out[30]:
domain Skills
element_name Active Learning Active Listening Complex Problem Solving Coordination Critical Thinking Equipment Maintenance Equipment Selection Installation Instructing Judgment and Decision Making ... Science Service Orientation Social Perceptiveness Speaking Systems Analysis Systems Evaluation Technology Design Time Management Troubleshooting Writing
onet_soc_code
11-1011.00 4.00 4.38 4.50 4.25 4.38 1 1.00 1.00 3.25 4.50 ... 1.62 3.25 4.12 4.38 4.12 4.12 1.75 4.25 1.00 4.12
11-1011.03 3.50 3.88 4.00 3.62 4.00 1 1.12 1.00 3.25 3.75 ... 1.75 3.25 3.75 4.00 3.62 3.62 1.62 3.38 1.12 3.88
11-1021.00 3.50 4.00 3.50 3.62 3.88 1 1.25 1.12 3.12 3.50 ... 1.62 3.12 3.62 4.00 3.12 3.12 1.62 3.50 1.88 3.38
11-2011.00 3.25 4.00 3.50 3.50 3.75 1 1.25 1.00 2.88 3.75 ... 1.50 3.12 4.00 4.00 3.12 3.00 1.62 3.88 1.12 3.75
11-2021.00 3.50 3.88 3.38 3.50 3.88 1 1.00 1.00 3.12 3.62 ... 2.12 3.12 3.75 3.75 3.12 3.25 1.75 3.38 1.00 3.25

5 rows × 35 columns

Work Activities


In [31]:
#for work activities, we only want to keep rows where scale_id == 'IM'
work_activities_final = work_activities[work_activities.scale_id == 'IM']

In [32]:
len(work_activities_final)


Out[32]:
37843

In [33]:
feature(work_activities_final)


Out[33]:
41

In [34]:
work_activities_final['domain'] = 'Work_Activities'
work_activities_pt = work_activities_final.pivot_table('data_value',
                                     rows = 'onet_soc_code',
                                     cols = ['domain', 'element_name'],
                                     aggfunc = 'sum')
work_activities_pt.head()


Out[34]:
domain Work_Activities
element_name Analyzing Data or Information Assisting and Caring for Others Coaching and Developing Others Communicating with Persons Outside Organization Communicating with Supervisors, Peers, or Subordinates Controlling Machines and Processes Coordinating the Work and Activities of Others Developing Objectives and Strategies Developing and Building Teams Documenting/Recording Information ... Provide Consultation and Advice to Others Repairing and Maintaining Electronic Equipment Repairing and Maintaining Mechanical Equipment Resolving Conflicts and Negotiating with Others Scheduling Work and Activities Selling or Influencing Others Staffing Organizational Units Thinking Creatively Training and Teaching Others Updating and Using Relevant Knowledge
onet_soc_code
11-1011.00 4.19 2.22 3.91 4.62 4.75 1.32 4.00 4.63 4.55 2.19 ... 3.43 1.61 1.46 4.40 3.14 4.34 2.54 4.11 2.80 3.75
11-1011.03 3.85 2.23 3.64 4.46 4.58 1.36 3.96 4.31 4.12 3.44 ... 3.96 1.12 1.12 3.54 3.65 3.85 2.62 4.50 4.12 4.35
11-1021.00 3.49 3.08 3.41 3.83 3.74 1.99 4.09 3.22 3.56 3.29 ... 3.26 1.97 2.06 3.63 4.09 3.88 3.26 3.60 3.26 3.46
11-2011.00 2.81 2.10 2.68 4.56 4.28 2.22 3.06 3.68 3.27 3.30 ... 2.44 1.58 1.38 2.61 3.72 3.27 2.01 4.54 2.67 3.83
11-2021.00 3.52 2.40 3.54 4.60 4.58 1.32 3.96 4.04 4.24 2.84 ... 3.32 1.48 1.12 3.38 3.92 3.92 2.84 4.48 3.04 3.80

5 rows × 41 columns

Work Context


In [35]:
#in work context, we only want to keep rows where scale_id == 'CX' or 'CT'
work_context_final = work_context[(work_context['scale_id'] == 'CX') | (work_context['scale_id'] == 'CT')]

In [36]:
len(work_context_final)


Out[36]:
52592

In [37]:
feature(work_context_final)


Out[37]:
56

In [38]:
work_context_final_CX = work_context_final[work_context_final['scale_id'] == 'CX']
work_context_final_CT = work_context_final[work_context_final['scale_id'] == 'CT']

In [39]:
work_context_final_CX['domain'] = 'Work_Context'
work_context_CX_pt = work_context_final_CX.pivot_table('data_value',
                                     rows = 'onet_soc_code',
                                     cols = ['domain', 'element_name'],
                                     aggfunc = 'sum')
work_context_CX_pt.head()


Out[39]:
domain Work_Context
element_name Consequence of Error Contact With Others Coordinate or Lead Others Cramped Work Space, Awkward Positions Deal With External Customers Deal With Physically Aggressive People Deal With Unpleasant or Angry People Degree of Automation Electronic Mail Exposed to Contaminants ... Spend Time Standing Spend Time Using Your Hands to Handle, Control, or Feel Objects, Tools, or Controls Spend Time Walking and Running Structured versus Unstructured Work Telephone Time Pressure Very Hot or Cold Temperatures Wear Common Protective or Safety Equipment such as Safety Shoes, Glasses, Gloves, Hearing Protection, Hard Hats, or Life Jackets Wear Specialized Protective or Safety Equipment such as Breathing Apparatus, Safety Harness, Full Protection Suits, or Radiation Protection Work With Work Group or Team
onet_soc_code
11-1011.00 3.55 4.84 4.32 1.47 3.83 2.07 3.92 1.80 5.00 1.49 ... 2.47 1.80 1.67 4.75 5.00 4.73 1.63 1.59 1.17 4.35
11-1011.03 2.35 4.38 4.12 1.38 3.73 1.04 2.38 1.72 4.96 1.65 ... 1.96 1.77 1.92 4.36 4.88 3.65 1.69 2.23 1.42 4.48
11-1021.00 3.04 4.76 4.20 1.32 4.48 1.60 3.39 2.32 4.26 2.11 ... 3.13 2.77 2.73 4.79 5.00 4.08 1.82 2.35 1.22 4.28
11-2011.00 2.06 4.65 4.12 1.53 3.89 1.29 2.73 2.56 5.00 1.12 ... 1.99 2.94 2.30 4.41 5.00 4.67 1.29 1.14 1.00 3.99
11-2021.00 2.40 4.64 3.72 1.21 4.00 1.12 2.56 2.08 5.00 1.16 ... 2.12 1.84 1.60 4.32 5.00 4.20 1.28 1.28 1.12 4.44

5 rows × 55 columns


In [40]:
work_context_final_CT['domain'] = 'Work_Context_Time'
work_context_CT_pt = work_context_final_CT.pivot_table('data_value',
                                     rows = 'onet_soc_code',
                                     cols = ['domain', 'element_name'],
                                     aggfunc = 'sum')
work_context_CT_pt.head()


Out[40]:
domain Work_Context_Time
element_name Duration of Typical Work Week Work Schedules
onet_soc_code
11-1011.00 2.91 1.00
11-1011.03 2.77 1.35
11-1021.00 2.67 1.37
11-2011.00 2.51 1.04
11-2021.00 2.68 1.28

Work Styles


In [41]:
#in work styles, we can keep everything
work_styles_final = work_styles

In [42]:
len(work_styles_final)


Out[42]:
14752

In [43]:
feature(work_styles_final)


Out[43]:
16

In [44]:
work_styles_final['domain'] = 'Work_Styles'
work_styles_pt = work_styles_final.pivot_table('data_value',
                                     rows = 'onet_soc_code',
                                     cols = ['domain', 'element_name'],
                                     aggfunc = 'sum')
work_styles_pt.head()


Out[44]:
domain Work_Styles
element_name Achievement/Effort Adaptability/Flexibility Analytical Thinking Attention to Detail Concern for Others Cooperation Dependability Independence Initiative Innovation Integrity Leadership Persistence Self Control Social Orientation Stress Tolerance
onet_soc_code
11-1011.00 4.66 4.48 4.24 4.26 3.95 4.42 4.67 4.63 4.79 4.22 4.85 4.84 4.61 4.28 4.02 4.75
11-1011.03 4.19 4.23 4.31 4.12 3.48 4.32 4.23 4.27 4.60 4.38 4.58 4.64 4.31 4.00 3.35 4.08
11-1021.00 4.07 4.21 4.22 4.52 3.96 4.26 4.73 3.96 4.36 3.88 4.36 4.50 4.24 4.38 3.56 4.35
11-2011.00 4.30 4.54 4.16 4.70 3.93 4.40 4.74 4.08 4.71 4.51 4.66 4.23 4.23 4.42 3.99 4.39
11-2021.00 4.24 4.24 3.84 4.48 3.72 4.44 4.56 4.20 4.32 4.08 4.40 4.36 4.28 4.04 3.88 4.20

Work Values


In [45]:
#in work values, we want to only keep rows where scale_id == 'EX'
work_values_final = work_values[work_values.scale_id == 'EX']

In [46]:
len(work_values_final)


Out[46]:
5844

In [47]:
feature(work_values_final)


Out[47]:
6

In [48]:
work_values_final['domain'] = 'Work_Values'
work_values_pt = work_values_final.pivot_table('data_value',
                                     rows = 'onet_soc_code',
                                     cols = ['domain', 'element_name'],
                                     aggfunc = 'sum')
work_values_pt.head()


Out[48]:
domain Work_Values
element_name Achievement Independence Recognition Relationships Support Working Conditions
onet_soc_code
11-1011.00 6.33 7.00 7.00 5.00 5.33 6.33
11-1011.03 6.67 6.67 6.00 5.00 3.33 6.33
11-1021.00 5.33 6.00 5.67 6.33 4.67 6.00
11-1031.00 5.33 5.00 5.00 5.67 4.00 4.33
11-2011.00 5.33 5.33 5.33 5.00 4.00 5.33

Occupation Metadata


In [49]:
occupation_data['element_name'] = "title"
occupation_data['domain'] = 'Occupation'
occ_data_pt = occupation_data.pivot_table('title',
                                     rows = 'onet_soc_code',
                                     cols = ['domain', 'element_name'],
                                     aggfunc = 'sum')


#combined_df = combined_df.rename(columns=lambda x: x.replace(' ', '_'))
occ_data_pt.Occupation.title = occ_data_pt.Occupation.title.apply(lambda x: x.replace(' ', '_'))
occ_data_pt.Occupation.title = occ_data_pt.Occupation.title.apply(lambda x: x.replace('/', '_'))
occ_data_pt.Occupation.title = occ_data_pt.Occupation.title.apply(lambda x: x.replace(',', '_'))


occ_data_pt.tail()


Out[49]:
domain Occupation
element_name title
onet_soc_code
55-3015.00 Command_and_Control_Center_Specialists
55-3016.00 Infantry
55-3017.00 Radar_and_Sonar_Technicians
55-3018.00 Special_Forces
55-3019.00 Military_Enlisted_Tactical_Operations_and_Air_...

Put Everything Together into One DataFrame


In [50]:
domain_pt_list = [abilities_pt, knowledge_pt, interests_pt, job_zones_pt, skills_pt, work_activities_pt, work_context_CX_pt, work_context_CT_pt, work_styles_pt, work_values_pt]

combined_df = pd.concat(domain_pt_list, axis=1)

combined_df = pd.merge(occ_data_pt, combined_df, left_index = True, right_index = True)

combined_df.head()


Out[50]:
domain Occupation Abilities ... Work_Styles Work_Values
element_name title Arm-Hand Steadiness Auditory Attention Category Flexibility Control Precision Deductive Reasoning Depth Perception Dynamic Flexibility Dynamic Strength Explosive Strength ... Persistence Self Control Social Orientation Stress Tolerance Achievement Independence Recognition Relationships Support Working Conditions
11-1011.00 Chief_Executives 1.00 2.25 3.50 1.88 4.00 2.13 1 1 1 ... 4.61 4.28 4.02 4.75 6.33 7.00 7.00 5.00 5.33 6.33
11-1011.03 Chief_Sustainability_Officers 1.00 1.88 3.38 1.75 4.00 2.00 1 1 1 ... 4.31 4.00 3.35 4.08 6.67 6.67 6.00 5.00 3.33 6.33
11-1021.00 General_and_Operations_Managers 2.38 2.38 3.00 2.25 3.62 1.88 1 1 1 ... 4.24 4.38 3.56 4.35 5.33 6.00 5.67 6.33 4.67 6.00
11-1031.00 Legislators NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN 5.33 5.00 5.00 5.67 4.00 4.33
11-2011.00 Advertising_and_Promotions_Managers 1.88 1.88 3.38 1.50 3.88 1.88 1 1 1 ... 4.23 4.42 3.99 4.39 5.33 5.33 5.33 5.00 4.00 5.33

5 rows × 248 columns


In [51]:
# Remove spaces in element names
combined_df = combined_df.rename(columns=lambda x: x.replace(' ', '_'))
combined_df.head()


Out[51]:
domain Occupation Abilities ... Work_Styles Work_Values
element_name title Arm-Hand_Steadiness Auditory_Attention Category_Flexibility Control_Precision Deductive_Reasoning Depth_Perception Dynamic_Flexibility Dynamic_Strength Explosive_Strength ... Persistence Self_Control Social_Orientation Stress_Tolerance Achievement Independence Recognition Relationships Support Working_Conditions
11-1011.00 Chief_Executives 1.00 2.25 3.50 1.88 4.00 2.13 1 1 1 ... 4.61 4.28 4.02 4.75 6.33 7.00 7.00 5.00 5.33 6.33
11-1011.03 Chief_Sustainability_Officers 1.00 1.88 3.38 1.75 4.00 2.00 1 1 1 ... 4.31 4.00 3.35 4.08 6.67 6.67 6.00 5.00 3.33 6.33
11-1021.00 General_and_Operations_Managers 2.38 2.38 3.00 2.25 3.62 1.88 1 1 1 ... 4.24 4.38 3.56 4.35 5.33 6.00 5.67 6.33 4.67 6.00
11-1031.00 Legislators NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN 5.33 5.00 5.00 5.67 4.00 4.33
11-2011.00 Advertising_and_Promotions_Managers 1.88 1.88 3.38 1.50 3.88 1.88 1 1 1 ... 4.23 4.42 3.99 4.39 5.33 5.33 5.33 5.00 4.00 5.33

5 rows × 248 columns


In [52]:
combined_df.to_csv("onet_data.csv")

Normalize the Data

We use x - minimum/(maximum - minimum) for the normalization, to get all of the features to be between 0 and 1


In [53]:
def normalize(series):
    maximum = series.max()
    minimum = series.min()
    return [(item - minimum) / (maximum - minimum) for item in series]

In [54]:
normed_df = combined_df.copy()
normed_df.iloc[:,1:] = normed_df.iloc[:,1:].apply(normalize)
normed_df.head()


Out[54]:
domain Occupation Abilities ... Work_Styles Work_Values
element_name title Arm-Hand_Steadiness Auditory_Attention Category_Flexibility Control_Precision Deductive_Reasoning Depth_Perception Dynamic_Flexibility Dynamic_Strength Explosive_Strength ... Persistence Self_Control Social_Orientation Stress_Tolerance Achievement Independence Recognition Relationships Support Working_Conditions
11-1011.00 Chief_Executives 0.000000 0.416667 0.683544 0.234667 0.751004 0.361022 0 0 0 ... 0.894531 0.699571 0.715625 0.918301 0.888333 1.000000 1.000000 0.624765 0.686679 0.966
11-1011.03 Chief_Sustainability_Officers 0.000000 0.293333 0.632911 0.200000 0.751004 0.319489 0 0 0 ... 0.777344 0.579399 0.506250 0.699346 0.945000 0.938086 0.833333 0.624765 0.311445 0.966
11-1021.00 General_and_Operations_Managers 0.394286 0.460000 0.472574 0.333333 0.598394 0.281150 0 0 0 ... 0.750000 0.742489 0.571875 0.787582 0.721667 0.812383 0.778333 0.874296 0.562852 0.900
11-1031.00 Legislators NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN 0.721667 0.624765 0.666667 0.750469 0.437148 0.566
11-2011.00 Advertising_and_Promotions_Managers 0.251429 0.293333 0.632911 0.133333 0.702811 0.281150 0 0 0 ... 0.746094 0.759657 0.706250 0.800654 0.721667 0.686679 0.721667 0.624765 0.437148 0.766

5 rows × 248 columns


In [55]:
normed_df.to_csv("onet_data_normalized.csv")

Visualize the Features

In this section, we view histograms of all of the features. In the beginning, this helped us identify problems with the data set (such as incorrectly filling in NaNs with 0 before the normalization, which skewed the data.


In [56]:
#function to draw histograms for a particular domain
from math import floor,ceil
def draw_histogram(domain_frame):
    fig, axes = plt.subplots(nrows=int((ceil(float(len(domain_frame.columns))/3.0))), ncols=3, figsize = (12,len(domain_frame.columns)))
    plt.subplots_adjust(hspace = 0.4)
    for i,column_name in enumerate(domain_frame.columns):
        row = int(floor(i/3))
        column = i % 3
        domain_frame[column_name].hist(bins=10, ax=axes[row,column]); axes[row,column].set_title(column_name); axes[row,column].set_ylim([0,500])

In [57]:
# draw_histogram(combined_df.Job_Zones)
# normed_df.Interests.Investigative.hist(bins=10)
draw_histogram(normed_df.Interests)



In [58]:
draw_histogram(normed_df.Abilities)



In [59]:
draw_histogram(normed_df.Knowledge)



In [60]:
draw_histogram(normed_df.Skills)



In [61]:
draw_histogram(normed_df.Work_Activities)



In [62]:
draw_histogram(normed_df.Work_Context)



In [63]:
draw_histogram(normed_df.Work_Styles)



In [64]:
draw_histogram(normed_df.Work_Values)


Correlation between Features

We used this to evaluate the relationships between variables. In future work, we could use these results to do feature selection with the end goal of obtaining better clustering results.


In [65]:
corr_df = normed_df.iloc[:,1:].corr()

In [66]:
corr_df.index = corr_df.index.droplevel(0)
corr_df.head()


Out[66]:
domain Abilities ... Work_Styles Work_Values
element_name Arm-Hand_Steadiness Auditory_Attention Category_Flexibility Control_Precision Deductive_Reasoning Depth_Perception Dynamic_Flexibility Dynamic_Strength Explosive_Strength Extent_Flexibility ... Persistence Self_Control Social_Orientation Stress_Tolerance Achievement Independence Recognition Relationships Support Working_Conditions
element_name
Arm-Hand_Steadiness 1.000000 0.551135 -0.272371 0.891912 -0.443138 0.710986 0.371735 0.755574 0.261467 0.783192 ... -0.362671 -0.162780 -0.198899 -0.250157 -0.497735 -0.428879 -0.518425 -0.281836 0.283581 -0.478212
Auditory_Attention 0.551135 1.000000 -0.054200 0.611415 -0.121370 0.658149 0.179921 0.511877 0.263377 0.517501 ... -0.162018 0.015601 -0.064527 -0.001550 -0.274176 -0.183204 -0.257769 -0.131038 0.419780 -0.192113
Category_Flexibility -0.272371 -0.054200 1.000000 -0.267147 0.670081 -0.092014 -0.229921 -0.328293 -0.167502 -0.361720 ... 0.420876 0.006757 0.002206 0.147548 0.579465 0.528414 0.601526 0.104111 0.014712 0.596652
Control_Precision 0.891912 0.611415 -0.267147 1.000000 -0.402879 0.816589 0.288932 0.675230 0.219886 0.714894 ... -0.375893 -0.219412 -0.288178 -0.276792 -0.507982 -0.400672 -0.510065 -0.360595 0.369022 -0.439037
Deductive_Reasoning -0.443138 -0.121370 0.670081 -0.402879 1.000000 -0.194816 -0.328400 -0.452801 -0.085409 -0.477243 ... 0.587241 0.228740 0.174062 0.376672 0.781149 0.745311 0.804703 0.287953 0.071965 0.808228

5 rows × 247 columns


In [67]:
corr_df.columns = corr_df.columns.droplevel(0)
corr_df.head()


Out[67]:
element_name Arm-Hand_Steadiness Auditory_Attention Category_Flexibility Control_Precision Deductive_Reasoning Depth_Perception Dynamic_Flexibility Dynamic_Strength Explosive_Strength Extent_Flexibility ... Persistence Self_Control Social_Orientation Stress_Tolerance Achievement Independence Recognition Relationships Support Working_Conditions
element_name
Arm-Hand_Steadiness 1.000000 0.551135 -0.272371 0.891912 -0.443138 0.710986 0.371735 0.755574 0.261467 0.783192 ... -0.362671 -0.162780 -0.198899 -0.250157 -0.497735 -0.428879 -0.518425 -0.281836 0.283581 -0.478212
Auditory_Attention 0.551135 1.000000 -0.054200 0.611415 -0.121370 0.658149 0.179921 0.511877 0.263377 0.517501 ... -0.162018 0.015601 -0.064527 -0.001550 -0.274176 -0.183204 -0.257769 -0.131038 0.419780 -0.192113
Category_Flexibility -0.272371 -0.054200 1.000000 -0.267147 0.670081 -0.092014 -0.229921 -0.328293 -0.167502 -0.361720 ... 0.420876 0.006757 0.002206 0.147548 0.579465 0.528414 0.601526 0.104111 0.014712 0.596652
Control_Precision 0.891912 0.611415 -0.267147 1.000000 -0.402879 0.816589 0.288932 0.675230 0.219886 0.714894 ... -0.375893 -0.219412 -0.288178 -0.276792 -0.507982 -0.400672 -0.510065 -0.360595 0.369022 -0.439037
Deductive_Reasoning -0.443138 -0.121370 0.670081 -0.402879 1.000000 -0.194816 -0.328400 -0.452801 -0.085409 -0.477243 ... 0.587241 0.228740 0.174062 0.376672 0.781149 0.745311 0.804703 0.287953 0.071965 0.808228

5 rows × 247 columns


In [68]:
corr_pairs_list = []
for i in range(len(corr_df.index)):
    row_name = corr_df.index[i]
    for j in range(i + 1, len(corr_df.columns)):
        column_name = corr_df.columns[j]
        corr_pairs_list.append([row_name,column_name, corr_df.ix[i,j]])

In [69]:
corr_pairs_df = DataFrame(corr_pairs_list)

In [70]:
#here are the correlations sorted in ascending order
#not surprisingly, spending time sitting is negatively correlated with features measuring strength and physical activity
corr_pairs_df.sort(2)


Out[70]:
0 1 2
29786 Spend_Time_Sitting Spend_Time_Standing -0.967267
10446 Trunk_Strength Spend_Time_Sitting -0.872477
29788 Spend_Time_Sitting Spend_Time_Walking_and_Running -0.835168
29605 Spend_Time_Bending_or_Twisting_the_Body Spend_Time_Sitting -0.795219
9843 Stamina Spend_Time_Sitting -0.793382
2380 Extent_Flexibility Spend_Time_Sitting -0.788185
26251 Performing_General_Physical_Activities Spend_Time_Sitting -0.782129
27926 Electronic_Mail Spend_Time_Bending_or_Twisting_the_Body -0.773786
3781 Gross_Body_Coordination Spend_Time_Sitting -0.757728
10045 Static_Strength Spend_Time_Sitting -0.756000
1905 Dynamic_Strength Spend_Time_Sitting -0.754747
25093 Handling_and_Moving_Objects Spend_Time_Sitting -0.751062
11231 Written_Comprehension Spend_Time_Bending_or_Twisting_the_Body -0.723359
10410 Trunk_Strength Electronic_Mail -0.723312
11426 Written_Expression Spend_Time_Bending_or_Twisting_the_Body -0.722919
23201 Writing Spend_Time_Bending_or_Twisting_the_Body -0.717948
10380 Trunk_Strength Interacting_With_Computers -0.717117
25391 Interacting_With_Computers Spend_Time_Bending_or_Twisting_the_Body -0.717089
2344 Extent_Flexibility Electronic_Mail -0.714901
10009 Static_Strength Electronic_Mail -0.714563
21826 Reading_Comprehension Spend_Time_Bending_or_Twisting_the_Body -0.713159
25057 Handling_and_Moving_Objects Electronic_Mail -0.711717
28330 Exposed_to_Minor_Burns,_Cuts,_Bites,_or_Stings Spend_Time_Sitting -0.711399
9979 Static_Strength Interacting_With_Computers -0.709255
27933 Electronic_Mail Spend_Time_Using_Your_Hands_to_Handle,_Control... -0.705413
29716 Spend_Time_Kneeling,_Crouching,_Stooping,_or_C... Spend_Time_Sitting -0.703470
9640 Speed_of_Limb_Movement Spend_Time_Sitting -0.702008
11433 Written_Expression Spend_Time_Using_Your_Hands_to_Handle,_Control... -0.701100
27932 Electronic_Mail Spend_Time_Standing -0.699675
18008 Realistic Speaking -0.697539
... ... ... ...
7591 Rate_Control Response_Orientation 0.915933
21745 Reading_Comprehension Writing 0.916252
7166 Peripheral_Vision Sound_Localization 0.916289
1702 Dynamic_Strength Extent_Flexibility 0.916317
2211 Extent_Flexibility Stamina 0.918172
3369 Glare_Sensitivity Peripheral_Vision 0.918579
5862 Night_Vision Spatial_Orientation 0.919099
30367 Achievement Recognition 0.919637
9675 Stamina Static_Strength 0.923176
6601 Oral_Expression Speaking 0.923189
6291 Oral_Comprehension Oral_Expression 0.926851
1708 Dynamic_Strength Gross_Body_Coordination 0.927770
6150 Number_Facility Mathematics 0.929885
991 Deductive_Reasoning Inductive_Reasoning 0.930154
1737 Dynamic_Strength Static_Strength 0.930837
11075 Written_Comprehension Written_Expression 0.933975
7167 Peripheral_Vision Spatial_Orientation 0.936410
1736 Dynamic_Strength Stamina 0.938521
4960 Mathematical_Reasoning Number_Facility 0.940364
22506 Systems_Analysis Systems_Evaluation 0.943016
7590 Rate_Control Reaction_Time 0.944219
7803 Reaction_Time Response_Orientation 0.944458
3135 Fluency_of_Ideas Originality 0.945320
11345 Written_Expression Writing 0.948572
11139 Written_Comprehension Reading_Comprehension 0.950134
5040 Mathematical_Reasoning Mathematics 0.952128
19 Arm-Hand_Steadiness Manual_Dexterity 0.961529
5855 Night_Vision Peripheral_Vision 0.962100
3612 Gross_Body_Coordination Stamina 0.966249
19224 Equipment_Maintenance Repairing 0.976774

30381 rows × 3 columns


In [71]:
#here are the correlations presented in descending order
corr_pairs_df.sort(2, ascending=False)


Out[71]:
0 1 2
19224 Equipment_Maintenance Repairing 0.976774
3612 Gross_Body_Coordination Stamina 0.966249
5855 Night_Vision Peripheral_Vision 0.962100
19 Arm-Hand_Steadiness Manual_Dexterity 0.961529
5040 Mathematical_Reasoning Mathematics 0.952128
11139 Written_Comprehension Reading_Comprehension 0.950134
11345 Written_Expression Writing 0.948572
3135 Fluency_of_Ideas Originality 0.945320
7803 Reaction_Time Response_Orientation 0.944458
7590 Rate_Control Reaction_Time 0.944219
22506 Systems_Analysis Systems_Evaluation 0.943016
4960 Mathematical_Reasoning Number_Facility 0.940364
1736 Dynamic_Strength Stamina 0.938521
7167 Peripheral_Vision Spatial_Orientation 0.936410
11075 Written_Comprehension Written_Expression 0.933975
1737 Dynamic_Strength Static_Strength 0.930837
991 Deductive_Reasoning Inductive_Reasoning 0.930154
6150 Number_Facility Mathematics 0.929885
1708 Dynamic_Strength Gross_Body_Coordination 0.927770
6291 Oral_Comprehension Oral_Expression 0.926851
6601 Oral_Expression Speaking 0.923189
9675 Stamina Static_Strength 0.923176
30367 Achievement Recognition 0.919637
5862 Night_Vision Spatial_Orientation 0.919099
3369 Glare_Sensitivity Peripheral_Vision 0.918579
2211 Extent_Flexibility Stamina 0.918172
1702 Dynamic_Strength Extent_Flexibility 0.916317
7166 Peripheral_Vision Sound_Localization 0.916289
21745 Reading_Comprehension Writing 0.916252
7591 Rate_Control Response_Orientation 0.915933
... ... ... ...
18008 Realistic Speaking -0.697539
27932 Electronic_Mail Spend_Time_Standing -0.699675
11433 Written_Expression Spend_Time_Using_Your_Hands_to_Handle,_Control... -0.701100
9640 Speed_of_Limb_Movement Spend_Time_Sitting -0.702008
29716 Spend_Time_Kneeling,_Crouching,_Stooping,_or_C... Spend_Time_Sitting -0.703470
27933 Electronic_Mail Spend_Time_Using_Your_Hands_to_Handle,_Control... -0.705413
9979 Static_Strength Interacting_With_Computers -0.709255
28330 Exposed_to_Minor_Burns,_Cuts,_Bites,_or_Stings Spend_Time_Sitting -0.711399
25057 Handling_and_Moving_Objects Electronic_Mail -0.711717
21826 Reading_Comprehension Spend_Time_Bending_or_Twisting_the_Body -0.713159
10009 Static_Strength Electronic_Mail -0.714563
2344 Extent_Flexibility Electronic_Mail -0.714901
25391 Interacting_With_Computers Spend_Time_Bending_or_Twisting_the_Body -0.717089
10380 Trunk_Strength Interacting_With_Computers -0.717117
23201 Writing Spend_Time_Bending_or_Twisting_the_Body -0.717948
11426 Written_Expression Spend_Time_Bending_or_Twisting_the_Body -0.722919
10410 Trunk_Strength Electronic_Mail -0.723312
11231 Written_Comprehension Spend_Time_Bending_or_Twisting_the_Body -0.723359
25093 Handling_and_Moving_Objects Spend_Time_Sitting -0.751062
1905 Dynamic_Strength Spend_Time_Sitting -0.754747
10045 Static_Strength Spend_Time_Sitting -0.756000
3781 Gross_Body_Coordination Spend_Time_Sitting -0.757728
27926 Electronic_Mail Spend_Time_Bending_or_Twisting_the_Body -0.773786
26251 Performing_General_Physical_Activities Spend_Time_Sitting -0.782129
2380 Extent_Flexibility Spend_Time_Sitting -0.788185
9843 Stamina Spend_Time_Sitting -0.793382
29605 Spend_Time_Bending_or_Twisting_the_Body Spend_Time_Sitting -0.795219
29788 Spend_Time_Sitting Spend_Time_Walking_and_Running -0.835168
10446 Trunk_Strength Spend_Time_Sitting -0.872477
29786 Spend_Time_Sitting Spend_Time_Standing -0.967267

30381 rows × 3 columns

Weight the data

We experimented with this weighting scheme to downweight features in the domains that have a high number of features, but ultimately did not use the weighted values to avoid introducing biases into the data


In [72]:
# creating array of column weights when equal
normed_df.columns
weighted_df = normed_df.copy()
domains = ['Abilities','Interests','Job_Zones','Knowledge','Skills','Work_Activities','Work_Context','Work_Context_Time','Work_Styles','Work_Values']
for domain in domains:
    domain_frame = weighted_df[domain]
    n_cols = len(domain_frame.columns)
    weighted_domain_frame = domain_frame/n_cols
    weighted_df[domain] = weighted_domain_frame
    
weighted_df.head()


Out[72]:
domain Occupation Abilities ... Work_Styles Work_Values
element_name title Arm-Hand_Steadiness Auditory_Attention Category_Flexibility Control_Precision Deductive_Reasoning Depth_Perception Dynamic_Flexibility Dynamic_Strength Explosive_Strength ... Persistence Self_Control Social_Orientation Stress_Tolerance Achievement Independence Recognition Relationships Support Working_Conditions
11-1011.00 Chief_Executives 0.000000 0.008013 0.013145 0.004513 0.014442 0.006943 0 0 0 ... 0.055908 0.043723 0.044727 0.057394 0.148056 0.166667 0.166667 0.104128 0.114447 0.161000
11-1011.03 Chief_Sustainability_Officers 0.000000 0.005641 0.012171 0.003846 0.014442 0.006144 0 0 0 ... 0.048584 0.036212 0.031641 0.043709 0.157500 0.156348 0.138889 0.104128 0.051907 0.161000
11-1021.00 General_and_Operations_Managers 0.007582 0.008846 0.009088 0.006410 0.011508 0.005407 0 0 0 ... 0.046875 0.046406 0.035742 0.049224 0.120278 0.135397 0.129722 0.145716 0.093809 0.150000
11-1031.00 Legislators NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN 0.120278 0.104128 0.111111 0.125078 0.072858 0.094333
11-2011.00 Advertising_and_Promotions_Managers 0.004835 0.005641 0.012171 0.002564 0.013516 0.005407 0 0 0 ... 0.046631 0.047479 0.044141 0.050041 0.120278 0.114447 0.120278 0.104128 0.072858 0.127667

5 rows × 248 columns

Handle NaN Values


In [73]:
#get rid of occupations that have to many NAs
nan_count =  len(normed_df.columns) - normed_df.count(axis=1)

In [74]:
nan_count.unique()


Out[74]:
array([  0, 234, 235,  35])

In [75]:
nan_count.hist()


Out[75]:
<matplotlib.axes.AxesSubplot at 0x11442cd90>

In [76]:
len(weighted_df.columns)


Out[76]:
248

In [77]:
len(nan_count[nan_count==35])


Out[77]:
1

We ultimately decided to get rid of any occupations that had any NaNs among the features. Those that had NaNs were almost all NaN, with the exception of one occupation that had 35 NaNs (out of 248 features). We removed because we felt we didn't have sufficient information to calculate meaningful distance measures for these occupations.


In [78]:
#get rid of occupations that have nans
weighted_df_no_na = weighted_df.dropna(how='any')
normed_df_no_na = normed_df.dropna(how='any')

In [79]:
occ_titles = weighted_df_no_na.Occupation.title

Calculate Euclidean Distance between Occupations


In [80]:
#calculate for weighted features
euclid_dist_array_weighted = sklearn.metrics.pairwise.euclidean_distances(weighted_df_no_na.iloc[:,1:])
euclid_dist_df_weighted = DataFrame(euclid_dist_array_weighted).set_index(occ_titles)
euclid_dist_df_weighted.columns = occ_titles
euclid_dist_df_weighted.head()


Out[80]:
title Chief_Executives Chief_Sustainability_Officers General_and_Operations_Managers Advertising_and_Promotions_Managers Marketing_Managers Sales_Managers Public_Relations_and_Fundraising_Managers Administrative_Services_Managers Computer_and_Information_Systems_Managers Treasurers_and_Controllers ... Cleaners_of_Vehicles_and_Equipment Laborers_and_Freight__Stock__and_Material_Movers__Hand Machine_Feeders_and_Offbearers Packers_and_Packagers__Hand Gas_Compressor_and_Gas_Pumping_Station_Operators Pump_Operators__Except_Wellhead_Pumpers Wellhead_Pumpers Refuse_and_Recyclable_Material_Collectors Mine_Shuttle_Car_Operators Tank_Car__Truck__and_Ship_Loaders
title
Chief_Executives 0.000000 0.181263 0.535390 0.308514 0.294546 0.288312 0.286855 0.544765 0.329539 0.137563 ... 0.884232 0.907038 0.868701 0.923370 0.858371 0.836218 0.850957 0.853397 0.863084 0.833481
Chief_Sustainability_Officers 0.181263 0.000000 0.519958 0.314384 0.275115 0.296865 0.301183 0.539763 0.301729 0.195272 ... 0.865993 0.870975 0.848385 0.885358 0.833628 0.833911 0.836714 0.841863 0.861718 0.822669
General_and_Operations_Managers 0.535390 0.519958 0.000000 0.319571 0.277837 0.286914 0.296680 0.159023 0.303830 0.538866 ... 0.462888 0.464778 0.429044 0.491546 0.407295 0.425650 0.424265 0.421402 0.462103 0.394599
Advertising_and_Promotions_Managers 0.308514 0.314384 0.319571 0.000000 0.130092 0.166416 0.122899 0.294198 0.193231 0.309479 ... 0.623232 0.652907 0.619166 0.659830 0.611847 0.601884 0.607961 0.605447 0.624532 0.601082
Marketing_Managers 0.294546 0.275115 0.277837 0.130092 0.000000 0.123180 0.106170 0.297304 0.153269 0.298149 ... 0.642513 0.652808 0.623598 0.670282 0.609355 0.607975 0.609771 0.613613 0.638723 0.593852

5 rows × 922 columns


In [81]:
#calculate for normed features
euclid_dist_array_normed = sklearn.metrics.pairwise.euclidean_distances(normed_df_no_na.iloc[:,1:])
euclid_dist_df_normed = DataFrame(euclid_dist_array_normed).set_index(occ_titles)
euclid_dist_df_normed.columns = occ_titles
euclid_dist_df_normed.head()


Out[81]:
title Chief_Executives Chief_Sustainability_Officers General_and_Operations_Managers Advertising_and_Promotions_Managers Marketing_Managers Sales_Managers Public_Relations_and_Fundraising_Managers Administrative_Services_Managers Computer_and_Information_Systems_Managers Treasurers_and_Controllers ... Cleaners_of_Vehicles_and_Equipment Laborers_and_Freight__Stock__and_Material_Movers__Hand Machine_Feeders_and_Offbearers Packers_and_Packagers__Hand Gas_Compressor_and_Gas_Pumping_Station_Operators Pump_Operators__Except_Wellhead_Pumpers Wellhead_Pumpers Refuse_and_Recyclable_Material_Collectors Mine_Shuttle_Car_Operators Tank_Car__Truck__and_Ship_Loaders
title
Chief_Executives 0.000000 2.669282 3.001914 3.048946 2.585561 2.619643 2.368730 3.202132 3.173096 2.534048 ... 7.231281 6.899580 6.820793 6.664917 6.368751 6.259388 6.882292 6.915280 7.911907 6.563762
Chief_Sustainability_Officers 2.669282 0.000000 2.577008 2.520657 1.812761 2.422856 2.368706 2.709058 2.712034 2.466375 ... 6.394218 6.143485 6.005014 5.671100 5.591187 5.596112 6.064678 6.289219 7.137248 5.972403
General_and_Operations_Managers 3.001914 2.577008 0.000000 2.843368 2.615433 2.393803 2.742282 2.294043 2.600134 3.021511 ... 5.549566 5.235682 5.045208 4.896554 4.607163 4.664212 5.208924 5.207061 6.399461 4.836656
Advertising_and_Promotions_Managers 3.048946 2.520657 2.843368 0.000000 1.759457 2.603743 1.839133 2.242443 2.845571 2.671636 ... 5.955495 5.695384 5.669688 5.310311 5.521899 5.512969 5.923749 5.984562 6.909347 5.960769
Marketing_Managers 2.585561 1.812761 2.615433 1.759457 0.000000 2.068560 1.592365 2.307453 2.698033 2.300741 ... 6.328169 6.063655 6.022700 5.584062 5.817367 5.764398 6.208042 6.330996 7.176900 6.163062

5 rows × 922 columns


In [82]:
max(euclid_dist_df_weighted.max())


Out[82]:
1.2366673906148178

In [83]:
max(euclid_dist_df_normed.max())


Out[83]:
8.4396293425434639

Density Clustering

In the commented out code below, we tried to loop through different parameters for the DBSCAN and save the silhouette coefficient, which we wanted to optimize (get it close to 1). It takes close to an hour to run. The results were not good. We ran into some errors for certain combinations of eps and min_samples values. Additionally, all of our silhouette coefficients were negative or very close to zero, indicating a lot of overlap among the clusters.

We concluded that DBSCAN was not the appropriate clustering method, as our data may not meet the criteria of having high density clusters surrounded by low density errors. Given the high-dimensionality of our data set, it is also challenging to visualize the different occupations. Later, we attempt a PCA transformation to visualize, but find that the variance in our dataset cannot be explained by two or three PCA vectors.

Loop through inputs


In [84]:
# #set up 
# eps_values = np.arange(2.0,4,0.05)
# min_samples = np.arange(5,20)

In [85]:
# density_results = []
# # # http://scikit-learn.org/stable/auto_examples/cluster/plot_dbscan.html#example-cluster-plot-dbscan-pySee 
# normed_array = np.array(normed_df_no_na.iloc[:,1:])

# from sklearn.cluster import DBSCAN
# from sklearn import metrics

# for eps_value in eps_values:
#     for min_sample in min_samples:
#         db = DBSCAN(eps=eps_value, min_samples=min_sample).fit(normed_array)
#         core_samples = db.core_sample_indices_
#         labels = db.labels_

#         #Number of clusters in labels
#         n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
#         coeff = metrics.silhouette_score(normed_array, labels)
#         density_results.append((eps_value, min_sample, n_clusters_, coeff))

In [86]:
# density_results

Two-input Example with PCA visualization of clusters

This is an illustrative output of our clustering outputs

Clustering


In [87]:
from sklearn.cluster import DBSCAN
from sklearn import metrics

normed_array = np.array(normed_df_no_na.iloc[:,1:])

db = DBSCAN(eps=2.1, min_samples=10).fit(normed_array)
core_samples = db.core_sample_indices_
labels = db.labels_

num_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
coeff = metrics.silhouette_score(normed_array, labels)
print 'Number of Clusters: ' + str(num_clusters_)
print 'Noise: ' + str(len(labels[labels==-1])) + ' data points'
print 'Silhouette Coefficient: ' + str(coeff)


Number of Clusters: 5
Noise: 539 data points
Silhouette Coefficient: -0.0617010822337

In [88]:
num_clusters_


Out[88]:
5

PCA

PCA outputs show that less than 50% of the variance in the data set can be explained with two vectors, which is not good


In [90]:
from sklearn.decomposition import PCA
import seaborn as sns
from mpld3 import enable_notebook, disable_notebook
from mpld3 import plugins
enable_notebook()

pca = PCA(n_components=2).fit(normed_array)
pca.explained_variance_ratio_


Out[90]:
array([ 0.3623532 ,  0.12945225])

Here, we can visualize the results with an interactive scatterplot, with the caveat that the locations of the points in space are based on the two PCA components detailed above.


In [91]:
pca_2d = pca.transform(normed_array)

fig, ax = plt.subplots(subplot_kw=dict(axisbg='#EEEEEE'))
N = 100

current_palette = sns.color_palette("husl", num_clusters_ + 1)

for i in range(0, pca_2d.shape[0]):
    db_index = int(db.labels_[i])
    scatter = ax.scatter(pca_2d[i,0], pca_2d[i,1],
                         c= current_palette[db_index],
                         alpha=0.6)
    labels = [np.array(occ_titles)[i]]
    tooltip = plugins.PointLabelTooltip(scatter, labels=labels)
    plugins.connect(fig, tooltip)
ax.grid(color='white', linestyle='solid')

ax.set_title("Scatter Plot (with tooltips!)", size=20)


Out[91]:
<matplotlib.text.Text at 0x114a5bd50>

In the above, the purple dots are noise data points that weren't able to be clustered- this was more than half of the occupations.

The dark green clusters look to be mechanical-repair fields. The light green occupations seem similar.

The gold dots largely involve the medical field.

The blue dots are mechanica/manufactuering engineers

The red dots encompass a wide-variety of office jobs.


In [92]:
disable_notebook()

Hierarchical Clustering

Following dbscan, we try hierarchical clustering. The number of clusters can be varied depending on how fine-grained you want the outputs to be. 50 clusters seems to yield useful results.


In [93]:
import time as time
import mpl_toolkits.mplot3d.axes3d as p3
from sklearn.cluster import Ward
from sklearn import metrics

In [94]:
normed_array = np.array(normed_df_no_na.iloc[:,1:])

In [95]:
# Compute clustering
print "Compute structured hierarchical clustering..."
st = time.time()
ward = Ward(n_clusters=50, connectivity=None).fit(normed_array)
label = ward.labels_
print "Elapsed time: ", time.time() - st
print "Number of points: ", label.size


Compute structured hierarchical clustering...
Elapsed time:  0.807130098343
Number of points:  922

In [96]:
clusters = DataFrame(data = label, index = occ_titles)
clusters.columns = ['cluster']

In [97]:
for cluster in range(clusters.cluster.max()):
    current_cluster = clusters[clusters.cluster == cluster]
    print '\nOccupations in cluster ' + str(cluster) + ' (' + str(len(current_cluster)) + ' total):'
    for occ in current_cluster.index:
        print '   ' + occ


Occupations in cluster 0 (10 total):
   Radio_Operators
   Optometrists
   Audiologists
   Acupuncturists
   Orthoptists
   Opticians__Dispensing
   Hearing_Aid_Specialists
   Gaming_Surveillance_Officers_and_Gaming_Investigators
   Police__Fire__and_Ambulance_Dispatchers
   Air_Traffic_Controllers

Occupations in cluster 1 (21 total):
   Crossing_Guards
   Cooks__Fast_Food
   Food_Preparation_Workers
   Combined_Food_Preparation_and_Serving_Workers__Including_Fast_Food
   Counter_Attendants__Cafeteria__Food_Concession__and_Coffee_Shop
   Waiters_and_Waitresses
   Food_Servers__Nonrestaurant
   Dining_Room_and_Cafeteria_Attendants_and_Bartender_Helpers
   Hosts_and_Hostesses__Restaurant__Lounge__and_Coffee_Shop
   Nonfarm_Animal_Caretakers
   Ushers__Lobby_Attendants__and_Ticket_Takers
   Amusement_and_Recreation_Attendants
   Costume_Attendants
   Locker_Room__Coatroom__and_Dressing_Room_Attendants
   Shampooers
   Models
   Stock_Clerks__Sales_Floor
   Marking_Clerks
   Stock_Clerks-_Stockroom__Warehouse__or_Storage_Yard
   Order_Fillers__Wholesale_and_Retail_Sales
   Graders_and_Sorters__Agricultural_Products

Occupations in cluster 2 (34 total):
   Sociologists
   Anthropologists
   Geographers
   Political_Scientists
   Business_Teachers__Postsecondary
   Mathematical_Science_Teachers__Postsecondary
   Architecture_Teachers__Postsecondary
   Anthropology_and_Archeology_Teachers__Postsecondary
   Area__Ethnic__and_Cultural_Studies_Teachers__Postsecondary
   Economics_Teachers__Postsecondary
   Geography_Teachers__Postsecondary
   Political_Science_Teachers__Postsecondary
   Psychology_Teachers__Postsecondary
   Sociology_Teachers__Postsecondary
   Health_Specialties_Teachers__Postsecondary
   Education_Teachers__Postsecondary
   Library_Science_Teachers__Postsecondary
   Criminal_Justice_and_Law_Enforcement_Teachers__Postsecondary
   Law_Teachers__Postsecondary
   Social_Work_Teachers__Postsecondary
   Art__Drama__and_Music_Teachers__Postsecondary
   Communications_Teachers__Postsecondary
   English_Language_and_Literature_Teachers__Postsecondary
   Foreign_Language_and_Literature_Teachers__Postsecondary
   History_Teachers__Postsecondary
   Philosophy_and_Religion_Teachers__Postsecondary
   Home_Economics_Teachers__Postsecondary
   Recreation_and_Fitness_Studies_Teachers__Postsecondary
   Elementary_School_Teachers__Except_Special_Education
   Middle_School_Teachers__Except_Special_and_Career_Technical_Education
   Secondary_School_Teachers__Except_Special_and_Career_Technical_Education
   Special_Education_Teachers__Middle_School
   Special_Education_Teachers__Secondary_School
   Adult_Basic_and_Secondary_Education_and_Literacy_Teachers_and_Instructors

Occupations in cluster 3 (30 total):
   Nursing_Instructors_and_Teachers__Postsecondary
   Chiropractors
   Dentists__General
   Oral_and_Maxillofacial_Surgeons
   Orthodontists
   Prosthodontists
   Obstetricians_and_Gynecologists
   Surgeons
   Dermatologists
   Hospitalists
   Ophthalmologists
   Physical_Medicine_and_Rehabilitation_Physicians
   Sports_Medicine_Physicians
   Urologists
   Physician_Assistants
   Podiatrists
   Occupational_Therapists
   Physical_Therapists
   Exercise_Physiologists
   Veterinarians
   Registered_Nurses
   Acute_Care_Nurses
   Critical_Care_Nurses
   Clinical_Nurse_Specialists
   Nurse_Midwives
   Nurse_Practitioners
   Licensed_Practical_and_Licensed_Vocational_Nurses
   Orthotists_and_Prosthetists
   Athletic_Trainers
   Midwives

Occupations in cluster 4 (19 total):
   Geothermal_Production_Managers
   Biomass_Power_Plant_Managers
   Nursery_and_Greenhouse_Managers
   Aquacultural_Managers
   Forest_and_Conservation_Technicians
   Forest_Fire_Fighting_and_Prevention_Supervisors
   First-Line_Supervisors_of_Landscaping__Lawn_Service__and_Groundskeeping_Workers
   Pest_Control_Workers
   First-Line_Supervisors_of_Logging_Workers
   First-Line_Supervisors_of_Aquacultural_Workers
   First-Line_Supervisors_of_Agricultural_Crop_and_Horticultural_Workers
   First-Line_Supervisors_of_Animal_Husbandry_and_Animal_Care_Workers
   Forest_and_Conservation_Workers
   Log_Graders_and_Scalers
   First-Line_Supervisors_of_Construction_Trades_and_Extraction_Workers
   Segmental_Pavers
   First-Line_Supervisors_of_Mechanics__Installers__and_Repairers
   First-Line_Supervisors_of_Production_and_Operating_Workers
   Aircraft_Cargo_Handling_Supervisors

Occupations in cluster 5 (23 total):
   Advertising_and_Promotions_Managers
   Marketing_Managers
   Public_Relations_and_Fundraising_Managers
   Agents_and_Business_Managers_of_Artists__Performers__and_Athletes
   Equal_Opportunity_Representatives_and_Officers
   Personal_Financial_Advisors
   Loan_Counselors
   Loan_Officers
   Historians
   Lawyers
   Administrative_Law_Judges__Adjudicators__and_Hearing_Officers
   Arbitrators__Mediators__and_Conciliators
   Judges__Magistrate_Judges__and_Magistrates
   Producers
   Talent_Directors
   Radio_and_Television_Announcers
   Broadcast_News_Analysts
   Reporters_and_Correspondents
   Public_Relations_Specialists
   Copy_Writers
   Insurance_Sales_Agents
   Sales_Agents__Securities_and_Commodities
   Sales_Agents__Financial_Services

Occupations in cluster 6 (19 total):
   Graduate_Teaching_Assistants
   Vocational_Education_Teachers__Postsecondary
   Preschool_Teachers__Except_Special_Education
   Kindergarten_Teachers__Except_Special_Education
   Self-Enrichment_Education_Teachers
   Teacher_Assistants
   Actors
   Music_Directors
   Singers
   Musicians__Instrumental
   Public_Address_System_and_Other_Announcers
   Interpreters_and_Translators
   Speech-Language_Pathology_Assistants
   Funeral_Attendants
   Tour_Guides_and_Escorts
   Childcare_Workers
   Fitness_Trainers_and_Aerobics_Instructors
   Demonstrators_and_Product_Promoters
   Door-To-Door_Sales_Workers__News_and_Street_Vendors__and_Related_Workers

Occupations in cluster 7 (39 total):
   Boilermakers
   Electricians
   Pipe_Fitters_and_Steamfitters
   Plumbers
   Sheet_Metal_Workers
   Solar_Photovoltaic_Installers
   Elevator_Installers_and_Repairers
   Weatherization_Installers_and_Technicians
   Radio__Cellular__and_Tower_Equipment_Installers_and_Repairers
   Telecommunications_Equipment_Installers_and_Repairers__Except_Line_Installers
   Electric_Motor__Power_Tool__and_Related_Repairers
   Electrical_and_Electronics_Repairers__Powerhouse__Substation__and_Relay
   Electronic_Equipment_Installers_and_Repairers__Motor_Vehicles
   Security_and_Fire_Alarm_Systems_Installers
   Aircraft_Mechanics_and_Service_Technicians
   Automotive_Glass_Installers_and_Repairers
   Automotive_Master_Mechanics
   Automotive_Specialty_Technicians
   Bus_and_Truck_Mechanics_and_Diesel_Engine_Specialists
   Farm_Equipment_Mechanics_and_Service_Technicians
   Mobile_Heavy_Equipment_Mechanics__Except_Engines
   Motorboat_Mechanics_and_Service_Technicians
   Motorcycle_Mechanics
   Outdoor_Power_Equipment_and_Other_Small_Engine_Mechanics
   Bicycle_Repairers
   Recreational_Vehicle_Service_Technicians
   Mechanical_Door_Repairers
   Control_and_Valve_Installers_and_Repairers__Except_Mechanical_Door
   Heating_and_Air_Conditioning_Mechanics_and_Installers
   Refrigeration_Mechanics_and_Installers
   Home_Appliance_Repairers
   Telecommunications_Line_Installers_and_Repairers
   Maintenance_and_Repair_Workers__General
   Wind_Turbine_Service_Technicians
   Coin__Vending__and_Amusement_Machine_Servicers_and_Repairers
   Locksmiths_and_Safe_Repairers
   Signal_and_Track_Switch_Repairers
   Ship_Engineers
   Transportation_Vehicle__Equipment_and_Systems_Inspectors__Except_Aviation

Occupations in cluster 8 (35 total):
   Architectural_and_Engineering_Managers
   Water_Resource_Specialists
   Brownfield_Redevelopment_Specialists_and_Site_Managers
   Logistics_Engineers
   Aerospace_Engineers
   Agricultural_Engineers
   Biomedical_Engineers
   Chemical_Engineers
   Civil_Engineers
   Transportation_Engineers
   Electrical_Engineers
   Electronics_Engineers__Except_Computer
   Environmental_Engineers
   Water_Wastewater_Engineers
   Fire-Prevention_and_Protection_Engineers
   Product_Safety_Engineers
   Industrial_Engineers
   Human_Factors_Engineers_and_Ergonomists
   Marine_Architects
   Materials_Engineers
   Mechanical_Engineers
   Mining_and_Geological_Engineers__Including_Mining_Safety_Engineers
   Nuclear_Engineers
   Petroleum_Engineers
   Biochemical_Engineers
   Validation_Engineers
   Energy_Engineers
   Mechatronics_Engineers
   Photonics_Engineers
   Wind_Energy_Engineers
   Industrial_Engineering_Technicians
   Electrical_Engineering_Technologists
   Industrial_Engineering_Technologists
   Materials_Scientists
   Commercial_and_Industrial_Designers

Occupations in cluster 9 (30 total):
   Brickmasons_and_Blockmasons
   Stonemasons
   Construction_Carpenters
   Rough_Carpenters
   Carpet_Installers
   Floor_Layers__Except_Carpet__Wood__and_Hard_Tiles
   Floor_Sanders_and_Finishers
   Tile_and_Marble_Setters
   Cement_Masons_and_Concrete_Finishers
   Terrazzo_Workers_and_Finishers
   Construction_Laborers
   Drywall_and_Ceiling_Tile_Installers
   Tapers
   Glaziers
   Insulation_Workers__Floor__Ceiling__and_Wall
   Insulation_Workers__Mechanical
   Painters__Construction_and_Maintenance
   Paperhangers
   Plasterers_and_Stucco_Masons
   Reinforcing_Iron_and_Rebar_Workers
   Roofers
   Helpers--Brickmasons__Blockmasons__Stonemasons__and_Tile_and_Marble_Setters
   Helpers--Carpenters
   Helpers--Electricians
   Helpers--Painters__Paperhangers__Plasterers__and_Stucco_Masons
   Helpers--Pipelayers__Plumbers__Pipefitters__and_Steamfitters
   Helpers--Roofers
   Roustabouts__Oil_and_Gas
   Riggers
   Structural_Metal_Fabricators_and_Fitters

Occupations in cluster 10 (27 total):
   Chief_Executives
   Sales_Managers
   Administrative_Services_Managers
   Financial_Managers__Branch_or_Department
   Human_Resources_Managers
   Training_and_Development_Managers
   Education_Administrators__Elementary_and_Secondary_School
   Education_Administrators__Postsecondary
   Gaming_Managers
   Lodging_Managers
   Medical_and_Health_Services_Managers
   Postmasters_and_Mail_Superintendents
   Property__Real_Estate__and_Community_Association_Managers
   Social_and_Community_Service_Managers
   Emergency_Management_Directors
   Security_Managers
   Loss_Prevention_Managers
   Training_and_Development_Specialists
   Industrial-Organizational_Psychologists
   Clergy
   Farm_and_Home_Management_Advisors
   Instructional_Coordinators
   Program_Directors
   Gaming_Supervisors
   Spa_Managers
   First-Line_Supervisors_of_Non-Retail_Sales_Workers
   First-Line_Supervisors_of_Office_and_Administrative_Support_Workers

Occupations in cluster 11 (29 total):
   Quality_Control_Systems_Managers
   Construction_Managers
   Government_Property_Inspectors_and_Investigators
   Energy_Auditors
   Security_Management_Specialists
   Telecommunications_Engineering_Specialists
   Surveyors
   Geodetic_Surveyors
   Industrial_Safety_and_Health_Engineers
   Marine_Engineers
   Manufacturing_Engineers
   Robotics_Engineers
   Aerospace_Engineering_and_Operations_Technicians
   Environmental_Engineering_Technicians
   Non-Destructive_Testing_Specialists
   Manufacturing_Engineering_Technologists
   Surveying_Technicians
   Geological_Sample_Test_Technicians
   Environmental_Science_and_Protection_Technicians__Including_Health
   Precision_Agriculture_Technicians
   Occupational_Health_and_Safety_Specialists
   Occupational_Health_and_Safety_Technicians
   Agricultural_Inspectors
   Construction_and_Building_Inspectors
   Power_Distributors_and_Dispatchers
   Airfield_Operations_Specialists
   Traffic_Technicians
   Aviation_Inspectors
   Freight_and_Cargo_Inspectors

Occupations in cluster 12 (23 total):
   Computer_and_Information_Research_Scientists
   Computer_Systems_Analysts
   Information_Security_Analysts
   Computer_Programmers
   Software_Developers__Applications
   Software_Developers__Systems_Software
   Web_Developers
   Database_Administrators
   Computer_Network_Architects
   Software_Quality_Assurance_Engineers_and_Testers
   Computer_Systems_Engineers_Architects
   Web_Administrators
   Document_Management_Specialists
   Cartographers_and_Photogrammetrists
   Computer_Hardware_Engineers
   Architectural_Drafters
   Civil_Drafters
   Electronic_Drafters
   Electrical_Drafters
   Mechanical_Drafters
   Civil_Engineering_Technicians
   Mapping_Technicians
   Remote_Sensing_Technicians

Occupations in cluster 13 (29 total):
   Paving__Surfacing__and_Tamping_Equipment_Operators
   Pile-Driver_Operators
   Operating_Engineers_and_Other_Construction_Equipment_Operators
   Highway_Maintenance_Workers
   Rail-Track_Laying_and_Maintenance_Equipment_Operators
   Continuous_Mining_Machine_Operators
   Mine_Cutting_and_Channeling_Machine_Operators
   Roof_Bolters__Mining
   Helpers--Extraction_Workers
   Automotive_Body_and_Related_Repairers
   Rail_Car_Repairers
   Industrial_Machinery_Mechanics
   Maintenance_Workers__Machinery
   Refractory_Materials_Repairers__Except_Brickmasons
   Aircraft_Structure__Surfaces__Rigging__and_Systems_Assemblers
   Fiberglass_Laminators_and_Fabricators
   Metal-Refining_Furnace_Operators_and_Tenders
   Pourers_and_Casters__Metal
   Welders__Cutters__and_Welder_Fitters
   Layout_Workers__Metal_and_Plastic
   Furniture_Finishers
   Separating__Filtering__Clarifying__Precipitating__and_Still_Machine_Setters__Operators__and_Tenders
   Crushing__Grinding__and_Polishing_Machine_Setters__Operators__and_Tenders
   Mixing_and_Blending_Machine_Setters__Operators__and_Tenders
   Painters__Transportation_Equipment
   Cooling_and_Freezing_Equipment_Operators_and_Tenders
   Sailors_and_Marine_Oilers
   Excavating_and_Loading_Machine_and_Dragline_Operators
   Hoist_and_Winch_Operators

Occupations in cluster 14 (12 total):
   Farm_Labor_Contractors
   Umpires__Referees__and_Other_Sports_Officials
   Bailiffs
   Parking_Enforcement_Workers
   Security_Guards
   Lifeguards__Ski_Patrol__and_Other_Recreational_Protective_Service_Workers
   Baggage_Porters_and_Bellhops
   Couriers_and_Messengers
   Postal_Service_Mail_Carriers
   Driver_Sales_Workers
   Taxi_Drivers_and_Chauffeurs
   Parking_Lot_Attendants

Occupations in cluster 15 (7 total):
   Craft_Artists
   Fine_Artists__Including_Painters__Sculptors__and_Illustrators
   Cooks__Private_Household
   Animal_Trainers
   Animal_Breeders
   Hunters_and_Trappers
   Potters__Manufacturing

Occupations in cluster 16 (15 total):
   Office_Machine_Operators__Except_Computer
   Camera_and_Photographic_Equipment_Repairers
   Musical_Instrument_Repairers_and_Tuners
   Watch_Repairers
   Timing_Device_Assemblers_and_Adjusters
   Prepress_Technicians_and_Workers
   Sewers__Hand
   Tailors__Dressmakers__and_Custom_Sewers
   Jewelers
   Gem_and_Diamond_Workers
   Precious_Metal_Workers
   Dental_Laboratory_Technicians
   Ophthalmic_Laboratory_Technicians
   Photographic_Process_Workers_and_Processing_Machine_Operators
   Etchers_and_Engravers

Occupations in cluster 17 (8 total):
   Emergency_Medical_Technicians_and_Paramedics
   Municipal_Fire_Fighting_and_Prevention_Supervisors
   Municipal_Firefighters
   Forest_Firefighters
   Airline_Pilots__Copilots__and_Flight_Engineers
   Commercial_Pilots
   Ship_and_Boat_Captains
   Pilots__Ship

Occupations in cluster 18 (13 total):
   Insurance_Appraisers__Auto_Damage
   Meeting__Convention__and_Event_Planners
   Assessors
   Appraisers__Real_Estate
   Private_Detectives_and_Investigators
   Concierges
   Travel_Guides
   Advertising_Sales_Agents
   Sales_Representatives__Wholesale_and_Manufacturing__Technical_and_Scientific_Products
   Solar_Sales_Representatives_and_Assessors
   Sales_Representatives__Wholesale_and_Manufacturing__Except_Technical_and_Scientific_Products
   Real_Estate_Brokers
   Real_Estate_Sales_Agents

Occupations in cluster 19 (30 total):
   Paralegals_and_Legal_Assistants
   Court_Reporters
   Title_Examiners__Abstractors__and_Searchers
   Library_Technicians
   Medical_Records_and_Health_Information_Technicians
   Medical_Transcriptionists
   Telemarketers
   Switchboard_Operators__Including_Answering_Service
   Telephone_Operators
   Statement_Clerks
   Billing__Cost__and_Rate_Clerks
   Bookkeeping__Accounting__and_Auditing_Clerks
   Payroll_and_Timekeeping_Clerks
   Court_Clerks
   Municipal_Clerks
   License_Clerks
   File_Clerks
   Interviewers__Except_Eligibility_and_Loan
   Library_Assistants__Clerical
   Human_Resources_Assistants__Except_Payroll_and_Timekeeping
   Receptionists_and_Information_Clerks
   Executive_Secretaries_and_Executive_Administrative_Assistants
   Legal_Secretaries
   Medical_Secretaries
   Secretaries_and_Administrative_Assistants__Except_Legal__Medical__and_Executive
   Data_Entry_Keyers
   Word_Processors_and_Typists
   Insurance_Claims_Clerks
   Office_Clerks__General
   Proofreaders_and_Copy_Markers

Occupations in cluster 20 (8 total):
   Floral_Designers
   Merchandise_Displayers_and_Window_Trimmers
   Massage_Therapists
   Barbers
   Hairdressers__Hairstylists__and_Cosmetologists
   Makeup_Artists__Theatrical_and_Performance
   Manicurists_and_Pedicurists
   Skincare_Specialists

Occupations in cluster 21 (21 total):
   Licensing_Examiners_and_Inspectors
   Coroners
   Forensic_Science_Technicians
   First-Line_Supervisors_of_Correctional_Officers
   First-Line_Supervisors_of_Police_and_Detectives
   Fire_Inspectors
   Fire_Investigators
   Correctional_Officers_and_Jailers
   Police_Detectives
   Police_Identification_and_Records_Officers
   Criminal_Investigators_and_Special_Agents
   Immigration_and_Customs_Inspectors
   Fish_and_Game_Wardens
   Police_Patrol_Officers
   Sheriffs_and_Deputy_Sheriffs
   Transit_and_Railroad_Police
   Animal_Control_Workers
   Embalmers
   Morticians__Undertakers__and_Funeral_Directors
   Ambulance_Drivers_and_Attendants__Except_Emergency_Medical_Technicians
   Transportation_Attendants__Except_Flight_Attendants

Occupations in cluster 22 (32 total):
   Computer-Controlled_Machine_Tool_Operators__Metal_and_Plastic
   Computer_Numerically_Controlled_Machine_Tool_Programmers__Metal_and_Plastic
   Extruding_and_Drawing_Machine_Setters__Operators__and_Tenders__Metal_and_Plastic
   Rolling_Machine_Setters__Operators__and_Tenders__Metal_and_Plastic
   Drilling_and_Boring_Machine_Tool_Setters__Operators__and_Tenders__Metal_and_Plastic
   Grinding__Lapping__Polishing__and_Buffing_Machine_Tool_Setters__Operators__and_Tenders__Metal_and_Plastic
   Lathe_and_Turning_Machine_Tool_Setters__Operators__and_Tenders__Metal_and_Plastic
   Milling_and_Planing_Machine_Setters__Operators__and_Tenders__Metal_and_Plastic
   Machinists
   Model_Makers__Metal_and_Plastic
   Patternmakers__Metal_and_Plastic
   Multiple_Machine_Tool_Setters__Operators__and_Tenders__Metal_and_Plastic
   Tool_and_Die_Makers
   Heat_Treating_Equipment_Setters__Operators__and_Tenders__Metal_and_Plastic
   Tool_Grinders__Filers__and_Sharpeners
   Printing_Press_Operators
   Print_Binding_and_Finishing_Workers
   Textile_Cutting_Machine_Setters__Operators__and_Tenders
   Textile_Winding__Twisting__and_Drawing_Out_Machine_Setters__Operators__and_Tenders
   Extruding_and_Forming_Machine_Setters__Operators__and_Tenders__Synthetic_and_Glass_Fibers
   Upholsterers
   Cabinetmakers_and_Bench_Carpenters
   Model_Makers__Wood
   Patternmakers__Wood
   Sawing_Machine_Setters__Operators__and_Tenders__Wood
   Woodworking_Machine_Setters__Operators__and_Tenders__Except_Sawing
   Grinding_and_Polishing_Workers__Hand
   Cutting_and_Slicing_Machine_Setters__Operators__and_Tenders
   Coating__Painting__and_Spraying_Machine_Setters__Operators__and_Tenders
   Adhesive_Bonding_Machine_Operators_and_Tenders
   Glass_Blowers__Molders__Benders__and_Finishers
   Molding_and_Casting_Workers

Occupations in cluster 23 (26 total):
   Electrical_and_Electronics_Installers_and_Repairers__Transportation_Equipment
   Coil_Winders__Tapers__and_Finishers
   Electrical_and_Electronic_Equipment_Assemblers
   Electromechanical_Equipment_Assemblers
   Engine_and_Other_Machine_Assemblers
   Team_Assemblers
   Food_and_Tobacco_Roasting__Baking__and_Drying_Machine_Operators_and_Tenders
   Food_Batchmakers
   Food_Cooking_Machine_Operators_and_Tenders
   Forging_Machine_Setters__Operators__and_Tenders__Metal_and_Plastic
   Cutting__Punching__and_Press_Machine_Setters__Operators__and_Tenders__Metal_and_Plastic
   Molding__Coremaking__and_Casting_Machine_Setters__Operators__and_Tenders__Metal_and_Plastic
   Solderers_and_Brazers
   Welding__Soldering__and_Brazing_Machine_Setters__Operators__and_Tenders
   Plating_and_Coating_Machine_Setters__Operators__and_Tenders__Metal_and_Plastic
   Shoe_Machine_Operators_and_Tenders
   Textile_Bleaching_and_Dyeing_Machine_Operators_and_Tenders
   Textile_Knitting_and_Weaving_Machine_Setters__Operators__and_Tenders
   Extruding__Forming__Pressing__and_Compacting_Machine_Setters__Operators__and_Tenders
   Furnace__Kiln__Oven__Drier__and_Kettle_Operators_and_Tenders
   Packaging_and_Filling_Machine_Operators_and_Tenders
   Semiconductor_Processors
   Paper_Goods_Machine_Setters__Operators__and_Tenders
   Tire_Builders
   Helpers--Production_Workers
   Machine_Feeders_and_Offbearers

Occupations in cluster 24 (12 total):
   Network_and_Computer_Systems_Administrators
   Computer_User_Support_Specialists
   Audio-Visual_and_Multimedia_Collections_Specialists
   Set_and_Exhibit_Designers
   Audio_and_Video_Equipment_Technicians
   Broadcast_Technicians
   Sound_Engineering_Technicians
   Photographers
   Camera_Operators__Television__Video__and_Motion_Picture
   Computer_Operators
   Desktop_Publishers
   Fabric_and_Apparel_Patternmakers

Occupations in cluster 25 (16 total):
   Meter_Readers__Utilities
   Bus_Drivers__Transit_and_Intercity
   Bus_Drivers__School_or_Special_Client
   Heavy_and_Tractor-Trailer_Truck_Drivers
   Light_Truck_or_Delivery_Services_Drivers
   Locomotive_Engineers
   Locomotive_Firers
   Rail_Yard_Engineers__Dinkey_Operators__and_Hostlers
   Railroad_Brake__Signal__and_Switch_Operators
   Railroad_Conductors_and_Yardmasters
   Subway_and_Streetcar_Operators
   Mates-_Ship__Boat__and_Barge
   Motorboat_Operators
   Bridge_and_Lock_Tenders
   Automotive_and_Watercraft_Service_Attendants
   Refuse_and_Recyclable_Material_Collectors

Occupations in cluster 26 (29 total):
   Chief_Sustainability_Officers
   Purchasing_Managers
   Transportation_Managers
   Logistics_Managers
   Clinical_Research_Coordinators
   Regulatory_Affairs_Managers
   Compliance_Managers
   Supply_Chain_Managers
   Buyers_and_Purchasing_Agents__Farm_Products
   Wholesale_and_Retail_Buyers__Except_Farm_Products
   Purchasing_Agents__Except_Wholesale__Retail__and_Farm_Products
   Regulatory_Affairs_Specialists
   Cost_Estimators
   Logisticians
   Logistics_Analysts
   Management_Analysts
   Customs_Brokers
   Business_Continuity_Planners
   Sustainability_Specialists
   Information_Technology_Project_Managers
   Urban_and_Regional_Planners
   Transportation_Planners
   City_and_Regional_Planning_Aides
   Archivists
   Curators
   Librarians
   Travel_Agents
   Cargo_and_Freight_Agents
   Dispatchers__Except_Police__Fire__and_Ambulance

Occupations in cluster 27 (10 total):
   Psychiatric_Technicians
   Home_Health_Aides
   Psychiatric_Aides
   Nursing_Assistants
   Occupational_Therapy_Assistants
   Occupational_Therapy_Aides
   Physical_Therapist_Assistants
   Physical_Therapist_Aides
   Personal_Care_Aides
   Flight_Attendants

Occupations in cluster 28 (17 total):
   Nuclear_Equipment_Operation_Technicians
   Nuclear_Monitoring_Technicians
   Pesticide_Handlers__Sprayers__and_Applicators__Vegetation
   Geothermal_Technicians
   Nuclear_Power_Reactor_Operators
   Power_Plant_Operators
   Stationary_Engineers_and_Boiler_Operators
   Water_and_Wastewater_Treatment_Plant_and_System_Operators
   Chemical_Plant_and_System_Operators
   Gas_Plant_Operators
   Petroleum_Pump_System_Operators__Refinery_Operators__and_Gaugers
   Biomass_Plant_Technicians
   Chemical_Equipment_Operators_and_Tenders
   Conveyor_Operators_and_Tenders
   Gas_Compressor_and_Gas_Pumping_Station_Operators
   Pump_Operators__Except_Wellhead_Pumpers
   Wellhead_Pumpers

Occupations in cluster 29 (17 total):
   Pharmacy_Technicians
   Pharmacy_Aides
   Bartenders
   Slot_Supervisors
   Gaming_Dealers
   Gaming_and_Sports_Book_Writers_and_Runners
   Cashiers
   Gaming_Change_Persons_and_Booth_Cashiers
   Counter_and_Rental_Clerks
   Parts_Salespersons
   Retail_Salespersons
   Gaming_Cage_Workers
   Tellers
   Hotel__Motel__and_Resort_Desk_Clerks
   Order_Clerks
   Reservation_and_Transportation_Ticket_Agents_and_Travel_Clerks
   Postal_Service_Clerks

Occupations in cluster 30 (28 total):
   Anesthesiologists
   Anesthesiologist_Assistants
   Radiation_Therapists
   Respiratory_Therapists
   Nurse_Anesthetists
   Medical_and_Clinical_Laboratory_Technologists
   Histotechnologists_and_Histologic_Technicians
   Medical_and_Clinical_Laboratory_Technicians
   Dental_Hygienists
   Cardiovascular_Technologists_and_Technicians
   Diagnostic_Medical_Sonographers
   Nuclear_Medicine_Technologists
   Radiologic_Technologists
   Magnetic_Resonance_Imaging_Technologists
   Respiratory_Therapy_Technicians
   Surgical_Technologists
   Veterinary_Technologists_and_Technicians
   Ophthalmic_Medical_Technicians
   Neurodiagnostic_Technologists
   Ophthalmic_Medical_Technologists
   Radiologic_Technicians
   Dental_Assistants
   Medical_Assistants
   Medical_Equipment_Preparers
   Veterinary_Assistants_and_Laboratory_Animal_Caretakers
   Phlebotomists
   Endoscopy_Technicians
   Medical_Appliance_Technicians

Occupations in cluster 31 (14 total):
   Epidemiologists
   Medical_Scientists__Except_Epidemiologists
   Neuropsychologists_and_Clinical_Neuropsychologists
   Dietitians_and_Nutritionists
   Pharmacists
   Family_and_General_Practitioners
   Internists__General
   Pediatricians__General
   Allergists_and_Immunologists
   Neurologists
   Nuclear_Medicine_Physicians
   Pathologists
   Preventive_Medicine_Physicians
   Radiologists

Occupations in cluster 32 (13 total):
   Tree_Trimmers_and_Pruners
   Structural_Iron_and_Steel_Workers
   Hazardous_Materials_Removal_Workers
   Septic_Tank_Servicers_and_Sewer_Pipe_Cleaners
   Derrick_Operators__Oil_and_Gas
   Rotary_Drill_Operators__Oil_and_Gas
   Service_Unit_Operators__Oil__Gas__and_Mining
   Explosives_Workers__Ordnance_Handling_Experts__and_Blasters
   Millwrights
   Electrical_Power-Line_Installers_and_Repairers
   Commercial_Divers
   Manufactured_Building_and_Mobile_Home_Installers
   Tank_Car__Truck__and_Ship_Loaders

Occupations in cluster 33 (14 total):
   Education_Administrators__Preschool_and_Childcare_Center_Program
   Health_Educators
   Community_Health_Workers
   Directors__Religious_Activities_and_Education
   Career_Technical_Education_Teachers__Middle_School
   Career_Technical_Education_Teachers__Secondary_School
   Adapted_Physical_Education_Specialists
   Coaches_and_Scouts
   Low_Vision_Therapists__Orientation_and_Mobility_Specialists__and_Vision_Rehabilitation_Therapists
   Recreational_Therapists
   First-Line_Supervisors_of_Personal_Service_Workers
   Nannies
   Recreation_Workers
   Residential_Advisors

Occupations in cluster 34 (11 total):
   General_and_Operations_Managers
   Industrial_Production_Managers
   Storage_and_Distribution_Managers
   Food_Service_Managers
   Chefs_and_Head_Cooks
   First-Line_Supervisors_of_Food_Preparation_and_Serving_Workers
   First-Line_Supervisors_of_Housekeeping_and_Janitorial_Workers
   First-Line_Supervisors_of_Retail_Sales_Workers
   Production__Planning__and_Expediting_Clerks
   First-Line_Supervisors_of_Helpers__Laborers__and_Material_Movers__Hand
   First-Line_Supervisors_of_Transportation_and_Material-Moving_Machine_and_Vehicle_Operators

Occupations in cluster 35 (16 total):
   Electronics_Engineering_Technicians
   Electrical_Engineering_Technicians
   Electro-Mechanical_Technicians
   Robotics_Technicians
   Mechanical_Engineering_Technicians
   Electromechanical_Engineering_Technologists
   Electronics_Engineering_Technologists
   Mechanical_Engineering_Technologists
   Photonics_Technicians
   Manufacturing_Production_Technicians
   Computer__Automated_Teller__and_Office_Machine_Repairers
   Radio_Mechanics
   Avionics_Technicians
   Electrical_and_Electronics_Repairers__Commercial_and_Industrial_Equipment
   Electronic_Home_Entertainment_Equipment_Installers_and_Repairers
   Medical_Equipment_Repairers

Occupations in cluster 36 (12 total):
   Computer_and_Information_Systems_Managers
   Distance_Learning_Coordinators
   Informatics_Nurse_Specialists
   Architects__Except_Landscape_and_Naval
   Landscape_Architects
   Instructional_Designers_and_Technologists
   Art_Directors
   Fashion_Designers
   Interior_Designers
   Directors-_Stage__Motion_Pictures__Television__and_Radio
   Technical_Directors_Managers
   Sales_Engineers

Occupations in cluster 37 (13 total):
   Geospatial_Information_Scientists_and_Technologists
   Geographic_Information_Systems_Technicians
   Mathematicians
   Operations_Research_Analysts
   Statisticians
   Biostatisticians
   Bioinformatics_Scientists
   Astronomers
   Physicists
   Atmospheric_and_Space_Scientists
   Remote_Sensing_Scientists_and_Technologists
   Economists
   Environmental_Economists

Occupations in cluster 38 (20 total):
   Dishwashers
   Janitors_and_Cleaners__Except_Maids_and_Housekeeping_Cleaners
   Maids_and_Housekeeping_Cleaners
   Motion_Picture_Projectionists
   Postal_Service_Mail_Sorters__Processors__and_Processing_Machine_Operators
   Mail_Clerks_and_Mail_Machine_Operators__Except_Postal_Service
   Fabric_Menders__Except_Garment
   Meat__Poultry__and_Fish_Cutters_and_Trimmers
   Slaughterers_and_Meat_Packers
   Foundry_Mold_and_Coremakers
   Laundry_and_Dry-Cleaning_Workers
   Pressers__Textile__Garment__and_Related_Materials
   Sewing_Machine_Operators
   Shoe_and_Leather_Workers_and_Repairers
   Cutters_and_Trimmers__Hand
   Painting__Coating__and_Decorating_Workers
   Cleaning__Washing__and_Metal_Pickling_Equipment_Operators_and_Tenders
   Stone_Cutters_and_Carvers__Manufacturing
   Cleaners_of_Vehicles_and_Equipment
   Packers_and_Packagers__Hand

Occupations in cluster 39 (13 total):
   Environmental_Compliance_Inspectors
   Soil_and_Plant_Scientists
   Zoologists_and_Wildlife_Biologists
   Soil_and_Water_Conservationists
   Range_Managers
   Park_Naturalists
   Foresters
   Environmental_Scientists_and_Specialists__Including_Health
   Environmental_Restoration_Planners
   Geoscientists__Except_Hydrologists_and_Geographers
   Hydrologists
   Archeologists
   Forest_Fire_Inspectors_and_Prevention_Specialists

Occupations in cluster 40 (3 total):
   Athletes_and_Sports_Competitors
   Dancers
   Choreographers

Occupations in cluster 41 (19 total):
   School_Psychologists
   Clinical_Psychologists
   Counseling_Psychologists
   Substance_Abuse_and_Behavioral_Disorder_Counselors
   Educational__Guidance__School__and_Vocational_Counselors
   Marriage_and_Family_Therapists
   Mental_Health_Counselors
   Rehabilitation_Counselors
   Child__Family__and_School_Social_Workers
   Healthcare_Social_Workers
   Mental_Health_and_Substance_Abuse_Social_Workers
   Probation_Officers_and_Correctional_Treatment_Specialists
   Social_and_Human_Service_Assistants
   Psychiatrists
   Speech-Language_Pathologists
   Advanced_Practice_Psychiatric_Nurses
   Naturopathic_Physicians
   Genetic_Counselors
   Patient_Representatives

Occupations in cluster 42 (23 total):
   Treasurers_and_Controllers
   Compensation_and_Benefits_Managers
   Compensation__Benefits__and_Job_Analysis_Specialists
   Market_Research_Analysts_and_Marketing_Specialists
   Accountants
   Auditors
   Budget_Analysts
   Credit_Analysts
   Financial_Analysts
   Insurance_Underwriters
   Financial_Examiners
   Tax_Preparers
   Risk_Management_Specialists
   Fraud_Examiners__Investigators_and_Analysts
   Business_Intelligence_Analysts
   Actuaries
   Clinical_Data_Managers
   Climate_Change_Analysts
   Survey_Researchers
   Social_Science_Research_Assistants
   Judicial_Law_Clerks
   Intelligence_Analysts
   Statistical_Assistants

Occupations in cluster 43 (10 total):
   Dietetic_Technicians
   Transportation_Security_Screeners
   Cooks__Institution_and_Cafeteria
   Cooks__Restaurant
   Cooks__Short_Order
   Shipping__Receiving__and_Traffic_Clerks
   Weighers__Measurers__Checkers__and_Samplers__Recordkeeping
   Bakers
   Butchers_and_Meat_Cutters
   Inspectors__Testers__Sorters__Samplers__and_Weighers

Occupations in cluster 44 (10 total):
   Chemists
   Agricultural_Technicians
   Food_Science_Technicians
   Biological_Technicians
   Chemical_Technicians
   Geophysical_Data_Technicians
   Quality_Control_Analysts
   Museum_Technicians_and_Conservators
   Cytogenetic_Technologists
   Cytotechnologists

Occupations in cluster 45 (16 total):
   Claims_Examiners__Property_and_Casualty_Insurance
   Insurance_Adjusters__Examiners__and_Investigators
   Human_Resources_Specialists
   Credit_Counselors
   Tax_Examiners_and_Collectors__and_Revenue_Agents
   Bill_and_Account_Collectors
   Procurement_Clerks
   Brokerage_Clerks
   Correspondence_Clerks
   Credit_Authorizers
   Credit_Checkers
   Customer_Service_Representatives
   Eligibility_Interviewers__Government_Programs
   Loan_Interviewers_and_Clerks
   New_Accounts_Clerks
   Insurance_Policy_Processing_Clerks

Occupations in cluster 46 (21 total):
   Landscaping_and_Groundskeeping_Workers
   Agricultural_Equipment_Operators
   Nursery_Workers
   Farmworkers_and_Laborers__Crop
   Farmworkers__Farm__Ranch__and_Aquacultural_Animals
   Fishers_and_Related_Fishing_Workers
   Fallers
   Logging_Equipment_Operators
   Pipelayers
   Fence_Erectors
   Earth_Drillers__Except_Oil_and_Gas
   Rock_Splitters__Quarry
   Tire_Repairers_and_Changers
   Helpers--Installation__Maintenance__and_Repair_Workers
   Recycling_and_Reclamation_Workers
   Crane_and_Tower_Operators
   Dredge_Operators
   Loading_Machine_Operators__Underground_Mining
   Industrial_Truck_and_Tractor_Operators
   Laborers_and_Freight__Stock__and_Material_Movers__Hand
   Mine_Shuttle_Car_Operators

Occupations in cluster 47 (8 total):
   Video_Game_Designers
   Multimedia_Artists_and_Animators
   Graphic_Designers
   Music_Composers_and_Arrangers
   Editors
   Technical_Writers
   Poets__Lyricists_and_Creative_Writers
   Film_and_Video_Editors

Occupations in cluster 48 (8 total):
   Natural_Sciences_Managers
   Animal_Scientists
   Food_Scientists_and_Technologists
   Biologists
   Biochemists_and_Biophysicists
   Microbiologists
   Molecular_and_Cellular_Biologists
   Geneticists

In [98]:
#Function to take in a career and return other careers in the same cluster 
def getCluster(occ_title):
    occ_cluster = clusters.ix[occ_title]
    cluster_data = clusters[clusters.cluster == occ_cluster.cluster]
    print '\nOccupations that are similar to ' + occ_title + ' are:'
    for occ in cluster_data.index:
        if occ != occ_title:
            print '   '  + occ
#     return cluster_data

In [99]:
getCluster('Aerospace_Engineers')


Occupations that are similar to Aerospace_Engineers are:
   Architectural_and_Engineering_Managers
   Water_Resource_Specialists
   Brownfield_Redevelopment_Specialists_and_Site_Managers
   Logistics_Engineers
   Agricultural_Engineers
   Biomedical_Engineers
   Chemical_Engineers
   Civil_Engineers
   Transportation_Engineers
   Electrical_Engineers
   Electronics_Engineers__Except_Computer
   Environmental_Engineers
   Water_Wastewater_Engineers
   Fire-Prevention_and_Protection_Engineers
   Product_Safety_Engineers
   Industrial_Engineers
   Human_Factors_Engineers_and_Ergonomists
   Marine_Architects
   Materials_Engineers
   Mechanical_Engineers
   Mining_and_Geological_Engineers__Including_Mining_Safety_Engineers
   Nuclear_Engineers
   Petroleum_Engineers
   Biochemical_Engineers
   Validation_Engineers
   Energy_Engineers
   Mechatronics_Engineers
   Photonics_Engineers
   Wind_Energy_Engineers
   Industrial_Engineering_Technicians
   Electrical_Engineering_Technologists
   Industrial_Engineering_Technologists
   Materials_Scientists
   Commercial_and_Industrial_Designers

Distance-based Analysis

Question 1: Which occupations are the most unique?

We consider unique to be those careers that are outside of the 95th percentile in terms of distance to the nearest neighboring occupation


In [100]:
#calculate, for each occupation, the distance to get to the 1st, 2nd, 3rd, 4th, and 5th closest occupations by distance 
closest_five_distances = DataFrame()

for occ_title in occ_titles:
    distances = DataFrame(euclid_dist_df_normed.xs(occ_title))
    distances
    distances.sort(occ_title,inplace=True)
    low_distances = distances.iloc[1:,:].head(5).T
    low_distances.columns = [1,2,3,4,5]
    closest_five_distances= pd.concat([closest_five_distances,low_distances])

In [101]:
closest_five_stats = closest_five_distances.describe(percentile_width=90)

In [102]:
def get_unique_careers(distance_threshold = closest_five_stats.ix['95%'][1]):
    unique_occs = []
    for occ_title in occ_titles:
        distances = euclid_dist_df_normed.xs(occ_title)
        nearby_occs = len(distances[distances <= distance_threshold])-1
        if nearby_occs == 0:
            unique_occs.append((occ_title, nearby_occs))
    unique_careers = DataFrame(unique_occs)
    print '\nThe most unique occupations are: '
    for occ in unique_careers[0]:
        print '   ' + occ

In [103]:
get_unique_careers()


The most unique occupations are: 
   Insurance_Appraisers__Auto_Damage
   Government_Property_Inspectors_and_Investigators
   Farm_Labor_Contractors
   Electro-Mechanical_Technicians
   Historians
   Floral_Designers
   Merchandise_Displayers_and_Window_Trimmers
   Set_and_Exhibit_Designers
   Actors
   Athletes_and_Sports_Competitors
   Umpires__Referees__and_Other_Sports_Officials
   Dancers
   Choreographers
   Music_Directors
   Singers
   Musicians__Instrumental
   Photographers
   Camera_Operators__Television__Video__and_Motion_Picture
   Massage_Therapists
   Private_Detectives_and_Investigators
   Gaming_Surveillance_Officers_and_Gaming_Investigators
   Crossing_Guards
   Lifeguards__Ski_Patrol__and_Other_Recreational_Protective_Service_Workers
   Transportation_Security_Screeners
   Cooks__Private_Household
   Animal_Trainers
   Costume_Attendants
   Makeup_Artists__Theatrical_and_Performance
   Demonstrators_and_Product_Promoters
   Models
   Door-To-Door_Sales_Workers__News_and_Street_Vendors__and_Related_Workers
   Proofreaders_and_Copy_Markers
   Animal_Breeders
   Graders_and_Sorters__Agricultural_Products
   Agricultural_Equipment_Operators
   Farmworkers_and_Laborers__Crop
   Hunters_and_Trappers
   Fallers
   Coin__Vending__and_Amusement_Machine_Servicers_and_Repairers
   Manufactured_Building_and_Mobile_Home_Installers
   Furnace__Kiln__Oven__Drier__and_Kettle_Operators_and_Tenders
   Airline_Pilots__Copilots__and_Flight_Engineers
   Air_Traffic_Controllers
   Flight_Attendants
   Light_Truck_or_Delivery_Services_Drivers
   Conveyor_Operators_and_Tenders
   Mine_Shuttle_Car_Operators

In [104]:
#function used elsewhere to print out all of the domains and features in an occupation data frame
def print_domains_features(occ_frame):
    for domain in occ_frame.index.get_level_values('domain').unique():
        sub_frame = occ_frame.xs(domain, level=0)
        print '   ' + domain.replace("_", " ")
        for feature in sub_frame.index:
            print '      ' + feature.replace("_", " ")

In [105]:
print 'Here is a list of all of the possible occupations:'
set(occ_titles)


Here is a list of all of the possible occupations:
Out[105]:
{'Accountants',
 'Actors',
 'Actuaries',
 'Acupuncturists',
 'Acute_Care_Nurses',
 'Adapted_Physical_Education_Specialists',
 'Adhesive_Bonding_Machine_Operators_and_Tenders',
 'Administrative_Law_Judges__Adjudicators__and_Hearing_Officers',
 'Administrative_Services_Managers',
 'Adult_Basic_and_Secondary_Education_and_Literacy_Teachers_and_Instructors',
 'Advanced_Practice_Psychiatric_Nurses',
 'Advertising_Sales_Agents',
 'Advertising_and_Promotions_Managers',
 'Aerospace_Engineering_and_Operations_Technicians',
 'Aerospace_Engineers',
 'Agents_and_Business_Managers_of_Artists__Performers__and_Athletes',
 'Agricultural_Engineers',
 'Agricultural_Equipment_Operators',
 'Agricultural_Inspectors',
 'Agricultural_Sciences_Teachers__Postsecondary',
 'Agricultural_Technicians',
 'Air_Traffic_Controllers',
 'Aircraft_Cargo_Handling_Supervisors',
 'Aircraft_Mechanics_and_Service_Technicians',
 'Aircraft_Structure__Surfaces__Rigging__and_Systems_Assemblers',
 'Airfield_Operations_Specialists',
 'Airline_Pilots__Copilots__and_Flight_Engineers',
 'Allergists_and_Immunologists',
 'Ambulance_Drivers_and_Attendants__Except_Emergency_Medical_Technicians',
 'Amusement_and_Recreation_Attendants',
 'Anesthesiologist_Assistants',
 'Anesthesiologists',
 'Animal_Breeders',
 'Animal_Control_Workers',
 'Animal_Scientists',
 'Animal_Trainers',
 'Anthropologists',
 'Anthropology_and_Archeology_Teachers__Postsecondary',
 'Appraisers__Real_Estate',
 'Aquacultural_Managers',
 'Arbitrators__Mediators__and_Conciliators',
 'Archeologists',
 'Architects__Except_Landscape_and_Naval',
 'Architectural_Drafters',
 'Architectural_and_Engineering_Managers',
 'Architecture_Teachers__Postsecondary',
 'Archivists',
 'Area__Ethnic__and_Cultural_Studies_Teachers__Postsecondary',
 'Art_Directors',
 'Art__Drama__and_Music_Teachers__Postsecondary',
 'Assessors',
 'Astronomers',
 'Athletes_and_Sports_Competitors',
 'Athletic_Trainers',
 'Atmospheric__Earth__Marine__and_Space_Sciences_Teachers__Postsecondary',
 'Atmospheric_and_Space_Scientists',
 'Audio-Visual_and_Multimedia_Collections_Specialists',
 'Audio_and_Video_Equipment_Technicians',
 'Audiologists',
 'Auditors',
 'Automotive_Body_and_Related_Repairers',
 'Automotive_Glass_Installers_and_Repairers',
 'Automotive_Master_Mechanics',
 'Automotive_Specialty_Technicians',
 'Automotive_and_Watercraft_Service_Attendants',
 'Aviation_Inspectors',
 'Avionics_Technicians',
 'Baggage_Porters_and_Bellhops',
 'Bailiffs',
 'Bakers',
 'Barbers',
 'Bartenders',
 'Bicycle_Repairers',
 'Bill_and_Account_Collectors',
 'Billing__Cost__and_Rate_Clerks',
 'Biochemical_Engineers',
 'Biochemists_and_Biophysicists',
 'Bioinformatics_Scientists',
 'Biological_Science_Teachers__Postsecondary',
 'Biological_Technicians',
 'Biologists',
 'Biomass_Plant_Technicians',
 'Biomass_Power_Plant_Managers',
 'Biomedical_Engineers',
 'Biostatisticians',
 'Boilermakers',
 'Bookkeeping__Accounting__and_Auditing_Clerks',
 'Brickmasons_and_Blockmasons',
 'Bridge_and_Lock_Tenders',
 'Broadcast_News_Analysts',
 'Broadcast_Technicians',
 'Brokerage_Clerks',
 'Brownfield_Redevelopment_Specialists_and_Site_Managers',
 'Budget_Analysts',
 'Bus_Drivers__School_or_Special_Client',
 'Bus_Drivers__Transit_and_Intercity',
 'Bus_and_Truck_Mechanics_and_Diesel_Engine_Specialists',
 'Business_Continuity_Planners',
 'Business_Intelligence_Analysts',
 'Business_Teachers__Postsecondary',
 'Butchers_and_Meat_Cutters',
 'Buyers_and_Purchasing_Agents__Farm_Products',
 'Cabinetmakers_and_Bench_Carpenters',
 'Camera_Operators__Television__Video__and_Motion_Picture',
 'Camera_and_Photographic_Equipment_Repairers',
 'Cardiovascular_Technologists_and_Technicians',
 'Career_Technical_Education_Teachers__Middle_School',
 'Career_Technical_Education_Teachers__Secondary_School',
 'Cargo_and_Freight_Agents',
 'Carpet_Installers',
 'Cartographers_and_Photogrammetrists',
 'Cashiers',
 'Cement_Masons_and_Concrete_Finishers',
 'Chefs_and_Head_Cooks',
 'Chemical_Engineers',
 'Chemical_Equipment_Operators_and_Tenders',
 'Chemical_Plant_and_System_Operators',
 'Chemical_Technicians',
 'Chemistry_Teachers__Postsecondary',
 'Chemists',
 'Chief_Executives',
 'Chief_Sustainability_Officers',
 'Child__Family__and_School_Social_Workers',
 'Childcare_Workers',
 'Chiropractors',
 'Choreographers',
 'City_and_Regional_Planning_Aides',
 'Civil_Drafters',
 'Civil_Engineering_Technicians',
 'Civil_Engineers',
 'Claims_Examiners__Property_and_Casualty_Insurance',
 'Cleaners_of_Vehicles_and_Equipment',
 'Cleaning__Washing__and_Metal_Pickling_Equipment_Operators_and_Tenders',
 'Clergy',
 'Climate_Change_Analysts',
 'Clinical_Data_Managers',
 'Clinical_Nurse_Specialists',
 'Clinical_Psychologists',
 'Clinical_Research_Coordinators',
 'Coaches_and_Scouts',
 'Coating__Painting__and_Spraying_Machine_Setters__Operators__and_Tenders',
 'Coil_Winders__Tapers__and_Finishers',
 'Coin__Vending__and_Amusement_Machine_Servicers_and_Repairers',
 'Combined_Food_Preparation_and_Serving_Workers__Including_Fast_Food',
 'Commercial_Divers',
 'Commercial_Pilots',
 'Commercial_and_Industrial_Designers',
 'Communications_Teachers__Postsecondary',
 'Community_Health_Workers',
 'Compensation__Benefits__and_Job_Analysis_Specialists',
 'Compensation_and_Benefits_Managers',
 'Compliance_Managers',
 'Computer-Controlled_Machine_Tool_Operators__Metal_and_Plastic',
 'Computer_Hardware_Engineers',
 'Computer_Network_Architects',
 'Computer_Numerically_Controlled_Machine_Tool_Programmers__Metal_and_Plastic',
 'Computer_Operators',
 'Computer_Programmers',
 'Computer_Science_Teachers__Postsecondary',
 'Computer_Systems_Analysts',
 'Computer_Systems_Engineers_Architects',
 'Computer_User_Support_Specialists',
 'Computer__Automated_Teller__and_Office_Machine_Repairers',
 'Computer_and_Information_Research_Scientists',
 'Computer_and_Information_Systems_Managers',
 'Concierges',
 'Construction_Carpenters',
 'Construction_Laborers',
 'Construction_Managers',
 'Construction_and_Building_Inspectors',
 'Continuous_Mining_Machine_Operators',
 'Control_and_Valve_Installers_and_Repairers__Except_Mechanical_Door',
 'Conveyor_Operators_and_Tenders',
 'Cooks__Fast_Food',
 'Cooks__Institution_and_Cafeteria',
 'Cooks__Private_Household',
 'Cooks__Restaurant',
 'Cooks__Short_Order',
 'Cooling_and_Freezing_Equipment_Operators_and_Tenders',
 'Copy_Writers',
 'Coroners',
 'Correctional_Officers_and_Jailers',
 'Correspondence_Clerks',
 'Cost_Estimators',
 'Costume_Attendants',
 'Counseling_Psychologists',
 'Counter_Attendants__Cafeteria__Food_Concession__and_Coffee_Shop',
 'Counter_and_Rental_Clerks',
 'Couriers_and_Messengers',
 'Court_Clerks',
 'Court_Reporters',
 'Craft_Artists',
 'Crane_and_Tower_Operators',
 'Credit_Analysts',
 'Credit_Authorizers',
 'Credit_Checkers',
 'Credit_Counselors',
 'Criminal_Investigators_and_Special_Agents',
 'Criminal_Justice_and_Law_Enforcement_Teachers__Postsecondary',
 'Critical_Care_Nurses',
 'Crossing_Guards',
 'Crushing__Grinding__and_Polishing_Machine_Setters__Operators__and_Tenders',
 'Curators',
 'Customer_Service_Representatives',
 'Customs_Brokers',
 'Cutters_and_Trimmers__Hand',
 'Cutting__Punching__and_Press_Machine_Setters__Operators__and_Tenders__Metal_and_Plastic',
 'Cutting_and_Slicing_Machine_Setters__Operators__and_Tenders',
 'Cytogenetic_Technologists',
 'Cytotechnologists',
 'Dancers',
 'Data_Entry_Keyers',
 'Database_Administrators',
 'Demonstrators_and_Product_Promoters',
 'Dental_Assistants',
 'Dental_Hygienists',
 'Dental_Laboratory_Technicians',
 'Dentists__General',
 'Dermatologists',
 'Derrick_Operators__Oil_and_Gas',
 'Desktop_Publishers',
 'Diagnostic_Medical_Sonographers',
 'Dietetic_Technicians',
 'Dietitians_and_Nutritionists',
 'Dining_Room_and_Cafeteria_Attendants_and_Bartender_Helpers',
 'Directors-_Stage__Motion_Pictures__Television__and_Radio',
 'Directors__Religious_Activities_and_Education',
 'Dishwashers',
 'Dispatchers__Except_Police__Fire__and_Ambulance',
 'Distance_Learning_Coordinators',
 'Document_Management_Specialists',
 'Door-To-Door_Sales_Workers__News_and_Street_Vendors__and_Related_Workers',
 'Dredge_Operators',
 'Drilling_and_Boring_Machine_Tool_Setters__Operators__and_Tenders__Metal_and_Plastic',
 'Driver_Sales_Workers',
 'Drywall_and_Ceiling_Tile_Installers',
 'Earth_Drillers__Except_Oil_and_Gas',
 'Economics_Teachers__Postsecondary',
 'Economists',
 'Editors',
 'Education_Administrators__Elementary_and_Secondary_School',
 'Education_Administrators__Postsecondary',
 'Education_Administrators__Preschool_and_Childcare_Center_Program',
 'Education_Teachers__Postsecondary',
 'Educational__Guidance__School__and_Vocational_Counselors',
 'Electric_Motor__Power_Tool__and_Related_Repairers',
 'Electrical_Drafters',
 'Electrical_Engineering_Technicians',
 'Electrical_Engineering_Technologists',
 'Electrical_Engineers',
 'Electrical_Power-Line_Installers_and_Repairers',
 'Electrical_and_Electronic_Equipment_Assemblers',
 'Electrical_and_Electronics_Installers_and_Repairers__Transportation_Equipment',
 'Electrical_and_Electronics_Repairers__Commercial_and_Industrial_Equipment',
 'Electrical_and_Electronics_Repairers__Powerhouse__Substation__and_Relay',
 'Electricians',
 'Electro-Mechanical_Technicians',
 'Electromechanical_Engineering_Technologists',
 'Electromechanical_Equipment_Assemblers',
 'Electronic_Drafters',
 'Electronic_Equipment_Installers_and_Repairers__Motor_Vehicles',
 'Electronic_Home_Entertainment_Equipment_Installers_and_Repairers',
 'Electronics_Engineering_Technicians',
 'Electronics_Engineering_Technologists',
 'Electronics_Engineers__Except_Computer',
 'Elementary_School_Teachers__Except_Special_Education',
 'Elevator_Installers_and_Repairers',
 'Eligibility_Interviewers__Government_Programs',
 'Embalmers',
 'Emergency_Management_Directors',
 'Emergency_Medical_Technicians_and_Paramedics',
 'Endoscopy_Technicians',
 'Energy_Auditors',
 'Energy_Engineers',
 'Engine_and_Other_Machine_Assemblers',
 'Engineering_Teachers__Postsecondary',
 'English_Language_and_Literature_Teachers__Postsecondary',
 'Environmental_Compliance_Inspectors',
 'Environmental_Economists',
 'Environmental_Engineering_Technicians',
 'Environmental_Engineers',
 'Environmental_Restoration_Planners',
 'Environmental_Science_Teachers__Postsecondary',
 'Environmental_Science_and_Protection_Technicians__Including_Health',
 'Environmental_Scientists_and_Specialists__Including_Health',
 'Epidemiologists',
 'Equal_Opportunity_Representatives_and_Officers',
 'Etchers_and_Engravers',
 'Excavating_and_Loading_Machine_and_Dragline_Operators',
 'Executive_Secretaries_and_Executive_Administrative_Assistants',
 'Exercise_Physiologists',
 'Explosives_Workers__Ordnance_Handling_Experts__and_Blasters',
 'Extruding__Forming__Pressing__and_Compacting_Machine_Setters__Operators__and_Tenders',
 'Extruding_and_Drawing_Machine_Setters__Operators__and_Tenders__Metal_and_Plastic',
 'Extruding_and_Forming_Machine_Setters__Operators__and_Tenders__Synthetic_and_Glass_Fibers',
 'Fabric_Menders__Except_Garment',
 'Fabric_and_Apparel_Patternmakers',
 'Fallers',
 'Family_and_General_Practitioners',
 'Farm_Equipment_Mechanics_and_Service_Technicians',
 'Farm_Labor_Contractors',
 'Farm_and_Home_Management_Advisors',
 'Farmworkers__Farm__Ranch__and_Aquacultural_Animals',
 'Farmworkers_and_Laborers__Crop',
 'Fashion_Designers',
 'Fence_Erectors',
 'Fiberglass_Laminators_and_Fabricators',
 'File_Clerks',
 'Film_and_Video_Editors',
 'Financial_Analysts',
 'Financial_Examiners',
 'Financial_Managers__Branch_or_Department',
 'Fine_Artists__Including_Painters__Sculptors__and_Illustrators',
 'Fire-Prevention_and_Protection_Engineers',
 'Fire_Inspectors',
 'Fire_Investigators',
 'First-Line_Supervisors_of_Agricultural_Crop_and_Horticultural_Workers',
 'First-Line_Supervisors_of_Animal_Husbandry_and_Animal_Care_Workers',
 'First-Line_Supervisors_of_Aquacultural_Workers',
 'First-Line_Supervisors_of_Construction_Trades_and_Extraction_Workers',
 'First-Line_Supervisors_of_Correctional_Officers',
 'First-Line_Supervisors_of_Food_Preparation_and_Serving_Workers',
 'First-Line_Supervisors_of_Helpers__Laborers__and_Material_Movers__Hand',
 'First-Line_Supervisors_of_Housekeeping_and_Janitorial_Workers',
 'First-Line_Supervisors_of_Landscaping__Lawn_Service__and_Groundskeeping_Workers',
 'First-Line_Supervisors_of_Logging_Workers',
 'First-Line_Supervisors_of_Mechanics__Installers__and_Repairers',
 'First-Line_Supervisors_of_Non-Retail_Sales_Workers',
 'First-Line_Supervisors_of_Office_and_Administrative_Support_Workers',
 'First-Line_Supervisors_of_Personal_Service_Workers',
 'First-Line_Supervisors_of_Police_and_Detectives',
 'First-Line_Supervisors_of_Production_and_Operating_Workers',
 'First-Line_Supervisors_of_Retail_Sales_Workers',
 'First-Line_Supervisors_of_Transportation_and_Material-Moving_Machine_and_Vehicle_Operators',
 'Fish_and_Game_Wardens',
 'Fishers_and_Related_Fishing_Workers',
 'Fitness_Trainers_and_Aerobics_Instructors',
 'Flight_Attendants',
 'Floor_Layers__Except_Carpet__Wood__and_Hard_Tiles',
 'Floor_Sanders_and_Finishers',
 'Floral_Designers',
 'Food_Batchmakers',
 'Food_Cooking_Machine_Operators_and_Tenders',
 'Food_Preparation_Workers',
 'Food_Science_Technicians',
 'Food_Scientists_and_Technologists',
 'Food_Servers__Nonrestaurant',
 'Food_Service_Managers',
 'Food_and_Tobacco_Roasting__Baking__and_Drying_Machine_Operators_and_Tenders',
 'Foreign_Language_and_Literature_Teachers__Postsecondary',
 'Forensic_Science_Technicians',
 'Forest_Fire_Fighting_and_Prevention_Supervisors',
 'Forest_Fire_Inspectors_and_Prevention_Specialists',
 'Forest_Firefighters',
 'Forest_and_Conservation_Technicians',
 'Forest_and_Conservation_Workers',
 'Foresters',
 'Forestry_and_Conservation_Science_Teachers__Postsecondary',
 'Forging_Machine_Setters__Operators__and_Tenders__Metal_and_Plastic',
 'Foundry_Mold_and_Coremakers',
 'Fraud_Examiners__Investigators_and_Analysts',
 'Freight_and_Cargo_Inspectors',
 'Funeral_Attendants',
 'Furnace__Kiln__Oven__Drier__and_Kettle_Operators_and_Tenders',
 'Furniture_Finishers',
 'Gaming_Cage_Workers',
 'Gaming_Change_Persons_and_Booth_Cashiers',
 'Gaming_Dealers',
 'Gaming_Managers',
 'Gaming_Supervisors',
 'Gaming_Surveillance_Officers_and_Gaming_Investigators',
 'Gaming_and_Sports_Book_Writers_and_Runners',
 'Gas_Compressor_and_Gas_Pumping_Station_Operators',
 'Gas_Plant_Operators',
 'Gem_and_Diamond_Workers',
 'General_and_Operations_Managers',
 'Genetic_Counselors',
 'Geneticists',
 'Geodetic_Surveyors',
 'Geographers',
 'Geographic_Information_Systems_Technicians',
 'Geography_Teachers__Postsecondary',
 'Geological_Sample_Test_Technicians',
 'Geophysical_Data_Technicians',
 'Geoscientists__Except_Hydrologists_and_Geographers',
 'Geospatial_Information_Scientists_and_Technologists',
 'Geothermal_Production_Managers',
 'Geothermal_Technicians',
 'Glass_Blowers__Molders__Benders__and_Finishers',
 'Glaziers',
 'Government_Property_Inspectors_and_Investigators',
 'Graders_and_Sorters__Agricultural_Products',
 'Graduate_Teaching_Assistants',
 'Graphic_Designers',
 'Grinding__Lapping__Polishing__and_Buffing_Machine_Tool_Setters__Operators__and_Tenders__Metal_and_Plastic',
 'Grinding_and_Polishing_Workers__Hand',
 'Hairdressers__Hairstylists__and_Cosmetologists',
 'Hazardous_Materials_Removal_Workers',
 'Health_Educators',
 'Health_Specialties_Teachers__Postsecondary',
 'Healthcare_Social_Workers',
 'Hearing_Aid_Specialists',
 'Heat_Treating_Equipment_Setters__Operators__and_Tenders__Metal_and_Plastic',
 'Heating_and_Air_Conditioning_Mechanics_and_Installers',
 'Heavy_and_Tractor-Trailer_Truck_Drivers',
 'Helpers--Brickmasons__Blockmasons__Stonemasons__and_Tile_and_Marble_Setters',
 'Helpers--Carpenters',
 'Helpers--Electricians',
 'Helpers--Extraction_Workers',
 'Helpers--Installation__Maintenance__and_Repair_Workers',
 'Helpers--Painters__Paperhangers__Plasterers__and_Stucco_Masons',
 'Helpers--Pipelayers__Plumbers__Pipefitters__and_Steamfitters',
 'Helpers--Production_Workers',
 'Helpers--Roofers',
 'Highway_Maintenance_Workers',
 'Historians',
 'History_Teachers__Postsecondary',
 'Histotechnologists_and_Histologic_Technicians',
 'Hoist_and_Winch_Operators',
 'Home_Appliance_Repairers',
 'Home_Economics_Teachers__Postsecondary',
 'Home_Health_Aides',
 'Hospitalists',
 'Hosts_and_Hostesses__Restaurant__Lounge__and_Coffee_Shop',
 'Hotel__Motel__and_Resort_Desk_Clerks',
 'Human_Factors_Engineers_and_Ergonomists',
 'Human_Resources_Assistants__Except_Payroll_and_Timekeeping',
 'Human_Resources_Managers',
 'Human_Resources_Specialists',
 'Hunters_and_Trappers',
 'Hydrologists',
 'Immigration_and_Customs_Inspectors',
 'Industrial-Organizational_Psychologists',
 'Industrial_Engineering_Technicians',
 'Industrial_Engineering_Technologists',
 'Industrial_Engineers',
 'Industrial_Machinery_Mechanics',
 'Industrial_Production_Managers',
 'Industrial_Safety_and_Health_Engineers',
 'Industrial_Truck_and_Tractor_Operators',
 'Informatics_Nurse_Specialists',
 'Information_Security_Analysts',
 'Information_Technology_Project_Managers',
 'Inspectors__Testers__Sorters__Samplers__and_Weighers',
 'Instructional_Coordinators',
 'Instructional_Designers_and_Technologists',
 'Insulation_Workers__Floor__Ceiling__and_Wall',
 'Insulation_Workers__Mechanical',
 'Insurance_Adjusters__Examiners__and_Investigators',
 'Insurance_Appraisers__Auto_Damage',
 'Insurance_Claims_Clerks',
 'Insurance_Policy_Processing_Clerks',
 'Insurance_Sales_Agents',
 'Insurance_Underwriters',
 'Intelligence_Analysts',
 'Interior_Designers',
 'Internists__General',
 'Interpreters_and_Translators',
 'Interviewers__Except_Eligibility_and_Loan',
 'Janitors_and_Cleaners__Except_Maids_and_Housekeeping_Cleaners',
 'Jewelers',
 'Judges__Magistrate_Judges__and_Magistrates',
 'Judicial_Law_Clerks',
 'Kindergarten_Teachers__Except_Special_Education',
 'Laborers_and_Freight__Stock__and_Material_Movers__Hand',
 'Landscape_Architects',
 'Landscaping_and_Groundskeeping_Workers',
 'Lathe_and_Turning_Machine_Tool_Setters__Operators__and_Tenders__Metal_and_Plastic',
 'Laundry_and_Dry-Cleaning_Workers',
 'Law_Teachers__Postsecondary',
 'Lawyers',
 'Layout_Workers__Metal_and_Plastic',
 'Legal_Secretaries',
 'Librarians',
 'Library_Assistants__Clerical',
 'Library_Science_Teachers__Postsecondary',
 'Library_Technicians',
 'License_Clerks',
 'Licensed_Practical_and_Licensed_Vocational_Nurses',
 'Licensing_Examiners_and_Inspectors',
 'Lifeguards__Ski_Patrol__and_Other_Recreational_Protective_Service_Workers',
 'Light_Truck_or_Delivery_Services_Drivers',
 'Loading_Machine_Operators__Underground_Mining',
 'Loan_Counselors',
 'Loan_Interviewers_and_Clerks',
 'Loan_Officers',
 'Locker_Room__Coatroom__and_Dressing_Room_Attendants',
 'Locksmiths_and_Safe_Repairers',
 'Locomotive_Engineers',
 'Locomotive_Firers',
 'Lodging_Managers',
 'Log_Graders_and_Scalers',
 'Logging_Equipment_Operators',
 'Logisticians',
 'Logistics_Analysts',
 'Logistics_Engineers',
 'Logistics_Managers',
 'Loss_Prevention_Managers',
 'Low_Vision_Therapists__Orientation_and_Mobility_Specialists__and_Vision_Rehabilitation_Therapists',
 'Machine_Feeders_and_Offbearers',
 'Machinists',
 'Magnetic_Resonance_Imaging_Technologists',
 'Maids_and_Housekeeping_Cleaners',
 'Mail_Clerks_and_Mail_Machine_Operators__Except_Postal_Service',
 'Maintenance_Workers__Machinery',
 'Maintenance_and_Repair_Workers__General',
 'Makeup_Artists__Theatrical_and_Performance',
 'Management_Analysts',
 'Manicurists_and_Pedicurists',
 'Manufactured_Building_and_Mobile_Home_Installers',
 'Manufacturing_Engineering_Technologists',
 'Manufacturing_Engineers',
 'Manufacturing_Production_Technicians',
 'Mapping_Technicians',
 'Marine_Architects',
 'Marine_Engineers',
 'Market_Research_Analysts_and_Marketing_Specialists',
 'Marketing_Managers',
 'Marking_Clerks',
 'Marriage_and_Family_Therapists',
 'Massage_Therapists',
 'Materials_Engineers',
 'Materials_Scientists',
 'Mates-_Ship__Boat__and_Barge',
 'Mathematical_Science_Teachers__Postsecondary',
 'Mathematicians',
 'Meat__Poultry__and_Fish_Cutters_and_Trimmers',
 'Mechanical_Door_Repairers',
 'Mechanical_Drafters',
 'Mechanical_Engineering_Technicians',
 'Mechanical_Engineering_Technologists',
 'Mechanical_Engineers',
 'Mechatronics_Engineers',
 'Medical_Appliance_Technicians',
 'Medical_Assistants',
 'Medical_Equipment_Preparers',
 'Medical_Equipment_Repairers',
 'Medical_Records_and_Health_Information_Technicians',
 'Medical_Scientists__Except_Epidemiologists',
 'Medical_Secretaries',
 'Medical_Transcriptionists',
 'Medical_and_Clinical_Laboratory_Technicians',
 'Medical_and_Clinical_Laboratory_Technologists',
 'Medical_and_Health_Services_Managers',
 'Meeting__Convention__and_Event_Planners',
 'Mental_Health_Counselors',
 'Mental_Health_and_Substance_Abuse_Social_Workers',
 'Merchandise_Displayers_and_Window_Trimmers',
 'Metal-Refining_Furnace_Operators_and_Tenders',
 'Meter_Readers__Utilities',
 'Microbiologists',
 'Middle_School_Teachers__Except_Special_and_Career_Technical_Education',
 'Midwives',
 'Milling_and_Planing_Machine_Setters__Operators__and_Tenders__Metal_and_Plastic',
 'Millwrights',
 'Mine_Cutting_and_Channeling_Machine_Operators',
 'Mine_Shuttle_Car_Operators',
 'Mining_and_Geological_Engineers__Including_Mining_Safety_Engineers',
 'Mixing_and_Blending_Machine_Setters__Operators__and_Tenders',
 'Mobile_Heavy_Equipment_Mechanics__Except_Engines',
 'Model_Makers__Metal_and_Plastic',
 'Model_Makers__Wood',
 'Models',
 'Molding__Coremaking__and_Casting_Machine_Setters__Operators__and_Tenders__Metal_and_Plastic',
 'Molding_and_Casting_Workers',
 'Molecular_and_Cellular_Biologists',
 'Morticians__Undertakers__and_Funeral_Directors',
 'Motion_Picture_Projectionists',
 'Motorboat_Mechanics_and_Service_Technicians',
 'Motorboat_Operators',
 'Motorcycle_Mechanics',
 'Multimedia_Artists_and_Animators',
 'Multiple_Machine_Tool_Setters__Operators__and_Tenders__Metal_and_Plastic',
 'Municipal_Clerks',
 'Municipal_Fire_Fighting_and_Prevention_Supervisors',
 'Municipal_Firefighters',
 'Museum_Technicians_and_Conservators',
 'Music_Composers_and_Arrangers',
 'Music_Directors',
 'Musical_Instrument_Repairers_and_Tuners',
 'Musicians__Instrumental',
 'Nannies',
 'Natural_Sciences_Managers',
 'Naturopathic_Physicians',
 'Network_and_Computer_Systems_Administrators',
 'Neurodiagnostic_Technologists',
 'Neurologists',
 'Neuropsychologists_and_Clinical_Neuropsychologists',
 'New_Accounts_Clerks',
 'Non-Destructive_Testing_Specialists',
 'Nonfarm_Animal_Caretakers',
 'Nuclear_Engineers',
 'Nuclear_Equipment_Operation_Technicians',
 'Nuclear_Medicine_Physicians',
 'Nuclear_Medicine_Technologists',
 'Nuclear_Monitoring_Technicians',
 'Nuclear_Power_Reactor_Operators',
 'Nurse_Anesthetists',
 'Nurse_Midwives',
 'Nurse_Practitioners',
 'Nursery_Workers',
 'Nursery_and_Greenhouse_Managers',
 'Nursing_Assistants',
 'Nursing_Instructors_and_Teachers__Postsecondary',
 'Obstetricians_and_Gynecologists',
 'Occupational_Health_and_Safety_Specialists',
 'Occupational_Health_and_Safety_Technicians',
 'Occupational_Therapists',
 'Occupational_Therapy_Aides',
 'Occupational_Therapy_Assistants',
 'Office_Clerks__General',
 'Office_Machine_Operators__Except_Computer',
 'Operating_Engineers_and_Other_Construction_Equipment_Operators',
 'Operations_Research_Analysts',
 'Ophthalmic_Laboratory_Technicians',
 'Ophthalmic_Medical_Technicians',
 'Ophthalmic_Medical_Technologists',
 'Ophthalmologists',
 'Opticians__Dispensing',
 'Optometrists',
 'Oral_and_Maxillofacial_Surgeons',
 'Order_Clerks',
 'Order_Fillers__Wholesale_and_Retail_Sales',
 'Orthodontists',
 'Orthoptists',
 'Orthotists_and_Prosthetists',
 'Outdoor_Power_Equipment_and_Other_Small_Engine_Mechanics',
 'Packaging_and_Filling_Machine_Operators_and_Tenders',
 'Packers_and_Packagers__Hand',
 'Painters__Construction_and_Maintenance',
 'Painters__Transportation_Equipment',
 'Painting__Coating__and_Decorating_Workers',
 'Paper_Goods_Machine_Setters__Operators__and_Tenders',
 'Paperhangers',
 'Paralegals_and_Legal_Assistants',
 'Park_Naturalists',
 'Parking_Enforcement_Workers',
 'Parking_Lot_Attendants',
 'Parts_Salespersons',
 'Pathologists',
 'Patient_Representatives',
 'Patternmakers__Metal_and_Plastic',
 'Patternmakers__Wood',
 'Paving__Surfacing__and_Tamping_Equipment_Operators',
 'Payroll_and_Timekeeping_Clerks',
 'Pediatricians__General',
 'Personal_Care_Aides',
 'Personal_Financial_Advisors',
 'Pest_Control_Workers',
 'Pesticide_Handlers__Sprayers__and_Applicators__Vegetation',
 'Petroleum_Engineers',
 'Petroleum_Pump_System_Operators__Refinery_Operators__and_Gaugers',
 'Pharmacists',
 'Pharmacy_Aides',
 'Pharmacy_Technicians',
 'Philosophy_and_Religion_Teachers__Postsecondary',
 'Phlebotomists',
 'Photographers',
 'Photographic_Process_Workers_and_Processing_Machine_Operators',
 'Photonics_Engineers',
 'Photonics_Technicians',
 'Physical_Medicine_and_Rehabilitation_Physicians',
 'Physical_Therapist_Aides',
 'Physical_Therapist_Assistants',
 'Physical_Therapists',
 'Physician_Assistants',
 'Physicists',
 'Physics_Teachers__Postsecondary',
 'Pile-Driver_Operators',
 'Pilots__Ship',
 'Pipe_Fitters_and_Steamfitters',
 'Pipelayers',
 'Plasterers_and_Stucco_Masons',
 'Plating_and_Coating_Machine_Setters__Operators__and_Tenders__Metal_and_Plastic',
 'Plumbers',
 'Podiatrists',
 'Poets__Lyricists_and_Creative_Writers',
 'Police_Detectives',
 'Police_Identification_and_Records_Officers',
 'Police_Patrol_Officers',
 'Police__Fire__and_Ambulance_Dispatchers',
 'Political_Science_Teachers__Postsecondary',
 'Political_Scientists',
 'Postal_Service_Clerks',
 'Postal_Service_Mail_Carriers',
 'Postal_Service_Mail_Sorters__Processors__and_Processing_Machine_Operators',
 'Postmasters_and_Mail_Superintendents',
 'Potters__Manufacturing',
 'Pourers_and_Casters__Metal',
 'Power_Distributors_and_Dispatchers',
 'Power_Plant_Operators',
 'Precious_Metal_Workers',
 'Precision_Agriculture_Technicians',
 'Prepress_Technicians_and_Workers',
 'Preschool_Teachers__Except_Special_Education',
 'Pressers__Textile__Garment__and_Related_Materials',
 'Preventive_Medicine_Physicians',
 'Print_Binding_and_Finishing_Workers',
 'Printing_Press_Operators',
 'Private_Detectives_and_Investigators',
 'Probation_Officers_and_Correctional_Treatment_Specialists',
 'Procurement_Clerks',
 'Producers',
 'Product_Safety_Engineers',
 'Production__Planning__and_Expediting_Clerks',
 'Program_Directors',
 'Proofreaders_and_Copy_Markers',
 'Property__Real_Estate__and_Community_Association_Managers',
 'Prosthodontists',
 'Psychiatric_Aides',
 'Psychiatric_Technicians',
 'Psychiatrists',
 'Psychology_Teachers__Postsecondary',
 'Public_Address_System_and_Other_Announcers',
 'Public_Relations_Specialists',
 'Public_Relations_and_Fundraising_Managers',
 'Pump_Operators__Except_Wellhead_Pumpers',
 'Purchasing_Agents__Except_Wholesale__Retail__and_Farm_Products',
 'Purchasing_Managers',
 'Quality_Control_Analysts',
 'Quality_Control_Systems_Managers',
 'Radiation_Therapists',
 'Radio_Mechanics',
 'Radio_Operators',
 'Radio__Cellular__and_Tower_Equipment_Installers_and_Repairers',
 'Radio_and_Television_Announcers',
 'Radiologic_Technicians',
 'Radiologic_Technologists',
 'Radiologists',
 'Rail-Track_Laying_and_Maintenance_Equipment_Operators',
 'Rail_Car_Repairers',
 'Rail_Yard_Engineers__Dinkey_Operators__and_Hostlers',
 'Railroad_Brake__Signal__and_Switch_Operators',
 'Railroad_Conductors_and_Yardmasters',
 'Range_Managers',
 'Real_Estate_Brokers',
 'Real_Estate_Sales_Agents',
 'Receptionists_and_Information_Clerks',
 'Recreation_Workers',
 'Recreation_and_Fitness_Studies_Teachers__Postsecondary',
 'Recreational_Therapists',
 'Recreational_Vehicle_Service_Technicians',
 'Recycling_and_Reclamation_Workers',
 'Refractory_Materials_Repairers__Except_Brickmasons',
 'Refrigeration_Mechanics_and_Installers',
 'Refuse_and_Recyclable_Material_Collectors',
 'Registered_Nurses',
 'Regulatory_Affairs_Managers',
 'Regulatory_Affairs_Specialists',
 'Rehabilitation_Counselors',
 'Reinforcing_Iron_and_Rebar_Workers',
 'Remote_Sensing_Scientists_and_Technologists',
 'Remote_Sensing_Technicians',
 'Reporters_and_Correspondents',
 'Reservation_and_Transportation_Ticket_Agents_and_Travel_Clerks',
 'Residential_Advisors',
 'Respiratory_Therapists',
 'Respiratory_Therapy_Technicians',
 'Retail_Salespersons',
 'Riggers',
 'Risk_Management_Specialists',
 'Robotics_Engineers',
 'Robotics_Technicians',
 'Rock_Splitters__Quarry',
 'Rolling_Machine_Setters__Operators__and_Tenders__Metal_and_Plastic',
 'Roof_Bolters__Mining',
 'Roofers',
 'Rotary_Drill_Operators__Oil_and_Gas',
 'Rough_Carpenters',
 'Roustabouts__Oil_and_Gas',
 'Sailors_and_Marine_Oilers',
 'Sales_Agents__Financial_Services',
 'Sales_Agents__Securities_and_Commodities',
 'Sales_Engineers',
 'Sales_Managers',
 'Sales_Representatives__Wholesale_and_Manufacturing__Except_Technical_and_Scientific_Products',
 'Sales_Representatives__Wholesale_and_Manufacturing__Technical_and_Scientific_Products',
 'Sawing_Machine_Setters__Operators__and_Tenders__Wood',
 'School_Psychologists',
 'Secondary_School_Teachers__Except_Special_and_Career_Technical_Education',
 'Secretaries_and_Administrative_Assistants__Except_Legal__Medical__and_Executive',
 'Security_Guards',
 'Security_Management_Specialists',
 'Security_Managers',
 'Security_and_Fire_Alarm_Systems_Installers',
 'Segmental_Pavers',
 'Self-Enrichment_Education_Teachers',
 'Semiconductor_Processors',
 'Separating__Filtering__Clarifying__Precipitating__and_Still_Machine_Setters__Operators__and_Tenders',
 'Septic_Tank_Servicers_and_Sewer_Pipe_Cleaners',
 'Service_Unit_Operators__Oil__Gas__and_Mining',
 'Set_and_Exhibit_Designers',
 'Sewers__Hand',
 'Sewing_Machine_Operators',
 'Shampooers',
 'Sheet_Metal_Workers',
 'Sheriffs_and_Deputy_Sheriffs',
 'Ship_Engineers',
 'Ship_and_Boat_Captains',
 'Shipping__Receiving__and_Traffic_Clerks',
 'Shoe_Machine_Operators_and_Tenders',
 'Shoe_and_Leather_Workers_and_Repairers',
 'Signal_and_Track_Switch_Repairers',
 'Singers',
 'Skincare_Specialists',
 'Slaughterers_and_Meat_Packers',
 'Slot_Supervisors',
 'Social_Science_Research_Assistants',
 'Social_Work_Teachers__Postsecondary',
 'Social_and_Community_Service_Managers',
 'Social_and_Human_Service_Assistants',
 'Sociologists',
 'Sociology_Teachers__Postsecondary',
 'Software_Developers__Applications',
 'Software_Developers__Systems_Software',
 'Software_Quality_Assurance_Engineers_and_Testers',
 'Soil_and_Plant_Scientists',
 'Soil_and_Water_Conservationists',
 'Solar_Photovoltaic_Installers',
 'Solar_Sales_Representatives_and_Assessors',
 'Solderers_and_Brazers',
 'Sound_Engineering_Technicians',
 'Spa_Managers',
 'Special_Education_Teachers__Middle_School',
 'Special_Education_Teachers__Secondary_School',
 'Speech-Language_Pathologists',
 'Speech-Language_Pathology_Assistants',
 'Sports_Medicine_Physicians',
 'Statement_Clerks',
 'Stationary_Engineers_and_Boiler_Operators',
 'Statistical_Assistants',
 'Statisticians',
 'Stock_Clerks-_Stockroom__Warehouse__or_Storage_Yard',
 'Stock_Clerks__Sales_Floor',
 'Stone_Cutters_and_Carvers__Manufacturing',
 'Stonemasons',
 'Storage_and_Distribution_Managers',
 'Structural_Iron_and_Steel_Workers',
 'Structural_Metal_Fabricators_and_Fitters',
 'Substance_Abuse_and_Behavioral_Disorder_Counselors',
 'Subway_and_Streetcar_Operators',
 'Supply_Chain_Managers',
 'Surgeons',
 'Surgical_Technologists',
 'Survey_Researchers',
 'Surveying_Technicians',
 'Surveyors',
 'Sustainability_Specialists',
 'Switchboard_Operators__Including_Answering_Service',
 'Tailors__Dressmakers__and_Custom_Sewers',
 'Talent_Directors',
 'Tank_Car__Truck__and_Ship_Loaders',
 'Tapers',
 'Tax_Examiners_and_Collectors__and_Revenue_Agents',
 'Tax_Preparers',
 'Taxi_Drivers_and_Chauffeurs',
 'Teacher_Assistants',
 'Team_Assemblers',
 'Technical_Directors_Managers',
 'Technical_Writers',
 'Telecommunications_Engineering_Specialists',
 'Telecommunications_Equipment_Installers_and_Repairers__Except_Line_Installers',
 'Telecommunications_Line_Installers_and_Repairers',
 'Telemarketers',
 'Telephone_Operators',
 'Tellers',
 'Terrazzo_Workers_and_Finishers',
 'Textile_Bleaching_and_Dyeing_Machine_Operators_and_Tenders',
 'Textile_Cutting_Machine_Setters__Operators__and_Tenders',
 'Textile_Knitting_and_Weaving_Machine_Setters__Operators__and_Tenders',
 'Textile_Winding__Twisting__and_Drawing_Out_Machine_Setters__Operators__and_Tenders',
 'Tile_and_Marble_Setters',
 'Timing_Device_Assemblers_and_Adjusters',
 'Tire_Builders',
 'Tire_Repairers_and_Changers',
 'Title_Examiners__Abstractors__and_Searchers',
 'Tool_Grinders__Filers__and_Sharpeners',
 'Tool_and_Die_Makers',
 'Tour_Guides_and_Escorts',
 'Traffic_Technicians',
 'Training_and_Development_Managers',
 'Training_and_Development_Specialists',
 'Transit_and_Railroad_Police',
 'Transportation_Attendants__Except_Flight_Attendants',
 'Transportation_Engineers',
 'Transportation_Managers',
 'Transportation_Planners',
 'Transportation_Security_Screeners',
 'Transportation_Vehicle__Equipment_and_Systems_Inspectors__Except_Aviation',
 'Travel_Agents',
 'Travel_Guides',
 'Treasurers_and_Controllers',
 'Tree_Trimmers_and_Pruners',
 'Umpires__Referees__and_Other_Sports_Officials',
 'Upholsterers',
 'Urban_and_Regional_Planners',
 'Urologists',
 'Ushers__Lobby_Attendants__and_Ticket_Takers',
 'Validation_Engineers',
 'Veterinarians',
 'Veterinary_Assistants_and_Laboratory_Animal_Caretakers',
 'Veterinary_Technologists_and_Technicians',
 'Video_Game_Designers',
 'Vocational_Education_Teachers__Postsecondary',
 'Waiters_and_Waitresses',
 'Watch_Repairers',
 'Water_Resource_Specialists',
 'Water_Wastewater_Engineers',
 'Water_and_Wastewater_Treatment_Plant_and_System_Operators',
 'Weatherization_Installers_and_Technicians',
 'Web_Administrators',
 'Web_Developers',
 'Weighers__Measurers__Checkers__and_Samplers__Recordkeeping',
 'Welders__Cutters__and_Welder_Fitters',
 'Welding__Soldering__and_Brazing_Machine_Setters__Operators__and_Tenders',
 'Wellhead_Pumpers',
 'Wholesale_and_Retail_Buyers__Except_Farm_Products',
 'Wind_Energy_Engineers',
 'Wind_Turbine_Service_Technicians',
 'Woodworking_Machine_Setters__Operators__and_Tenders__Except_Sawing',
 'Word_Processors_and_Typists',
 'Zoologists_and_Wildlife_Biologists'}

In [106]:
#function to find out what are the most important features for a particular career
#this picks all features that are greater than 0.8, or the top 20 if there are more than 20
#we should remove the job_zone and Work_Context_Time domains, because 1 doesn't mean importance for those categories
def important_features(occ_title):
    selected_occ_series = normed_df_no_na[normed_df_no_na.Occupation.title == occ_title].T.drop('Occupation',level=0)
    selected_occ_series.columns = [occ_title] 
    selected_occ_series.drop(['Job_Zones','Work_Context_Time'], level=0, inplace=True)
    important_features = selected_occ_series[selected_occ_series[occ_title] >= 0.8]
    if len(important_features) > 20:
        important_features = selected_occ_series.sort(occ_title, ascending=False).head(20)
    elif len(important_features) < 10:
        important_features = selected_occ_series.sort(occ_title, ascending=False).head(10)
    important_features = important_features.sort(axis=0)
    print 'The most important attributes for ' + occ_title.replace('_', ' ') + ' are:'
    print_domains_features(important_features)
    return important_features
important = important_features('Business_Intelligence_Analysts')


The most important attributes for Business Intelligence Analysts are:
   Skills
      Active Learning
   Work Activities
      Analyzing Data or Information
      Communicating with Supervisors, Peers, or Subordinates
      Establishing and Maintaining Interpersonal Relationships
      Getting Information
      Interacting With Computers
      Interpreting the Meaning of Information for Others
      Processing Information
      Provide Consultation and Advice to Others
      Updating and Using Relevant Knowledge
   Work Context
      Electronic Mail
      Face-to-Face Discussions
      Indoors, Environmentally Controlled
      Spend Time Sitting
      Telephone
      Work With Work Group or Team
   Work Styles
      Analytical Thinking
      Attention to Detail
      Integrity
      Persistence

In [107]:
#function to find out what are the least important features for a particular career
#this picks all features that are equal to 0
def irrelevant_features(occ_title):
    selected_occ_series = normed_df_no_na[normed_df_no_na.Occupation.title == occ_title].T.drop('Occupation',level=0)
    selected_occ_series.columns = [occ_title]  
    selected_occ_series.drop(['Job_Zones','Work_Context_Time'], level=0, inplace=True)
    irrelevant_features = selected_occ_series[selected_occ_series[occ_title] == 0]
    print 'The least important attributes for ' + occ_title + ' are:'
    
    
    for domain in irrelevant_features.index.get_level_values('domain').unique():
        sub_irrelevant_features = irrelevant_features.xs(domain, level=0)
        print '   ' + domain
        for feature in sub_irrelevant_features.index:
            print '      ' + feature
    print ''
    return irrelevant_features
irrelevant = irrelevant_features('Business_Intelligence_Analysts')


The least important attributes for Business_Intelligence_Analysts are:
   Abilities
      Arm-Hand_Steadiness
      Dynamic_Flexibility
      Dynamic_Strength
      Explosive_Strength
      Extent_Flexibility
      Glare_Sensitivity
      Gross_Body_Coordination
      Gross_Body_Equilibrium
      Manual_Dexterity
      Multilimb_Coordination
      Peripheral_Vision
      Rate_Control
      Reaction_Time
      Response_Orientation
      Sound_Localization
      Spatial_Orientation
      Speed_of_Limb_Movement
      Stamina
      Static_Strength
      Wrist-Finger_Speed
   Skills
      Repairing
   Work_Context
      Cramped_Work_Space,_Awkward_Positions
      Exposed_to_Hazardous_Equipment
      Exposed_to_High_Places
      Exposed_to_Radiation
      Exposed_to_Whole_Body_Vibration
      Extremely_Bright_or_Inadequate_Lighting
      In_an_Open_Vehicle_or_Equipment
      Pace_Determined_by_Speed_of_Equipment
      Spend_Time_Climbing_Ladders,_Scaffolds,_or_Poles
      Spend_Time_Keeping_or_Regaining_Balance
      Spend_Time_Kneeling,_Crouching,_Stooping,_or_Crawling
      Wear_Specialized_Protective_or_Safety_Equipment_such_as_Breathing_Apparatus,_Safety_Harness,_Full_Protection_Suits,_or_Radiation_Protection


In [108]:
#function to find 20 related careers to an input career
def closest_occs(occ_title, similar=20):
    closest_df = DataFrame()
    distance_series = euclid_dist_df_normed.xs(occ_title).order().head(similar)
    selected_occ_series = normed_df_no_na[normed_df_no_na.Occupation.title == occ_title].T.drop('Occupation',level=0)
    selected_occ_series.columns = [occ_title]        
    closest_occs_frame = selected_occ_series
    
    for comparison in distance_series.index[1:]:
        compare_occ_series = normed_df_no_na[normed_df_no_na.Occupation.title == comparison].T.drop('Occupation',level=0)
        closest_occs_frame[comparison] = compare_occ_series.iloc[:,0]
#         closest_feature_frame['difference'] = abs(closest_feature_frame[occ_title] - closest_feature_frame[comparison])
    print "The closest occupations to " + occ_title.replace("_", " ") + " are:"
    for x in closest_occs_frame.columns[1:]:
        print '   ' +x.replace("_", " ")
    print '\nSee the table below to see how similar the careers are.'
    
    return closest_occs_frame.sort(occ_title, ascending=False)

closest_occs = closest_occs('Business_Intelligence_Analysts')
closest_occs


The closest occupations to Business Intelligence Analysts are:
   Market Research Analysts and Marketing Specialists
   Risk Management Specialists
   Survey Researchers
   Climate Change Analysts
   Auditors
   Actuaries
   Information Technology Project Managers
   Financial Analysts
   Operations Research Analysts
   Intelligence Analysts
   Regulatory Affairs Specialists
   Business Continuity Planners
   Cost Estimators
   Logistics Analysts
   Personal Financial Advisors
   Document Management Specialists
   Treasurers and Controllers
   Sustainability Specialists
   Marketing Managers

See the table below to see how similar the careers are.
Out[108]:
Business_Intelligence_Analysts Market_Research_Analysts_and_Marketing_Specialists Risk_Management_Specialists Survey_Researchers Climate_Change_Analysts Auditors Actuaries Information_Technology_Project_Managers Financial_Analysts Operations_Research_Analysts Intelligence_Analysts Regulatory_Affairs_Specialists Business_Continuity_Planners Cost_Estimators Logistics_Analysts Personal_Financial_Advisors Document_Management_Specialists Treasurers_and_Controllers Sustainability_Specialists Marketing_Managers
domain element_name
Work_Context Electronic_Mail 1 0.98 1 1 1 0.98 1 0.9925 0.94 1 1 0.9925 1 0.9375 1 0.9375 1 0.9925 0.99 1
Work_Styles Analytical_Thinking 0.9616519 0.9026549 0.8908555 0.8967552 0.9056047 0.9056047 0.9557522 0.7433628 0.8731563 0.9557522 0.9734513 0.8171091 0.6843658 0.8525074 0.8348083 0.79941 0.79941 0.8908555 0.7787611 0.6578171
Work_Context Spend_Time_Sitting 0.9497487 0.9070352 0.9145729 0.8994975 0.9673367 0.9321608 0.9497487 0.8165829 0.8090452 0.8969849 0.9170854 0.8643216 0.7839196 0.8668342 0.8467337 0.8869347 0.7713568 0.919598 0.8417085 0.8115578
Work_Activities Updating_and_Using_Relevant_Knowledge 0.9361022 0.8051118 0.715655 0.8530351 0.8690096 0.8498403 0.8178914 0.7316294 0.8370607 0.8051118 0.8722045 0.8753994 0.7635783 0.7795527 0.7220447 0.8210863 0.7220447 0.7444089 0.8466454 0.658147
Work_Context_Time Duration_of_Typical_Work_Week 0.9226804 0.7319588 0.9381443 0.8298969 0.9329897 0.9072165 0.8247423 0.8762887 0.8247423 0.8402062 0.6391753 0.7938144 0.8917526 0.8247423 0.7628866 0.7319588 0.814433 0.8917526 0.7628866 0.8453608
Work_Activities Communicating_with_Supervisors,_Peers,_or_Subordinates 0.9225352 0.8626761 0.8380282 0.8697183 0.8133803 0.8415493 0.8309859 0.8908451 0.8591549 0.7394366 0.8732394 0.8204225 0.8943662 0.8485915 0.7816901 0.6478873 0.7464789 0.8169014 0.8380282 0.8908451
Analyzing_Data_or_Information 0.9068493 0.9863014 0.8767123 0.9205479 0.8958904 0.8575342 1 0.7945205 0.9890411 0.9506849 0.9808219 0.6767123 0.8493151 0.8849315 0.8547945 0.8684932 0.7260274 0.8958904 0.6712329 0.6273973
Interpreting_the_Meaning_of_Information_for_Others 0.8983051 0.8785311 0.680791 0.8418079 0.8474576 0.6694915 0.8813559 0.6892655 0.7711864 0.8107345 0.8389831 0.7514124 0.7062147 0.6864407 0.5734463 0.6723164 0.6242938 0.7175141 0.7768362 0.5621469
Work_Context Telephone 0.8981233 0.9544236 0.9517426 0.9705094 0.9410188 0.9892761 0.9517426 0.9383378 0.924933 0.8418231 0.9571046 0.9276139 0.9544236 0.9436997 0.9678284 0.9115282 0.9678284 0.9919571 0.9436997 1
Work_Activities Establishing_and_Maintaining_Interpersonal_Relationships 0.8972603 0.8424658 0.6746575 0.7739726 0.7842466 0.6849315 0.6267123 0.7431507 0.8253425 0.5787671 0.7876712 0.7294521 0.8013699 0.6369863 0.6472603 0.8630137 0.7123288 0.6746575 0.8219178 0.8356164
Getting_Information 0.8884462 0.9203187 0.8326693 0.7808765 0.8884462 0.9083665 0.9203187 0.7848606 0.9003984 0.8406375 0.9163347 0.812749 0.8406375 0.9043825 0.7729084 0.8605578 0.7330677 0.8167331 0.7768924 0.812749
Work_Context Face-to-Face_Discussions 0.8816327 0.7265306 0.9102041 0.8489796 0.8285714 0.9346939 0.9142857 0.7755102 0.9755102 0.8204082 0.9510204 0.8530612 0.7428571 0.8244898 0.8653061 0.6979592 0.8040816 0.9714286 0.6204082 0.9346939
Work_Activities Provide_Consultation_and_Advice_to_Others 0.8797468 0.6708861 0.7246835 0.7721519 0.7626582 0.7531646 0.806962 0.6613924 0.7911392 0.8670886 0.6677215 0.7246835 0.8987342 0.7183544 0.7341772 0.7974684 0.8417722 0.7911392 0.8417722 0.6392405
Interacting_With_Computers 0.8636364 0.9166667 0.8510101 0.8838384 0.8207071 0.8383838 0.9393939 0.8686869 0.9191919 0.9343434 0.9040404 0.8611111 0.8560606 0.9267677 0.8434343 0.8737374 0.9393939 0.8055556 0.7828283 0.7676768
Work_Styles Persistence 0.8515625 0.7695313 0.65625 0.6835938 0.7539062 0.734375 0.7265625 0.7929687 0.890625 0.7421875 0.703125 0.765625 0.75 0.671875 0.6875 0.796875 0.671875 0.7460938 0.6835938 0.765625
Work_Context Work_With_Work_Group_or_Team 0.8511236 0.7106742 0.7893258 0.8370787 0.8230337 0.7893258 0.7219101 0.8848315 0.7752809 0.7640449 0.7837079 0.8398876 0.8342697 0.8314607 0.8033708 0.6404494 0.8679775 0.8258427 0.8932584 0.8455056
Skills Active_Learning 0.8438819 0.5780591 0.6329114 0.6877637 0.7890295 0.7383966 0.6877637 0.5780591 0.6877637 0.8438819 0.6329114 0.6877637 0.7383966 0.5780591 0.5780591 0.6329114 0.6877637 0.8438819 0.6877637 0.7383966
Work_Styles Integrity 0.8304348 0.873913 0.8391304 0.8478261 0.8434783 0.9826087 0.8695652 0.6913043 0.826087 0.6304348 0.9173913 0.973913 0.7913043 0.7521739 0.8086957 0.9521739 0.7217391 0.9565217 0.7347826 0.7391304
Attention_to_Detail 0.8289963 0.8587361 0.8698885 0.929368 0.7472119 0.8959108 0.866171 0.7472119 0.9851301 0.7806691 0.9553903 0.9405204 0.7695167 0.9219331 0.8364312 0.7881041 0.8513011 0.8141264 0.7472119 0.8066914
Work_Activities Processing_Information 0.8248588 0.8361582 0.7118644 0.8418079 0.7288136 0.819209 0.8983051 0.7627119 0.9067797 0.8022599 0.8870056 0.7768362 0.7372881 0.7909605 0.7514124 0.7768362 0.7288136 0.8531073 0.6949153 0.5367232
Work_Context Indoors,_Environmentally_Controlled 0.8207071 0.8535354 0.8964646 0.8510101 0.9368687 0.8787879 0.9318182 0.8459596 0.9318182 0.8510101 0.9040404 0.7929293 0.790404 0.9646465 0.7676768 0.9368687 0.9090909 1 0.9015152 0.7777778
Work_Activities Organizing,_Planning,_and_Prioritizing_Work 0.814433 0.8178694 0.6391753 0.7938144 0.790378 0.7972509 0.6494845 0.9656357 0.8762887 0.6632302 0.7731959 0.8453608 0.8006873 0.7044674 0.7147766 0.814433 0.6872852 0.7766323 0.6907216 0.7972509
Work_Styles Adaptability/Flexibility 0.8032129 0.6506024 0.6506024 0.7228916 0.562249 0.5863454 0.5381526 0.7028112 0.8634538 0.5461847 0.7228916 0.7991968 0.7871486 0.562249 0.6827309 0.5783133 0.7148594 0.6746988 0.6626506 0.7309237
Skills Systems_Analysis 0.8012821 0.6794872 0.8814103 0.6794872 0.7628205 0.7211538 0.9230769 0.8012821 0.6794872 0.9230769 0.5192308 0.8397436 0.8814103 0.7211538 0.8397436 0.7211538 0.8012821 0.7628205 0.6794872 0.6794872
Work_Activities Identifying_Objects,_Actions,_and_Events 0.7836066 0.7967213 0.7180328 0.5803279 0.6163934 0.7672131 0.6163934 0.7409836 0.504918 0.6655738 0.8622951 0.6688525 0.7442623 0.5803279 0.6622951 0.747541 0.6229508 0.6131148 0.6655738 0.6360656
Interests Investigative 0.7783333 0.945 0.6116667 0.945 0.8333333 0.445 0.555 0.3333333 0.7783333 0.8883333 0.8883333 0.1666667 0.555 0.2783333 0.3883333 0.3333333 0.3333333 0.2783333 0.555 0.2216667
Work_Values Achievement 0.7783333 0.6116667 0.6666667 0.6116667 0.8333333 0.6116667 0.6116667 0.8333333 0.7783333 0.8333333 0.8333333 0.555 0.8333333 0.5 0.6666667 0.6666667 0.6116667 0.7783333 0.7216667 0.8333333
Work_Context Structured_versus_Unstructured_Work 0.7713311 0.7713311 0.7508532 0.6962457 0.8054608 0.5904437 0.7098976 0.7133106 0.9317406 0.7337884 0.7337884 0.7406143 0.7952218 0.6109215 0.7713311 0.6211604 0.7269625 0.7952218 0.78157 0.7679181
Work_Values Working_Conditions 0.766 0.566 0.534 0.434 0.634 0.534 0.7 0.734 0.6 0.734 0.734 0.634 0.734 0.6 0.566 0.634 0.534 0.866 0.734 0.934
Skills Systems_Evaluation 0.7628205 0.6410256 0.8397436 0.6794872 0.8012821 0.7211538 0.9230769 0.7211538 0.6794872 0.9615385 0.5192308 0.7628205 0.8012821 0.6025641 0.8397436 0.7211538 0.7211538 0.7628205 0.8012821 0.7211538
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
Abilities Sound_Localization 0 0 0 0 0 0.105042 0 0 0 0 0.05042017 0.105042 0 0 0 0 0 0 0 0
Wrist-Finger_Speed 0 0.05042017 0.05042017 0 0 0.105042 0.05042017 0 0.05462185 0 0.05042017 0.05042017 0.05042017 0.05042017 0.105042 0.05042017 0.05042017 0 0 0.05042017
Response_Orientation 0 0 0 0 0 0 0 0 0 0 0 0.03314917 0 0.03314917 0.03314917 0 0 0 0 0.03314917
Work_Context Cramped_Work_Space,_Awkward_Positions 0 0.03076923 0.01282051 0.01794872 0.02820513 0.06153846 0 0.08205128 0.04102564 0.02820513 0.08717949 0.03846154 0.07692308 0.07435897 0.1538462 0 0.225641 0.01794872 0.1384615 0.05384615
Abilities Reaction_Time 0 0 0 0 0 0 0 0 0 0 0 0 0 0.03833866 0.03833866 0 0.03833866 0 0 0
Work_Context Exposed_to_Hazardous_Equipment 0 0 0.0225 0 0.01 0.01 0 0.04 0 0.0475 0.0075 0.0375 0.025 0.0625 0.15 0 0.06 0 0.0725 0.03
Exposed_to_High_Places 0 0 0.03768844 0 0.01005025 0.05276382 0 0.07286432 0.005025126 0 0.01507538 0 0.02512563 0.09045226 0.1005025 0 0.07286432 0 0.1356784 0.04271357
Exposed_to_Radiation 0 0 0.01 0 0 0 0 0.015 0 0 0.015 0.0075 0 0 0.05 0 0.02 0 0.0175 0
Exposed_to_Whole_Body_Vibration 0 0 0.005434783 0 0.01086957 0 0 0.02717391 0 0 0.008152174 0 0.008152174 0 0.09782609 0 0 0 0.01086957 0
Extremely_Bright_or_Inadequate_Lighting 0 0.06443299 0.02835052 0.03865979 0.04639175 0.1443299 0.01546392 0.1159794 0.06185567 0.03092784 0.1520619 0.05412371 0.1030928 0.07474227 0.128866 0 0.1237113 0.05927835 0.128866 0.05154639
Abilities Rate_Control 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.03846154 0 0.08012821 0 0 0
Peripheral_Vision 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Work_Context In_an_Open_Vehicle_or_Equipment 0 0 0.03457447 0.01861702 0.01861702 0.0106383 0 0.05053191 0 0.0106383 0.03191489 0.01595745 0.02659574 0 0.1808511 0.0106383 0.04255319 0.007978723 0.0106383 0.0212766
Pace_Determined_by_Speed_of_Equipment 0 0.03191489 0.02925532 0.05053191 0 0.0212766 0 0.1382979 0.0106383 0.0106383 0.05053191 0.007978723 0.02659574 0.1037234 0.106383 0.01861702 0.2021277 0.06117021 0.02925532 0.05319149
Skills Repairing 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Work_Context Spend_Time_Climbing_Ladders,_Scaffolds,_or_Poles 0 0 0.01510574 0 0 0.03625378 0 0.03021148 0.006042296 0 0.03021148 0 0.009063444 0.06344411 0.06042296 0 0.02416918 0 0.1450151 0.01208459
Spend_Time_Keeping_or_Regaining_Balance 0 0.02416918 0 0 0.02114804 0 0 0.009063444 0.02719033 0 0.02719033 0.01812689 0.03021148 0.03323263 0.1329305 0.02114804 0.04833837 0.02114804 0.1299094 0
Spend_Time_Kneeling,_Crouching,_Stooping,_or_Crawling 0 0.05737705 0.04371585 0.01092896 0.01912568 0.04371585 0 0.07377049 0.1038251 0 0.07103825 0.04918033 0.05737705 0.06010929 0.09836066 0.01912568 0.1202186 0.04644809 0.1256831 0.01092896
Abilities Multilimb_Coordination 0 0 0 0 0 0 0 0 0 0 0 0 0 0.03428571 0.03428571 0 0 0 0 0
Work_Context Wear_Specialized_Protective_or_Safety_Equipment_such_as_Breathing_Apparatus,_Safety_Harness,_Full_Protection_Suits,_or_Radiation_Protection 0 0 0.01069519 0 0.01069519 0 0 0.02673797 0 0 0.01604278 0.04010695 0.05614973 0.02941176 0.144385 0 0 0.00802139 0.02941176 0.03208556
Abilities Manual_Dexterity 0 0 0 0.2707692 0 0 0.03692308 0 0 0 0.1907692 0.07692308 0.03692308 0.03692308 0.03692308 0 0.3446154 0 0 0
Gross_Body_Equilibrium 0 0 0 0 0 0 0 0 0 0 0 0 0 0.03692308 0 0 0 0 0 0
Gross_Body_Coordination 0 0 0 0 0 0 0 0 0 0 0 0 0 0.03092784 0 0 0 0 0 0
Glare_Sensitivity 0 0 0 0 0 0 0 0 0 0 0.04166667 0 0.04166667 0.04166667 0 0 0 0 0 0.04166667
Extent_Flexibility 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.03428571 0 0 0
Explosive_Strength 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Dynamic_Strength 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Dynamic_Flexibility 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Arm-Hand_Steadiness 0 0.03428571 0.03428571 0.2514286 0 0.03428571 0.03428571 0 0 0 0.1771429 0.03428571 0.03428571 0.07142857 0.07142857 0 0.3571429 0 0 0.03428571
Stamina 0 0 0 0 0 0 0 0 0 0 0 0 0 0.03428571 0 0 0.03428571 0 0 0

247 rows × 20 columns


In [109]:
#function to compare two occupations
def compare_occs(occ1, occ2):
    occ1_series = normed_df_no_na[normed_df_no_na.Occupation.title == occ1].T.drop('Occupation',level=0)
    occ1_series.columns = [occ1]
    compare_frame = occ1_series
    occ2_series = normed_df_no_na[normed_df_no_na.Occupation.title == occ2].T.drop('Occupation',level=0)
    compare_frame[occ2] = occ2_series.iloc[:,0]
    compare_frame.drop(['Job_Zones','Work_Context_Time'], level=0, inplace=True)
    compare_frame['difference'] = compare_frame[occ1] - compare_frame[occ2]
    compare_frame['abs_difference'] = abs(compare_frame['difference'])
    
    #get the features with less than 0.1 distance between them that are above 0.8 for at least one of the occs
    shared_high = compare_frame[((compare_frame[occ1] >= 0.8) | (compare_frame[occ2] >= 0.8)) & (compare_frame.abs_difference < 0.1)]
    print '\nBoth ' + occ1.replace("_", " ") + ' and ' + occ2.replace("_", " ") + ' require similarly high degrees of:'
    if len(shared_high) == 0:
        print '   Nothing in common!'
    else:
        print_domains_features(shared_high)
    
    #get the features that are more important for occ1 than occ2
    more_important = compare_frame.sort('difference', ascending=False).head(10)
    print '\n' + occ1.replace("_", " ") + ' require higher degrees of the following attributes than ' + occ2.replace("_", " ") + ':'
    print_domains_features(more_important)
    
    #get the features that are more important for occ2 than occ1
    less_important = compare_frame.sort('difference', ascending=True).head(10)
    print '\n' + occ2.replace("_", " ") + ' require higher degrees of the following attributes than ' + occ1.replace("_", " ") + ':'
    print_domains_features(less_important)

    #get the feature neither occ needs
    not_needed = compare_frame[(compare_frame[occ1] ==0) & (compare_frame[occ2] == 0)]
    print '\nNeither ' + occ1.replace("_"," ") + ' nor ' + occ2.replace("_"," ") + ' require:'
    print_domains_features(not_needed)
    print ''
    
#     return more_important

compare_occs('Business_Intelligence_Analysts','Information_Technology_Project_Managers')


Both Business Intelligence Analysts and Information Technology Project Managers require similarly high degrees of:
   Skills
      Systems Analysis
   Work Activities
      Communicating with Supervisors, Peers, or Subordinates
      Interacting With Computers
      Processing Information
   Work Context
      Electronic Mail
      Indoors, Environmentally Controlled
      Telephone
      Work With Work Group or Team
   Work Styles
      Attention to Detail
      Persistence
   Work Values
      Achievement

Business Intelligence Analysts require higher degrees of the following attributes than Information Technology Project Managers:
   Interests
      Investigative
   Knowledge
      Sales and Marketing
   Skills
      Active Learning
      Programming
   Work Styles
      Independence
      Cooperation
      Analytical Thinking
   Work Activities
      Provide Consultation and Advice to Others
      Interpreting the Meaning of Information for Others
      Updating and Using Relevant Knowledge

Information Technology Project Managers require higher degrees of the following attributes than Business Intelligence Analysts:
   Work Activities
      Resolving Conflicts and Negotiating with Others
      Monitoring and Controlling Resources
      Coordinating the Work and Activities of Others
      Scheduling Work and Activities
   Work Values
      Independence
   Knowledge
      Design
      Personnel and Human Resources
   Work Context
      Deal With Unpleasant or Angry People
   Skills
      Management of Material Resources
      Monitoring

Neither Business Intelligence Analysts nor Information Technology Project Managers require:
   Abilities
      Arm-Hand Steadiness
      Dynamic Flexibility
      Dynamic Strength
      Explosive Strength
      Extent Flexibility
      Glare Sensitivity
      Gross Body Coordination
      Gross Body Equilibrium
      Manual Dexterity
      Multilimb Coordination
      Peripheral Vision
      Rate Control
      Reaction Time
      Response Orientation
      Sound Localization
      Spatial Orientation
      Speed of Limb Movement
      Stamina
      Static Strength
      Wrist-Finger Speed
   Skills
      Repairing