Unemployment, Experience & Higher Education

As time passes and the world changes, different fields become more popular than the others. For instance, nowadays tech industries hold huge portions of the economy, but few decades ago computers weren't even invented. As development of civilization continued, fields also evolved and developed. On different perspective, fields require more expertise While some fields look for graduate level education, others look for years of experience. Let's analyze a dataset on college majors and unemployment ratios and see if we can justify some relations.

The dataset used in this analysis is a courtesy of FiveThirtyEight, it is released to public and can be found on their GitHub page.

Structure of the Dataset

We will be looking at three CSV files, they all have major and employment information of people:

  • all-ages.csv - Overall data from all ages
  • grad-students.csv - Graduate students, includes non-graduates related columns as well
  • recent-grads.csv - Recent graduates, includes gender related columns as well
  • majors-list.csv - List of all majors and their codes.

The columns can be seen on the GitHub page mentioned above, and will be looked at analysis.

Questions

Since we can compare recent graduates statistics with overall data and overall graduate data we can come up with different questions for experience and education level relation to unemployment:

  • Which fields require more experience on the field for employment?
  • Which fields require higher education for employment?

Setting up Data


In [1]:
# this line is required to see visualizations inline for Jupyter notebook
%matplotlib inline

# importing modules that we need for analysis
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import re

In [2]:
all_ages = pd.read_csv("all-ages.csv")
grad_students = pd.read_csv("grad-students.csv")
majors = pd.read_csv("majors-list.csv")
recent_grads = pd.read_csv("recent-grads.csv")

Let's take a look at first few rows of the datasets:


In [3]:
all_ages.head(3)


Out[3]:
Major_code Major Major_category Total Employed Employed_full_time_year_round Unemployed Unemployment_rate Median P25th P75th
0 1100 GENERAL AGRICULTURE Agriculture & Natural Resources 128148 90245 74078 2423 0.026147 50000 34000 80000.0
1 1101 AGRICULTURE PRODUCTION AND MANAGEMENT Agriculture & Natural Resources 95326 76865 64240 2266 0.028636 54000 36000 80000.0
2 1102 AGRICULTURAL ECONOMICS Agriculture & Natural Resources 33955 26321 22810 821 0.030248 63000 40000 98000.0

In [4]:
grad_students.head(3)


Out[4]:
Major_code Major Major_category Grad_total Grad_sample_size Grad_employed Grad_full_time_year_round Grad_unemployed Grad_unemployment_rate Grad_median ... Nongrad_total Nongrad_employed Nongrad_full_time_year_round Nongrad_unemployed Nongrad_unemployment_rate Nongrad_median Nongrad_P25 Nongrad_P75 Grad_share Grad_premium
0 5601 CONSTRUCTION SERVICES Industrial Arts & Consumer Services 9173 200 7098 6511 681 0.087543 75000.0 ... 86062 73607 62435 3928 0.050661 65000.0 47000 98000.0 0.096320 0.153846
1 6004 COMMERCIAL ART AND GRAPHIC DESIGN Arts 53864 882 40492 29553 2482 0.057756 60000.0 ... 461977 347166 250596 25484 0.068386 48000.0 34000 71000.0 0.104420 0.250000
2 6211 HOSPITALITY MANAGEMENT Business 24417 437 18368 14784 1465 0.073867 65000.0 ... 179335 145597 113579 7409 0.048423 50000.0 35000 75000.0 0.119837 0.300000

3 rows × 22 columns


In [5]:
print(grad_students.columns)


Index(['Major_code', 'Major', 'Major_category', 'Grad_total',
       'Grad_sample_size', 'Grad_employed', 'Grad_full_time_year_round',
       'Grad_unemployed', 'Grad_unemployment_rate', 'Grad_median', 'Grad_P25',
       'Grad_P75', 'Nongrad_total', 'Nongrad_employed',
       'Nongrad_full_time_year_round', 'Nongrad_unemployed',
       'Nongrad_unemployment_rate', 'Nongrad_median', 'Nongrad_P25',
       'Nongrad_P75', 'Grad_share', 'Grad_premium'],
      dtype='object')

In [6]:
majors.head(3)


Out[6]:
FOD1P Major Major_Category
0 1100 GENERAL AGRICULTURE Agriculture & Natural Resources
1 1101 AGRICULTURE PRODUCTION AND MANAGEMENT Agriculture & Natural Resources
2 1102 AGRICULTURAL ECONOMICS Agriculture & Natural Resources

In [7]:
recent_grads.head(3)


Out[7]:
Rank Major_code Major Major_category Total Sample_size Men Women ShareWomen Employed ... Part_time Full_time_year_round Unemployed Unemployment_rate Median P25th P75th College_jobs Non_college_jobs Low_wage_jobs
0 1 2419 PETROLEUM ENGINEERING Engineering 2339 36 2057 282 0.120564 1976 ... 270 1207 37 0.018381 110000 95000 125000 1534 364 193
1 2 2416 MINING AND MINERAL ENGINEERING Engineering 756 7 679 77 0.101852 640 ... 170 388 85 0.117241 75000 55000 90000 350 257 50
2 3 2415 METALLURGICAL ENGINEERING Engineering 856 3 725 131 0.153037 648 ... 133 340 16 0.024096 73000 50000 105000 456 176 0

3 rows × 21 columns


In [8]:
recent_grads.columns


Out[8]:
Index(['Rank', 'Major_code', 'Major', 'Major_category', 'Total', 'Sample_size',
       'Men', 'Women', 'ShareWomen', 'Employed', 'Full_time', 'Part_time',
       'Full_time_year_round', 'Unemployed', 'Unemployment_rate', 'Median',
       'P25th', 'P75th', 'College_jobs', 'Non_college_jobs', 'Low_wage_jobs'],
      dtype='object')

Question 1 - "Which fields require more experience on the field for employment?"

To answer this question, we can compare unemployment rates of recent graduates over the unemployment rates of overall gradautes.

Let's call experience rate = (recent gradute unemployment rate) / (overall graduates unemployment rate)

It's true that in general employers would choose to hire more experienced people, but if experience rate looks a lot higher or lower compared to others, we can draw some conclusions.


In [9]:
# get a list of all majors
all_majors = all_ages["Major"].values

# create a dictionary for "experience rate"
employment_increase = {}

# loop over all majors
for major in all_majors:
    # get unemployment rates from both datasets 
    if major in grad_students["Major"].values and major in recent_grads["Major"].values:
        grad_rate = grad_students[grad_students["Major"]==major]["Grad_unemployment_rate"].values[0]
        recent_rate = recent_grads[recent_grads["Major"]==major]["Unemployment_rate"].values[0]
        
        # find "experience rate"
        rate = recent_rate/grad_rate
        # place it into dictionary
        employment_increase[major] = rate
list(employment_increase.items())[:10]


/Users/ardbsrn/anaconda/lib/python3.5/site-packages/ipykernel/__main__.py:15: RuntimeWarning: divide by zero encountered in double_scalars
Out[9]:
[('MISCELLANEOUS HEALTH MEDICAL PROFESSIONS', 2.9815019705407519),
 ('FRENCH GERMAN LATIN AND OTHER COMMON FOREIGN LANGUAGE STUDIES',
  1.9587923681364077),
 ('LIBERAL ARTS', 1.8585901147462398),
 ('ACTUARIAL SCIENCE', 1.2883521216778024),
 ('MISCELLANEOUS ENGINEERING TECHNOLOGIES', 1.6574805982708147),
 ('CONSTRUCTION SERVICES', 0.68563764485874468),
 ('MICROBIOLOGY', 2.0764231918780278),
 ('ELEMENTARY EDUCATION', 2.2877748756033829),
 ('CHEMICAL ENGINEERING', 1.5838820614361864),
 ('EDUCATIONAL ADMINISTRATION AND SUPERVISION', 0.0)]

In [10]:
# create a new dataframe from dictionary
df = pd.DataFrame.from_dict(employment_increase,orient="index")

# rename the column
df.columns = ["Rate"]

# sort dataframe by rates
df = df.sort_values(by=["Rate"], ascending = False)

#print out first and last 10 values
print(df[:10])
print(df[len(df)-10:len(df)])


                                                         Rate
COURT REPORTING                                           inf
NUCLEAR ENGINEERING                                 15.360212
NUCLEAR, INDUSTRIAL RADIOLOGY, AND BIOLOGICAL T...   8.066188
BIOMEDICAL ENGINEERING                               4.991650
COMPUTER PROGRAMMING AND DATA PROCESSING             4.631141
ANIMAL SCIENCES                                      4.126263
PUBLIC POLICY                                        4.112736
AGRICULTURAL ECONOMICS                               3.865340
GENERAL MEDICAL AND HEALTH SERVICES                  3.722096
COMMUNITY AND PUBLIC HEALTH                          3.719242
                                                        Rate
MATERIALS ENGINEERING AND MATERIALS SCIENCE         0.541862
SOCIAL PSYCHOLOGY                                   0.485417
ENGINEERING AND INDUSTRIAL MANAGEMENT               0.355584
ENGINEERING MECHANICS PHYSICS AND SCIENCE           0.218669
ELECTRICAL, MECHANICAL, AND PRECISION TECHNOLOG...  0.212825
EDUCATIONAL ADMINISTRATION AND SUPERVISION          0.000000
SOIL SCIENCE                                        0.000000
MATHEMATICS AND COMPUTER SCIENCE                    0.000000
BOTANY                                              0.000000
MILITARY TECHNOLOGIES                                    NaN

As we see, there are some NaN, 0 and inf values that is probably generated by missing unemployment rates. Let's remove them:


In [11]:
# get rid of NaN and inf values
df = df.replace([np.inf, 0], np.nan)
df = df[df["Rate"].isnull() == False]

#print out first and last 10 values
print(df[:10])
print(df[len(df)-10:len(df)])


                                                         Rate
NUCLEAR ENGINEERING                                 15.360212
NUCLEAR, INDUSTRIAL RADIOLOGY, AND BIOLOGICAL T...   8.066188
BIOMEDICAL ENGINEERING                               4.991650
COMPUTER PROGRAMMING AND DATA PROCESSING             4.631141
ANIMAL SCIENCES                                      4.126263
PUBLIC POLICY                                        4.112736
AGRICULTURAL ECONOMICS                               3.865340
GENERAL MEDICAL AND HEALTH SERVICES                  3.722096
COMMUNITY AND PUBLIC HEALTH                          3.719242
GEOGRAPHY                                            3.644643
                                                        Rate
MISCELLANEOUS AGRICULTURE                           0.691325
COSMETOLOGY SERVICES AND CULINARY ARTS              0.688208
CONSTRUCTION SERVICES                               0.685638
GENERAL AGRICULTURE                                 0.669821
HUMAN SERVICES AND COMMUNITY ORGANIZATION           0.629672
MATERIALS ENGINEERING AND MATERIALS SCIENCE         0.541862
SOCIAL PSYCHOLOGY                                   0.485417
ENGINEERING AND INDUSTRIAL MANAGEMENT               0.355584
ENGINEERING MECHANICS PHYSICS AND SCIENCE           0.218669
ELECTRICAL, MECHANICAL, AND PRECISION TECHNOLOG...  0.212825

Now let's graph two bar plots to get a better visual understanding.


In [12]:
# initialize the figure
fig = plt.figure(figsize=[10,5])
ax1 = fig.add_subplot(1,2,1)
ax2 = fig.add_subplot(1,2,2)

# plot first and last 10 values
ax1_indexes = df.index.values[0:10].tolist()
ax1_rates = df["Rate"].values[0:10]
ax1_positions = np.arange(10) + 0.75
ax1_ticks = np.arange(1,11)
ax1.bar(ax1_positions, ax1_rates, 0.7)
ax1.set_xticklabels(ax1_indexes,rotation = 90)
ax1.set_xticks(ax1_ticks)

ax2_indexes = df.index.values[len(df)-10:len(df)].tolist()
ax2_rates = df["Rate"].values[len(df)-10:len(df)]
ax2_positions = np.arange(10) + 0.75
ax2_ticks = np.arange(1,11)
ax2.bar(ax2_positions, ax2_rates, 0.7)
ax2.set_xticklabels(ax2_indexes,rotation = 90)
ax2.set_xticks(ax2_ticks)
ax2.set_ylim(0,0.8)

plt.show()


As we can see, "NUCLEAR ENGINEERING", "NUCLEAR, INDUSTRIAL RADIOLOGY, AND BIOLOGICAL TECHNOLOGIES", and "BIOMEDICAL ENGINEERING" are fields that require the most experience, while "ENGINEERING AND INDUSTRIAL MANAGEMENT", "ENGINEERING MECHANICS PHYSICS AND SCIENCE", and "ELECTRICAL, MECHANICAL, AND PRECISION TECHNOLOGIES" are the ones that requires the least.

Obviously this analysis is very rough considering there might be numerous other facts affecting the rates.

Question 2 - "Which fields require higher education for employment?"

We will repeat the same process as we did above, this time we are going to use all overall graduate data and overall data.

Let's call education rate = (gradute unemployment rate) / (overall unemployment rate)


In [13]:
# get a list of all majors
all_majors = all_ages["Major"].values

# create a dictionary for "education rate"
education_rate = {}

# loop over all majors
for major in all_majors:
    #get unemployment rates from both datasets
    if major in grad_students["Major"].values and major in recent_grads["Major"].values:
        grad_rate = grad_students[grad_students["Major"]==major]["Grad_unemployment_rate"].values[0]
        all_ages_rate = all_ages[all_ages["Major"]==major]["Unemployment_rate"].values[0]
        
        #find "education rate"
        rate = grad_rate/all_ages_rate
        
        # place it into dictionary
        education_rate[major] = rate
list(education_rate.items())[:10]


/Users/ardbsrn/anaconda/lib/python3.5/site-packages/ipykernel/__main__.py:15: RuntimeWarning: divide by zero encountered in double_scalars
Out[13]:
[('MISCELLANEOUS HEALTH MEDICAL PROFESSIONS', 0.50968948483737442),
 ('FRENCH GERMAN LATIN AND OTHER COMMON FOREIGN LANGUAGE STUDIES',
  0.65617247847524596),
 ('LIBERAL ARTS', 0.62314579213711996),
 ('ACTUARIAL SCIENCE', 1.3242801919529743),
 ('MISCELLANEOUS ENGINEERING TECHNOLOGIES', 0.52656965058005312),
 ('CONSTRUCTION SERVICES', 1.7132756761584875),
 ('MICROBIOLOGY', 0.63204820353568303),
 ('ELEMENTARY EDUCATION', 0.53084819575317121),
 ('CHEMICAL ENGINEERING', 0.83384187508015606),
 ('EDUCATIONAL ADMINISTRATION AND SUPERVISION', inf)]

In [14]:
# create a new dataframe and clean inf,0,NaN rows.
df2 = pd.DataFrame.from_dict(education_rate,orient="index")
df2.columns = ["Rate"]
df2 = df2.sort_values(by=["Rate"], ascending = False)
df2 = df2.replace([np.inf, 0], np.nan)
df2 = df2[df2["Rate"].isnull() == False]
print(df2[:10])
print(df2[len(df2)-10:len(df2)])


                                                        Rate
MATHEMATICS AND COMPUTER SCIENCE                    4.132069
ELECTRICAL, MECHANICAL, AND PRECISION TECHNOLOG...  2.662010
MISCELLANEOUS AGRICULTURE                           2.203710
ENGINEERING AND INDUSTRIAL MANAGEMENT               1.731443
CONSTRUCTION SERVICES                               1.713276
PHARMACOLOGY                                        1.503956
MECHANICAL ENGINEERING RELATED TECHNOLOGIES         1.498525
COSMETOLOGY SERVICES AND CULINARY ARTS              1.472518
HOSPITALITY MANAGEMENT                              1.435785
COMPUTER NETWORKING AND TELECOMMUNICATIONS          1.390355
                                                        Rate
ANIMAL SCIENCES                                     0.288820
SOIL SCIENCE                                        0.288356
COMPUTER PROGRAMMING AND DATA PROCESSING            0.272668
MULTI/INTERDISCIPLINARY STUDIES                     0.260453
NEUROSCIENCE                                        0.255613
MISCELLANEOUS FINE ARTS                             0.246315
BIOMEDICAL ENGINEERING                              0.233408
ASTRONOMY AND ASTROPHYSICS                          0.232833
NUCLEAR ENGINEERING                                 0.171800
NUCLEAR, INDUSTRIAL RADIOLOGY, AND BIOLOGICAL T...  0.126960

Now let's graph these rates:


In [15]:
# initialize the figure
fig = plt.figure(figsize=[10,5])
ax1 = fig.add_subplot(1,2,1)
ax2 = fig.add_subplot(1,2,2)

# plot first and last 10 values
ax1_indexes = df2.index.values[0:10].tolist()
ax1_rates = df2["Rate"].values[0:10]
ax1_positions = np.arange(10) + 0.75
ax1_ticks = np.arange(1,11)
ax1.bar(ax1_positions, ax1_rates, 0.7)
ax1.set_xticklabels(ax1_indexes,rotation = 90)
ax1.set_xticks(ax1_ticks)
ax1.set_ylim(0,5)

ax2_indexes = df2.index.values[len(df2)-10:len(df2)].tolist()
ax2_rates = df2["Rate"].values[len(df2)-10:len(df2)]
ax2_positions = np.arange(10) + 0.75
ax2_ticks = np.arange(1,11)
ax2.bar(ax2_positions, ax2_rates, 0.7)
ax2.set_xticklabels(ax2_indexes,rotation = 90)
ax2.set_xticks(ax2_ticks)
ax2.set_ylim(0,0.4)

plt.show()


As we can see, "MATHEMATICS AND COMPUTER SCIENCE", "ELECTRICAL, MECHANICAL, AND PRECISION TECHNOLOGIES", and "MISCELLANEOUS AGRICULTURE" are fields that require higher education the most, while "ASTRONOMY AND ASTROPHYSICS", "NUCLEAR ENGINEERING", and "NUCLEAR, INDUSTRIAL RADIOLOGY, AND BIOLOGICAL TECHNOLOGIES" are the ones that requires the least.

Again this is a very rough analysis considering it only involves two datasets and due to lack of data, we ignore other affecting factors


In [ ]: