Unemployment, Experience & Higher Education

As time passes and the world changes, different fields become more popular than the others. For instance, nowadays tech industries hold huge portions of the economy, but few decades ago computers weren't even invented. As development of civilization continued, fields also evolved and developed. On different perspective, fields require more expertise While some fields look for graduate level education, others look for years of experience. Let's analyze a dataset on college majors and unemployment ratios and see if we can justify some relations.

The dataset used in this analysis is a courtesy of FiveThirtyEight, it is released to public and can be found on their GitHub page.

Structure of the Dataset

We will be looking at three CSV files, they all have major and employment information of people:

all-ages.csv - Overall data from all ages
grad-students.csv - Graduate students, includes non-graduates related columns as well
recent-grads.csv - Recent graduates, includes gender related columns as well
majors-list.csv - List of all majors and their codes.

The columns can be seen on the GitHub page mentioned above, and will be looked at analysis.

Questions

Since we can compare recent graduates statistics with overall data and overall graduate data we can come up with different questions for experience and education level relation to unemployment:

Which fields require more experience on the field for employment?
Which fields require higher education for employment?

Setting up Data



In [1]:

    
# this line is required to see visualizations inline for Jupyter notebook
%matplotlib inline

# importing modules that we need for analysis
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import re



In [2]:

    
all_ages = pd.read_csv("all-ages.csv")
grad_students = pd.read_csv("grad-students.csv")
majors = pd.read_csv("majors-list.csv")
recent_grads = pd.read_csv("recent-grads.csv")

Let's take a look at first few rows of the datasets:



In [3]:

    
all_ages.head(3)









    Out[3]:







  
    
      
      Major_code
      Major
      Major_category
      Total
      Employed
      Employed_full_time_year_round
      Unemployed
      Unemployment_rate
      Median
      P25th
      P75th
    
  
  
    
      0
      1100
      GENERAL AGRICULTURE
      Agriculture & Natural Resources
      128148
      90245
      74078
      2423
      0.026147
      50000
      34000
      80000.0
    
    
      1
      1101
      AGRICULTURE PRODUCTION AND MANAGEMENT
      Agriculture & Natural Resources
      95326
      76865
      64240
      2266
      0.028636
      54000
      36000
      80000.0
    
    
      2
      1102
      AGRICULTURAL ECONOMICS
      Agriculture & Natural Resources
      33955
      26321
      22810
      821
      0.030248
      63000
      40000
      98000.0



In [4]:

    
grad_students.head(3)









    Out[4]:







  
    
      
      Major_code
      Major
      Major_category
      Grad_total
      Grad_sample_size
      Grad_employed
      Grad_full_time_year_round
      Grad_unemployed
      Grad_unemployment_rate
      Grad_median
      ...
      Nongrad_total
      Nongrad_employed
      Nongrad_full_time_year_round
      Nongrad_unemployed
      Nongrad_unemployment_rate
      Nongrad_median
      Nongrad_P25
      Nongrad_P75
      Grad_share
      Grad_premium
    
  
  
    
      0
      5601
      CONSTRUCTION SERVICES
      Industrial Arts & Consumer Services
      9173
      200
      7098
      6511
      681
      0.087543
      75000.0
      ...
      86062
      73607
      62435
      3928
      0.050661
      65000.0
      47000
      98000.0
      0.096320
      0.153846
    
    
      1
      6004
      COMMERCIAL ART AND GRAPHIC DESIGN
      Arts
      53864
      882
      40492
      29553
      2482
      0.057756
      60000.0
      ...
      461977
      347166
      250596
      25484
      0.068386
      48000.0
      34000
      71000.0
      0.104420
      0.250000
    
    
      2
      6211
      HOSPITALITY MANAGEMENT
      Business
      24417
      437
      18368
      14784
      1465
      0.073867
      65000.0
      ...
      179335
      145597
      113579
      7409
      0.048423
      50000.0
      35000
      75000.0
      0.119837
      0.300000
    
  

3 rows × 22 columns



In [5]:

    
print(grad_students.columns)









    



Index(['Major_code', 'Major', 'Major_category', 'Grad_total',
       'Grad_sample_size', 'Grad_employed', 'Grad_full_time_year_round',
       'Grad_unemployed', 'Grad_unemployment_rate', 'Grad_median', 'Grad_P25',
       'Grad_P75', 'Nongrad_total', 'Nongrad_employed',
       'Nongrad_full_time_year_round', 'Nongrad_unemployed',
       'Nongrad_unemployment_rate', 'Nongrad_median', 'Nongrad_P25',
       'Nongrad_P75', 'Grad_share', 'Grad_premium'],
      dtype='object')



In [6]:

    
majors.head(3)









    Out[6]:







  
    
      
      FOD1P
      Major
      Major_Category
    
  
  
    
      0
      1100
      GENERAL AGRICULTURE
      Agriculture & Natural Resources
    
    
      1
      1101
      AGRICULTURE PRODUCTION AND MANAGEMENT
      Agriculture & Natural Resources
    
    
      2
      1102
      AGRICULTURAL ECONOMICS
      Agriculture & Natural Resources



In [7]:

    
recent_grads.head(3)









    Out[7]:







  
    
      
      Rank
      Major_code
      Major
      Major_category
      Total
      Sample_size
      Men
      Women
      ShareWomen
      Employed
      ...
      Part_time
      Full_time_year_round
      Unemployed
      Unemployment_rate
      Median
      P25th
      P75th
      College_jobs
      Non_college_jobs
      Low_wage_jobs
    
  
  
    
      0
      1
      2419
      PETROLEUM ENGINEERING
      Engineering
      2339
      36
      2057
      282
      0.120564
      1976
      ...
      270
      1207
      37
      0.018381
      110000
      95000
      125000
      1534
      364
      193
    
    
      1
      2
      2416
      MINING AND MINERAL ENGINEERING
      Engineering
      756
      7
      679
      77
      0.101852
      640
      ...
      170
      388
      85
      0.117241
      75000
      55000
      90000
      350
      257
      50
    
    
      2
      3
      2415
      METALLURGICAL ENGINEERING
      Engineering
      856
      3
      725
      131
      0.153037
      648
      ...
      133
      340
      16
      0.024096
      73000
      50000
      105000
      456
      176
      0
    
  

3 rows × 21 columns



In [8]:

    
recent_grads.columns









    Out[8]:





Index(['Rank', 'Major_code', 'Major', 'Major_category', 'Total', 'Sample_size',
       'Men', 'Women', 'ShareWomen', 'Employed', 'Full_time', 'Part_time',
       'Full_time_year_round', 'Unemployed', 'Unemployment_rate', 'Median',
       'P25th', 'P75th', 'College_jobs', 'Non_college_jobs', 'Low_wage_jobs'],
      dtype='object')

Question 1 - "Which fields require more experience on the field for employment?"

To answer this question, we can compare unemployment rates of recent graduates over the unemployment rates of overall gradautes.

Let's call experience rate = (recent gradute unemployment rate) / (overall graduates unemployment rate)

It's true that in general employers would choose to hire more experienced people, but if experience rate looks a lot higher or lower compared to others, we can draw some conclusions.



In [9]:

    
# get a list of all majors
all_majors = all_ages["Major"].values

# create a dictionary for "experience rate"
employment_increase = {}

# loop over all majors
for major in all_majors:
    # get unemployment rates from both datasets 
    if major in grad_students["Major"].values and major in recent_grads["Major"].values:
        grad_rate = grad_students[grad_students["Major"]==major]["Grad_unemployment_rate"].values[0]
        recent_rate = recent_grads[recent_grads["Major"]==major]["Unemployment_rate"].values[0]
        
        # find "experience rate"
        rate = recent_rate/grad_rate
        # place it into dictionary
        employment_increase[major] = rate
list(employment_increase.items())[:10]









    



/Users/ardbsrn/anaconda/lib/python3.5/site-packages/ipykernel/__main__.py:15: RuntimeWarning: divide by zero encountered in double_scalars






    Out[9]:





[('MISCELLANEOUS HEALTH MEDICAL PROFESSIONS', 2.9815019705407519),
 ('FRENCH GERMAN LATIN AND OTHER COMMON FOREIGN LANGUAGE STUDIES',
  1.9587923681364077),
 ('LIBERAL ARTS', 1.8585901147462398),
 ('ACTUARIAL SCIENCE', 1.2883521216778024),
 ('MISCELLANEOUS ENGINEERING TECHNOLOGIES', 1.6574805982708147),
 ('CONSTRUCTION SERVICES', 0.68563764485874468),
 ('MICROBIOLOGY', 2.0764231918780278),
 ('ELEMENTARY EDUCATION', 2.2877748756033829),
 ('CHEMICAL ENGINEERING', 1.5838820614361864),
 ('EDUCATIONAL ADMINISTRATION AND SUPERVISION', 0.0)]



In [10]:

    
# create a new dataframe from dictionary
df = pd.DataFrame.from_dict(employment_increase,orient="index")

# rename the column
df.columns = ["Rate"]

# sort dataframe by rates
df = df.sort_values(by=["Rate"], ascending = False)

#print out first and last 10 values
print(df[:10])
print(df[len(df)-10:len(df)])









    



                                                         Rate
COURT REPORTING                                           inf
NUCLEAR ENGINEERING                                 15.360212
NUCLEAR, INDUSTRIAL RADIOLOGY, AND BIOLOGICAL T...   8.066188
BIOMEDICAL ENGINEERING                               4.991650
COMPUTER PROGRAMMING AND DATA PROCESSING             4.631141
ANIMAL SCIENCES                                      4.126263
PUBLIC POLICY                                        4.112736
AGRICULTURAL ECONOMICS                               3.865340
GENERAL MEDICAL AND HEALTH SERVICES                  3.722096
COMMUNITY AND PUBLIC HEALTH                          3.719242
                                                        Rate
MATERIALS ENGINEERING AND MATERIALS SCIENCE         0.541862
SOCIAL PSYCHOLOGY                                   0.485417
ENGINEERING AND INDUSTRIAL MANAGEMENT               0.355584
ENGINEERING MECHANICS PHYSICS AND SCIENCE           0.218669
ELECTRICAL, MECHANICAL, AND PRECISION TECHNOLOG...  0.212825
EDUCATIONAL ADMINISTRATION AND SUPERVISION          0.000000
SOIL SCIENCE                                        0.000000
MATHEMATICS AND COMPUTER SCIENCE                    0.000000
BOTANY                                              0.000000
MILITARY TECHNOLOGIES                                    NaN

As we see, there are some NaN, 0 and inf values that is probably generated by missing unemployment rates. Let's remove them:



In [11]:

    
# get rid of NaN and inf values
df = df.replace([np.inf, 0], np.nan)
df = df[df["Rate"].isnull() == False]

#print out first and last 10 values
print(df[:10])
print(df[len(df)-10:len(df)])









    



                                                         Rate
NUCLEAR ENGINEERING                                 15.360212
NUCLEAR, INDUSTRIAL RADIOLOGY, AND BIOLOGICAL T...   8.066188
BIOMEDICAL ENGINEERING                               4.991650
COMPUTER PROGRAMMING AND DATA PROCESSING             4.631141
ANIMAL SCIENCES                                      4.126263
PUBLIC POLICY                                        4.112736
AGRICULTURAL ECONOMICS                               3.865340
GENERAL MEDICAL AND HEALTH SERVICES                  3.722096
COMMUNITY AND PUBLIC HEALTH                          3.719242
GEOGRAPHY                                            3.644643
                                                        Rate
MISCELLANEOUS AGRICULTURE                           0.691325
COSMETOLOGY SERVICES AND CULINARY ARTS              0.688208
CONSTRUCTION SERVICES                               0.685638
GENERAL AGRICULTURE                                 0.669821
HUMAN SERVICES AND COMMUNITY ORGANIZATION           0.629672
MATERIALS ENGINEERING AND MATERIALS SCIENCE         0.541862
SOCIAL PSYCHOLOGY                                   0.485417
ENGINEERING AND INDUSTRIAL MANAGEMENT               0.355584
ENGINEERING MECHANICS PHYSICS AND SCIENCE           0.218669
ELECTRICAL, MECHANICAL, AND PRECISION TECHNOLOG...  0.212825

Now let's graph two bar plots to get a better visual understanding.



In [12]:

    
# initialize the figure
fig = plt.figure(figsize=[10,5])
ax1 = fig.add_subplot(1,2,1)
ax2 = fig.add_subplot(1,2,2)

# plot first and last 10 values
ax1_indexes = df.index.values[0:10].tolist()
ax1_rates = df["Rate"].values[0:10]
ax1_positions = np.arange(10) + 0.75
ax1_ticks = np.arange(1,11)
ax1.bar(ax1_positions, ax1_rates, 0.7)
ax1.set_xticklabels(ax1_indexes,rotation = 90)
ax1.set_xticks(ax1_ticks)

ax2_indexes = df.index.values[len(df)-10:len(df)].tolist()
ax2_rates = df["Rate"].values[len(df)-10:len(df)]
ax2_positions = np.arange(10) + 0.75
ax2_ticks = np.arange(1,11)
ax2.bar(ax2_positions, ax2_rates, 0.7)
ax2.set_xticklabels(ax2_indexes,rotation = 90)
ax2.set_xticks(ax2_ticks)
ax2.set_ylim(0,0.8)

plt.show()

As we can see, "NUCLEAR ENGINEERING", "NUCLEAR, INDUSTRIAL RADIOLOGY, AND BIOLOGICAL TECHNOLOGIES", and "BIOMEDICAL ENGINEERING" are fields that require the most experience, while "ENGINEERING AND INDUSTRIAL MANAGEMENT", "ENGINEERING MECHANICS PHYSICS AND SCIENCE", and "ELECTRICAL, MECHANICAL, AND PRECISION TECHNOLOGIES" are the ones that requires the least.

Obviously this analysis is very rough considering there might be numerous other facts affecting the rates.

Question 2 - "Which fields require higher education for employment?"

We will repeat the same process as we did above, this time we are going to use all overall graduate data and overall data.

Let's call education rate = (gradute unemployment rate) / (overall unemployment rate)



In [13]:

    
# get a list of all majors
all_majors = all_ages["Major"].values

# create a dictionary for "education rate"
education_rate = {}

# loop over all majors
for major in all_majors:
    #get unemployment rates from both datasets
    if major in grad_students["Major"].values and major in recent_grads["Major"].values:
        grad_rate = grad_students[grad_students["Major"]==major]["Grad_unemployment_rate"].values[0]
        all_ages_rate = all_ages[all_ages["Major"]==major]["Unemployment_rate"].values[0]
        
        #find "education rate"
        rate = grad_rate/all_ages_rate
        
        # place it into dictionary
        education_rate[major] = rate
list(education_rate.items())[:10]









    



/Users/ardbsrn/anaconda/lib/python3.5/site-packages/ipykernel/__main__.py:15: RuntimeWarning: divide by zero encountered in double_scalars






    Out[13]:





[('MISCELLANEOUS HEALTH MEDICAL PROFESSIONS', 0.50968948483737442),
 ('FRENCH GERMAN LATIN AND OTHER COMMON FOREIGN LANGUAGE STUDIES',
  0.65617247847524596),
 ('LIBERAL ARTS', 0.62314579213711996),
 ('ACTUARIAL SCIENCE', 1.3242801919529743),
 ('MISCELLANEOUS ENGINEERING TECHNOLOGIES', 0.52656965058005312),
 ('CONSTRUCTION SERVICES', 1.7132756761584875),
 ('MICROBIOLOGY', 0.63204820353568303),
 ('ELEMENTARY EDUCATION', 0.53084819575317121),
 ('CHEMICAL ENGINEERING', 0.83384187508015606),
 ('EDUCATIONAL ADMINISTRATION AND SUPERVISION', inf)]



In [14]:

    
# create a new dataframe and clean inf,0,NaN rows.
df2 = pd.DataFrame.from_dict(education_rate,orient="index")
df2.columns = ["Rate"]
df2 = df2.sort_values(by=["Rate"], ascending = False)
df2 = df2.replace([np.inf, 0], np.nan)
df2 = df2[df2["Rate"].isnull() == False]
print(df2[:10])
print(df2[len(df2)-10:len(df2)])









    



                                                        Rate
MATHEMATICS AND COMPUTER SCIENCE                    4.132069
ELECTRICAL, MECHANICAL, AND PRECISION TECHNOLOG...  2.662010
MISCELLANEOUS AGRICULTURE                           2.203710
ENGINEERING AND INDUSTRIAL MANAGEMENT               1.731443
CONSTRUCTION SERVICES                               1.713276
PHARMACOLOGY                                        1.503956
MECHANICAL ENGINEERING RELATED TECHNOLOGIES         1.498525
COSMETOLOGY SERVICES AND CULINARY ARTS              1.472518
HOSPITALITY MANAGEMENT                              1.435785
COMPUTER NETWORKING AND TELECOMMUNICATIONS          1.390355
                                                        Rate
ANIMAL SCIENCES                                     0.288820
SOIL SCIENCE                                        0.288356
COMPUTER PROGRAMMING AND DATA PROCESSING            0.272668
MULTI/INTERDISCIPLINARY STUDIES                     0.260453
NEUROSCIENCE                                        0.255613
MISCELLANEOUS FINE ARTS                             0.246315
BIOMEDICAL ENGINEERING                              0.233408
ASTRONOMY AND ASTROPHYSICS                          0.232833
NUCLEAR ENGINEERING                                 0.171800
NUCLEAR, INDUSTRIAL RADIOLOGY, AND BIOLOGICAL T...  0.126960

Now let's graph these rates:



In [15]:

    
# initialize the figure
fig = plt.figure(figsize=[10,5])
ax1 = fig.add_subplot(1,2,1)
ax2 = fig.add_subplot(1,2,2)

# plot first and last 10 values
ax1_indexes = df2.index.values[0:10].tolist()
ax1_rates = df2["Rate"].values[0:10]
ax1_positions = np.arange(10) + 0.75
ax1_ticks = np.arange(1,11)
ax1.bar(ax1_positions, ax1_rates, 0.7)
ax1.set_xticklabels(ax1_indexes,rotation = 90)
ax1.set_xticks(ax1_ticks)
ax1.set_ylim(0,5)

ax2_indexes = df2.index.values[len(df2)-10:len(df2)].tolist()
ax2_rates = df2["Rate"].values[len(df2)-10:len(df2)]
ax2_positions = np.arange(10) + 0.75
ax2_ticks = np.arange(1,11)
ax2.bar(ax2_positions, ax2_rates, 0.7)
ax2.set_xticklabels(ax2_indexes,rotation = 90)
ax2.set_xticks(ax2_ticks)
ax2.set_ylim(0,0.4)

plt.show()

As we can see, "MATHEMATICS AND COMPUTER SCIENCE", "ELECTRICAL, MECHANICAL, AND PRECISION TECHNOLOGIES", and "MISCELLANEOUS AGRICULTURE" are fields that require higher education the most, while "ASTRONOMY AND ASTROPHYSICS", "NUCLEAR ENGINEERING", and "NUCLEAR, INDUSTRIAL RADIOLOGY, AND BIOLOGICAL TECHNOLOGIES" are the ones that requires the least.

Again this is a very rough analysis considering it only involves two datasets and due to lack of data, we ignore other affecting factors



In [ ]:

	Major_code	Major	Major_category	Total	Employed	Employed_full_time_year_round	Unemployed	Unemployment_rate	Median	P25th	P75th
0	1100	GENERAL AGRICULTURE	Agriculture & Natural Resources	128148	90245	74078	2423	0.026147	50000	34000	80000.0
1	1101	AGRICULTURE PRODUCTION AND MANAGEMENT	Agriculture & Natural Resources	95326	76865	64240	2266	0.028636	54000	36000	80000.0
2	1102	AGRICULTURAL ECONOMICS	Agriculture & Natural Resources	33955	26321	22810	821	0.030248	63000	40000	98000.0

	Major_code	Major	Major_category	Grad_total	Grad_sample_size	Grad_employed	Grad_full_time_year_round	Grad_unemployed	Grad_unemployment_rate	Grad_median	...	Nongrad_total	Nongrad_employed	Nongrad_full_time_year_round	Nongrad_unemployed	Nongrad_unemployment_rate	Nongrad_median	Nongrad_P25	Nongrad_P75	Grad_share	Grad_premium
0	5601	CONSTRUCTION SERVICES	Industrial Arts & Consumer Services	9173	200	7098	6511	681	0.087543	75000.0	...	86062	73607	62435	3928	0.050661	65000.0	47000	98000.0	0.096320	0.153846
1	6004	COMMERCIAL ART AND GRAPHIC DESIGN	Arts	53864	882	40492	29553	2482	0.057756	60000.0	...	461977	347166	250596	25484	0.068386	48000.0	34000	71000.0	0.104420	0.250000
2	6211	HOSPITALITY MANAGEMENT	Business	24417	437	18368	14784	1465	0.073867	65000.0	...	179335	145597	113579	7409	0.048423	50000.0	35000	75000.0	0.119837	0.300000

	Rank	Major_code	Major	Major_category	Total	Sample_size	Men	Women	ShareWomen	Employed	...	Part_time	Full_time_year_round	Unemployed	Unemployment_rate	Median	P25th	P75th	College_jobs	Non_college_jobs	Low_wage_jobs
0	1	2419	PETROLEUM ENGINEERING	Engineering	2339	36	2057	282	0.120564	1976	...	270	1207	37	0.018381	110000	95000	125000	1534	364	193
1	2	2416	MINING AND MINERAL ENGINEERING	Engineering	756	7	679	77	0.101852	640	...	170	388	85	0.117241	75000	55000	90000	350	257	50
2	3	2415	METALLURGICAL ENGINEERING	Engineering	856	3	725	131	0.153037	648	...	133	340	16	0.024096	73000	50000	105000	456	176	0