As time passes and the world changes, different fields become more popular than the others. For instance, nowadays tech industries hold huge portions of the economy, but few decades ago computers weren't even invented. As development of civilization continued, fields also evolved and developed. On different perspective, fields require more expertise While some fields look for graduate level education, others look for years of experience. Let's analyze a dataset on college majors and unemployment ratios and see if we can justify some relations.
The dataset used in this analysis is a courtesy of FiveThirtyEight, it is released to public and can be found on their GitHub page.
We will be looking at three CSV files, they all have major and employment information of people:
The columns can be seen on the GitHub page mentioned above, and will be looked at analysis.
Since we can compare recent graduates statistics with overall data and overall graduate data we can come up with different questions for experience and education level relation to unemployment:
In [1]:
# this line is required to see visualizations inline for Jupyter notebook
%matplotlib inline
# importing modules that we need for analysis
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import re
In [2]:
all_ages = pd.read_csv("all-ages.csv")
grad_students = pd.read_csv("grad-students.csv")
majors = pd.read_csv("majors-list.csv")
recent_grads = pd.read_csv("recent-grads.csv")
Let's take a look at first few rows of the datasets:
In [3]:
all_ages.head(3)
Out[3]:
In [4]:
grad_students.head(3)
Out[4]:
In [5]:
print(grad_students.columns)
In [6]:
majors.head(3)
Out[6]:
In [7]:
recent_grads.head(3)
Out[7]:
In [8]:
recent_grads.columns
Out[8]:
To answer this question, we can compare unemployment rates of recent graduates over the unemployment rates of overall gradautes.
Let's call experience rate = (recent gradute unemployment rate) / (overall graduates unemployment rate)
It's true that in general employers would choose to hire more experienced people, but if experience rate looks a lot higher or lower compared to others, we can draw some conclusions.
In [9]:
# get a list of all majors
all_majors = all_ages["Major"].values
# create a dictionary for "experience rate"
employment_increase = {}
# loop over all majors
for major in all_majors:
# get unemployment rates from both datasets
if major in grad_students["Major"].values and major in recent_grads["Major"].values:
grad_rate = grad_students[grad_students["Major"]==major]["Grad_unemployment_rate"].values[0]
recent_rate = recent_grads[recent_grads["Major"]==major]["Unemployment_rate"].values[0]
# find "experience rate"
rate = recent_rate/grad_rate
# place it into dictionary
employment_increase[major] = rate
list(employment_increase.items())[:10]
Out[9]:
In [10]:
# create a new dataframe from dictionary
df = pd.DataFrame.from_dict(employment_increase,orient="index")
# rename the column
df.columns = ["Rate"]
# sort dataframe by rates
df = df.sort_values(by=["Rate"], ascending = False)
#print out first and last 10 values
print(df[:10])
print(df[len(df)-10:len(df)])
As we see, there are some NaN, 0 and inf values that is probably generated by missing unemployment rates. Let's remove them:
In [11]:
# get rid of NaN and inf values
df = df.replace([np.inf, 0], np.nan)
df = df[df["Rate"].isnull() == False]
#print out first and last 10 values
print(df[:10])
print(df[len(df)-10:len(df)])
Now let's graph two bar plots to get a better visual understanding.
In [12]:
# initialize the figure
fig = plt.figure(figsize=[10,5])
ax1 = fig.add_subplot(1,2,1)
ax2 = fig.add_subplot(1,2,2)
# plot first and last 10 values
ax1_indexes = df.index.values[0:10].tolist()
ax1_rates = df["Rate"].values[0:10]
ax1_positions = np.arange(10) + 0.75
ax1_ticks = np.arange(1,11)
ax1.bar(ax1_positions, ax1_rates, 0.7)
ax1.set_xticklabels(ax1_indexes,rotation = 90)
ax1.set_xticks(ax1_ticks)
ax2_indexes = df.index.values[len(df)-10:len(df)].tolist()
ax2_rates = df["Rate"].values[len(df)-10:len(df)]
ax2_positions = np.arange(10) + 0.75
ax2_ticks = np.arange(1,11)
ax2.bar(ax2_positions, ax2_rates, 0.7)
ax2.set_xticklabels(ax2_indexes,rotation = 90)
ax2.set_xticks(ax2_ticks)
ax2.set_ylim(0,0.8)
plt.show()
As we can see, "NUCLEAR ENGINEERING", "NUCLEAR, INDUSTRIAL RADIOLOGY, AND BIOLOGICAL TECHNOLOGIES", and "BIOMEDICAL ENGINEERING" are fields that require the most experience, while "ENGINEERING AND INDUSTRIAL MANAGEMENT", "ENGINEERING MECHANICS PHYSICS AND SCIENCE", and "ELECTRICAL, MECHANICAL, AND PRECISION TECHNOLOGIES" are the ones that requires the least.
Obviously this analysis is very rough considering there might be numerous other facts affecting the rates.
We will repeat the same process as we did above, this time we are going to use all overall graduate data and overall data.
Let's call education rate = (gradute unemployment rate) / (overall unemployment rate)
In [13]:
# get a list of all majors
all_majors = all_ages["Major"].values
# create a dictionary for "education rate"
education_rate = {}
# loop over all majors
for major in all_majors:
#get unemployment rates from both datasets
if major in grad_students["Major"].values and major in recent_grads["Major"].values:
grad_rate = grad_students[grad_students["Major"]==major]["Grad_unemployment_rate"].values[0]
all_ages_rate = all_ages[all_ages["Major"]==major]["Unemployment_rate"].values[0]
#find "education rate"
rate = grad_rate/all_ages_rate
# place it into dictionary
education_rate[major] = rate
list(education_rate.items())[:10]
Out[13]:
In [14]:
# create a new dataframe and clean inf,0,NaN rows.
df2 = pd.DataFrame.from_dict(education_rate,orient="index")
df2.columns = ["Rate"]
df2 = df2.sort_values(by=["Rate"], ascending = False)
df2 = df2.replace([np.inf, 0], np.nan)
df2 = df2[df2["Rate"].isnull() == False]
print(df2[:10])
print(df2[len(df2)-10:len(df2)])
Now let's graph these rates:
In [15]:
# initialize the figure
fig = plt.figure(figsize=[10,5])
ax1 = fig.add_subplot(1,2,1)
ax2 = fig.add_subplot(1,2,2)
# plot first and last 10 values
ax1_indexes = df2.index.values[0:10].tolist()
ax1_rates = df2["Rate"].values[0:10]
ax1_positions = np.arange(10) + 0.75
ax1_ticks = np.arange(1,11)
ax1.bar(ax1_positions, ax1_rates, 0.7)
ax1.set_xticklabels(ax1_indexes,rotation = 90)
ax1.set_xticks(ax1_ticks)
ax1.set_ylim(0,5)
ax2_indexes = df2.index.values[len(df2)-10:len(df2)].tolist()
ax2_rates = df2["Rate"].values[len(df2)-10:len(df2)]
ax2_positions = np.arange(10) + 0.75
ax2_ticks = np.arange(1,11)
ax2.bar(ax2_positions, ax2_rates, 0.7)
ax2.set_xticklabels(ax2_indexes,rotation = 90)
ax2.set_xticks(ax2_ticks)
ax2.set_ylim(0,0.4)
plt.show()
As we can see, "MATHEMATICS AND COMPUTER SCIENCE", "ELECTRICAL, MECHANICAL, AND PRECISION TECHNOLOGIES", and "MISCELLANEOUS AGRICULTURE" are fields that require higher education the most, while "ASTRONOMY AND ASTROPHYSICS", "NUCLEAR ENGINEERING", and "NUCLEAR, INDUSTRIAL RADIOLOGY, AND BIOLOGICAL TECHNOLOGIES" are the ones that requires the least.
Again this is a very rough analysis considering it only involves two datasets and due to lack of data, we ignore other affecting factors
In [ ]: