Please note that the maps are still saved here (in the map folder) in case nbviewer would throw any error (like 404, which happened to us sometimes)
In [1]:
import numpy as np
import os
import pandas as pd
import json
import folium
import branca
import seaborn as sns
import matplotlib.pyplot as plt
from IPython.display import IFrame, Image
from ipywidgets import interact
%matplotlib inline
In [2]:
folium.__version__ == '0.5.0'
Out[2]:
We were given the instruction that the unemployed rate of Switzerland on https://www.amstat.ch/v2/index.jsp also contained the people that were not jobless but looking for a job, which didn't make sense with the documentation on https://www.amstat.ch/v2/index.jsp, the raw numbers and the comments on Mattermost.
For those reasons, we will assume that unemployed rate for Switzerland are simply "number of jobless people/Total of active population" and not "number of people looking for a job/Total of active population"
Regarding the European dataset on unemployment rate, we chose to use the latest annual complete dataset which is the one from January to December 2016. We could have chosen the latest complete dataset for a month, but we want to avoid the bias induced by the seasons.
To make sure that we compare rates that are comparable, we chose the Swiss dataset on unemployment rate for the same period. Moreover, we used the interface provided on https://www.amstat.ch/ to rename and select precisely the data that we are interested in. Also note we store the rates per nationalities in a separate file because if we start splitting the datas in a lot of categories (age category, nationalities, etc.) it is then cumbersome to aggregate them.
We show the steps taken to obtain each dataset with pictures at the end of this notebook as an appendice.
After downloading each data files, we clean them to get a cleaner and easier to read data file. Such cleaning can be done using a program such as Excel or Libreoffice. For larger files that cannot be open by these programs, we can use a script. We describe the cleaning below:
Loading data and cleaning them.
In [3]:
#The europe data file
europe_data_file = 'data/european_rate_2016.csv'
europe_df = pd.read_csv(europe_data_file)
#The swiss data file excluding nationalities
swiss_data_file_age = 'data/swiss_rate_tot+age+lang.csv'
swiss_df = pd.read_csv(swiss_data_file_age, encoding="UTF-16")
#The swiss data file with nationalities
swiss_data_file_nationality = 'data/swiss_rate_2016.csv'
swiss_nat_df = pd.read_csv(swiss_data_file_nationality)
#Clean the number so that they are represented as an int
swiss_df['Jobless'] = swiss_df['Jobless'].apply(lambda x: int(x.replace("'", "")))
swiss_df['Total'] = swiss_df['Total'].apply(lambda x: int(x.replace("'", "")))
#Let's get the rate per canton by aggregating the data
swiss_df["tmp"] = swiss_df["Rate"] * swiss_df["Total"]
swiss_df_canton_grouped = swiss_df[["Canton", "Jobless", "Total", "tmp"]].groupby("Canton").sum()
swiss_df_canton_grouped["Rate"] = pd.DataFrame(swiss_df_canton_grouped["tmp"] / swiss_df_canton_grouped["Total"])
swiss_df_canton_grouped = swiss_df_canton_grouped.reset_index()
del swiss_df["tmp"]
del swiss_df_canton_grouped["tmp"]
In [4]:
#Add the nationalities to the swiss dataframe grouped by canton
swiss_nat_df = swiss_nat_df.pivot(index='Canton', columns='Nationality')['Rate']
swiss_df_canton_grouped = pd.merge(swiss_df_canton_grouped, swiss_nat_df.reset_index(), on='Canton')
del swiss_nat_df
In [5]:
europe_df = europe_df[['Name', 'Rate']]
#Create a new single row dataframe for Switzerland
swiss_country_df = pd.DataFrame(
[['Switzerland', swiss_df_canton_grouped.loc[swiss_df_canton_grouped['Canton'] == 'Total']['Rate'].values[0]]],
columns=['Name', 'Rate'])
#Add the single row dataframe of Switzerland to the Europe dataframe
europe_df = europe_df.append(swiss_country_df)
In [6]:
europe_topo_path = 'topojson/europe.topojson.json'
europe_json_data = json.load(open(europe_topo_path, encoding="UTF-8"))
In [7]:
rate_min = min(europe_df['Rate'])
rate_max = max(europe_df['Rate'])
# color scale from min rate to max rate
color_scale = branca.colormap.linear.OrRd.scale(rate_min, rate_max)
color_scale = color_scale.to_step(n=8)
In [8]:
def style_function(country):
rate = europe_df.loc[europe_df['Name'] == country['properties']['NAME']]['Rate'].values
if len(rate) > 0:
# country is in the dataframe
return {
'fillOpacity': 0.8,
'weight': 0.5,
'color': 'black',
'fillColor': color_scale(rate)
}
else:
# country is not in the dataframe, hence we put its color as black
return {
'fillOpacity': 0.2,
'weight': 0.2,
'color': 'black',
'fillColor': 'black'
}
In [9]:
europe_unemployment_map = folium.Map([53,22], tiles='cartodbpositron', zoom_start=4)
g = folium.TopoJson(europe_json_data,
'objects.europe',
style_function=style_function,
).add_to(europe_unemployment_map)
color_scale.caption = 'Unemployment Rate (%)'
color_scale.add_to(europe_unemployment_map)
europe_title = "europe_unemployement_rate"
europe_unemployment_map.save("maps/{}.html".format(europe_title))
print('https://github.com/gregunz/ada2017/tree/master/03%20-%20Interactive%20Viz/maps/{}.html' \
.format(europe_title.replace(" ", "%20").replace("%", "%25")))
europe_unemployment_map
Out[9]:
We used a linear scale because it's the one that is more intuitive. By intuitive we mean that a high color difference is proportional to the rate difference as we usually expect.
But because of the two outliers (Spain and Greece) we cannot really distinguish differences between countries with low rate.
Hence we decided to still use a linear scale but this time the outliers are aggregated together in the last color of the scale (which now means rate of 15 and +) :
In [10]:
rate_max = europe_df['Rate'].quantile(.95)
color_scale = branca.colormap.linear.OrRd.scale(rate_min, rate_max)
color_scale = color_scale.to_step(n=8)
color_scale.index[-1] = np.float(max(europe_df['Rate']))
europe_unemployment_map_2 = folium.Map([53,22], tiles='cartodbpositron', zoom_start=4)
g = folium.TopoJson(europe_json_data,
'objects.europe',
style_function=style_function,
).add_to(europe_unemployment_map_2)
color_scale.caption = 'Unemployment Rate (scale without outliers) (%)'
color_scale.add_to(europe_unemployment_map_2)
europe_title_2 = "europe_unemployement_rate_2"
europe_unemployment_map_2.save("maps/{}.html".format(europe_title_2))
print('https://github.com/gregunz/ada2017/tree/master/03%20-%20Interactive%20Viz/maps/{}.html' \
.format(europe_title_2.replace(" ", "%20").replace("%", "%25")))
europe_unemployment_map_2
Out[10]:
As we can see the recorded unemployment rate in Europe goes from around 3% to approximatelly 24%.
We can clearly notice that Greece and Spain rates are really high, this is obviously expected with the last events that happened in those countries.
Switzerland rate is the lowest, swiss people should be proud of this!
When a country is greyed out, it means that the data was not provided.
In [11]:
swiss_topo_path = r'topojson/ch-cantons.topojson.json'
swiss_json_data = json.load(open(swiss_topo_path, encoding="utf-8"))
In [12]:
# roughly the center of Switzerland
geo_swiss_center = [46.8, 8.35]
# function to draw a map without duplicating lots of code
def draw_swiss_map(df, title, column_name, **args):
swiss_map = folium.Map(geo_swiss_center, tiles='cartodbpositron', zoom_start=7.5)
swiss_map.choropleth(
geo_data=swiss_json_data,
data=df,
topojson='objects.cantons',
columns=['Canton', column_name],
fill_color='OrRd',
key_on='properties.name',
legend_name=title,
**args
)
swiss_map.save("maps/{}.html".format(title))
print('https://github.com/gregunz/ada2017/tree/master/03%20-%20Interactive%20Viz/maps/{}.html' \
.format(title.replace("%", "%25").replace(" ", "%20")))
return swiss_map
In [13]:
draw_swiss_map(swiss_df_canton_grouped, 'Unemployment Rate (%)', 'Rate')
Out[13]:
As we can see above, Eastern and Central Cantons have a lower unemployement rate than the Western Cantons on average.
As mentionned above, the rate we have corresponds to the proportion of people without jobs. We will visualize below the rate of people looking for a job.
Note that it is not the same scale as for the previous map. Here what we want to show is the visualization of the rate of people looking for a job and not compare both rates, if it would have been the case, we would have used the same scale.
In [14]:
# Adding the people looking for a new job that already have a job to the rate
swiss_df_canton_grouped['Looking for a job rate'] = swiss_df_canton_grouped['Rate'] * swiss_df_canton_grouped['Total'] / swiss_df_canton_grouped['Jobless']
In [15]:
draw_swiss_map(swiss_df_canton_grouped, 'Looking for a job (%)', 'Looking for a job rate')
Out[15]:
Both maps follow pretty much the same trends but with a bigger contrast between the Center/East and the West/Border.
To make sure that we use can use the colors to compare the rates between the Swiss and foreign citizens, we use the same colormap for the two following maps.
In [16]:
threshold_scale = np.linspace(
min([swiss_df_canton_grouped['Suisses'].min(), swiss_df_canton_grouped['Etrangers'].min()]),
max([swiss_df_canton_grouped['Suisses'].max(), swiss_df_canton_grouped['Etrangers'].max()]),
6,
dtype=int
).tolist()
In [17]:
draw_swiss_map(swiss_df_canton_grouped,
'Unemployment Rate (%) of Swiss Citizen',
'Suisses',
threshold_scale=threshold_scale)
Out[17]:
In [18]:
draw_swiss_map(
swiss_df_canton_grouped,
"Unemployment Rate (%) of foreign Citizen",
"Etrangers",
threshold_scale=threshold_scale
)
Out[18]:
We can see that more foreign people (in proportion) are unemployed than swiss people in general.
In [19]:
swiss_df_canton_grouped['Foreign - Swiss'] = (swiss_df_canton_grouped['Etrangers'] - swiss_df_canton_grouped['Suisses']) / swiss_df_canton_grouped['Etrangers']
In [20]:
draw_swiss_map(
swiss_df_canton_grouped,
'Differerence Swiss Foreign Unemployment Rate (%) of Citizen',
'Foreign - Swiss'
)
Out[20]:
This map shows in which cantons there are the biggest differences between swiss and foreign citizen in the unemployment rate. The closer we are to 1 means that
In [21]:
swiss_df_canton_grouped['Foreign / Swiss'] = swiss_df_canton_grouped['Etrangers'] / swiss_df_canton_grouped['Suisses']
In [22]:
draw_swiss_map(
swiss_df_canton_grouped,
'Ratio Swiss Foreign Unemployment Rate (%) of Citizen',
'Foreign / Swiss'
)
Out[22]:
Here, we can see that where the unemployment rate was the worse, foreigners and Swiss have almost equal chances to be unemployed. Whereas in the center of Switzerland, foreigners can be up to 7 times more likely to be unemployed.
This could come from a more conservative state of mind present in Central Switzerland and can be observed frequently if we look a the votation results.
We use the predifined 3 ages categories provided by Amstat, that is: 15-24, 25-49 and 50+
In [23]:
swiss_age_canton_df = swiss_df.groupby(["Canton", "Age category"]).sum().reset_index()
fig, ax = plt.subplots(figsize=(20,10))
plot = sns.barplot(x="Canton", y="Rate", hue="Age category", data=swiss_age_canton_df);
plot.set_xticklabels(plot.get_xticklabels(), rotation=90, size=20)
plt.show()
Even though it is not asked, let's plot three different maps. (One for each age category)
In [24]:
ages = ["15-24 ans", "25-49 ans", "50 ans et plus"]
threshold_scale_age = np.linspace(
min([swiss_age_canton_df[swiss_age_canton_df["Age category"] == age]["Rate"].min() for age in ages]),
max([swiss_age_canton_df[swiss_age_canton_df["Age category"] == age]["Rate"].max() for age in ages]),
6,
dtype=float
).tolist()
maps = list()
for age in ages:
maps.append(draw_swiss_map(
swiss_age_canton_df[swiss_age_canton_df["Age category"] == age],
"Unemployement rate (of {})".format(age),
"Rate",
threshold_scale = threshold_scale_age
))
def draw_map_age(age):
return maps[ages.index(age)]
In [25]:
interact(draw_map_age, age=ages);
Unfortunatly, the maps above won't show because we are using an interactive library, but here are some screenshots:
In [26]:
Image('./img/interact1.png')
Out[26]:
In [28]:
Image('./img/interact2.png')
Out[28]:
In [29]:
Image('./img/interact3.png')
Out[29]:
Observations:
More young people are unemployed and it's decreasing with the age. Suprisingly enough people 50+ have good rates!
Note that we let the canton Ticino be in neither side because it is generally omited from the Röstigraben separation. Also we do not separate "upper Valais" and "Lower Valais", we do it that way to be consistent with the data Amstat provides us.
In other words, this is a simplified version separating Latin* cantons from the Germanic* ones
*(we look at the majority, we don't take into account billingual cantons)
In [30]:
# copy of swiss df without row "Total"
rosti_df = swiss_df[swiss_df['Canton'] != 'Total'].copy()
# adding a column with the weighted rate in order to compute the average rate per "röstigraben part"
rosti_df["Weighted rate"] = rosti_df["Rate"] * rosti_df["Total"]
# aggregating weighted rates with respect to their "röstigraben part"
rosti_df = rosti_df.groupby(["Language region"])[["Weighted rate", "Total"]].sum().head()
# adding a column with the rate per "röstigraben part"
rosti_df["Rate by rösti part"] = rosti_df["Weighted rate"] / rosti_df["Total"]
swiss_df_with_rosti = pd.merge(swiss_df, rosti_df.reset_index(), on="Language region")
In [31]:
draw_swiss_map(
swiss_df_with_rosti,
'Unemployment Rate (%) in Röstigraben parts',
'Rate by rösti part',
line_weight=0,
fill_opacity=0.7,
smooth_factor=0.1,
)
Out[31]:
We aggregated the cantons together depending on the language spoken and computed the weighted average of the unemployed rate depending on the number of unemployed people. We can clearly see a huge difference between the French (& Italian) speaking part and the rest (almost a factor of 2), this separation is also called the "Röstigraben" (Ticcino might be handle separately but we decided to show germanic vs latin).
In [32]:
Image('./img/get_europe_rates.png')
Out[32]:
In [33]:
Image('./img/get_swiss_rates_1.png')
Out[33]:
In [34]:
Image('./img/get_swiss_rates_2.png')
Out[34]:
In [35]:
Image('./img/get_swiss_rates_3.png')
Out[35]:
In [36]:
Image('./img/get_swiss_rates_4.png')
Out[36]:
In [37]:
Image('./img/get_swiss_rates_5.png')
Out[37]: