Author: Pascal pascal@bayesimpact.org
Date: 2017-12-19
In November 2017, the CREST asked us to analyze our users through the dimension of people living in urban vs rural areas. We started investigating and found a dataset from INSEE that they call urban entities.
This notebook analyses this dataset. For each city in France it gives a mapping to the urban entity it is part of. According to the documentation, an urban entity is a contiguous urban area with less than 200m between buildings. Areas that are populated with less than 2000 inhabitants are considered as rural.
Here we use the cleaned_data
lib that already does the import and basic cleaning on the data. To get the data required to run this notebook run:
docker-compose run --rm data-analysis-prepare make \
data/geo/french_urban_entities.xls \
data/geo/french_cities.csv \
data/geo/insee_france_cities.tsv
Let's open 3 datasets related to cities: the urban entities, the index of all French cities and the French city stats:
In [1]:
import os
from os import path
import pandas as pd
from bob_emploi.data_analysis.lib import cleaned_data
DATA_FOLDER = os.getenv('DATA_FOLDER')
urban_entities = cleaned_data.french_urban_entities(DATA_FOLDER)
urban_entities.head()
Out[1]:
In [2]:
cities = cleaned_data.french_cities(DATA_FOLDER)
cities.head()
Out[2]:
In [3]:
city_stats = cleaned_data.french_city_stats(DATA_FOLDER)
city_stats.head()
Out[3]:
Pretty nice: they are all indexed with the city ID, or "Code Officiel Géographique" so we can merge those three datasets. While doing that, let's make sure we restrict to current cities only:
In [4]:
all_cities = pd.merge(
cities[cities.current & ~cities.arrondissement], city_stats,
right_index=True, left_index=True, how='outer')
all_cities = pd.merge(
all_cities, urban_entities,
right_index=True, left_index=True, how='outer')
all_cities.head()
Out[4]:
In [5]:
official_cities = all_cities[all_cities.name.notnull()]
official_cities.urban.notnull().value_counts()
Out[5]:
Pretty neat! We have urban data for all the cities. Now let's try to get a better understanding of this data.
The two fields we are going to dig are urban
and UU2010
. Supposedly urban
gives a score where 0
means rural and then from 1
to 8
, it relates to bigger and bigger urban entities. UU2010
gives the ID of the urban entity the city is part of.
Let's do some quick point checks:
In [6]:
official_cities.sort_values('population', ascending=False)[['name', 'urban', 'UU2010']].head()
Out[6]:
That sounds good: the biggest cities are inside the biggest urban entities.
Let's check one of them:
In [7]:
official_cities[official_cities.UU2010 == '00758']\
.sort_values('population', ascending=False)[['name', 'urban', 'UU2010']].head()
Out[7]:
Cool, those are indeed cities that are part of the Lyon urban entities.
Let's check the other side of the spectrum:
In [8]:
official_cities[official_cities.urban == 0][['name', 'urban', 'UU2010', 'population']].head()
Out[8]:
Indeed those seems like small villages (population count is low) however they seem to have an UU2010
field which is common. Apparently that field is not valid for rural cities:
In [9]:
official_cities[official_cities.urban == 0]\
.groupby(['UU2010', 'departement_id_x'])\
.urban.count().to_frame().head()
Out[9]:
Alright, there seems to be a unique UU2010
per département assigned to all rural cities in this département. We will make sure to ignore it.
Now let's see global stats for each level of urban entities:
In [10]:
def _stats_per_urban_group(cities):
if cities.urban.iloc[0]:
entities_population = cities.groupby('UU2010').population.sum()
else:
# Not grouping as UU2010 has no meaning for rural areas.
entities_population = cities.population
return pd.Series({
'total_population': entities_population.sum().astype(int),
'min_entity_population': entities_population.min().astype(int),
'max_entity_population': entities_population.max().astype(int),
'avg_entity_population': entities_population.mean().astype(int),
'num_entities': len(entities_population),
})
urban_stats = official_cities.groupby('urban').apply(_stats_per_urban_group)
urban_stats
Out[10]:
OK, many things interesting in those stats. First the size of entities seems to be globally consistent with the documentation: entities level are defined by their sizes. For the small numbers though, there seem to be some slight inconsistencies but we'll say that population data is not very precise.
Let's check the distribution of the number of entities by level:
In [11]:
COLOR_MAP = [
'#e0f2f1',
'#c8e6c9',
'#c5e1a5',
'#dce775',
'#ffee58',
'#ffc107',
'#ff9800',
'#ff5722',
'#795548',
]
urban_stats.num_entities.plot(kind='pie', figsize=(5, 5), colors=COLOR_MAP);
The huge majority of cities are rural, and only very few of them are part of the largest urban entities.
Let's look at it from another angle, and check the population distribution:
In [12]:
urban_stats.total_population.plot(kind='pie', figsize=(5, 5), colors=COLOR_MAP);
OK, this is a whole other picture: rural areas account only for less than a quarter of the population, and actually half of the population lives in urban entities level 6 or above (each entity is larger than 100k inhabitants).
Finally let's plot the urban entities for France metropolitan area:
In [13]:
is_in_metropol = (official_cities.longitude > -5) & (official_cities.latitude > 25)
official_cities[is_in_metropol & official_cities.urban.notnull()]\
.sort_values('urban')\
.plot(kind='scatter', x='longitude', y='latitude', s=5, c='urban', figsize=(12, 10));
Nice! The largest urban entities seem to be located where we know are the largest cities with the benefit of knowing how far it extends.
The urban entities dataset is quite clean. The major learning is that although more than 80% of cities are rural, less than 25% of the population is in a rural area. The slicing by urban level (from 1 to 8) can also be used to distinguish people living in small or large urban areas even though their own city might just be a small city next to a big one.