matta - view and scaffold d3.js visualizations in IPython notebooks

Let's Make a Map Too (and a Cartogram!)

Inspired by Mike Bostock's Let's Make a Map, we want to make a map too using matta. We will display the communes of Santiago, Chile. To do that we will perform the following steps:

Download the administrative borders of Chile in a shapefile from the Chilean Library of Congress.
Use ogr2ogr to filter and clip the borders of the city of Santiago, as well as converting the result to GeoJSON.
Convert the GeoJSON file to TopoJSON.
Display the TopoJSON file using matta in the IPython notebook.
Download Human Development Index from Wikipedia and make a choropleth/symbol map using matta.



In [2]:

    
from __future__ import print_function, unicode_literals



In [1]:

    
import matta
import json
import unicodedata

# we do this to load the required libraries when viewing on NBViewer
matta.init_javascript(path='https://rawgit.com/carnby/matta/master/matta/libs')









    



/home/egraells/Dropbox/phd/apps/matta/matta/visualizations/cartography/template.css
/home/egraells/Dropbox/phd/apps/matta/matta/visualizations/parsets/template.css






    Out[1]:





matta Javascript code added.

Download Shapefiles

Note We delete data to start from 0



In [4]:

    
!rm -fr data
!mkdir data
!wget http://siit2.bcn.cl/obtienearchivo?id=repositorio/10221/10396/1/division_comunal.zip -O data/division_comunal.zip









    



--2015-12-20 22:18:18--  http://siit2.bcn.cl/obtienearchivo?id=repositorio/10221/10396/1/division_comunal.zip
Resolviendo siit2.bcn.cl (siit2.bcn.cl)... 200.0.66.71
Conectando con siit2.bcn.cl (siit2.bcn.cl)[200.0.66.71]:80... conectado.
Petición HTTP enviada, esperando respuesta... 200 OK
Longitud: 29000232 (28M) [application/zip]
Grabando a: “data/division_comunal.zip”

100%[======================================>] 29.000.232  1,25MB/s   en 22s    

2015-12-20 22:18:47 (1,23 MB/s) - “data/division_comunal.zip” guardado [29000232/29000232]



In [5]:

    
!unzip data/division_comunal.zip -d data/









    



Archive:  data/division_comunal.zip
  inflating: data/Disclaimer.txt     
  inflating: data/division_comunal.dbf  
  inflating: data/division_comunal.prj  
  inflating: data/division_comunal.sbn  
  inflating: data/division_comunal.sbx  
  inflating: data/division_comunal.shp  
  inflating: data/division_comunal.shp.xml  
  inflating: data/division_comunal.shx

Convert to GeoJSON

You can use ogrinfo to see the structure of the source shapefile.



In [6]:

    
!ogrinfo data/division_comunal.shp 'division_comunal' | head -n 30









    



INFO: Open of `data/division_comunal.shp'
      using driver `ESRI Shapefile' successful.

Layer name: division_comunal
Geometry: Polygon
Feature Count: 346
Extent: (-3701712.293900, 3794823.357600) - (704690.560200, 8065196.816300)
Layer SRS WKT:
PROJCS["WGS_1984_UTM_Zone_19S",
    GEOGCS["GCS_WGS_1984",
        DATUM["WGS_1984",
            SPHEROID["WGS_84",6378137.0,298.257223563]],
        PRIMEM["Greenwich",0.0],
        UNIT["Degree",0.0174532925199433]],
    PROJECTION["Transverse_Mercator"],
    PARAMETER["False_Easting",500000.0],
    PARAMETER["False_Northing",10000000.0],
    PARAMETER["Central_Meridian",-69.0],
    PARAMETER["Scale_Factor",0.9996],
    PARAMETER["Latitude_Of_Origin",0.0],
    UNIT["Meter",1.0]]
NOM_REG: String (50.0)
NOM_PROV: String (20.0)
NOM_COM: String (30.0)
SHAPE_LENG: Real (19.11)
DIS_ELEC: Integer (4.0)
CIR_SENA: Integer (4.0)
COD_COMUNA: Integer (4.0)
SHAPE_Le_1: Real (19.11)
SHAPE_Area: Real (19.11)

Now we use ogr2ogr to convert the shapefile into GeoJSON. We will build a file for Santiago (its capital) only.

Notes:

After some manual inspection, we know that NOM_PROV contains the name of the parent administrative divisions of the city. We use a where clause to filter those.
We delete santiago-comunas-geo.json in case it exists (that is, when we re-run the notebook :) ).
We use the -clipdst option to specify a bounding box obtained in this site. We need it because some populated locations are much bigger in administrative terms than in population terms.
We also use the -t_srs EPSG:4326 option to convert the data coordinates to (longitude,latitude) pairs.



In [7]:

    
!ogr2ogr -where "NOM_PROV IN ('Santiago', 'Maipo', 'Cordillera', 'Chacabuco', 'Talagante')" -f GeoJSON \
    -clipdst -70.828155 -33.635036 -70.452573 -33.302953 -t_srs EPSG:4326 \
    data/santiago-comunas-geo.json data/division_comunal.shp

From GeoJSON to TopoJSON



In [8]:

    
!topojson -p --id-property NOM_COM -s 0 -o data/santiago-comunas-topo.json data/santiago-comunas-geo.json









    



bounds: -70.828155 -33.635036 -70.452573 -33.302953 (spherical)
pre-quantization: 0.0418m (3.76e-7°) 0.0369m (3.32e-7°)
topology: 274 arcs, 5195 points
post-quantization: 4.18m (0.0000376°) 3.69m (0.0000332°)
prune: retained 274 / 274 arcs (100%)



In [9]:

    
!ls -lh data/*.json









    



-rw-rw-r-- 1 egraells egraells 449K dic 20 22:19 data/santiago-comunas-geo.json
-rw-rw-r-- 1 egraells egraells  55K dic 20 22:19 data/santiago-comunas-topo.json



In [10]:

    
def strip_accents(s):
   return ''.join(c for c in unicodedata.normalize('NFD', s) if unicodedata.category(c) != 'Mn')

with open('data/santiago-comunas-topo.json', 'r') as f:
    stgo = json.load(f)
    
for g in stgo['objects']['santiago-comunas-geo']['geometries']:
    g['id'] = strip_accents(g['id'].title())
    g['properties']['id'] = g['id']

Here is an example feature:



In [11]:

    
stgo['objects']['santiago-comunas-geo']['geometries'][0]









    Out[11]:





{u'arcs': [[0, 1, 2, 3, 4, 5, 6, 7]],
 u'id': u'Independencia',
 u'properties': {u'CIR_SENA': 7,
  u'COD_COMUNA': 1310,
  u'DIS_ELEC': 19,
  u'NOM_COM': u'Independencia',
  u'NOM_PROV': u'Santiago',
  u'NOM_REG': u'Regi\xf3n Metropolitana de Santiago',
  u'SHAPE_Area': 7514745.51089,
  u'SHAPE_LENG': 11488.6957475,
  u'SHAPE_Le_1': 11718.6870863,
  u'id': u'Independencia'},
 u'type': u'Polygon'}

Display TopoJSON using matta



In [12]:

    
matta.cartography(geometry=stgo, leaflet=False, label='NOM_COM',
                 fill_color={'value': 'id', 'palette': str('husl'), 'scale': 'ordinal', 'n_colors': 40})

Choroplet and Symbol Map

Now that we can load a map and giving some colors to areas, let's try to visualize some data. The easiest way to do so would be to get a Wikipedia page with data for Santiago. From it, we will get the Human Development Index (IDH in Spanish) score for each municipality of the city.



In [13]:

    
import pandas as pd
df = pd.read_html('https://es.wikipedia.org/wiki/Anexo:Comunas_de_Santiago_de_Chile', 
                  attrs={'class': 'sortable'}, header=0)[0]
df.head()









    Out[13]:






  
    
      
      Comuna
      Ubicación?
      Población (2002)?
      Viviendas (2002)?
      Densidad poblacional?
      Crecimiento demográfico (1992-2002)?
      IDH (2003)?
      Pobreza (2006)?
    
  
  
    
      0
      Cerrillos
      surponiente
      71.906
      19.811
      4.32908
      -10
      0,743 (54)
      83
    
    
      1
      Cerro Navia
      norponiente
      148.312
      35.277
      13.48291
      -48
      0,683 (165)
      175
    
    
      2
      Conchalí
      norte
      133.256
      32.609
      12.07029
      -129
      0,707 (118)
      80
    
    
      3
      El Bosque
      sur
      175.594
      42.808
      12.27072
      16
      0,711 (106)
      158
    
    
      4
      Estación Central
      surponiente
      130.394
      32.357
      9.03631
      -75
      0,735 (60)
      73

Data is not clean due to locale settings. Fortunately, we just want the HDI column, which should be easy to convert to a meaningful float. Let's fix the HDI and Population columns.



In [14]:

    
df['Comuna'] = [strip_accents(c).replace('?', '').title() for c in df['Comuna']]
df['HDI'] = [float(c.split()[0].replace(',', '.')) for c in df['IDH (2003)?']]
del df['IDH (2003)?']
df.head()









    Out[14]:






  
    
      
      Comuna
      Ubicación?
      Población (2002)?
      Viviendas (2002)?
      Densidad poblacional?
      Crecimiento demográfico (1992-2002)?
      Pobreza (2006)?
      HDI
    
  
  
    
      0
      Cerrillos
      surponiente
      71.906
      19.811
      4.32908
      -10
      83
      0.743
    
    
      1
      Cerro Navia
      norponiente
      148.312
      35.277
      13.48291
      -48
      175
      0.683
    
    
      2
      Conchali
      norte
      133.256
      32.609
      12.07029
      -129
      80
      0.707
    
    
      3
      El Bosque
      sur
      175.594
      42.808
      12.27072
      16
      158
      0.711
    
    
      4
      Estacion Central
      surponiente
      130.394
      32.357
      9.03631
      -75
      73
      0.735



In [15]:

    
df['Population'] = df['Población (2002)?'] * 1000
df = df.loc[:,('Comuna', 'HDI', 'Population')]
df.head()









    Out[15]:






  
    
      
      Comuna
      HDI
      Population
    
  
  
    
      0
      Cerrillos
      0.743
      71906
    
    
      1
      Cerro Navia
      0.683
      148312
    
    
      2
      Conchali
      0.707
      133256
    
    
      3
      El Bosque
      0.711
      175594
    
    
      4
      Estacion Central
      0.735
      130394

Much better!



In [16]:

    
df.describe()









    Out[16]:






  
    
      
      HDI
      Population
    
  
  
    
      count
      37.000000
      37.000000
    
    
      mean
      0.762865
      146718.648649
    
    
      std
      0.076155
      106077.982722
    
    
      min
      0.657000
      2477.000000
    
    
      25%
      0.709000
      85118.000000
    
    
      50%
      0.737000
      120874.000000
    
    
      75%
      0.804000
      175594.000000
    
    
      max
      0.949000
      492603.000000

Now let's visualize HDI for each municipality.



In [17]:

    
matta.cartography(geometry=stgo, label='NOM_COM', width=700,
                  area_dataframe=df, area_feature_name='Comuna', 
                  area_color={'value': 'HDI', 'scale': 'threshold', 'palette': 'coolwarm', 'n_colors': 7, 'legend': True}, 
                  leaflet=False)

What happened there? We bound a DataFrame (area_dataframe) to the geometry by using the area_feature_name column in the DataFrame. Using this, our code was able to match a feature from the TopoJSON structure with a row from the DataFrame. Note that some areas are not drawn - they were not available in the DataFrame.

We can also visualize this information using a cartogram. According to Wikipedia:

A cartogram is a map in which some thematic mapping variable – such as travel time, population, or Gross National Product – is substituted for land area or distance. The geometry or space of the map is distorted in order to convey the information of this alternate variable.

For instance, we observe that the areas that have a high HDI are, indeed, big. But how much people inhabit them?



In [38]:

    
matta.cartogram(geometry=stgo, width=700,
                area_dataframe=df, area_feature_name='Comuna', area_opacity=1.0,
                area_color={'value': 'HDI', 'scale': 'threshold', 'palette': 'coolwarm', 'n_colors': 7}, 
                area_value='Population', na_value=df['Population'].min() * 0.75)

As we can see, the shape of each area is different now! The differences are not very dramatic, but this is just an example.



In [ ]:

	Comuna	Ubicación?	Población (2002)?	Viviendas (2002)?	Densidad poblacional?	Crecimiento demográfico (1992-2002)?	IDH (2003)?	Pobreza (2006)?
0	Cerrillos	surponiente	71.906	19.811	4.32908	-10	0,743 (54)	83
1	Cerro Navia	norponiente	148.312	35.277	13.48291	-48	0,683 (165)	175
2	Conchalí	norte	133.256	32.609	12.07029	-129	0,707 (118)	80
3	El Bosque	sur	175.594	42.808	12.27072	16	0,711 (106)	158
4	Estación Central	surponiente	130.394	32.357	9.03631	-75	0,735 (60)	73

	Comuna	HDI	Population
0	Cerrillos	0.743	71906
1	Cerro Navia	0.683	148312
2	Conchali	0.707	133256
3	El Bosque	0.711	175594
4	Estacion Central	0.735	130394

	HDI	Population
count	37.000000	37.000000
mean	0.762865	146718.648649
std	0.076155	106077.982722
min	0.657000	2477.000000
25%	0.709000	85118.000000
50%	0.737000	120874.000000
75%	0.804000	175594.000000
max	0.949000	492603.000000