Network analysis

First, import relevant libraries:


In [1]:
import warnings
warnings.filterwarnings('ignore')

import numpy as np
import pandas as pd
%matplotlib inline
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
import matplotlib.dates as mdates
from pylab import *

import igraph as ig # Need to install this in your virtual environment

# from re import sub

from scipy.spatial.distance import squareform, pdist # Use for getting weights for layout

from scipy.cluster.hierarchy import dendrogram, linkage

In [2]:
# import os
# import sys
# sys.path.append('/home/mmalik/optourism-repo' + "/pipeline")
# from firenzecard_analyzer import *

In [2]:
import sys
sys.path.append('../../src/')
from utils.database import dbutils

conn = dbutils.connect()
cursor = conn.cursor()

Then, load the data (takes a few moments):


In [3]:
nodes = pd.read_sql('select * from optourism.firenze_card_locations', con=conn)
nodes.head()


Out[3]:
museum_name longitude latitude museum_id short_name string
0 Basilica di Santa Croce 11.262598 43.768754 1 Santa Croce C
1 Basilica San Lorenzo 11.254430 43.774932 2 San Lorenzo 2
2 Battistero di San Giovanni 11.254966 43.773131 3 Opera del Duomo D
3 Biblioteca Medicea Laurenziana 11.253924 43.774799 4 Laurenziana l
4 Cappella Brancacci 11.243859 43.768334 5 Brancacci b

In [22]:
df = pd.read_sql('select * from optourism.firenze_card_logs', con=conn)
df['museum_id'].replace(to_replace=39,value=38,inplace=True)
df['short_name'] = df['museum_id'].replace(dict(zip(nodes['museum_id'],nodes['short_name'])))
df['string'] = df['museum_id'].replace(dict(zip(nodes['museum_id'],nodes['string'])))
df['date'] = pd.to_datetime(df['entry_time'], format='%Y-%m-%d %H:%M:%S').dt.date
df['hour'] = pd.to_datetime(df['date']) + pd.to_timedelta(pd.to_datetime(df['entry_time'], format='%Y-%m-%d %H:%M:%S').dt.hour, unit='h')
df['total_people'] = df['total_adults'] + df['minors']
df.head()


Out[22]:
user_id museum_name entry_time adults_first_use adults_reuse total_adults minors museum_id short_name string date hour total_people
0 2089098 Palazzo Pitti 2016-09-19 14:49:00 0 1 1 0 38 Pitti P 2016-09-19 2016-09-19 14:00:00 1
1 2089099 Palazzo Pitti 2016-09-19 14:49:00 0 1 1 0 38 Pitti P 2016-09-19 2016-09-19 14:00:00 1
2 2083344 Palazzo Pitti 2016-09-19 14:57:00 0 1 1 0 38 Pitti P 2016-09-19 2016-09-19 14:00:00 1
3 2083335 Palazzo Pitti 2016-09-19 14:57:00 0 1 1 0 38 Pitti P 2016-09-19 2016-09-19 14:00:00 1
4 2083304 Palazzo Pitti 2016-09-19 14:58:00 0 1 1 0 38 Pitti P 2016-09-19 2016-09-19 14:00:00 1

In [23]:
# Helper function for making summary tables/distributions
def frequency(dataframe,columnname):
    out = dataframe[columnname].value_counts().to_frame()
    out.columns = ['frequency']
    out.index.name = columnname
    out.reset_index(inplace=True)
    out.sort_values(columnname,inplace=True)
    out['cumulative'] = out['frequency'].cumsum()/out['frequency'].sum()
    out['ccdf'] = 1 - out['cumulative']
    return out

I propose distinguishing paths from flows. A path is an itinerary, and the flow is the number of people who take the flow. E.g., a family or a tour group produces one path, but adds mulitple people to the overall flow.

We now build a transition graph, a directed graph where an edge represents a person going from one museum to another within the same day.

We also produce the transition matrix, a row-normalized n-by-n matrix of the frequency of transition from the row node to the column node. If you take a vector of the current volumes in each location, and multiply that my the transition matrix, you get a prediction for the number of people on each node at the next time. This prediction can be refined with corrections for daily/weekly patterns and such.

Transition/Origin-Destination (OD) matrix

Now, we make a graph of the transitions for museums. To do this, we make an edgelist out of the above.

Specifically, we want an edgelist where the first column is the origin site, the second column is the destination site, the third column is the number of people (total adults plus rows for minors), and the fourth column is the time stamp of the entry to the destination museum.

But, there's a twist. We want to track when people arrive at the first museum of their day. We can do this by adding a dummy "source" node that everybody starts each day from. We can then query this dummy node to see not only which museum people activate their Firenze card from, but also the museum where they start their other days. For visualizations, we can drop it (or not visualize it).

We could also have people return to this source node at the end of each day (or make a separate "target" node for this purpose), but there would be no timestamp for that arrival so it would complicate the data with missing values. However, we might still want to do this, analogously to find the last museum people tend to visit in a day.

I will create this source node by the following: first, create an indicator for if the previous record is the same day and the same Firenze card. If it is, we make a link from the museum of the previous row and the museum of that row.

If the previous row is either a different day and/or a different user_id, make a link between the dummy "source" node and that row's museum.

I do this below in a different order: I initialize a "from" column with all source, then overwrite with the museum of the previous row if the conditions are met.


In [24]:
df4 = df.groupby(['user_id','entry_time','date','hour','museum_name','short_name','string']).sum()['total_people'].to_frame() # Need to group in this order to be correct further down
df4.head()


Out[24]:
total_people
user_id entry_time date hour museum_name short_name string
1459702 2016-06-22 10:04:00 2016-06-22 2016-06-22 10:00:00 Galleria degli Uffizi Uffizi U 1
2016-06-22 14:26:00 2016-06-22 2016-06-22 14:00:00 Museo Casa Dante M. Casa Dante 3 1
2016-06-22 15:49:00 2016-06-22 2016-06-22 15:00:00 Galleria dell'Accademia di Firenze Accademia A 1
2016-06-23 09:43:00 2016-06-23 2016-06-23 09:00:00 Battistero di San Giovanni Opera del Duomo D 1
2016-06-23 11:14:00 2016-06-23 2016-06-23 11:00:00 Museo Galileo M. Galileo G 1

In [25]:
df4.reset_index(inplace=True)
df4.head(10)


Out[25]:
user_id entry_time date hour museum_name short_name string total_people
0 1459702 2016-06-22 10:04:00 2016-06-22 2016-06-22 10:00:00 Galleria degli Uffizi Uffizi U 1
1 1459702 2016-06-22 14:26:00 2016-06-22 2016-06-22 14:00:00 Museo Casa Dante M. Casa Dante 3 1
2 1459702 2016-06-22 15:49:00 2016-06-22 2016-06-22 15:00:00 Galleria dell'Accademia di Firenze Accademia A 1
3 1459702 2016-06-23 09:43:00 2016-06-23 2016-06-23 09:00:00 Battistero di San Giovanni Opera del Duomo D 1
4 1459702 2016-06-23 11:14:00 2016-06-23 2016-06-23 11:00:00 Museo Galileo M. Galileo G 1
5 1459702 2016-06-23 12:57:00 2016-06-23 2016-06-23 12:00:00 Museo di Palazzo Vecchio M. Palazzo Vecchio V 1
6 1459702 2016-06-23 13:41:00 2016-06-23 2016-06-23 13:00:00 Museo Nazionale del Bargello M. Bargello B 1
7 1459702 2016-06-23 15:05:00 2016-06-23 2016-06-23 15:00:00 Basilica di Santa Croce Santa Croce C 1
8 1473903 2016-06-19 11:24:00 2016-06-19 2016-06-19 11:00:00 Galleria degli Uffizi Uffizi U 1
9 1473903 2016-06-20 12:05:00 2016-06-20 2016-06-20 12:00:00 Battistero di San Giovanni Opera del Duomo D 1

In [26]:
df4['from'] = 'start' # Initialize 'from' column with 'start'
df4['to'] = df4['short_name'] # Copy 'to' column with row's museum_name
df4.head(10)


Out[26]:
user_id entry_time date hour museum_name short_name string total_people from to
0 1459702 2016-06-22 10:04:00 2016-06-22 2016-06-22 10:00:00 Galleria degli Uffizi Uffizi U 1 start Uffizi
1 1459702 2016-06-22 14:26:00 2016-06-22 2016-06-22 14:00:00 Museo Casa Dante M. Casa Dante 3 1 start M. Casa Dante
2 1459702 2016-06-22 15:49:00 2016-06-22 2016-06-22 15:00:00 Galleria dell'Accademia di Firenze Accademia A 1 start Accademia
3 1459702 2016-06-23 09:43:00 2016-06-23 2016-06-23 09:00:00 Battistero di San Giovanni Opera del Duomo D 1 start Opera del Duomo
4 1459702 2016-06-23 11:14:00 2016-06-23 2016-06-23 11:00:00 Museo Galileo M. Galileo G 1 start M. Galileo
5 1459702 2016-06-23 12:57:00 2016-06-23 2016-06-23 12:00:00 Museo di Palazzo Vecchio M. Palazzo Vecchio V 1 start M. Palazzo Vecchio
6 1459702 2016-06-23 13:41:00 2016-06-23 2016-06-23 13:00:00 Museo Nazionale del Bargello M. Bargello B 1 start M. Bargello
7 1459702 2016-06-23 15:05:00 2016-06-23 2016-06-23 15:00:00 Basilica di Santa Croce Santa Croce C 1 start Santa Croce
8 1473903 2016-06-19 11:24:00 2016-06-19 2016-06-19 11:00:00 Galleria degli Uffizi Uffizi U 1 start Uffizi
9 1473903 2016-06-20 12:05:00 2016-06-20 2016-06-20 12:00:00 Battistero di San Giovanni Opera del Duomo D 1 start Opera del Duomo

In [27]:
make_link = (df4['user_id'].shift(1)==df4['user_id'])&(df4['date'].shift(1)==df4['date']) # Row indexes at which to overwrite 'source'
df4['from'][make_link] = df4['short_name'].shift(1)[make_link]
df4.head(10)


Out[27]:
user_id entry_time date hour museum_name short_name string total_people from to
0 1459702 2016-06-22 10:04:00 2016-06-22 2016-06-22 10:00:00 Galleria degli Uffizi Uffizi U 1 start Uffizi
1 1459702 2016-06-22 14:26:00 2016-06-22 2016-06-22 14:00:00 Museo Casa Dante M. Casa Dante 3 1 Uffizi M. Casa Dante
2 1459702 2016-06-22 15:49:00 2016-06-22 2016-06-22 15:00:00 Galleria dell'Accademia di Firenze Accademia A 1 M. Casa Dante Accademia
3 1459702 2016-06-23 09:43:00 2016-06-23 2016-06-23 09:00:00 Battistero di San Giovanni Opera del Duomo D 1 start Opera del Duomo
4 1459702 2016-06-23 11:14:00 2016-06-23 2016-06-23 11:00:00 Museo Galileo M. Galileo G 1 Opera del Duomo M. Galileo
5 1459702 2016-06-23 12:57:00 2016-06-23 2016-06-23 12:00:00 Museo di Palazzo Vecchio M. Palazzo Vecchio V 1 M. Galileo M. Palazzo Vecchio
6 1459702 2016-06-23 13:41:00 2016-06-23 2016-06-23 13:00:00 Museo Nazionale del Bargello M. Bargello B 1 M. Palazzo Vecchio M. Bargello
7 1459702 2016-06-23 15:05:00 2016-06-23 2016-06-23 15:00:00 Basilica di Santa Croce Santa Croce C 1 M. Bargello Santa Croce
8 1473903 2016-06-19 11:24:00 2016-06-19 2016-06-19 11:00:00 Galleria degli Uffizi Uffizi U 1 start Uffizi
9 1473903 2016-06-20 12:05:00 2016-06-20 2016-06-20 12:00:00 Battistero di San Giovanni Opera del Duomo D 1 start Opera del Duomo

In [28]:
edges = df4[['from', 'to', 'total_people', 'entry_time']]
edges.head()


Out[28]:
from to total_people entry_time
0 start Uffizi 1 2016-06-22 10:04:00
1 Uffizi M. Casa Dante 1 2016-06-22 14:26:00
2 M. Casa Dante Accademia 1 2016-06-22 15:49:00
3 start Opera del Duomo 1 2016-06-23 09:43:00
4 Opera del Duomo M. Galileo 1 2016-06-23 11:14:00

In [39]:
supp = edges[edges['from'].shift(-1)=='start'][['to','total_people']]
supp.columns = ['from','total_people']
supp['to'] = 'end'
supp = supp[['from','to','total_people']]
supp.head()
# supp = df4[df4['from'].shift(-1)=='start'][['to','total_people']]
# supp.columns = ['from','weight']
# supp['to'] = 'end'
# supp = supp[['from','to','weight']]
# supp.head(50)


Out[39]:
from to total_people
2 Accademia end 1
7 Santa Croce end 1
8 Uffizi end 1
11 M. Palazzo Vecchio end 1
13 M. Archeologico end 1

In [59]:
supp_edges = supp.groupby(['from','to'])['total_people'].sum().to_frame().reset_index()
supp_edges


Out[59]:
from to total_people
0 Accademia end 15697
1 Brancacci end 1342
2 Cappelle Medicee end 2888
3 Casa Buonarroti end 452
4 La Specola end 553
5 Laurenziana end 369
6 M. Antropologia end 335
7 M. Archeologico end 886
8 M. Bargello end 3915
9 M. Calcio end 10
10 M. Casa Dante end 1772
11 M. Civici Fiesole end 497
12 M. Ebraico end 483
13 M. Ferragamo end 860
14 M. Galileo end 6666
15 M. Geologia end 113
16 M. Horne end 20
17 M. Innocenti end 731
18 M. Marini end 82
19 M. Mineralogia end 80
20 M. Novecento end 489
21 M. Opificio end 106
22 M. Palazzo Davanzati end 185
23 M. Palazzo Vecchio end 11024
24 M. Preistoria end 11
25 M. San Marco end 927
26 M. Santa Maria Novella end 7949
27 M. Stefano Bardini end 127
28 M. Stibbert end 113
29 Opera del Duomo end 17164
30 Orto Botanico end 447
31 Palazzo Medici end 3776
32 Palazzo Strozzi end 2330
33 Pitti end 21642
34 Planetario end 13
35 San Lorenzo end 3690
36 Santa Croce end 9810
37 Torre di Palazzo Vecchio end 8095
38 Uffizi end 17157
39 V. Bardini end 1287

In [60]:
# Create the actual edgelist for the transition matrix (of a first-order Markov chain)
df5 = pd.concat([edges.groupby(['from','to'])['total_people'].sum().to_frame().reset_index(),supp_edges])
df5.columns = ['from','to','weight']
df5.head(10)


Out[60]:
from to weight
0 Accademia Accademia 2
1 Accademia Brancacci 77
2 Accademia Cappelle Medicee 1277
3 Accademia Casa Buonarroti 49
4 Accademia La Specola 30
5 Accademia Laurenziana 301
6 Accademia M. Antropologia 51
7 Accademia M. Archeologico 826
8 Accademia M. Bargello 1022
9 Accademia M. Calcio 2

In [19]:
# Make exportable dynamic edgelist
# df4[['from','to','total_people','entry_time']].sort_values('entry_time').to_csv('dynamic_edgelist.csv')
out = df4[['from','to','total_people','entry_time']].sort_values('total_people',ascending=False)
out.columns = ['source','target','people','datetime']
out.to_csv('dynamic_edgelist.csv',index=False)

In [20]:
# # Make actual graph object to export as gml
# out = df4[['from','to','total_people','entry_time']].sort_values('total_people',ascending=False)
# out.columns = ['source','target','people','datetime']

# g = ig.Graph.TupleList(df4.itertuples(index=False), directed=True, weights=True)
# ig.summary(g)

In [21]:
# g.es.attributes()

In [61]:
# Create and check the graph
g = ig.Graph.TupleList(df5.itertuples(index=False), directed=True, weights=True)
ig.summary(g)


IGRAPH DNW- 43 1280 -- 
+ attr: name (v), weight (e)

In [23]:
# g.vs['name']

In [62]:
# Save the weighted indegree calculated with the source node before dropping it
indeg = g.strength(mode='in',weights='weight')[0:-1] # This drops the "source" node, which is last

In [70]:
indeg


Out[70]:
[42417.0,
 4488.0,
 18759.0,
 1845.0,
 1274.0,
 7399.0,
 1056.0,
 2638.0,
 14415.0,
 14.0,
 5430.0,
 734.0,
 1350.0,
 1555.0,
 15133.0,
 730.0,
 2472.0,
 282.0,
 302.0,
 1325.0,
 1103.0,
 1365.0,
 32757.0,
 8736.0,
 19595.0,
 383.0,
 230.0,
 49889.0,
 1200.0,
 13234.0,
 4262.0,
 34007.0,
 188.0,
 20086.0,
 22979.0,
 16680.0,
 44339.0,
 2139.0,
 296.0,
 28.0,
 2.0,
 0.0]

In [63]:
# # Delete the dummy 'source' node
# g.delete_vertices([v.index for v in g.vs if v['name']==u'start'])
g.simplify(loops=False, combine_edges=sum)
ig.summary(g)


IGRAPH DNW- 43 1280 -- 
+ attr: name (v), weight (e)

In [64]:
# Put in graph attributes to help with plotting
g.vs['label'] = g.vs["name"] 
# g.vs[sub("'","",i.decode('unicode_escape').encode('ascii','ignore')) for i in g2.vs["name"]] # Is getting messed up!

In [65]:
g.vs['label']


Out[65]:
['Accademia',
 'Brancacci',
 'Cappelle Medicee',
 'Casa Buonarroti',
 'La Specola',
 'Laurenziana',
 'M. Antropologia',
 'M. Archeologico',
 'M. Bargello',
 'M. Calcio',
 'M. Casa Dante',
 'M. Civici Fiesole',
 'M. Ebraico',
 'M. Ferragamo',
 'M. Galileo',
 'M. Geologia',
 'M. Innocenti',
 'M. Marini',
 'M. Mineralogia',
 'M. Novecento',
 'M. Opificio',
 'M. Palazzo Davanzati',
 'M. Palazzo Vecchio',
 'M. San Marco',
 'M. Santa Maria Novella',
 'M. Stefano Bardini',
 'M. Stibbert',
 'Opera del Duomo',
 'Orto Botanico',
 'Palazzo Medici',
 'Palazzo Strozzi',
 'Pitti',
 'Planetario',
 'San Lorenzo',
 'Santa Croce',
 'Torre di Palazzo Vecchio',
 'Uffizi',
 'V. Bardini',
 'M. Horne',
 'M. Preistoria',
 'Primo Conti',
 'start',
 'end']

In [66]:
# Get coordinates, requires this lengthy query
latlon = pd.DataFrame({'short_name':g.vs['label']}).merge(nodes[['short_name','longitude','latitude']],left_index=True,how='left',on='short_name')

In [71]:
latlon


Out[71]:
short_name longitude latitude
10 Accademia 11.258516 43.776755
4 Brancacci 11.243859 43.768334
5 Cappelle Medicee 11.252750 43.774914
6 Casa Buonarroti 11.263593 43.769850
11 La Specola 11.247132 43.764626
3 Laurenziana 11.253924 43.774799
18 M. Antropologia 11.257962 43.771754
13 M. Archeologico 11.261037 43.776634
26 M. Bargello 11.257864 43.770509
16 M. Calcio 11.303383 43.777617
14 M. Casa Dante 11.257062 43.771071
12 M. Civici Fiesole 11.293076 43.807254
21 M. Ebraico 11.265515 43.772972
22 M. Ferragamo 11.251063 43.769812
23 M. Galileo 11.256023 43.767683
37 M. Geologia 11.259840 43.778341
15 M. Innocenti 11.260970 43.776340
25 M. Marini 11.250052 43.771906
38 M. Mineralogia 11.259840 43.778341
27 M. Novecento 11.249096 43.773020
17 M. Opificio 11.256901 43.768732
39 M. Palazzo Davanzati 11.254827 43.770237
40 M. Palazzo Vecchio 11.255600 43.769517
19 M. San Marco 11.258964 43.777506
20 M. Santa Maria Novella 11.249420 43.774049
28 M. Stefano Bardini 11.259193 43.765088
46 M. Stibbert 11.255899 43.792889
2 Opera del Duomo 11.254966 43.773131
47 Orto Botanico 11.261745 43.779411
48 Palazzo Medici 11.255910 43.774764
49 Palazzo Strozzi 11.252241 43.771007
62 Pitti 11.248342 43.765178
8 Planetario 11.264543 43.776782
1 San Lorenzo 11.254430 43.774932
0 Santa Croce 11.262598 43.768754
50 Torre di Palazzo Vecchio 11.256007 43.769281
9 Uffizi 11.255607 43.768526
51 V. Bardini 11.256237 43.764011
24 M. Horne 11.259375 43.767443
41 M. Preistoria 11.259883 43.772897
7 Primo Conti 11.292696 43.812167
62 start NaN NaN
62 end NaN NaN

In [67]:
# Latitude is flipped, need to multiply by -1 to get correct orientation
g.vs['x'] = (latlon['longitude']).values.tolist()
g.vs['y'] = (-1*latlon['latitude']).values.tolist()

In [30]:
# # Make distances matrix for a layout based on geography but with enough spacing to not overplot and cause ZeroDivisionError
# # Want to convert the distances into attraction, do with "gravity" of 1/(d^2)
# dist = pd.DataFrame(squareform(pdist(nodes.iloc[:, 1:3])), columns=nodes['short_name'], index=nodes['short_name'])
# grav = dist.pow(-2)
# np.fill_diagonal(grav.values, 0)
# grav.head()

In [31]:
# A = grav.values
# g2 = ig.Graph.Adjacency((A > 0).tolist())
# g2.es['weight'] = A[A.nonzero()]
# g2.vs['label'] = grav.index

In [32]:
# layout = g2.layout("fr",weights='weight')
# ig.plot(g2,layout=layout)

In [33]:
# layout = g.layout_fruchterman_reingold(seed=nodes[['longitude','latitude']].values.tolist(),maxiter=5,maxdelta=.01)
# ig.plot(g,layout=layout)

In [34]:
# g.delete_edges(g.es.find(_between=(g.vs(name_eq='Torre di Palazzo Vecchio'), g.vs(name_eq='M. Palazzo Vecchio'))))
# g.delete_edges(g.es.find(_between=(g.vs(name_eq='M. Palazzo Vecchio'),g.vs(name_eq='Torre di Palazzo Vecchio'))))
# ig.summary(g)


IGRAPH DNW- 41 1197 -- 
+ attr: label (v), name (v), x (v), y (v), weight (e)

In [68]:
visual_style = {}
visual_style['vertex_size'] = [.000075*i for i in indeg] # .00075 is from hand-tuning
visual_style['vertex_label_size'] = [.00025*i for i in indeg]
visual_style['edge_width'] = [np.floor(.001*i) for i in g.es["weight"]] # Scale weights. .001*i chosen by hand. Try also .05*np.sqrt(i)
# visual_style['edge_curved'] = True
# visual_style["autocurve"] = True
ig.plot(g.as_undirected(), **visual_style) # Positions, for reference


Out[68]:

In [69]:
# layout = g.layout_fruchterman_reingold(seed=nodes[['longitude','latitude']].values.tolist(),maxiter=4)
# layout = g.layout_drl(seed=nodes[['longitude','latitude']].values.tolist(), fixed=[True]*len(g.vs))
# layout = g.layout_drl(seed=nodes[['longitude','latitude']].values.tolist(), fixed=[True]*len(g.vs), weights='weight')
# layout = g.layout_graphopt(seed=nodes[['longitude','latitude']].values.tolist(), niter=1, node_mass=1)
visual_style = {}
visual_style['edge_width'] = [np.floor(.001*i) for i in g.es["weight"]] # Scale weights. .001*i chosen by hand. Try also .05*np.sqrt(i)
visual_style['edge_arrow_size'] = [.00025*i for i in g.es["weight"]] # .00025*i chosen by hand. Try also .01*np.sqrt(i)
visual_style['vertex_size'] = [.00075*i for i in indeg] # .00075 is from hand-tuning
visual_style['vertex_label_size'] = [.00025*i for i in indeg]
visual_style['vertex_color'] = "rgba(100, 100, 255, .75)"
visual_style['edge_color'] = "rgba(0, 0, 0, .25)"
# visual_style['edge_curved'] = True
visual_style["autocurve"] = True
ig.plot(g, 'graph.svg', bbox = (1000,1000), **visual_style)


Out[69]:

In [99]:
# print(g2.get_adjacency()) # This was another check; before it was very nearly upper triangular. Now it looks much better. Copy into a text editor and resize to see the whole matrix.

In [50]:
transition_matrix = pd.DataFrame(g.get_adjacency(attribute='weight').data, columns=g.vs['name'], index=g.vs['name'])

In [325]:
# transition_matrix.loc[transition_matrix.idxmax(axis=0).values].idxmax(axis=1).index

In [326]:
# pd.Index(pd.Series(transition_matrix.loc[transition_matrix.idxmax(axis=0).values].max(axis=1).sort_values(ascending=False).index.unique().tolist() + transition_matrix.max(axis=1).sort_values(ascending=False).to_frame().index.tolist()).unique())

In [327]:
# temp = pd.DataFrame({'museum':transition_matrix.idxmax(axis=0).values,'values':transition_matrix.max(axis=0).values}).groupby('museum').max().sort_values('values',ascending=False)
# temp.reset_index(inplace=True)
# temp2 = transition_matrix.max(axis=0).to_frame().reset_index()
# temp2.columns = ['museum','values']
# temp = temp.append(temp2)
# order = temp.groupby('museum').max().sort_values('values',ascending=False).index

In [328]:
# order

In [329]:
# order = pd.Index([u'Opera del Duomo', 
#                   u'Accademia', 
#                   u'Uffizi', 
#                   u'Palazzo Pitti', 
#                   u'M. Palazzo Vecchio', 
#                   u'M. Galileo', 
#                   u'Santa Croce', 
#                   u'Cappelle Medicee', 
#                   u'San Lorenzo', 
#                   u'Laurenziana', 
#                   u'Palazzo Medici', 
#                   u'M. Bargello', 
#                   u'M. Santa Maria Novella', 
#                   u'M. San Marco', 
#                   u'Torre di Palazzo Vecchio', 
#                   u'M. Casa Dante', 
#                   u'Brancacci', 
#                   u'V. Bardini', 
#                   u'M. Archeologico', 
#                   u'Casa Buonarroti', 
#                   u'M. Innocenti', 
#                   u'M. Novecento', 
#                   u'La Specola', 
#                   u'Palazzo Strozzi', 
#                   u'M. Opificio', 
#                   u'Orto Botanico', 
#                   u'M. Geologia', 
#                   u'M. Mineralogia', 
#                   u'M. Ebraico', 
#                   u'M. Antropologia', 
#                   u'M. Ferragamo', 
#                   u'M. Palazzo Davanzati', 
#                   u'M. Horne', 
#                   u'M. Civici Fiesole', 
#                   u'M. Marini', 
#                   u'M. Stefano Bardini', 
#                   u'Planetario', 
#                   u'M. Stibbert', 
#                   u'M. Calcio', 
#                   u'M. Preistoria', 
#                   u'Primo Conti'])

In [51]:
# transition_matrix.reindex()
# transition_matrix.idxmax(axis=0).values is like "which.max()" in R: which columns have the row max
# Get which column has the max.
# Then, get this actual max...

# order = transition_matrix.loc[transition_matrix.idxmax(axis=0).values].max(axis=1).sort_values(ascending=False).unique().to_frame().index
order = transition_matrix.max(axis=1).sort_values(ascending=False).to_frame().index
# order = pd.Index(pd.Series(transition_matrix.loc 
#                            [
#                                transition_matrix.idxmax(axis=0).values # which column has the max? Order by that. 
#                            ] 
#                            .max(axis=1).sort_values(ascending=False).index.unique().tolist() 
#                            + 
#                            transition_matrix.max(axis=0).sort_values(ascending=False).to_frame().index.tolist()).unique())
mat = transition_matrix[order].reindex(order)

In [395]:
transition_matrix.to_csv('transition_matrix.csv')

In [54]:
transition_matrix.as_matrix()


Out[54]:
array([[     2,     77,   1277, ...,      0,      0,      0],
       [    80,      0,     66, ...,      0,      0,      0],
       [  1587,     98,      1, ...,      0,      0,      0],
       ..., 
       [     0,      0,      0, ...,      0,      0,      0],
       [ 20071,   1453,   5599, ...,      2,      0,      0],
       [     0,      0,      0, ...,      0,      0, 144093]])

In [55]:
Z = linkage(transition_matrix.as_matrix(), 'single', 'correlation')
# hcluster.dendrogram(Z, color_threshold=0)

In [56]:
fig = figure()
axdendro = fig.add_axes([0.09,0.1,0.2,0.8])
D = transition_matrix.as_matrix()
Y = linkage(D, method='single', metric='correlation')
Z = dendrogram(Y, orientation='left')
axdendro.set_xticks([])
axdendro.set_yticks([])

# Plot distance matrix.
axmatrix = fig.add_axes([0.3,0.1,0.6,0.8])
index = Z['leaves']
D = D[index,:]
D = D[:,index]
im = axmatrix.matshow(mat, aspect='auto', origin='lower')
axmatrix.set_xticks([])
axmatrix.set_yticks([])

# Plot colorbar.
axcolor = fig.add_axes([0.91,0.1,0.02,0.8])
colorbar(im, cax=axcolor)

fig.show()



In [388]:



Out[388]:
Index([u'Accademia', u'Brancacci', u'Cappelle Medicee', u'Casa Buonarroti',
       u'La Specola', u'Laurenziana', u'M. Antropologia', u'M. Archeologico',
       u'M. Bargello', u'M. Calcio', u'M. Casa Dante', u'M. Civici Fiesole',
       u'M. Ebraico', u'M. Ferragamo', u'M. Galileo', u'M. Geologia',
       u'M. Innocenti', u'M. Marini', u'M. Mineralogia', u'M. Novecento',
       u'M. Opificio', u'M. Palazzo Davanzati', u'M. Palazzo Vecchio',
       u'M. San Marco', u'M. Santa Maria Novella', u'M. Stefano Bardini',
       u'M. Stibbert', u'Opera del Duomo', u'Orto Botanico', u'Palazzo Medici',
       u'Palazzo Pitti', u'Palazzo Strozzi', u'Planetario', u'San Lorenzo',
       u'Santa Croce', u'Torre di Palazzo Vecchio', u'Uffizi', u'V. Bardini',
       u'M. Horne', u'M. Preistoria', u'Primo Conti'],
      dtype='object')

In [342]:
plt.figure(figsize=(25, 10))
plt.title('Hierarchical Clustering Dendrogram')
plt.xlabel('sample index')
plt.ylabel('distance')
dendrogram(
    Z,
    leaf_rotation=90.,  # rotates the x axis labels
    leaf_font_size=8.,  # font size for the x axis labels
)
plt.show()



In [52]:
plt.matshow(np.log(mat))


Out[52]:
<matplotlib.image.AxesImage at 0x7f434ed2ed50>

In [332]:
mat = mat.div(mat.sum(axis=1), axis=0)

In [53]:
fig = plt.figure(figsize=(10,10))#,dpi=300)
ax = fig.add_subplot(111)
cmap=plt.cm.PuBu
# cax = ax.matshow(transition_matrix[order].reindex(order),cmap=cmap)
cax = ax.matshow(transition_matrix[order].reindex(order),cmap=cmap)
fig.colorbar(cax)

ax.set_xticklabels(['']+order.tolist(),rotation=90)
ax.set_yticklabels(['']+order.tolist())

ax.xaxis.set_major_locator(ticker.MultipleLocator(1))
ax.yaxis.set_major_locator(ticker.MultipleLocator(1))

plt.show()



In [334]:
fig = plt.figure(figsize=(10,10))#,dpi=300)
ax = fig.add_subplot(111)
cmap=plt.cm.PuBu
# cax = ax.matshow(transition_matrix[order].reindex(order),cmap=cmap)
cax = ax.matshow(np.log(transition_matrix[order].reindex(order)),cmap=cmap)
fig.colorbar(cax)

ax.set_xticklabels(['']+order.tolist(),rotation=90)
ax.set_yticklabels(['']+order.tolist())

ax.xaxis.set_major_locator(ticker.MultipleLocator(1))
ax.yaxis.set_major_locator(ticker.MultipleLocator(1))

plt.show()



In [51]:
mat2 = mat.drop(['M. Calcio', 'Primo Conti'])
mat2 = mat2.drop(['M. Calcio', 'Primo Conti'],axis=1)

In [52]:
fig = plt.figure(figsize=(10,10))#,dpi=300)
ax = fig.add_subplot(111)
cmap=plt.cm.PuBu
# cax = ax.matshow(transition_matrix[order].reindex(order),cmap=cmap)
cax = ax.matshow(mat2,cmap=cmap)
fig.colorbar(cax)

ax.set_xticklabels(['']+mat2.index.tolist(),rotation=90)
ax.set_yticklabels(['']+mat2.index.tolist())

ax.xaxis.set_major_locator(ticker.MultipleLocator(1))
ax.yaxis.set_major_locator(ticker.MultipleLocator(1))

plt.show()


Edgelist (to export?)


In [38]:
# Create the actual edgelist for the transition matrix (of a first-order Markov chain)
df5 = df4.groupby(['from','to'])['total_people'].sum().to_frame()
df5.columns = ['weight']
df5.reset_index(inplace=True)
df5.head(10)


Out[38]:
from to weight
0 Accademia Accademia 2
1 Accademia Brancacci 77
2 Accademia Cappelle Medicee 1277
3 Accademia Casa Buonarroti 49
4 Accademia La Specola 30
5 Accademia Laurenziana 301
6 Accademia M. Antropologia 51
7 Accademia M. Archeologico 826
8 Accademia M. Bargello 1022
9 Accademia M. Calcio 2

In [39]:
# Export
df5.to_csv('static_edgelist.csv')
nodes[['short_name','longitude','latitude']].to_csv('node_positions.csv')

In [ ]:


In [ ]:


In [ ]:


In [ ]:


In [ ]:


In [ ]:


In [ ]:


In [ ]:


In [ ]:


In [ ]:


In [ ]:


In [ ]:


In [ ]:


In [ ]:

Other exploratory/summary plots


In [6]:
timeunitname = 'hour' # 'day'
timeunitcode = 'h' # 'D'
df1 = df.groupby(['short_name',timeunitname]).sum()
df1['total_people'] = df1['total_adults']+df1['minors']
df1.drop(['museum_id','user_id','adults_first_use','adults_reuse','total_adults','minors'], axis=1, inplace=True)
df1.head()

In [8]:
df1 = df1.reindex(pd.MultiIndex.from_product([df['short_name'].unique(),pd.date_range('2016-06-01','2016-10-01',freq=timeunitcode)]), fill_value=0)
df1.reset_index(inplace=True)
df1.columns = ['short_name','hour','total_people']
df1.head()


Out[8]:
short_name hour total_people
0 Palazzo Pitti 2016-06-01 00:00:00 0
1 Palazzo Pitti 2016-06-01 01:00:00 0
2 Palazzo Pitti 2016-06-01 02:00:00 0
3 Palazzo Pitti 2016-06-01 03:00:00 0
4 Palazzo Pitti 2016-06-01 04:00:00 0

In [9]:
# multiline plot with group by
fig, ax = plt.subplots(nrows = 1, ncols = 1, figsize=(15,8), dpi=300)
for key, grp in df1.groupby(['short_name']):
    if key in ['Accademia','Uffizi','Opera del Duomo']:
        ax.plot(grp['hour'], grp['total_people'], linewidth=.5, label=str(key))
plt.legend(bbox_to_anchor=(1.1, 1), loc='upper right')
ax.set_xlim(['2016-06-03','2016-06-15'])
ax.set_ylim([-1,110])
# plt.xticks(pd.date_range('2016-06-01','2016-06-15',freq='D'))
ticks = pd.date_range('2016-06-03','2016-06-15',freq='D').date
plt.xticks(ticks, ticks, rotation='vertical')
ax.set_xticks(pd.date_range('2016-06-03','2016-06-15',freq='6h'), minor=True, )
# fig.autofmt_xdate()
# ax.fmt_xdata = mdates.DateFormatter('%m-%d')
plt.show()



In [10]:
# multiline plot with group by
fig, ax = plt.subplots(nrows = 1, ncols = 1, figsize=(15,8), dpi=300)
for key, grp in df1.groupby(['short_name']):
    ax.plot(grp['hour'], grp['total_people'], linewidth=.5, label=str(key))
plt.legend(bbox_to_anchor=(1.1, 1), loc='upper right')
ax.set_xlim(['2016-06-01','2016-06-15'])
plt.show()



In [11]:
df2 = df.groupby('museum_name').sum()[['total_adults','minors']]
df2['total_people'] = df2['total_adults'] + df2['minors']
df2.sort_values('total_people',inplace=True,ascending=False)
df2.head()


Out[11]:
total_adults minors total_people
museum_name
Battistero di San Giovanni 44047 5842 49889
Galleria degli Uffizi 40622 3717 44339
Galleria dell'Accademia di Firenze 39364 3053 42417
Museo di Palazzo Vecchio 29403 3354 32757
Palazzo Pitti 2 Ð Giardino di Boboli, Museo degli Argenti, Museo delle Porcellan 29142 3155 32297

In [12]:
df2.plot.bar(figsize=(16,8))
plt.title('Number of Firenze card visitors')
plt.xlabel('Museum')
plt.ylabel('Number of people')
# plt.yscale('log')
plt.show()