So... you've got some interesting property of the network uncovered. Is it real? How can you trust that what you've found didn't arise by random chance?
One useful way of thinking by using generative models of random graphs. By "generative" and "random", we mean that the graph was generated using some statistical random model underlying it. This is essentially in the "Bayesian" tradition of modelling.
The alternative is a non-parametric way of thinking. This is where we don't assume some model that generated the data, and instead performs randomization on the data on hand to calculate statistics. Here, we assume that our randomization procedure is an appropriate random model for the network.
Discrete Distributions:
n * bernoulli trials follows binomial distribution.Continuous Distributions:
We can also make up an arbitrary distribution of our own. :-)
Assumptions of this model:
Under this model, what is the distribution of number of rides taken throughout the ? We can probably solve this analytically, but since we have ultra-cheap computational capabilities at our fingertips, might as well do it the brute-force way.
In [ ]:
import networkx as nx
G = nx.read_gpickle('datasets/divvy_2013/divvy_graph.pkl')
total_trips = sum([d['count'] for _,_,d in G.edges(data=True)])
print(total_trips)
In [ ]:
In [ ]:
In [ ]:
import numpy as np
Computing the interval between the 2.5th to the 97.5th percentile effectively gives you a centered 95% mass of the distribution. Knowing this range can be quite useful in other data analysis cases.
Together, we will implement a random model for the number of trips that are taken between two stations.
In this procedure, we will copy the nodes graph G into a new graph G_random. This can be done by:
G_random = nx.Graph()
G_random.add_nodes_from(G.nodes(data=True))
Following that, we will manually map-reduce this computation.
In [ ]:
# Grab all the nodes from G.
G_random = nx.Graph()
G_random.add_nodes_from(G.nodes(data=True))
G_random.nodes(data=True)
In [ ]:
# Implement the procedure by brute force. You will find it takes a bit of time.......... but because we have so many people
# in the class, we can brute force parallelize this :-).
from random import sample, choice
import math
n = 40 # the total number of students with Python 3.4 installed.
for i in range(math.ceil(total_trips/n)):
if i % 1000 == 0:
print(i)
n1 = choice(G_random.nodes())
n2 = choice(G_random.nodes()) #note: n1 and n2 might well be the same nodes.
if (n1, n2) not in G_random.edges():
G_random.add_edge(n1, n2, count=0)
else:
G_random.edge[n1][n2]['count'] += 1
In [ ]:
your_id = 1 # change this number to the ID given to you.
nx.write_gpickle(G_random, 'datasets/divvy_2013/divvy_random{0}.pkl'.format(your_id))
In [ ]:
# Load the graphs with randomly re-distributed trip counts as a list of graphs.
import os
def is_pkl(filename):
"""
Checks if a file is a pickled graph.
"""
if filename.split('.')[-1] == 'pkl':
return True
else:
return False
def is_divvy_random_graph(filename):
"""
Checks if it's a Divvy graph.
"""
if 'divvy_random' in filename:
return True
else:
return False
# Load the data into a list of graphs.
divvy_random_graphs = []
for f in os.listdir('datasets/divvy_2013'):
if is_pkl(f) and is_divvy_random_graph(f):
g = nx.read_gpickle('datasets/divvy_2013/{0}'.format(f))
graphs.append(g)
In [ ]:
# Plot the distribution of counts in a re-distributed set of trips.
import matplotlib.pyplot as plt
%matplotlib inline
G_random_master = nx.Graph()
for g in graphs:
for sc, sk, d in g.edges(data=True):
if (sc, sk) not in G_random_master.edges():
G_random_master.add_edge(sc, sk, count=d['count'])
if (sc, sk) in G_random_master.edges():
G_random_master.edge[sc][sk]['count'] += d['count']
# Remove edges that have no counts
for sc, sk, d in G_random_master.edges(data=True):
if d['count'] == 0:
G_random_master.remove_edge(sc, sk)
# Plot the distribution of counts throughout the network
counts = [d['count'] for _,_,d in G_random_master.edges(data=True)]
plt.bar(Counter(counts).keys(), Counter(counts).values())
plt.show()
In [ ]:
nx.write_gpickle(G_random_master, 'datasets/divvy_2013/divvy_random_master.pkl')
Think about it...
Given the distribution of trip counts in the randomly re-distributed trips, what can we infer about the popularity of certain routes? Are they likely to have shown up in this random model?
In [ ]: