In [1]:
%matplotlib inline
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from IPython.display import Image
pd.set_option('display.mpl_style', 'default') # Make the graphs a bit prettier
plt.rcParams['figure.figsize'] = (15, 8)
for a fine tutorial, from which I'm showing slides, see http://www.scottbot.net/HIAL/?page_id=41142
In [2]:
Image("https://upload.wikimedia.org/wikipedia/commons/thumb/2/28/6n-graph2.svg/300px-6n-graph2.svg.png")
###h/t wikipedia
Out[2]:
1 IS CONNECTED TO 1
1 IS CONNECTED TO 2
1 IS CONNECTED TO 5
2 IS CONNECTED IS 1
. . . .
$\begin{pmatrix} 1 & 1 & 0 & 0 & 1 & 0\\ 1 & 0 & 1 & 0 & 1 & 0\\ 0 & 1 & 0 & 1 & 0 & 0\\ 0 & 0 & 1 & 0 & 1 & 1\\ 1 & 1 & 0 & 1 & 0 & 0\\ 0 & 0 & 0 & 1 & 0 & 0\\ \end{pmatrix}$
A $1$ at column $i$ and row $j$ means that $i$ and $j$ are connected.
$6$ is connected only to $4$; $1$ is connected to itself, $2$, and $5$.
This graph is symmetric: it assumes a connection from $1$ to $2$ is the same as one from $2$ to $1$.
Often this is not the case: think of friending someone on a social network: A friending B does not mean B friends A. Such a graph is called a directed graph.
I'm taking this example from http://kieranhealy.org/blog/archives/2013/06/09/using-metadata-to-find-paul-revere/ and moving it from R
to python
In [3]:
membership_matrix=pd.read_csv("https://raw.githubusercontent.com/kjhealy/revere/master/data/PaulRevereAppD.csv", index_col=[0])
In [4]:
membership_matrix
Out[4]:
In [5]:
membership_matrix.shape
Out[5]:
We have a matrix of 254 rows and 7 columns
We are intested in
who is connected to whom
which organization most connected to which other organization.
Our quarry is something like the similarity matrices we looked at in doing our work with text.
We are going to create an adjacency matrix, another square matrix that indicates who is connected to whom.
(You are not responsible for this but I want you to know the smoke and mirrors.)
We have to do a little of basic arithmetic with matrices to make this work.
First we need the idea of the transpose, which is just flipping the rows and columns of a matrix.
$\begin{bmatrix} 1 & 2 \
$
The adjacency matrix is the product of a matrix and its transpose.
Say we have a $257 * 7$ matrix; its transpose will be $7 * 257$ matrix.
If we multiply an $M * N$ matrix by an $N * M$ matrix, we get a $M * M$ matrix.
So let's start with the adjacency of every person to every other person.
Our goal, then, is a symmetrical matrix that matches $M$ people to $M$ people.
In python we can use the .dot
method to perform the necessary form of matrix multiplication. We can do this directly on pandas
dataframes.
In [6]:
person_adjacency=membership_matrix.dot(membership_matrix.T)
# multiplying the membership matrix by its transpose
In [7]:
person_adjacency
Out[7]:
Just as easily we can get the adjacency among the clubs.
This time we multipy the transpose by the original matrix.
(Remember: matrix multiplication is not commutative.)
In [8]:
club_adjacency=membership_matrix.T.dot(membership_matrix)
In [9]:
club_adjacency
Out[9]:
In [10]:
club_adjacency["StAndrewsLodge"]
Out[10]:
In [11]:
[value for value in club_adjacency["StAndrewsLodge"]]
Out[11]:
In [12]:
club_adjacency_names=[club for club in club_adjacency]
In [13]:
import networkx as nx
introduces new data type, the graph, with lots of operations to create, modify, analyze, graph, import and export graphs
find the full documentation at https://networkx.github.io/
Begin by initializing a graph, then add edges, nodes or both.
In [14]:
my_first_graph=nx.Graph() # initialize a graph
In [15]:
my_first_graph.add_edge(1,2)
In [16]:
nx.draw(my_first_graph)
In [17]:
my_first_graph.add_edges_from([(2,3), (2,4)]) #add a list of edges, in the form of tuples (x,y)
In [18]:
nx.draw(my_first_graph)
In [19]:
nx.number_of_nodes(my_first_graph)
Out[19]:
In [20]:
my_first_graph.nodes()
Out[20]:
In [21]:
my_first_graph.edges()
Out[21]:
In [22]:
nx.betweenness_centrality(my_first_graph)
Out[22]:
In [23]:
my_first_graph.add_edge(4,5, weight=3)
In [24]:
nx.draw(my_first_graph)
Default simply graphing method not show the weighted edge.
We will fix this.
For examples of drawing in networkx
, see https://networkx.github.io/documentation/latest/gallery.html
In [25]:
G=nx.Graph()
In [26]:
for club in club_adjacency:
for i, j in enumerate(club_adjacency[club]):
G.add_edge(club, club_adjacency_names[i], weight=j)
In [27]:
pos=nx.spring_layout(G,iterations=20)
nx.draw_networkx_edges(G, pos)
nx.draw_networkx_nodes(G, pos)
nx.draw_networkx_labels(G, pos)
plt.show()
In [28]:
H=nx.Graph()
In [29]:
person_adjacency_names=[person for person in person_adjacency]
for person in person_adjacency:
for i, j in enumerate(person_adjacency[person]):
if person != person_adjacency_names[i]:
H.add_edge(person, person_adjacency_names[i], weight=j)
In [30]:
nx.draw(H)
In [31]:
pos=nx.spring_layout(H, iterations=30)
plt.figure(figsize=(40,40))
nx.draw_networkx_edges(H, pos, width=.015)
nx.draw_networkx_nodes(H, pos, node_size=50)
nx.draw_networkx_labels(H, pos)
plt.axis('off')
plt.show()
networkx
, allows a wide range of analysis, including sophisticated supervised and unsupervised learning.Google
's Page-Rank
--the core of its search algorithm--is a graph analysis algorithm.
Intuition is that websites (nodes) referred to by other high-ranking websites (nodes) should be more highly ranked
In [32]:
H_centrality=nx.eigenvector_centrality(H, max_iter=30)
In [33]:
H_centrality
Out[33]:
argh, need to sort that
In [34]:
sorted(H_centrality, key=H_centrality.get, reverse=True)[:10] #useful syntax to know for sorting dictionaries!
Out[34]:
In [35]:
for name in sorted(H_centrality, key=H_centrality.get, reverse=True)[0:10]:
print(name, H_centrality[name])
Point isn't that such techniques are always right:
they do uncover relationship
they have tons of false positives
In [ ]: