Google Knowledge Graph Example

The Google Knowledge Graph (GKG) is a repository of data extracted from web. Each record consists of delimited sets of named entities extracted from each web resource, and the co-mentions of entiites within the text can reveal possible relationships between them.

http://analysis.gdeltproject.org/module-gkg-exporter.html

In this example, we'll look at data exported from GKG covering the first three months of 2016, the reference the keywords "Flint" and "water".

To read from a text file, we can use the csv library to parse the file. Using the DictReader function formats the values as a dictionary. We can then parse the people names and add them as edges to the graph.



In [15]:

    
pylab.rcParams['figure.figsize'] = (5.0, 5.0)  # This sets the width of the plots.
import csv
import requests
import networkx as nx
import pylab as plt
import warnings # Wakari issues some warnings due to its versions of matplotlib and networkx - let's ignore them for now.
warnings.filterwarnings("ignore")

g_names=nx.Graph()  #  Let's create an undirected graph to hold the nodes and edges
with open('flint_gkg.txt','r') as f:
    data = f.read()
reader = csv.DictReader(data.splitlines(), delimiter='\t')  # Parsed the text and loads it into a Python dicionary
for row in reader:
    org_found = False
    if row["Organizations"].find("foundation") >= 0:  # To keep the size manageable, let's only consider orgs with foundation in the name
        org_found = True
    names = row["Persons"].split(";")
    if org_found:
        for s1 in names:
            for s2 in names:
                if s1 != s2:
                    if g_names.has_edge(s1, s2):
                        g_names[s1][s2]['weight'] += 1
                    else:
                        g_names.add_edge(s1, s2, {"weight": 1})
nx.draw(g_names)

Let's gather some stats about this graph:



In [16]:

    
def graph_stats(g):
    return {
        "num_nodes": nx.number_of_nodes(g),
        "num_edges": nx.number_of_edges(g),
        "edge_density": nx.density(g),
        "max_degree": sorted(nx.degree_centrality(g).values(),reverse=True)[0]
    }



In [17]:

    
print graph_stats(g_names)









    



{'edge_density': 0.13638642059694692, 'max_degree': 0.49760765550239233, 'num_edges': 2993, 'num_nodes': 210}

We can also gather some node-level stats and add the values as node attributes, so we can use them to draw the graph:



In [18]:

    
nx.set_node_attributes(g_names, 'degree', nx.degree(g_names))
nx.set_node_attributes(g_names, 'betweenness', nx.betweenness_centrality(g_names))

Now, each node has it's betweenness centrality and degree added. Let's look at a single node as an example:



In [19]:

    
print g_names.nodes(data=True)[0]









    



('kelly curtis', {'betweenness': 0.0, 'degree': 3})



In [20]:

    
sorted_nodes = sorted([n for n in g_names.nodes_iter(data=True)], key=lambda x: x[1]["betweenness"], reverse=True)
g = g_names.subgraph([n[0] for n in sorted_nodes[:25]])
print graph_stats(g)









    



{'edge_density': 0.24333333333333335, 'max_degree': 0.6666666666666666, 'num_edges': 73, 'num_nodes': 25}



In [21]:

    
g.edges(data=True)[10]









    Out[21]:





('yanna lambrinidou', 'gladyes williamson', {'weight': 2})



In [22]:

    
pylab.rcParams['figure.figsize'] = (15.0, 15.0)
pos=nx.spring_layout(g, k=1, iterations=20)
widths = [e[2]["weight"] for e in g.edges(data=True)]
node_sizes = [n[1]["degree"] * 100 for n in g.nodes(data=True)]
node_colors = [n[1]["betweenness"] * 100 for n in g.nodes(data=True)] 
nx.draw(g,pos,node_color=node_colors,node_size=node_sizes, edge_color='#A0CBE2',width=widths,cmap=plt.cm.Blues,with_labels=True)
plt.title('Co-Mentions of People with Flint and Water', y=0.97, fontsize=20, fontweight='bold')
plt.show()



In [23]:

    
# To generate data for D3, we can use the following code to generate json in the format that many D3 network visualizations expect:
from networkx.readwrite import json_graph
import json as js
data = json_graph.node_link_data(g, attrs= dict(id='name', source='source', target='target', key='key'))
with open("flint_people.json", "w") as text_file:
    text_file.write(js.dumps(data))

Here is a D3 rendering of the same graph:

http://gis.foundationcenter.org/networkxd3/co_occurrence_force.html

You may notice that the nodes in this graph tend to cluster somewhat. We can try to quantify this observation using a graph partitioning approach. By splitting the graph into subgraphs, we can isolate those parts of the graph that are mentioned together the most. One simple way to to this is to progressivly remove more and more of the edges with the highest betweenness centrality.



In [24]:

    
pylab.rcParams['figure.figsize'] = (8.0, 8.0)
current_component_count = nx.number_connected_components(g)
while current_component_count < 4:  # Try changing the number of partitions
    attr = nx.edge_betweenness_centrality(g)
    nx.set_edge_attributes(g, 'edge_betweenness', attr)
    edge_list = [e for e in g.edges_iter(data=True)]
    edge_list = sorted(edge_list, key=lambda x: x[2]["edge_betweenness"], reverse=True)
    for e in edge_list:
        g.remove_edge(*e[:2])
        if nx.number_connected_components(g) > current_component_count:
            current_component_count = nx.number_connected_components(g)
            break

cluster_index = 0
for nodes in nx.connected_components(g):
    for n in nodes:
        g.node[n]["cluster_index"] = cluster_index
    cluster_index += 1

pos=nx.spring_layout(g, k=0.1, iterations=10)
widths = [e[2]["weight"] for e in g.edges(data=True)]
node_sizes = [n[1]["degree"] * 100 for n in g.nodes(data=True)]
node_colors = [n[1]["cluster_index"] * 100 for n in g.nodes(data=True)]  #  Using cluster index to color the nodes.
nx.draw(g,pos,node_color=node_colors,node_size=node_sizes, edge_color='#A0CBE2',width=widths,cmap=plt.cm.Blues,with_labels=True)
plt.title('Partitioned Co-Mention Graph', y=0.97, fontsize=20, fontweight='bold')
plt.show()



In [25]:

    
[n[0] + ": " + str(n[1]["cluster_index"]) for n in sorted([n for n in g.nodes_iter(data=True)], key=lambda x: x[1]["cluster_index"])]









    Out[25]:





['kelly curtis: 0',
 'jamie gaskin: 0',
 'brandi carlile: 0',
 'kevin foley: 1',
 'mickey hart: 1',
 'eddie vedder: 1',
 'romneycare obamacare: 2',
 'yanna lambrinidou: 2',
 'tom perry: 2',
 'marc edwards: 2',
 'dan kildee: 2',
 'serena williams: 2',
 'william rhoads: 2',
 'lee anne walters: 2',
 'mona hanna-attisha: 2',
 'adel al-jubeir: 2',
 'rick snyder: 2',
 'mohammad javad zarif: 2',
 'barack obama: 2',
 'rebekah martin: 2',
 'hillary clinton: 2',
 'gladyes williamson: 2',
 'dan wyant: 2',
 'susan tompor: 3',
 'mark wahlberg: 3']

There are many other algorithms for detecting clusters in graphs. For the Constellations visualizations, we use the NetworkX module described here:

http://perso.crans.org/aynaud/communities/



In [ ]: