This notebook contains an early exploration of WANE files.
The inital files were examples taken from the Internet Archive documentation for WANE files.
This is basically a learning file for using JSON to work with the WANE files, eventually leading to exporting the WANE data into graph files for use with Gephi.
In [2]:
import json
import csv
In [3]:
data = []
for line in open('wane1.json', 'r'):
data.append(json.loads(line))
In [4]:
data[1]
Out[4]:
In [5]:
len(data)
Out[5]:
In [6]:
for item in data:
print(['url'])
In [7]:
for item in data:
print(item['url'])
In [8]:
type(data)
Out[8]:
In [9]:
for item in data:
print(item['digest'])
In [10]:
for item in data:
print(item['named_entities'])
In [11]:
for item in data:
print(item['named_entities']['persons'])
In [12]:
for item in data:
print(item['named_entities']['organizations'])
In [13]:
for item in data:
print(item['named_entities']['locations'])
In [14]:
type(data[2]['named_entities']['locations'])
Out[14]:
In [15]:
data[1]['url']
Out[15]:
So I'm reasonably confident about importing the JSON data at this point and accessing the different items in the data structure. Now the next step is putting this into a graph.
In [16]:
import networkx as nx
G=nx.Graph()
In [17]:
for item in data:
G.add_node(item['url'])
In [18]:
G.number_of_nodes()
Out[18]:
In [19]:
G.nodes()
Out[19]:
In [20]:
for item in data:
G.add_nodes_from(item['named_entities']['persons'])
In [21]:
edges = []
for item in data:
for index in range(len(item['named_entities']['persons'])):
l = [[item['url'], item['named_entities']['persons'][index]]]
edges.extend(tuple(l))
In [22]:
print(data[0]['named_entities']['persons'][0])
In [23]:
print(edges[3])
In [24]:
edges
Out[24]:
In [25]:
G.add_edges_from(edges)
In [26]:
G.edges()
Out[26]:
In [27]:
sorted(nx.degree(G).values())
Out[27]:
In [28]:
G.nodes()
Out[28]:
In [30]:
nx.write_graphml(G,"./graph1.gml")