In [73]:
import numpy as np
import pandas as pd
from scipy.stats import binom
from urllib.request import urlopen
import networkx as nx
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()
pd.set_option("display.max_columns", 500)
This project represents an intitial attempt to use archival evidence to validate and explore the Wikipedia link network associated with the French philosopher Jacques Derrida. A sample of books with personal dedications from other intellectuals to Derrida was selected from Derrida's personal library, now held by Princeton University. The hypothesis was that these "dedicator nodes" would be useful as reference points in analyzing Derrida's ego network, as encoded in Wikipedia. The project dataset was constructed in several steps:
A sample of 150 dedicators was identified and names were reconciled against URIs in the Virtual International Authority File (VIAF). From VIAF, Wikidata identifiers were extracted for these nodes.
Wikidata lookups were performed to identify the 62 different biographical pages that exist for Derrida across the 285 different language editions of Wikipedia. Links were scraped from these pages and each link was checked in Wikidata to filter out irrelevant nodes. Two filtering criteria were applied: (1) the link must represent a person and (2) that person must have been born in or after 1888, in order to reasonably overlap with Derrida's own life.
A second iteration of link scraping and lookups was performed to harvest the links from the Wikipedia pages for the filtered nodes linked from the Derrida pages.
A separate round of two-step harvesting was performed for all of the pages that linked to one of Derrida's pages, again searching across all of the relevant Wikipedia language editions. The Wikipedia backlink API was used to easily identify these links.
The final combined network contained 13105 nodes and 24780 weighted edges.
In [74]:
G = nx.read_graphml(urlopen("https://raw.githubusercontent.com/timathom/netsci/master/project/data/full/full.graphml"))
print(nx.info(G))
In [92]:
# Add network nodes to a list
graph = [G.node[n] for n in G.nodes_iter()]
# Create DataFrame from list
df = pd.DataFrame(graph)
df.fillna(0, inplace=True)
ego_net = df[df.loc[:, "ego"] == True]
dedicators = df[df.loc[:, "dedicator"] == True]
dedicators.head()
Out[92]:
In [82]:
# Define simulation function to randomly assign dedicator labels
def simulate(dfs, dist):
dtest = dfs["dedicator"] == 1
dedicators = dfs[dtest].copy()
dedicators["dedicator"] = np.random.choice(dfs.index, len(dedicators))
d = dfs.iloc[dedicators["dedicator"]]
etest = d["ego"] == 1
dist["ego"].append(len(d[etest]))
simulation = df.copy()
# Initialize a dictionary to hold the results
distribution = {"ego": []}
# Run the simulation
for i in range(10000):
simulate(simulation, distribution)
result = distribution
dfr = pd.DataFrame(result)
# Plot the result
plt.figure();
dfr.loc[:, "ego"].plot.hist(alpha = 0.5)
plt.show()
In [91]:
# Print summary data
percent_total = len(dedicators)/len(df)
percent_ego = len(ego_net[ego_net.loc[:, "dedicator"] == True])/len(ego_net)
summary = pd.DataFrame.from_dict({"Total dedicators": [len(dedicators)],
"Nodes in ego network": [len(ego_net)],
"Dedicators in ego network": [len(ego_net[ego_net.loc[:, "dedicator"] == True])],
"Proportion of dedicators (total)": [percent_total],
"Proportion of dedicators (ego)": [percent_ego],
"Mean dedicators in ego network under null model": [np.mean(dfr.loc[:, "ego"])]
})
summary
Out[91]:
In [68]:
# Likelihood of seeing a number greater than or equal to 47 in a set of 740 nodes with a probability of 0.01.
round(1 - binom.cdf(46, 740, 0.01), 4)
Out[68]: