Let's look at the shape of the data about DAMD and see how to computationally construct a graph, and how that compares to doing so with an interactive tool, such as Table 2 Net.
In [2]:
# first we want some Python tools to make our lives easier
import pandas as pd
import numpy as np
import networkx as nx
import matplotlib.pyplot as plt
%matplotlib inline
In [2]:
with open("20170718 hashtag_damd uncleaned.csv") as fd:
for row in fd.readlines()[:3]:
print(row)
That looks like a comma-separated value (CSV) file. There are many other kinds of files for data, but these are quite typical. In a CSV, each line is a data item (a tweet in this case), and columns are variables for each item. We call such a thing a data frame.
In [4]:
damd = pd.read_csv("20170718 hashtag_damd uncleaned.csv")
What variables do we have?
In [5]:
damd.columns
Out[5]:
Let's decide to use the tweet_id
as index. It is an unique identifier for the tweets.
In [6]:
damd = pd.read_csv("20170718 hashtag_damd uncleaned.csv", index_col="tweet_id")
damd.head(3)
Out[6]:
To find patterns in the data, we might look at #hashtags, and if we can identify some interesting patterns in them. Cooccurrence is a useful thing to look at, and can easily be done in Twitter data.
We might want to bipartite graph ("network") $g = \langle N, V \rangle$, where $N = \{{node}_1, {node}_2 \ldots {node}_n\}$ is a set of nodes ("spheres"), and $V = \{{\langle source, target \rangle_1, \langle source, target \rangle _2 \ldots \langle source, target \rangle _m }\}$ set of edges ("lines") of tweets and hashtags, to analyze hashtag co-occurrence.
A bipartite graph has two types of nodes, which are not connected within the type, only across. In our case, hashtags are connected to tweets, but tweets are not directly connected to tweets, and hashtags are not directly connected to hashtags. Makes sense, right?
This data manipulation process can be done with Table 2 Net. But doing so programmatically is a different way to do it. We will use Python library called NetworkX.
Below is a Gephi visualization of a graph made with Table 2 Net, coloured by node type red for tweets and green for hashtags, and showing labels for the hashtag nodes with degree of 15 or larger. We have used the algorithm ForceAtlas2 in Gephi for positioning the nodes. The central node, hashtag damd
has been hidden, because it carries no information.
First let's take a peek at the shape of the hashtags, how they are stored in the data we have received.
In [7]:
damd.hashtags.head()
Out[7]:
We see that the hashtag
column is itself a semicolon separated list, and our data is kind of three dimensional. We need to split it up.
From reading the documentation, we know that nx.Graph.add_edge()
requires input as a tuple (source, target)
, describing one edge. For each tweet, we generate a list of it's hashtags, and then add those edges to the graph one by one. So, from the original data shape
tweet1 hashtag1;hashtag2;hashtag3
tweet2 hashtag9;hashtag4
.
.
.
We create an intermediary data shape for line 5
tweet1 hashtag1
tweet1 hashtag2
tweet1 hashtag3
tweet2 hashtag9
tweet2 hashtag4
.
.
.
This suits what the NetworkX API expects.
Conveniently NetworkX automatically creates the nodes, so we don't have to think about them. How can it automatically know what the nodes are, if it only looks at links?
In [8]:
def buildHashtagCooccurrenceGraph(tweets):
g = nx.Graph(name="Hashtag co-occurrence bipartite")
for tweet, hashtags in damd.hashtags.astype(str).map(lambda l: l.split(';')).items():
g.add_node(tweet, Type="tweet_id")
for hashtag in hashtags:
g.add_edge(tweet, hashtag.lower())
return g
In [9]:
g = buildHashtagCooccurrenceGraph(damd)
Now, let's briefly inspect the graph g
we created.
In [10]:
print(nx.info(g))
Save to file, for opening in Gephi.
In [9]:
nx.write_gexf(g, "hashtag-cooccurrence-bipartite-with-python.gexf")
In [10]:
g_table2net = nx.read_gexf("hashtag-cooccurrence-bipartite-with-table2net.gexf")
print(nx.info(g_table2net))
After poking around in Gephi for half an hour setting colours and filters, positioning with ForceAtlas2 and outputting an image, here is a visualization of the graph. It should be equal to the one above, which was visualized from a graph constructed from the data with Table 2 Net.
In graph theory, "isomorphism" (ἴσος isos "equal", and μορφή morphe "form" or "shape") means that graphs are of the same shape. Why do want to know this? We want to inspect if we successfully reproduced the process that Table 2 Net did.
In [11]:
# This algoritm is not guaranteed, but it is fast
nx.isomorphism.fast_could_be_isomorphic(g, g_table2net)
Out[11]:
Did we "open the black box" of Table 2 Net and Gephi?