In [1]:
import pandas
import graphistry
import pandas as pd
import igraph
#graphistry.register(key='MY_API_KEY', server='labs.graphistry.com') # https://www.graphistry.com/api-request
This dataset was created by a Twitter user who was surprised that one of his very innocuous tweet ("Hey let's grab a coffee") got retweeted several times. Intrigued, he had a closer look at the accounts that retweeted his message. He found that those accounts all had inprononcable names that looked like gibberish. Suspecting that those accounts might be fake, he crawled the twitter social network around the suspicious accounts to produce this dataset.
The dataset is in a CSV file named twitterDemo.csv
which looks like that:
#dstAccount,srcAccount
arley_leon16,wxite_pymp
michaelinhooo2,wxite_pymp
steeeva,wxite_pymp
...
Each row in twitterDemo.csv
denotes two twitter accounts "following" (Twitter's equivalent of friending) each other.
In [2]:
follows_df = pandas.read_csv('../../data/twitterDemo.csv')
follows_df.sample(3)
Out[2]:
In [4]:
g = graphistry.bind(source='srcAccount', destination='dstAccount').edges(follows_df)
g.plot()
Out[4]:
Can you answer the following questions by exploring the visualization you have just created?
Next, we are going to use IGraph, a graph computation library, to compute metrics like pagerank to help us understand the dataset.
In [8]:
ig = g.pandas2igraph(follows_df)
igraph.summary(ig)
In [9]:
ig.vs['pagerank'] = ig.pagerank(directed=False)
ig.vs['betweenness'] = ig.betweenness(directed=True)
ig.es['ebetweenness'] = ig.edge_betweenness(directed=True)
ig.vs['community_spinglass'] = ig.community_spinglass(spins=12, stop_temp=0.1, cool_fact=0.9).membership
uig = ig.copy()
uig.to_undirected()
ig.vs['community_infomap'] = uig.community_infomap().membership
ig.vs['community_louvain'] = uig.community_multilevel().membership
In [31]:
nodes_df = pd.DataFrame([x.attributes() for x in ig.vs])
nodes_df.sample(3)
Out[31]:
In [32]:
g2 = g.nodes(nodes_df).bind(node='__nodeid__', point_color='community_spinglass', point_size='pagerank')
g2.plot()
Out[32]:
Within the visualization, you can filter and drill down into the graph. Try the following:
Open the histogram panel, and add histograms for pagerank
, betweenness
, ebetweenness
, etc. By selecting a region of a histogram or clicking on a bar, you can filter the graph.
You can also manually create filters in the filter panel ("funnel" icon in the left menu bar). For instance, try filtering on point:pagerank
such that point:pagerank >= 0.01
. We select the most "influencial accounts". Those are the likely botnet owners/customers.
Still in the histogram panel, you can visually show attributes using on the graph node/edge colors. Try clicking on each of the three square icons on top of each histogram. Notice that when point color is bound to community_spinglass
, the "tail" of the network forms a distinct community. What makes those accounts different from the rest?
With the histogram panel open, click on data brush and then lasso a selection on the graph. The histograms highlight the subset of nodes under the selection. You can drag the data brush selection to compare different subgraphs.
In [ ]: