Botnet on Twitter?


In [1]:
import pandas
import graphistry
import pandas as pd
import igraph
#graphistry.register(key='MY_API_KEY', server='labs.graphistry.com') # https://www.graphistry.com/api-request

Step 1: Loading The Data

This dataset was created by a Twitter user who was surprised that one of his very innocuous tweet ("Hey let's grab a coffee") got retweeted several times. Intrigued, he had a closer look at the accounts that retweeted his message. He found that those accounts all had inprononcable names that looked like gibberish. Suspecting that those accounts might be fake, he crawled the twitter social network around the suspicious accounts to produce this dataset.

The dataset is in a CSV file named twitterDemo.csv which looks like that:

#dstAccount,srcAccount
arley_leon16,wxite_pymp
michaelinhooo2,wxite_pymp
steeeva,wxite_pymp
...

Each row in twitterDemo.csv denotes two twitter accounts "following" (Twitter's equivalent of friending) each other.


In [2]:
follows_df = pandas.read_csv('../../data/twitterDemo.csv')
follows_df.sample(3)


Out[2]:
dstAccount srcAccount
3508 ilisitizixox ijow_opakeb78
1542 upimesevacug osiz_ixolasor53
1760 _Tu_Moda_ ufewanikebix58

Step 2: First Simple Visualization

We can visualize this subset of the Twitter network as a graph: Each node is a Twitter account and edges encode the "follows" relation.


In [4]:
g = graphistry.bind(source='srcAccount', destination='dstAccount').edges(follows_df)

g.plot()


Out[4]:

Can you answer the following questions by exploring the visualization you have just created?

  • Is the structure of the graph what you would expect from a social network?
  • Can you tell which accounts might be fake and which ones are likely real users?

Step 3: Computing Graph Metrics With IGraph

Next, we are going to use IGraph, a graph computation library, to compute metrics like pagerank to help us understand the dataset.


In [8]:
ig = g.pandas2igraph(follows_df)
igraph.summary(ig)


IGRAPH D--- 7889 10063 -- 
+ attr: __nodeid__ (v)

In [9]:
ig.vs['pagerank'] = ig.pagerank(directed=False)
ig.vs['betweenness'] = ig.betweenness(directed=True)
ig.es['ebetweenness'] = ig.edge_betweenness(directed=True)

ig.vs['community_spinglass'] = ig.community_spinglass(spins=12, stop_temp=0.1, cool_fact=0.9).membership
uig = ig.copy()
uig.to_undirected()
ig.vs['community_infomap'] = uig.community_infomap().membership
ig.vs['community_louvain'] = uig.community_multilevel().membership

In [31]:
nodes_df = pd.DataFrame([x.attributes() for x in ig.vs])
nodes_df.sample(3)


Out[31]:
__nodeid__ betweenness community_infomap community_louvain community_spinglass pagerank
3922 usovenesucug 0.0 103 33 5 0.000094
7659 elin_egutukez 0.0 388 23 7 0.000057
5446 ocomamigoyob41 0.0 490 24 4 0.000055

In [32]:
g2 = g.nodes(nodes_df).bind(node='__nodeid__', point_color='community_spinglass', point_size='pagerank')
g2.plot()


Out[32]:

Step 4: Visual Drill Downs

Within the visualization, you can filter and drill down into the graph. Try the following:

  1. Open the histogram panel, and add histograms for pagerank, betweenness, ebetweenness, etc. By selecting a region of a histogram or clicking on a bar, you can filter the graph.

  2. You can also manually create filters in the filter panel ("funnel" icon in the left menu bar). For instance, try filtering on point:pagerank such that point:pagerank >= 0.01. We select the most "influencial accounts". Those are the likely botnet owners/customers.

  3. Still in the histogram panel, you can visually show attributes using on the graph node/edge colors. Try clicking on each of the three square icons on top of each histogram. Notice that when point color is bound to community_spinglass, the "tail" of the network forms a distinct community. What makes those accounts different from the rest?

  4. With the histogram panel open, click on data brush and then lasso a selection on the graph. The histograms highlight the subset of nodes under the selection. You can drag the data brush selection to compare different subgraphs.


In [ ]: