Botnet on Twitter?



In [1]:

    
import pandas
import graphistry
import pandas as pd
import igraph
#graphistry.register(key='MY_API_KEY', server='labs.graphistry.com') # https://www.graphistry.com/api-request

Step 1: Loading The Data

This dataset was created by a Twitter user who was surprised that one of his very innocuous tweet ("Hey let's grab a coffee") got retweeted several times. Intrigued, he had a closer look at the accounts that retweeted his message. He found that those accounts all had inprononcable names that looked like gibberish. Suspecting that those accounts might be fake, he crawled the twitter social network around the suspicious accounts to produce this dataset.

The dataset is in a CSV file named twitterDemo.csv which looks like that:

#dstAccount,srcAccount
arley_leon16,wxite_pymp
michaelinhooo2,wxite_pymp
steeeva,wxite_pymp
...

Each row in twitterDemo.csv denotes two twitter accounts "following" (Twitter's equivalent of friending) each other.



In [2]:

    
follows_df = pandas.read_csv('../../data/twitterDemo.csv')
follows_df.sample(3)









    Out[2]:







  
    
      
      dstAccount
      srcAccount
    
  
  
    
      3508
      ilisitizixox
      ijow_opakeb78
    
    
      1542
      upimesevacug
      osiz_ixolasor53
    
    
      1760
      _Tu_Moda_
      ufewanikebix58

Step 2: First Simple Visualization

We can visualize this subset of the Twitter network as a graph: Each node is a Twitter account and edges encode the "follows" relation.



In [4]:

    
g = graphistry.bind(source='srcAccount', destination='dstAccount').edges(follows_df)

g.plot()









    Out[4]:

Can you answer the following questions by exploring the visualization you have just created?

Is the structure of the graph what you would expect from a social network?
Can you tell which accounts might be fake and which ones are likely real users?

Step 3: Computing Graph Metrics With IGraph

Next, we are going to use IGraph, a graph computation library, to compute metrics like pagerank to help us understand the dataset.



In [8]:

    
ig = g.pandas2igraph(follows_df)
igraph.summary(ig)









    



IGRAPH D--- 7889 10063 -- 
+ attr: __nodeid__ (v)



In [9]:

    
ig.vs['pagerank'] = ig.pagerank(directed=False)
ig.vs['betweenness'] = ig.betweenness(directed=True)
ig.es['ebetweenness'] = ig.edge_betweenness(directed=True)

ig.vs['community_spinglass'] = ig.community_spinglass(spins=12, stop_temp=0.1, cool_fact=0.9).membership
uig = ig.copy()
uig.to_undirected()
ig.vs['community_infomap'] = uig.community_infomap().membership
ig.vs['community_louvain'] = uig.community_multilevel().membership



In [31]:

    
nodes_df = pd.DataFrame([x.attributes() for x in ig.vs])
nodes_df.sample(3)









    Out[31]:







  
    
      
      __nodeid__
      betweenness
      community_infomap
      community_louvain
      community_spinglass
      pagerank
    
  
  
    
      3922
      usovenesucug
      0.0
      103
      33
      5
      0.000094
    
    
      7659
      elin_egutukez
      0.0
      388
      23
      7
      0.000057
    
    
      5446
      ocomamigoyob41
      0.0
      490
      24
      4
      0.000055



In [32]:

    
g2 = g.nodes(nodes_df).bind(node='__nodeid__', point_color='community_spinglass', point_size='pagerank')
g2.plot()









    Out[32]:

Step 4: Visual Drill Downs

Within the visualization, you can filter and drill down into the graph. Try the following:

Open the histogram panel, and add histograms for pagerank, betweenness, ebetweenness, etc. By selecting a region of a histogram or clicking on a bar, you can filter the graph.
You can also manually create filters in the filter panel ("funnel" icon in the left menu bar). For instance, try filtering on point:pagerank such that point:pagerank >= 0.01. We select the most "influencial accounts". Those are the likely botnet owners/customers.
Still in the histogram panel, you can visually show attributes using on the graph node/edge colors. Try clicking on each of the three square icons on top of each histogram. Notice that when point color is bound to community_spinglass, the "tail" of the network forms a distinct community. What makes those accounts different from the rest?
With the histogram panel open, click on data brush and then lasso a selection on the graph. The histograms highlight the subset of nodes under the selection. You can drag the data brush selection to compare different subgraphs.



In [ ]:

	dstAccount	srcAccount
3508	ilisitizixox	ijow_opakeb78
1542	upimesevacug	osiz_ixolasor53
1760	_Tu_Moda_	ufewanikebix58

	__nodeid__	community_infomap	community_louvain	community_spinglass	pagerank
3922	usovenesucug	103	33	5	0.000094
7659	elin_egutukez	388	23	7	0.000057
5446	ocomamigoyob41	490	24	4	0.000055