Tutorial: Data Analysis in Graphistry

  1. Load data
  2. Plot:
    • Simple: input is a list of edges
    • Arbitrary: input is a table (hypergraph transform)
  3. Advanced bindings
  4. Further docs

In [2]:
import graphistry
#graphistry.register(key='MY_API_KEY', server='labs.graphistry.com')

1. Load CSV

Graphistry works seamlessly with Pandas dataframes


In [3]:
import pandas as pd

df = pd.read_csv('./data/honeypot.csv')
df.sample(3)


Out[3]:
attackerIP victimIP victimPort vulnName count time(max) time(min)
168 59.91.217.236 172.31.14.66 445.0 MS08067 (NetAPI) 5 1.416331e+09 1.416330e+09
16 117.194.34.106 172.31.14.66 445.0 MS08067 (NetAPI) 9 1.415973e+09 1.415972e+09
107 195.189.111.210 172.31.14.66 445.0 MS08067 (NetAPI) 8 1.416838e+09 1.416836e+09

2. Plot

A. Simple graphs

  • Build up a set of bindings. Simple graphs are for edge lists, or an edge list + node list.
  • See UI Guide for in-tool activity

Demo graph schema:

  • Edges: Alerts linking attackerIP -> victimIP
  • Nodes: Synthesized from attackerIP -> victimIP edges
  • Default colors: Automatic based on inferred commmunity
  • Default node size: Number of edges

In [4]:
g = graphistry.edges(df).bind(source='attackerIP', destination='victimIP')

In [5]:
g.plot()


Out[5]:

B. Hypergraphs -- Plot arbitrary tables

To quickly understand correlations across all your table's values, hypergraph is a convenient transformation.

A hypergraph will link values occurring in the sample table row to one another. By default, the hypergraph plot does not link values directly to one another, but indirects through a node representing the row.

Demo graph schema:

  • Edges: row -> attckerIP, row -> victimIP, row -> victimPort, row -> volnName
  • Nodes: row, attackerIP, victimIP, victimPort, vulnName
  • Default colors: Automatic based on inferred commmunity
  • Default node size: Number of edges

To allow nodes from the attackerIP and victimIP columns to merge together when they have the same value, instead of generating distinct nodes such as attackerIP::127.0.0.1 and victimIP::127.0.0.1, we combine them into one category, ip. The result is one node ip::127.0.0.1.


In [6]:
hg1 = graphistry.hypergraph(
    df,
    entity_types=['attackerIP', 'victimIP', 'victimPort', 'vulnName'],
    opts={
        'CATAGORIES': {
            'ip': ['attackerIP', 'victimIP'] #merge nodes across these columns
        }
    })

hg1_g = hg1['graph']
hg1_g.plot()


('# links', 880)
('# events', 220)
('# attrib entities', 221)
Out[6]:

For more advanced hypergraph control, we can skip the row node, and control which edges are generated, by enabling direct.

Demo graph schema:

  • Edges:
    • attackerIP -> victimIP, attackerIP -> victimPort, attackerIP -> vulnName
    • victimPort -> victimIP
    • vulnName -> victimIP
  • Nodes: attackerIP, victimIP, victimPort, vulnName
  • Default colors: Automatic based on inferred commmunity
  • Default node size: Number of edges

In [7]:
hg2 = graphistry.hypergraph(
    df,
    entity_types=['attackerIP', 'victimIP', 'victimPort', 'vulnName'],
    direct=True,
    opts={
        'EDGES': { ### OPTIONAL, DEFAULTS TO CREATING ALL-TO-ALL
            'attackerIP': ['victimIP', 'victimPort', 'vulnName'],
            'victimPort': ['victimIP'],
            'vulnName': ['victimIP']         
        },
        'CATAGORIES': {
            'ip': ['attackerIP', 'victimIP'] #merge nodes across these columns
        }
    })

hg2_g = hg2['graph']
hg2_g.plot()


('# links', 1100)
('# events', 220)
('# attrib entities', 221)
Out[7]:

3. Advanced bindings

By default, you do not need to explictly create a table of nodes. However, if you do provide one, you can then drive visual styles based on node attributes.

Demo schema:


In [12]:
# 1. Create nodes, tag type as `attacker`

targets_df = df[['victimIP']].drop_duplicates().rename(columns={'victimIP': 'node_id'})\
    .assign(type='victim')

attackers_df = df.groupby(['attackerIP']).agg({'count': {'attacks': 'sum'}}).reset_index()
attackers_df.columns = attackers_df.columns.get_level_values(0)
attackers_df = attackers_df.rename(columns={'attackerIP': 'node_id'}).assign(type='attacker')
attackers_df

nodes_df = pd.concat([targets_df, attackers_df], ignore_index=True)
nodes_df.sample(3)


Out[12]:
count node_id type
32 3.0 124.123.70.99 attacker
177 2.0 85.192.166.151 attacker
2 6.0 1.235.32.141 attacker

In [9]:
# 2. Plot nodes, and color based on type `attacker`

g2 = g.nodes(nodes_df).bind(node='node_id')

#optional
nodes_df['my_color'] = nodes_df['type'].apply(lambda v: 0 if v == 'attacker' else 2)
nodes_df = nodes_df.fillna(value={'count': (nodes_df['count'].max() + nodes_df['count'].min()) / 2.0 })
g2 = g2.bind(point_size = 'count', point_color='my_color')
g2 = g2.settings(url_params={'workbook': 'my_analysis_wb_1'})

g2.plot()


Out[9]:

Advanced bindings work with hypergraphs too


In [10]:
nodes = hg2_g._nodes

types = list(nodes['type'].unique())
nodes_with_colors = nodes.assign(color=nodes.type.apply(lambda t: types.index(t)))
nodes_with_colors.sample(3)


Out[10]:
attackerIP nodeID nodeTitle type victimIP victimPort vulnName category color
112 220.172.133.215 attackerIP::220.172.133.215 220.172.133.215 attackerIP NaN NaN NaN attackerIP 0
57 179.25.208.154 attackerIP::179.25.208.154 179.25.208.154 attackerIP NaN NaN NaN attackerIP 0
121 31.135.61.170 attackerIP::31.135.61.170 31.135.61.170 attackerIP NaN NaN NaN attackerIP 0

In [11]:
hg2_g\
  .nodes(nodes_with_colors).bind(point_color='color')\
  .settings(url_params={'workbook': 'my_analysis_wb_2'})\
  .plot()


Out[11]:

In [ ]:


In [ ]: