Visualize CSV Mini-App

  • Jupyter: File -> Make a copy
    Colab: File -> Save a copy in Drive
  • Run notebook cells by pressing shift-enter
  • Either edit annd run top cells one-by-one, or edit and run the self-contained version at the bottom

In [1]:
#!pip install graphistry -q

In [2]:
import pandas as pd
import graphistry
#graphistry.register(key='MY_KEY', server='labs.graphistry.com')

1. Upload csv

Use a file by uploading it or via URL.

Run help(pd.read_csv) for more options.

File Upload: Jupyter Notebooks

  • If circle on top right not green, click kernel -> reconnect
  • Go to file directory (/tree) by clicking the Jupyter logo
  • Navigate to the directory page containing your notebook
  • Press the upload button on the top right

File Upload: Google Colab

  • Open the left sidebar by pressing the right arrow on the left
  • Go to the Files tab
  • Press UPLOAD
  • Make sure goes into /content

File Upload: URL

  • Uncomment below line and put in the actual data url
  • Run help(pd.read_csv) for more options

In [3]:
file_path = './events-1551346702.csv'
df = pd.read_csv(file_path)

print('# rows', len(df))
df.sample(min(len(df), 3))


# rows 92883
/usr/local/lib/python3.6/dist-packages/IPython/core/interactiveshell.py:3049: DtypeWarning: Columns (4,15,23) have mixed types. Specify dtype option on import or set low_memory=False.
  interactivity=interactivity, compiler=compiler, result=result)
Out[3]:
id action action_classification action_level action_summary authorities_link country create_date data_notes date ... status target_audience type uid uid_hash url slug device_id created_at updated_at
47601 54812 NaN II NaN NaN http://www.healthycanadians.gc.ca/recall-alert... CAN NaN NaN NaN ... NaN NaN Recall 23061 02f800fd NaN can-architect-i-system-stat-troponin-i-assay 51467 2019-02-27 11:17:58 UTC 2019-02-27 11:17:58 UTC
23227 26969 NaN NaN NaN NaN https://www.swissmedic.ch/swissmedic/de/home.html CHE NaN NaN NaN ... NaN NaN Field Safety Notice Vk_20080826_01 1c3b037b https://fsca.swissmedic.ch/mep/#?q=Vk_20080826_01 che-hd3-diagnostic-ultrasound-system-model-cat... 24987 2019-02-27 10:54:39 UTC 2019-02-27 10:54:39 UTC
88819 103605 Products should be taken out of service. NaN NaN Recall/Exchange https://lakemedelsverket.se/ SWE NaN CHC 13-02 NaN ... NaN NaN Field Safety Notice 6.6.4-2013-51906 1a780322 https://lakemedelsverket.se/malgrupp/Halso---s... swe-total-protein_2-urine-upro_2-on-advia-chem... 98325 2019-02-28 08:58:19 UTC 2019-02-28 08:58:19 UTC

3 rows × 30 columns


In [4]:
df.columns


Out[4]:
Index(['id', 'action', 'action_classification', 'action_level',
       'action_summary', 'authorities_link', 'country', 'create_date',
       'data_notes', 'date', 'date_initiated_by_firm', 'date_posted',
       'date_terminated', 'date_updated', 'determined_cause', 'documents',
       'icij_notes', 'number', 'reason', 'source', 'status', 'target_audience',
       'type', 'uid', 'uid_hash', 'url', 'slug', 'device_id', 'created_at',
       'updated_at'],
      dtype='object')

2. Optional: Clean up CSV


In [8]:
hits = pd.DataFrame([[c, len(df[c].unique())] for c in df.columns], columns=['col', 'num_uniq']).sort_values('num_uniq')

hits.query('num_uniq > 10 & num_uniq < 9288')


Out[8]:
col num_uniq
2 action_classification 11
6 country 19
19 source 19
16 icij_notes 20
14 determined_cause 44
20 status 279
7 create_date 629
12 date_terminated 3277
11 date_posted 3279
9 date 3857
10 date_initiated_by_firm 5416
29 updated_at 5489
28 created_at 5489

In [10]:
skip_nodes = ['icij_notes', 'determined_cause', 'action_classification', 'icij_notes', 'country', 'status', 'source']
nodes = [x for x in list(hits.query('num_uniq > 10 & num_uniq < 9288')['col']) if not x in skip_nodes]

nodes


Out[10]:
['create_date',
 'date_terminated',
 'date_posted',
 'date',
 'date_initiated_by_firm',
 'updated_at',
 'created_at']

In [12]:
df = df_orig.query('country == "USA"')

3. Configure: Visualize with 3 kinds of graphs

Set mode and the corresponding values:

Mode "A". See graph from table of (src,dst) edges

Mode "B". See hypergraph: Draw row as node and connect it to entities in same row

  • Pick which cols to make nodes
  • If multiple cols share same type (e.g., "src_ip", "dest_ip" are both "ip"), unify them

Mode "C". See by creating multiple nodes, edges per row

  • Pick how different column values point to other column values
  • If multiple cols share same type (e.g., "src_ip", "dest_ip" are both "ip"), unify them

In [13]:
#Pick 'A', 'B', or 'C'
mode = 'B' 
max_rows = 50000


### 'A' == mode
my_src_col = 'attackerIP'
my_dest_col = 'victimIP'



### 'B' == mode
node_cols = nodes
categories = { #optional
    #'date': [ 'create_date', 'date_initiated_by_firm', 'date_posted', 'date_terminated', 'updated_at' ]
    #'ip': ['attacker_IP', 'victimIP']
    #, 'user': ['owner', 'seller'],
}



### 'C' == mode
edges = {
      'attackerIP': [ 'victimIP', 'victimPort', 'vulnName'],
      'victimIP': [ 'victimPort'],
      'vulnName': [ 'victimIP' ]
}
categories = { #optional
      'ip': ['attackerIP', 'victimIP']
       #, user': ['owner', 'seller'], ...
}

4. Plot: Upload & render!


In [14]:
g = None
hg = None
num_rows = min(max_rows, len(df))
if mode == 'A':
    g = graphistry.edges(df.sample(num_rows)).bind(source=my_src_col, destination=my_dest_col)
elif mode == 'B':
    hg = graphistry.hypergraph(df.sample(num_rows), node_cols, opts={'CATEGORIES': categories})
    g = hg['graph']
elif mode == 'C':
    nodes = list(edges.keys())
    for dests in edges.values():
        for dest in dests:
            nodes.append(dest)
    node_cols = list(set(nodes))
    hg = graphistry.hypergraph(df.sample(num_rows), node_cols, direct=True, opts={'CATEGORIES': categories, 'EDGES': edges})
    g = hg['graph']
  
#hg
print(len(g._edges))

g.plot()


# links 173770
# events 35826
# attrib entities 14368
173770
Uploading 84603 kB. This may take a while...
Out[14]:

Alternative: Combined

Split into data loading and cleaning/configuring/plotting.


In [15]:
#!pip install graphistry -q
import pandas as pd
import graphistry
#graphistry.register(key='MY_KEY', server='labs.graphistry.com')


##########
#1. Load
file_path = './events-1551346702.csv'
df = pd.read_csv(file_path)

print(df.columns)
print('rows:', len(df))
print(df.sample(min(len(df),3)))


Index(['id', 'action', 'action_classification', 'action_level',
       'action_summary', 'authorities_link', 'country', 'create_date',
       'data_notes', 'date', 'date_initiated_by_firm', 'date_posted',
       'date_terminated', 'date_updated', 'determined_cause', 'documents',
       'icij_notes', 'number', 'reason', 'source', 'status', 'target_audience',
       'type', 'uid', 'uid_hash', 'url', 'slug', 'device_id', 'created_at',
       'updated_at'],
      dtype='object')
rows: 92883
           id                                             action  \
47246   54457                                                NaN   
90383  105169  ANSM has been informed of the implementation o...   
51788   58999                                                NaN   

      action_classification  action_level  \
47246                    II           NaN   
90383                   NaN           NaN   
51788                    II           NaN   

                                          action_summary  \
47246                                                NaN   
90383  Surgical Motor - Conmed Hall L3000SM Small Lit...   
51788                                                NaN   

                                        authorities_link country create_date  \
47246  http://www.healthycanadians.gc.ca/recall-alert...     CAN         NaN   
90383                         https://www.ansm.sante.fr/     FRA         NaN   
51788  http://www.healthycanadians.gc.ca/recall-alert...     CAN         NaN   

      data_notes        date  ... status target_audience    type       uid  \
47246        NaN         NaN  ...    NaN             NaN  Recall     21796   
90383        NaN  2018-07-25  ...    NaN             NaN  Recall  R1811289   
51788        NaN         NaN  ...    NaN             NaN  Recall     50818   

       uid_hash                                                url  \
47246  0310010a                                                NaN   
90383  083701c1  https://ansm.sante.fr/S-informer/Informations-...   
51788  03100107                                                NaN   

                                            slug device_id  \
47246  can-encode-lab-designed-zirconia-abutment     51116   
90383   fra-hall-l3000sm-small-lithium-batteries     99819   
51788                 can-chemosite-system-ports     55584   

                    created_at               updated_at  
47246  2019-02-27 11:17:35 UTC  2019-02-27 11:17:35 UTC  
90383  2019-02-28 09:05:31 UTC  2019-02-28 09:05:31 UTC  
51788  2019-02-27 11:22:40 UTC  2019-02-27 11:22:40 UTC  

[3 rows x 30 columns]

In [16]:
##########
#2. Clean
#df = df.rename(columns={'attackerIP': 'src_ip', 'victimIP: 'dest_ip', 'victimPort': 'protocol'})

hits = pd.DataFrame([[c, len(df[c].unique())] for c in df.columns], columns=['col', 'num_uniq']).sort_values('num_uniq')


skip_nodes = ['icij_notes', 'determined_cause', 'action_classification', 'icij_notes', 'country', 'status', 'source']
nodes = [x for x in list(hits.query('num_uniq > 10 & num_uniq < 9288')['col']) if not x in skip_nodes]

df = df.query('country == "USA"')
    
##########
#3. Config - Pick 'A', 'B', or 'C'
mode = 'B' 
max_rows = 50000


### 'A' == mode
my_src_col = 'attackerIP'
my_dest_col = 'victimIP'

### 'B' == mode
node_cols = nodes
categories = { #optional
    #'ip': ['src_ip', 'dest_ip']
    #, 'user': ['owner', 'seller'],
}

### 'C' == mode
edges = {
    'attackerIP': [ 'victimIP', 'victimPort', 'vulnName'],
    'victimIP': [ 'victimPort' ],
    'vulnName': ['victimIP' ]
}
categories = { #optional
    #'ip': ['attackerIP', 'victimIP']
    #, 'user': ['owner', 'seller'], ...
}

##########
#4. Plot
g = None
hg = None
num_rows = min(max_rows, len(df))
if mode == 'A':
    g = graphistry.edges(df.sample(num_rows)).bind(source=my_src_col, destination=my_dest_col)
elif mode == 'B':
    hg = graphistry.hypergraph(df.sample(num_rows), node_cols, opts={'CATEGORIES': categories})
    g = hg['graph']
elif mode == 'C':
    nodes = list(edges.keys())
    for dests in edges.values():
        for dest in dests:
            nodes.append(dest)
    node_cols = list(set(nodes))
    hg = graphistry.hypergraph(df.sample(num_rows), node_cols, direct=True, opts={'CATEGORIES': categories, 'EDGES': edges})
    g = hg['graph']
  

g.plot()


# links 173770
# events 35826
# attrib entities 14368
Uploading 84590 kB. This may take a while...
Out[16]:

In [ ]: