Visualize CSV Mini-App

Jupyter: File -> Make a copy
Colab: File -> Save a copy in Drive
Run notebook cells by pressing shift-enter
Either edit annd run top cells one-by-one, or edit and run the self-contained version at the bottom



In [1]:

    
#!pip install graphistry -q



In [2]:

    
import pandas as pd
import graphistry
#graphistry.register(key='MY_KEY', server='labs.graphistry.com')

1. Upload csv

Use a file by uploading it or via URL.

Run help(pd.read_csv) for more options.

File Upload: Jupyter Notebooks

If circle on top right not green, click kernel -> reconnect
Go to file directory (/tree) by clicking the Jupyter logo
Navigate to the directory page containing your notebook
Press the upload button on the top right

File Upload: Google Colab

Open the left sidebar by pressing the right arrow on the left
Go to the Files tab
Press UPLOAD
Make sure goes into /content

File Upload: URL

Uncomment below line and put in the actual data url
Run help(pd.read_csv) for more options



In [3]:

    
file_path = './events-1551346702.csv'
df = pd.read_csv(file_path)

print('# rows', len(df))
df.sample(min(len(df), 3))









    



# rows 92883






    



/usr/local/lib/python3.6/dist-packages/IPython/core/interactiveshell.py:3049: DtypeWarning: Columns (4,15,23) have mixed types. Specify dtype option on import or set low_memory=False.
  interactivity=interactivity, compiler=compiler, result=result)






    Out[3]:







  
    
      
      id
      action
      action_classification
      action_level
      action_summary
      authorities_link
      country
      create_date
      data_notes
      date
      ...
      status
      target_audience
      type
      uid
      uid_hash
      url
      slug
      device_id
      created_at
      updated_at
    
  
  
    
      47601
      54812
      NaN
      II
      NaN
      NaN
      http://www.healthycanadians.gc.ca/recall-alert...
      CAN
      NaN
      NaN
      NaN
      ...
      NaN
      NaN
      Recall
      23061
      02f800fd
      NaN
      can-architect-i-system-stat-troponin-i-assay
      51467
      2019-02-27 11:17:58 UTC
      2019-02-27 11:17:58 UTC
    
    
      23227
      26969
      NaN
      NaN
      NaN
      NaN
      https://www.swissmedic.ch/swissmedic/de/home.html
      CHE
      NaN
      NaN
      NaN
      ...
      NaN
      NaN
      Field Safety Notice
      Vk_20080826_01
      1c3b037b
      https://fsca.swissmedic.ch/mep/#?q=Vk_20080826_01
      che-hd3-diagnostic-ultrasound-system-model-cat...
      24987
      2019-02-27 10:54:39 UTC
      2019-02-27 10:54:39 UTC
    
    
      88819
      103605
      Products should be taken out of service.
      NaN
      NaN
      Recall/Exchange
      https://lakemedelsverket.se/
      SWE
      NaN
      CHC 13-02
      NaN
      ...
      NaN
      NaN
      Field Safety Notice
      6.6.4-2013-51906
      1a780322
      https://lakemedelsverket.se/malgrupp/Halso---s...
      swe-total-protein_2-urine-upro_2-on-advia-chem...
      98325
      2019-02-28 08:58:19 UTC
      2019-02-28 08:58:19 UTC
    
  

3 rows × 30 columns



In [4]:

    
df.columns









    Out[4]:





Index(['id', 'action', 'action_classification', 'action_level',
       'action_summary', 'authorities_link', 'country', 'create_date',
       'data_notes', 'date', 'date_initiated_by_firm', 'date_posted',
       'date_terminated', 'date_updated', 'determined_cause', 'documents',
       'icij_notes', 'number', 'reason', 'source', 'status', 'target_audience',
       'type', 'uid', 'uid_hash', 'url', 'slug', 'device_id', 'created_at',
       'updated_at'],
      dtype='object')

2. Optional: Clean up CSV



In [8]:

    
hits = pd.DataFrame([[c, len(df[c].unique())] for c in df.columns], columns=['col', 'num_uniq']).sort_values('num_uniq')

hits.query('num_uniq > 10 & num_uniq < 9288')









    Out[8]:







  
    
      
      col
      num_uniq
    
  
  
    
      2
      action_classification
      11
    
    
      6
      country
      19
    
    
      19
      source
      19
    
    
      16
      icij_notes
      20
    
    
      14
      determined_cause
      44
    
    
      20
      status
      279
    
    
      7
      create_date
      629
    
    
      12
      date_terminated
      3277
    
    
      11
      date_posted
      3279
    
    
      9
      date
      3857
    
    
      10
      date_initiated_by_firm
      5416
    
    
      29
      updated_at
      5489
    
    
      28
      created_at
      5489



In [10]:

    
skip_nodes = ['icij_notes', 'determined_cause', 'action_classification', 'icij_notes', 'country', 'status', 'source']
nodes = [x for x in list(hits.query('num_uniq > 10 & num_uniq < 9288')['col']) if not x in skip_nodes]

nodes









    Out[10]:





['create_date',
 'date_terminated',
 'date_posted',
 'date',
 'date_initiated_by_firm',
 'updated_at',
 'created_at']



In [12]:

    
df = df_orig.query('country == "USA"')

3. Configure: Visualize with 3 kinds of graphs

Set mode and the corresponding values:

Mode "A". See graph from table of (src,dst) edges

Mode "B". See hypergraph: Draw row as node and connect it to entities in same row

Pick which cols to make nodes
If multiple cols share same type (e.g., "src_ip", "dest_ip" are both "ip"), unify them

Mode "C". See by creating multiple nodes, edges per row

Pick how different column values point to other column values
If multiple cols share same type (e.g., "src_ip", "dest_ip" are both "ip"), unify them



In [13]:

    
#Pick 'A', 'B', or 'C'
mode = 'B' 
max_rows = 50000


### 'A' == mode
my_src_col = 'attackerIP'
my_dest_col = 'victimIP'



### 'B' == mode
node_cols = nodes
categories = { #optional
    #'date': [ 'create_date', 'date_initiated_by_firm', 'date_posted', 'date_terminated', 'updated_at' ]
    #'ip': ['attacker_IP', 'victimIP']
    #, 'user': ['owner', 'seller'],
}



### 'C' == mode
edges = {
      'attackerIP': [ 'victimIP', 'victimPort', 'vulnName'],
      'victimIP': [ 'victimPort'],
      'vulnName': [ 'victimIP' ]
}
categories = { #optional
      'ip': ['attackerIP', 'victimIP']
       #, user': ['owner', 'seller'], ...
}

4. Plot: Upload & render!

See UI guide: https://labs.graphistry.com/graphistry/ui.html



In [14]:

    
g = None
hg = None
num_rows = min(max_rows, len(df))
if mode == 'A':
    g = graphistry.edges(df.sample(num_rows)).bind(source=my_src_col, destination=my_dest_col)
elif mode == 'B':
    hg = graphistry.hypergraph(df.sample(num_rows), node_cols, opts={'CATEGORIES': categories})
    g = hg['graph']
elif mode == 'C':
    nodes = list(edges.keys())
    for dests in edges.values():
        for dest in dests:
            nodes.append(dest)
    node_cols = list(set(nodes))
    hg = graphistry.hypergraph(df.sample(num_rows), node_cols, direct=True, opts={'CATEGORIES': categories, 'EDGES': edges})
    g = hg['graph']
  
#hg
print(len(g._edges))

g.plot()









    



# links 173770
# events 35826
# attrib entities 14368
173770
Uploading 84603 kB. This may take a while...






    Out[14]:

Alternative: Combined

Split into data loading and cleaning/configuring/plotting.



In [15]:

    
#!pip install graphistry -q
import pandas as pd
import graphistry
#graphistry.register(key='MY_KEY', server='labs.graphistry.com')


##########
#1. Load
file_path = './events-1551346702.csv'
df = pd.read_csv(file_path)

print(df.columns)
print('rows:', len(df))
print(df.sample(min(len(df),3)))









    



Index(['id', 'action', 'action_classification', 'action_level',
       'action_summary', 'authorities_link', 'country', 'create_date',
       'data_notes', 'date', 'date_initiated_by_firm', 'date_posted',
       'date_terminated', 'date_updated', 'determined_cause', 'documents',
       'icij_notes', 'number', 'reason', 'source', 'status', 'target_audience',
       'type', 'uid', 'uid_hash', 'url', 'slug', 'device_id', 'created_at',
       'updated_at'],
      dtype='object')
rows: 92883
           id                                             action  \
47246   54457                                                NaN   
90383  105169  ANSM has been informed of the implementation o...   
51788   58999                                                NaN   

      action_classification  action_level  \
47246                    II           NaN   
90383                   NaN           NaN   
51788                    II           NaN   

                                          action_summary  \
47246                                                NaN   
90383  Surgical Motor - Conmed Hall L3000SM Small Lit...   
51788                                                NaN   

                                        authorities_link country create_date  \
47246  http://www.healthycanadians.gc.ca/recall-alert...     CAN         NaN   
90383                         https://www.ansm.sante.fr/     FRA         NaN   
51788  http://www.healthycanadians.gc.ca/recall-alert...     CAN         NaN   

      data_notes        date  ... status target_audience    type       uid  \
47246        NaN         NaN  ...    NaN             NaN  Recall     21796   
90383        NaN  2018-07-25  ...    NaN             NaN  Recall  R1811289   
51788        NaN         NaN  ...    NaN             NaN  Recall     50818   

       uid_hash                                                url  \
47246  0310010a                                                NaN   
90383  083701c1  https://ansm.sante.fr/S-informer/Informations-...   
51788  03100107                                                NaN   

                                            slug device_id  \
47246  can-encode-lab-designed-zirconia-abutment     51116   
90383   fra-hall-l3000sm-small-lithium-batteries     99819   
51788                 can-chemosite-system-ports     55584   

                    created_at               updated_at  
47246  2019-02-27 11:17:35 UTC  2019-02-27 11:17:35 UTC  
90383  2019-02-28 09:05:31 UTC  2019-02-28 09:05:31 UTC  
51788  2019-02-27 11:22:40 UTC  2019-02-27 11:22:40 UTC  

[3 rows x 30 columns]



In [16]:

    
##########
#2. Clean
#df = df.rename(columns={'attackerIP': 'src_ip', 'victimIP: 'dest_ip', 'victimPort': 'protocol'})

hits = pd.DataFrame([[c, len(df[c].unique())] for c in df.columns], columns=['col', 'num_uniq']).sort_values('num_uniq')


skip_nodes = ['icij_notes', 'determined_cause', 'action_classification', 'icij_notes', 'country', 'status', 'source']
nodes = [x for x in list(hits.query('num_uniq > 10 & num_uniq < 9288')['col']) if not x in skip_nodes]

df = df.query('country == "USA"')
    
##########
#3. Config - Pick 'A', 'B', or 'C'
mode = 'B' 
max_rows = 50000


### 'A' == mode
my_src_col = 'attackerIP'
my_dest_col = 'victimIP'

### 'B' == mode
node_cols = nodes
categories = { #optional
    #'ip': ['src_ip', 'dest_ip']
    #, 'user': ['owner', 'seller'],
}

### 'C' == mode
edges = {
    'attackerIP': [ 'victimIP', 'victimPort', 'vulnName'],
    'victimIP': [ 'victimPort' ],
    'vulnName': ['victimIP' ]
}
categories = { #optional
    #'ip': ['attackerIP', 'victimIP']
    #, 'user': ['owner', 'seller'], ...
}

##########
#4. Plot
g = None
hg = None
num_rows = min(max_rows, len(df))
if mode == 'A':
    g = graphistry.edges(df.sample(num_rows)).bind(source=my_src_col, destination=my_dest_col)
elif mode == 'B':
    hg = graphistry.hypergraph(df.sample(num_rows), node_cols, opts={'CATEGORIES': categories})
    g = hg['graph']
elif mode == 'C':
    nodes = list(edges.keys())
    for dests in edges.values():
        for dest in dests:
            nodes.append(dest)
    node_cols = list(set(nodes))
    hg = graphistry.hypergraph(df.sample(num_rows), node_cols, direct=True, opts={'CATEGORIES': categories, 'EDGES': edges})
    g = hg['graph']
  

g.plot()









    



# links 173770
# events 35826
# attrib entities 14368
Uploading 84590 kB. This may take a while...






    Out[16]:



In [ ]:

	id	action	action_classification	action_level	action_summary	authorities_link	country	create_date	data_notes	date	...	status	target_audience	type	uid	uid_hash	url	slug	device_id	created_at	updated_at
47601	54812	NaN	II	NaN	NaN	http://www.healthycanadians.gc.ca/recall-alert...	CAN	NaN	NaN	NaN	...	NaN	NaN	Recall	23061	02f800fd	NaN	can-architect-i-system-stat-troponin-i-assay	51467	2019-02-27 11:17:58 UTC	2019-02-27 11:17:58 UTC
23227	26969	NaN	NaN	NaN	NaN	https://www.swissmedic.ch/swissmedic/de/home.html	CHE	NaN	NaN	NaN	...	NaN	NaN	Field Safety Notice	Vk_20080826_01	1c3b037b	https://fsca.swissmedic.ch/mep/#?q=Vk_20080826_01	che-hd3-diagnostic-ultrasound-system-model-cat...	24987	2019-02-27 10:54:39 UTC	2019-02-27 10:54:39 UTC
88819	103605	Products should be taken out of service.	NaN	NaN	Recall/Exchange	https://lakemedelsverket.se/	SWE	NaN	CHC 13-02	NaN	...	NaN	NaN	Field Safety Notice	6.6.4-2013-51906	1a780322	https://lakemedelsverket.se/malgrupp/Halso---s...	swe-total-protein_2-urine-upro_2-on-advia-chem...	98325	2019-02-28 08:58:19 UTC	2019-02-28 08:58:19 UTC

	col	num_uniq
2	action_classification	11
6	country	19
19	source	19
16	icij_notes	20
14	determined_cause	44
20	status	279
7	create_date	629
12	date_terminated	3277
11	date_posted	3279
9	date	3857
10	date_initiated_by_firm	5416
29	updated_at	5489
28	created_at	5489