BlazingSQL + Graphistry: Netflow analysis

This tutorial shows running BlazingSQL (GPU-accelerated SQL) on raw parquet files and visually analyzing the result with Graphistry

Download data



In [ ]:

    
!wget -Pq data/ https://blazingsql-colab.s3.amazonaws.com/netflow_parquet/1_0_0.parquet
!wget -Pq data/ https://blazingsql-colab.s3.amazonaws.com/netflow_parquet/1_1_0.parquet 
!wget -Pq data/ https://blazingsql-colab.s3.amazonaws.com/netflow_parquet/1_2_0.parquet
!wget -Pq data/ https://blazingsql-colab.s3.amazonaws.com/netflow_parquet/1_3_0.parquet



In [1]:

    
!ls -alh data









    



total 434M
drwxr-xr-x 2 graphistry graphistry 4.0K Nov 19 23:26 .
drwxr-xr-x 1 graphistry root       4.0K Nov 19 23:37 ..
-rw-r--r-- 1 graphistry graphistry 108M Jul 30 22:02 1_0_0.parquet
-rw-r--r-- 1 graphistry graphistry 135M Jul 30 22:01 1_1_0.parquet
-rw-r--r-- 1 graphistry graphistry  85M Jul 30 22:01 1_2_0.parquet
-rw-r--r-- 1 graphistry graphistry 108M Jul 30 22:01 1_3_0.parquet

Load data into table



In [2]:

    
from blazingsql import BlazingContext 

bc = BlazingContext()









    



BlazingContext ready



In [3]:

    
local_path = !pwd
local_path









    Out[3]:





['/home/graphistry']



In [4]:

    
bc.create_table('netflow', local_path[0] + '/data/*_0.parquet')









    Out[4]:





<pyblazing.apiv2.sql.Table at 0x7fd13cfa7cc0>

Compute IP<>IP flow summaries



In [5]:

    
%%time

result = bc.sql('''
        SELECT
            a.firstSeenSrcIp as source,
            a.firstSeenDestIp as destination,
            count(a.firstSeenDestPort) as targetPorts,
            SUM(a.firstSeenSrcTotalBytes) as bytesOut,
            SUM(a.firstSeenDestTotalBytes) as bytesIn,
            SUM(a.durationSeconds) as durationSeconds,
            MIN(parsedDate) as firstFlowDate,
            MAX(parsedDate) as lastFlowDate,
            COUNT(*) as attemptCount
        FROM
            netflow a
        GROUP BY
            a.firstSeenSrcIp,
            a.firstSeenDestIp
        ''').get()

gdf = result.columns

gdf.head(3)









    



CPU times: user 311 ms, sys: 257 ms, total: 568 ms
Wall time: 4.75 s






    Out[5]:







  
    
      
      source
      destination
      targetPorts
      bytesOut
      bytesIn
      durationSeconds
      firstFlowDate
      lastFlowDate
      attemptCount
    
  
  
    
      0
      10.0.0.13
      46932
      8
      5064
      41
      1360
      2013-04-01 08:57:31
      2013-04-06 08:29:30
      8
    
    
      1
      10.0.0.12
      2605
      20
      12660
      100
      3520
      2013-04-01 09:29:15
      2013-04-07 06:51:00
      20
    
    
      2
      10.0.0.7
      29791
      10
      6320
      50
      1630
      2013-04-01 10:54:31
      2013-04-06 07:03:52
      10

Visualize network



In [6]:

    
import graphistry

#upload protobuf instead of json
graphistry.register(api=2)



In [7]:

    
len(gdf.to_pandas())









    Out[7]:





1303786



In [8]:

    
graphistry.bind(source='source', destination='destination').plot(gdf.to_pandas())









    



WARNING:  Large graph: |nodes| + |edges| = 1369523. Layout/rendering might be slow.






    



Uploading 19425 kB. This may take a while...






    Out[8]:

	source	destination	targetPorts	bytesOut	bytesIn	durationSeconds	firstFlowDate	lastFlowDate	attemptCount
0	10.0.0.13	46932	8	5064	41	1360	2013-04-01 08:57:31	2013-04-06 08:29:30	8
1	10.0.0.12	2605	20	12660	100	3520	2013-04-01 09:29:15	2013-04-07 06:51:00	20
2	10.0.0.7	29791	10	6320	50	1630	2013-04-01 10:54:31	2013-04-06 07:03:52	10