BlazingSQL + Graphistry: Netflow analysis

This tutorial shows running BlazingSQL (GPU-accelerated SQL) on raw parquet files and visually analyzing the result with Graphistry

Download data


In [ ]:
!wget -Pq data/ https://blazingsql-colab.s3.amazonaws.com/netflow_parquet/1_0_0.parquet
!wget -Pq data/ https://blazingsql-colab.s3.amazonaws.com/netflow_parquet/1_1_0.parquet 
!wget -Pq data/ https://blazingsql-colab.s3.amazonaws.com/netflow_parquet/1_2_0.parquet
!wget -Pq data/ https://blazingsql-colab.s3.amazonaws.com/netflow_parquet/1_3_0.parquet

In [1]:
!ls -alh data


total 434M
drwxr-xr-x 2 graphistry graphistry 4.0K Nov 19 23:26 .
drwxr-xr-x 1 graphistry root       4.0K Nov 19 23:37 ..
-rw-r--r-- 1 graphistry graphistry 108M Jul 30 22:02 1_0_0.parquet
-rw-r--r-- 1 graphistry graphistry 135M Jul 30 22:01 1_1_0.parquet
-rw-r--r-- 1 graphistry graphistry  85M Jul 30 22:01 1_2_0.parquet
-rw-r--r-- 1 graphistry graphistry 108M Jul 30 22:01 1_3_0.parquet

Load data into table


In [2]:
from blazingsql import BlazingContext 

bc = BlazingContext()


BlazingContext ready

In [3]:
local_path = !pwd
local_path


Out[3]:
['/home/graphistry']

In [4]:
bc.create_table('netflow', local_path[0] + '/data/*_0.parquet')


Out[4]:
<pyblazing.apiv2.sql.Table at 0x7fd13cfa7cc0>

Compute IP<>IP flow summaries


In [5]:
%%time

result = bc.sql('''
        SELECT
            a.firstSeenSrcIp as source,
            a.firstSeenDestIp as destination,
            count(a.firstSeenDestPort) as targetPorts,
            SUM(a.firstSeenSrcTotalBytes) as bytesOut,
            SUM(a.firstSeenDestTotalBytes) as bytesIn,
            SUM(a.durationSeconds) as durationSeconds,
            MIN(parsedDate) as firstFlowDate,
            MAX(parsedDate) as lastFlowDate,
            COUNT(*) as attemptCount
        FROM
            netflow a
        GROUP BY
            a.firstSeenSrcIp,
            a.firstSeenDestIp
        ''').get()

gdf = result.columns

gdf.head(3)


CPU times: user 311 ms, sys: 257 ms, total: 568 ms
Wall time: 4.75 s
Out[5]:
source destination targetPorts bytesOut bytesIn durationSeconds firstFlowDate lastFlowDate attemptCount
0 10.0.0.13 46932 8 5064 41 1360 2013-04-01 08:57:31 2013-04-06 08:29:30 8
1 10.0.0.12 2605 20 12660 100 3520 2013-04-01 09:29:15 2013-04-07 06:51:00 20
2 10.0.0.7 29791 10 6320 50 1630 2013-04-01 10:54:31 2013-04-06 07:03:52 10

Visualize network


In [6]:
import graphistry

#upload protobuf instead of json
graphistry.register(api=2)

In [7]:
len(gdf.to_pandas())


Out[7]:
1303786

In [8]:
graphistry.bind(source='source', destination='destination').plot(gdf.to_pandas())


WARNING:  Large graph: |nodes| + |edges| = 1369523. Layout/rendering might be slow.
Uploading 19425 kB. This may take a while...
Out[8]: