AWS CloudWatch VPC Flow Logs <> Graphistry

Analyze cloudwatch logs, such as using vpc flow to map an account, with Graphistry

This example directly uses the AWS CLI for cloudwatch API access. You can also work from S3 or systems like Athena.

Installs & Configure

Set aws_access_key_id, aws_secret_access_key, key or pull from your env


In [0]:
!pip install graphistry -q
!pip install awscli -q

In [0]:
!aws configure set region us-west-2
!aws configure set aws_access_key_id "FILL_ME_IN"
!aws configure set aws_secret_access_key "FILL_ME_IN"

In [0]:
import pandas as pd
import json
import graphistry
#graphistry.register(key='FILL_ME_IN', server='FILL_ME_IN')

Record logs

If you do not already have logs, you can record VPC flow logs from your EC2 console:

  • Services -> EC2 -> Network Interfaces -> select interface(s) -> Action -> create flow log
    • Send to cloudwatch; use default settings for IAM and elsewhere
  • When enough data available, stop logging

Download & summarize logs

  • Pick a log group from available
  • Fetch: See AWS docs on filter-log-events
  • Load into a dataframe
  • Compute summary stats

In [113]:
!aws logs describe-log-groups


{
    "logGroups": [
        {
            "logGroupName": "/aws/lambda/ami-test-AZInfoFunction-1V3BW2PT09ER2",
            "creationTime": 1534508995180,
            "metricFilterCount": 0,
            "arn": "arn:aws:logs:us-west-2:520859498379:log-group:/aws/lambda/ami-test-AZInfoFunction-1V3BW2PT09ER2:*",
            "storedBytes": 1615
        },
        {
            "logGroupName": "VPCFlowDemo",
            "creationTime": 1556422724248,
            "metricFilterCount": 0,
            "arn": "arn:aws:logs:us-west-2:520859498379:log-group:VPCFlowDemo:*",
            "storedBytes": 0
        }
    ]
}

In [40]:
!aws logs filter-log-events --log-group-name VPCFlowDemo > data.json
!ls -al data.json


-rw-r--r-- 1 root root 3761828 Apr 28 20:43 data.json

In [108]:
with open('data.json', 'r') as f:
    data = json.load(f)
df = pd.DataFrame([x['message'].split(" ") for x in data['events']])
df.columns = cols = ['version', 'accountid', 'interfaceid', 'src_ip', 'dest_ip', 'src_port', 'dest_port', 'protocol', 'packets', 'bytes', 'time_start', 'time_end', 'action', 'status']

print('# rows', len(df))
df.sample(3)


# rows 9671
Out[108]:
version accountid interfaceid src_ip dest_ip src_port dest_port protocol packets bytes time_start time_end action status
3748 2 520859498379 eni-03cefc09700cd0f3b 172.31.18.239 35.188.230.101 443 44448 6 8 3922 1556422848 1556422903 ACCEPT OK
6289 2 520859498379 eni-08275497a357fd66a 172.20.45.114 172.20.59.137 31161 22186 6 2 112 1556423050 1556423110 ACCEPT OK
1396 2 520859498379 eni-092275301fc5694d9 172.20.60.118 172.20.55.224 80 33936 6 2 112 1556422660 1556422718 ACCEPT OK

In [114]:
# Int->Float for precision errors
df2 = df.copy()
for c in ['packets', 'bytes']:
    df2[c] = df2[c].astype(float)

summary_df = df2\
    .groupby(['src_ip', 'dest_ip', 'interfaceid', 'dest_port', 'protocol', 'action', 'status'])\
    .agg({
        'time_start': ['min', 'max'],
        'time_end': ['min', 'max'],
        'packets': ['min', 'max', 'sum', 'count'],
        'bytes': ['min', 'max', 'sum', 'count']
    }).reset_index()
summary_df.columns = [(" ".join(x)).strip().replace(" ", "_") for x in list(summary_df.columns)]
print('# rows', len(summary_df))
summary_df.sample(3)


# rows 5049
Out[114]:
src_ip dest_ip interfaceid dest_port protocol action status time_start_min time_start_max time_end_min time_end_max packets_min packets_max packets_sum packets_count bytes_min bytes_max bytes_sum bytes_count
3107 172.20.55.224 172.20.61.101 eni-016babb4349103670 38076 6 ACCEPT OK 1556422627 1556422627 1556422686 1556422686 2.0 2.0 2.0 1 112.0 112.0 112.0 1
1356 172.20.45.114 172.20.41.131 eni-08275497a357fd66a 3240 6 ACCEPT OK 1556422990 1556422990 1556423050 1556423050 2.0 2.0 2.0 1 112.0 112.0 112.0 1
4311 172.20.60.118 172.20.59.137 eni-092275301fc5694d9 8842 6 ACCEPT OK 1556422660 1556422660 1556422718 1556422718 2.0 2.0 2.0 1 112.0 112.0 112.0 1

Plot


In [110]:
hg = graphistry.hypergraph(
    summary_df,
    entity_types=['src_ip', 'dest_ip'], #'dest_port', 'interfaceid', 'action', ...
    direct=True)
hg['graph'].bind(edge_title='bytes_sum').plot()


# links 5049
# events 5049
# attrib entities 255
Out[110]:

In [0]: