In this part, we will:
You can download this notebook to run it locally.
In [1]:
import pandas
import graphistry
try:
from urllib.parse import unquote # Python 3
except ImportError:
from urllib import unquote # Python 2
#graphistry.register(key='<go to www.graphistry.com/api-request to get one api key>', server='labs.graphistry.com')
Raw Apache logs are a bit tricky to parse:
time field contains a space thus get split into two columns. We merge them back.cmd_path_proto field bundles the HTTP command, the path accessed, and the protocol version in to a single column. We split them in three columns.Sample raw data:
136.243.14.137 - - [14/Feb/2015:01:56:03 -0800] "GET /robots.txt HTTP/1.0" 200 252 "-" "Mozilla/5.0 (compatible; MJ12bot/v1.4.5; http://www.majestic12.co.uk/bot.php?+)"
136.243.14.137 - - [14/Feb/2015:01:56:10 -0800] "GET /honeypot//%22http://amunhoney.sourceforge.net//%22 HTTP/1.0" 404 284 "-" "Mozilla/5.0 (compatible; MJ12bot/v1.4.5; http://www.majestic12.co.uk/bot.php?+)"
In [6]:
url = 'http://www.secrepo.com/self.logs/2015/access.log.2015-02-14.gz'
def parseApacheLogs(filename):
fields = ['host', 'identity', 'user', 'time_part1', 'time_part2', 'cmd_path_proto',
'http_code', 'response_bytes', 'referer', 'user_agent', 'unknown']
data = pandas.read_csv(url, compression='gzip', sep=' ', header=None, names=fields, na_values=['-'])
# Panda's parser mistakenly splits the date into two columns, so we must concatenate them
time = data.time_part1 + data.time_part2
time_trimmed = time.map(lambda s: s.strip('[]').split('-')[0]) # Drop the timezone for simplicity
data['time'] = pandas.to_datetime(time_trimmed, format='%d/%b/%Y:%H:%M:%S')
# Split column `cmd_path_proto` into three columns, and decode the URL (ex: '%20' => ' ')
data['command'], data['path'], data['protocol'] = zip(*data['cmd_path_proto'].str.split().tolist())
data['path'] = data['path'].map(lambda s: unquote(s))
# Drop the fixed columns and any empty ones
data1 = data.drop(['time_part1', 'time_part2', 'cmd_path_proto'], axis=1)
return data1.dropna(axis=1, how='all')
logs = parseApacheLogs(url)
logs[:3]
Out[6]:
In [7]:
def host2pathGraph(logs):
def getEdgeTable(logs):
edges = logs.copy()
# Color edges by HTTP result code
http_code_to_color = {code: color for color, code in enumerate(edges['http_code'].unique())}
edges['ecolor'] = edges['http_code'].map(lambda code: http_code_to_color[code])
return edges
def getNodeTable(edges):
nodes0 = logs['host'].to_frame('nodeid')
nodes0['pcolor'] = 96000
nodes1 = logs['path'].to_frame('nodeid')
nodes1['pcolor'] = 96001
return pandas.concat([nodes0, nodes1], ignore_index=True).drop_duplicates()
edges = getEdgeTable(logs)
nodes = getNodeTable(edges)
return (edges, nodes)
g = graphistry.bind(source='host', destination='path', node='nodeid', \
edge_color='ecolor', point_color='pcolor')
g.plot(*host2pathGraph(logs))
Out[7]:
To avoid crowding a graph with many edges between the same nodes, we are going to bundle mutli-edges into one edge with added summary attributes. A multiedge is a set of edges that share the same source/destination.
For each bundle of requests, we compute the
The first two computations use Panda's built-in min and max aggregator functions. Then, to extract the most frequent referer, we write our own custom aggregator: mostFrequent.
In [8]:
#Bundle edges into a Pandas group when they share the same attributes like 'host' and 'path'
grouped_logs = logs.groupby(['host', 'path', 'user_agent', 'command', 'protocol', 'http_code'])
# Make dataframes count, min_time, max_time, and referer that are indexed by the groupby keys.
count = grouped_logs.size().to_frame('count')
min_time = grouped_logs['time'].agg('min').to_frame('time (min)')
max_time = grouped_logs['time'].agg('max').to_frame('time (max)')
def mostFrequent(x):
s = x.value_counts()
return s.index[0] if len(s.index > 0) else None
referer = grouped_logs['referer'].agg(mostFrequent)
# Join into one table based on the same groupby keys
# We remove the indexes (via reset_index) since we do not need them anymore.
summary = count.join([min_time, max_time, referer]).reset_index()
summary[:3]
Out[8]:
Plot. For an even cleaner view, in the visualization, try using a histogram filter to only show nodes with a degree of 100 or less.
In [9]:
g.plot(*host2pathGraph(summary))
Out[9]:
In [10]:
def path2pathGraph(summary):
host2path = summary[['host', 'path']].copy()
host2path['path'] = host2path['path'].map(lambda p: p.split('?')[0])
sessions = pandas.merge(host2path, host2path, on='host').drop_duplicates()
host2color = {host: 265000 + index for index, host in enumerate(sessions.host.unique())}
sessions['ecolor'] = sessions['host'].map(lambda x: host2color[x])
return sessions
sessionEdges = path2pathGraph(summary)
sessionEdges[:3]
Out[10]:
In [11]:
graphistry.bind(source='path_x', destination='path_y', edge_color='ecolor').plot(sessionEdges)
Out[11]:
For example, you can quickly explore the browsing session of an individual host:
In [12]:
graphistry.bind(source='host', destination='user_agent').plot(summary)
Out[12]:
In [19]:
hg = graphistry.hypergraph(
logs,
entity_types=['host', 'path', 'referer', 'user_agent'],
direct=True,
opts={
'EDGES': {
'host': ['path', 'user_agent'],
'user_agent': ['path'],
'referer': ['path']
},
'CATEGORIES': {
'url': ['path', 'referer']
}
})
hg['graph'].plot()
Out[19]:
In [ ]: