Tutorial (Part 2): Visualizing Apache Access Logs

In this part, we will:

Parse and clean raw Apache logs into a Pandas dataframe
Bundle requests that share the same source and target ("edge aggregation")
Create different kinds of graph views of the same logs, where each one reveals different insights into the data.

You can download this notebook to run it locally.



In [1]:

    
import pandas
import graphistry

try:
    from urllib.parse import unquote # Python 3
except ImportError:
    from urllib import unquote       # Python 2

#graphistry.register(key='<go to www.graphistry.com/api-request to get one api key>', server='labs.graphistry.com')

Download+Parse Apache Logs to Create a Pandas Dataframe

Raw Apache logs are a bit tricky to parse:

The time field contains a space thus get split into two columns. We merge them back.
The cmd_path_proto field bundles the HTTP command, the path accessed, and the protocol version in to a single column. We split them in three columns.

Sample raw data:

136.243.14.137 - - [14/Feb/2015:01:56:03 -0800] "GET /robots.txt HTTP/1.0" 200 252 "-" "Mozilla/5.0 (compatible; MJ12bot/v1.4.5; http://www.majestic12.co.uk/bot.php?+)" 
136.243.14.137 - - [14/Feb/2015:01:56:10 -0800] "GET /honeypot//%22http://amunhoney.sourceforge.net//%22 HTTP/1.0" 404 284 "-" "Mozilla/5.0 (compatible; MJ12bot/v1.4.5; http://www.majestic12.co.uk/bot.php?+)"



In [6]:

    
url = 'http://www.secrepo.com/self.logs/2015/access.log.2015-02-14.gz'

def parseApacheLogs(filename):
    fields = ['host', 'identity', 'user', 'time_part1', 'time_part2', 'cmd_path_proto', 
             'http_code', 'response_bytes', 'referer', 'user_agent', 'unknown']
    
    data = pandas.read_csv(url, compression='gzip', sep=' ', header=None, names=fields, na_values=['-'])

    # Panda's parser mistakenly splits the date into two columns, so we must concatenate them
    time = data.time_part1 + data.time_part2
    time_trimmed = time.map(lambda s: s.strip('[]').split('-')[0]) # Drop the timezone for simplicity
    data['time'] = pandas.to_datetime(time_trimmed, format='%d/%b/%Y:%H:%M:%S')
    
    # Split column `cmd_path_proto` into three columns, and decode the URL (ex: '%20' => ' ')
    data['command'], data['path'], data['protocol'] = zip(*data['cmd_path_proto'].str.split().tolist())
    data['path'] = data['path'].map(lambda s: unquote(s))
    
    # Drop the fixed columns and any empty ones
    data1 = data.drop(['time_part1', 'time_part2', 'cmd_path_proto'], axis=1)
    return data1.dropna(axis=1, how='all')

logs = parseApacheLogs(url)
logs[:3]









    Out[6]:







  
    
      
      host
      http_code
      response_bytes
      referer
      user_agent
      time
      command
      path
      protocol
    
  
  
    
      0
      136.243.14.137
      200
      252
      NaN
      Mozilla/5.0 (compatible; MJ12bot/v1.4.5; http:...
      2015-02-14 01:56:03
      GET
      /robots.txt
      HTTP/1.0
    
    
      1
      136.243.14.137
      404
      284
      NaN
      Mozilla/5.0 (compatible; MJ12bot/v1.4.5; http:...
      2015-02-14 01:56:10
      GET
      /honeypot//"http://amunhoney.sourceforge.net//"
      HTTP/1.0
    
    
      2
      136.243.14.137
      404
      303
      NaN
      Mozilla/5.0 (compatible; MJ12bot/v1.4.5; http:...
      2015-02-14 01:56:15
      GET
      /honeypot//"http://glastopf.org//"
      HTTP/1.0

Graph connecting Hosts to URLs: Simple Version

We create host-to-path graph by using both edge and node tables as shown in tutorial part 1.



In [7]:

    
def host2pathGraph(logs):
    def getEdgeTable(logs):
        edges = logs.copy()
        # Color edges by HTTP result code
        http_code_to_color = {code: color for color, code in enumerate(edges['http_code'].unique())}
        edges['ecolor'] = edges['http_code'].map(lambda code: http_code_to_color[code])
        return edges
    
    def getNodeTable(edges):
        nodes0 = logs['host'].to_frame('nodeid')
        nodes0['pcolor'] = 96000
        nodes1 = logs['path'].to_frame('nodeid')
        nodes1['pcolor'] = 96001
        return pandas.concat([nodes0, nodes1], ignore_index=True).drop_duplicates()
    
    edges = getEdgeTable(logs)
    nodes = getNodeTable(edges)
    return (edges, nodes)

g = graphistry.bind(source='host', destination='path', node='nodeid', \
                          edge_color='ecolor', point_color='pcolor')
g.plot(*host2pathGraph(logs))









    



/usr/local/lib/python2.7/site-packages/graphistry/pygraphistry.py:533: FutureWarning: pandas.tslib is deprecated and will be removed in a future version.
You can access NaTType as type(pandas.NaT)
  elif isinstance(obj, pandas.tslib.NaTType):






    Out[7]:

Graph connecting Hosts to URLs: Declutter via Edge Aggregation

To avoid crowding a graph with many edges between the same nodes, we are going to bundle mutli-edges into one edge with added summary attributes. A multiedge is a set of edges that share the same source/destination.

For each bundle of requests, we compute the

The earliest time
The latest time
The most frequent referer

The first two computations use Panda's built-in min and max aggregator functions. Then, to extract the most frequent referer, we write our own custom aggregator: mostFrequent.



In [8]:

    
#Bundle edges into a Pandas group when they share the same attributes like 'host' and 'path'
grouped_logs = logs.groupby(['host', 'path', 'user_agent', 'command', 'protocol', 'http_code'])

# Make dataframes count, min_time, max_time, and referer that are indexed by the groupby keys.
count = grouped_logs.size().to_frame('count')
min_time = grouped_logs['time'].agg('min').to_frame('time (min)')
max_time = grouped_logs['time'].agg('max').to_frame('time (max)')

def mostFrequent(x):
    s = x.value_counts()
    return s.index[0] if len(s.index > 0) else None
referer = grouped_logs['referer'].agg(mostFrequent)

# Join into one table based on the same groupby keys
# We remove the indexes (via reset_index) since we do not need them anymore.
summary = count.join([min_time, max_time, referer]).reset_index()
summary[:3]









    Out[8]:







  
    
      
      host
      path
      user_agent
      command
      protocol
      http_code
      count
      time (min)
      time (max)
      referer
    
  
  
    
      0
      1.224.163.80
      ////bbs/skin/ggambo5100_board/setup.php
      Microsoft Internet Explorer/4.0b1 (Windows 95)
      POST
      HTTP/1.1
      404
      2
      2015-02-14 12:41:55
      2015-02-14 12:41:56
      None
    
    
      1
      1.224.163.80
      ////bbs/skin/ggambo5100_board/setup.php
      Mozilla/5.0 (Macintosh; U; PPC Mac OS X Mach-O...
      POST
      HTTP/1.1
      404
      10
      2015-02-14 12:41:18
      2015-02-14 12:48:05
      None
    
    
      2
      1.224.163.80
      ////bbs/skin/ggambo5100_board/setup.php
      Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB...
      POST
      HTTP/1.1
      404
      2
      2015-02-14 12:54:55
      2015-02-14 12:54:57
      None

Plot. For an even cleaner view, in the visualization, try using a histogram filter to only show nodes with a degree of 100 or less.



In [9]:

    
g.plot(*host2pathGraph(summary))









    Out[9]:

Switching Lenses: Another View of the Same Data

There are many way to cast data into a graph. Each reveals different insights.

For an alternate view of the web logs, we can visualize how users browse from page to page.



In [10]:

    
def path2pathGraph(summary):
    host2path = summary[['host', 'path']].copy()
    host2path['path'] = host2path['path'].map(lambda p: p.split('?')[0])
    sessions = pandas.merge(host2path, host2path, on='host').drop_duplicates()

    host2color = {host: 265000 + index for index, host in enumerate(sessions.host.unique())}
    sessions['ecolor'] = sessions['host'].map(lambda x: host2color[x])
    return sessions

sessionEdges = path2pathGraph(summary)
sessionEdges[:3]









    Out[10]:







  
    
      
      host
      path_x
      path_y
      ecolor
    
  
  
    
      0
      1.224.163.80
      ////bbs/skin/ggambo5100_board/setup.php
      ////bbs/skin/ggambo5100_board/setup.php
      265000
    
    
      15
      1.224.163.80
      ////bbs/skin/ggambo5100_board/setup.php
      ////bbs/skin/ggambo5100_board/write.php
      265000
    
    
      30
      1.224.163.80
      ////bbs/skin/ggambo5100_board/setup.php
      ////bbs/skin/ggambo6000_board/setup.php
      265000



In [11]:

    
graphistry.bind(source='path_x', destination='path_y', edge_color='ecolor').plot(sessionEdges)









    Out[11]:

Explore In-Tool for Deeper Insights

For example, you can quickly explore the browsing session of an individual host:

Click on an edge to open its label
On the host field, use the filter icon to filter on the edge's host value
Recluster the graph
Restart by opening the filters menu and disabling or delete the generated host filter

Another View: Attacker Fingerprints

An attacker will often use multiple computers with similar malformed browser fingerprints.

Try excluding Mozilla-based browsers by making the following exclusion:

    point:__nodeid__ like "Mozilla%"



In [12]:

    
graphistry.bind(source='host', destination='user_agent').plot(summary)









    Out[12]:

Multiple views, simulateneously



In [19]:

    
hg = graphistry.hypergraph(
    logs,
    entity_types=['host', 'path', 'referer', 'user_agent'],
    direct=True,
    opts={
        'EDGES': {
            'host': ['path', 'user_agent'],
            'user_agent': ['path'],
            'referer': ['path']            
        },
        'CATEGORIES': {
            'url': ['path', 'referer']
        }
    })

hg['graph'].plot()









    



('# links', 7976)
('# events', 2652)
('# attrib entities', 311)






    Out[19]:



In [ ]:

	host	http_code	response_bytes	referer	user_agent	time	command	path	protocol
0	136.243.14.137	200	252	NaN	Mozilla/5.0 (compatible; MJ12bot/v1.4.5; http:...	2015-02-14 01:56:03	GET	/robots.txt	HTTP/1.0
1	136.243.14.137	404	284	NaN	Mozilla/5.0 (compatible; MJ12bot/v1.4.5; http:...	2015-02-14 01:56:10	GET	/honeypot//"http://amunhoney.sourceforge.net//"	HTTP/1.0
2	136.243.14.137	404	303	NaN	Mozilla/5.0 (compatible; MJ12bot/v1.4.5; http:...	2015-02-14 01:56:15	GET	/honeypot//"http://glastopf.org//"	HTTP/1.0

	host	path	user_agent	command	protocol	http_code	count	time (min)	time (max)	referer
0	1.224.163.80	////bbs/skin/ggambo5100_board/setup.php	Microsoft Internet Explorer/4.0b1 (Windows 95)	POST	HTTP/1.1	404	2	2015-02-14 12:41:55	2015-02-14 12:41:56	None
1	1.224.163.80	////bbs/skin/ggambo5100_board/setup.php	Mozilla/5.0 (Macintosh; U; PPC Mac OS X Mach-O...	POST	HTTP/1.1	404	10	2015-02-14 12:41:18	2015-02-14 12:48:05	None
2	1.224.163.80	////bbs/skin/ggambo5100_board/setup.php	Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB...	POST	HTTP/1.1	404	2	2015-02-14 12:54:55	2015-02-14 12:54:57	None