Apache access logs are text files that record the activity of a web site. The analysis of log files provides useful insights for web masters and site owners. This example illustrates how to use the data analysis functions of Pandas to analize log files.
For this exercise we will need to import the following libraries:
In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from dateutil.parser import *
Log files come in various formats which differ in the amount mof information registered in the log. The two most common formats are the 'common' and the 'combined'. For more details about apache log formats see Apache's Log Files. Here are a couple of examples:
Common format:
local - - [30/Mar/1995:20:28:05 -0700] "GET 6224.html HTTP/1.0" 200 2644
Combined format:
192.168.0.1 - - [02/Jan/2018:07:20:02 -0500] "GET /index.html HTTP/1.1" 200 8813 "-" "Mozilla/5.0 (Linux; Android 7.0; PRA-LX3 Build/HUAWEIPRA-LX3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.91 Mobile Safari/537.36"
The first element of a log line is the address or hostname of the client.
Next come the id and the user, usually not used and register as a '-'.
Then, we have the date and time enclosed between '[' and ']'.
The next element is the request, which comes enclosed in quotes.
Then we have a couple numbers, the http status code and the response size. In some cases, they may be registered as '-'.
In the combined format we have two additional elements, the referer and the user agent, both enclosed in quotes.
At first, the log lines may seem difficult to parse. Pandas simplifies thet task with the read_csv function, which supports many customization options. We are going to need the following:
So, lets define our custom functions for parsing integers and dates:
In [2]:
def apacheDateParser(x,y):
return parse(x+' '+y, fuzzy=True)
def myIntParser(x):
try:
# Throws a ValueError exception if is not a valid integer
return int(x)
except ValueError:
return np.nan
And now we can read the log file into a Pandas Dataframe.
In [3]:
data = pd.read_csv('<Path to your access_log file>', encoding='iso-8859-1',
delim_whitespace=True,
header=None,
parse_dates={ 'dt': [3,4] },
date_parser=apacheDateParser,
index_col=0,
names=['client','id','user','datetime','tz','request','status','size','referer','agent'],
converters={ 'status':myIntParser, 'size': myIntParser },
dtype={ 'referer': object, 'agent': object } )
We can check how many rows and columns in the data with
In [4]:
print(data.shape)
And take a look at the data with
In [5]:
data.head()
Out[5]:
In this case the log file is in the common format and for this reason the referer and agent columns are empty.
We may also want to check that the type conversions work as expected
In [6]:
data.dtypes
Out[6]:
We can check some basic statistics of the data, for example
The mean transfer size is
In [7]:
data['size'].mean()
Out[7]:
and its standard deviation is
In [8]:
data['size'].std()
Out[8]:
We can see that in some cases the size was not a valid value
In [9]:
data['size'][data['size'].isnull()].head()
Out[9]:
When doing analysis on files with this kind of problems, be aware that there may be cases like this that introduce error in the results. For example if we count the number of size records we obtain
In [10]:
data['size'].count()
Out[10]:
Much lower than the number of rows in the table.
Using a grouping criteria we can obtain statistics about groups of records that share certain characteristic. In this example, we are going to characterize the visits to the website by day of the week. So, we start by grouping the dataframe by weekday for counting the number is hits. In this case, any column will do for the counting.
In [11]:
grpHitsWDay = data[['id']].groupby(data.index.weekday, sort=False)
Observe that the groupby operation returns a DataFrameGroupBy object
In [12]:
grpHitsWDay
Out[12]:
The indices attribute give us a good intuition of how groups are represented
In [13]:
grpHitsWDay.indices
Out[13]:
each weekday (0,..,6) is an array that containts the rows of the dataframe that belong to the group. This is quite important because when computing any metric, we have to use functions that operate over the arrays of the corresponding rows.
For example, counting the number of elements in the array gives us the total hits of the respective day
In [14]:
grpHitsWDay.count()
Out[14]:
Of course, it would be nicer if we give the days of the week the corresponding names and the column the name 'Hits'
In [15]:
hits = grpHitsWDay.count()
hits.index = [ 'Mon','Tue','Wed','Thu','Fri','Sat','Sun' ]
hits.columns = [ 'Hits' ]
hits
Out[15]:
The method describe gives a basic statistical description of the 'Hits' metric
In [16]:
hits.describe()
Out[16]:
Sometimes a picture is work a thousand words
In [17]:
hits.plot(kind='bar', figsize=(8,6), colormap='summer', title='Hits per weekday', legend=False)
plt.show()
One may want a characterize several of the metrics at the same time. Let us characterize the behavior of hits and transferred bytes per weekday
In [18]:
grpWDay = data[ ['id','size'] ].groupby(data.index.weekday)
In this case, we want to count the number of hits and sum the number of bytes. The aggregate method allows passing the fuctions to apply to each column
In [19]:
stats = grpWDay.aggregate({ 'id':lambda x: x.count(), 'size':np.sum })
stats
Out[19]:
A couple of notes:
Of course, everything looks better with nicer names, so let us rename the columns and the indices
In [20]:
stats = grpWDay.aggregate({ 'id':lambda x: x.count(), 'size':np.sum }).rename(columns={'size':'Bytes', 'id':'Hits'})
stats.index=[ 'Mon','Tue','Wed','Thu','Fri','Sat','Sun' ]
stats
Out[20]:
Now let us plot both metrics
In [21]:
stats.plot(kind='bar', figsize=(8,6), colormap='summer', title='Hits & bytes per weekday', subplots=True)
plt.show()
Sometimes, we can obtain additional information by doing some operations over the data. One good example is identifying the resources of the web site. Currently, the dataset containts a request column that gives the method, the resource and the protocol version
In [22]:
data['request'].head(10)
Out[22]:
The HTTP protocol dictates that the three fields are separated by spaces and the request cannot contain any additional spaces. Thus a simple split would give us the three fields. Lets grab the resource and add an extra column to the dataframe
In [23]:
data['resource'] = data['request'].apply(lambda x: x.split()[1])
data['resource'].head(10)
Out[23]:
Here we use the apply method, which invokes a function to each element of the array. We define a lambda function that takes the request string, makes a split and returns the element at position 1.
Well then, let us find the number of hits and bytes for each resource
In [24]:
grpRsc = data[ ['id','size'] ].groupby(data['resource'])
stats = grpRsc.aggregate({ 'id':lambda x: x.count(), 'size':np.sum }).rename(columns={'size':'XferBytes', 'id':'Hits'})
stats
Out[24]:
It is commonly the case that we are interested in the top hitters. So, let us sort the information and grab the top ten hitters in descending order
In [25]:
sortedh = stats.sort_values(by='Hits', ascending=False)
sortedh.head(10)
Out[25]:
Observe that the top hits are not responsible for the most transferred bytes. Let us see this from the perspective of transferred bytes
In [26]:
sortedb = stats.sort_values(by='XferBytes', ascending=False)
sortedb.head(10)
Out[26]:
Now we now what to optimize to improve the bandwidth utilization of the website. How about plotting some of this information
In [27]:
sortedb.head(10).plot(kind='bar', figsize=(8,5), colormap='summer', title='Xfer & Hits (sorted by Xfer)', subplots=True)
plt.show()
Pandas allows many different things with a few lines of code. Some interesting exercises could be the following:
logAnalyzer.py in GitHub