version 1.0.0
lambda
functions that are applied at each worker.127.0.0.1 - - [01/Aug/1995:00:00:01 -0400] "GET /images/launch-logo.gif HTTP/1.0" 200 1839
127.0.0.1
####This is the IP address (or host name, if available) of the client (remote host) which made the request to the server.
-
####The "hyphen" in the output indicates that the requested piece of information (user identity from remote machine) is not available.
-
####The "hyphen" in the output indicates that the requested piece of information (user identity from local logon) is not available.
[01/Aug/1995:00:00:01 -0400]
####The time that the server finished processing the request. The format is:
[day/month/year:hour:minute:second timezone]
"GET /images/launch-logo.gif HTTP/1.0"
####This is the first line of the request string from the client. It consists of a three components: the request method (e.g., GET
, POST
, etc.), the endpoint (a Uniform Resource Identifier), and the client protocol version.
200
####This is the status code that the server sends back to the client. This information is very valuable, because it reveals whether the request resulted in a successful response (codes beginning in 2), a redirection (codes beginning in 3), an error caused by the client (codes beginning in 4), or an error in the server (codes beginning in 5). The full list of possible status codes can be found in the HTTP specification (RFC 2616 section 10).
1839
####The last entry indicates the size of the object returned to the client, not including the response headers. If no content was returned to the client, this value will be "-" (or sometimes 0).
search
function. The function returns a pair consisting of a Row object and 1. If the log line fails to match the regular expression, the function returns a pair consisting of the log line string and 0. A '-' value in the content size field is cleaned up by substituting it with 0. The function converts the log line's date string into a Python datetime
object using the given parse_apache_time
function.
In [1]:
import re
import datetime
from pyspark.sql import Row
month_map = {'Jan': 1, 'Feb': 2, 'Mar':3, 'Apr':4, 'May':5, 'Jun':6, 'Jul':7,
'Aug':8, 'Sep': 9, 'Oct':10, 'Nov': 11, 'Dec': 12}
def parse_apache_time(s):
""" Convert Apache time format into a Python datetime object
Args:
s (str): date and time in Apache time format
Returns:
datetime: datetime object (ignore timezone for now)
"""
return datetime.datetime(int(s[7:11]),
month_map[s[3:6]],
int(s[0:2]),
int(s[12:14]),
int(s[15:17]),
int(s[18:20]))
def parseApacheLogLine(logline):
""" Parse a line in the Apache Common Log format
Args:
logline (str): a line of text in the Apache Common Log format
Returns:
tuple: either a dictionary containing the parts of the Apache Access Log and 1,
or the original invalid log line and 0
"""
match = re.search(APACHE_ACCESS_LOG_PATTERN, logline)
if match is None:
return (logline, 0)
size_field = match.group(9)
if size_field == '-':
size = long(0)
else:
size = long(match.group(9))
return (Row(
host = match.group(1),
client_identd = match.group(2),
user_id = match.group(3),
date_time = parse_apache_time(match.group(4)),
method = match.group(5),
endpoint = match.group(6),
protocol = match.group(7),
response_code = int(match.group(8)),
content_size = size
), 1)
In [2]:
# A regular expression pattern to extract fields from the log line
APACHE_ACCESS_LOG_PATTERN = '^(\S+) (\S+) (\S+) \[([\w:/]+\s[+\-]\d{4})\] "(\S+) (\S+)\s*(\S*)" (\d{3}) (\S+)'
sc.textfile(logFile)
to convert each line of the file into an element in an RDD.map(parseApacheLogLine)
to apply the parse function to each element (that is, a line from the log file) in the RDD and turn each line into a pair Row
object.
In [3]:
import sys
import os
from test_helper import Test
baseDir = os.path.join('data')
inputPath = os.path.join('cs100', 'lab2', 'apache.access.log.PROJECT')
logFile = os.path.join(baseDir, inputPath)
def parseLogs():
""" Read and parse log file """
parsed_logs = (sc
.textFile(logFile)
.map(parseApacheLogLine)
.cache())
access_logs = (parsed_logs
.filter(lambda s: s[1] == 1)
.map(lambda s: s[0])
.cache())
failed_logs = (parsed_logs
.filter(lambda s: s[1] == 0)
.map(lambda s: s[0]))
failed_logs_count = failed_logs.count()
if failed_logs_count > 0:
print 'Number of invalid logline: %d' % failed_logs.count()
for line in failed_logs.take(20):
print 'Invalid logline: %s' % line
print 'Read %d lines, successfully parsed %d lines, failed to parse %d lines' % (parsed_logs.count(), access_logs.count(), failed_logs.count())
return parsed_logs, access_logs, failed_logs
parsed_logs, access_logs, failed_logs = parseLogs()
APACHE_ACCESS_LOG_PATTERN
regular expression below so that the failed lines will correctly parse, and press Shift-Enter
to rerun parseLogs()
.127.0.0.1 - - [01/Aug/1995:00:00:01 -0400] "GET /images/launch-logo.gif HTTP/1.0" 200 1839
search
function, now would be a good time to check up on the documentation. One tip that might be useful is to use an online tester like http://pythex.org or http://www.pythonregex.com. To use it, copy and paste the regular expression string below (located between the single quotes ') and test it against one of the 'Invalid logline' above.
In [4]:
# TODO: Replace <FILL IN> with appropriate code
# This was originally '^(\S+) (\S+) (\S+) \[([\w:/]+\s[+\-]\d{4})\] "(\S+) (\S+)\s*(\S*)" (\d{3}) (\S+)'
APACHE_ACCESS_LOG_PATTERN = '^(\S+) (\S+) (\S+) \[([\w:/]+\s[+\-]\d{4})\] "(\S+) (\S+)\s*(\S*)\s*" (\d{3}) (\S+)'
parsed_logs, access_logs, failed_logs = parseLogs()
In [5]:
# TEST Data cleaning (1c)
Test.assertEquals(failed_logs.count(), 0, 'incorrect failed_logs.count()')
Test.assertEquals(parsed_logs.count(), 1043177 , 'incorrect parsed_logs.count()')
Test.assertEquals(access_logs.count(), parsed_logs.count(), 'incorrect access_logs.count()')
map
to the access_logs
RDD. The lambda
function we want for the map is to extract the content_size
field from the RDD. The map produces a new RDD containing only the content_sizes
(one element for each Row object in the access_logs
RDD). To compute the minimum and maximum statistics, we can use min()
and max()
functions on the new RDD. We can compute the average statistic by using the reduce
function with a lambda
function that sums the two inputs, which represent two elements from the new RDD that are being reduced together. The result of the reduce()
is the total content size from the log and it is to be divided by the number of requests as determined using the count()
function on the new RDD.
In [6]:
# Calculate statistics based on the content size.
content_sizes = access_logs.map(lambda log: log.content_size).cache()
print 'Content Size Avg: %i, Min: %i, Max: %s' % (
content_sizes.reduce(lambda a, b : a + b) / content_sizes.count(),
content_sizes.min(),
content_sizes.max())
lambda
function to extract the response_code
field from the access_logs
RDD. The difference here is that we will use a pair tuple instead of just the field itself. Using a pair tuple consisting of the response code and 1 will let us count how many records have a particular response code. Using the new RDD, we perform a reduceByKey
function. reduceByKey
performs a reduce on a per-key basis by applying the lambda
function to each element, pairwise with the same key. We use the simple lambda
function of adding the two values. Then, we cache the resulting RDD and create a list by using the take
function.
In [7]:
# Response Code to Count
responseCodeToCount = (access_logs
.map(lambda log: (log.response_code, 1))
.reduceByKey(lambda a, b : a + b)
.cache())
responseCodeToCountList = responseCodeToCount.take(100)
print 'Found %d response codes' % len(responseCodeToCountList)
print 'Response Code Counts: %s' % responseCodeToCountList
assert len(responseCodeToCountList) == 7
assert sorted(responseCodeToCountList) == [(200, 940847), (302, 16244), (304, 79824), (403, 58), (404, 6185), (500, 2), (501, 17)]
matplotlib
matplotlib
. First we need to extract the labels and fractions for the graph. We do this with two separate map
functions with a lambda
functions. The first map
function extracts a list of of the response code values, and the second map
function extracts a list of the per response code counts divided by the total size of the access logs. Next, we create a figure with figure()
constructor and use the pie()
method to create the pie plot.
In [8]:
labels = responseCodeToCount.map(lambda (x, y): x).collect()
print labels
count = access_logs.count()
fracs = responseCodeToCount.map(lambda (x, y): (float(y) / count)).collect()
print fracs
In [9]:
import matplotlib.pyplot as plt
def pie_pct_format(value):
""" Determine the appropriate format string for the pie chart percentage label
Args:
value: value of the pie slice
Returns:
str: formated string label; if the slice is too small to fit, returns an empty string for label
"""
return '' if value < 7 else '%.0f%%' % value
fig = plt.figure(figsize=(4.5, 4.5), facecolor='white', edgecolor='white')
colors = ['yellowgreen', 'lightskyblue', 'gold', 'purple', 'lightcoral', 'yellow', 'black']
explode = (0.05, 0.05, 0.1, 0, 0, 0, 0)
patches, texts, autotexts = plt.pie(fracs, labels=labels, colors=colors,
explode=explode, autopct=pie_pct_format,
shadow=False, startangle=125)
for text, autotext in zip(texts, autotexts):
if autotext.get_text() == '':
text.set_text('') # If the slice is small to fit, don't show a text label
plt.legend(labels, loc=(0.80, -0.1), shadow=True)
pass
lambda
function to extract the host
field from the access_logs
RDD using a pair tuple consisting of the host and 1 which will let us count how many records were created by a particular host's request. Using the new RDD, we perform a reduceByKey
function with a lambda
function that adds the two values. We then filter the result based on the count of accesses by each host (the second element of each pair) being greater than ten. Next, we extract the host name by performing a map
with a lambda
function that returns the first element of each pair. Finally, we extract 20 elements from the resulting RDD - note that the choice of which elements are returned is not guaranteed to be deterministic.
In [10]:
# Any hosts that has accessed the server more than 10 times.
hostCountPairTuple = access_logs.map(lambda log: (log.host, 1))
hostSum = hostCountPairTuple.reduceByKey(lambda a, b : a + b)
hostMoreThan10 = hostSum.filter(lambda s: s[1] > 10)
hostsPick20 = (hostMoreThan10
.map(lambda s: s[0])
.take(20))
print 'Any 20 hosts that have accessed more then 10 times: %s' % hostsPick20
# An example: [u'204.120.34.185', u'204.243.249.9', u'slip1-32.acs.ohio-state.edu', u'lapdog-14.baylor.edu', u'199.77.67.3', u'gs1.cs.ttu.edu', u'haskell.limbex.com', u'alfred.uib.no', u'146.129.66.31', u'manaus.bologna.maraut.it', u'dialup98-110.swipnet.se', u'slip-ppp02.feldspar.com', u'ad03-053.compuserve.com', u'srawlin.opsys.nwa.com', u'199.202.200.52', u'ix-den7-23.ix.netcom.com', u'151.99.247.114', u'w20-575-104.mit.edu', u'205.25.227.20', u'ns.rmc.com']
lambda
function to extract the endpoint
field from the access_logs
RDD using a pair tuple consisting of the endpoint and 1 which will let us count how many records were created by a particular host's request. Using the new RDD, we perform a reduceByKey
function with a lambda
function that adds the two values. We then cache the results.matplotlib
. We previously imported the matplotlib.pyplot
library, so we do not need to import it again. We perform two separate map
functions with lambda
functions. The first map
function extracts a list of endpoint values, and the second map
function extracts a list of the visits per endpoint values. Next, we create a figure with figure()
constructor, set various features of the plot (axis limits, grid lines, and labels), and use the plot()
method to create the line plot.
In [11]:
endpoints = (access_logs
.map(lambda log: (log.endpoint, 1))
.reduceByKey(lambda a, b : a + b)
.cache())
ends = endpoints.map(lambda (x, y): x).collect()
counts = endpoints.map(lambda (x, y): y).collect()
fig = plt.figure(figsize=(8,4.2), facecolor='white', edgecolor='white')
plt.axis([0, len(ends), 0, max(counts)])
plt.grid(b=True, which='major', axis='y')
plt.xlabel('Endpoints')
plt.ylabel('Number of Hits')
plt.plot(counts)
pass
lambda
function to extract the endpoint
field from the access_logs
RDD using a pair tuple consisting of the endpoint and 1 which will let us count how many records were created by a particular host's request. Using the new RDD, we perform a reduceByKey
function with a lambda
function that adds the two values. We then extract the top ten endpoints by performing a takeOrdered
with a value of 10 and a lambda
function that multiplies the count (the second element of each pair) by -1 to create a sorted list with the top endpoints at the bottom.
In [12]:
# Top Endpoints
endpointCounts = (access_logs
.map(lambda log: (log.endpoint, 1))
.reduceByKey(lambda a, b : a + b))
topEndpoints = endpointCounts.takeOrdered(10, lambda s: -1 * s[1])
print 'Top Ten Endpoints: %s' % topEndpoints
assert topEndpoints == [(u'/images/NASA-logosmall.gif', 59737), (u'/images/KSC-logosmall.gif', 50452), (u'/images/MOSAIC-logosmall.gif', 43890), (u'/images/USA-logosmall.gif', 43664), (u'/images/WORLD-logosmall.gif', 43277), (u'/images/ksclogo-medium.gif', 41336), (u'/ksc.html', 28582), (u'/history/apollo/images/apollo-logo1.gif', 26778), (u'/images/launch-logo.gif', 24755), (u'/', 20292)], 'incorrect Top Ten Endpoints'
In [13]:
# TODO: Replace <FILL IN> with appropriate code
# HINT: Each of these <FILL IN> below could be completed with a single transformation or action.
# You are welcome to structure your solution in a different way, so long as
# you ensure the variables used in the next Test section are defined (ie. endpointSum, topTenErrURLs).
not200 = access_logs.filter(lambda log: log.response_code != 200)
endpointCountPairTuple = not200.map(lambda log: (log.endpoint, 1))
endpointSum = endpointCountPairTuple.reduceByKey(lambda a, b : a + b)
topTenErrURLs = endpointSum.takeOrdered(10, lambda s: -1 * s[1])
print 'Top Ten failed URLs: %s' % topTenErrURLs
In [14]:
# TEST Top ten error endpoints (3a)
Test.assertEquals(endpointSum.count(), 7689, 'incorrect count for endpointSum')
Test.assertEquals(topTenErrURLs, [(u'/images/NASA-logosmall.gif', 8761), (u'/images/KSC-logosmall.gif', 7236), (u'/images/MOSAIC-logosmall.gif', 5197), (u'/images/USA-logosmall.gif', 5157), (u'/images/WORLD-logosmall.gif', 5020), (u'/images/ksclogo-medium.gif', 4728), (u'/history/apollo/images/apollo-logo1.gif', 2907), (u'/images/launch-logo.gif', 2811), (u'/', 2199), (u'/images/ksclogosmall.gif', 1622)], 'incorrect Top Ten failed URLs (topTenErrURLs)')
In [15]:
# TODO: Replace <FILL IN> with appropriate code
# HINT: Do you recall the tips from (3a)? Each of these <FILL IN> could be an transformation or action.
hosts = access_logs.map(lambda log: log.host, 1)
uniqueHosts = hosts.distinct()
uniqueHostCount = uniqueHosts.count()
print 'Unique hosts: %d' % uniqueHostCount
In [16]:
# TEST Number of unique hosts (3b)
Test.assertEquals(uniqueHostCount, 54507, 'incorrect uniqueHostCount')
dailyHosts
so that we can reuse it in the next exercise.
In [17]:
# TODO: Replace <FILL IN> with appropriate code
dayToHostPairTuple = access_logs.map(lambda log: (log.date_time.day, log.host)).distinct()
dayGroupedHosts = dayToHostPairTuple.groupByKey()
dayHostCount = dayGroupedHosts.map(lambda (day, hosts): (day, len(hosts)))
dailyHosts = (dayHostCount.sortByKey()
.cache())
dailyHostsList = dailyHosts.take(30)
print 'Unique hosts per day: %s' % dailyHostsList
In [18]:
# TEST Number of unique daily hosts (3c)
Test.assertEquals(dailyHosts.count(), 21, 'incorrect dailyHosts.count()')
Test.assertEquals(dailyHostsList, [(1, 2582), (3, 3222), (4, 4190), (5, 2502), (6, 2537), (7, 4106), (8, 4406), (9, 4317), (10, 4523), (11, 4346), (12, 2864), (13, 2650), (14, 4454), (15, 4214), (16, 4340), (17, 4385), (18, 4168), (19, 2550), (20, 2560), (21, 4134), (22, 4456)], 'incorrect dailyHostsList')
Test.assertTrue(dailyHosts.is_cached, 'incorrect dailyHosts.is_cached')
matplotlib
to plot a "Line" graph of the unique hosts requests by day.daysWithHosts
should be a list of days and hosts
should be a list of number of unique hosts for each corresponding day.collect()
method
In [19]:
# TODO: Replace <FILL IN> with appropriate code
daysWithHosts = dailyHosts.map(lambda log: log[0]).collect()
hosts = dailyHosts.map(lambda log: log[1]).collect()
In [20]:
# TEST Visualizing unique daily hosts (3d)
test_days = range(1, 23)
test_days.remove(2)
Test.assertEquals(daysWithHosts, test_days, 'incorrect days')
Test.assertEquals(hosts, [2582, 3222, 4190, 2502, 2537, 4106, 4406, 4317, 4523, 4346, 2864, 2650, 4454, 4214, 4340, 4385, 4168, 2550, 2560, 4134, 4456], 'incorrect hosts')
In [21]:
fig = plt.figure(figsize=(8,4.5), facecolor='white', edgecolor='white')
plt.axis([min(daysWithHosts), max(daysWithHosts), 0, max(hosts)+500])
plt.grid(b=True, which='major', axis='y')
plt.xlabel('Day')
plt.ylabel('Hosts')
plt.plot(daysWithHosts, hosts)
pass
avgDailyReqPerHost
so that we can reuse it in the next exercise.
In [22]:
# TODO: Replace <FILL IN> with appropriate code
dayAndHostTuple = access_logs.map(lambda log: (log.date_time.day, log.host))
groupedByDay = dayAndHostTuple.groupByKey()
sortedByDay = groupedByDay.sortByKey() # Redundant, but was in skeleton code
avgDailyReqPerHost = (sortedByDay
.map(lambda(day, requests): (day, len(requests))) # Num of requests per day
.join(dailyHosts) # Join w/ num of unique hosts per day
.map(lambda(day, (requests, hosts)): (day, requests/hosts)) # Calculate num of requests per unique host
.sortByKey() # Sort (again)
.cache())
avgDailyReqPerHostList = avgDailyReqPerHost.take(30)
print 'Average number of daily requests per Hosts is %s' % avgDailyReqPerHostList
In [23]:
# TEST Average number of daily requests per hosts (3e)
Test.assertEquals(avgDailyReqPerHostList, [(1, 13), (3, 12), (4, 14), (5, 12), (6, 12), (7, 13), (8, 13), (9, 14), (10, 13), (11, 14), (12, 13), (13, 13), (14, 13), (15, 13), (16, 13), (17, 13), (18, 13), (19, 12), (20, 12), (21, 13), (22, 12)], 'incorrect avgDailyReqPerHostList')
Test.assertTrue(avgDailyReqPerHost.is_cached, 'incorrect avgDailyReqPerHost.is_cache')
avgDailyReqPerHost
from the previous exercise, use matplotlib
to plot a "Line" graph of the average daily requests per unique host by day.daysWithAvg
should be a list of days and avgs
should be a list of average daily requests per unique hosts for each corresponding day.
In [24]:
# TODO: Replace <FILL IN> with appropriate code
daysWithAvg = avgDailyReqPerHost.map(lambda a: a[0]).collect()
avgs = avgDailyReqPerHost.map(lambda a: a[1]).collect()
In [25]:
# TEST Average Daily Requests per Unique Host (3f)
Test.assertEquals(daysWithAvg, [1, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22], 'incorrect days')
Test.assertEquals(avgs, [13, 12, 14, 12, 12, 13, 13, 14, 13, 14, 13, 13, 13, 13, 13, 13, 13, 12, 12, 13, 12], 'incorrect avgs')
In [26]:
fig = plt.figure(figsize=(8,4.2), facecolor='white', edgecolor='white')
plt.axis([0, max(daysWithAvg), 0, max(avgs)+2])
plt.grid(b=True, which='major', axis='y')
plt.xlabel('Day')
plt.ylabel('Average')
plt.plot(daysWithAvg, avgs)
pass
In [27]:
# TODO: Replace <FILL IN> with appropriate code
badRecords = (access_logs.filter(lambda log: log.response_code == 404)
.cache())
print 'Found %d 404 URLs' % badRecords.count()
In [28]:
# TEST Counting 404 (4a)
Test.assertEquals(badRecords.count(), 6185, 'incorrect badRecords.count()')
Test.assertTrue(badRecords.is_cached, 'incorrect badRecords.is_cached')
In [29]:
# TODO: Replace <FILL IN> with appropriate code
badEndpoints = badRecords.map(lambda log: log.endpoint)
badUniqueEndpoints = badEndpoints.distinct()
badUniqueEndpointsPick40 = badUniqueEndpoints.take(40)
print '404 URLS: %s' % badUniqueEndpointsPick40
In [30]:
# TEST Listing 404 records (4b)
badUniqueEndpointsSet40 = set(badUniqueEndpointsPick40)
Test.assertEquals(len(badUniqueEndpointsSet40), 40, 'badUniqueEndpointsPick40 not distinct')
In [31]:
# TODO: Replace <FILL IN> with appropriate code
badEndpointsCountPairTuple = badRecords.map(lambda log: (log.endpoint, 1))
badEndpointsSum = badEndpointsCountPairTuple.reduceByKey(lambda a, b: a + b)
badEndpointsTop20 = badEndpointsSum.takeOrdered(20, lambda a: -a[1])
print 'Top Twenty 404 URLs: %s' % badEndpointsTop20
In [32]:
# TEST Top twenty 404 URLs (4c)
Test.assertEquals(badEndpointsTop20, [(u'/pub/winvn/readme.txt', 633), (u'/pub/winvn/release.txt', 494), (u'/shuttle/missions/STS-69/mission-STS-69.html', 431), (u'/images/nasa-logo.gif', 319), (u'/elv/DELTA/uncons.htm', 178), (u'/shuttle/missions/sts-68/ksc-upclose.gif', 156), (u'/history/apollo/sa-1/sa-1-patch-small.gif', 146), (u'/images/crawlerway-logo.gif', 120), (u'/://spacelink.msfc.nasa.gov', 117), (u'/history/apollo/pad-abort-test-1/pad-abort-test-1-patch-small.gif', 100), (u'/history/apollo/a-001/a-001-patch-small.gif', 97), (u'/images/Nasa-logo.gif', 85), (u'/shuttle/resources/orbiters/atlantis.gif', 64), (u'/history/apollo/images/little-joe.jpg', 62), (u'/images/lf-logo.gif', 59), (u'/shuttle/resources/orbiters/discovery.gif', 56), (u'/shuttle/resources/orbiters/challenger.gif', 54), (u'/robots.txt', 53), (u'/elv/new01.gif>', 43), (u'/history/apollo/pad-abort-test-2/pad-abort-test-2-patch-small.gif', 38)], 'incorrect badEndpointsTop20')
In [33]:
# TODO: Replace <FILL IN> with appropriate code
errHostsCountPairTuple = badRecords.map(lambda log: (log.host, 1))
errHostsSum = errHostsCountPairTuple.reduceByKey(lambda a, b: a + b)
errHostsTop25 = errHostsSum.takeOrdered(25, lambda a: -a[1])
print 'Top 25 hosts that generated errors: %s' % errHostsTop25
In [34]:
# TEST Top twenty-five 404 response code hosts (4d)
Test.assertEquals(len(errHostsTop25), 25, 'length of errHostsTop25 is not 25')
Test.assertEquals(len(set(errHostsTop25) - set([(u'maz3.maz.net', 39), (u'piweba3y.prodigy.com', 39), (u'gate.barr.com', 38), (u'm38-370-9.mit.edu', 37), (u'ts8-1.westwood.ts.ucla.edu', 37), (u'nexus.mlckew.edu.au', 37), (u'204.62.245.32', 33), (u'163.206.104.34', 27), (u'spica.sci.isas.ac.jp', 27), (u'www-d4.proxy.aol.com', 26), (u'www-c4.proxy.aol.com', 25), (u'203.13.168.24', 25), (u'203.13.168.17', 25), (u'internet-gw.watson.ibm.com', 24), (u'scooter.pa-x.dec.com', 23), (u'crl5.crl.com', 23), (u'piweba5y.prodigy.com', 23), (u'onramp2-9.onr.com', 22), (u'slip145-189.ut.nl.ibm.net', 22), (u'198.40.25.102.sap2.artic.edu', 21), (u'gn2.getnet.com', 20), (u'msp1-16.nas.mr.net', 20), (u'isou24.vilspa.esa.es', 19), (u'dial055.mbnet.mb.ca', 19), (u'tigger.nashscene.com', 19)])), 0, 'incorrect errHostsTop25')
In [35]:
# TODO: Replace <FILL IN> with appropriate code
errDateCountPairTuple = badRecords.map(lambda log: (log.date_time.day, 1))
errDateSum = errDateCountPairTuple.reduceByKey(lambda a, b: a + b)
errDateSorted = (errDateSum.sortByKey()
.cache())
errByDate = errDateSorted.collect()
print '404 Errors by day: %s' % errByDate
In [36]:
# TEST 404 response codes per day (4e)
Test.assertEquals(errByDate, [(1, 243), (3, 303), (4, 346), (5, 234), (6, 372), (7, 532), (8, 381), (9, 279), (10, 314), (11, 263), (12, 195), (13, 216), (14, 287), (15, 326), (16, 258), (17, 269), (18, 255), (19, 207), (20, 312), (21, 305), (22, 288)], 'incorrect errByDate')
Test.assertTrue(errDateSorted.is_cached, 'incorrect errDateSorted.is_cached')
In [37]:
# TODO: Replace <FILL IN> with appropriate code
daysWithErrors404 = errDateSorted.map(lambda a: a[0]).collect()
errors404ByDay = errDateSorted.map(lambda a: a[1]).collect()
In [38]:
# TEST Visualizing the 404 Response Codes by Day (4f)
Test.assertEquals(daysWithErrors404, [1, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22], 'incorrect daysWithErrors404')
Test.assertEquals(errors404ByDay, [243, 303, 346, 234, 372, 532, 381, 279, 314, 263, 195, 216, 287, 326, 258, 269, 255, 207, 312, 305, 288], 'incorrect errors404ByDay')
In [39]:
fig = plt.figure(figsize=(8,4.2), facecolor='white', edgecolor='white')
plt.axis([0, max(daysWithErrors404), 0, max(errors404ByDay)])
plt.grid(b=True, which='major', axis='y')
plt.xlabel('Day')
plt.ylabel('404 Errors')
plt.plot(daysWithErrors404, errors404ByDay)
pass
In [40]:
# TODO: Replace <FILL IN> with appropriate code
topErrDate = errDateSorted.takeOrdered(5, lambda a: -a[1])
print 'Top Five dates for 404 requests: %s' % topErrDate
In [41]:
# TEST Five dates for 404 requests (4g)
Test.assertEquals(topErrDate, [(7, 532), (8, 381), (6, 372), (4, 346), (15, 326)], 'incorrect topErrDate')
In [42]:
# TODO: Replace <FILL IN> with appropriate code
hourCountPairTuple = badRecords.map(lambda log: (log.date_time.hour, 1))
hourRecordsSum = hourCountPairTuple.reduceByKey(lambda a, b: a + b)
hourRecordsSorted = (hourRecordsSum.sortByKey()
.cache())
errHourList = hourRecordsSorted.collect()
print 'Top hours for 404 requests: %s' % errHourList
In [43]:
# TEST Hourly 404 response codes (4h)
Test.assertEquals(errHourList, [(0, 175), (1, 171), (2, 422), (3, 272), (4, 102), (5, 95), (6, 93), (7, 122), (8, 199), (9, 185), (10, 329), (11, 263), (12, 438), (13, 397), (14, 318), (15, 347), (16, 373), (17, 330), (18, 268), (19, 269), (20, 270), (21, 241), (22, 234), (23, 272)], 'incorrect errHourList')
Test.assertTrue(hourRecordsSorted.is_cached, 'incorrect hourRecordsSorted.is_cached')
In [44]:
# TODO: Replace <FILL IN> with appropriate code
hoursWithErrors404 = hourRecordsSorted.map(lambda a: a[0]).collect()
errors404ByHours = hourRecordsSorted.map(lambda a: a[1]).collect()
In [45]:
# TEST Visualizing the 404 Response Codes by Hour (4i)
Test.assertEquals(hoursWithErrors404, [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23], 'incorrect hoursWithErrors404')
Test.assertEquals(errors404ByHours, [175, 171, 422, 272, 102, 95, 93, 122, 199, 185, 329, 263, 438, 397, 318, 347, 373, 330, 268, 269, 270, 241, 234, 272], 'incorrect errors404ByHours')
In [46]:
fig = plt.figure(figsize=(8,4.2), facecolor='white', edgecolor='white')
plt.axis([0, max(hoursWithErrors404), 0, max(errors404ByHours)])
plt.grid(b=True, which='major', axis='y')
plt.xlabel('Hour')
plt.ylabel('404 Errors')
plt.plot(hoursWithErrors404, errors404ByHours)
pass