Worksheet 2: Exploring Two Dimensional Data

Import the Libraries

For this exercise, we will be using:


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
plt.style.use('ggplot')
%pylab inline


Populating the interactive namespace from numpy and matplotlib

Exercise 1: Reading various forms of JSON Data

In the /data/ folder, you will find a series of .json files called dataN.json, numbered 1-4. Each file contains the following data:

birthday first_name last_name
0 5\/3\/67 Robert Hernandez
1 8\/4\/84 Steve Smith
2 9\/13\/91 Anne Raps
3 4\/15\/75 Alice Muller

In [47]:
#Your code here...
file1 = pd.read_json('../../data/data1.json')
file2 = pd.read_json('../../data/data2.json')
file2 = pd.read_json('../../data/data2.json')
file3 = pd.read_json('../../data/data3.json') # add orient=columns
file4 = pd.read_json('../../data/data4.json', orient='split')
combined = pd.concat([file1, file2.T, file3, file4], ignore_index=True)
combined


Out[47]:
birthday first_name last_name
0 5\/3\/67 Robert Hernandez
1 8\/4\/84 Steve Smith
2 9\/13\/91 Anne Raps
3 4\/15\/75 Alice Muller
4 5\/3\/67 Robert Hernandez
5 8\/4\/84 Steve Smith
6 9\/13\/91 Anne Raps
7 4\/15\/75 Alice Muller
8 5\/3\/67 Robert Hernandez
9 8\/4\/84 Steve Smith
10 9\/13\/91 Anne Raps
11 4\/15\/75 Alice Muller
12 5\/3\/67 Robert Hernandez
13 8\/4\/84 Steve Smith
14 9\/13\/91 Anne Raps
15 4\/15\/75 Alice Muller

Exercise 2:

In the data file, there is a webserver file called hackers-access.httpd. For this exercise, you will use this file to answer the following questions:

  1. Which browsers are the top 10 most used browsers in this data?
  2. Which are the top 10 most used operating systems?

In order to accomplish this task, do the following:

  1. Write a function which takes a User Agent string as an argument and returns the relevant data. HINT: You might want to use python's user_agents module, the documentation for which is available here: (https://pypi.python.org/pypi/user-agents)
  2. Next, apply this function to the column which contains the user agent string.
  3. Store this series as a new column in the dataframe
  4. Count the occurances of each value in the new columns

In [48]:
import apache_log_parser
from user_agents import parse


def parse_ua(line):
    parsed_data = parse(line)
    return str(parsed_data).split('/')[1]

def parse_ua_2(line):
    parsed_data = parse(line)
    return str(parsed_data).split('/')[2]

#Read in the log file
line_parser = apache_log_parser.make_parser("%h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-agent}i\"")

server_log = open("../../data/hackers-access.httpd", "r")
parsed_server_data = []
for line in server_log:
    data = {}
    data = line_parser(line)
    parsed_server_data.append( data )

server_df = pd.DataFrame(parsed_server_data)
server_df['OS'] = server_df['request_header_user_agent'].apply(parse_ua)
server_df['Browser'] = server_df['request_header_user_agent'].apply(parse_ua_2)
server_df['OS'].value_counts().head(10)


Out[48]:
 Windows 7               2041
 Windows Vista            500
 Windows XP               423
 Windows 8.1              221
 Linux                    125
 Mac OS X 10.10.1          66
 Chrome OS 6310.68.0       60
 Mac OS X 10.8.4           12
 Ubuntu                     6
 Mac OS X 10.10.0           2
Name: OS, dtype: int64

In [ ]:
#Apply the functions to the dataframe

In [ ]:
#Get the top 10 values

Exercise 3:

Using the dailybots.csv film, read the file into a DataFrame and perform the following operations:

  1. Filter the DataFrame to include bots from the Government/Politics Industry.
  2. Calculate the ratio of hosts to orgs and add this as a column to the DataFrame and output the result
  3. Calculate the total number of hosts infected by each BotFam in the Government/Politics Industry. You should use the groupby() function which is documented here: (http://pandas.pydata.org/pandas-docs/stable/groupby.html)

In [60]:
#Your code here...
bots = pd.read_csv('../../data/dailybots.csv')
gov_bots = bots[['botfam', 'hosts']][bots['industry'] == 'Government/Politics']
gov_bots.groupby(['botfam']).size()


Out[60]:
botfam
Bedep          89
ConfickerAB    90
Necurs         87
Olmasco        11
PushDo         85
Ramnit         90
Sality         90
Zeus           90
Zusy           72
zeroaccess     89
dtype: int64

In [ ]: