In [2]:
import pandas as pd
import numpy as np
import matplotlib 
%matplotlib inline
matplotlib.pyplot.style.use = 'ggplot'

First, load up the data

First you're going to want to create a data frame from the dailybots.csv file which can be found in the data directory. You should be able to do this with the pd.read_csv() function. Take a minute to look at the dataframe because we are going to be using it for this entire worksheet.


In [3]:
data = pd.read_csv( '../data/dailybots.csv' )

Exercise 1: Which industry sees the most Ramnit infections? Least?

Count the number of infected days for "Ramnit" in each industry industry. How:

  1. First filter the data to remove all the infections we don't care about
  2. Aggregate the data on the column of interest. HINT: You might want to use the groupby() function
  3. Add up the results

In [ ]:

Exercise 2: Calculate the min, max, median and mean infected orgs for each bot family, sort by median

In this exercise, you are asked to calculate the min, max, median and mean of infected orgs for each bot family sorted by median. HINT:

  1. Using the groupby() function, create a grouped data frame
  2. You can do this one metric at a time OR you can use the .agg() function. You might want to refer to the documentation here: http://pandas.pydata.org/pandas-docs/stable/groupby.html#applying-multiple-functions-at-once
  3. Sort the values (HINT HINT) by the median column

In [ ]:

Exercise 3: Which date had the total most bot infections and how many infections on that day?

In this exercise you are asked to aggregate and sum the number of infections (hosts) by date. Once you've done that, the next step is to sort in descending order.


In [ ]:

Exercise 4: Plot the daily infected hosts for Necurs, Ramnit and PushDo

In this exercise you're going to plot the daily infected hosts for three infection types. In order to do this, you'll need to do the following steps:

  1. Filter the data to remove the botfamilies we don't care about.
  2. Use groupby() to aggregate the data by date and family, then sum up the hosts in each group
  3. Plot the data. Hint: You might want to use the unstack() function to prepare the data for plotting.

In [ ]:

Exercise 5: What are the distribution of infected hosts for each day-of-week across all bot families?

Hint: try a box plot and/or violin plot. In order to do this, there are two steps:

  1. First create a day column where the day of the week is represented as an integer. You'll need to convert the date column to an actual date/time object. See here: http://pandas.pydata.org/pandas-docs/stable/timeseries.html
  2. Next, use the .boxplot() method to plot the data. This has grouping built in, so you don't have to group by first.

In [ ]: