In [1]:
import pandas as pd
import numpy as np
import matplotlib
%matplotlib inline
matplotlib.pyplot.style.use = 'ggplot'
First you're going to want to create a data frame from the dailybots.csv file which can be found in the data directory. You should be able to do this with the pd.read_csv() function. Take a minute to look at the dataframe because we are going to be using it for this entire worksheet.
In [2]:
data = pd.read_csv( '../../data/dailybots.csv' )
#Look at a summary of the data
data.describe()
Out[2]:
In [3]:
data['botfam'].value_counts()
Out[3]:
Count the number of infected days for "Ramnit" in each industry industry. How:
groupby() function
In [4]:
grouped_df = data[data.botfam == "Ramnit"].groupby(['industry'])
grouped_df.sum()
Out[4]:
In this exercise, you are asked to calculate the min, max, median and mean of infected orgs for each bot family sorted by median. HINT:
groupby() function, create a grouped data frame
In [5]:
group2 = data[['botfam','orgs']].groupby( ['botfam'])
summary = group2.agg([np.min, np.max, np.mean, np.median, np.std])
summary.sort_values( [('orgs', 'median')], ascending=False)
Out[5]:
In [23]:
df3 = data[['date','hosts']].groupby('date').sum()
df3.sort_values(by='hosts', ascending=False).head(10)
Out[23]:
In this exercise you're going to plot the daily infected hosts for three infection types. In order to do this, you'll need to do the following steps:
groupby() to aggregate the data by date and family, then sum up the hosts in each groupunstack() function to prepare the data for plotting.
In [7]:
filteredData = data[ data['botfam'].isin(['Necurs', 'Ramnit', 'PushDo']) ][['date', 'botfam', 'hosts']]
groupedFilteredData = filteredData.groupby( ['date', 'botfam']).sum()
groupedFilteredData.unstack(level=1).plot(kind='line', subplots=False)
Out[7]:
Hint: try a box plot and/or violin plot. In order to do this, there are two steps:
.boxplot() method to plot the data. This has grouping built in, so you don't have to group by first.
In [14]:
data.date = pd.to_datetime( data.date )
data['day'] = data.date.dt.weekday
data[['hosts', 'day']].boxplot( by='day')
Out[14]:
In [13]:
grouped = data[['hosts', 'day']].groupby('day')
print( grouped.sum() )
In [ ]:
grouped.box