In [3]:
import pandas as pd 
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline
plt.style.use('ggplot')

Data Visualization Worksheet

This worksheet will walk you through the basic process of preparing a visualization using Python/Pandas/Matplotlib.

For this exercise, we will be creating a line plot comparing the number of hosts infected by the Bedep and ConfickerAB Bot Families in the Government/Politic sector.

Prepare the data

The data we will be using is in the dailybots.csv file which can be found in the data folder. As is common, we will have to do some data wrangling to get it into a format which we can use to visualize this data. To do that, we'll need to:

  1. Read in the data
  2. Filter the data by industry and botnet The result should look something like this:
date ConflikerAB Bedep
0 2016-06-01 255 430
1 2016-06-02 431 453

The way I chose to do this might be a little more complex, but I wanted you to see all the steps involved.

Step 1 Read in the data

Using the pd.read_csv() function, you can read in the data.


In [5]:
data = pd.read_csv('../../data/dailybots.csv')
data.head()


Out[5]:
date botfam industry hosts orgs
0 2016-06-01 Bedep Education 88 33
1 2016-06-01 Bedep Finance 387 17
2 2016-06-01 Bedep Government/Politics 430 42
3 2016-06-01 Bedep Healthcare/Wellness 42 19
4 2016-06-01 Bedep Manufacturing 184 18

Step 2: Filter the Data

The next step is to filter both by industry and by botfam. In order to get the data into the format I wanted, I did this separately. First, I created a second dataframe called filteredData which only contains the information from the Government/Politics industry.


In [6]:
filteredData = data[data['industry'] == "Government/Politics"]
filteredData.head()


Out[6]:
date botfam industry hosts orgs
2 2016-06-01 Bedep Government/Politics 430 42
8 2016-06-01 ConfickerAB Government/Politics 255 38
14 2016-06-01 Necurs Government/Politics 277 25
23 2016-06-01 PushDo Government/Politics 8 2
28 2016-06-01 Ramnit Government/Politics 53 6

Next, I created a second DataFrame which only contains the information from the ConfickerAB botnet. I also reduced the columns to the date and host count. You'll need to rename the host count so that you can merge the other data set later.


In [7]:
filteredData2 = filteredData[filteredData['botfam']== 'ConfickerAB' ][['date','hosts']]
filteredData2.columns = ['date', 'ConfickerAB']
filteredData2.date = pd.to_datetime( filteredData2.date )
filteredData2.head()


Out[7]:
date ConfickerAB
8 2016-06-01 255
64 2016-06-02 431
120 2016-06-03 367
178 2016-06-04 272
234 2016-06-05 245

Repeat this porcess for the Bedep botfam in a separate dataFrame.

Step 3: Merge the DataFrames.

Next, you'll need to merge the dataframes so that you end up with a dataframe with three columns: the date, the ConfickerAB count, and the the Bedep count. Pandas has a .merge() function which is documented here: http://pandas.pydata.org/pandas-docs/stable/merging.html


In [8]:
filteredData3 = filteredData[filteredData['botfam']== 'Bedep' ][['date','hosts']]
filteredData3.columns = ['date', 'Bedep']
filteredData3.date = pd.to_datetime( filteredData3.date )
finalData = pd.merge(filteredData2, filteredData3, on='date', how='left')
finalData.head()


Out[8]:
date ConfickerAB Bedep
0 2016-06-01 255 430.0
1 2016-06-02 431 453.0
2 2016-06-03 367 338.0
3 2016-06-04 272 259.0
4 2016-06-05 245 284.0

Create the first chart

Using the .plot() method, plot your dataframe and see what you get.


In [9]:
finalData.plot(kind='line' )


Out[9]:
<matplotlib.axes._subplots.AxesSubplot at 0x109ece390>

Step 3 Customizing your plot:

The default plot doesn't look horrible, but there are certainly some improvements which can be made. Try the following:

  1. Change the x-axis to a date by converting the date column to a date object.
  2. Move the Legend to the upper center of the graph
  3. Make the figure size larger.
  4. Instead of rendering both lines on one graph, split them up into two plots
  5. Add axis labels

There are many examples in the documentation which is available: http://pandas.pydata.org/pandas-docs/version/0.18.1/visualization.html


In [10]:
finalData.set_index('date', inplace=True)
finalData.plot(kind='line' )


Out[10]:
<matplotlib.axes._subplots.AxesSubplot at 0x10a2fdd68>

Move the Legend to the Upper Center of the Graph

For this, you'll have to assign the plot variable to a new variable and then call the formatting methods on it.


In [9]:
nicePlot = finalData.plot( kind="line")
nicePlot.legend(loc='upper center', bbox_to_anchor=(0.5, 1.05),
          ncol=3, fancybox=True, shadow=False)


Out[9]:
<matplotlib.legend.Legend at 0x10a4ae860>

Make the Figure Size Larger:


In [10]:
finalData.plot( kind="line", figsize=(60,40))


Out[10]:
<matplotlib.axes._subplots.AxesSubplot at 0x10a88bb70>

Adding Subplots

The first thing you'll need to do is call the .subplots( nrows=<rows>, ncols=<cols> ) to create a subplot. Next, plot your charts using the .plot() method. To do add the second plot to your figure, add the ax=axes[n] to the .plot() method.


In [17]:
fig, axes = plt.subplots(nrows=1, ncols=2)

finalData['ConfickerAB'].plot()
finalData['Bedep'].plot(ax=axes[0])


Out[17]:
<matplotlib.axes._subplots.AxesSubplot at 0x10be129e8>

In [58]:
from bokeh.plotting import output_notebook
output_notebook()


Loading BokehJS ...

In [59]:
from bokeh.charts import TimeSeries
from bokeh.io import show

In [60]:
linechart = TimeSeries( data=finalData,
                       title="ConfickerAB Hosts", 
                       legend="top_left" )
show( linechart )



In [ ]: