Worksheet 3: EDA Worksheet

This worksheet covers concepts covered in the first half of Module 1 - Exploratory Data Analysis in One Dimension. It should take no more than 20-30 minutes to complete. Please raise your hand if you get stuck.

There are many ways to accomplish the tasks that you are presented with, however you will find that by using the techniques covered in class, the exercises should be relatively simple.

Import the Libraries

For this exercise, we will be using:

Pandas (http://pandas.pydata.org/pandas-docs/stable/)
Numpy (https://docs.scipy.org/doc/numpy/reference/)
Matplotlib (http://matplotlib.org/api/pyplot_api.html)



In [1]:

    
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
plt.style.use('ggplot')
%pylab inline









    



Populating the interactive namespace from numpy and matplotlib

Exercise 1: Summarize the Data

For this exercise, you are given a Series of random numbers creatively names random_numbers. For the first exercise please do the following:

Remove all the numbers less than 10
Sort the series
Calculate the Tukey 5 number summary for this dataset
Count the number of even and odd numbers
Find the five largest and 5 smallest numbers in the series



In [38]:

    
#Generate a series of random numbers between 1 and 100. 
random_numbers = pd.Series( np.random.randint(1, 100, 50) )
random_numbers[random_numbers > 10].sort_values().describe()









    Out[38]:





count    43.000000
mean     51.418605
std      24.319532
min      11.000000
25%      33.000000
50%      51.000000
75%      72.000000
max      97.000000
dtype: float64



In [14]:

    
random_numbers = pd.Series( np.random.randint(1, 100, 50) )
random_numbers.sort_values().unique()[0:10]

# even and odd numbers
even_numbers = random_numbers[random_numbers % 2 == 0].count()









    Out[14]:





array([ 1,  3,  6,  7, 10, 13, 17, 22, 26, 28])

Exercise 2: Creating a Histogram

Using the random number Series create a histogram with 10 bins.



In [17]:

    
random_numbers = pd.Series( np.random.randint(1, 100, 50) )
random_numbers.hist(bins=10)









    Out[17]:





<matplotlib.axes._subplots.AxesSubplot at 0x7f498e4c4e10>

Exercise 3: Counting Values

You have been given a list of US phone numbers. The area code is the first three digits. Your task is to produce a summary of how many times each area code appears in the list. To do this you will need to:

Extract the area code from each phone number
Count the unique occurances.



In [18]:

    
phone_numbers = [
'(833) 759-6854',
'(811) 268-9951',
'(855) 449-4648',
'(833) 212-2929',
'(833) 893-7475',
'(822) 346-3086',
'(844) 259-9074',
'(855) 975-8945',
'(811) 385-8515',
'(811) 523-5090',
'(844) 593-5677',
'(833) 534-5793',
'(899) 898-3043',
'(833) 662-7621',
'(899) 146-8244',
'(822) 793-4965',
'(822) 641-7853',
'(833) 153-7848',
'(811) 958-2930',
'(822) 332-3070',
'(833) 223-1776',
'(811) 397-1451',
'(844) 096-0377',
'(822) 000-0717',
'(899) 311-1880']



In [25]:

    
phone_records = pd.Series(phone_numbers)
phone_records.apply(lambda x: int(x.split(' ')[0][1:-1])).unique()

# or
area_codes = phone_number_series.str.slice(1,4)
area_codes = phone_number_series.str.extract('\((\d{3}))', expand=False)









    Out[25]:





array([833, 811, 855, 822, 844, 899])

Exercise 4: Putting it all together: Bot Analysis

First you're going to want to create a data frame from the dailybots.csv file which can be found in the data directory. You should be able to do this with the pd.read_csv() function. Take a minute to look at the dataframe because we are going to be using it for to answer several different questions.



In [39]:

    
data = pd.read_csv( '../../data/dailybots.csv' )
data.describe()









    Out[39]:







  
    
      
      hosts
      orgs
    
  
  
    
      count
      4969.000000
      4969.00000
    
    
      mean
      142.308110
      15.31133
    
    
      std
      240.955294
      21.15150
    
    
      min
      1.000000
      1.00000
    
    
      25%
      10.000000
      3.00000
    
    
      50%
      38.000000
      7.00000
    
    
      75%
      187.000000
      16.00000
    
    
      max
      2031.000000
      122.00000



In [40]:

    
data.info() # useful to see columns









    



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4969 entries, 0 to 4968
Data columns (total 5 columns):
date        4969 non-null object
botfam      4969 non-null object
industry    4969 non-null object
hosts       4969 non-null int64
orgs        4969 non-null int64
dtypes: int64(2), object(3)
memory usage: 194.2+ KB



In [43]:

    
data.columns()









    



---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-43-31f5dce5b69f> in <module>()
----> 1 data.columns()

TypeError: 'Index' object is not callable



In [44]:

    
data['botfam'].value_counts()









    Out[44]:





Sality         540
Zeus           540
ConfickerAB    540
Ramnit         540
zeroaccess     538
Bedep          535
Necurs         511
Zusy           498
PushDo         418
Olmasco        309
Name: botfam, dtype: int64

Part 1: Which industry sees the most Ramnit infections? Least?

Count the number of infected days for "Ramnit" in each industry industry. How:

First filter the data to remove all the infections we don't care about
Aggregate the data on the column of interest. HINT: You might want to use the groupby() function
Add up the results



In [45]:

    
grouped_df = data[data.botfam == 'Ramnit'].groupby(['industry'])
grouped_df.sum()









    Out[45]:







  
    
      
      hosts
      orgs
    
    
      industry
      
      
    
  
  
    
      Education
      7492
      3949
    
    
      Finance
      2281
      1357
    
    
      Government/Politics
      4484
      448
    
    
      Healthcare/Wellness
      544
      473
    
    
      Manufacturing
      46303
      1155
    
    
      Retail
      17649
      558

Part 2: Calculate the min, max, median and mean infected orgs for each bot family, sort by median

In this exercise, you are asked to calculate the min, max, median and mean of infected orgs for each bot family sorted by median. HINT:

Using the groupby() function, create a grouped data frame
You can do this one metric at a time OR you can use the .agg() function. You might want to refer to the documentation here: http://pandas.pydata.org/pandas-docs/stable/groupby.html#applying-multiple-functions-at-once
Sort the values (HINT HINT) by the median column



In [ ]:

    
group2 = data[['botfam', 'orgs']].groupby(['botfam'])
summary = group2.agg([np.min, np.max, np.mean, np.dedian, np.std])
summary.sort_values(['orgs', 'median'], ascending=False)

Part 3: Which date had the total most bot infections and how many infections on that day?

For the next step, aggregate and sum the number of infections (hosts) by date. Once you've done that, the next step is to sort in descending order.



In [ ]:

    
df3 = data[['date', 'hosts']].groupby('date').sum()
df3.sort_values(by='hosts')

Part 4: Plot the daily infected hosts for Necurs, Ramnit and PushDo

For the final step, you're going to plot the daily infected hosts for three infection types. In order to do this, you'll need to do the following steps:

Filter the data to remove the botfamilies we don't care about.
Use groupby() to aggregate the data by date and family, then sum up the hosts in each group
Plot the data. Hint: You might want to use the unstack() function to prepare the data for plotting.
Use the plot() method to plot the results.



In [ ]:



In [ ]:

    
#Plot the data
groupedFilteredData.unstack(level=1).plot(kind='line', subplots=False)

	hosts	orgs
count	4969.000000	4969.00000
mean	142.308110	15.31133
std	240.955294	21.15150
min	1.000000	1.00000
25%	10.000000	3.00000
50%	38.000000	7.00000
75%	187.000000	16.00000
max	2031.000000	122.00000

	hosts	orgs
industry
Education	7492	3949
Finance	2281	1357
Government/Politics	4484	448
Healthcare/Wellness	544	473
Manufacturing	46303	1155
Retail	17649	558