This worksheet covers concepts covered in the first half of Module 1 - Exploratory Data Analysis in One Dimension. It should take no more than 20-30 minutes to complete. Please raise your hand if you get stuck.
There are many ways to accomplish the tasks that you are presented with, however you will find that by using the techniques covered in class, the exercises should be relatively simple.
For this exercise, we will be using:
In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
plt.style.use('ggplot')
%pylab inline
In [2]:
data = pd.read_csv( '../../data/dailybots.csv' )
#Look at a summary of the data
data.describe()
Out[2]:
For this exercise, you are given a Series of random numbers creatively names random_numbers. For the first exercise please do the following:
In [12]:
#Generate a series of random numbers between 1 and 100.
random_numbers = pd.Series( np.random.randint(1, 100, 50) )
In [13]:
#Your code here...
#Filter the Series
random_numbers = random_numbers[random_numbers >= 10]
#Sort the Series
random_numbers.sort_values(inplace=True)
#Calculate the Tukey 5 Number Summary
random_numbers.describe()
#Count the number of even and odd numbers
even_numbers = random_numbers[random_numbers % 2 == 0].count()
odd_numbers = random_numbers[random_numbers % 2 != 0].count()
print( "Even numbers: " + str(even_numbers))
print( "Odd numbers: " + str(odd_numbers))
#Find the five largest and smallest numbers
print( "Smallest Numbers:")
print( random_numbers.head(5))
print( "Largest Numbers:")
print( random_numbers.tail(5))
In [14]:
#Your code here...
random_numbers.hist(bins=10)
Out[14]:
You have been given a list of US phone numbers. The area code is the first three digits. Your task is to produce a summary of how many times each area code appears in the list. To do this you will need to:
In [16]:
phone_numbers = [
'(833) 759-6854',
'(811) 268-9951',
'(855) 449-4648',
'(833) 212-2929',
'(833) 893-7475',
'(822) 346-3086',
'(844) 259-9074',
'(855) 975-8945',
'(811) 385-8515',
'(811) 523-5090',
'(844) 593-5677',
'(833) 534-5793',
'(899) 898-3043',
'(833) 662-7621',
'(899) 146-8244',
'(822) 793-4965',
'(822) 641-7853',
'(833) 153-7848',
'(811) 958-2930',
'(822) 332-3070',
'(833) 223-1776',
'(811) 397-1451',
'(844) 096-0377',
'(822) 000-0717',
'(899) 311-1880']
In [18]:
#Your code here...
phone_number_series = pd.Series(phone_numbers)
area_codes = phone_number_series.str.slice(1,4)
area_codes2 = phone_number_series.str.extract( '\((\d{3})\)', expand=False)
area_codes2.value_counts()
Out[18]:
First you're going to want to create a data frame from the dailybots.csv file which can be found in the data directory. You should be able to do this with the pd.read_csv() function. Take a minute to look at the dataframe because we are going to be using it for to answer several different questions.
In [4]:
data = pd.read_csv( '../../data/dailybots.csv' )
data.head()
Out[4]:
In [5]:
data.describe()
Out[5]:
In [6]:
data.info()
In [7]:
data['botfam'].value_counts()
Out[7]:
Count the number of infected days for "Ramnit" in each industry industry. How:
groupby() function
In [8]:
grouped_df = data[data.botfam == "Ramnit"].groupby(['industry'])
grouped_df.sum()
Out[8]:
In this exercise, you are asked to calculate the min, max, median and mean of infected orgs for each bot family sorted by median. HINT:
groupby() function, create a grouped data frame
In [9]:
group2 = data[['botfam','orgs']].groupby( ['botfam'])
summary = group2.agg([np.min, np.max, np.mean, np.median, np.std])
summary.sort_values( [('orgs', 'median')], ascending=False)
Out[9]:
In [10]:
df3 = data[['date','hosts']].groupby('date').sum()
df3.sort_values(by='hosts', ascending=False).head(10)
Out[10]:
For the final step, you're going to plot the daily infected hosts for three infection types. In order to do this, you'll need to do the following steps:
groupby() to aggregate the data by date and family, then sum up the hosts in each groupunstack() function to prepare the data for plotting. plot() method to plot the results.
In [11]:
filteredData = data[ data['botfam'].isin(['Necurs', 'Ramnit', 'PushDo']) ][['date', 'botfam', 'hosts']]
groupedFilteredData = filteredData.groupby( ['date', 'botfam']).sum()
groupedFilteredData.unstack(level=1).plot(kind='line', subplots=False)
Out[11]:
In [ ]: