This worksheet covers concepts covered in the first half of Module 1 - Exploratory Data Analysis in One Dimension. It should take no more than 20-30 minutes to complete. Please raise your hand if you get stuck.
There are many ways to accomplish the tasks that you are presented with, however you will find that by using the techniques covered in class, the exercises should be relatively simple.
For this exercise, we will be using:
In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
plt.style.use('ggplot')
%pylab inline
For this exercise, you are given a Series of random numbers creatively names random_numbers. For the first exercise please do the following:
In [38]:
#Generate a series of random numbers between 1 and 100.
random_numbers = pd.Series( np.random.randint(1, 100, 50) )
random_numbers[random_numbers > 10].sort_values().describe()
Out[38]:
In [14]:
random_numbers = pd.Series( np.random.randint(1, 100, 50) )
random_numbers.sort_values().unique()[0:10]
# even and odd numbers
even_numbers = random_numbers[random_numbers % 2 == 0].count()
Out[14]:
In [17]:
random_numbers = pd.Series( np.random.randint(1, 100, 50) )
random_numbers.hist(bins=10)
Out[17]:
You have been given a list of US phone numbers. The area code is the first three digits. Your task is to produce a summary of how many times each area code appears in the list. To do this you will need to:
In [18]:
phone_numbers = [
'(833) 759-6854',
'(811) 268-9951',
'(855) 449-4648',
'(833) 212-2929',
'(833) 893-7475',
'(822) 346-3086',
'(844) 259-9074',
'(855) 975-8945',
'(811) 385-8515',
'(811) 523-5090',
'(844) 593-5677',
'(833) 534-5793',
'(899) 898-3043',
'(833) 662-7621',
'(899) 146-8244',
'(822) 793-4965',
'(822) 641-7853',
'(833) 153-7848',
'(811) 958-2930',
'(822) 332-3070',
'(833) 223-1776',
'(811) 397-1451',
'(844) 096-0377',
'(822) 000-0717',
'(899) 311-1880']
In [25]:
phone_records = pd.Series(phone_numbers)
phone_records.apply(lambda x: int(x.split(' ')[0][1:-1])).unique()
# or
area_codes = phone_number_series.str.slice(1,4)
area_codes = phone_number_series.str.extract('\((\d{3}))', expand=False)
Out[25]:
First you're going to want to create a data frame from the dailybots.csv file which can be found in the data directory. You should be able to do this with the pd.read_csv() function. Take a minute to look at the dataframe because we are going to be using it for to answer several different questions.
In [39]:
data = pd.read_csv( '../../data/dailybots.csv' )
data.describe()
Out[39]:
In [40]:
data.info() # useful to see columns
In [43]:
data.columns()
In [44]:
data['botfam'].value_counts()
Out[44]:
Count the number of infected days for "Ramnit" in each industry industry. How:
groupby() function
In [45]:
grouped_df = data[data.botfam == 'Ramnit'].groupby(['industry'])
grouped_df.sum()
Out[45]:
In this exercise, you are asked to calculate the min, max, median and mean of infected orgs for each bot family sorted by median. HINT:
groupby() function, create a grouped data frame
In [ ]:
group2 = data[['botfam', 'orgs']].groupby(['botfam'])
summary = group2.agg([np.min, np.max, np.mean, np.dedian, np.std])
summary.sort_values(['orgs', 'median'], ascending=False)
In [ ]:
df3 = data[['date', 'hosts']].groupby('date').sum()
df3.sort_values(by='hosts')
For the final step, you're going to plot the daily infected hosts for three infection types. In order to do this, you'll need to do the following steps:
groupby() to aggregate the data by date and family, then sum up the hosts in each groupunstack() function to prepare the data for plotting. plot() method to plot the results.
In [ ]:
In [ ]:
#Plot the data
groupedFilteredData.unstack(level=1).plot(kind='line', subplots=False)