Lesson 2: NumPy and Pandas for 1D Data

01 - Introduction

  • Will get familiar with 2 libraries - numpy and pandas
  • Writing Data Analysis code will be much easier.
  • Code runs faster
  • Analyse one dimensional data

02 - Gapminder Data

The data in this lesson was obtained from the site gapminder.org. The variables included are:

  • Aged 15+ Employment Rate (%)
  • Life Expectancy (years)
  • GDP/capita (US$, inflation adjusted)
  • Primary school completion (% of boys)
  • Primary school completion (% of girls)

04 - One-Dimensional Data in NumPy and Pandas


In [1]:
import pandas as pd
  • Importing it takes some time
  • Has many functions like read_csv and uniq that help a lot

In [2]:
import numpy as np

05 - NumPy Arrays

  • Both Pandas and NumPy have special data structures for 1 D data
  • Numpy array is similar to Python list
  • Similarities
    • Access element by index
    • Access a range of elements
    • Use loops
  • Differences
    • Each element should have same type
    • Can have different types but it was designed for single data type
    • Convenient functions like mean and std

In [3]:
employments = pd.read_csv('employment_above_15.csv')

In [4]:
employments[0:5]


Out[4]:
Country 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007
0 Afghanistan 56.700001 56.500000 56.599998 56.200001 56.200001 56.099998 56.200001 56.200001 56.099998 56.099998 56.500000 56.400002 54.400002 56.000000 54.000000 56.000000 55.700001
1 Albania 52.700001 52.299999 52.400002 52.700001 52.799999 52.599998 52.400002 52.099998 52.099998 51.900002 51.799999 51.799999 51.799999 51.700001 51.500000 51.400002 51.400002
2 Algeria 39.400002 38.900002 39.400002 39.400002 38.099998 38.900002 39.700001 39.500000 39.400002 38.599998 40.400002 41.500000 42.799999 46.400002 48.000000 50.000000 50.500000
3 Angola 75.800003 75.800003 75.500000 75.900002 75.800003 75.900002 75.699997 75.599998 75.599998 75.500000 75.500000 75.599998 75.500000 75.500000 75.599998 75.500000 75.699997
4 Argentina 53.599998 53.799999 53.700001 53.799999 53.500000 54.400002 54.900002 55.000000 54.900002 55.500000 55.599998 55.400002 57.299999 57.700001 58.099998 58.400002 58.400002

In [5]:
#Selecting a column and displaying its first 5 elements
employments.get('1991')[0:5]


Out[5]:
0    56.700001
1    52.700001
2    39.400002
3    75.800003
4    53.599998
Name: 1991, dtype: float64

In [6]:
employments.get('Country')[0:5]


Out[6]:
0    Afghanistan
1        Albania
2        Algeria
3         Angola
4      Argentina
Name: Country, dtype: object

In [7]:
def max_employment(countries, employment):    
    i = employment.argmax()
    return (countries[i], employment[i])

In [8]:
max_employment(employments.get('Country'), employments.get('2007'))


Out[8]:
('Burundi', 83.199996948199995)

Let's look at the element type of few array which numpy calls dtype


In [9]:
countries = np.array(['Afghanistan','Albania','Algeria','Angola','Argentina','Armenia'])
employment = np.array([56.700001, 52.700001, 39.400002, 75.800003, 53.599998])

print countries.dtype
print employment.dtype

print np.array([0, 1, 2, 3]).dtype
print np.array([True, False, True]).dtype
print np.array(['AL', 'AK']).dtype


|S11
float64
int64
bool
|S2

|S11 means String with maximum length 11.


In [10]:
print employment.mean()
print employment.std()
print employment.max()
print employment.sum()


55.640001
11.6969402871
75.800003
278.200005

07 - Vectorized Operations

  • Numpy supports Vectorized operations
  • A vector is a list of numbers
  • Addition of 2 vectors can be done in several ways. Different languages implement it differently
  • In case of NumPy it is an element wise addition

In [11]:
np.array([1, 2, 3]) + np.array([4, 5, 6])


Out[11]:
array([5, 7, 9])

09 - Multiplying by a Scalar

  • Multiplying by Scalar is scalar multiplied with each element of the array

In [12]:
np.array([1, 2, 3]) * 3


Out[12]:
array([3, 6, 9])

11 - Calculate Overall Completion Rate

More vectorized operations


In [13]:
np.array([1, 2, 3]) + np.array([4, 5, 6])


Out[13]:
array([5, 7, 9])

In [14]:
np.array([1, 2, 3]) + 1


Out[14]:
array([2, 3, 4])

In [15]:
np.array([1, 2, 3]) - np.array([7, 10, 15])


Out[15]:
array([ -6,  -8, -12])

In [16]:
np.array([1, 2, 3]) - 1


Out[16]:
array([0, 1, 2])

In [17]:
np.array([1, 2, 3]) * np.array([4, 5, 6])


Out[17]:
array([ 4, 10, 18])

In [18]:
np.array([1, 2, 3]) * np.array([2])


Out[18]:
array([2, 4, 6])

In [19]:
#Throws error
#np.array([1, 2, 3]) * np.array([2, 3])

In [20]:
np.array([2, 3]) ** np.array([2, 3])


Out[20]:
array([ 4, 27])

In [21]:
np.array([5, 6]) ** 2


Out[21]:
array([25, 36])

See this article for more information about bitwise operations.

In NumPy, a & b performs a bitwise and of a and b. This is not necessarily the same as a logical and, if you wanted to see if matching terms in two integer vectors were non-zero. However, if a and b are both arrays of booleans, rather than integers, bitwise and and logical and are the same thing. If you want to perform a logical and on integer vectors, then you can use the NumPy function np.logical_and(a, b) or convert them into boolean vectors first.

Similarly, a | b performs a bitwise or, and ~a performs a bitwise not. However, if your arrays contain booleans, these will be the same as performing logical or and logical not. NumPy also has similar functions for performing these logical operations on integer-valued arrays.

In the solution, we may want to / (2.) instead of just / (2) . This is because in Python 2, dividing an integer by another integer (2) drops fractions, so if our inputs are also integers, we may end up losing information. If we divide by a float (2.) then we will definitely retain decimal values.


In [22]:
female_completion = pd.read_csv('female_completion_rate.csv')
male_completion = pd.read_csv('male_completion_rate.csv')

In [23]:
female_completion[0:5]


Out[23]:
Country 1970 1971 1972 1973 1974 1975 1976 1977 1978 ... 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011
0 Abkhazia NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 Afghanistan NaN NaN NaN NaN 4.19285 NaN NaN 5.14529 5.91965 ... NaN NaN NaN 18.74188 NaN NaN NaN NaN NaN NaN
2 Akrotiri and Dhekelia NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 Albania NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN 100.27718 97.70814 NaN NaN NaN 90.41091 89.76010 86.01452 89.53901
4 Algeria NaN NaN 30.90031 33.02938 34.32702 39.7942 45.24156 48.22515 50.49138 ... 90.09179 91.27633 93.30839 94.21432 NaN 97.35583 109.72854 95.13346 95.87439 94.20928

5 rows × 43 columns


In [24]:
male_completion[0:5]


Out[24]:
Country 1970 1971 1972 1973 1974 1975 1976 1977 1978 ... 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011
0 Abkhazia NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 Afghanistan NaN NaN NaN NaN 26.23178 NaN NaN 26.73849 29.07336 ... NaN NaN NaN 48.36070 NaN NaN NaN NaN NaN NaN
2 Akrotiri and Dhekelia NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 Albania NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN 101.90853 99.10666 NaN NaN NaN 89.44897 88.83622 86.49044 88.18113
4 Algeria NaN NaN 55.3917 57.41252 60.71733 66.29215 72.58079 75.04380 77.39284 ... 90.34439 92.04006 92.68131 94.66572 NaN 95.47622 121.33472 93.16440 96.09082 94.51607

5 rows × 43 columns


In [25]:
female = np.array([56.0, 23.0, 65.0])
male = np.array([23.0, 45.0, 22.0])

In [26]:
def overall_completion_rate(female_completion, male_completion):
    return (female_completion + male_completion) / 2

In [27]:
overall_completion_rate(female, male)


Out[27]:
array([ 39.5,  34. ,  43.5])

13 - Standardizing Data

  • How does one data point compare to other data point?
  • One way to do this is to convert the data point to number of standard deviations from the mean

In [28]:
def standardize_data(values):
    return (values - values.mean()) / values.std()

15 - NumPy Index Arrays


In [29]:
def mean_time_for_paid_students(time_spent, days_to_cancel):
    return time_spent[days_to_cancel >= 7].mean()

17 - + vs +=


In [30]:
a = np.array([1, 2, 3, 4])
b = a
a += np.array([1, 1, 1, 1]) #Difference here
print b


[2 3 4 5]


In [31]:
a = np.array([1, 2, 3, 4])
b = a
a = a + np.array([1, 1, 1, 1]) #Difference here
print b


[1 2 3 4]

19 - In-Place vs Not In-Place

  • += operates in-place while + does not

In [32]:
a = np.array([1, 2, 3, 4, 5])
slice = a[:3]
slice[0] = 100

a


Out[32]:
array([100,   2,   3,   4,   5])
  • slice refers to view of original array

21 - Pandas Series


In [33]:
def variable_correlation(variable1, variable2):
    both_above = (variable1 > variable1.mean()) & \
                 (variable2 > variable2.mean())
    both_below = (variable1 < variable1.mean()) & \
                 (variable2 < variable2.mean())
    
    is_same_direction = both_above | both_below
    num_same_direction = is_same_direction.sum()
    
    num_different_direction = len(variable1) - num_same_direction
    
    return (num_same_direction, num_different_direction)

23 - Series Indexes


In [34]:
s = pd.Series([1, 2, 3, 4])

In [35]:
s.describe()


Out[35]:
count    4.000000
mean     2.500000
std      1.290994
min      1.000000
25%      1.750000
50%      2.500000
75%      3.250000
max      4.000000
dtype: float64

In [36]:
countries = np.array(['Albania', 'Algeria', 'Andorra', 'Angola'])
life_expectancy = np.array([74.7, 75., 83.4, 57.6])

life_expectancy


Out[36]:
array([ 74.7,  75. ,  83.4,  57.6])

Some people call countries[0] as indexing into array. But the instructor uses position 0 to avoid confusion. This is because in Pandas index and postion are not the same thing


In [37]:
life_expectancy = pd.Series([74.7, 75., 83.4, 57.6],
                           index = ['Albania', 
                                    'Algeria', 
                                    'Andorra', 
                                    'Angola'])

life_expectancy


Out[37]:
Albania    74.7
Algeria    75.0
Andorra    83.4
Angola     57.6
dtype: float64
  • NumPy arrays are souped-up version of Python lists
  • Pandas Series is like a cross between a list and a dictionary

In [38]:
#Access by index
life_expectancy.loc['Angola']


Out[38]:
57.600000000000001

In [39]:
#If we don't specify index then automatically adds index 0, 1, 2, ...
pd.Series([74.7, 75., 83.4, 57.6])


Out[39]:
0    74.7
1    75.0
2    83.4
3    57.6
dtype: float64

In [40]:
#Access element by position
print life_expectancy.iloc[0]

#same as
print life_expectancy[0]


74.7
74.7

In [41]:
def max_employment(employment):
    max_country = employment.argmax()
    max_value = employment.loc[max_country]
    
    return (max_country, max_value)

25 - Vectorized Operations and Series Indexes

  • In NumPy arrays addition happens as per position
  • What happens if we add two Pandas series?

In [42]:
s1 = pd.Series([1, 2, 3, 4], index = ['a', 'b', 'c', 'd'])
s2 = pd.Series([10, 20, 30, 40], index = ['a', 'b', 'c', 'd'])

In [43]:
s1


Out[43]:
a    1
b    2
c    3
d    4
dtype: int64

In [44]:
s2


Out[44]:
a    10
b    20
c    30
d    40
dtype: int64

In [45]:
s1 + s2


Out[45]:
a    11
b    22
c    33
d    44
dtype: int64

In [46]:
# Index are in different order
s3 = pd.Series([10, 20, 30, 40], index = ['b', 'd', 'a', 'c'])

In [47]:
s3


Out[47]:
b    10
d    20
a    30
c    40
dtype: int64

In [48]:
s1 + s3


Out[48]:
a    31
b    12
c    43
d    24
dtype: int64

Matching indexes were used to add the 2 series


In [49]:
s4 = pd.Series([10, 20, 30, 40], index = ['c', 'd', 'e', 'f'])

In [50]:
s4


Out[50]:
c    10
d    20
e    30
f    40
dtype: int64

In [51]:
s1 + s4


Out[51]:
a     NaN
b     NaN
c    13.0
d    24.0
e     NaN
f     NaN
dtype: float64

In [52]:
#If we don't want to show NaN in our solution
(s1 + s4).dropna()


Out[52]:
c    13.0
d    24.0
dtype: float64

28 - Filling Missing Values - Solution


In [53]:
#If we want to give a default value
s1.add(s4, fill_value=0)


Out[53]:
a     1.0
b     2.0
c    13.0
d    24.0
e    30.0
f    40.0
dtype: float64

29 - Pandas Series apply

  • So far we have used built-in functions like mean() and vectorized operations like +
  • apply takes a Series and a function and returns a new series applying the function on each element of the Series


In [54]:
names = pd.Series([
        'Andre Agassi',
        'Barry Bonds',
        'Christopher Columbus',
        'Daniel Defoe'
    ])

In [55]:
def reverse_name(name):
    split_name = name.split(" ")
    return "{}, {}".format(split_name[1], split_name[0])

In [56]:
reverse_name(names.iloc[0])


Out[56]:
'Agassi, Andre'

In [57]:
def reverse_names(names):
    return names.apply(reverse_name)

In [58]:
reverse_names(names)


Out[58]:
0            Agassi, Andre
1             Bonds, Barry
2    Columbus, Christopher
3            Defoe, Daniel
dtype: object

31 - Plotting in Pandas - Solution


In [59]:
employment = pd.read_csv('employment_above_15.csv', index_col = 'Country')
female_completion = pd.read_csv('female_completion_rate.csv', index_col = 'Country')
male_completion = pd.read_csv('male_completion_rate.csv', index_col = 'Country')
life_expectancy = pd.read_csv('life_expectancy.csv', index_col = 'Country')
gdp_per_capita = pd.read_csv('gdp_per_capita.csv', index_col = 'Country')

In [60]:
_country = 'United States'

employment_country = employment.loc[_country]
female_completion_country = female_completion.loc[_country]
male_completion_country = male_completion.loc[_country]
life_expectancy_country = life_expectancy.loc[_country]
gdp_per_capita_country = gdp_per_capita.loc[_country]

In [61]:
%pylab inline


Populating the interactive namespace from numpy and matplotlib

In [62]:
employment_country.plot()


Out[62]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f8166640d10>

In [63]:
female_completion_country.plot()


Out[63]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f81642da610>

In [64]:
male_completion_country.plot()


Out[64]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f8163ccee90>

In [65]:
life_expectancy_country.plot()


Out[65]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f8163c0e150>

In [66]:
gdp_per_capita_country.plot()


Out[66]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f8163c195d0>