Lesson 2: NumPy and Pandas for 1D Data

01 - Introduction

Will get familiar with 2 libraries - numpy and pandas
Writing Data Analysis code will be much easier.
Code runs faster
Analyse one dimensional data

02 - Gapminder Data

The data in this lesson was obtained from the site gapminder.org. The variables included are:

Aged 15+ Employment Rate (%)
Life Expectancy (years)
GDP/capita (US$, inflation adjusted)
Primary school completion (% of boys)
Primary school completion (% of girls)

04 - One-Dimensional Data in NumPy and Pandas



In [1]:

    
import pandas as pd

Importing it takes some time
Has many functions like read_csv and uniq that help a lot



In [2]:

    
import numpy as np

05 - NumPy Arrays

Both Pandas and NumPy have special data structures for 1 D data
Numpy array is similar to Python list
Similarities
- Access element by index
- Access a range of elements
- Use loops
Differences
- Each element should have same type
- Can have different types but it was designed for single data type
- Convenient functions like mean and std



In [3]:

    
employments = pd.read_csv('employment_above_15.csv')



In [4]:

    
employments[0:5]









    Out[4]:






  
    
      
      Country
      1991
      1992
      1993
      1994
      1995
      1996
      1997
      1998
      1999
      2000
      2001
      2002
      2003
      2004
      2005
      2006
      2007
    
  
  
    
      0
      Afghanistan
      56.700001
      56.500000
      56.599998
      56.200001
      56.200001
      56.099998
      56.200001
      56.200001
      56.099998
      56.099998
      56.500000
      56.400002
      54.400002
      56.000000
      54.000000
      56.000000
      55.700001
    
    
      1
      Albania
      52.700001
      52.299999
      52.400002
      52.700001
      52.799999
      52.599998
      52.400002
      52.099998
      52.099998
      51.900002
      51.799999
      51.799999
      51.799999
      51.700001
      51.500000
      51.400002
      51.400002
    
    
      2
      Algeria
      39.400002
      38.900002
      39.400002
      39.400002
      38.099998
      38.900002
      39.700001
      39.500000
      39.400002
      38.599998
      40.400002
      41.500000
      42.799999
      46.400002
      48.000000
      50.000000
      50.500000
    
    
      3
      Angola
      75.800003
      75.800003
      75.500000
      75.900002
      75.800003
      75.900002
      75.699997
      75.599998
      75.599998
      75.500000
      75.500000
      75.599998
      75.500000
      75.500000
      75.599998
      75.500000
      75.699997
    
    
      4
      Argentina
      53.599998
      53.799999
      53.700001
      53.799999
      53.500000
      54.400002
      54.900002
      55.000000
      54.900002
      55.500000
      55.599998
      55.400002
      57.299999
      57.700001
      58.099998
      58.400002
      58.400002



In [5]:

    
#Selecting a column and displaying its first 5 elements
employments.get('1991')[0:5]









    Out[5]:





0    56.700001
1    52.700001
2    39.400002
3    75.800003
4    53.599998
Name: 1991, dtype: float64



In [6]:

    
employments.get('Country')[0:5]









    Out[6]:





0    Afghanistan
1        Albania
2        Algeria
3         Angola
4      Argentina
Name: Country, dtype: object



In [7]:

    
def max_employment(countries, employment):    
    i = employment.argmax()
    return (countries[i], employment[i])



In [8]:

    
max_employment(employments.get('Country'), employments.get('2007'))









    Out[8]:





('Burundi', 83.199996948199995)

Let's look at the element type of few array which numpy calls dtype



In [9]:

    
countries = np.array(['Afghanistan','Albania','Algeria','Angola','Argentina','Armenia'])
employment = np.array([56.700001, 52.700001, 39.400002, 75.800003, 53.599998])

print countries.dtype
print employment.dtype

print np.array([0, 1, 2, 3]).dtype
print np.array([True, False, True]).dtype
print np.array(['AL', 'AK']).dtype









    



|S11
float64
int64
bool
|S2

|S11 means String with maximum length 11.



In [10]:

    
print employment.mean()
print employment.std()
print employment.max()
print employment.sum()









    



55.640001
11.6969402871
75.800003
278.200005

07 - Vectorized Operations

Numpy supports Vectorized operations
A vector is a list of numbers
Addition of 2 vectors can be done in several ways. Different languages implement it differently
In case of NumPy it is an element wise addition



In [11]:

    
np.array([1, 2, 3]) + np.array([4, 5, 6])









    Out[11]:





array([5, 7, 9])

09 - Multiplying by a Scalar

Multiplying by Scalar is scalar multiplied with each element of the array



In [12]:

    
np.array([1, 2, 3]) * 3









    Out[12]:





array([3, 6, 9])

11 - Calculate Overall Completion Rate

More vectorized operations



In [13]:

    
np.array([1, 2, 3]) + np.array([4, 5, 6])









    Out[13]:





array([5, 7, 9])



In [14]:

    
np.array([1, 2, 3]) + 1









    Out[14]:





array([2, 3, 4])



In [15]:

    
np.array([1, 2, 3]) - np.array([7, 10, 15])









    Out[15]:





array([ -6,  -8, -12])



In [16]:

    
np.array([1, 2, 3]) - 1









    Out[16]:





array([0, 1, 2])



In [17]:

    
np.array([1, 2, 3]) * np.array([4, 5, 6])









    Out[17]:





array([ 4, 10, 18])



In [18]:

    
np.array([1, 2, 3]) * np.array([2])









    Out[18]:





array([2, 4, 6])



In [19]:

    
#Throws error
#np.array([1, 2, 3]) * np.array([2, 3])



In [20]:

    
np.array([2, 3]) ** np.array([2, 3])









    Out[20]:





array([ 4, 27])



In [21]:

    
np.array([5, 6]) ** 2









    Out[21]:





array([25, 36])

See this article for more information about bitwise operations.

In NumPy, a & b performs a bitwise and of a and b. This is not necessarily the same as a logical and, if you wanted to see if matching terms in two integer vectors were non-zero. However, if a and b are both arrays of booleans, rather than integers, bitwise and and logical and are the same thing. If you want to perform a logical and on integer vectors, then you can use the NumPy function np.logical_and(a, b) or convert them into boolean vectors first.

Similarly, a | b performs a bitwise or, and ~a performs a bitwise not. However, if your arrays contain booleans, these will be the same as performing logical or and logical not. NumPy also has similar functions for performing these logical operations on integer-valued arrays.

In the solution, we may want to / (2.) instead of just / (2) . This is because in Python 2, dividing an integer by another integer (2) drops fractions, so if our inputs are also integers, we may end up losing information. If we divide by a float (2.) then we will definitely retain decimal values.



In [22]:

    
female_completion = pd.read_csv('female_completion_rate.csv')
male_completion = pd.read_csv('male_completion_rate.csv')



In [23]:

    
female_completion[0:5]









    Out[23]:






  
    
      
      Country
      1970
      1971
      1972
      1973
      1974
      1975
      1976
      1977
      1978
      ...
      2002
      2003
      2004
      2005
      2006
      2007
      2008
      2009
      2010
      2011
    
  
  
    
      0
      Abkhazia
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      1
      Afghanistan
      NaN
      NaN
      NaN
      NaN
      4.19285
      NaN
      NaN
      5.14529
      5.91965
      ...
      NaN
      NaN
      NaN
      18.74188
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      2
      Akrotiri and Dhekelia
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      3
      Albania
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      ...
      NaN
      100.27718
      97.70814
      NaN
      NaN
      NaN
      90.41091
      89.76010
      86.01452
      89.53901
    
    
      4
      Algeria
      NaN
      NaN
      30.90031
      33.02938
      34.32702
      39.7942
      45.24156
      48.22515
      50.49138
      ...
      90.09179
      91.27633
      93.30839
      94.21432
      NaN
      97.35583
      109.72854
      95.13346
      95.87439
      94.20928
    
  

5 rows × 43 columns



In [24]:

    
male_completion[0:5]









    Out[24]:






  
    
      
      Country
      1970
      1971
      1972
      1973
      1974
      1975
      1976
      1977
      1978
      ...
      2002
      2003
      2004
      2005
      2006
      2007
      2008
      2009
      2010
      2011
    
  
  
    
      0
      Abkhazia
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      1
      Afghanistan
      NaN
      NaN
      NaN
      NaN
      26.23178
      NaN
      NaN
      26.73849
      29.07336
      ...
      NaN
      NaN
      NaN
      48.36070
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      2
      Akrotiri and Dhekelia
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      3
      Albania
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      ...
      NaN
      101.90853
      99.10666
      NaN
      NaN
      NaN
      89.44897
      88.83622
      86.49044
      88.18113
    
    
      4
      Algeria
      NaN
      NaN
      55.3917
      57.41252
      60.71733
      66.29215
      72.58079
      75.04380
      77.39284
      ...
      90.34439
      92.04006
      92.68131
      94.66572
      NaN
      95.47622
      121.33472
      93.16440
      96.09082
      94.51607
    
  

5 rows × 43 columns



In [25]:

    
female = np.array([56.0, 23.0, 65.0])
male = np.array([23.0, 45.0, 22.0])



In [26]:

    
def overall_completion_rate(female_completion, male_completion):
    return (female_completion + male_completion) / 2



In [27]:

    
overall_completion_rate(female, male)









    Out[27]:





array([ 39.5,  34. ,  43.5])

13 - Standardizing Data

How does one data point compare to other data point?
One way to do this is to convert the data point to number of standard deviations from the mean



In [28]:

    
def standardize_data(values):
    return (values - values.mean()) / values.std()

15 - NumPy Index Arrays



In [29]:

    
def mean_time_for_paid_students(time_spent, days_to_cancel):
    return time_spent[days_to_cancel >= 7].mean()

17 - + vs +=



In [30]:

    
a = np.array([1, 2, 3, 4])
b = a
a += np.array([1, 1, 1, 1]) #Difference here
print b



In [31]:

    
a = np.array([1, 2, 3, 4])
b = a
a = a + np.array([1, 1, 1, 1]) #Difference here
print b

19 - In-Place vs Not In-Place

+= operates in-place while + does not



In [32]:

    
a = np.array([1, 2, 3, 4, 5])
slice = a[:3]
slice[0] = 100

a









    Out[32]:





array([100,   2,   3,   4,   5])

slice refers to view of original array

21 - Pandas Series



In [33]:

    
def variable_correlation(variable1, variable2):
    both_above = (variable1 > variable1.mean()) & \
                 (variable2 > variable2.mean())
    both_below = (variable1 < variable1.mean()) & \
                 (variable2 < variable2.mean())
    
    is_same_direction = both_above | both_below
    num_same_direction = is_same_direction.sum()
    
    num_different_direction = len(variable1) - num_same_direction
    
    return (num_same_direction, num_different_direction)

23 - Series Indexes



In [34]:

    
s = pd.Series([1, 2, 3, 4])



In [35]:

    
s.describe()









    Out[35]:





count    4.000000
mean     2.500000
std      1.290994
min      1.000000
25%      1.750000
50%      2.500000
75%      3.250000
max      4.000000
dtype: float64



In [36]:

    
countries = np.array(['Albania', 'Algeria', 'Andorra', 'Angola'])
life_expectancy = np.array([74.7, 75., 83.4, 57.6])

life_expectancy









    Out[36]:





array([ 74.7,  75. ,  83.4,  57.6])

Some people call countries[0] as indexing into array. But the instructor uses position 0 to avoid confusion. This is because in Pandas index and postion are not the same thing



In [37]:

    
life_expectancy = pd.Series([74.7, 75., 83.4, 57.6],
                           index = ['Albania', 
                                    'Algeria', 
                                    'Andorra', 
                                    'Angola'])

life_expectancy









    Out[37]:





Albania    74.7
Algeria    75.0
Andorra    83.4
Angola     57.6
dtype: float64

NumPy arrays are souped-up version of Python lists
Pandas Series is like a cross between a list and a dictionary



In [38]:

    
#Access by index
life_expectancy.loc['Angola']









    Out[38]:





57.600000000000001



In [39]:

    
#If we don't specify index then automatically adds index 0, 1, 2, ...
pd.Series([74.7, 75., 83.4, 57.6])









    Out[39]:





0    74.7
1    75.0
2    83.4
3    57.6
dtype: float64



In [40]:

    
#Access element by position
print life_expectancy.iloc[0]

#same as
print life_expectancy[0]



In [41]:

    
def max_employment(employment):
    max_country = employment.argmax()
    max_value = employment.loc[max_country]
    
    return (max_country, max_value)

25 - Vectorized Operations and Series Indexes

In NumPy arrays addition happens as per position
What happens if we add two Pandas series?



In [42]:

    
s1 = pd.Series([1, 2, 3, 4], index = ['a', 'b', 'c', 'd'])
s2 = pd.Series([10, 20, 30, 40], index = ['a', 'b', 'c', 'd'])



In [43]:

    
s1









    Out[43]:





a    1
b    2
c    3
d    4
dtype: int64



In [44]:

    
s2









    Out[44]:





a    10
b    20
c    30
d    40
dtype: int64



In [45]:

    
s1 + s2









    Out[45]:





a    11
b    22
c    33
d    44
dtype: int64



In [46]:

    
# Index are in different order
s3 = pd.Series([10, 20, 30, 40], index = ['b', 'd', 'a', 'c'])



In [47]:

    
s3









    Out[47]:





b    10
d    20
a    30
c    40
dtype: int64



In [48]:

    
s1 + s3









    Out[48]:





a    31
b    12
c    43
d    24
dtype: int64

Matching indexes were used to add the 2 series



In [49]:

    
s4 = pd.Series([10, 20, 30, 40], index = ['c', 'd', 'e', 'f'])



In [50]:

    
s4









    Out[50]:





c    10
d    20
e    30
f    40
dtype: int64



In [51]:

    
s1 + s4









    Out[51]:





a     NaN
b     NaN
c    13.0
d    24.0
e     NaN
f     NaN
dtype: float64



In [52]:

    
#If we don't want to show NaN in our solution
(s1 + s4).dropna()









    Out[52]:





c    13.0
d    24.0
dtype: float64

28 - Filling Missing Values - Solution



In [53]:

    
#If we want to give a default value
s1.add(s4, fill_value=0)









    Out[53]:





a     1.0
b     2.0
c    13.0
d    24.0
e    30.0
f    40.0
dtype: float64

29 - Pandas Series apply

So far we have used built-in functions like mean() and vectorized operations like +
apply takes a Series and a function and returns a new series applying the function on each element of the Series



In [54]:

    
names = pd.Series([
        'Andre Agassi',
        'Barry Bonds',
        'Christopher Columbus',
        'Daniel Defoe'
    ])



In [55]:

    
def reverse_name(name):
    split_name = name.split(" ")
    return "{}, {}".format(split_name[1], split_name[0])



In [56]:

    
reverse_name(names.iloc[0])









    Out[56]:





'Agassi, Andre'



In [57]:

    
def reverse_names(names):
    return names.apply(reverse_name)



In [58]:

    
reverse_names(names)









    Out[58]:





0            Agassi, Andre
1             Bonds, Barry
2    Columbus, Christopher
3            Defoe, Daniel
dtype: object

31 - Plotting in Pandas - Solution



In [59]:

    
employment = pd.read_csv('employment_above_15.csv', index_col = 'Country')
female_completion = pd.read_csv('female_completion_rate.csv', index_col = 'Country')
male_completion = pd.read_csv('male_completion_rate.csv', index_col = 'Country')
life_expectancy = pd.read_csv('life_expectancy.csv', index_col = 'Country')
gdp_per_capita = pd.read_csv('gdp_per_capita.csv', index_col = 'Country')



In [60]:

    
_country = 'United States'

employment_country = employment.loc[_country]
female_completion_country = female_completion.loc[_country]
male_completion_country = male_completion.loc[_country]
life_expectancy_country = life_expectancy.loc[_country]
gdp_per_capita_country = gdp_per_capita.loc[_country]



In [61]:

    
%pylab inline









    



Populating the interactive namespace from numpy and matplotlib



In [62]:

    
employment_country.plot()









    Out[62]:





<matplotlib.axes._subplots.AxesSubplot at 0x7f8166640d10>



In [63]:

    
female_completion_country.plot()









    Out[63]:





<matplotlib.axes._subplots.AxesSubplot at 0x7f81642da610>



In [64]:

    
male_completion_country.plot()









    Out[64]:





<matplotlib.axes._subplots.AxesSubplot at 0x7f8163ccee90>



In [65]:

    
life_expectancy_country.plot()









    Out[65]:





<matplotlib.axes._subplots.AxesSubplot at 0x7f8163c0e150>



In [66]:

    
gdp_per_capita_country.plot()









    Out[66]:





<matplotlib.axes._subplots.AxesSubplot at 0x7f8163c195d0>

	Country	1991	1992	1993	1994	1995	1996	1997	1998	1999	2000	2001	2002	2003	2004	2005	2006	2007
0	Afghanistan	56.700001	56.500000	56.599998	56.200001	56.200001	56.099998	56.200001	56.200001	56.099998	56.099998	56.500000	56.400002	54.400002	56.000000	54.000000	56.000000	55.700001
1	Albania	52.700001	52.299999	52.400002	52.700001	52.799999	52.599998	52.400002	52.099998	52.099998	51.900002	51.799999	51.799999	51.799999	51.700001	51.500000	51.400002	51.400002
2	Algeria	39.400002	38.900002	39.400002	39.400002	38.099998	38.900002	39.700001	39.500000	39.400002	38.599998	40.400002	41.500000	42.799999	46.400002	48.000000	50.000000	50.500000
3	Angola	75.800003	75.800003	75.500000	75.900002	75.800003	75.900002	75.699997	75.599998	75.599998	75.500000	75.500000	75.599998	75.500000	75.500000	75.599998	75.500000	75.699997
4	Argentina	53.599998	53.799999	53.700001	53.799999	53.500000	54.400002	54.900002	55.000000	54.900002	55.500000	55.599998	55.400002	57.299999	57.700001	58.099998	58.400002	58.400002

	Country	1970	1971	1972	1973	1974	1975	1976	1977	1978	...	2002	2003	2004	2005	2006	2007	2008	2009	2010	2011
0	Abkhazia	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	Afghanistan	NaN	NaN	NaN	NaN	4.19285	NaN	NaN	5.14529	5.91965	...	NaN	NaN	NaN	18.74188	NaN	NaN	NaN	NaN	NaN	NaN
2	Akrotiri and Dhekelia	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3	Albania	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	100.27718	97.70814	NaN	NaN	NaN	90.41091	89.76010	86.01452	89.53901
4	Algeria	NaN	NaN	30.90031	33.02938	34.32702	39.7942	45.24156	48.22515	50.49138	...	90.09179	91.27633	93.30839	94.21432	NaN	97.35583	109.72854	95.13346	95.87439	94.20928

	Country	1970	1971	1972	1973	1974	1975	1976	1977	1978	...	2002	2003	2004	2005	2006	2007	2008	2009	2010	2011
0	Abkhazia	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	Afghanistan	NaN	NaN	NaN	NaN	26.23178	NaN	NaN	26.73849	29.07336	...	NaN	NaN	NaN	48.36070	NaN	NaN	NaN	NaN	NaN	NaN
2	Akrotiri and Dhekelia	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3	Albania	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	101.90853	99.10666	NaN	NaN	NaN	89.44897	88.83622	86.49044	88.18113
4	Algeria	NaN	NaN	55.3917	57.41252	60.71733	66.29215	72.58079	75.04380	77.39284	...	90.34439	92.04006	92.68131	94.66572	NaN	95.47622	121.33472	93.16440	96.09082	94.51607