The data in this lesson was obtained from the site gapminder.org. The variables included are:
In [1]:
import pandas as pd
read_csv and uniq that help a lot
In [2]:
import numpy as np
mean and std
In [3]:
employments = pd.read_csv('employment_above_15.csv')
In [4]:
employments[0:5]
Out[4]:
In [5]:
#Selecting a column and displaying its first 5 elements
employments.get('1991')[0:5]
Out[5]:
In [6]:
employments.get('Country')[0:5]
Out[6]:
In [7]:
def max_employment(countries, employment):
i = employment.argmax()
return (countries[i], employment[i])
In [8]:
max_employment(employments.get('Country'), employments.get('2007'))
Out[8]:
Let's look at the element type of few array which numpy calls dtype
In [9]:
countries = np.array(['Afghanistan','Albania','Algeria','Angola','Argentina','Armenia'])
employment = np.array([56.700001, 52.700001, 39.400002, 75.800003, 53.599998])
print countries.dtype
print employment.dtype
print np.array([0, 1, 2, 3]).dtype
print np.array([True, False, True]).dtype
print np.array(['AL', 'AK']).dtype
|S11 means String with maximum length 11.
In [10]:
print employment.mean()
print employment.std()
print employment.max()
print employment.sum()
In [11]:
np.array([1, 2, 3]) + np.array([4, 5, 6])
Out[11]:
In [12]:
np.array([1, 2, 3]) * 3
Out[12]:
In [13]:
np.array([1, 2, 3]) + np.array([4, 5, 6])
Out[13]:
In [14]:
np.array([1, 2, 3]) + 1
Out[14]:
In [15]:
np.array([1, 2, 3]) - np.array([7, 10, 15])
Out[15]:
In [16]:
np.array([1, 2, 3]) - 1
Out[16]:
In [17]:
np.array([1, 2, 3]) * np.array([4, 5, 6])
Out[17]:
In [18]:
np.array([1, 2, 3]) * np.array([2])
Out[18]:
In [19]:
#Throws error
#np.array([1, 2, 3]) * np.array([2, 3])
In [20]:
np.array([2, 3]) ** np.array([2, 3])
Out[20]:
In [21]:
np.array([5, 6]) ** 2
Out[21]:
See this article for more information about bitwise operations.
In NumPy, a & b performs a bitwise and of a and b. This is not necessarily the same as a logical and, if you wanted to see if matching terms in two integer vectors were non-zero. However, if a and b are both arrays of booleans, rather than integers, bitwise and and logical and are the same thing. If you want to perform a logical and on integer vectors, then you can use the NumPy function np.logical_and(a, b) or convert them into boolean vectors first.
Similarly, a | b performs a bitwise or, and ~a performs a bitwise not. However, if your arrays contain booleans, these will be the same as performing logical or and logical not. NumPy also has similar functions for performing these logical operations on integer-valued arrays.
In the solution, we may want to / (2.) instead of just / (2) . This is because in Python 2, dividing an integer by another integer (2) drops fractions, so if our inputs are also integers, we may end up losing information. If we divide by a float (2.) then we will definitely retain decimal values.
In [22]:
female_completion = pd.read_csv('female_completion_rate.csv')
male_completion = pd.read_csv('male_completion_rate.csv')
In [23]:
female_completion[0:5]
Out[23]:
In [24]:
male_completion[0:5]
Out[24]:
In [25]:
female = np.array([56.0, 23.0, 65.0])
male = np.array([23.0, 45.0, 22.0])
In [26]:
def overall_completion_rate(female_completion, male_completion):
return (female_completion + male_completion) / 2
In [27]:
overall_completion_rate(female, male)
Out[27]:
In [28]:
def standardize_data(values):
return (values - values.mean()) / values.std()
In [29]:
def mean_time_for_paid_students(time_spent, days_to_cancel):
return time_spent[days_to_cancel >= 7].mean()
In [30]:
a = np.array([1, 2, 3, 4])
b = a
a += np.array([1, 1, 1, 1]) #Difference here
print b
In [31]:
a = np.array([1, 2, 3, 4])
b = a
a = a + np.array([1, 1, 1, 1]) #Difference here
print b
In [32]:
a = np.array([1, 2, 3, 4, 5])
slice = a[:3]
slice[0] = 100
a
Out[32]:
slice refers to view of original array
In [33]:
def variable_correlation(variable1, variable2):
both_above = (variable1 > variable1.mean()) & \
(variable2 > variable2.mean())
both_below = (variable1 < variable1.mean()) & \
(variable2 < variable2.mean())
is_same_direction = both_above | both_below
num_same_direction = is_same_direction.sum()
num_different_direction = len(variable1) - num_same_direction
return (num_same_direction, num_different_direction)
In [34]:
s = pd.Series([1, 2, 3, 4])
In [35]:
s.describe()
Out[35]:
In [36]:
countries = np.array(['Albania', 'Algeria', 'Andorra', 'Angola'])
life_expectancy = np.array([74.7, 75., 83.4, 57.6])
life_expectancy
Out[36]:
Some people call countries[0] as indexing into array. But the instructor uses position 0 to avoid confusion. This is because in Pandas index and postion are not the same thing
In [37]:
life_expectancy = pd.Series([74.7, 75., 83.4, 57.6],
index = ['Albania',
'Algeria',
'Andorra',
'Angola'])
life_expectancy
Out[37]:
In [38]:
#Access by index
life_expectancy.loc['Angola']
Out[38]:
In [39]:
#If we don't specify index then automatically adds index 0, 1, 2, ...
pd.Series([74.7, 75., 83.4, 57.6])
Out[39]:
In [40]:
#Access element by position
print life_expectancy.iloc[0]
#same as
print life_expectancy[0]
In [41]:
def max_employment(employment):
max_country = employment.argmax()
max_value = employment.loc[max_country]
return (max_country, max_value)
In [42]:
s1 = pd.Series([1, 2, 3, 4], index = ['a', 'b', 'c', 'd'])
s2 = pd.Series([10, 20, 30, 40], index = ['a', 'b', 'c', 'd'])
In [43]:
s1
Out[43]:
In [44]:
s2
Out[44]:
In [45]:
s1 + s2
Out[45]:
In [46]:
# Index are in different order
s3 = pd.Series([10, 20, 30, 40], index = ['b', 'd', 'a', 'c'])
In [47]:
s3
Out[47]:
In [48]:
s1 + s3
Out[48]:
Matching indexes were used to add the 2 series
In [49]:
s4 = pd.Series([10, 20, 30, 40], index = ['c', 'd', 'e', 'f'])
In [50]:
s4
Out[50]:
In [51]:
s1 + s4
Out[51]:
In [52]:
#If we don't want to show NaN in our solution
(s1 + s4).dropna()
Out[52]:
In [53]:
#If we want to give a default value
s1.add(s4, fill_value=0)
Out[53]:
In [54]:
names = pd.Series([
'Andre Agassi',
'Barry Bonds',
'Christopher Columbus',
'Daniel Defoe'
])
In [55]:
def reverse_name(name):
split_name = name.split(" ")
return "{}, {}".format(split_name[1], split_name[0])
In [56]:
reverse_name(names.iloc[0])
Out[56]:
In [57]:
def reverse_names(names):
return names.apply(reverse_name)
In [58]:
reverse_names(names)
Out[58]:
In [59]:
employment = pd.read_csv('employment_above_15.csv', index_col = 'Country')
female_completion = pd.read_csv('female_completion_rate.csv', index_col = 'Country')
male_completion = pd.read_csv('male_completion_rate.csv', index_col = 'Country')
life_expectancy = pd.read_csv('life_expectancy.csv', index_col = 'Country')
gdp_per_capita = pd.read_csv('gdp_per_capita.csv', index_col = 'Country')
In [60]:
_country = 'United States'
employment_country = employment.loc[_country]
female_completion_country = female_completion.loc[_country]
male_completion_country = male_completion.loc[_country]
life_expectancy_country = life_expectancy.loc[_country]
gdp_per_capita_country = gdp_per_capita.loc[_country]
In [61]:
%pylab inline
In [62]:
employment_country.plot()
Out[62]:
In [63]:
female_completion_country.plot()
Out[63]:
In [64]:
male_completion_country.plot()
Out[64]:
In [65]:
life_expectancy_country.plot()
Out[65]:
In [66]:
gdp_per_capita_country.plot()
Out[66]: