Pandas Exercise

When working on real world data tasks, you'll quickly realize that a large portion of your time is spent manipulating raw data into a form that you can actually work with, a process often called data munging or data wrangling. Different programming langauges have different methods and packages to handle this task, with varying degrees of ease, and luckily for us, Python has an excellent one called Pandas which we will be using in this exercise.


In [1]:
import numpy as np
import pandas as pd

Importing data and working with Data Frames

The Data Frame is perhaps the most important object in Pandas and Data Science in Python, providing a plethora of functions for common data tasks. Using only Pandas, do the following exercises.

1 - Download the free1.csv from the R Data Repository and save it to the same directory as this notebook. Then import into your environment as a Data Frame. Now read free2.csv directly into a Data Frame from the URL.


In [2]:
# Import from file
free1 = pd.read_csv('free1.csv')

# Import from URL
free2 = pd.read_csv('https://vincentarelbundock.github.io/Rdatasets/csv/Zelig/free2.csv')

In [3]:
print(free1.head())
print(free2.head())


   Unnamed: 0  sex   age  educ   country  y  v1  v2  v3  v4  v5  v6
0      109276  0.0  20.0   4.0   Eurasia  1   4   3   3   5   3   4
1       88178  1.0  25.0   4.0    Oceana  2   3   3   5   5   5   5
2      111063  1.0  56.0   2.0  Eastasia  2   3   2   4   5   5   4
3      161488  0.0  65.0   6.0  Eastasia  2   3   3   5   5   5   5
4       44532  1.0  50.0   5.0    Oceana  1   5   3   5   5   3   5
   Unnamed: 0  sex   age  educ   country  y  v1  v2  v3  v4  v5  v6
0      109276  0.0  20.0   4.0   Eurasia  1   4   3   3   5   3   4
1       88178  1.0  25.0   4.0    Oceana  2   3   3   5   5   5   5
2      111063  1.0  56.0   2.0  Eastasia  2   3   2   4   5   5   4
3      161488  0.0  65.0   6.0  Eastasia  2   3   3   5   5   5   5
4       44532  1.0  50.0   5.0    Oceana  1   5   3   5   5   3   5

2 - Combine your free1 Data Frame with free2 into a single Data Frame, named free_data, and print the first few rows to verify that it worked correctly. From here on out, this combined Data Frame is what we will be working with.


In [4]:
free_data = pd.concat([free1, free2], ignore_index=True)

3 - Print the last 10 rows.


In [5]:
free_data.tail(10)


Out[5]:
Unnamed: 0 sex age educ country y v1 v2 v3 v4 v5 v6
890 56676 1.0 42.0 1.0 Oceana 5 4 1 3 3 2 4
891 58098 0.0 41.0 1.0 Oceana 5 3 4 3 3 4 4
892 117252 1.0 41.0 6.0 Eurasia 5 4 2 1 4 4 3
893 110212 0.0 40.0 3.0 Eurasia 4 3 2 3 4 3 3
894 168326 0.0 24.0 4.0 Eastasia 5 3 4 3 3 3 4
895 95744 1.0 70.0 1.0 Eastasia 3 2 1 1 2 1 1
896 109491 1.0 18.0 4.0 Eurasia 3 1 1 1 1 1 2
897 65788 1.0 19.0 1.0 Eastasia 5 3 3 3 3 3 3
898 147766 0.0 53.0 4.0 Eastasia 4 3 3 3 3 3 3
899 116952 1.0 18.0 3.0 Eurasia 5 4 4 4 4 4 4

4 - Rename the first column (currently unamed), to id. Print the column names to verify that it worked correctly.


In [6]:
new_col = np.array(free_data.columns)
new_col[0] = 'id'
free_data.columns = new_col
print(free_data.columns)

# solution: free_date.rename(columns={'Unnamed: 0': 'id}, inplace=True)


Index(['id', 'sex', 'age', 'educ', 'country', 'y', 'v1', 'v2', 'v3', 'v4',
       'v5', 'v6'],
      dtype='object')

5 - What are the number of rows and columns of the Data Frame?


In [7]:
free_data.shape


Out[7]:
(900, 12)

6 - What are the data types of each column? Can quantities like the mean be calculated for each columm? If not, which one(s) and why?


In [8]:
free_data.dtypes


Out[8]:
id           int64
sex        float64
age        float64
educ       float64
country     object
y            int64
v1           int64
v2           int64
v3           int64
v4           int64
v5           int64
v6           int64
dtype: object

7 - Print out the first 5 rows of the country column.


In [9]:
free_data['country'].head()


Out[9]:
0     Eurasia
1      Oceana
2    Eastasia
3    Eastasia
4      Oceana
Name: country, dtype: object

8 - How many unique values are in the country column?


In [10]:
len(free_data['country'].unique())


Out[10]:
3

9 - Print out the number of occurences of each unique value in the country column.


In [11]:
free_data.groupby('country')['id'].count()
# solution: free_data.country.value_count()


Out[11]:
country
Eastasia    300
Eurasia     300
Oceana      300
Name: id, dtype: int64

10 - Summarize the dataframe.


In [12]:
free_data.describe()


Out[12]:
id sex age educ y v1 v2 v3 v4 v5 v6
count 900.000000 898.000000 892.000000 890.000000 900.000000 900.000000 900.000000 900.000000 900.000000 900.000000 900.000000
mean 90665.368889 0.556793 40.744395 2.941573 3.520000 2.648889 2.535556 3.664444 4.084444 3.866667 4.380000
std 44234.598996 0.497041 16.743316 1.600394 1.293709 1.151991 1.267310 1.016363 0.955973 0.984869 0.989399
min 142.000000 0.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000
25% 52621.000000 0.000000 27.000000 1.000000 3.000000 2.000000 2.000000 3.000000 3.000000 3.000000 4.000000
50% 108699.000000 1.000000 39.000000 3.000000 4.000000 3.000000 2.000000 4.000000 4.000000 4.000000 5.000000
75% 119329.000000 1.000000 52.000000 4.000000 5.000000 3.000000 3.000000 4.000000 5.000000 5.000000 5.000000
max 171811.000000 1.000000 90.000000 7.000000 5.000000 5.000000 5.000000 5.000000 5.000000 5.000000 5.000000

11 - Were all columns included in the summary? If not, print the summary again, forcing this column to appear in the result.


In [13]:
#free_data['country'] = free_data['country'].astype('category')

In [14]:
free_data.describe(include='all')


Out[14]:
id sex age educ country y v1 v2 v3 v4 v5 v6
count 900.000000 898.000000 892.000000 890.000000 900 900.000000 900.000000 900.000000 900.000000 900.000000 900.000000 900.000000
unique NaN NaN NaN NaN 3 NaN NaN NaN NaN NaN NaN NaN
top NaN NaN NaN NaN Oceana NaN NaN NaN NaN NaN NaN NaN
freq NaN NaN NaN NaN 300 NaN NaN NaN NaN NaN NaN NaN
mean 90665.368889 0.556793 40.744395 2.941573 NaN 3.520000 2.648889 2.535556 3.664444 4.084444 3.866667 4.380000
std 44234.598996 0.497041 16.743316 1.600394 NaN 1.293709 1.151991 1.267310 1.016363 0.955973 0.984869 0.989399
min 142.000000 0.000000 1.000000 1.000000 NaN 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000
25% 52621.000000 0.000000 27.000000 1.000000 NaN 3.000000 2.000000 2.000000 3.000000 3.000000 3.000000 4.000000
50% 108699.000000 1.000000 39.000000 3.000000 NaN 4.000000 3.000000 2.000000 4.000000 4.000000 4.000000 5.000000
75% 119329.000000 1.000000 52.000000 4.000000 NaN 5.000000 3.000000 3.000000 4.000000 5.000000 5.000000 5.000000
max 171811.000000 1.000000 90.000000 7.000000 NaN 5.000000 5.000000 5.000000 5.000000 5.000000 5.000000 5.000000

12 - Print rows 100 to 110 of the free1 Data Frame.


In [15]:
free1.iloc[99:110]


Out[15]:
Unnamed: 0 sex age educ country y v1 v2 v3 v4 v5 v6
99 57136 1.0 28.0 1.0 Oceana 5 5 5 5 5 5 5
100 71010 1.0 51.0 1.0 Eastasia 5 5 5 5 5 5 5
101 145298 1.0 20.0 2.0 Eastasia 2 2 1 3 4 5 3
102 162131 1.0 43.0 5.0 Eastasia 3 3 3 3 3 3 3
103 81406 0.0 45.0 1.0 Eastasia 1 1 1 1 1 1 1
104 164869 1.0 61.0 6.0 Eastasia 3 3 2 4 5 5 5
105 110303 0.0 37.0 6.0 Eurasia 3 3 2 3 5 3 3
106 78048 1.0 28.0 1.0 Oceana 2 2 2 4 5 4 5
107 118281 1.0 23.0 3.0 Eurasia 3 3 4 3 3 3 3
108 24024 1.0 75.0 3.0 Oceana 2 2 2 3 4 3 5
109 101158 1.0 25.0 3.0 Eastasia 2 5 4 3 4 5 2

13 - Print rows 100 to 110 of only the first 3 columns in free1 using only indices.


In [16]:
free1.iloc[99:110, 0:3]


Out[16]:
Unnamed: 0 sex age
99 57136 1.0 28.0
100 71010 1.0 51.0
101 145298 1.0 20.0
102 162131 1.0 43.0
103 81406 0.0 45.0
104 164869 1.0 61.0
105 110303 0.0 37.0
106 78048 1.0 28.0
107 118281 1.0 23.0
108 24024 1.0 75.0
109 101158 1.0 25.0

14 - Create and print a list containing the mean and the value counts of each column in the data frame except the country column.


In [17]:
# create a list containing all columns
col = list(free_data.columns)
# exclude the country column
col = col[0:4] + col[5:]
#calculate mean and count and put them in a list
[list(free_data[col].mean()), list(free_data[col].count())]

# solution:
# results = []
# for col in free_data.drop('country', axis=1).columns:
#    results.append((free_data[col].mean(), free_data[col].value_counts()))


Out[17]:
[[90665.368888888886,
  0.55679287305122493,
  40.744394618834079,
  2.9415730337078654,
  3.52,
  2.6488888888888891,
  2.5355555555555553,
  3.6644444444444444,
  4.0844444444444443,
  3.8666666666666667,
  4.3799999999999999],
 [900, 898, 892, 890, 900, 900, 900, 900, 900, 900, 900]]

15 - Create a Data Frame, called demographics, using only the columns sex, age, and educ from the free1 Data Frame. Also create a Data Frame called scores, using only the columns v1, v2, v3, v4, v5, v6 from the free1 Data Frame


In [18]:
demographics = free1[['sex', 'age', 'educ']]
print(demographics.head())

scores = free1[['v1', 'v2', 'v3', 'v4', 'v5', 'v6']]
print(scores.head())


   sex   age  educ
0  0.0  20.0   4.0
1  1.0  25.0   4.0
2  1.0  56.0   2.0
3  0.0  65.0   6.0
4  1.0  50.0   5.0
   v1  v2  v3  v4  v5  v6
0   4   3   3   5   3   4
1   3   3   5   5   5   5
2   3   2   4   5   5   4
3   3   3   5   5   5   5
4   5   3   5   5   3   5

16 - Loop through each row in scores and grab the largest value, in the v_ columns, found in each row and store your results in two lists containing the value and column name it came from. For example, row 0 is

{'v1': 4, 'v2': 3, 'v3': 3, 'v4': 5, 'v5': 3, 'v6': 4}

the values

('v4', 5)

should be added to your two lists.

We can do this in two ways: vectorized or using a for loop.

In the vectorized version we calculate max and argmax along the columns axis and put the values obtained in two lists:


In [19]:
value = list(scores.max(axis=1))
col_name = list(scores.apply(np.argmax, axis=1))

# print the first ten values for checking
print(value[0:10])
print(col_name[0:10])


[5, 5, 5, 5, 5, 4, 5, 4, 5, 5]
['v4', 'v3', 'v4', 'v3', 'v1', 'v1', 'v6', 'v1', 'v3', 'v4']

In the for loop version we iterate through each row and append the max and argmax for every row in two lists:


In [20]:
value = []
col_name = []

for index, row in scores.iterrows():
    value.append(np.max(row))
    col_name.append(np.argmax(row))

# print the first ten values for checking
print(value[0:10])
print(col_name[0:10])


[5, 5, 5, 5, 5, 4, 5, 4, 5, 5]
['v4', 'v3', 'v4', 'v3', 'v1', 'v1', 'v6', 'v1', 'v3', 'v4']

17 - Create a new Data Frame with columns named cat and score from your results in part (16), for the column with the largest score and the actual score respectively.


In [21]:
cat_scores = pd.DataFrame({'cat':col_name,'score':value})
cat_scores.head()


Out[21]:
cat score
0 v4 5
1 v3 5
2 v4 5
3 v3 5
4 v1 5

18 - Using the Data Frame created in part (17), print the frequency of each column being the max score.


In [22]:
cat_scores.groupby('cat').count() / cat_scores.shape[0]


Out[22]:
score
cat
v1 0.186667
v2 0.102222
v3 0.202222
v4 0.220000
v5 0.062222
v6 0.226667

Sorting, Filtering, and Grouping data

Most of the time, we'll want to rearrange the data a bit, include only certain values in our analysis, or put the data into useful groups. Pandas provides syntax and many functions to do this.

Using only Pandas, do the following exercises.

1 - Using the free1.csv downloaded above, import it as a Data Frame named free_data, rename the first column to id, and print the first few rows.


In [23]:
free_data = pd.read_csv('free1.csv')
col = list(free_data.columns)
col[0] = 'id'
free_data.columns = col

free_data.head()


Out[23]:
id sex age educ country y v1 v2 v3 v4 v5 v6
0 109276 0.0 20.0 4.0 Eurasia 1 4 3 3 5 3 4
1 88178 1.0 25.0 4.0 Oceana 2 3 3 5 5 5 5
2 111063 1.0 56.0 2.0 Eastasia 2 3 2 4 5 5 4
3 161488 0.0 65.0 6.0 Eastasia 2 3 3 5 5 5 5
4 44532 1.0 50.0 5.0 Oceana 1 5 3 5 5 3 5

2 - Sort free_data by country, educ, and then by age in decending order, modifying the original Data Frame.


In [24]:
free_data.sort_values(['country', 'educ', 'age'], ascending=[True, True, False], inplace=True)

free_data.head()


Out[24]:
id sex age educ country y v1 v2 v3 v4 v5 v6
260 129355 1.0 83.0 1.0 Eastasia 4 1 2 4 5 5 5
415 142546 0.0 70.0 1.0 Eastasia 5 2 3 3 4 4 3
445 95744 1.0 70.0 1.0 Eastasia 3 2 1 1 2 1 1
147 128180 0.0 68.0 1.0 Eastasia 3 2 2 5 3 3 5
268 93407 1.0 64.0 1.0 Eastasia 3 1 3 2 2 2 5

3 - Create a new Data Frame called uni containing only rows from free_data which indicate that the person attended university or graduate school. Print the value counts for each country.

This is the dictionary for the educ column:

  1. No formal education
  2. Less than primary school education
  3. Completed primary school
  4. Completed secondary school
  5. Completed high school
  6. Completed college
  7. Completed post-graduate degree

In [25]:
# I guess graduate school means high school, not certain though...
uni = free_data[free_data['educ'] >= 5]
uni.groupby('country')['id'].count()


Out[25]:
country
Eastasia    33
Eurasia     27
Oceana      21
Name: id, dtype: int64

4 - Create a list of three Data Frames for those who are less than 25 years old, between 25 and 50 years old, and older than 50.


In [26]:
ages_df_list = [free_data[free_data['age'] < 25],
                free_data[(free_data['age'] >= 25) & (free_data['age'] <= 50)],
                free_data[free_data['age'] > 50]]

ages_df_list[1]


Out[26]:
id sex age educ country y v1 v2 v3 v4 v5 v6
261 137663 0.0 50.0 1.0 Eastasia 4 2 2 3 5 4 5
223 127868 1.0 46.0 1.0 Eastasia 5 1 1 5 5 5 1
103 81406 0.0 45.0 1.0 Eastasia 1 1 1 1 1 1 1
360 138535 1.0 45.0 1.0 Eastasia 5 2 3 4 5 4 5
319 129590 1.0 43.0 1.0 Eastasia 5 1 1 4 4 3 5
348 159708 1.0 43.0 1.0 Eastasia 5 4 2 4 5 3 5
199 156041 1.0 40.0 1.0 Eastasia 4 2 4 4 4 3 5
243 160339 1.0 39.0 1.0 Eastasia 5 4 5 5 3 4 5
74 140966 0.0 35.0 1.0 Eastasia 1 3 4 5 5 3 5
29 8524 1.0 32.0 1.0 Eastasia 4 5 4 5 5 5 5
204 8296 0.0 32.0 1.0 Eastasia 5 3 3 5 5 5 5
122 166412 0.0 31.0 1.0 Eastasia 3 3 3 4 3 4 3
263 114086 0.0 30.0 1.0 Eastasia 3 1 2 2 1 3 4
302 82600 1.0 30.0 1.0 Eastasia 4 2 1 3 4 3 5
198 125327 1.0 28.0 1.0 Eastasia 4 3 2 5 5 3 5
220 125656 0.0 27.0 1.0 Eastasia 3 1 1 3 3 3 4
259 20075 1.0 25.0 1.0 Eastasia 5 1 2 5 5 3 5
344 105904 1.0 25.0 1.0 Eastasia 5 2 1 3 5 5 5
374 149219 1.0 46.0 2.0 Eastasia 5 4 2 4 3 3 4
421 107534 1.0 41.0 2.0 Eastasia 4 2 5 3 2 4 3
242 146950 0.0 38.0 2.0 Eastasia 5 3 2 5 5 4 5
367 162577 1.0 36.0 2.0 Eastasia 4 3 3 3 3 3 3
411 147495 1.0 33.0 2.0 Eastasia 4 3 3 3 3 3 3
303 146857 0.0 31.0 2.0 Eastasia 4 3 2 3 5 3 5
162 82247 1.0 30.0 2.0 Eastasia 2 1 2 3 4 5 4
276 111809 1.0 27.0 2.0 Eastasia 3 1 4 1 2 4 4
150 128252 0.0 48.0 3.0 Eastasia 3 2 3 5 3 3 5
258 124356 1.0 48.0 3.0 Eastasia 5 2 3 5 5 5 5
435 20416 0.0 48.0 3.0 Eastasia 5 2 3 3 2 3 3
286 121943 0.0 47.0 3.0 Eastasia 3 1 1 1 4 5 5
... ... ... ... ... ... ... ... ... ... ... ... ...
129 27969 0.0 44.0 4.0 Oceana 3 3 3 4 4 4 4
85 25324 0.0 43.0 4.0 Oceana 3 5 4 5 4 5 3
96 28150 0.0 42.0 4.0 Oceana 2 2 3 4 4 4 5
146 42669 1.0 40.0 4.0 Oceana 2 2 2 3 5 4 5
334 3073 0.0 40.0 4.0 Oceana 4 3 4 3 4 4 4
120 41762 1.0 35.0 4.0 Oceana 2 2 2 5 4 5 5
166 42641 1.0 35.0 4.0 Oceana 3 1 3 4 4 5 5
254 41344 0.0 34.0 4.0 Oceana 3 2 2 3 5 4 5
284 42168 0.0 33.0 4.0 Oceana 4 3 2 3 4 4 5
241 42743 0.0 30.0 4.0 Oceana 5 2 2 5 5 4 5
80 79340 0.0 29.0 4.0 Oceana 3 3 4 3 3 4 5
358 92330 0.0 29.0 4.0 Oceana 5 3 3 4 4 4 5
155 89971 1.0 28.0 4.0 Oceana 2 1 1 5 3 5 5
61 4949 1.0 26.0 4.0 Oceana 2 5 5 5 4 4 4
266 43344 0.0 26.0 4.0 Oceana 4 2 2 3 4 4 5
1 88178 1.0 25.0 4.0 Oceana 2 3 3 5 5 5 5
174 44669 0.0 25.0 4.0 Oceana 3 1 1 4 3 4 4
4 44532 1.0 50.0 5.0 Oceana 1 5 3 5 5 3 5
14 30513 0.0 49.0 5.0 Oceana 1 3 4 4 4 5 5
19 41371 1.0 41.0 5.0 Oceana 1 2 1 4 4 3 5
281 92195 1.0 31.0 5.0 Oceana 4 2 1 3 4 5 5
23 92556 0.0 30.0 5.0 Oceana 1 2 1 4 5 4 5
133 88111 0.0 30.0 5.0 Oceana 3 3 2 4 5 5 5
175 23450 1.0 30.0 5.0 Oceana 3 1 1 4 4 4 5
251 91858 1.0 29.0 5.0 Oceana 5 2 1 5 5 3 5
196 23272 0.0 37.0 6.0 Oceana 3 2 2 5 5 4 5
34 25441 0.0 42.0 7.0 Oceana 2 4 4 4 4 4 5
401 26614 0.0 33.0 7.0 Oceana 5 3 5 4 5 4 3
151 88856 0.0 30.0 7.0 Oceana 3 2 2 5 4 4 5
58 60125 0.0 41.0 NaN Oceana 1 5 2 4 5 5 5

236 rows × 12 columns

5 - Using a for loop, create a list of 3 Data Frames each containing only one of the 3 countries.


In [27]:
countries_df_list = []
for country in free_data['country'].unique():
    countries_df_list.append(free_data[free_data['country'] == country])

countries_df_list[2]


Out[27]:
id sex age educ country y v1 v2 v3 v4 v5 v6
44 76803 1.0 71.0 1.0 Oceana 2 4 4 4 5 4 5
48 60488 1.0 63.0 1.0 Oceana 4 5 4 5 5 5 5
136 61252 0.0 62.0 1.0 Oceana 5 5 5 5 5 5 5
145 603 1.0 61.0 1.0 Oceana 2 2 2 5 5 5 5
171 42755 1.0 60.0 1.0 Oceana 3 2 2 5 5 4 5
309 88267 0.0 58.0 1.0 Oceana 4 2 2 3 5 5 5
436 60079 0.0 57.0 1.0 Oceana 4 1 3 1 2 3 2
89 76784 1.0 55.0 1.0 Oceana 1 1 1 5 5 5 5
142 463 0.0 55.0 1.0 Oceana 5 5 5 5 5 5 4
20 60279 0.0 54.0 1.0 Oceana 1 5 4 5 5 4 5
140 90873 1.0 53.0 1.0 Oceana 2 2 2 3 4 4 4
154 78538 1.0 52.0 1.0 Oceana 3 1 1 4 4 3 5
252 59708 1.0 51.0 1.0 Oceana 4 2 2 4 3 3 5
163 4028 0.0 50.0 1.0 Oceana 3 1 1 4 5 5 5
312 78421 1.0 48.0 1.0 Oceana 4 2 1 3 3 4 5
336 78608 0.0 48.0 1.0 Oceana 5 2 2 3 4 4 5
167 78558 1.0 47.0 1.0 Oceana 3 2 1 4 5 4 5
13 57867 0.0 46.0 1.0 Oceana 1 5 5 5 5 5 5
67 73821 1.0 45.0 1.0 Oceana 2 4 2 5 3 4 5
132 61135 0.0 45.0 1.0 Oceana 4 4 4 4 4 4 4
408 92438 1.0 44.0 1.0 Oceana 4 3 5 2 4 4 1
269 58757 1.0 43.0 1.0 Oceana 4 3 2 3 4 3 5
440 56676 1.0 42.0 1.0 Oceana 5 4 1 3 3 2 4
441 58098 0.0 41.0 1.0 Oceana 5 3 4 3 3 4 4
118 78703 1.0 40.0 1.0 Oceana 3 4 2 5 5 4 3
245 78405 0.0 40.0 1.0 Oceana 4 3 3 4 4 4 5
170 78731 1.0 38.0 1.0 Oceana 3 2 2 4 5 4 5
93 60470 0.0 37.0 1.0 Oceana 2 2 2 3 5 5 5
116 56613 0.0 35.0 1.0 Oceana 5 5 4 4 5 5 5
111 61465 0.0 34.0 1.0 Oceana 2 2 2 2 4 4 5
... ... ... ... ... ... ... ... ... ... ... ... ...
201 3085 0.0 24.0 4.0 Oceana 3 1 2 3 5 3 5
149 4360 1.0 23.0 4.0 Oceana 3 2 1 4 5 4 5
247 78281 1.0 22.0 4.0 Oceana 3 1 1 3 5 5 5
60 30266 0.0 19.0 4.0 Oceana 1 3 4 5 5 5 5
291 28638 1.0 19.0 4.0 Oceana 4 3 3 3 4 4 5
45 23208 0.0 18.0 4.0 Oceana 2 3 1 3 3 3 3
271 42296 0.0 18.0 4.0 Oceana 4 2 2 3 4 3 5
192 23288 1.0 62.0 5.0 Oceana 3 2 2 4 3 3 5
256 23924 0.0 60.0 5.0 Oceana 3 1 1 3 4 4 5
72 30741 1.0 55.0 5.0 Oceana 4 5 5 5 5 5 5
4 44532 1.0 50.0 5.0 Oceana 1 5 3 5 5 3 5
14 30513 0.0 49.0 5.0 Oceana 1 3 4 4 4 5 5
19 41371 1.0 41.0 5.0 Oceana 1 2 1 4 4 3 5
281 92195 1.0 31.0 5.0 Oceana 4 2 1 3 4 5 5
23 92556 0.0 30.0 5.0 Oceana 1 2 1 4 5 4 5
133 88111 0.0 30.0 5.0 Oceana 3 3 2 4 5 5 5
175 23450 1.0 30.0 5.0 Oceana 3 1 1 4 4 4 5
251 91858 1.0 29.0 5.0 Oceana 5 2 1 5 5 3 5
63 90418 1.0 24.0 5.0 Oceana 2 3 2 5 5 3 5
43 31052 0.0 22.0 5.0 Oceana 1 4 3 4 5 4 5
189 90645 0.0 22.0 5.0 Oceana 3 2 1 4 4 4 5
238 92750 1.0 19.0 5.0 Oceana 4 3 3 4 4 5 5
115 24643 0.0 59.0 6.0 Oceana 3 3 2 4 4 3 5
196 23272 0.0 37.0 6.0 Oceana 3 2 2 5 5 4 5
62 30485 0.0 68.0 7.0 Oceana 2 3 4 3 4 5 5
34 25441 0.0 42.0 7.0 Oceana 2 4 4 4 4 4 5
401 26614 0.0 33.0 7.0 Oceana 5 3 5 4 5 4 3
151 88856 0.0 30.0 7.0 Oceana 3 2 2 5 4 4 5
58 60125 0.0 41.0 NaN Oceana 1 5 2 4 5 5 5
313 60617 NaN 1.0 NaN Oceana 4 3 4 3 4 4 4

150 rows × 12 columns

6 - Create a list of age categories, labled 0, 1, and 2 for each row for the three groups made in part (4). Attach this list to the free_data dataframe as a column named age_cat.


In [28]:
# create the list of categories
age_cat = []
for i in range(3):
    # concatenate a list of i's of the same length of the i-th dataframe created in part 4
    age_cat  = age_cat + [i] * ages_df_list[i].shape[0]

# sort values by age
free_data.sort_values('age', ascending=True, inplace=True)
# add the list as a new column
free_data.loc[pd.notnull(free_data['age']), 'age_cat'] = age_cat

free_data.head()


Out[28]:
id sex age educ country y v1 v2 v3 v4 v5 v6 age_cat
313 60617 NaN 1.0 NaN Oceana 4 3 4 3 4 4 4 0.0
316 60716 1.0 17.0 1.0 Oceana 5 2 1 4 4 5 5 0.0
158 76998 0.0 18.0 1.0 Oceana 3 2 2 5 5 5 5 0.0
232 58633 1.0 18.0 1.0 Oceana 5 4 3 5 3 3 5 0.0
449 116952 1.0 18.0 3.0 Eurasia 5 4 4 4 4 4 4 0.0

7 - Print the mean for all columns for each age_cat using groupby.


In [29]:
free_data.groupby('age_cat').mean()


Out[29]:
id sex age educ y v1 v2 v3 v4 v5 v6
age_cat
0.0 97924.298851 0.534884 20.505747 3.317647 3.643678 2.517241 2.597701 3.655172 3.954023 3.781609 4.471264
1.0 92976.567797 0.559322 36.686441 3.034335 3.605932 2.631356 2.474576 3.580508 4.055085 3.830508 4.266949
2.0 81100.097561 0.569106 62.845528 2.552846 3.243902 2.780488 2.593496 3.821138 4.219512 3.975610 4.520325

8 - Print the mean education for each age_cat using groupby.


In [30]:
free_data.groupby('age_cat')['educ'].mean()


Out[30]:
age_cat
0.0    3.317647
1.0    3.034335
2.0    2.552846
Name: educ, dtype: float64

9 - Print summary statistics for each column for those with an education greater than or equal to 5, grouped by age_cat.


In [31]:
free_data[free_data['educ'] >= 5].groupby('age_cat').describe()


Out[31]:
age educ id sex v1 v2 v3 v4 v5 v6 y
age_cat
0.0 count 20.000000 20.000000 20.000000 20.000000 20.000000 20.000000 20.000000 20.000000 20.000000 20.000000 20.000000
mean 21.750000 5.200000 97745.450000 0.500000 2.800000 3.100000 3.750000 4.200000 4.000000 4.650000 3.150000
std 1.585294 0.410391 35834.636660 0.512989 1.151658 1.483240 0.786398 0.695852 0.858395 0.587143 1.386969
min 19.000000 5.000000 31052.000000 0.000000 1.000000 1.000000 3.000000 3.000000 3.000000 3.000000 1.000000
25% 20.000000 5.000000 90588.250000 0.000000 2.000000 2.000000 3.000000 4.000000 3.000000 4.000000 2.000000
50% 22.000000 5.000000 108682.000000 0.500000 3.000000 3.000000 4.000000 4.000000 4.000000 5.000000 3.000000
75% 23.000000 5.000000 116392.250000 1.000000 4.000000 4.250000 4.000000 5.000000 5.000000 5.000000 4.000000
max 24.000000 6.000000 171662.000000 1.000000 5.000000 5.000000 5.000000 5.000000 5.000000 5.000000 5.000000
1.0 count 43.000000 43.000000 43.000000 43.000000 43.000000 43.000000 43.000000 43.000000 43.000000 43.000000 43.000000
mean 36.767442 5.488372 89611.930233 0.488372 2.767442 2.395349 3.581395 4.093023 3.720930 4.162791 3.581395
std 7.057037 0.702787 39754.316402 0.505781 1.087531 1.177963 1.096153 0.921023 1.031079 1.111203 1.257976
min 25.000000 5.000000 11633.000000 0.000000 1.000000 1.000000 1.000000 2.000000 1.000000 2.000000 1.000000
25% 30.000000 5.000000 50499.500000 0.000000 2.000000 2.000000 3.000000 4.000000 3.000000 3.000000 3.000000
50% 37.000000 5.000000 108928.000000 0.000000 3.000000 2.000000 4.000000 4.000000 4.000000 5.000000 4.000000
75% 42.000000 6.000000 116362.500000 1.000000 3.500000 3.000000 4.000000 5.000000 4.000000 5.000000 5.000000
max 50.000000 7.000000 162131.000000 1.000000 5.000000 5.000000 5.000000 5.000000 5.000000 5.000000 5.000000
2.0 count 18.000000 18.000000 18.000000 18.000000 18.000000 18.000000 18.000000 18.000000 18.000000 18.000000 18.000000
mean 58.555556 5.555556 83721.611111 0.666667 2.888889 2.833333 4.111111 4.166667 4.222222 4.611111 3.222222
std 5.802862 0.704792 54036.236832 0.485071 1.182663 1.465285 0.676400 1.098127 0.808452 0.849837 1.060275
min 51.000000 5.000000 23288.000000 0.000000 1.000000 1.000000 3.000000 1.000000 3.000000 2.000000 1.000000
25% 54.250000 5.000000 34431.750000 0.000000 2.000000 2.000000 4.000000 4.000000 4.000000 5.000000 3.000000
50% 58.000000 5.000000 51936.500000 1.000000 3.000000 2.000000 4.000000 4.500000 4.000000 5.000000 3.000000
75% 63.500000 6.000000 132317.250000 1.000000 3.750000 4.000000 4.750000 5.000000 5.000000 5.000000 4.000000
max 68.000000 7.000000 164869.000000 1.000000 5.000000 5.000000 5.000000 5.000000 5.000000 5.000000 5.000000

10 - Which of the vignette has the largest mean score for each education level? What about the median?

In this survey six vignettes were showed to different people and they were asked to determine the degree of freedom of the situation depicted.

These are the possible votes:

  1. Completely free
  2. Very free
  3. Moderately free
  4. Slightly free
  5. Not free at all

and these are the vignettes:

  1. Kay does not like many of the government's policies. She frequently publishes her opinion in newspapers, criticizing decisions by officials and calling for change. She sees little reason these actions could lead to government reprisal.

  2. Michael disagrees with many of the government's policies. Though he knows criticism is frowned upon, he doesn't believe the government would punish someone for expressing critical views. He makes his opinion known on most issues without regard to who is listening.

  3. Bob has political views at odds with the government. He has heard of people occasionally being arrested for speaking out against the government, and government leaders sometimes make political speeches condemning those who criticize. He sometimes writes letters to newspapers about politics, but he is careful not to use his real name.

  4. Connie does not like the government's stance on many issues. She has a friend who was arrested for being too openly critical of governmental leaders, and so she avoids voicing her opinions in public places.

  5. Vito disagrees with many of the government's policies, and is very careful about whom he says this to, reserving his real opinions for family and close friends only. He knows several men who have been taken away by government officials for saying negative things in public.

  6. Sonny lives in fear of being harassed for his political views. Everyone he knows who has spoken out against the government has been arrested or taken away. He never says a word about anything the government does, not even when he is at home alone with his family.

It seems (reasonably enough) that the sixth vignette has the highest mean and median score among the situations proposed in all but two cases:


In [34]:
print(free_data.groupby('educ')['v1', 'v2', 'v3', 'v4', 'v5', 'v6'].mean().apply(np.argmax, axis=1))
print(free_data.groupby('educ')['v1', 'v2', 'v3', 'v4', 'v5', 'v6'].median().apply(np.argmax, axis=1))


educ
1.0    v6
2.0    v6
3.0    v6
4.0    v6
5.0    v6
6.0    v6
7.0    v4
dtype: object
educ
1.0    v6
2.0    v6
3.0    v6
4.0    v6
5.0    v6
6.0    v4
7.0    v6
dtype: object

12 - Which country would you say has the most freedom of speech? Be sure to justify your answer quantitatively.

I would say Oceana has the most freedom of speech, because it has the higher mean for the vignettes, expecially vignettes 4 and 5.

It should be noted also that the vignette 5 has a surprisingly low mean in the other two countries, more so if compared to vignette 4 or 3.


In [33]:
free_data.groupby('country')['v1', 'v2', 'v3', 'v4', 'v5', 'v6', 'y'].describe()


Out[33]:
v1 v2 v3 v4 v5 v6 y
country
Eastasia count 150.000000 150.000000 150.000000 150.000000 150.000000 150.000000 150.000000
mean 2.440000 2.386667 3.633333 4.020000 3.806667 4.466667 3.660000
std 1.155553 1.180301 1.032471 1.058491 0.953247 0.945968 1.163367
min 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000
25% 2.000000 1.000000 3.000000 3.000000 3.000000 4.000000 3.000000
50% 2.000000 2.000000 4.000000 4.000000 4.000000 5.000000 4.000000
75% 3.000000 3.000000 4.000000 5.000000 5.000000 5.000000 5.000000
max 5.000000 5.000000 5.000000 5.000000 5.000000 5.000000 5.000000
Eurasia count 150.000000 150.000000 150.000000 150.000000 150.000000 150.000000 150.000000
mean 2.760000 2.706667 3.460000 3.933333 3.633333 4.060000 4.013333
std 1.034181 1.353825 1.014195 0.980800 1.137622 1.088253 1.152684
min 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000
25% 2.000000 2.000000 3.000000 3.000000 3.000000 3.000000 3.000000
50% 3.000000 3.000000 4.000000 4.000000 4.000000 4.000000 4.000000
75% 3.000000 4.000000 4.000000 5.000000 5.000000 5.000000 5.000000
max 5.000000 5.000000 5.000000 5.000000 5.000000 5.000000 5.000000
Oceana count 150.000000 150.000000 150.000000 150.000000 150.000000 150.000000 150.000000
mean 2.746667 2.513333 3.900000 4.300000 4.160000 4.613333 2.886667
std 1.238025 1.251773 0.960635 0.775030 0.760431 0.841752 1.303257
min 1.000000 1.000000 1.000000 2.000000 2.000000 1.000000 1.000000
25% 2.000000 2.000000 3.000000 4.000000 4.000000 5.000000 2.000000
50% 2.500000 2.000000 4.000000 4.000000 4.000000 5.000000 3.000000
75% 3.000000 3.000000 5.000000 5.000000 5.000000 5.000000 4.000000
max 5.000000 5.000000 5.000000 5.000000 5.000000 5.000000 5.000000

13 - Is there a difference of opinion between men and women regarding freedom of speech? If any, does this difference manifest itself accross the different countries? Accross education levels? Be sure to justify your answers quantiatively.

We can say that women think of themselves as slightly more free than men:


In [34]:
print(free_data.groupby('sex')['y'].mean()) # men thing of themselves as slightly more free


sex
0.0    3.417085
1.0    3.600000
Name: y, dtype: float64

and that this difference is greater in Eastasia and almost none in Eurasia:


In [35]:
print(free_data.groupby(['sex', 'country'])['y'].mean().unstack())


country  Eastasia  Eurasia    Oceana
sex                                 
0.0       3.50000  4.00000  2.840000
1.0       3.76087  4.02381  2.918919

The difference is also present across education levels: curiously uneducated women and women with secondary school degree or higher feel freer than men but women with an intermediate education feel less free than men. We can also observe a tendency, aside from uneducated people, for women to feel freer the more they are educated when compared to men.

Lastly, it's pretty remarkable that the most educated men does not feel free at all, but this can be due to the small sample size.


In [36]:
print(free_data.groupby(['sex', 'educ'])['y'].mean().unstack())


educ       1.0       2.0       3.0       4.0       5.0       6.0  7.0
sex                                                                  
0.0   3.600000  3.708333  3.480000  3.250000  3.333333  2.777778  3.4
1.0   3.974359  3.312500  3.446429  3.282051  3.551724  3.333333  5.0

Merging, Indexes, and Apply()

Much of the power of Data Sciences comes from the ability to join together datasets from very different sources. One could be interested in seeing if there is a relationship between housing prices and prevalence of infectious disease in a given ZIP code for example. This task is often referred to as a merge or join.

Every Pandas Data Frame has an index. Indices in Pandas are a bit of a complex topic, but for the time being consider them to be a unique identifier for each row in a Data Frame. When performing joins and manipulating Data Frames, it is important to remember that your task may require the creation or change of the Data Frame's index. For more extensive reading on this topic, consult the Pandas Documentation.

And lastly, if you are coming from a programming background like C/C++ or Java, you are likely very accustomed to operating on arrays and lists using for loops. Often this is how you will want to work with Data Frames in Python, but Pandas also provides functionality for functional like programming by utilizing the Apply() function. This is similar to the apply family of functions in R and the Map() and related functions in Lisp. Making use of Apply() in Python can make your code more concise, readable, and faster when performing operations on an entire Data Frame.

Using on Pandas, perform the following exercises.

1 - Using the free1.csv downloaded above, import it as a Data Frame named free_data and rename the first column to id.


In [37]:
free_data = pd.read_csv('free1.csv')
col = list(free_data.columns)
col[0] = 'id'
free_data.columns = col

free_data.head()


Out[37]:
id sex age educ country y v1 v2 v3 v4 v5 v6
0 109276 0.0 20.0 4.0 Eurasia 1 4 3 3 5 3 4
1 88178 1.0 25.0 4.0 Oceana 2 3 3 5 5 5 5
2 111063 1.0 56.0 2.0 Eastasia 2 3 2 4 5 5 4
3 161488 0.0 65.0 6.0 Eastasia 2 3 3 5 5 5 5
4 44532 1.0 50.0 5.0 Oceana 1 5 3 5 5 3 5

2 - Create a dataframe named free_sub, consisting of the id, country, and y columns from free_data.


In [38]:
free_sub = free_data[['id', 'country', 'y']]
free_sub.head()


Out[38]:
id country y
0 109276 Eurasia 1
1 88178 Oceana 2
2 111063 Eastasia 2
3 161488 Eastasia 2
4 44532 Oceana 1

3 - Create a new Data Frame called ed_level, consisting of the id and three categories of education levels, labeled high, med, and low, for ranges of your choosing. Do this using a for loop.


In [39]:
# create a dataframe with one column
ed_level = pd.DataFrame(free_data['id'])
# add a blank column
ed_level['educ_cat'] = None
# dictionary of labels
educ_cat = {1:'low', 2:'low', 3:'med', 4:'med', 5:'high', 6:'high', 7:'high'}

# for every row in the dataframe if the corresponding educ column in the free data is not missing
# use the dictionary to set the corresponding label
for i in range(0, len(ed_level)):
    if pd.notnull(free_data.loc[i, 'educ']):
        ed_level.loc[i, 'educ_cat'] = educ_cat[free_data.loc[i, 'educ']]

ed_level.head(5)


Out[39]:
id educ_cat
0 109276 med
1 88178 med
2 111063 low
3 161488 high
4 44532 high

4 - Merge free_sub and ed_level together. Which column should the merge be performed on? Do this using both the concat() and merge() functions.

The merge should be performed on the id column, which is used implicitly by the merge function.

The concat function returns a DataFrame that duplicates all the rows (and indexes) setting to NaN the missing values for each of the original datasets.

The merge function instead performs an inner join returning the rows from the first dataset enriched with the values of the second.


In [40]:
# Solution: the id is set as an index for both columns
pd.concat([free_sub, ed_level])


Out[40]:
country educ_cat id y
0 Eurasia NaN 109276 1.0
1 Oceana NaN 88178 2.0
2 Eastasia NaN 111063 2.0
3 Eastasia NaN 161488 2.0
4 Oceana NaN 44532 1.0
5 Eastasia NaN 95503 1.0
6 Oceana NaN 26276 2.0
7 Oceana NaN 26299 2.0
8 Oceana NaN 89206 2.0
9 Oceana NaN 24342 1.0
10 Eurasia NaN 109960 1.0
11 Eurasia NaN 117509 2.0
12 Eastasia NaN 11633 3.0
13 Oceana NaN 57867 1.0
14 Oceana NaN 30513 1.0
15 Oceana NaN 91576 2.0
16 Oceana NaN 42161 1.0
17 Oceana NaN 74317 1.0
18 Oceana NaN 661 2.0
19 Oceana NaN 41371 1.0
20 Oceana NaN 60279 1.0
21 Eurasia NaN 110851 2.0
22 Oceana NaN 22764 1.0
23 Oceana NaN 92556 1.0
24 Eastasia NaN 38858 4.0
25 Eastasia NaN 71819 4.0
26 Eurasia NaN 108399 2.0
27 Oceana NaN 27707 4.0
28 Eastasia NaN 166386 1.0
29 Eastasia NaN 8524 4.0
... ... ... ... ...
420 NaN low 116011 NaN
421 NaN low 107534 NaN
422 NaN med 111669 NaN
423 NaN low 118728 NaN
424 NaN med 121362 NaN
425 NaN med 108776 NaN
426 NaN high 109604 NaN
427 NaN low 120098 NaN
428 NaN low 116405 NaN
429 NaN high 141024 NaN
430 NaN low 108443 NaN
431 NaN med 110293 NaN
432 NaN med 117227 NaN
433 NaN med 117445 NaN
434 NaN med 110560 NaN
435 NaN med 20416 NaN
436 NaN low 60079 NaN
437 NaN high 52621 NaN
438 NaN med 108533 NaN
439 NaN med 109068 NaN
440 NaN low 56676 NaN
441 NaN low 58098 NaN
442 NaN high 117252 NaN
443 NaN med 110212 NaN
444 NaN med 168326 NaN
445 NaN low 95744 NaN
446 NaN med 109491 NaN
447 NaN low 65788 NaN
448 NaN med 147766 NaN
449 NaN med 116952 NaN

900 rows × 4 columns


In [41]:
pd.merge(free_sub, ed_level)


Out[41]:
id country y educ_cat
0 109276 Eurasia 1 med
1 88178 Oceana 2 med
2 111063 Eastasia 2 low
3 161488 Eastasia 2 high
4 44532 Oceana 1 high
5 95503 Eastasia 1 high
6 26276 Oceana 2 low
7 26299 Oceana 2 med
8 89206 Oceana 2 low
9 24342 Oceana 1 med
10 109960 Eurasia 1 med
11 117509 Eurasia 2 high
12 11633 Eastasia 3 high
13 57867 Oceana 1 low
14 30513 Oceana 1 high
15 91576 Oceana 2 med
16 42161 Oceana 1 med
17 74317 Oceana 1 med
18 661 Oceana 2 med
19 41371 Oceana 1 high
20 60279 Oceana 1 low
21 110851 Eurasia 2 high
22 22764 Oceana 1 med
23 92556 Oceana 1 high
24 38858 Eastasia 4 high
25 71819 Eastasia 4 low
26 108399 Eurasia 2 high
27 27707 Oceana 4 med
28 166386 Eastasia 1 med
29 8524 Eastasia 4 low
... ... ... ... ...
420 116011 Eurasia 5 low
421 107534 Eastasia 4 low
422 111669 Eastasia 5 med
423 118728 Eurasia 5 low
424 121362 Eastasia 5 med
425 108776 Eurasia 5 med
426 109604 Eurasia 4 high
427 120098 Eurasia 4 low
428 116405 Eurasia 4 low
429 141024 Eastasia 5 high
430 108443 Eurasia 5 low
431 110293 Eurasia 5 med
432 117227 Eurasia 5 med
433 117445 Eurasia 5 med
434 110560 Eurasia 4 med
435 20416 Eastasia 5 med
436 60079 Oceana 4 low
437 52621 Eastasia 4 high
438 108533 Eurasia 5 med
439 109068 Eurasia 4 med
440 56676 Oceana 5 low
441 58098 Oceana 5 low
442 117252 Eurasia 5 high
443 110212 Eurasia 4 med
444 168326 Eastasia 5 med
445 95744 Eastasia 3 low
446 109491 Eurasia 3 med
447 65788 Eastasia 5 low
448 147766 Eastasia 4 med
449 116952 Eurasia 5 med

450 rows × 4 columns

5 - Use the append() function to join together free_sub and ed_level. Are the results the same as in part (4)? If not, how could you reproduce the result append() by using concat() or merge()?


In [42]:
free_sub.append(ed_level)


Out[42]:
country educ_cat id y
0 Eurasia NaN 109276 1.0
1 Oceana NaN 88178 2.0
2 Eastasia NaN 111063 2.0
3 Eastasia NaN 161488 2.0
4 Oceana NaN 44532 1.0
5 Eastasia NaN 95503 1.0
6 Oceana NaN 26276 2.0
7 Oceana NaN 26299 2.0
8 Oceana NaN 89206 2.0
9 Oceana NaN 24342 1.0
10 Eurasia NaN 109960 1.0
11 Eurasia NaN 117509 2.0
12 Eastasia NaN 11633 3.0
13 Oceana NaN 57867 1.0
14 Oceana NaN 30513 1.0
15 Oceana NaN 91576 2.0
16 Oceana NaN 42161 1.0
17 Oceana NaN 74317 1.0
18 Oceana NaN 661 2.0
19 Oceana NaN 41371 1.0
20 Oceana NaN 60279 1.0
21 Eurasia NaN 110851 2.0
22 Oceana NaN 22764 1.0
23 Oceana NaN 92556 1.0
24 Eastasia NaN 38858 4.0
25 Eastasia NaN 71819 4.0
26 Eurasia NaN 108399 2.0
27 Oceana NaN 27707 4.0
28 Eastasia NaN 166386 1.0
29 Eastasia NaN 8524 4.0
... ... ... ... ...
420 NaN low 116011 NaN
421 NaN low 107534 NaN
422 NaN med 111669 NaN
423 NaN low 118728 NaN
424 NaN med 121362 NaN
425 NaN med 108776 NaN
426 NaN high 109604 NaN
427 NaN low 120098 NaN
428 NaN low 116405 NaN
429 NaN high 141024 NaN
430 NaN low 108443 NaN
431 NaN med 110293 NaN
432 NaN med 117227 NaN
433 NaN med 117445 NaN
434 NaN med 110560 NaN
435 NaN med 20416 NaN
436 NaN low 60079 NaN
437 NaN high 52621 NaN
438 NaN med 108533 NaN
439 NaN med 109068 NaN
440 NaN low 56676 NaN
441 NaN low 58098 NaN
442 NaN high 117252 NaN
443 NaN med 110212 NaN
444 NaN med 168326 NaN
445 NaN low 95744 NaN
446 NaN med 109491 NaN
447 NaN low 65788 NaN
448 NaN med 147766 NaN
449 NaN med 116952 NaN

900 rows × 4 columns

The append function produces the same results of concat: it returns a DataFrame with duplicated rows and indices setting to NaN the missing values:

6 - Use numpy to generate two lists 100 random floats labeled y1 and y2. Now create a sequence of integers on the range 0-100 labeled x1 and a sequence of integers on the range 50-150 labeled x2. Create two DataFrames, dat1 and dat2 consisting of x1 and y1, and x2 and y2 respectively, but having labels x, y1, and x, y2. Use merge() to join these two Data Frames together, on x, using both an inner and outer join. What is the difference between the two joins?


In [39]:
np.random.seed(0)
# random floats
y1 = np.random.ranf(100)
y2 = np.random.ranf(100)
# random integers
x1 = np.arange(0, 100)
x2 = np.arange(50, 150)
# create the dataframes
dat1 = pd.DataFrame({'x':x1, 'y1':y1})
dat2 = pd.DataFrame({'x':x2, 'y2':y2})

The inner join mantains only the rows of the first dataset which have a counterpart in the second dataset, eventually duplicating them if there is more than one match. This leads, in general, to fewer rows returned:


In [40]:
pd.merge(dat1, dat2)


Out[40]:
x y1 y2
0 50 0.570197 0.677817
1 51 0.438602 0.270008
2 52 0.988374 0.735194
3 53 0.102045 0.962189
4 54 0.208877 0.248753
5 55 0.161310 0.576157
6 56 0.653108 0.592042
7 57 0.253292 0.572252
8 58 0.466311 0.223082
9 59 0.244426 0.952749
10 60 0.158970 0.447125
11 61 0.110375 0.846409
12 62 0.656330 0.699479
13 63 0.138183 0.297437
14 64 0.196582 0.813798
15 65 0.368725 0.396506
16 66 0.820993 0.881103
17 67 0.097101 0.581273
18 68 0.837945 0.881735
19 69 0.096098 0.692532
20 70 0.976459 0.725254
21 71 0.468651 0.501324
22 72 0.976761 0.956084
23 73 0.604846 0.643990
24 74 0.739264 0.423855
25 75 0.039188 0.606393
26 76 0.282807 0.019193
27 77 0.120197 0.301575
28 78 0.296140 0.660174
29 79 0.118728 0.290078
30 80 0.317983 0.618015
31 81 0.414263 0.428769
32 82 0.064147 0.135474
33 83 0.692472 0.298282
34 84 0.566601 0.569965
35 85 0.265389 0.590873
36 86 0.523248 0.574325
37 87 0.093941 0.653201
38 88 0.575946 0.652103
39 89 0.929296 0.431418
40 90 0.318569 0.896547
41 91 0.667410 0.367562
42 92 0.131798 0.435865
43 93 0.716327 0.891923
44 94 0.289406 0.806194
45 95 0.183191 0.703889
46 96 0.586513 0.100227
47 97 0.020108 0.919483
48 98 0.828940 0.714241
49 99 0.004695 0.998847

The outer join instead returns all the rows from both the datasets with the eventual matches between them (again duplicating the rows if there are multiple matches):


In [41]:
pd.merge(dat1, dat2, how='outer')


Out[41]:
x y1 y2
0 0 0.548814 NaN
1 1 0.715189 NaN
2 2 0.602763 NaN
3 3 0.544883 NaN
4 4 0.423655 NaN
5 5 0.645894 NaN
6 6 0.437587 NaN
7 7 0.891773 NaN
8 8 0.963663 NaN
9 9 0.383442 NaN
10 10 0.791725 NaN
11 11 0.528895 NaN
12 12 0.568045 NaN
13 13 0.925597 NaN
14 14 0.071036 NaN
15 15 0.087129 NaN
16 16 0.020218 NaN
17 17 0.832620 NaN
18 18 0.778157 NaN
19 19 0.870012 NaN
20 20 0.978618 NaN
21 21 0.799159 NaN
22 22 0.461479 NaN
23 23 0.780529 NaN
24 24 0.118274 NaN
25 25 0.639921 NaN
26 26 0.143353 NaN
27 27 0.944669 NaN
28 28 0.521848 NaN
29 29 0.414662 NaN
... ... ... ...
120 120 NaN 0.521037
121 121 NaN 0.054338
122 122 NaN 0.199997
123 123 NaN 0.018522
124 124 NaN 0.793698
125 125 NaN 0.223925
126 126 NaN 0.345352
127 127 NaN 0.928081
128 128 NaN 0.704414
129 129 NaN 0.031839
130 130 NaN 0.164694
131 131 NaN 0.621478
132 132 NaN 0.577229
133 133 NaN 0.237893
134 134 NaN 0.934214
135 135 NaN 0.613966
136 136 NaN 0.535633
137 137 NaN 0.589910
138 138 NaN 0.730122
139 139 NaN 0.311945
140 140 NaN 0.398221
141 141 NaN 0.209844
142 142 NaN 0.186193
143 143 NaN 0.944372
144 144 NaN 0.739551
145 145 NaN 0.490459
146 146 NaN 0.227415
147 147 NaN 0.254356
148 148 NaN 0.058029
149 149 NaN 0.434417

150 rows × 3 columns

7 - Create a Data Frame, called scores consising of only the y and v_ columns from free_data.


In [46]:
scores = free_data[['y', 'v1', 'v2', 'v3', 'v4', 'v5', 'v6']]

8 - Using a for loop(s), compute the sum and mean for each column in scores.


In [47]:
# initialize an array of zeros to store the results
sums = np.zeros(7)
# calculate the sum with a for loop
for i in range(len(scores)):
    sums += scores.iloc[i]
# calculate the mean out the loop, dividing by the total number of scores
means = sums / len(scores)

print(sums)
print(means)


[ 1584.  1192.  1141.  1649.  1838.  1740.  1971.]
[ 3.52        2.64888889  2.53555556  3.66444444  4.08444444  3.86666667
  4.38      ]

9 - Using the apply() function, compute the sum and mean for each column in scores.


In [48]:
scores.apply(lambda x: [np.sum(x), np.mean(x)])


Out[48]:
y                   [1584, 3.52]
v1     [1192, 2.648888888888889]
v2    [1141, 2.5355555555555553]
v3    [1649, 3.6644444444444444]
v4     [1838, 4.084444444444444]
v5    [1740, 3.8666666666666667]
v6                  [1971, 4.38]
dtype: object

10 - Using the apply() function, label each column in scores as either high, med, or low by first computing the mean for each column and assigning the categories at values of your choosing. Do this by writing a single function you can call with apply().


In [49]:
scores.apply(lambda x: 'low' if np.mean(x) <= 3 else 'med' if np.mean(x) <= 4 else 'high')


Out[49]:
y      med
v1     low
v2     low
v3     med
v4    high
v5     med
v6    high
dtype: object

Time Series

In many situations you may not know the relationship between two variables but you do know that there ought to be one. Take for example the daily price of beef and grain. It is reasonable to assume that there exists some, perhaps even a causal, relationship between these two, but due to the complexity of the phenomenon, and the vast number of underlying latent variables involved (fuel price, politics, famine, etc...), you likely have little hope to uncover such a relationship in a reasonable amount of time. However, you do know that these two variables are related in time and may exibit some pattern that repeats itself in time. Identifying these types of patterns is called Time Series Analysis and sequencing your data such that each data point is represented as a unique point in time is called a Time Series. The canonical example of a Time Series is, of course, stock market data which is what we will be using for this exercise

Do the following exercises.

1 - Create a start and end datetime object, starting at a date of your choosing and ending today.


In [50]:
start = pd.datetime(2016, 1, 1)
end = pd.datetime.today()

print(start, end)


2016-01-01 00:00:00 2017-07-26 10:11:27.586984

2 - For three stocks of your choosing, put their symbols into a list and use pandas to retrieve their data from google for the time frame you created in part (1). Print the results.


In [51]:
import pandas_datareader.data as web

It turns out that the DataReader method return a Panel object if called passing a list of stocks, so we can transform it in a DataFrame and unstack it to obtain the derired result:


In [52]:
stock_code = ['GOOG', 'FB', 'MSFT']

stock_panel = web.DataReader(stock_code, start=start, end=end, data_source='google')

stock_prices = stock_panel.to_frame().unstack(level=1)

stock_prices


Out[52]:
Open High Low Close Volume
minor FB GOOG MSFT FB GOOG MSFT FB GOOG MSFT FB GOOG MSFT FB GOOG MSFT
Date
2016-01-04 101.95 743.00 54.32 102.24 744.06 54.80 99.75 731.26 53.39 102.22 741.84 54.80 37717312.0 3258199.0 52843210.0
2016-01-05 102.89 746.45 54.93 103.71 752.00 55.39 101.66 738.64 54.54 102.73 742.58 55.05 23258238.0 1950691.0 34079674.0
2016-01-06 101.13 730.00 54.32 103.77 747.18 54.40 100.90 728.92 53.64 102.97 743.62 54.05 25096183.0 1947034.0 39518863.0
2016-01-07 100.50 730.31 52.70 101.43 738.50 53.48 97.30 719.06 52.07 97.92 726.39 52.17 45172906.0 2963741.0 56564852.0
2016-01-08 99.88 731.45 52.37 100.50 733.23 53.28 97.03 713.00 52.15 97.33 714.47 52.33 35402298.0 2450857.0 48753969.0
2016-01-11 97.91 716.61 52.51 98.60 718.86 52.85 95.39 703.54 51.46 97.51 716.03 52.30 29932385.0 2090621.0 36943756.0
2016-01-12 99.00 721.68 52.76 99.96 728.75 53.10 97.55 717.32 52.06 99.37 726.07 52.78 28395390.0 2024509.0 36095539.0
2016-01-13 100.58 730.85 53.80 100.58 734.74 54.07 95.21 698.61 51.30 95.44 700.56 51.64 33091576.0 2468295.0 66119018.0
2016-01-14 95.85 705.38 52.00 98.87 721.92 53.42 92.45 689.10 51.57 98.37 714.72 53.11 48488509.0 2211853.0 51429807.0
2016-01-15 93.98 692.29 51.31 96.38 706.74 51.97 93.54 685.37 50.34 94.97 694.45 50.99 45935550.0 3592449.0 70739137.0
2016-01-19 96.53 703.30 51.48 97.42 709.98 51.68 93.92 693.41 50.06 95.26 701.79 50.56 30620565.0 2258479.0 43128276.0
2016-01-20 92.83 688.61 49.98 95.00 706.85 51.38 89.37 673.26 49.10 94.35 698.45 50.79 58982367.0 3439386.0 63162940.0
2016-01-21 94.91 702.18 51.00 95.99 719.19 51.58 92.62 694.46 50.30 94.16 706.59 50.48 30468830.0 2410263.0 40169367.0
2016-01-22 96.41 723.60 51.41 98.07 728.13 52.33 95.49 720.12 51.26 97.94 725.25 52.29 30331442.0 2006528.0 36878402.0
2016-01-25 98.72 723.58 51.94 99.48 729.68 52.65 96.93 710.01 51.65 97.01 711.67 51.79 32319141.0 1704641.0 34497323.0
2016-01-26 97.76 713.85 51.79 97.88 718.28 52.44 95.66 706.48 51.55 97.34 713.04 52.17 26637412.0 1324300.0 28699484.0
2016-01-27 97.79 713.67 52.01 97.85 718.24 52.20 94.23 694.39 51.02 94.45 699.99 51.22 48311872.0 2139970.0 36266102.0
2016-01-28 107.20 722.22 51.86 110.34 733.69 52.21 104.81 712.35 51.25 109.11 730.96 52.06 107084072.0 2658016.0 58489190.0
2016-01-29 108.99 731.53 54.73 112.84 744.99 55.09 108.84 726.80 54.00 112.21 742.95 55.09 62357459.0 3394935.0 83137039.0
2016-02-01 112.27 750.46 54.88 115.72 757.86 55.09 112.01 743.27 54.50 115.09 752.00 54.71 45840867.0 4801816.0 43587105.0
2016-02-02 114.80 784.50 54.17 117.59 789.87 54.26 113.20 764.65 52.65 114.61 764.65 53.00 59644925.0 6332431.0 54453559.0
2016-02-03 115.27 770.22 53.25 115.34 774.50 53.39 109.75 720.50 51.26 112.69 726.95 52.16 56848084.0 6162333.0 57347519.0
2016-02-04 111.80 722.81 52.10 111.94 727.00 52.81 109.25 701.86 51.37 110.49 708.01 52.00 38648485.0 5145855.0 46803371.0
2016-02-05 109.51 703.87 51.94 109.58 703.99 52.00 103.18 680.15 49.56 104.07 683.57 50.16 76522716.0 5069985.0 60651115.0
2016-02-08 100.41 667.85 49.55 102.68 684.03 49.57 97.46 663.06 48.19 99.75 682.74 49.41 71016537.0 4212541.0 57006105.0
2016-02-09 97.14 672.32 49.02 102.40 699.90 50.24 96.82 668.77 48.67 99.54 678.11 49.28 62580050.0 3604335.0 45822192.0
2016-02-10 101.55 686.86 49.89 103.25 701.31 50.39 100.24 682.13 49.52 101.00 684.12 49.71 44943948.0 2627379.0 37899918.0
2016-02-11 99.60 675.00 48.68 105.11 689.35 50.11 98.88 668.87 48.51 101.91 683.11 49.69 43493102.0 3007223.0 48448154.0
2016-02-12 103.74 690.26 50.25 104.24 693.75 50.68 101.09 678.60 49.75 102.01 682.40 50.50 36038104.0 2129831.0 33292336.0
2016-02-16 103.80 692.98 50.90 103.93 698.00 51.09 100.24 685.05 50.13 101.61 691.00 51.09 45438324.0 2497024.0 35937137.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2017-06-13 150.15 951.91 70.02 151.18 959.98 70.82 148.90 944.09 69.96 150.68 953.40 70.65 20483350.0 2013337.0 25258614.0
2017-06-14 151.26 959.92 70.91 152.40 961.15 71.10 149.05 942.25 69.43 150.25 950.76 70.27 20808759.0 1489715.0 25510685.0
2017-06-15 147.67 933.97 69.27 150.04 943.34 70.21 146.37 924.44 68.80 149.80 942.31 69.90 18994152.0 2133050.0 26068709.0
2017-06-16 149.59 940.00 69.73 150.83 942.04 70.03 148.60 931.60 69.22 150.64 939.78 70.00 22882417.0 3094711.0 48345085.0
2017-06-19 151.71 949.96 70.50 153.57 959.99 70.94 151.71 949.05 70.35 152.87 957.37 70.87 19026462.0 1533336.0 23798268.0
2017-06-20 152.88 957.52 70.82 153.84 961.62 70.87 152.21 950.01 69.87 152.25 950.63 69.91 14743853.0 1125990.0 21512231.0
2017-06-21 152.36 953.64 70.21 154.08 960.10 70.62 151.88 950.76 69.94 153.91 959.45 70.27 15039834.0 1202233.0 19891051.0
2017-06-22 153.01 958.70 70.54 154.55 960.72 70.59 152.91 954.55 69.71 153.40 957.09 70.26 13005535.0 941958.0 22965738.0
2017-06-23 152.72 956.83 70.09 155.20 966.00 71.25 152.65 954.20 69.92 155.07 965.59 71.21 18026977.0 1527856.0 27617291.0
2017-06-26 156.25 969.90 71.40 156.50 973.31 71.71 153.20 950.79 70.44 153.59 952.27 70.53 18049651.0 1598355.0 19606974.0
2017-06-27 152.84 942.46 70.11 153.31 948.29 70.18 150.39 926.85 69.18 150.58 927.33 69.21 19360766.0 2579930.0 25215128.0
2017-06-28 150.92 929.00 69.21 153.47 942.75 69.84 149.86 916.00 68.79 153.24 940.49 69.80 16719104.0 2721406.0 25806200.0
2017-06-29 152.28 929.92 69.38 152.50 931.26 69.49 148.92 910.62 68.09 151.04 917.79 68.49 23973482.0 3299176.0 28918715.0
2017-06-30 151.90 926.05 68.78 151.92 926.05 69.38 150.06 908.31 68.74 150.98 908.73 68.93 17666022.0 2090226.0 24161068.0
2017-07-03 151.72 912.18 69.33 152.15 913.94 69.60 147.80 894.79 68.02 148.43 898.70 68.17 13862735.0 1710373.0 16165538.0
2017-07-05 149.00 901.76 68.26 150.85 914.51 69.44 148.13 898.50 68.22 150.34 911.71 69.08 14334290.0 1813884.0 21176272.0
2017-07-06 149.03 904.12 68.27 150.04 914.94 68.78 148.00 899.70 68.12 148.82 906.69 68.57 14951802.0 1424503.0 21117572.0
2017-07-07 149.25 908.85 68.70 151.99 921.54 69.84 149.19 908.85 68.70 151.44 918.59 69.46 13615931.0 1637785.0 16878317.0
2017-07-10 151.69 921.77 69.46 153.98 930.38 70.25 151.51 919.59 69.20 153.50 928.80 69.98 13378672.0 1192825.0 15014503.0
2017-07-11 153.37 929.54 70.11 155.42 931.43 70.68 152.91 922.00 69.75 155.27 930.09 69.99 13667744.0 1113235.0 17460022.0
2017-07-12 156.49 938.68 70.69 159.16 946.30 71.28 156.20 934.47 70.55 158.90 943.83 71.15 22739388.0 1532144.0 17750925.0
2017-07-13 158.74 946.29 71.50 159.78 954.45 72.04 158.42 943.01 71.31 159.26 947.16 71.77 13929236.0 1294687.0 20269816.0
2017-07-14 160.13 952.00 72.24 160.32 956.91 73.27 159.33 948.00 71.96 159.97 955.99 72.78 16371207.0 1053774.0 25868086.0
2017-07-17 160.25 957.00 72.80 160.78 960.74 73.45 158.81 949.24 72.72 159.73 953.42 73.35 12858836.0 1165537.0 21803907.0
2017-07-18 159.66 953.00 73.09 163.73 968.04 73.39 159.42 950.60 72.66 162.86 965.40 73.30 23722910.0 1153964.0 26435296.0
2017-07-19 163.59 967.84 73.50 165.70 973.04 74.04 163.17 964.03 73.45 164.14 970.89 73.86 26273642.0 1224540.0 22416222.0
2017-07-20 164.80 975.00 74.18 165.00 975.90 74.30 162.81 961.51 73.28 164.53 968.15 74.22 18523107.0 1624463.0 42361025.0
2017-07-21 164.16 962.25 73.45 165.05 973.23 74.29 163.75 960.15 73.17 164.43 972.92 73.79 14744002.0 1711000.0 46717082.0
2017-07-24 164.64 972.22 73.53 166.17 986.20 73.75 164.31 970.77 73.13 166.00 980.34 73.60 17366971.0 3248347.0 21394826.0
2017-07-25 165.01 953.81 73.80 165.54 959.70 74.31 163.86 945.40 73.50 165.28 950.70 74.19 15383241.0 4660979.0 22018681.0

393 rows × 15 columns

3 - Create a Data Frame called stock_open for the open prices of the stocks you retrieved in part (2). Print the first few rows.


In [53]:
stock_open = stock_prices['Open']
stock_open.head(10)


Out[53]:
minor FB GOOG MSFT
Date
2016-01-04 101.95 743.00 54.32
2016-01-05 102.89 746.45 54.93
2016-01-06 101.13 730.00 54.32
2016-01-07 100.50 730.31 52.70
2016-01-08 99.88 731.45 52.37
2016-01-11 97.91 716.61 52.51
2016-01-12 99.00 721.68 52.76
2016-01-13 100.58 730.85 53.80
2016-01-14 95.85 705.38 52.00
2016-01-15 93.98 692.29 51.31

4 - Compute the total, average, and maximum price for each stock weekly.

If by total we mean sum (it doesn't make too much sense to me though...):


In [54]:
stock_open.resample('W').aggregate([sum,np.mean, max])


Out[54]:
FB GOOG MSFT
sum mean max sum mean max sum mean max
Date
2016-01-10 506.35 101.2700 102.89 3681.21 736.2420 746.45 268.64 53.7280 54.93
2016-01-17 487.32 97.4640 100.58 3566.81 713.3620 730.85 262.38 52.4760 53.80
2016-01-24 380.68 95.1700 96.53 2817.69 704.4225 723.60 203.87 50.9675 51.48
2016-01-31 510.46 102.0920 108.99 3604.85 720.9700 731.53 262.33 52.4660 54.73
2016-02-07 563.65 112.7300 115.27 3731.86 746.3720 784.50 266.34 53.2680 54.88
2016-02-14 502.44 100.4880 103.74 3392.29 678.4580 690.26 247.39 49.4780 50.25
2016-02-21 414.09 103.5225 105.80 2797.01 699.2525 710.00 206.69 51.6725 52.33
2016-02-28 532.36 106.4720 108.70 3506.41 701.2820 708.58 259.64 51.9280 52.60
2016-03-06 545.41 109.0820 110.25 3556.61 711.3220 719.00 260.10 52.0200 52.97
2016-03-13 535.94 107.1880 108.48 3522.08 704.4160 720.00 260.18 52.0360 53.00
2016-03-20 552.59 110.5180 112.15 3658.41 731.6820 741.86 268.04 53.6080 54.92
2016-03-27 447.16 111.7900 112.13 2948.33 737.0825 742.36 214.81 53.7025 54.11
2016-04-03 572.33 114.4660 116.73 3709.33 741.8660 750.10 272.80 54.5600 55.05
2016-04-10 566.69 113.3380 114.25 3713.17 742.6340 750.06 274.52 54.9040 55.43
2016-04-17 553.68 110.7360 112.23 3738.17 747.6340 754.01 274.50 54.9000 55.30
2016-04-24 557.03 111.4060 112.55 3769.65 753.9300 769.51 276.12 55.2240 56.63
2016-05-01 564.70 112.9400 119.58 3547.77 709.5540 725.42 255.49 51.0980 52.26
2016-05-08 587.16 117.4320 118.04 3481.07 696.2140 698.38 249.97 49.9940 50.34
2016-05-15 599.93 119.9860 120.41 3581.15 716.2300 723.41 254.59 50.9180 51.44
2016-05-22 589.01 117.8020 119.38 3532.77 706.5540 715.99 253.95 50.7900 51.72
2016-05-29 589.59 117.9180 119.56 3581.03 716.2060 724.01 257.07 51.4140 51.93
2016-06-05 475.63 118.9075 119.46 2928.04 732.0100 734.53 209.72 52.4300 52.64
2016-06-12 591.29 118.2580 119.24 3611.05 722.2100 724.91 259.30 51.8600 52.24
2016-06-19 572.66 114.5320 115.30 3575.55 715.1100 719.00 249.19 49.8380 50.41
2016-06-26 567.92 113.5840 114.65 3468.85 693.7700 699.06 253.01 50.6020 51.28
2016-07-03 564.44 112.8880 114.67 3410.64 682.1280 692.20 249.78 49.9560 51.13
2016-07-10 460.36 115.0900 116.63 2783.62 695.9050 699.50 204.76 51.1900 51.73
2016-07-17 589.97 117.9940 118.63 3598.10 719.6200 725.73 266.79 53.3580 53.95
2016-07-24 599.38 119.8760 121.92 3672.15 734.4300 741.86 275.62 55.1240 56.15
2016-07-31 617.98 123.5960 127.52 3737.74 747.5480 772.71 281.86 56.3720 56.61
... ... ... ... ... ... ... ... ... ...
2017-01-08 473.42 118.3550 120.98 3148.51 787.1275 795.26 249.76 62.4400 62.79
2017-01-15 625.82 125.1640 127.49 4033.88 806.7760 807.86 313.78 62.7560 63.06
2017-01-22 512.78 128.1950 128.41 3224.92 806.2300 807.08 250.26 62.5650 62.68
2017-01-29 651.00 130.2000 132.68 4131.69 826.3380 837.81 319.36 63.8720 65.39
2017-02-05 658.46 131.6920 133.22 4007.99 801.5980 814.66 321.66 64.3320 65.69
2017-02-12 664.41 132.8820 134.49 4031.90 806.3800 811.70 318.58 63.7160 64.25
2017-02-19 668.82 133.7640 134.70 4097.31 819.4620 823.02 322.36 64.4720 64.74
2017-02-26 537.15 134.2875 135.89 3315.17 828.7925 830.12 257.89 64.4725 64.61
2017-03-05 682.24 136.4480 137.09 4143.42 828.6840 833.85 321.43 64.2860 64.69
2017-03-12 687.69 137.5380 138.91 4167.14 833.4280 843.28 322.72 64.5440 65.19
2017-03-19 698.00 139.6000 140.34 4235.87 847.1740 851.61 323.75 64.7500 65.01
2017-03-26 698.35 139.6700 141.15 4174.40 834.8800 851.40 324.52 64.9040 65.36
2017-04-02 706.11 141.2220 142.40 4114.83 822.9660 833.50 325.78 65.1560 65.65
2017-04-09 709.36 141.8720 142.26 4156.45 831.2900 835.51 328.95 65.7900 66.30
2017-04-16 561.14 140.2850 141.00 3294.17 823.5425 825.39 261.92 65.4800 65.61
2017-04-23 709.23 141.8460 143.90 4183.34 836.6680 842.88 327.15 65.4300 65.67
2017-04-30 734.01 146.8020 149.50 4374.69 874.9380 910.66 340.52 68.1040 68.91
2017-05-07 760.30 152.0600 153.60 4586.03 917.2060 933.54 345.70 69.1400 69.71
2017-05-14 753.14 150.6280 151.49 4651.90 930.3800 936.95 343.79 68.7580 68.99
2017-05-21 741.44 148.2880 150.17 4661.09 932.2180 940.00 340.16 68.0320 68.89
2017-05-28 747.64 149.5280 152.23 4762.93 952.5860 969.70 344.25 68.8500 69.80
2017-06-04 608.27 152.0675 152.70 3883.74 970.9350 975.02 281.00 70.2500 70.53
2017-06-11 769.17 153.8340 154.77 4906.21 981.2420 984.50 361.46 72.2920 72.64
2017-06-18 746.84 149.3680 151.26 4725.36 945.0720 959.92 349.18 69.8360 70.91
2017-06-25 762.68 152.5360 153.01 4776.65 955.3300 958.70 352.16 70.4320 70.82
2017-07-02 764.19 152.8380 156.25 4697.33 939.4660 969.90 348.88 69.7760 71.40
2017-07-09 599.00 149.7500 151.72 3626.91 906.7275 912.18 274.56 68.6400 69.33
2017-07-16 780.42 156.0840 160.13 4688.28 937.6560 952.00 354.00 70.8000 72.24
2017-07-23 812.46 162.4920 164.80 4815.09 963.0180 975.00 367.02 73.4040 74.18
2017-07-30 329.65 164.8250 165.01 1926.03 963.0150 972.22 147.33 73.6650 73.80

82 rows × 9 columns

Instead, if we mean last opening price of the week:


In [55]:
# calculate last value for each week
last = stock_open.resample('W').last()
# add a level to the column index for merging this data with mean and max dataframe created below
last.columns = pd.MultiIndex.from_product([stock_open.columns, ['last']])
# merge mean and max dataframe and last dataframe using index, then stack and unstack to have the columns in the right place
pd.merge(stock_open.resample('W').aggregate([np.mean, max]),
         last,
         left_index=True,
         right_index=True).stack().unstack()


Out[55]:
FB GOOG MSFT
last max mean last max mean last max mean
Date
2016-01-10 99.88 102.89 101.2700 731.45 746.45 736.2420 52.37 54.93 53.7280
2016-01-17 93.98 100.58 97.4640 692.29 730.85 713.3620 51.31 53.80 52.4760
2016-01-24 96.41 96.53 95.1700 723.60 723.60 704.4225 51.41 51.48 50.9675
2016-01-31 108.99 108.99 102.0920 731.53 731.53 720.9700 54.73 54.73 52.4660
2016-02-07 109.51 115.27 112.7300 703.87 784.50 746.3720 51.94 54.88 53.2680
2016-02-14 103.74 103.74 100.4880 690.26 690.26 678.4580 50.25 50.25 49.4780
2016-02-21 102.55 105.80 103.5225 695.03 710.00 699.2525 51.97 52.33 51.6725
2016-02-28 108.70 108.70 106.4720 708.58 708.58 701.2820 52.60 52.60 51.9280
2016-03-06 110.05 110.25 109.0820 714.99 719.00 711.3220 52.40 52.97 52.0200
2016-03-13 108.48 108.48 107.1880 720.00 720.00 704.4160 53.00 53.00 52.0360
2016-03-20 111.56 112.15 110.5180 741.86 741.86 731.6820 54.92 54.92 53.6080
2016-03-27 112.13 112.13 111.7900 732.01 742.36 737.0825 53.84 54.11 53.7025
2016-04-03 113.75 116.73 114.4660 738.60 750.10 741.8660 55.05 55.05 54.5600
2016-04-10 114.25 114.25 113.3380 743.97 750.06 742.6340 54.67 55.43 54.9040
2016-04-17 110.79 112.23 110.7360 753.98 754.01 747.6340 55.30 55.30 54.9000
2016-04-24 111.21 112.55 111.4060 726.30 769.51 753.9300 51.91 56.63 55.2240
2016-05-01 116.82 119.58 112.9400 690.70 725.42 709.5540 49.35 52.26 51.0980
2016-05-08 117.16 118.04 117.4320 698.38 698.38 696.2140 49.92 50.34 49.9940
2016-05-15 120.38 120.41 119.9860 711.93 723.41 716.2300 51.44 51.44 50.9180
2016-05-22 116.96 119.38 117.8020 701.62 715.99 706.5540 50.48 51.72 50.7900
2016-05-29 119.56 119.56 117.9180 724.01 724.01 716.2060 51.92 51.93 51.4140
2016-06-05 118.98 119.46 118.9075 729.27 734.53 732.0100 52.38 52.64 52.4300
2016-06-12 117.54 119.24 118.2580 719.47 724.91 722.2100 51.05 52.24 51.8600
2016-06-19 114.42 115.30 114.5320 708.65 719.00 715.1100 50.41 50.41 49.8380
2016-06-26 111.01 114.65 113.5840 675.17 699.06 693.7700 49.81 51.28 50.6020
2016-07-03 114.20 114.67 112.8880 692.20 692.20 682.1280 51.13 51.13 49.9560
2016-07-10 116.43 116.63 115.0900 699.50 699.50 695.9050 51.73 51.73 51.1900
2016-07-17 117.74 118.63 117.9940 725.73 725.73 719.6200 53.95 53.95 53.3580
2016-07-24 119.90 121.92 119.8760 741.86 741.86 734.4300 56.08 56.15 55.1240
2016-07-31 124.65 127.52 123.5960 772.71 772.71 747.5480 56.26 56.61 56.3720
... ... ... ... ... ... ... ... ... ...
2017-01-08 120.98 120.98 118.3550 795.26 795.26 787.1275 62.30 62.79 62.4400
2017-01-15 127.49 127.49 125.1640 807.48 807.86 806.7760 62.62 63.06 62.7560
2017-01-22 128.10 128.41 128.1950 806.91 807.08 806.2300 62.67 62.68 62.5650
2017-01-29 132.68 132.68 130.2000 834.71 837.81 826.3380 65.39 65.39 63.8720
2017-02-05 131.24 133.22 131.6920 802.99 814.66 801.5980 63.50 65.69 64.3320
2017-02-12 134.10 134.49 132.8820 811.70 811.70 806.3800 64.25 64.25 63.7160
2017-02-19 133.50 134.70 133.7640 823.02 823.02 819.4620 64.47 64.74 64.4720
2017-02-26 134.16 135.89 134.2875 827.73 830.12 828.7925 64.53 64.61 64.4725
2017-03-05 136.63 137.09 136.4480 830.56 833.85 828.6840 63.99 64.69 64.2860
2017-03-12 138.91 138.91 137.5380 843.28 843.28 833.4280 65.11 65.19 64.5440
2017-03-19 140.34 140.34 139.6000 851.61 851.61 847.1740 64.91 65.01 64.7500
2017-03-26 140.08 141.15 139.6700 820.08 851.40 834.8800 65.36 65.36 64.9040
2017-04-02 142.31 142.40 141.2220 828.97 833.50 822.9660 65.65 65.65 65.1560
2017-04-09 141.20 142.26 141.8720 827.96 835.51 831.2900 65.85 66.30 65.7900
2017-04-16 139.62 141.00 140.2850 822.14 825.39 823.5425 65.29 65.61 65.4800
2017-04-23 143.90 143.90 141.8460 842.88 842.88 836.6680 65.67 65.67 65.4300
2017-04-30 149.50 149.50 146.8020 910.66 910.66 874.9380 68.91 68.91 68.1040
2017-05-07 151.45 153.60 152.0600 933.54 933.54 917.2060 68.90 69.71 69.1400
2017-05-14 150.40 151.49 150.6280 931.53 936.95 930.3800 68.61 68.99 68.7580
2017-05-21 148.44 150.17 148.2880 931.47 940.00 932.2180 67.50 68.89 68.0320
2017-05-28 152.23 152.23 149.5280 969.70 969.70 952.5860 69.80 69.80 68.8500
2017-06-04 151.85 152.70 152.0675 969.46 975.02 970.9350 70.44 70.53 70.2500
2017-06-11 154.77 154.77 153.8340 984.50 984.50 981.2420 72.04 72.64 72.2920
2017-06-18 149.59 151.26 149.3680 940.00 959.92 945.0720 69.73 70.91 69.8360
2017-06-25 152.72 153.01 152.5360 956.83 958.70 955.3300 70.09 70.82 70.4320
2017-07-02 151.90 156.25 152.8380 926.05 969.90 939.4660 68.78 71.40 69.7760
2017-07-09 149.25 151.72 149.7500 908.85 912.18 906.7275 68.70 69.33 68.6400
2017-07-16 160.13 160.13 156.0840 952.00 952.00 937.6560 72.24 72.24 70.8000
2017-07-23 164.16 164.80 162.4920 962.25 975.00 963.0180 73.45 74.18 73.4040
2017-07-30 165.01 165.01 164.8250 953.81 972.22 963.0150 73.80 73.80 73.6650

82 rows × 9 columns

5 - For each stock, return the weeks for which the opening stock price was greater than the yearly daily average.


In [56]:
# create a dataframe containing the opening price for each week minus the mean over the year
# I'm appending start year resampling and en of year resampling in order to have all the weeks in the years considered
stock_difference = stock_open.resample('W').first() - stock_open.resample('AS').mean().append(stock_open.resample('A').mean()).resample('W').ffill()
stock_difference.head()


Out[56]:
minor FB GOOG MSFT
Date
2016-01-03 NaN NaN NaN
2016-01-10 -15.130754 -0.738532 -0.910476
2016-01-17 -19.170754 -27.128532 -2.720476
2016-01-24 -20.550754 -40.438532 -3.750476
2016-01-31 -18.360754 -20.158532 -3.290476

In [57]:
# print values for every stock:
for stock in stock_code:
    print('Weeks with opening price above yearly average for {}:\n{}\n'.format(stock,
                        # format the index where stock difference is greater than 0 and join the array using newline
                        '\n'.join(stock_difference[stock_difference[stock] > 0].index.strftime('Week %W of %Y'))))


Weeks with opening price above yearly average for GOOG:
Week 05 of 2016
Week 14 of 2016
Week 16 of 2016
Week 31 of 2016
Week 32 of 2016
Week 33 of 2016
Week 34 of 2016
Week 35 of 2016
Week 36 of 2016
Week 37 of 2016
Week 38 of 2016
Week 39 of 2016
Week 40 of 2016
Week 41 of 2016
Week 42 of 2016
Week 43 of 2016
Week 44 of 2016
Week 45 of 2016
Week 46 of 2016
Week 47 of 2016
Week 48 of 2016
Week 49 of 2016
Week 50 of 2016
Week 51 of 2016
Week 18 of 2017
Week 19 of 2017
Week 20 of 2017
Week 21 of 2017
Week 22 of 2017
Week 23 of 2017
Week 24 of 2017
Week 25 of 2017
Week 26 of 2017
Week 27 of 2017
Week 28 of 2017
Week 29 of 2017
Week 30 of 2017

Weeks with opening price above yearly average for FB:
Week 18 of 2016
Week 19 of 2016
Week 20 of 2016
Week 21 of 2016
Week 22 of 2016
Week 23 of 2016
Week 28 of 2016
Week 29 of 2016
Week 30 of 2016
Week 31 of 2016
Week 32 of 2016
Week 33 of 2016
Week 34 of 2016
Week 35 of 2016
Week 36 of 2016
Week 37 of 2016
Week 38 of 2016
Week 39 of 2016
Week 40 of 2016
Week 41 of 2016
Week 42 of 2016
Week 43 of 2016
Week 44 of 2016
Week 45 of 2016
Week 46 of 2016
Week 47 of 2016
Week 48 of 2016
Week 50 of 2016
Week 51 of 2016
Week 17 of 2017
Week 18 of 2017
Week 19 of 2017
Week 20 of 2017
Week 21 of 2017
Week 22 of 2017
Week 23 of 2017
Week 24 of 2017
Week 25 of 2017
Week 26 of 2017
Week 27 of 2017
Week 28 of 2017
Week 29 of 2017
Week 30 of 2017

Weeks with opening price above yearly average for MSFT:
Week 14 of 2016
Week 16 of 2016
Week 30 of 2016
Week 31 of 2016
Week 32 of 2016
Week 33 of 2016
Week 34 of 2016
Week 35 of 2016
Week 36 of 2016
Week 37 of 2016
Week 38 of 2016
Week 39 of 2016
Week 40 of 2016
Week 41 of 2016
Week 42 of 2016
Week 43 of 2016
Week 44 of 2016
Week 45 of 2016
Week 46 of 2016
Week 47 of 2016
Week 48 of 2016
Week 49 of 2016
Week 50 of 2016
Week 51 of 2016
Week 17 of 2017
Week 18 of 2017
Week 19 of 2017
Week 20 of 2017
Week 21 of 2017
Week 22 of 2017
Week 23 of 2017
Week 24 of 2017
Week 25 of 2017
Week 26 of 2017
Week 27 of 2017
Week 28 of 2017
Week 29 of 2017
Week 30 of 2017