PyShop Session 4

Exercises

These exercises will give you the chance to practice downloading a csv file using requests and dealing with the data in that file using Pandas.

Should you struggle with the download of the data, the data file is included in the folder for your convenience.

The questions are in increasing difficulty, where the first question should take you less than a minute and the last one you might not be able to figure out. Good luck!

Note: I appologize for the solutions and questions not being next to each other, but there is a numbering issue in the Markdown that generates this text. Sorry, but it is a known bug that has yet to be fixed!

Use the requests package to download the two data files at the following URL's: http://www.stern.nyu.edu/~wgreene/Text/Edition7/TableF1-1.csv http://www.stern.nyu.edu/~wgreene/Text/Edition7/TableF3-1.csv. You should have to requests objects containing the pages.
Use BytesIO and Pandas to read in the csv data to two seperate data frames.
Use head, tail, describe, shape, and any other methods you find interesting to learn more about your data sets.
Use join and merge to do an inner join of the two data frames, keeping only years that appear in both data sets. NOTE: These two will do the same thing, it's just for practice. If you struggle with join don't forget about the index!
Check the dimensions of all four data frames to ensure the proper shape.
Use the map function to make all the column headers lower case and to strip white space from the left and right. NOTE: To see why this is necessary, print the column names. There are extra spaces!
Use the shift method to create a lagged value for capital to labor ratio k. Given this, generate a column of first differences in k.
Use groupby to generate average values of all variables during and outside of war (ie when w is 1 or 0 respectively).
Use merge to add a column for average technology level during war time and during peace time to your data frame, then create a column for technology deviation from the mean.
Install seaborn using either conda install seaborn or pip install seaborn. NOTE: Apparently the pip version is more up to date.
Use seaborn to generate a pairplot of your data, just to take a look. With so few observations, we can't say much... NOTE: If you get an error like AttributeError: max must be larger than min in range parameter., it is probably caused by the Nan in the lag. Come up with some way to deal with this!
If you complete this and still want some more, use statsmodels to generate regression coefficients of Q (output) on all of the variables (or just the ones you want). Clearly the sample is too small, but it's good practice!



In [89]:

    
import requests
import numpy as np
import pandas as pd
from io import BytesIO
import seaborn as sns

%matplotlib inline



In [32]:

    
c = requests.Session()
URL = 'http://www.stern.nyu.edu/~wgreene/Text/Edition7/TableF2-1.csv'
page = c.get(URL)
c = requests.Session()
URL = 'http://www.stern.nyu.edu/~wgreene/Text/Edition7/TableF6-4.csv'
page2 = c.get(URL)



In [33]:

    
data1 = pd.read_csv(BytesIO(page.content))
data2 = pd.read_csv(BytesIO(page2.content))



In [34]:

    
data1.head()



In [35]:

    
data2.head()



In [36]:

    
data1.tail()



In [37]:

    
data2.tail()



In [38]:

    
data1.describe()









    Out[38]:






  
    
      
      YEAR
      X
      C
      W
    
  
  
    
      count
      11.000000
      11.000000
      11.000000
      11.000000
    
    
      mean
      1945.000000
      323.272727
      273.272727
      0.363636
    
    
      std
      3.316625
      35.071615
      35.522080
      0.504525
    
    
      min
      1940.000000
      241.000000
      226.000000
      0.000000
    
    
      25%
      1942.500000
      319.500000
      242.500000
      0.000000
    
    
      50%
      1945.000000
      332.000000
      265.000000
      0.000000
    
    
      75%
      1947.500000
      339.500000
      302.500000
      1.000000
    
    
      max
      1950.000000
      371.000000
      325.000000
      1.000000



In [39]:

    
data2.describe()









    Out[39]:






  
    
      
      YEAR
      Q
      K
      A
    
  
  
    
      count
      41.000000
      41.000000
      41.000000
      41.000000
    
    
      mean
      1929.000000
      0.905317
      2.631463
      1.324244
    
    
      std
      11.979149
      0.201472
      0.329815
      0.267492
    
    
      min
      1909.000000
      0.616000
      2.060000
      0.983000
    
    
      25%
      1919.000000
      0.729000
      2.470000
      1.142000
    
    
      50%
      1929.000000
      0.874000
      2.630000
      1.226000
    
    
      75%
      1939.000000
      1.034000
      2.810000
      1.514000
    
    
      max
      1949.000000
      1.296000
      3.330000
      1.850000



In [40]:

    
merged = pd.merge(data1, data2, how="inner", on='YEAR')
print(data1.shape, data2.shape, merged.shape)
merged.head()









    



(11, 4) (41, 4) (10, 7)






    Out[40]:






  
    
      
      YEAR
      X
      C
      W
      Q
      K
      A
    
  
  
    
      0
      1940
      241
      226
      0
      1.082
      2.63
      1.590
    
    
      1
      1941
      280
      240
      0
      1.122
      2.58
      1.660
    
    
      2
      1942
      319
      235
      1
      1.136
      2.64
      1.665
    
    
      3
      1943
      331
      245
      1
      1.180
      2.62
      1.692
    
    
      4
      1944
      345
      255
      1
      1.265
      2.63
      1.812



In [46]:

    
# NOTE: Join requires the index to be defined, so set the index for both frames
joined = data1.set_index('YEAR').join(data2.set_index('YEAR'), how="inner")
print(data1.shape, data2.shape, joined.shape)
joined.head()









    



(11, 4) (41, 4) (10, 6)






    Out[46]:






  
    
      
      X
      C
      W
      Q
      K
      A
    
    
      YEAR
      
      
      
      
      
      
    
  
  
    
      1940
      241
      226
      0
      1.082
      2.63
      1.590
    
    
      1941
      280
      240
      0
      1.122
      2.58
      1.660
    
    
      1942
      319
      235
      1
      1.136
      2.64
      1.665
    
    
      1943
      331
      245
      1
      1.180
      2.62
      1.692
    
    
      1944
      345
      255
      1
      1.265
      2.63
      1.812



In [58]:

    
merged.columns = map(str.lower, merged.columns)
merged.columns = map(str.lstrip, merged.columns)
merged.columns = map(str.rstrip, merged.columns)
print(merged.columns)









    



Index(['year', 'x', 'c', 'w', 'q', 'k', 'a', 'lagged_k', 'diff_k'], dtype='object')



In [51]:

    
merged['lagged_k'] = merged['k'].shift()
merged.head()



In [54]:

    
merged['diff_k'] = merged['k'] - merged['lagged_k']
merged.head()



In [60]:

    
grouped = merged.groupby('w')



In [64]:

    
grouped.aggregate(np.mean)









    Out[64]:






  
    
      
      year
      x
      c
      q
      k
      a
      lagged_k
      diff_k
    
    
      w
      
      
      
      
      
      
      
      
    
  
  
    
      0
      1945.166667
      308.333333
      280.166667
      1.184833
      2.576667
      1.722333
      2.5680
      -0.002
    
    
      1
      1943.500000
      333.750000
      250.000000
      1.219250
      2.637500
      1.754750
      2.6175
      0.020



In [78]:

    
means = grouped['a'].aggregate(np.mean).reset_index()
means.columns = ['w', 'a_w']



In [80]:

    
merged1 = pd.merge(merged, means, how="left", on="w")
merged1



In [84]:

    
merged1['a_dev'] = merged1['a'] - merged1['a_w']
merged1



In [90]:

    
sns.pairplot(merged1.set_index('year').dropna())









    Out[90]:





<seaborn.axisgrid.PairGrid at 0x7f7e04cb4e50>






    



/home/tmabbot/anaconda/envs/snakes/lib/python3.3/site-packages/matplotlib/figure.py:1653: UserWarning: This figure includes Axes that are not compatible with tight_layout, so its results might be incorrect.
  warnings.warn("This figure includes Axes that are not "

	YEAR	Q	K	A
0	1909	0.623	2.06	1.000
1	1910	0.616	2.10	0.983
2	1911	0.647	2.17	1.021
3	1912	0.652	2.21	1.023
4	1913	0.680	2.23	1.064

	YEAR	X	C	W
0	1940	241	226	0
1	1941	280	240	0
2	1942	319	235	1
3	1943	331	245	1
4	1944	345	255	1

	YEAR	X	C
6	1946	332	295
7	1947	320	300
8	1948	339	305
9	1949	338	315
10	1950	371	325

	YEAR	Q	K	A
36	1945	1.296	2.66	1.850
37	1946	1.215	2.50	1.769
38	1947	1.194	2.50	1.739
39	1948	1.221	2.55	1.767
40	1949	1.275	2.70	1.809

	YEAR	X	C	W
count	11.000000	11.000000	11.000000	11.000000
mean	1945.000000	323.272727	273.272727	0.363636
std	3.316625	35.071615	35.522080	0.504525
min	1940.000000	241.000000	226.000000	0.000000
25%	1942.500000	319.500000	242.500000	0.000000
50%	1945.000000	332.000000	265.000000	0.000000
75%	1947.500000	339.500000	302.500000	1.000000
max	1950.000000	371.000000	325.000000	1.000000

	YEAR	Q	K	A
count	41.000000	41.000000	41.000000	41.000000
mean	1929.000000	0.905317	2.631463	1.324244
std	11.979149	0.201472	0.329815	0.267492
min	1909.000000	0.616000	2.060000	0.983000
25%	1919.000000	0.729000	2.470000	1.142000
50%	1929.000000	0.874000	2.630000	1.226000
75%	1939.000000	1.034000	2.810000	1.514000
max	1949.000000	1.296000	3.330000	1.850000

	YEAR	X	C	W	Q	K	A
0	1940	241	226	0	1.082	2.63	1.590
1	1941	280	240	0	1.122	2.58	1.660
2	1942	319	235	1	1.136	2.64	1.665
3	1943	331	245	1	1.180	2.62	1.692
4	1944	345	255	1	1.265	2.63	1.812

	year	x	c	q	k	a	lagged_k	diff_k
w
0	1945.166667	308.333333	280.166667	1.184833	2.576667	1.722333	2.5680	-0.002
1	1943.500000	333.750000	250.000000	1.219250	2.637500	1.754750	2.6175	0.020