PyShop Session 4

Exercises

These exercises will give you the chance to practice downloading a csv file using requests and dealing with the data in that file using Pandas.

Should you struggle with the download of the data, the data file is included in the folder for your convenience.

The questions are in increasing difficulty, where the first question should take you less than a minute and the last one you might not be able to figure out. Good luck!

Note: I appologize for the solutions and questions not being next to each other, but there is a numbering issue in the Markdown that generates this text. Sorry, but it is a known bug that has yet to be fixed!

Use the requests package to download the two data files at the following URL's: http://www.stern.nyu.edu/~wgreene/Text/Edition7/TableF1-1.csv http://www.stern.nyu.edu/~wgreene/Text/Edition7/TableF3-1.csv. You should have two requests objects containing the pages.
Use BytesIO and Pandas to read in the csv data to two seperate data frames.
Use head, tail, describe, shape, and any other methods you find interesting to learn more about your data sets.
Use join and merge to do an inner join of the two data frames, keeping only years that appear in both data sets. NOTE: These two will do the same thing, it's just for practice. If you struggle with join don't forget about the index!
Check the dimensions of all four data frames to ensure the proper shape.
Use the map function to make all the column headers lower case and to strip white space from the left and right. NOTE: To see why this is necessary, print the column names. There are extra spaces!
Use the shift method to create a lagged value for capital to labor ratio k. Given this, generate a column of first differences in k.
Use groupby to generate average values of all variables during and outside of war (ie when w is 1 or 0 respectively).
Use merge to add a column for average technology level during war time and during peace time to your data frame, then create a column for technology deviation from the mean.
Install seaborn using either conda install seaborn or pip install seaborn. NOTE: Apparently the pip version is more up to date.
Use seaborn to generate a pairplot of your data, just to take a look. With so few observations, we can't say much... NOTE: If you get an error like AttributeError: max must be larger than min in range parameter., it is probably caused by the Nan in the lag. Come up with some way to deal with this!
If you complete this and still want some more, use statsmodels to generate regression coefficients of Q (output) on all of the variables (or just the ones you want). Clearly the sample is too small, but it's good practice!