NYU Stern School of Business | March 2016
Please answer the questions below in this IPython notebook. Add cells as needed. When you're done, save it and email to Dave Backus (db3@nyu.edu). Use the subject line: "bootcamp exam" plus "UG" or "MBA", as appropriate. Make sure you have the correct email address. And the correct file. Doing this correctly is worth 10 points.
This IPython notebook was created by Dave Backus, Chase Coleman, and Spencer Lyon for the NYU Stern course Data Bootcamp.
In [1]:
# import packages
import pandas as pd # data management
import matplotlib.pyplot as plt # graphics
import datetime as dt # check today's date
import sys # check Python version
# IPython command, puts plots in notebook
%matplotlib inline
print('Today is', dt.date.today())
print('Python version:\n', sys.version, sep='')
In [2]:
# experiment in this box
a = 2*3
b = 2.0*3
c = 'abc'
d = ['This', "is", 'not', "a", 'string']
e = d[3]
In [3]:
# do this with a function because we're lazy and value our time
def valuetype(x):
"""
print value and type of input x
"""
print('Value and type: ', x, ', ', type(x), sep='')
In [4]:
# (a)
valuetype(a)
In [5]:
# (b)
valuetype(b)
In [6]:
# (c)
valuetype(c)
In [7]:
# (d)
valuetype(d)
In [8]:
# (e)
valuetype(e)
In [ ]:
In [9]:
f = (1, 2, 3)
g = {1: 'Chase', 2: 'Dave', 3: 'Spencer'}
h = 'foo' + 'bar'
i = (1 != 0)
In [10]:
# (f)
valuetype(f)
In [11]:
# (g)
valuetype(g)
In [12]:
# (h)
valuetype(h)
In [13]:
# (i)
valuetype(i)
In [ ]:
In [14]:
torf = True
if torf:
x = 1
else:
x = 2
print('x =', x)
Changed cell to Markdown with menu at top
The code from if on:
At the top, torf is True, so we do the first one (x=1). If we change it to False, we do the second one (x=2).
In [ ]:
Take the first
and last
variables defined in the cell below and do the following with them:
(a) Extract the first letter of last
.
(b) Find a method to split last
into two components at the hyphen.
(c) Define a new string variable named combo
consisting of first
(the first name), a space, the first letter of last
, and a period.
(d) Define a function that takes as inputs first and last names (both strings) and returns combo
(also a string, consisting of the first name plus the first letter of the last name and a period). Apply it to the variables first
and last
and to your own first and last names.
(20 points)
In [1]:
first = 'Sarah'
last = 'Beckett-Hile'
In [2]:
# (a)
firstoffirst = first[0]
firstoffirst
Out[2]:
In [3]:
# (b)
last.split('-')
Out[3]:
In [4]:
# (c)
combo = first + ' ' + last[0] + '.'
combo
Out[4]:
In [19]:
# (d)
def lastinitial(name1, name2):
combo = name1 + ' ' + name2[0] + '.'
return combo
lastinitial(first, last)
Out[19]:
In [20]:
lastinitial('Chase', 'Coleman')
Out[20]:
In [ ]:
Consider the variable things = [1, '2', 3.0, 'four']
.
(a) Write a loop that goes through the elements of things
and prints them and their type.
(b) Modify the loop to print only those elements that are integers.
(10 points)
(c) Bonus (not graded): Can you do parts (a) and (b) with a list comprehension?
In [21]:
things = [1, '2', 3.0, 'four']
In [22]:
# (a)
for thing in things:
print('Value and type: ', thing, ', ', type(thing), sep='')
In [23]:
# (b)
for thing in things:
if type(thing) == int:
print('Value and type: ', thing, ', ', type(thing), sep='')
In [24]:
# (c)
[print('Value and type: ', thing, ', ', type(thing), sep='') for thing in things]
Out[24]:
In [25]:
[print('Value and type: ', thing, ', ', type(thing), sep='') for thing in things
if type(thing) == int]
Out[25]:
In [ ]:
Next up: We explore the Census's Business Dynamics Statistics, a huge collection of data about firms. We've extracted a small piece of one of their databases that includes these variables for 2013:
Size
: size category of firms based on number of employees Firms
: number of firms in this size categoryEmp
: number of employees in this size category Run the code cell below to load the data and use the result to answer these questions:
(a) What type of object is bsd
?
(b) What are its dimensions?
(c) What are its column labels? Row labels?
(d) What dtypes are the columns?
(20 points)
In [27]:
data = {'Size': ['1 to 4', '5 to 9', '10 to 19', '20 to 49', '50 to 99',
'100 to 249', '250 to 499', '500 to 999', '1000 to 2499',
'2500 to 4999', '5000 to 9999', '10000+'],
'Firms': [2846416, 1020772, 598153, 373345, 115544, 63845,
19389, 9588, 6088, 2287, 1250, 1357],
'Emp': [5998912, 6714924, 8151891, 11425545, 8055535, 9788341,
6611734, 6340775, 8321486, 6738218, 6559020, 32556671]}
bds = pd.DataFrame(data)
bds = bds.set_index('Size')
In [28]:
# (a)
type(bds)
Out[28]:
In [29]:
# (b)
bds.shape
Out[29]:
In [30]:
(c)
list(bds) # or bsd.columns
Out[30]:
In [31]:
bds.index
Out[31]:
In [32]:
# (d)
bds.dtypes
Out[32]:
In [ ]:
In [33]:
# (a)
bds['AvgEmp'] = bds['Emp']/bds['Firms']
bds.head(3)
Out[33]:
In [34]:
# (b)
bds = bds.rename(columns={'Emp': 'Employment'})
bds.head(3)
Out[34]:
In [35]:
# (c)
bds['Employment'].plot.bar()
Out[35]:
In [ ]:
In [36]:
# everything has to be in same cell to apply to the same figure
plt.style.use('fivethirtyeight') # (e)
fig, ax = plt.subplots() # (a)
bds['Firms'].plot.barh(ax=ax, color='red') # (b,c)
ax.set_title('Numbers of firms by employment category') # (d)
Out[36]:
Comment. Evidently there are lots of small firms.
In [ ]: