Data Bootcamp "Learning Experience"

NYU Stern School of Business | March 2016

Please answer the questions below in this IPython notebook. Add cells as needed. When you're done, save it and email to Dave Backus (db3@nyu.edu). Use the subject line: "bootcamp exam" plus "UG" or "MBA", as appropriate. Make sure you have the correct email address. And the correct file. Doing this correctly is worth 10 points.

This IPython notebook was created by Dave Backus, Chase Coleman, and Spencer Lyon for the NYU Stern course Data Bootcamp.

Import packages

Run this code. Really.


In [1]:
# import packages 
import pandas as pd                   # data management
import matplotlib.pyplot as plt       # graphics 
import datetime as dt                 # check today's date 
import sys                            # check Python version 

# IPython command, puts plots in notebook 
%matplotlib inline

print('Today is', dt.date.today())
print('Python version:\n', sys.version, sep='')


Today is 2016-04-02
Python version:
3.5.1 |Anaconda 2.5.0 (64-bit)| (default, Jan 29 2016, 15:01:46) [MSC v.1900 64 bit (AMD64)]

Question 0

  • Change the file name by adding _YourLastName to it in the textbox at the top.
  • Add a markdown cell directly above this one that includes your name in bold, your student number, and your email address.

(10 points)

Question 1

For each part (a)-(e), describe the type and value of the variable with the corresponding name:

(a) a = 2*3

(b) b = 2.0*3

(c) c = 'abc'

(d) d = ['This', "is", 'not', "a", 'string']

(e) e = d[3]

(25 points)


In [2]:
# experiment in this box 

a = 2*3
b = 2.0*3
c = 'abc'
d = ['This', "is", 'not', "a", 'string']
e = d[3]

In [3]:
# do this with a function because we're lazy and value our time
def valuetype(x):
    """
    print value and type of input x
    """
    print('Value and type: ', x, ', ', type(x), sep='')

In [4]:
# (a)
valuetype(a)


Value and type: 6, <class 'int'>

In [5]:
# (b)
valuetype(b)


Value and type: 6.0, <class 'float'>

In [6]:
# (c)
valuetype(c)


Value and type: abc, <class 'str'>

In [7]:
# (d)
valuetype(d)


Value and type: ['This', 'is', 'not', 'a', 'string'], <class 'list'>

In [8]:
# (e)
valuetype(e)


Value and type: a, <class 'str'>

In [ ]:

Question 2

As above describe the value and type of each variable. (These are more challenging.)

(f) f = (1, 2, 3)

(g) g = {1: 'Chase', 2: 'Dave', 3: 'Spencer'}

(h) h = 'foo' + 'bar'

(i) i = (1 != 0) # parens not needed, but they make code more understandable

(20 points)


In [9]:
f = (1, 2, 3)
g = {1: 'Chase', 2: 'Dave', 3: 'Spencer'}
h = 'foo' + 'bar'
i = (1 != 0)

In [10]:
# (f)
valuetype(f)


Value and type: (1, 2, 3), <class 'tuple'>

In [11]:
# (g)
valuetype(g)


Value and type: {1: 'Chase', 2: 'Dave', 3: 'Spencer'}, <class 'dict'>

In [12]:
# (h)
valuetype(h)


Value and type: foobar, <class 'str'>

In [13]:
# (i)
valuetype(i)


Value and type: True, <class 'bool'>

In [ ]:

Question 3

Explain the code below -- briefly -- in a Markdown cell. What happens if we change the first line to torf = False?

(10 points)


In [14]:
torf = True

if torf: 
    x = 1
else:
    x = 2
    
print('x =', x)


x = 1

Changed cell to Markdown with menu at top

The code from if on:

  • if torf is True, set x=1
  • if torf is False, we set x=2

At the top, torf is True, so we do the first one (x=1). If we change it to False, we do the second one (x=2).


In [ ]:

Question 4

Take the first and last variables defined in the cell below and do the following with them:

(a) Extract the first letter of last.

(b) Find a method to split last into two components at the hyphen.

(c) Define a new string variable named combo consisting of first (the first name), a space, the first letter of last, and a period.

(d) Define a function that takes as inputs first and last names (both strings) and returns combo (also a string, consisting of the first name plus the first letter of the last name and a period). Apply it to the variables first and last and to your own first and last names.

(20 points)


In [1]:
first = 'Sarah'
last  = 'Beckett-Hile'

In [2]:
# (a)
firstoffirst = first[0]
firstoffirst


Out[2]:
'S'

In [3]:
# (b) 
last.split('-')


Out[3]:
['Beckett', 'Hile']

In [4]:
# (c) 
combo = first + ' ' + last[0] + '.'
combo


Out[4]:
'Sarah B.'

In [19]:
# (d) 
def lastinitial(name1, name2):
    combo = name1 + ' ' + name2[0] + '.'
    return combo

lastinitial(first, last)


Out[19]:
'Sarah B.'

In [20]:
lastinitial('Chase', 'Coleman')


Out[20]:
'Chase C.'

In [ ]:

Question 5

Consider the variable things = [1, '2', 3.0, 'four'].

(a) Write a loop that goes through the elements of things and prints them and their type.

(b) Modify the loop to print only those elements that are integers.

(10 points)

(c) Bonus (not graded): Can you do parts (a) and (b) with a list comprehension?


In [21]:
things = [1, '2', 3.0, 'four']

In [22]:
# (a) 
for thing in things:
    print('Value and type: ', thing, ', ', type(thing), sep='')


Value and type: 1, <class 'int'>
Value and type: 2, <class 'str'>
Value and type: 3.0, <class 'float'>
Value and type: four, <class 'str'>

In [23]:
# (b) 
for thing in things:
    if type(thing) == int:
        print('Value and type: ', thing, ', ', type(thing), sep='')


Value and type: 1, <class 'int'>

In [24]:
# (c) 
[print('Value and type: ', thing, ', ', type(thing), sep='') for thing in things]


Value and type: 1, <class 'int'>
Value and type: 2, <class 'str'>
Value and type: 3.0, <class 'float'>
Value and type: four, <class 'str'>
Out[24]:
[None, None, None, None]

In [25]:
[print('Value and type: ', thing, ', ', type(thing), sep='') for thing in things
    if type(thing) == int]


Value and type: 1, <class 'int'>
Out[25]:
[None]

In [ ]:

Question 6

Next up: We explore the Census's Business Dynamics Statistics, a huge collection of data about firms. We've extracted a small piece of one of their databases that includes these variables for 2013:

  • Size: size category of firms based on number of employees
  • Firms: number of firms in this size category
  • Emp: number of employees in this size category

Run the code cell below to load the data and use the result to answer these questions:

(a) What type of object is bsd?

(b) What are its dimensions?

(c) What are its column labels? Row labels?

(d) What dtypes are the columns?

(20 points)


In [27]:
data = {'Size': ['1 to 4', '5 to 9', '10 to 19', '20 to 49', '50 to 99',
                 '100 to 249', '250 to 499', '500 to 999', '1000 to 2499',
                 '2500 to 4999', '5000 to 9999', '10000+'], 
        'Firms': [2846416, 1020772, 598153, 373345, 115544, 63845,
                  19389, 9588, 6088, 2287, 1250, 1357], 
        'Emp': [5998912, 6714924, 8151891, 11425545, 8055535, 9788341, 
                6611734, 6340775, 8321486, 6738218, 6559020, 32556671]}
bds = pd.DataFrame(data) 
bds = bds.set_index('Size')

In [28]:
# (a)
type(bds)


Out[28]:
pandas.core.frame.DataFrame

In [29]:
# (b)
bds.shape


Out[29]:
(12, 2)

In [30]:
(c)
list(bds)  # or bsd.columns


Out[30]:
['Emp', 'Firms']

In [31]:
bds.index


Out[31]:
Index(['1 to 4', '5 to 9', '10 to 19', '20 to 49', '50 to 99', '100 to 249',
       '250 to 499', '500 to 999', '1000 to 2499', '2500 to 4999',
       '5000 to 9999', '10000+'],
      dtype='object', name='Size')

In [32]:
# (d)
bds.dtypes


Out[32]:
Emp      int64
Firms    int64
dtype: object

In [ ]:

Question 7

Continuing with the same data:

(a) Create a new variable AvgEmp equal to the ratio of Emp to Firms and add it as a new column in bsd.

(b) Use a dataframe method to change the name of Emp to Employees.

(c) Create a bar chart of the number of employees in each size category.

(15 points)


In [33]:
# (a) 
bds['AvgEmp'] = bds['Emp']/bds['Firms']
bds.head(3)


Out[33]:
Emp Firms AvgEmp
Size
1 to 4 5998912 2846416 2.107532
5 to 9 6714924 1020772 6.578280
10 to 19 8151891 598153 13.628438

In [34]:
# (b) 
bds = bds.rename(columns={'Emp': 'Employment'})
bds.head(3)


Out[34]:
Employment Firms AvgEmp
Size
1 to 4 5998912 2846416 2.107532
5 to 9 6714924 1020772 6.578280
10 to 19 8151891 598153 13.628438

In [35]:
# (c)
bds['Employment'].plot.bar()


Out[35]:
<matplotlib.axes._subplots.AxesSubplot at 0x8a7f860>

In [ ]:

Question 8

Still continuing with the same data:

(a) Create figure and axis objects.

(b) Add a horizontal bar chart of the number of firms in each category to the axis object you created.

(c) Make the bars red.

(d) Add a title.

(e) Change the style to fivethirtyeight.

(25 points)


In [36]:
# everything has to be in same cell to apply to the same figure
plt.style.use('fivethirtyeight')  # (e) 
fig, ax = plt.subplots()  # (a) 
bds['Firms'].plot.barh(ax=ax, color='red')  # (b,c) 
ax.set_title('Numbers of firms by employment category')  # (d)


Out[36]:
<matplotlib.text.Text at 0x90ad828>

Comment. Evidently there are lots of small firms.


In [ ]: