Working with Open Data Midterm (March 18, 2014)

There are 94 points in this exam: 2 each for the 47 questions. The questions are either multiple choice or short answers. For multiple choice, just write the number of the choice selected.

Name: ______________________________________

`

World Population

Consider this code to construct a DataFrame of populations of countries.


In [1]:
import json
import requests
from pandas import DataFrame

# read population in from JSON-formatted data derived from the Wikipedia
pop_json_url = "https://gist.github.com/rdhyee/8511607/" + \
     "raw/f16257434352916574473e63612fcea55a0c1b1c/population_of_countries.json"
pop_list= requests.get(pop_json_url).json()

df = DataFrame(pop_list)
df[:5]


Out[1]:
0 1 2
0 1 China 1385566537
1 2 India 1252139596
2 3 United States 320050716
3 4 Indonesia 249865631
4 5 Brazil 200361925

5 rows × 3 columns

Note the dtypes of the columns.


In [2]:
df.dtypes


Out[2]:
0    float64
1     object
2      int64
dtype: object

Q1: What is the relationship between s and the population of China, where s is defined as

s = sum(df[df[1].str.startswith('C')][2])

  1. s is greater than the population of China
  2. s is the same as the population of China
  3. s is less than the population of China
  4. s is not a number.

A1:


Q2: What is the relationship between s2 and the population of China, where s2 is defined by:

s2 = sum(df[df[1].str.startswith('X')][2])

  1. s2 is greater than the population of China
  2. s2 is the same as the population of China
  3. s2 is less than the population of China
  4. s2 is not a number.

A2:


Q3: What happens when the following statement is run?

df.columns = ['Number','Country','Population']

  1. Nothing
  2. df gets a new attribute called columns
  3. df's columns are renamed based on the list
  4. Throws an exception

A3:


Q4: This statement does the following

df.columns = ['Number','Country']

  1. Nothing
  2. df gets a new attribute called columns
  3. df's columns are renamed based on the list
  4. Throws an exception

A4:



Q5: How would you rewrite the following statement to get the same result as

s = sum(df[df[1].str.startswith('C')][2])

after running:

df.columns = ['Number','Country','Population']

A5:






Q6. What is

len(df[df["Population"] > 1000000000])

A6:



Q7. What is

";".join(df[df['Population']>1000000000]['Country'].apply(lambda s: s[0]))

A7:



Q8. What is

len(";".join(df[df['Population']>1000000000]['Country'].apply(lambda s: s[0])))

A8:



Pandas Series


In [11]:
from pandas import DataFrame, Series
import numpy as np

s1 = Series(np.arange(-1,4))
s1


Out[11]:
0   -1
1    0
2    1
3    2
4    3
dtype: int64

Q9: What is

s1 + 1

A9:






Q10: What is

s1.apply(lambda k: 2*k).sum()

A10:



Q11: What is

s1.cumsum()[3]

A11:



Q12: What is

s1.cumsum() - s1.cumsum()

A12:









Q13. What is

len(s1.cumsum() - s1.cumsum())

A13:






Q14: What is

np.any(s1 > 2)

A14:



Q15. What is

np.all(s1<3)

A15:



Census API

Consider the following code to load population(s) from the Census API.


In [19]:
from census import Census
from us import states

import settings

c = Census(settings.CENSUS_KEY)
c.sf1.get(('NAME', 'P0010001'), {'for': 'state:%s' % states.CA.fips})


Out[19]:
[{u'NAME': u'California', u'P0010001': u'37253956', u'state': u'06'}]

Q16: What is the purpose of settings.CENSUS_KEY?

  1. It is the password for the Census Python package
  2. It is an API Access key for authentication with the Census API
  3. It is an API Access key for authentication with Github
  4. It is key shared by all users of the Census API

A16:


Q17. When we run

pip install census

we are:

  1. installing a Python module from PyPI
  2. installing the Python module census from continuum.io's repository
  3. signing ourselves up for a census API key
  4. None of the above

A17:


Consider r1 and r2:

Q18: What is the difference between r1 and r2?

r1 = c.sf1.get(('NAME', 'P0010001'), {'for': 'county:*', 'in': 'state:%s' % states.CA.fips})
    r2 = c.sf1.get(('NAME', 'P0010001'), {'for': 'county:*', 'in': 'state:*' })

A18:











Q19. What's the relationship between len(r1) and len(r2)?

  1. len(r1) is less than len(r2)
  2. len(r1) equals len(r2)
  3. len(r1) is greater than len(r2)
  4. None of the above

A19:


Q20: Which is a correct geographic hierarchy?

Nation > States = Nation is subdivided into States

  1. Counties > States
  2. Counties > Block Groups > Census Tracks
  3. Census Tracts > Block Groups > Census Blocks
  4. Places > Counties

A20:



In [21]:
from pandas import DataFrame
import numpy as np
from census import Census
from us import states

import settings

c = Census(settings.CENSUS_KEY)

r = c.sf1.get(('NAME', 'P0010001'), {'for': 'state:*'})
df1 = DataFrame(r)

df1.head()


Out[21]:
NAME P0010001 state
0 Alabama 4779736 01
1 Alaska 710231 02
2 Arizona 6392017 04
3 Arkansas 2915918 05
4 California 37253956 06

5 rows × 3 columns


In [22]:
len(df1)


Out[22]:
52

Q21: Why does df have 52 items? Please explain

A21:









Consider the two following expressions:


In [23]:
print df1.P0010001.sum()
print
print df1.P0010001.astype(int).sum()


477973671023163920172915918372539565029196357409789793460172318801310968765313603011567582128306326483802304635528531184339367453337213283615773552654762998836405303925296729759889279894151826341270055113164708791894205917919378102953548367259111536504375135138310741270237910525674625364814180634610525145561276388562574180010246724540185299456869865636263725789

312471327

Q22: Why is df1.P0010001.sum() different from df1.P0010001.astype(int).sum()?

A22:











Q23: Describe the output of the following:

df1.P0010001 = df1.P0010001.astype(int)
df1[['NAME','P0010001']].sort('P0010001', ascending=True).head()

A23:









Q24: After running:

df1.set_index('NAME', inplace=True)

how would you access the Series for the state of Nebraska?

  1. df1['Nebraska']
  2. df1[1]
  3. df1.ix['Nebraska']
  4. df1[df1['NAME'] == 'Nebraska']

A24:


Q25. What is len(states.STATES)?

A25:



Q26. What is

len(df1[np.in1d(df1.state, [s.fips for s in states.STATES])])

A26:



In the next question, we will make use of the negation operator ~. Take a look at a specific example


In [28]:
~Series([True, True, False, True])


Out[28]:
0    False
1    False
2     True
3    False
dtype: bool

Q27. What is

list(df1[~np.in1d(df1.state, [s.fips for s in states.STATES])].index)[0]

A27:



Consider pop1 and pop2:


In [30]:
pop1 = df1['P0010001'].astype('int').sum() 
pop2 = df1[np.in1d(df1.state, [s.fips for s in states.STATES])]['P0010001'].astype('int').sum()

pop1-pop2


Out[30]:
3725789

Q28. What does pop11 - pop2 represent?

A28:






Generator and range

Q29. Given that

range(10)

is

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

How to get the total of every integer from 1 to 100?

  1. sum(range(1, 101))
  2. sum(range(100))
  3. sum(range(1, 100))
  4. None of the above

A29:


Q30. What output is produced from

# itertools is a great library
# http://docs.python.org/2/library/itertools.html#itertools.count
# itertools.count(start=0, step=1):
# "Make an iterator that returns evenly spaced values starting with step."

from itertools import islice, count
c = count(0, 1)
print c.next()
print c.next()

A30:






Q31. Recalling that

1+2+3+...+100 = 5050

what is:

(2*Series(np.arange(101))).sum()

A31:



Census Places

Consider the follow generator that we used to query for census places.


In [34]:
import pandas as pd
from pandas import DataFrame

import census
import settings
import us

from itertools import islice

c=census.Census(settings.CENSUS_KEY)

def places(variables="NAME"):
    
    for state in us.states.STATES:
        geo = {'for':'place:*', 'in':'state:{s_fips}'.format(s_fips=state.fips)}
        for place in c.sf1.get(variables, geo=geo):
            yield place

Now we compute a DataFrame for the places: places_df


In [35]:
r = list(islice(places("NAME,P0010001"), None))
places_df = DataFrame(r)
places_df.P0010001 = places_df.P0010001.astype('int')

print "number of places", len(places_df)
print "total pop", places_df.P0010001.sum()
places_df.head()


number of places 29261
total pop 228457238
Out[35]:
NAME P0010001 place state
0 Abanda CDP 192 00100 01
1 Abbeville city 2688 00124 01
2 Adamsville city 4522 00460 01
3 Addison town 758 00484 01
4 Akron town 356 00676 01

5 rows × 4 columns

We display the most populous places from California


In [36]:
places_df[places_df.state=='06'].sort_index(by='P0010001', ascending=False).head()


Out[36]:
NAME P0010001 place state
2714 Los Angeles city 3792621 44000 06
3112 San Diego city 1307402 66000 06
3122 San Jose city 945942 68000 06
3116 San Francisco city 805235 67000 06
2425 Fresno city 494665 27000 06

5 rows × 4 columns

Q32. Given

places_df[places_df.state=='06'].sort_index(by='P0010001', ascending=False).head()

is

NAME P0010001 place state
2714 Los Angeles city 3792621 44000 06
3112 San Diego city 1307402 66000 06
3122 San Jose city 945942 68000 06
3116 San Francisco city 805235 67000 06
2425 Fresno city 494665 27000 06

5 rows × 4 columns

</div>
what is

places_df.ix[3122]['label']

after we add the label column with:

places_df['label'] = places_df.apply(lambda s: s['state']+s['place'], axis=1)

A32:






Q33. What is

places_df["NAME"][3122]

A33:



Alphabet and apply

Now let's set up a DataFrame with some letters and properties of letters.


In [39]:
# numpy and pandas related imports 

import numpy as np
from pandas import Series, DataFrame
import pandas as pd

# for example, using lower and uppercase English letters

import string

lower = Series(list(string.lowercase), name='lower')
upper = Series(list(string.uppercase), name='upper')

df2 = pd.concat((lower, upper), axis=1)
df2['ord'] = df2['lower'].apply(ord)
df2.head()


Out[39]:
lower upper ord
0 a A 97
1 b B 98
2 c C 99
3 d D 100
4 e E 101

5 rows × 3 columns

Note that string.upper takes a letter and returns its uppercase version. For example:


In [40]:
string.upper('b')


Out[40]:
'B'

Q34. What is

np.all(df2['lower'].apply(string.upper) == df2['upper'])

A34:




Q35. What is

df2.apply(lambda s: s['lower'] + s['upper'], axis=1)[6]

A35:





Berkeley I School generator

Please remind yourself what enumerate does.


In [43]:
words = ['Berkeley', 'I', 'School']

for (i, word) in islice(enumerate(words),1):
    print (i, word)


(0, 'Berkeley')

Q36. What is

list(enumerate(words))[2][1]

A36:






Now consider the generator g2


In [45]:
def g2():
    words = ['Berkeley', 'I', 'School']
    for word in words:
        if word != 'I':
            for letter in list(word):
                yield letter
            
my_g2 = g2()

Q37. What is

len(list(my_g2))

A37:




In [47]:
def g3():
    words = ['Berkeley', 'I', 'School']
    for word in words:
        yield words

Q38. What is

len(list(g3()))

A38:



Groupby

Consider using groupby with a DataFrame with states.


In [49]:
import us
import census
import settings

import pandas as pd
import numpy as np
from pandas import DataFrame, Series
from itertools import islice

c = census.Census(settings.CENSUS_KEY)

def states(variables='NAME'):
    geo={'for':'state:*'}
    
    states_fips = set([state.fips for state in us.states.STATES])
    # need to filter out non-states
    for r in c.sf1.get(variables, geo=geo, year=2010):
        if r['state'] in states_fips:
            yield r
            
# make a dataframe from the total populations of states in the 2010 Census

df = DataFrame(states('NAME,P0010001'))
df.P0010001 = df.P0010001.astype('int')
df['first_letter'] = df.NAME.apply(lambda s:s[0])

df.head()


Out[49]:
NAME P0010001 state first_letter
0 Alabama 4779736 01 A
1 Alaska 710231 02 A
2 Arizona 6392017 04 A
3 Arkansas 2915918 05 A
4 California 37253956 06 C

5 rows × 4 columns

For reference, here's a list of all the states


In [50]:
print list(df.NAME)


[u'Alabama', u'Alaska', u'Arizona', u'Arkansas', u'California', u'Colorado', u'Connecticut', u'Delaware', u'District of Columbia', u'Florida', u'Georgia', u'Hawaii', u'Idaho', u'Illinois', u'Indiana', u'Iowa', u'Kansas', u'Kentucky', u'Louisiana', u'Maine', u'Maryland', u'Massachusetts', u'Michigan', u'Minnesota', u'Mississippi', u'Missouri', u'Montana', u'Nebraska', u'Nevada', u'New Hampshire', u'New Jersey', u'New Mexico', u'New York', u'North Carolina', u'North Dakota', u'Ohio', u'Oklahoma', u'Oregon', u'Pennsylvania', u'Rhode Island', u'South Carolina', u'South Dakota', u'Tennessee', u'Texas', u'Utah', u'Vermont', u'Virginia', u'Washington', u'West Virginia', u'Wisconsin', u'Wyoming']

Q39. What is

df.groupby('first_letter').apply(lambda g:list(g.NAME))['C']

A39:






Q40. What is

df.groupby('first_letter').apply(lambda g:len(g.NAME))['A']

A40:



Q41. What is

df.groupby('first_letter').agg('count')['first_letter']['P']

A41:



Q42. What is

len(df.groupby('NAME'))

A42:



Diversity Index

Recall the code from the diversity calculations


In [55]:
def normalize(s):
    """take a Series and divide each item by the sum so that the new series adds up to 1.0"""
    total = np.sum(s)
    return s.astype('float') / total

def entropy(series):
    """Normalized Shannon Index"""
    # a series in which all the entries are equal should result in normalized entropy of 1.0
    
    # eliminate 0s
    series1 = series[series!=0]

    # if len(series) < 2 (i.e., 0 or 1) then return 0
    
    if len(series1) > 1:
        # calculate the maximum possible entropy for given length of input series
        max_s = -np.log(1.0/len(series))
    
        total = float(sum(series1))
        p = series1.astype('float')/float(total)
        return sum(-p*np.log(p))/max_s
    else:
        return 0.0

def gini_simpson(s):
    # https://en.wikipedia.org/wiki/Diversity_index#Gini.E2.80.93Simpson_index
    s1 = normalize(s)
    return 1-np.sum(s1*s1)

Q43. Suppose you have 10 people and 5 categories, how you would you maximize the Shannon entropy?

  1. Regardless of how you distribute the people, you'll get the same entropy.
  2. Put 10 people in any single category, and then 0 in the rest.
  3. Distribute the people evenly over all the categories.
  4. Put 5 people in each category.

A43:


Q44. What is

entropy(Series([0,0,10,0,0]))

A44:



Q45. What is

entropy(Series([10,0,0,0,0]))

A45:



Q46. What is

entropy(Series([1,1,1,1,1]))

A46:



Q47. What is

gini_simpson(Series([2,2,2,2,2]))

A47: