Working with Open Data Midterm (March 18, 2014)
There are 94 points in this exam: 2 each for the 47 questions. The questions are either multiple choice or short answers. For multiple choice, just write the number of the choice selected.
Name: ______________________________________
`
Consider this code to construct a DataFrame of populations of countries.
In [49]:
import json
import requests
from pandas import DataFrame
# read population in from JSON-formatted data derived from the Wikipedia
pop_json_url = "https://gist.github.com/rdhyee/8511607/" + \
"raw/f16257434352916574473e63612fcea55a0c1b1c/population_of_countries.json"
pop_list= requests.get(pop_json_url).json()
df = DataFrame(pop_list)
df[:5]
Note the dtypes
of the columns.
In [ ]:
df.dtypes
In [ ]:
s = sum(df[df[1].str.startswith('C')][2])
s
Q1: What is the relationship between s
and the population of China, where s
is defined as
s = sum(df[df[1].str.startswith('C')][2])
s
is greater than the population of Chinas
is the same as the population of Chinas
is less than the population of Chinas
is not a number.A1:
1
In [ ]:
s2 = sum(df[df[1].str.startswith('X')][2])
s2
Q2: What is the relationship between s2
and the population of China, where s2
is defined by:
s2 = sum(df[df[1].str.startswith('X')][2])
s2
is greater than the population of Chinas2
is the same as the population of Chinas2
is less than the population of Chinas2
is not a number.A2:
3
In [ ]:
df.columns = ['Number','Country','Population']
Q3: What happens when the following statement is run?
df.columns = ['Number','Country','Population']
df
gets a new attribute called columns
df
's columns are renamed based on the listA3:
3
In [ ]:
try:
df.columns = ['Number','Country']
except Exception as e:
print e
Q4: This statement does the following
df.columns = ['Number','Country']
df
gets a new attribute called columns
df
's columns are renamed based on the listA4:
4
In [ ]:
df.columns = ['Number','Country','Population']
s=sum(df[df['Country'].str.startswith('C')]['Population'])
s
Q5: How would you rewrite the following statement to get the same result as
s = sum(df[df[1].str.startswith('C')][2])
after running:
df.columns = ['Number','Country','Population']
A5:
s=sum(df[df['Country'].str.startswith('C')]['Population'])
In [ ]:
len(df[df["Population"]>1000000000])
Q6. What is
len(df[df["Population"] > 1000000000])
A6:
2
In [ ]:
";".join(df[df['Population']>1000000000]['Country'].apply(lambda s: s[0]))
Q7. What is
";".join(df[df['Population']>1000000000]['Country'].apply(lambda s: s[0]))
A7:
C;I
In [ ]:
len(";".join(df[df['Population']>1000000000]['Country'].apply(lambda s: s[0])))
Q8. What is
len(";".join(df[df['Population']>1000000000]['Country'].apply(lambda s: s[0])))
A8:
3
In [ ]:
from pandas import DataFrame, Series
import numpy as np
s1 = Series(np.arange(-1,4))
s1
In [ ]:
s1 + 1
Q9: What is
s1 + 1
A9:
0 0 1 1 2 2 3 3 4 4
In [ ]:
s1.apply(lambda k: 2*k).sum()
Q10: What is
s1.apply(lambda k: 2*k).sum()
A10:
10
In [ ]:
s1.cumsum()[3]
Q11: What is
s1.cumsum()[3]
A11:
2
In [ ]:
s1.cumsum() - s1.cumsum()
Q12: What is
s1.cumsum() - s1.cumsum()
A12:
0 0 1 0 2 0 3 0 4 0
In [ ]:
len(s1.cumsum() - s1.cumsum())
Q13. What is
len(s1.cumsum() - s1.cumsum())
A13:
5
In [ ]:
np.any(s1 > 2)
Q14: What is
np.any(s1 > 2)
A14:
True
In [ ]:
np.all(s1<3)
Q15. What is
np.all(s1<3)
A15:
False
Consider the following code to load population(s) from the Census API.
In [ ]:
from census import Census
from us import states
import settings
c = Census(settings.CENSUS_KEY)
c.sf1.get(('NAME', 'P0010001'), {'for': 'state:%s' % states.CA.fips})
Q16: What is the purpose of settings.CENSUS_KEY
?
A16:
2
Q17. When we run
pip install census
we are:
A17:
1
Consider r1
and r2
:
In [ ]:
r1 = c.sf1.get(('NAME', 'P0010001'), {'for': 'county:*', 'in': 'state:%s' % states.CA.fips})
r2 = c.sf1.get(('NAME', 'P0010001'), {'for': 'county:*', 'in': 'state:*' })
len(r1), len(r2)
Q18: What is the difference between r1
and r2
?
r1 = c.sf1.get(('NAME', 'P0010001'), {'for': 'county:*', 'in': 'state:%s' % states.CA.fips})
r2 = c.sf1.get(('NAME', 'P0010001'), {'for': 'county:*', 'in': 'state:*' })
A18:
r1
is a list holding the name and total population from the 2010 US Census for every county in California.
r2
holds the name and total population from the 2010 US Census for every county in all US states, DC, and Puerto Rico.
Q19. What's the relationship between len(r1)
and len(r2)
?
len(r1)
is less than len(r2)
len(r1)
equals len(r2)
len(r1)
is greater than len(r2)
A19:
1
Q20: Which is a correct geographic hierarchy?
Nation > States = Nation is subdivided into States
A20:
3
In [87]:
from pandas import DataFrame
import numpy as np
from census import Census
from us import states
import settings
c = Census(settings.CENSUS_KEY)
r = c.sf1.get(('NAME', 'P0010001'), {'for': 'state:*'})
df1 = DataFrame(r)
df1.head()
Out[87]:
In [88]:
len(df1)
Out[88]:
Q21: Why does df
have 52 items? Please explain
A21:
When queried for "states", the US Census API returns data for the 50 states, the District of Columbia, and Puerto Rico: (50+1+1 = 52 entities).
Consider the two following expressions:
In [89]:
print df1.P0010001.sum()
print
print df1.P0010001.astype(int).sum()
Q22: Why is df1.P0010001.sum()
different from df1.P0010001.astype(int).sum()
?
A22:
The data type of df1.P0010001
is a string. Hence, performing sum
on it concatenates the string representation of populations into a longer string. In contrast, once df1.P0010001
is converted into integers via df1.P0010001.astype(int)
, a sum
operation adds up all the populations into a single integer.
In [90]:
df1.P0010001 = df1.P0010001.astype(int)
df1[['NAME','P0010001']].sort('P0010001', ascending=True).head()
Out[90]:
Q23: Describe the output of the following:
df1.P0010001 = df1.P0010001.astype(int)
df1[['NAME','P0010001']].sort('P0010001', ascending=True).head()
A23: A DataFrame (with 5 rows and 2 columns (NAME, P0010001)) listing the 5 least populous states in ascending order by population.
In [91]:
df1.set_index('NAME', inplace=True)
df1.ix['Nebraska']
Out[91]:
Q24: After running:
df1.set_index('NAME', inplace=True)
how would you access the Series for the state of Nebraska?
df1['Nebraska']
df1[1]
df1.ix['Nebraska']
df1[df1['NAME'] == 'Nebraska']
In [92]:
df1.set_index('NAME', inplace=True)
A24:
3
In [ ]:
len(states.STATES)
Q25. What is len(states.STATES)
?
A25:
51
In [ ]:
len(df1[np.in1d(df1.state, [s.fips for s in states.STATES])])
Q26. What is
len(df1[np.in1d(df1.state, [s.fips for s in states.STATES])])
A26:
51
In the next question, we will make use of the negation operator ~
. Take a look at a specific example
In [ ]:
~Series([True, True, False, True])
In [ ]:
list(df1[~np.in1d(df1.state, [s.fips for s in states.STATES])].index)[0]
Q27. What is
list(df1[~np.in1d(df1.state, [s.fips for s in states.STATES])].index)[0]
A27:
Puerto Rico
Consider pop1
and pop2
:
In [56]:
pop1 = df1['P0010001'].astype('int').sum()
pop2 = df1[np.in1d(df1.state, [s.fips for s in states.STATES])]['P0010001'].astype('int').sum()
pop1-pop2
Q28. What does pop11 - pop2
represent?
A28: The population of Puerto Rico in the 2010 Census.
In [57]:
sum(range(1, 101))
Out[57]:
Q29. Given that
range(10)
is
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
How to get the total of every integer from 1 to 100?
sum(range(1, 101))
sum(range(100))
sum(range(1, 100))
A29:
1
In [58]:
# itertools is a great library
# http://docs.python.org/2/library/itertools.html#itertools.count
# itertools.count(start=0, step=1):
# "Make an iterator that returns evenly spaced values starting with step."
from itertools import islice, count
c = count(0, 1)
print c.next()
print c.next()
Q30. What output is produced from
# itertools is a great library
# http://docs.python.org/2/library/itertools.html#itertools.count
# itertools.count(start=0, step=1):
# "Make an iterator that returns evenly spaced values starting with step."
from itertools import islice, count
c = count(0, 1)
print c.next()
print c.next()
A30:
0 1
In [59]:
(2*Series(np.arange(101))).sum()
Out[59]:
Q31. Recalling that
1+2+3+...+100 = 5050
what is:
(2*Series(np.arange(101))).sum()
A31:
10100
Consider the follow generator that we used to query for census places.
In [60]:
import pandas as pd
from pandas import DataFrame
import census
import settings
import us
from itertools import islice
c=census.Census(settings.CENSUS_KEY)
def places(variables="NAME"):
for state in us.states.STATES:
geo = {'for':'place:*', 'in':'state:{s_fips}'.format(s_fips=state.fips)}
for place in c.sf1.get(variables, geo=geo):
yield place
Now we compute a DataFrame for the places: places_df
In [61]:
r = list(islice(places("NAME,P0010001"), None))
places_df = DataFrame(r)
places_df.P0010001 = places_df.P0010001.astype('int')
print "number of places", len(places_df)
print "total pop", places_df.P0010001.sum()
places_df.head()
Out[61]:
We display the most populous places from California
In [62]:
places_df[places_df.state=='06'].sort_index(by='P0010001', ascending=False).head()
Out[62]:
In [63]:
places_df['label'] = places_df.apply(lambda s: s['state']+s['place'], axis=1)
places_df.ix[3122]['label']
Out[63]:
Q32. Given
places_df[places_df.state=='06'].sort_index(by='P0010001', ascending=False).head()
is
NAME | P0010001 | place | state | |
---|---|---|---|---|
2714 | Los Angeles city | 3792621 | 44000 | 06 |
3112 | San Diego city | 1307402 | 66000 | 06 |
3122 | San Jose city | 945942 | 68000 | 06 |
3116 | San Francisco city | 805235 | 67000 | 06 |
2425 | Fresno city | 494665 | 27000 | 06 |
5 rows × 4 columns
</div>places_df.ix[3122]['label']
after we add the label
column with:
places_df['label'] = places_df.apply(lambda s: s['state']+s['place'], axis=1)
A32:
0668000
In [64]:
places_df["NAME"][3122]
Out[64]:
Q33. What is
places_df["NAME"][3122]
A33:
San Jose city
Now let's set up a DataFrame with some letters and properties of letters.
In [65]:
# numpy and pandas related imports
import numpy as np
from pandas import Series, DataFrame
import pandas as pd
# for example, using lower and uppercase English letters
import string
lower = Series(list(string.lowercase), name='lower')
upper = Series(list(string.uppercase), name='upper')
df2 = pd.concat((lower, upper), axis=1)
df2['ord'] = df2['lower'].apply(ord)
df2.head()
Out[65]:
Note that string.upper
takes a letter and returns its uppercase version. For example:
In [66]:
string.upper('b')
Out[66]:
In [67]:
np.all(df2['lower'].apply(string.upper) == df2['upper'])
Out[67]:
Q34. What is
np.all(df2['lower'].apply(string.upper) == df2['upper'])
A34:
True
In [68]:
df2.apply(lambda s: s['lower'] + s['upper'], axis=1)[6]
Out[68]:
Q35. What is
df2.apply(lambda s: s['lower'] + s['upper'], axis=1)[6]
A35:
gG
Please remind yourself what enumerate
does.
In [69]:
words = ['Berkeley', 'I', 'School']
for (i, word) in islice(enumerate(words),1):
print (i, word)
In [70]:
list(enumerate(words))[2][1]
Out[70]:
Q36. What is
list(enumerate(words))[2][1]
A36:
School
Now consider the generator g2
In [71]:
def g2():
words = ['Berkeley', 'I', 'School']
for word in words:
if word != 'I':
for letter in list(word):
yield letter
my_g2 = g2()
In [72]:
len(list(my_g2))
Out[72]:
Q37. What is
len(list(my_g2))
A37:
14
In [73]:
def g3():
words = ['Berkeley', 'I', 'School']
for word in words:
yield words
In [74]:
len(list(g3()))
Out[74]:
Q38. What is
len(list(g3()))
A38:
3
Consider using groupby
with a DataFrame with states.
In [75]:
import us
import census
import settings
import pandas as pd
import numpy as np
from pandas import DataFrame, Series
from itertools import islice
c = census.Census(settings.CENSUS_KEY)
def states(variables='NAME'):
geo={'for':'state:*'}
states_fips = set([state.fips for state in us.states.STATES])
# need to filter out non-states
for r in c.sf1.get(variables, geo=geo, year=2010):
if r['state'] in states_fips:
yield r
# make a dataframe from the total populations of states in the 2010 Census
df = DataFrame(states('NAME,P0010001'))
df.P0010001 = df.P0010001.astype('int')
df['first_letter'] = df.NAME.apply(lambda s:s[0])
df.head()
Out[75]:
For reference, here's a list of all the states
In [76]:
print list(df.NAME)
In [77]:
df.groupby('first_letter').apply(lambda g:list(g.NAME))['C']
Out[77]:
Q39. What is
df.groupby('first_letter').apply(lambda g:list(g.NAME))['C']
A39:
[u'California', u'Colorado', u'Connecticut']
In [78]:
df.groupby('first_letter').apply(lambda g:len(g.NAME))['A']
Out[78]:
Q40. What is
df.groupby('first_letter').apply(lambda g:len(g.NAME))['A']
A40:
4
In [79]:
df.groupby('first_letter').agg('count')['first_letter']['P']
Out[79]:
Q41. What is
df.groupby('first_letter').agg('count')['first_letter']['P']
A41:
1
In [80]:
len(df.groupby('NAME'))
Out[80]:
Q42. What is
len(df.groupby('NAME'))
A42:
51
Recall the code from the diversity calculations
In [81]:
def normalize(s):
"""take a Series and divide each item by the sum so that the new series adds up to 1.0"""
total = np.sum(s)
return s.astype('float') / total
def entropy(series):
"""Normalized Shannon Index"""
# a series in which all the entries are equal should result in normalized entropy of 1.0
# eliminate 0s
series1 = series[series!=0]
# if len(series) < 2 (i.e., 0 or 1) then return 0
if len(series1) > 1:
# calculate the maximum possible entropy for given length of input series
max_s = -np.log(1.0/len(series))
total = float(sum(series1))
p = series1.astype('float')/float(total)
return sum(-p*np.log(p))/max_s
else:
return 0.0
def gini_simpson(s):
# https://en.wikipedia.org/wiki/Diversity_index#Gini.E2.80.93Simpson_index
s1 = normalize(s)
return 1-np.sum(s1*s1)
Q43. Suppose you have 10 people and 5 categories, how you would you maximize the Shannon entropy?
A43:
3
In [82]:
entropy(Series([0,0,10,0,0]))
Out[82]:
Q44. What is
entropy(Series([0,0,10,0,0]))
A44:
0
In [83]:
entropy(Series([10,0,0,0,0]))
Out[83]:
Q45. What is
entropy(Series([10,0,0,0,0]))
A45:
0
In [84]:
entropy(Series([1,1,1,1,1]))
Out[84]:
Q46. What is
entropy(Series([1,1,1,1,1]))
A46:
1
In [85]:
gini_simpson(Series([2,2,2,2,2]))
Out[85]:
Q47. What is
gini_simpson(Series([2,2,2,2,2]))
A47:
0.8