Working with Open Data Midterm (March 18, 2014)
There are 94 points in this exam: 2 each for the 47 questions. The questions are either multiple choice or short answers. For multiple choice, just write the number of the choice selected.
Name: ______________________________________
`
Consider this code to construct a DataFrame of populations of countries.
In [1]:
import json
import requests
from pandas import DataFrame
# read population in from JSON-formatted data derived from the Wikipedia
pop_json_url = "https://gist.github.com/rdhyee/8511607/" + \
"raw/f16257434352916574473e63612fcea55a0c1b1c/population_of_countries.json"
pop_list= requests.get(pop_json_url).json()
df = DataFrame(pop_list)
df[:5]
Out[1]:
Note the dtypes
of the columns.
In [2]:
df.dtypes
Out[2]:
Q1: What is the relationship between s
and the population of China, where s
is defined as
s = sum(df[df[1].str.startswith('C')][2])
s
is greater than the population of Chinas
is the same as the population of Chinas
is less than the population of Chinas
is not a number.A1:
Q2: What is the relationship between s2
and the population of China, where s2
is defined by:
s2 = sum(df[df[1].str.startswith('X')][2])
s2
is greater than the population of Chinas2
is the same as the population of Chinas2
is less than the population of Chinas2
is not a number.A2:
Q3: What happens when the following statement is run?
df.columns = ['Number','Country','Population']
df
gets a new attribute called columns
df
's columns are renamed based on the listA3:
Q4: This statement does the following
df.columns = ['Number','Country']
df
gets a new attribute called columns
df
's columns are renamed based on the listA4:
Q5: How would you rewrite the following statement to get the same result as
s = sum(df[df[1].str.startswith('C')][2])
after running:
df.columns = ['Number','Country','Population']
A5:
Q6. What is
len(df[df["Population"] > 1000000000])
A6:
Q7. What is
";".join(df[df['Population']>1000000000]['Country'].apply(lambda s: s[0]))
A7:
Q8. What is
len(";".join(df[df['Population']>1000000000]['Country'].apply(lambda s: s[0])))
A8:
In [11]:
from pandas import DataFrame, Series
import numpy as np
s1 = Series(np.arange(-1,4))
s1
Out[11]:
Q9: What is
s1 + 1
A9:
Q10: What is
s1.apply(lambda k: 2*k).sum()
A10:
Q11: What is
s1.cumsum()[3]
A11:
Q12: What is
s1.cumsum() - s1.cumsum()
A12:
Q13. What is
len(s1.cumsum() - s1.cumsum())
A13:
Q14: What is
np.any(s1 > 2)
A14:
Q15. What is
np.all(s1<3)
A15:
Consider the following code to load population(s) from the Census API.
In [19]:
from census import Census
from us import states
import settings
c = Census(settings.CENSUS_KEY)
c.sf1.get(('NAME', 'P0010001'), {'for': 'state:%s' % states.CA.fips})
Out[19]:
Q16: What is the purpose of settings.CENSUS_KEY
?
A16:
Q17. When we run
pip install census
we are:
A17:
Consider r1
and r2
:
Q18: What is the difference between r1
and r2
?
r1 = c.sf1.get(('NAME', 'P0010001'), {'for': 'county:*', 'in': 'state:%s' % states.CA.fips})
r2 = c.sf1.get(('NAME', 'P0010001'), {'for': 'county:*', 'in': 'state:*' })
A18:
Q19. What's the relationship between len(r1)
and len(r2)
?
len(r1)
is less than len(r2)
len(r1)
equals len(r2)
len(r1)
is greater than len(r2)
A19:
Q20: Which is a correct geographic hierarchy?
Nation > States = Nation is subdivided into States
A20:
In [21]:
from pandas import DataFrame
import numpy as np
from census import Census
from us import states
import settings
c = Census(settings.CENSUS_KEY)
r = c.sf1.get(('NAME', 'P0010001'), {'for': 'state:*'})
df1 = DataFrame(r)
df1.head()
Out[21]:
In [22]:
len(df1)
Out[22]:
Q21: Why does df
have 52 items? Please explain
A21:
Consider the two following expressions:
In [23]:
print df1.P0010001.sum()
print
print df1.P0010001.astype(int).sum()
Q22: Why is df1.P0010001.sum()
different from df1.P0010001.astype(int).sum()
?
A22:
Q23: Describe the output of the following:
df1.P0010001 = df1.P0010001.astype(int)
df1[['NAME','P0010001']].sort('P0010001', ascending=True).head()
A23:
Q24: After running:
df1.set_index('NAME', inplace=True)
how would you access the Series for the state of Nebraska?
df1['Nebraska']
df1[1]
df1.ix['Nebraska']
df1[df1['NAME'] == 'Nebraska']
A24:
Q25. What is len(states.STATES)
?
A25:
Q26. What is
len(df1[np.in1d(df1.state, [s.fips for s in states.STATES])])
A26:
In the next question, we will make use of the negation operator ~
. Take a look at a specific example
In [28]:
~Series([True, True, False, True])
Out[28]:
Q27. What is
list(df1[~np.in1d(df1.state, [s.fips for s in states.STATES])].index)[0]
A27:
Consider pop1
and pop2
:
In [30]:
pop1 = df1['P0010001'].astype('int').sum()
pop2 = df1[np.in1d(df1.state, [s.fips for s in states.STATES])]['P0010001'].astype('int').sum()
pop1-pop2
Out[30]:
Q28. What does pop11 - pop2
represent?
A28:
Q29. Given that
range(10)
is
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
How to get the total of every integer from 1 to 100?
sum(range(1, 101))
sum(range(100))
sum(range(1, 100))
A29:
Q30. What output is produced from
# itertools is a great library
# http://docs.python.org/2/library/itertools.html#itertools.count
# itertools.count(start=0, step=1):
# "Make an iterator that returns evenly spaced values starting with step."
from itertools import islice, count
c = count(0, 1)
print c.next()
print c.next()
A30:
Q31. Recalling that
1+2+3+...+100 = 5050
what is:
(2*Series(np.arange(101))).sum()
A31:
Consider the follow generator that we used to query for census places.
In [34]:
import pandas as pd
from pandas import DataFrame
import census
import settings
import us
from itertools import islice
c=census.Census(settings.CENSUS_KEY)
def places(variables="NAME"):
for state in us.states.STATES:
geo = {'for':'place:*', 'in':'state:{s_fips}'.format(s_fips=state.fips)}
for place in c.sf1.get(variables, geo=geo):
yield place
Now we compute a DataFrame for the places: places_df
In [35]:
r = list(islice(places("NAME,P0010001"), None))
places_df = DataFrame(r)
places_df.P0010001 = places_df.P0010001.astype('int')
print "number of places", len(places_df)
print "total pop", places_df.P0010001.sum()
places_df.head()
Out[35]:
We display the most populous places from California
In [36]:
places_df[places_df.state=='06'].sort_index(by='P0010001', ascending=False).head()
Out[36]:
Q32. Given
places_df[places_df.state=='06'].sort_index(by='P0010001', ascending=False).head()
is
NAME | P0010001 | place | state | |
---|---|---|---|---|
2714 | Los Angeles city | 3792621 | 44000 | 06 |
3112 | San Diego city | 1307402 | 66000 | 06 |
3122 | San Jose city | 945942 | 68000 | 06 |
3116 | San Francisco city | 805235 | 67000 | 06 |
2425 | Fresno city | 494665 | 27000 | 06 |
5 rows × 4 columns
</div>places_df.ix[3122]['label']
after we add the label
column with:
places_df['label'] = places_df.apply(lambda s: s['state']+s['place'], axis=1)
A32:
Q33. What is
places_df["NAME"][3122]
A33:
Now let's set up a DataFrame with some letters and properties of letters.
In [39]:
# numpy and pandas related imports
import numpy as np
from pandas import Series, DataFrame
import pandas as pd
# for example, using lower and uppercase English letters
import string
lower = Series(list(string.lowercase), name='lower')
upper = Series(list(string.uppercase), name='upper')
df2 = pd.concat((lower, upper), axis=1)
df2['ord'] = df2['lower'].apply(ord)
df2.head()
Out[39]:
Note that string.upper
takes a letter and returns its uppercase version. For example:
In [40]:
string.upper('b')
Out[40]:
Q34. What is
np.all(df2['lower'].apply(string.upper) == df2['upper'])
A34:
Q35. What is
df2.apply(lambda s: s['lower'] + s['upper'], axis=1)[6]
A35:
Please remind yourself what enumerate
does.
In [43]:
words = ['Berkeley', 'I', 'School']
for (i, word) in islice(enumerate(words),1):
print (i, word)
Q36. What is
list(enumerate(words))[2][1]
A36:
Now consider the generator g2
In [45]:
def g2():
words = ['Berkeley', 'I', 'School']
for word in words:
if word != 'I':
for letter in list(word):
yield letter
my_g2 = g2()
Q37. What is
len(list(my_g2))
A37:
In [47]:
def g3():
words = ['Berkeley', 'I', 'School']
for word in words:
yield words
Q38. What is
len(list(g3()))
A38:
Consider using groupby
with a DataFrame with states.
In [49]:
import us
import census
import settings
import pandas as pd
import numpy as np
from pandas import DataFrame, Series
from itertools import islice
c = census.Census(settings.CENSUS_KEY)
def states(variables='NAME'):
geo={'for':'state:*'}
states_fips = set([state.fips for state in us.states.STATES])
# need to filter out non-states
for r in c.sf1.get(variables, geo=geo, year=2010):
if r['state'] in states_fips:
yield r
# make a dataframe from the total populations of states in the 2010 Census
df = DataFrame(states('NAME,P0010001'))
df.P0010001 = df.P0010001.astype('int')
df['first_letter'] = df.NAME.apply(lambda s:s[0])
df.head()
Out[49]:
For reference, here's a list of all the states
In [50]:
print list(df.NAME)
Q39. What is
df.groupby('first_letter').apply(lambda g:list(g.NAME))['C']
A39:
Q40. What is
df.groupby('first_letter').apply(lambda g:len(g.NAME))['A']
A40:
Q41. What is
df.groupby('first_letter').agg('count')['first_letter']['P']
A41:
Q42. What is
len(df.groupby('NAME'))
A42:
Recall the code from the diversity calculations
In [55]:
def normalize(s):
"""take a Series and divide each item by the sum so that the new series adds up to 1.0"""
total = np.sum(s)
return s.astype('float') / total
def entropy(series):
"""Normalized Shannon Index"""
# a series in which all the entries are equal should result in normalized entropy of 1.0
# eliminate 0s
series1 = series[series!=0]
# if len(series) < 2 (i.e., 0 or 1) then return 0
if len(series1) > 1:
# calculate the maximum possible entropy for given length of input series
max_s = -np.log(1.0/len(series))
total = float(sum(series1))
p = series1.astype('float')/float(total)
return sum(-p*np.log(p))/max_s
else:
return 0.0
def gini_simpson(s):
# https://en.wikipedia.org/wiki/Diversity_index#Gini.E2.80.93Simpson_index
s1 = normalize(s)
return 1-np.sum(s1*s1)
Q43. Suppose you have 10 people and 5 categories, how you would you maximize the Shannon entropy?
A43:
Q44. What is
entropy(Series([0,0,10,0,0]))
A44:
Q45. What is
entropy(Series([10,0,0,0,0]))
A45:
Q46. What is
entropy(Series([1,1,1,1,1]))
A46:
Q47. What is
gini_simpson(Series([2,2,2,2,2]))
A47: