Working with Open Data Midterm (March 18, 2014)
There are 94 points in this exam: 2 each for the 47 questions. The questions are either multiple choice or short answers. For multiple choice, just write the number of the choice selected.
Name: ______________________________________
`
Consider this code to construct a DataFrame of populations of countries.
In [1]:
    
import json
import requests
from pandas import DataFrame
# read population in from JSON-formatted data derived from the Wikipedia
pop_json_url = "https://gist.github.com/rdhyee/8511607/" + \
     "raw/f16257434352916574473e63612fcea55a0c1b1c/population_of_countries.json"
pop_list= requests.get(pop_json_url).json()
df = DataFrame(pop_list)
df[:5]
    
    Out[1]:
Note the dtypes of the columns.
In [2]:
    
df.dtypes
    
    Out[2]:
Q1: What is the relationship between s and the population of China, where s is defined as
s = sum(df[df[1].str.startswith('C')][2])
s is greater than the population of Chinas is the same as the population of Chinas is less than the population of Chinas is not a number.A1:
Q2: What is the relationship between s2 and the population of China, where s2 is defined by:
s2 = sum(df[df[1].str.startswith('X')][2])
s2 is greater than the population of Chinas2 is the same as the population of Chinas2 is less than the population of Chinas2 is not a number.A2:
Q3: What happens when the following statement is run?
df.columns = ['Number','Country','Population']
df gets a new attribute called columnsdf's columns are renamed based on the listA3:
Q4: This statement does the following
df.columns = ['Number','Country']
df gets a new attribute called columnsdf's columns are renamed based on the listA4:
Q5: How would you rewrite the following statement to get the same result as
s = sum(df[df[1].str.startswith('C')][2])
after running:
df.columns = ['Number','Country','Population']
A5:
Q6. What is
len(df[df["Population"] > 1000000000])
A6:
Q7. What is
";".join(df[df['Population']>1000000000]['Country'].apply(lambda s: s[0]))
A7:
Q8. What is
len(";".join(df[df['Population']>1000000000]['Country'].apply(lambda s: s[0])))
A8:
In [11]:
    
from pandas import DataFrame, Series
import numpy as np
s1 = Series(np.arange(-1,4))
s1
    
    Out[11]:
Q9: What is
s1 + 1
A9:
Q10: What is
s1.apply(lambda k: 2*k).sum()
A10:
Q11: What is
s1.cumsum()[3]
A11:
Q12: What is
s1.cumsum() - s1.cumsum()
A12:
Q13. What is
len(s1.cumsum() - s1.cumsum())
A13:
Q14: What is
np.any(s1 > 2)
A14:
Q15. What is
np.all(s1<3)
A15:
Consider the following code to load population(s) from the Census API.
In [19]:
    
from census import Census
from us import states
import settings
c = Census(settings.CENSUS_KEY)
c.sf1.get(('NAME', 'P0010001'), {'for': 'state:%s' % states.CA.fips})
    
    Out[19]:
Q16: What is the purpose of settings.CENSUS_KEY?
A16:
Q17. When we run
pip install census
we are:
A17:
Consider r1 and r2:
Q18: What is the difference between r1 and r2?
r1 = c.sf1.get(('NAME', 'P0010001'), {'for': 'county:*', 'in': 'state:%s' % states.CA.fips})
    r2 = c.sf1.get(('NAME', 'P0010001'), {'for': 'county:*', 'in': 'state:*' })
A18:
Q19. What's the relationship between len(r1) and len(r2)?
len(r1) is less than len(r2)len(r1) equals len(r2)len(r1) is greater than len(r2)A19:
Q20: Which is a correct geographic hierarchy?
Nation > States = Nation is subdivided into States
A20:
In [21]:
    
from pandas import DataFrame
import numpy as np
from census import Census
from us import states
import settings
c = Census(settings.CENSUS_KEY)
r = c.sf1.get(('NAME', 'P0010001'), {'for': 'state:*'})
df1 = DataFrame(r)
df1.head()
    
    Out[21]:
In [22]:
    
len(df1)
    
    Out[22]:
Q21: Why does df have 52 items? Please explain
A21:
Consider the two following expressions:
In [23]:
    
print df1.P0010001.sum()
print
print df1.P0010001.astype(int).sum()
    
    
Q22: Why is df1.P0010001.sum() different from df1.P0010001.astype(int).sum()?
A22:
Q23: Describe the output of the following:
df1.P0010001 = df1.P0010001.astype(int)
df1[['NAME','P0010001']].sort('P0010001', ascending=True).head()
A23:
Q24: After running:
df1.set_index('NAME', inplace=True)
how would you access the Series for the state of Nebraska?
df1['Nebraska']df1[1]df1.ix['Nebraska']df1[df1['NAME'] == 'Nebraska']A24:
Q25. What is len(states.STATES)?
A25:
Q26. What is
len(df1[np.in1d(df1.state, [s.fips for s in states.STATES])])
A26:
In the next question, we will make use of the negation operator ~.  Take a look at a specific example
In [28]:
    
~Series([True, True, False, True])
    
    Out[28]:
Q27. What is
list(df1[~np.in1d(df1.state, [s.fips for s in states.STATES])].index)[0]
A27:
Consider pop1 and pop2:
In [30]:
    
pop1 = df1['P0010001'].astype('int').sum() 
pop2 = df1[np.in1d(df1.state, [s.fips for s in states.STATES])]['P0010001'].astype('int').sum()
pop1-pop2
    
    Out[30]:
Q28. What does pop11 - pop2 represent?
A28:
Q29. Given that
range(10)
is
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
How to get the total of every integer from 1 to 100?
sum(range(1, 101))sum(range(100))sum(range(1, 100))A29:
Q30. What output is produced from
# itertools is a great library
# http://docs.python.org/2/library/itertools.html#itertools.count
# itertools.count(start=0, step=1):
# "Make an iterator that returns evenly spaced values starting with step."
from itertools import islice, count
c = count(0, 1)
print c.next()
print c.next()
A30:
Q31. Recalling that
1+2+3+...+100 = 5050
what is:
(2*Series(np.arange(101))).sum()
A31:
Consider the follow generator that we used to query for census places.
In [34]:
    
import pandas as pd
from pandas import DataFrame
import census
import settings
import us
from itertools import islice
c=census.Census(settings.CENSUS_KEY)
def places(variables="NAME"):
    
    for state in us.states.STATES:
        geo = {'for':'place:*', 'in':'state:{s_fips}'.format(s_fips=state.fips)}
        for place in c.sf1.get(variables, geo=geo):
            yield place
    
Now we compute a DataFrame for the places: places_df
In [35]:
    
r = list(islice(places("NAME,P0010001"), None))
places_df = DataFrame(r)
places_df.P0010001 = places_df.P0010001.astype('int')
print "number of places", len(places_df)
print "total pop", places_df.P0010001.sum()
places_df.head()
    
    
    Out[35]:
We display the most populous places from California
In [36]:
    
places_df[places_df.state=='06'].sort_index(by='P0010001', ascending=False).head()
    
    Out[36]:
Q32. Given
places_df[places_df.state=='06'].sort_index(by='P0010001', ascending=False).head()
is
| NAME | P0010001 | place | state | |
|---|---|---|---|---|
| 2714 | Los Angeles city | 3792621 | 44000 | 06 | 
| 3112 | San Diego city | 1307402 | 66000 | 06 | 
| 3122 | San Jose city | 945942 | 68000 | 06 | 
| 3116 | San Francisco city | 805235 | 67000 | 06 | 
| 2425 | Fresno city | 494665 | 27000 | 06 | 
5 rows × 4 columns
</div>places_df.ix[3122]['label']
after we add the label column with:
places_df['label'] = places_df.apply(lambda s: s['state']+s['place'], axis=1)
A32:
Q33. What is
places_df["NAME"][3122]
A33:
Now let's set up a DataFrame with some letters and properties of letters.
In [39]:
    
# numpy and pandas related imports 
import numpy as np
from pandas import Series, DataFrame
import pandas as pd
# for example, using lower and uppercase English letters
import string
lower = Series(list(string.lowercase), name='lower')
upper = Series(list(string.uppercase), name='upper')
df2 = pd.concat((lower, upper), axis=1)
df2['ord'] = df2['lower'].apply(ord)
df2.head()
    
    Out[39]:
Note that string.upper takes a letter and returns its uppercase version.  For example:
In [40]:
    
string.upper('b')
    
    Out[40]:
Q34. What is
np.all(df2['lower'].apply(string.upper) == df2['upper'])
A34:
Q35. What is
df2.apply(lambda s: s['lower'] + s['upper'], axis=1)[6]
A35:
Please remind yourself what enumerate does.
In [43]:
    
words = ['Berkeley', 'I', 'School']
for (i, word) in islice(enumerate(words),1):
    print (i, word)
    
    
Q36. What is
list(enumerate(words))[2][1]
A36:
Now consider the generator g2
In [45]:
    
def g2():
    words = ['Berkeley', 'I', 'School']
    for word in words:
        if word != 'I':
            for letter in list(word):
                yield letter
            
my_g2 = g2()
    
Q37. What is
len(list(my_g2))
A37:
In [47]:
    
def g3():
    words = ['Berkeley', 'I', 'School']
    for word in words:
        yield words
    
Q38. What is
len(list(g3()))
A38:
Consider using groupby with a DataFrame with states.
In [49]:
    
import us
import census
import settings
import pandas as pd
import numpy as np
from pandas import DataFrame, Series
from itertools import islice
c = census.Census(settings.CENSUS_KEY)
def states(variables='NAME'):
    geo={'for':'state:*'}
    
    states_fips = set([state.fips for state in us.states.STATES])
    # need to filter out non-states
    for r in c.sf1.get(variables, geo=geo, year=2010):
        if r['state'] in states_fips:
            yield r
            
# make a dataframe from the total populations of states in the 2010 Census
df = DataFrame(states('NAME,P0010001'))
df.P0010001 = df.P0010001.astype('int')
df['first_letter'] = df.NAME.apply(lambda s:s[0])
df.head()
    
    Out[49]:
For reference, here's a list of all the states
In [50]:
    
print list(df.NAME)
    
    
Q39. What is
df.groupby('first_letter').apply(lambda g:list(g.NAME))['C']
A39:
Q40. What is
df.groupby('first_letter').apply(lambda g:len(g.NAME))['A']
A40:
Q41. What is
df.groupby('first_letter').agg('count')['first_letter']['P']
A41:
Q42. What is
len(df.groupby('NAME'))
A42:
Recall the code from the diversity calculations
In [55]:
    
def normalize(s):
    """take a Series and divide each item by the sum so that the new series adds up to 1.0"""
    total = np.sum(s)
    return s.astype('float') / total
def entropy(series):
    """Normalized Shannon Index"""
    # a series in which all the entries are equal should result in normalized entropy of 1.0
    
    # eliminate 0s
    series1 = series[series!=0]
    # if len(series) < 2 (i.e., 0 or 1) then return 0
    
    if len(series1) > 1:
        # calculate the maximum possible entropy for given length of input series
        max_s = -np.log(1.0/len(series))
    
        total = float(sum(series1))
        p = series1.astype('float')/float(total)
        return sum(-p*np.log(p))/max_s
    else:
        return 0.0
def gini_simpson(s):
    # https://en.wikipedia.org/wiki/Diversity_index#Gini.E2.80.93Simpson_index
    s1 = normalize(s)
    return 1-np.sum(s1*s1)
    
Q43. Suppose you have 10 people and 5 categories, how you would you maximize the Shannon entropy?
A43:
Q44. What is
entropy(Series([0,0,10,0,0]))
A44:
Q45. What is
entropy(Series([10,0,0,0,0]))
A45:
Q46. What is
entropy(Series([1,1,1,1,1]))
A46:
Q47. What is
gini_simpson(Series([2,2,2,2,2]))
A47: