Working with Open Data Midterm (March 18, 2014)

There are 94 points in this exam: 2 each for the 47 questions. The questions are either multiple choice or short answers. For multiple choice, just write the number of the choice selected.

Name: ______________________________________

World Population

Consider this code to construct a DataFrame of populations of countries.



In [49]:

    
import json
import requests
from pandas import DataFrame

# read population in from JSON-formatted data derived from the Wikipedia
pop_json_url = "https://gist.github.com/rdhyee/8511607/" + \
     "raw/f16257434352916574473e63612fcea55a0c1b1c/population_of_countries.json"
pop_list= requests.get(pop_json_url).json()

df = DataFrame(pop_list)
df[:5]









    



---------------------------------------------------------------------------
SSLError                                  Traceback (most recent call last)
<ipython-input-49-12a7dd8e6e88> in <module>()
      5 # read population in from JSON-formatted data derived from the Wikipedia
      6 pop_json_url = "https://gist.github.com/rdhyee/8511607/" +      "raw/f16257434352916574473e63612fcea55a0c1b1c/population_of_countries.json"
----> 7 pop_list= requests.get(pop_json_url).json()
      8 
      9 df = DataFrame(pop_list)

/Library/Frameworks/Python.framework/Versions/7.3/lib/python2.7/site-packages/requests/api.pyc in get(url, **kwargs)
     53 
     54     kwargs.setdefault('allow_redirects', True)
---> 55     return request('get', url, **kwargs)
     56 
     57 

/Library/Frameworks/Python.framework/Versions/7.3/lib/python2.7/site-packages/requests/api.pyc in request(method, url, **kwargs)
     42 
     43     session = sessions.Session()
---> 44     return session.request(method=method, url=url, **kwargs)
     45 
     46 

/Library/Frameworks/Python.framework/Versions/7.3/lib/python2.7/site-packages/requests/sessions.pyc in request(self, method, url, params, data, headers, cookies, files, auth, timeout, allow_redirects, proxies, hooks, stream, verify, cert)
    287 
    288         # Resolve redirects if allowed.
--> 289         history = [r for r in gen] if allow_redirects else []
    290 
    291         # Shuffle things around if there's history.

/Library/Frameworks/Python.framework/Versions/7.3/lib/python2.7/site-packages/requests/sessions.pyc in resolve_redirects(self, resp, req, stream, timeout, verify, cert, proxies)
    131                     verify=verify,
    132                     cert=cert,
--> 133                     proxies=proxies
    134                 )
    135 

/Library/Frameworks/Python.framework/Versions/7.3/lib/python2.7/site-packages/requests/sessions.pyc in request(self, method, url, params, data, headers, cookies, files, auth, timeout, allow_redirects, proxies, hooks, stream, verify, cert)
    277 
    278         # Send the request.
--> 279         resp = self.send(prep, stream=stream, timeout=timeout, verify=verify, cert=cert, proxies=proxies)
    280 
    281         # Persist cookies.

/Library/Frameworks/Python.framework/Versions/7.3/lib/python2.7/site-packages/requests/sessions.pyc in send(self, request, **kwargs)
    372         """Send a given PreparedRequest."""
    373         adapter = self.get_adapter(url=request.url)
--> 374         r = adapter.send(request, **kwargs)
    375         return r
    376 

/Library/Frameworks/Python.framework/Versions/7.3/lib/python2.7/site-packages/requests/adapters.pyc in send(self, request, stream, timeout, verify, cert, proxies)
    211         except (_SSLError, _HTTPError) as e:
    212             if isinstance(e, _SSLError):
--> 213                 raise SSLError(e)
    214             elif isinstance(e, TimeoutError):
    215                 raise Timeout(e)

SSLError: [Errno 1] _ssl.c:504: error:0D0890A1:asn1 encoding routines:ASN1_verify:unknown message digest algorithm

Note the dtypes of the columns.



In [ ]:

    
df.dtypes



In [ ]:

    
s = sum(df[df[1].str.startswith('C')][2])
s

Q1: What is the relationship between s and the population of China, where s is defined as

s = sum(df[df[1].str.startswith('C')][2])

s is greater than the population of China
s is the same as the population of China
s is less than the population of China
s is not a number.

A1:



In [ ]:

    
s2 = sum(df[df[1].str.startswith('X')][2])
s2

Q2: What is the relationship between s2 and the population of China, where s2 is defined by:

s2 = sum(df[df[1].str.startswith('X')][2])

s2 is greater than the population of China
s2 is the same as the population of China
s2 is less than the population of China
s2 is not a number.

A2:



In [ ]:

    
df.columns = ['Number','Country','Population']

Q3: What happens when the following statement is run?

df.columns = ['Number','Country','Population']

Nothing
df gets a new attribute called columns
df's columns are renamed based on the list
Throws an exception

A3:



In [ ]:

    
try:
    df.columns = ['Number','Country']
except Exception as e:
    print e

Q4: This statement does the following

df.columns = ['Number','Country']

Nothing
df gets a new attribute called columns
df's columns are renamed based on the list
Throws an exception

A4:



In [ ]:

    
df.columns = ['Number','Country','Population']
s=sum(df[df['Country'].str.startswith('C')]['Population'])
s

Q5: How would you rewrite the following statement to get the same result as

s = sum(df[df[1].str.startswith('C')][2])

after running:

df.columns = ['Number','Country','Population']

A5:


s=sum(df[df['Country'].str.startswith('C')]['Population'])



In [ ]:

    
len(df[df["Population"]>1000000000])

Q6. What is

len(df[df["Population"] > 1000000000])

A6:



In [ ]:

    
";".join(df[df['Population']>1000000000]['Country'].apply(lambda s: s[0]))

Q7. What is

";".join(df[df['Population']>1000000000]['Country'].apply(lambda s: s[0]))

A7:

C;I



In [ ]:

    
len(";".join(df[df['Population']>1000000000]['Country'].apply(lambda s: s[0])))

Q8. What is

len(";".join(df[df['Population']>1000000000]['Country'].apply(lambda s: s[0])))

A8:

Pandas Series



In [ ]:

    
from pandas import DataFrame, Series
import numpy as np

s1 = Series(np.arange(-1,4))
s1



In [ ]:

    
s1 + 1

Q9: What is

s1 + 1

A9:



In [ ]:

    
s1.apply(lambda k: 2*k).sum()

Q10: What is

s1.apply(lambda k: 2*k).sum()

A10:



In [ ]:

    
s1.cumsum()[3]

Q11: What is

s1.cumsum()[3]

A11:



In [ ]:

    
s1.cumsum() - s1.cumsum()

Q12: What is

s1.cumsum() - s1.cumsum()

A12:



In [ ]:

    
len(s1.cumsum() - s1.cumsum())

Q13. What is

len(s1.cumsum() - s1.cumsum())

A13:



In [ ]:

    
np.any(s1 > 2)

Q14: What is

np.any(s1 > 2)

A14:

True



In [ ]:

    
np.all(s1<3)

Q15. What is

np.all(s1<3)

A15:

False

Census API

Consider the following code to load population(s) from the Census API.



In [ ]:

    
from census import Census
from us import states

import settings

c = Census(settings.CENSUS_KEY)
c.sf1.get(('NAME', 'P0010001'), {'for': 'state:%s' % states.CA.fips})

Q16: What is the purpose of settings.CENSUS_KEY?

It is the password for the Census Python package
It is an API Access key for authentication with the Census API
It is an API Access key for authentication with Github
It is key shared by all users of the Census API

A16:

Q17. When we run

pip install census

we are:

installing a Python module from PyPI
installing the Python module census from continuum.io's repository
signing ourselves up for a census API key
None of the above

A17:

Consider r1 and r2:



In [ ]:

    
r1 = c.sf1.get(('NAME', 'P0010001'), {'for': 'county:*', 'in': 'state:%s' % states.CA.fips})
r2 = c.sf1.get(('NAME', 'P0010001'), {'for': 'county:*', 'in': 'state:*' })

len(r1), len(r2)

Q18: What is the difference between r1 and r2?

r1 = c.sf1.get(('NAME', 'P0010001'), {'for': 'county:*', 'in': 'state:%s' % states.CA.fips})
    r2 = c.sf1.get(('NAME', 'P0010001'), {'for': 'county:*', 'in': 'state:*' })

A18: r1 is a list holding the name and total population from the 2010 US Census for every county in California.

r2 holds the name and total population from the 2010 US Census for every county in all US states, DC, and Puerto Rico.

Q19. What's the relationship between len(r1) and len(r2)?

len(r1) is less than len(r2)
len(r1) equals len(r2)
len(r1) is greater than len(r2)
None of the above

A19:

Q20: Which is a correct geographic hierarchy?

Nation > States = Nation is subdivided into States

Counties > States
Counties > Block Groups > Census Tracks
Census Tracts > Block Groups > Census Blocks
Places > Counties

A20:



In [87]:

    
from pandas import DataFrame
import numpy as np
from census import Census
from us import states

import settings

c = Census(settings.CENSUS_KEY)

r = c.sf1.get(('NAME', 'P0010001'), {'for': 'state:*'})
df1 = DataFrame(r)

df1.head()









    Out[87]:






  
    
      
      NAME
      P0010001
      state
    
  
  
    
      0
          Alabama
        4779736
       01
    
    
      1
           Alaska
         710231
       02
    
    
      2
          Arizona
        6392017
       04
    
    
      3
         Arkansas
        2915918
       05
    
    
      4
       California
       37253956
       06
    
  

5 rows × 3 columns



In [88]:

    
len(df1)









    Out[88]:





52

Q21: Why does df have 52 items? Please explain

A21:

When queried for "states", the US Census API returns data for the 50 states, the District of Columbia, and Puerto Rico: (50+1+1 = 52 entities).

Consider the two following expressions:



In [89]:

    
print df1.P0010001.sum()
print
print df1.P0010001.astype(int).sum()









    



477973671023163920172915918372539565029196357409789793460172318801310968765313603011567582128306326483802304635528531184339367453337213283615773552654762998836405303925296729759889279894151826341270055113164708791894205917919378102953548367259111536504375135138310741270237910525674625364814180634610525145561276388562574180010246724540185299456869865636263725789

312471327

Q22: Why is df1.P0010001.sum() different from df1.P0010001.astype(int).sum()?

A22: The data type of df1.P0010001 is a string. Hence, performing sum on it concatenates the string representation of populations into a longer string. In contrast, once df1.P0010001 is converted into integers via df1.P0010001.astype(int), a sum operation adds up all the populations into a single integer.



In [90]:

    
df1.P0010001 = df1.P0010001.astype(int)
df1[['NAME','P0010001']].sort('P0010001', ascending=True).head()









    Out[90]:






  
    
      
      NAME
      P0010001
    
  
  
    
      50
                    Wyoming
       563626
    
    
      8 
       District of Columbia
       601723
    
    
      45
                    Vermont
       625741
    
    
      34
               North Dakota
       672591
    
    
      1 
                     Alaska
       710231
    
  

5 rows × 2 columns

Q23: Describe the output of the following:

df1.P0010001 = df1.P0010001.astype(int)
df1[['NAME','P0010001']].sort('P0010001', ascending=True).head()

A23: A DataFrame (with 5 rows and 2 columns (NAME, P0010001)) listing the 5 least populous states in ascending order by population.



In [91]:

    
df1.set_index('NAME', inplace=True)
df1.ix['Nebraska']









    Out[91]:





P0010001    1826341
state            31
Name: Nebraska, dtype: object

Q24: After running:

df1.set_index('NAME', inplace=True)

how would you access the Series for the state of Nebraska?

df1['Nebraska']
df1[1]
df1.ix['Nebraska']
df1[df1['NAME'] == 'Nebraska']



In [92]:

    
df1.set_index('NAME', inplace=True)









    



---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-92-92c17a804a2a> in <module>()
----> 1 df1.set_index('NAME', inplace=True)

/Library/Frameworks/Python.framework/Versions/7.3/lib/python2.7/site-packages/pandas/core/frame.pyc in set_index(self, keys, drop, append, inplace, verify_integrity)
   2343                 names.append(None)
   2344             else:
-> 2345                 level = frame[col].values
   2346                 names.append(col)
   2347                 if drop:

/Library/Frameworks/Python.framework/Versions/7.3/lib/python2.7/site-packages/pandas/core/frame.pyc in __getitem__(self, key)
   1633             return self._getitem_multilevel(key)
   1634         else:
-> 1635             return self._getitem_column(key)
   1636 
   1637     def _getitem_column(self, key):

/Library/Frameworks/Python.framework/Versions/7.3/lib/python2.7/site-packages/pandas/core/frame.pyc in _getitem_column(self, key)
   1640         # get column
   1641         if self.columns.is_unique:
-> 1642             return self._get_item_cache(key)
   1643 
   1644         # duplicate columns & possible reduce dimensionaility

/Library/Frameworks/Python.framework/Versions/7.3/lib/python2.7/site-packages/pandas/core/generic.pyc in _get_item_cache(self, item)
    981         res = cache.get(item)
    982         if res is None:
--> 983             values = self._data.get(item)
    984             res = self._box_item_values(item, values)
    985             cache[item] = res

/Library/Frameworks/Python.framework/Versions/7.3/lib/python2.7/site-packages/pandas/core/internals.pyc in get(self, item)
   2752                 return self.get_for_nan_indexer(indexer)
   2753 
-> 2754             _, block = self._find_block(item)
   2755             return block.get(item)
   2756         else:

/Library/Frameworks/Python.framework/Versions/7.3/lib/python2.7/site-packages/pandas/core/internals.pyc in _find_block(self, item)
   3063 
   3064     def _find_block(self, item):
-> 3065         self._check_have(item)
   3066         for i, block in enumerate(self.blocks):
   3067             if item in block:

/Library/Frameworks/Python.framework/Versions/7.3/lib/python2.7/site-packages/pandas/core/internals.pyc in _check_have(self, item)
   3070     def _check_have(self, item):
   3071         if item not in self.items:
-> 3072             raise KeyError('no item named %s' % com.pprint_thing(item))
   3073 
   3074     def reindex_axis(self, new_axis, indexer=None, method=None, axis=0,

KeyError: u'no item named NAME'

A24:



In [ ]:

    
len(states.STATES)

Q25. What is len(states.STATES)?

A25:



In [ ]:

    
len(df1[np.in1d(df1.state, [s.fips for s in states.STATES])])

Q26. What is

len(df1[np.in1d(df1.state, [s.fips for s in states.STATES])])

A26:

In the next question, we will make use of the negation operator ~. Take a look at a specific example



In [ ]:

    
~Series([True, True, False, True])



In [ ]:

    
list(df1[~np.in1d(df1.state, [s.fips for s in states.STATES])].index)[0]

Q27. What is

list(df1[~np.in1d(df1.state, [s.fips for s in states.STATES])].index)[0]

A27:

Puerto Rico

Consider pop1 and pop2:



In [56]:

    
pop1 = df1['P0010001'].astype('int').sum() 
pop2 = df1[np.in1d(df1.state, [s.fips for s in states.STATES])]['P0010001'].astype('int').sum()

pop1-pop2









    



---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-56-8fa08ec8efad> in <module>()
      1 pop1 = df1['P0010001'].astype('int').sum()
----> 2 pop2 = df1[np.in1d(df1.state, [s.fips for s in states.STATES])]['P0010001'].astype('int').sum()
      3 
      4 pop1-pop2

/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/numpy/lib/arraysetops.py in in1d(ar1, ar2, assume_unique)
    333     # here. The values from the first array should always come before
    334     # the values from the second array.
--> 335     order = ar.argsort(kind='mergesort')
    336     sar = ar[order]
    337     equal_adj = (sar[1:] == sar[:-1])

TypeError: requested sort not available for type

Q28. What does pop11 - pop2 represent?

A28: The population of Puerto Rico in the 2010 Census.

Generator and range



In [57]:

    
sum(range(1, 101))









    Out[57]:





5050

Q29. Given that

range(10)

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

How to get the total of every integer from 1 to 100?

sum(range(1, 101))
sum(range(100))
sum(range(1, 100))
None of the above

A29:



In [58]:

    
# itertools is a great library
# http://docs.python.org/2/library/itertools.html#itertools.count
# itertools.count(start=0, step=1):
# "Make an iterator that returns evenly spaced values starting with step."

from itertools import islice, count
c = count(0, 1)
print c.next()
print c.next()

0
1

Q30. What output is produced from

# itertools is a great library
# http://docs.python.org/2/library/itertools.html#itertools.count
# itertools.count(start=0, step=1):
# "Make an iterator that returns evenly spaced values starting with step."

from itertools import islice, count
c = count(0, 1)
print c.next()
print c.next()

A30:

0
1



In [59]:

    
(2*Series(np.arange(101))).sum()









    Out[59]:





10100

Q31. Recalling that

1+2+3+...+100 = 5050

what is:

(2*Series(np.arange(101))).sum()

A31:

Census Places

Consider the follow generator that we used to query for census places.



In [60]:

    
import pandas as pd
from pandas import DataFrame

import census
import settings
import us

from itertools import islice

c=census.Census(settings.CENSUS_KEY)

def places(variables="NAME"):
    
    for state in us.states.STATES:
        geo = {'for':'place:*', 'in':'state:{s_fips}'.format(s_fips=state.fips)}
        for place in c.sf1.get(variables, geo=geo):
            yield place

Now we compute a DataFrame for the places: places_df



In [61]:

    
r = list(islice(places("NAME,P0010001"), None))
places_df = DataFrame(r)
places_df.P0010001 = places_df.P0010001.astype('int')

print "number of places", len(places_df)
print "total pop", places_df.P0010001.sum()
places_df.head()









    



number of places 29261
total pop 228457238






    Out[61]:






  
    
      
      NAME
      P0010001
      place
      state
    
  
  
    
      0
            Abanda CDP
        192
       00100
       01
    
    
      1
        Abbeville city
       2688
       00124
       01
    
    
      2
       Adamsville city
       4522
       00460
       01
    
    
      3
          Addison town
        758
       00484
       01
    
    
      4
            Akron town
        356
       00676
       01
    
  

5 rows × 4 columns

We display the most populous places from California



In [62]:

    
places_df[places_df.state=='06'].sort_index(by='P0010001', ascending=False).head()









    Out[62]:






  
    
      
      NAME
      P0010001
      place
      state
    
  
  
    
      2714
         Los Angeles city
       3792621
       44000
       06
    
    
      3112
           San Diego city
       1307402
       66000
       06
    
    
      3122
            San Jose city
        945942
       68000
       06
    
    
      3116
       San Francisco city
        805235
       67000
       06
    
    
      2425
              Fresno city
        494665
       27000
       06
    
  

5 rows × 4 columns



In [63]:

    
places_df['label'] = places_df.apply(lambda s: s['state']+s['place'], axis=1)
places_df.ix[3122]['label']









    Out[63]:





u'0668000'

Q32. Given

places_df[places_df.state=='06'].sort_index(by='P0010001', ascending=False).head()

	NAME	P0010001	place	state
2714	Los Angeles city	3792621	44000	06
3112	San Diego city	1307402	66000	06
3122	San Jose city	945942	68000	06
3116	San Francisco city	805235	67000	06
2425	Fresno city	494665	27000	06

5 rows × 4 columns

</div>
what is

places_df.ix[3122]['label']

after we add the label column with:

places_df['label'] = places_df.apply(lambda s: s['state']+s['place'], axis=1)

A32:



In [64]:

    
places_df["NAME"][3122]









    Out[64]:





u'San Jose city'

Q33. What is

places_df["NAME"][3122]

A33:

San Jose city

Alphabet and apply

Now let's set up a DataFrame with some letters and properties of letters.



In [65]:

    
# numpy and pandas related imports 

import numpy as np
from pandas import Series, DataFrame
import pandas as pd

# for example, using lower and uppercase English letters

import string

lower = Series(list(string.lowercase), name='lower')
upper = Series(list(string.uppercase), name='upper')

df2 = pd.concat((lower, upper), axis=1)
df2['ord'] = df2['lower'].apply(ord)
df2.head()









    Out[65]:






  
    
      
      lower
      upper
      ord
    
  
  
    
      0
       a
       A
        97
    
    
      1
       b
       B
        98
    
    
      2
       c
       C
        99
    
    
      3
       d
       D
       100
    
    
      4
       e
       E
       101
    
  

5 rows × 3 columns

Note that string.upper takes a letter and returns its uppercase version. For example:



In [66]:

    
string.upper('b')









    Out[66]:





'B'



In [67]:

    
np.all(df2['lower'].apply(string.upper) == df2['upper'])









    Out[67]:





True

Q34. What is

np.all(df2['lower'].apply(string.upper) == df2['upper'])

A34:

True



In [68]:

    
df2.apply(lambda s: s['lower'] + s['upper'], axis=1)[6]









    Out[68]:





'gG'

Q35. What is

df2.apply(lambda s: s['lower'] + s['upper'], axis=1)[6]

A35:

gG

Berkeley I School generator

Please remind yourself what enumerate does.



In [69]:

    
words = ['Berkeley', 'I', 'School']

for (i, word) in islice(enumerate(words),1):
    print (i, word)









    



(0, 'Berkeley')



In [70]:

    
list(enumerate(words))[2][1]









    Out[70]:





'School'

Q36. What is

list(enumerate(words))[2][1]

A36:

School

Now consider the generator g2



In [71]:

    
def g2():
    words = ['Berkeley', 'I', 'School']
    for word in words:
        if word != 'I':
            for letter in list(word):
                yield letter
            
my_g2 = g2()



In [72]:

    
len(list(my_g2))









    Out[72]:





14

Q37. What is

len(list(my_g2))

A37:



In [73]:

    
def g3():
    words = ['Berkeley', 'I', 'School']
    for word in words:
        yield words



In [74]:

    
len(list(g3()))









    Out[74]:





3

Q38. What is

len(list(g3()))

A38:

Groupby

Consider using groupby with a DataFrame with states.



In [75]:

    
import us
import census
import settings

import pandas as pd
import numpy as np
from pandas import DataFrame, Series
from itertools import islice

c = census.Census(settings.CENSUS_KEY)

def states(variables='NAME'):
    geo={'for':'state:*'}
    
    states_fips = set([state.fips for state in us.states.STATES])
    # need to filter out non-states
    for r in c.sf1.get(variables, geo=geo, year=2010):
        if r['state'] in states_fips:
            yield r
            
# make a dataframe from the total populations of states in the 2010 Census

df = DataFrame(states('NAME,P0010001'))
df.P0010001 = df.P0010001.astype('int')
df['first_letter'] = df.NAME.apply(lambda s:s[0])

df.head()









    Out[75]:






  
    
      
      NAME
      P0010001
      state
      first_letter
    
  
  
    
      0
          Alabama
        4779736
       01
       A
    
    
      1
           Alaska
         710231
       02
       A
    
    
      2
          Arizona
        6392017
       04
       A
    
    
      3
         Arkansas
        2915918
       05
       A
    
    
      4
       California
       37253956
       06
       C
    
  

5 rows × 4 columns

For reference, here's a list of all the states



In [76]:

    
print list(df.NAME)









    



[u'Alabama', u'Alaska', u'Arizona', u'Arkansas', u'California', u'Colorado', u'Connecticut', u'Delaware', u'District of Columbia', u'Florida', u'Georgia', u'Hawaii', u'Idaho', u'Illinois', u'Indiana', u'Iowa', u'Kansas', u'Kentucky', u'Louisiana', u'Maine', u'Maryland', u'Massachusetts', u'Michigan', u'Minnesota', u'Mississippi', u'Missouri', u'Montana', u'Nebraska', u'Nevada', u'New Hampshire', u'New Jersey', u'New Mexico', u'New York', u'North Carolina', u'North Dakota', u'Ohio', u'Oklahoma', u'Oregon', u'Pennsylvania', u'Rhode Island', u'South Carolina', u'South Dakota', u'Tennessee', u'Texas', u'Utah', u'Vermont', u'Virginia', u'Washington', u'West Virginia', u'Wisconsin', u'Wyoming']



In [77]:

    
df.groupby('first_letter').apply(lambda g:list(g.NAME))['C']









    Out[77]:





[u'California', u'Colorado', u'Connecticut']

Q39. What is

df.groupby('first_letter').apply(lambda g:list(g.NAME))['C']

A39:

[u'California', u'Colorado', u'Connecticut']



In [78]:

    
df.groupby('first_letter').apply(lambda g:len(g.NAME))['A']









    Out[78]:





4

Q40. What is

df.groupby('first_letter').apply(lambda g:len(g.NAME))['A']

A40:



In [79]:

    
df.groupby('first_letter').agg('count')['first_letter']['P']









    Out[79]:





1

Q41. What is

df.groupby('first_letter').agg('count')['first_letter']['P']

A41:



In [80]:

    
len(df.groupby('NAME'))









    Out[80]:





51

Q42. What is

len(df.groupby('NAME'))

A42:

Diversity Index

Recall the code from the diversity calculations



In [81]:

    
def normalize(s):
    """take a Series and divide each item by the sum so that the new series adds up to 1.0"""
    total = np.sum(s)
    return s.astype('float') / total

def entropy(series):
    """Normalized Shannon Index"""
    # a series in which all the entries are equal should result in normalized entropy of 1.0
    
    # eliminate 0s
    series1 = series[series!=0]

    # if len(series) < 2 (i.e., 0 or 1) then return 0
    
    if len(series1) > 1:
        # calculate the maximum possible entropy for given length of input series
        max_s = -np.log(1.0/len(series))
    
        total = float(sum(series1))
        p = series1.astype('float')/float(total)
        return sum(-p*np.log(p))/max_s
    else:
        return 0.0

def gini_simpson(s):
    # https://en.wikipedia.org/wiki/Diversity_index#Gini.E2.80.93Simpson_index
    s1 = normalize(s)
    return 1-np.sum(s1*s1)

Q43. Suppose you have 10 people and 5 categories, how you would you maximize the Shannon entropy?

Regardless of how you distribute the people, you'll get the same entropy.
Put 10 people in any single category, and then 0 in the rest.
Distribute the people evenly over all the categories.
Put 5 people in each category.

A43:



In [82]:

    
entropy(Series([0,0,10,0,0]))









    Out[82]:





0.0

Q44. What is

entropy(Series([0,0,10,0,0]))

A44:



In [83]:

    
entropy(Series([10,0,0,0,0]))









    Out[83]:





0.0

Q45. What is

entropy(Series([10,0,0,0,0]))

A45:



In [84]:

    
entropy(Series([1,1,1,1,1]))









    Out[84]:





1.0000000000000002

Q46. What is

entropy(Series([1,1,1,1,1]))

A46:



In [85]:

    
gini_simpson(Series([2,2,2,2,2]))









    Out[85]:





0.79999999999999993

Q47. What is

gini_simpson(Series([2,2,2,2,2]))

A47:

0.8

	NAME	P0010001	state
0	Alabama	4779736	01
1	Alaska	710231	02
2	Arizona	6392017	04
3	Arkansas	2915918	05
4	California	37253956	06

	NAME	P0010001
50	Wyoming	563626
8	District of Columbia	601723
45	Vermont	625741
34	North Dakota	672591
1	Alaska	710231

	NAME	P0010001	place	state
0	Abanda CDP	192	00100	01
1	Abbeville city	2688	00124	01
2	Adamsville city	4522	00460	01
3	Addison town	758	00484	01
4	Akron town	356	00676	01