Read in JSON and DataFrame Basics


In [20]:
# read population in
import json
import requests
from pandas import DataFrame

# pop_json_url holds a 
pop_json_url = "https://gist.github.com/rdhyee/8511607/raw/f16257434352916574473e63612fcea55a0c1b1c/population_of_countries.json"
pop_list= requests.get(pop_json_url, verify=False).json()

df = DataFrame(pop_list)
df[:5]


Out[20]:
0 1 2
0 1 China 1385566537
1 2 India 1252139596
2 3 United States 320050716
3 4 Indonesia 249865631
4 5 Brazil 200361925

5 rows × 3 columns


In [21]:
df.ix[2]


Out[21]:
0                3
1    United States
2        320050716
Name: 2, dtype: object

In [22]:
df.dtypes


Out[22]:
0    float64
1     object
2      int64
dtype: object

Q: Based on the above statement, which of these would you expect to see in pop_list?

  1. ['1', 'United States', '320050716']
  2. [1, 'United States', 320050716]
  3. ['United States', 320050716]
  4. [1, 'United States', '320050716']

Q: What is the relationship between s and the population of China?

s = sum(df[df[1].str.startswith('C')][2])

  1. s is greater than the population of China
  2. s is the same as the population of China
  3. s is less than the population of China
  4. s is not a number.

In [23]:
# df.columns = ['Number','Country','Population']
print df[df[1].str.startswith('C')][2].sum()
sum(df[df[1].str.startswith('C')][2])


1667559248
Out[23]:
1667559248

Q: This statement does the following?

df.columns = ['Number','Country','Population']

  1. Nothing
  2. df gets a new attribute called columns
  3. df's columns are renamed based on the list
  4. Throws an exception

Q: How would you rewrite this statement to get the same result

s = sum(df[df[1].str.startswith('C')][2])

after running:

df.columns = ['Number','Country','Population']

Series Examples


In [24]:
from pandas import DataFrame, Series
import numpy as np

s1 = Series(np.arange(1,4))
s1


Out[24]:
0    1
1    2
2    3
dtype: int32

In [25]:
s1 + 1


Out[25]:
0    2
1    3
2    4
dtype: int32

Q: What is

s1 + 1

Q: What is

s1.apply(lambda k: 2*k).sum()

In [26]:
sum(s1.apply(lambda k: 2*k))


Out[26]:
12

Q: What is

s1.cumsum()[1]

In [27]:
s1.cumsum()[1]


Out[27]:
3

Q: What is

s1.cumsum() + s1.cumsum()

In [28]:
s1.cumsum()+s1.cumsum()


Out[28]:
0     2
1     6
2    12
dtype: int32

Q: Describe what is happening in these statements:

s1 + 1

and

s1.cumsum() + s1.cumsum()

Q: What is

np.any(s1 > 2)

In [29]:
s1 > 2


Out[29]:
0    False
1    False
2     True
dtype: bool

Census API Examples


In [30]:
from census import Census
from us import states

import settings

c = Census(settings.CENSUS_KEY)
c.sf1.get(('NAME', 'P0010001'), {'for': 'state:%s' % states.CA.fips})


Out[30]:
[{u'NAME': u'California', u'P0010001': u'37253956', u'state': u'06'}]

Q: What is the purpose of settings.CENSUS_KEY?

  1. It is the password for the Census Python package
  2. It is an API Access key for authentication with the Census API
  3. It is an API Access key for authentication with Github
  4. It is key shared by all users of the Census API

Q: What is the difference between r1 and r2?

r1 = c.sf1.get(('NAME', 'P0010001'), {'for': 'county:*', 'in': 'state:%s' % states.CA.fips})
r2 = c.sf1.get(('NAME', 'P0010001'), {'for': 'county:*', 'in': 'state:*' })

In [31]:
r1 = c.sf1.get(('NAME', 'P0010001'), {'for': 'county:*', 'in': 'state:%s' % states.CA.fips})
r2 = c.sf1.get(('NAME', 'P0010001'), {'for': 'county:*', 'in': 'state:*' })

In [32]:
r1[0]


Out[32]:
{u'NAME': u'Alameda County',
 u'P0010001': u'1510271',
 u'county': u'001',
 u'state': u'06'}

In [33]:
len(r2)


Out[33]:
3221

Q: Which is the correct geographic hierarchy?

Nation > States = Nation is subdivided into States

  1. Counties > States
  2. Counties > Census Blocks > Census Tracks
  3. Places > Counties
  4. Census Tracts > Block Groups > Census Blocks

In [34]:
from pandas import DataFrame

r = c.sf1.get(('NAME', 'P0010001'), {'for': 'state:*'})
df = DataFrame(r)

df.head()


Out[34]:
NAME P0010001 state
0 Alabama 4779736 01
1 Alaska 710231 02
2 Arizona 6392017 04
3 Arkansas 2915918 05
4 California 37253956 06

5 rows × 3 columns


In [54]:
df1 = DataFrame(r1)
df1.head()


Out[54]:
NAME P0010001 county state
0 Alameda County 1510271 001 06
1 Alpine County 1175 003 06
2 Amador County 38091 005 06
3 Butte County 220000 007 06
4 Calaveras County 45578 009 06

5 rows × 4 columns


In [55]:
r3 = c.sf1.get(('NAME', 'P0010001'), {'for': 'county:*' })
len(r3)


Out[55]:
3221

Q: Why does df have 52 items? Please explain


In [56]:
len(df)


Out[56]:
52

Q: Why are the results below different? Please explain


In [57]:
print df.P0010001.sum()
print
print df.P0010001.astype(int).sum()


312471327

312471327

Q: Describe the output of the following:

df.P0010001 = df.P0010001.astype(int)
df[['NAME','P0010001']].sort('P0010001', ascending=False).head()

In [58]:
df.P0010001 = df.P0010001.astype(int)
df[['NAME','P0010001']].sort('P0010001', ascending=False).head()


---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-58-0f3542cb237a> in <module>()
      1 df.P0010001 = df.P0010001.astype(int)
----> 2 df[['NAME','P0010001']].sort('P0010001', ascending=False).head()

/Library/Frameworks/Python.framework/Versions/7.3/lib/python2.7/site-packages/pandas/core/frame.pyc in __getitem__(self, key)
   1627         if isinstance(key, (Series, np.ndarray, list)):
   1628             # either boolean or fancy integer index
-> 1629             return self._getitem_array(key)
   1630         elif isinstance(key, DataFrame):
   1631             return self._getitem_frame(key)

/Library/Frameworks/Python.framework/Versions/7.3/lib/python2.7/site-packages/pandas/core/frame.pyc in _getitem_array(self, key)
   1671             return self.take(indexer, axis=0, convert=False)
   1672         else:
-> 1673             indexer = self.ix._convert_to_indexer(key, axis=1)
   1674             return self.take(indexer, axis=1, convert=True)
   1675 

/Library/Frameworks/Python.framework/Versions/7.3/lib/python2.7/site-packages/pandas/core/indexing.pyc in _convert_to_indexer(self, obj, axis, is_setter)
    957                     if isinstance(obj, tuple) and is_setter:
    958                         return {'key': obj}
--> 959                     raise KeyError('%s not in index' % objarr[mask])
    960 
    961                 return indexer

KeyError: "['NAME'] not in index"

Q: After running:

df.set_index('NAME', inplace=True)

how would you access the Series for the state of Alaska?

  1. df['Alaska']
  2. df[1]
  3. df.ix['Alaska']
  4. df[df['NAME'] == 'Alaska']

In [53]:
df.set_index('NAME', inplace=True)


---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-53-8ff8104008d3> in <module>()
----> 1 df.set_index('NAME', inplace=True)

/Library/Frameworks/Python.framework/Versions/7.3/lib/python2.7/site-packages/pandas/core/frame.pyc in set_index(self, keys, drop, append, inplace, verify_integrity)
   2343                 names.append(None)
   2344             else:
-> 2345                 level = frame[col].values
   2346                 names.append(col)
   2347                 if drop:

/Library/Frameworks/Python.framework/Versions/7.3/lib/python2.7/site-packages/pandas/core/frame.pyc in __getitem__(self, key)
   1633             return self._getitem_multilevel(key)
   1634         else:
-> 1635             return self._getitem_column(key)
   1636 
   1637     def _getitem_column(self, key):

/Library/Frameworks/Python.framework/Versions/7.3/lib/python2.7/site-packages/pandas/core/frame.pyc in _getitem_column(self, key)
   1640         # get column
   1641         if self.columns.is_unique:
-> 1642             return self._get_item_cache(key)
   1643 
   1644         # duplicate columns & possible reduce dimensionaility

/Library/Frameworks/Python.framework/Versions/7.3/lib/python2.7/site-packages/pandas/core/generic.pyc in _get_item_cache(self, item)
    981         res = cache.get(item)
    982         if res is None:
--> 983             values = self._data.get(item)
    984             res = self._box_item_values(item, values)
    985             cache[item] = res

/Library/Frameworks/Python.framework/Versions/7.3/lib/python2.7/site-packages/pandas/core/internals.pyc in get(self, item)
   2752                 return self.get_for_nan_indexer(indexer)
   2753 
-> 2754             _, block = self._find_block(item)
   2755             return block.get(item)
   2756         else:

/Library/Frameworks/Python.framework/Versions/7.3/lib/python2.7/site-packages/pandas/core/internals.pyc in _find_block(self, item)
   3063 
   3064     def _find_block(self, item):
-> 3065         self._check_have(item)
   3066         for i, block in enumerate(self.blocks):
   3067             if item in block:

/Library/Frameworks/Python.framework/Versions/7.3/lib/python2.7/site-packages/pandas/core/internals.pyc in _check_have(self, item)
   3070     def _check_have(self, item):
   3071         if item not in self.items:
-> 3072             raise KeyError('no item named %s' % com.pprint_thing(item))
   3073 
   3074     def reindex_axis(self, new_axis, indexer=None, method=None, axis=0,

KeyError: u'no item named NAME'

In [40]:


In [41]:
df.head()


Out[41]:
P0010001 state
NAME
Alabama 4779736 01
Alaska 710231 02
Arizona 6392017 04
Arkansas 2915918 05
California 37253956 06

5 rows × 2 columns


In [42]:
len([ s.fips for s in states.STATES])


Out[42]:
51

In [43]:
len(df)


Out[43]:
52

In [44]:
np.in1d([ s.fips for s in states.STATES], df.state)


Out[44]:
array([ True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True], dtype=bool)

In [45]:
len(np.in1d(df.state, [ s.fips for s in states.STATES]))


Out[45]:
52

In [46]:
len(df[np.in1d(df.state, [ s.fips for s in states.STATES])])


Out[46]:
51

In [47]:
df[np.in1d(df.state, [ s.fips for s in states.STATES])][:]


Out[47]:
P0010001 state
NAME
Alabama 4779736 01
Alaska 710231 02
Arizona 6392017 04
Arkansas 2915918 05
California 37253956 06
Colorado 5029196 08
Connecticut 3574097 09
Delaware 897934 10
District of Columbia 601723 11
Florida 18801310 12
Georgia 9687653 13
Hawaii 1360301 15
Idaho 1567582 16
Illinois 12830632 17
Indiana 6483802 18
Iowa 3046355 19
Kansas 2853118 20
Kentucky 4339367 21
Louisiana 4533372 22
Maine 1328361 23
Maryland 5773552 24
Massachusetts 6547629 25
Michigan 9883640 26
Minnesota 5303925 27
Mississippi 2967297 28
Missouri 5988927 29
Montana 989415 30
Nebraska 1826341 31
Nevada 2700551 32
New Hampshire 1316470 33
New Jersey 8791894 34
New Mexico 2059179 35
New York 19378102 36
North Carolina 9535483 37
North Dakota 672591 38
Ohio 11536504 39
Oklahoma 3751351 40
Oregon 3831074 41
Pennsylvania 12702379 42
Rhode Island 1052567 44
South Carolina 4625364 45
South Dakota 814180 46
Tennessee 6346105 47
Texas 25145561 48
Utah 2763885 49
Vermont 625741 50
Virginia 8001024 51
Washington 6724540 53
West Virginia 1852994 54
Wisconsin 5686986 55
Wyoming 563626 56

51 rows × 2 columns