Read in JSON and DataFrame Basics



In [20]:

    
# read population in
import json
import requests
from pandas import DataFrame

# pop_json_url holds a 
pop_json_url = "https://gist.github.com/rdhyee/8511607/raw/f16257434352916574473e63612fcea55a0c1b1c/population_of_countries.json"
pop_list= requests.get(pop_json_url, verify=False).json()

df = DataFrame(pop_list)
df[:5]









    Out[20]:






  
    
      
      0
      1
      2
    
  
  
    
      0
       1
               China
       1385566537
    
    
      1
       2
               India
       1252139596
    
    
      2
       3
       United States
        320050716
    
    
      3
       4
           Indonesia
        249865631
    
    
      4
       5
              Brazil
        200361925
    
  

5 rows × 3 columns



In [21]:

    
df.ix[2]









    Out[21]:





0                3
1    United States
2        320050716
Name: 2, dtype: object



In [22]:

    
df.dtypes









    Out[22]:





0    float64
1     object
2      int64
dtype: object

Q: Based on the above statement, which of these would you expect to see in pop_list?

['1', 'United States', '320050716']
[1, 'United States', 320050716]
['United States', 320050716]
[1, 'United States', '320050716']

Q: What is the relationship between s and the population of China?

s = sum(df[df[1].str.startswith('C')][2])

s is greater than the population of China
s is the same as the population of China
s is less than the population of China
s is not a number.



In [23]:

    
# df.columns = ['Number','Country','Population']
print df[df[1].str.startswith('C')][2].sum()
sum(df[df[1].str.startswith('C')][2])

Q: This statement does the following?

df.columns = ['Number','Country','Population']

Nothing
df gets a new attribute called columns
df's columns are renamed based on the list
Throws an exception

Q: How would you rewrite this statement to get the same result

s = sum(df[df[1].str.startswith('C')][2])

after running:

df.columns = ['Number','Country','Population']

Series Examples



In [24]:

    
from pandas import DataFrame, Series
import numpy as np

s1 = Series(np.arange(1,4))
s1









    Out[24]:





0    1
1    2
2    3
dtype: int32



In [25]:

    
s1 + 1









    Out[25]:





0    2
1    3
2    4
dtype: int32

Q: What is

s1 + 1

Q: What is

s1.apply(lambda k: 2*k).sum()



In [26]:

    
sum(s1.apply(lambda k: 2*k))









    Out[26]:





12

Q: What is

s1.cumsum()[1]



In [27]:

    
s1.cumsum()[1]









    Out[27]:





3

Q: What is

s1.cumsum() + s1.cumsum()



In [28]:

    
s1.cumsum()+s1.cumsum()









    Out[28]:





0     2
1     6
2    12
dtype: int32

Q: Describe what is happening in these statements:

s1 + 1

and

s1.cumsum() + s1.cumsum()

Q: What is

np.any(s1 > 2)



In [29]:

    
s1 > 2









    Out[29]:





0    False
1    False
2     True
dtype: bool

Census API Examples



In [30]:

    
from census import Census
from us import states

import settings

c = Census(settings.CENSUS_KEY)
c.sf1.get(('NAME', 'P0010001'), {'for': 'state:%s' % states.CA.fips})









    Out[30]:





[{u'NAME': u'California', u'P0010001': u'37253956', u'state': u'06'}]

Q: What is the purpose of settings.CENSUS_KEY?

It is the password for the Census Python package
It is an API Access key for authentication with the Census API
It is an API Access key for authentication with Github
It is key shared by all users of the Census API

Q: What is the difference between r1 and r2?

r1 = c.sf1.get(('NAME', 'P0010001'), {'for': 'county:*', 'in': 'state:%s' % states.CA.fips})
r2 = c.sf1.get(('NAME', 'P0010001'), {'for': 'county:*', 'in': 'state:*' })



In [31]:

    
r1 = c.sf1.get(('NAME', 'P0010001'), {'for': 'county:*', 'in': 'state:%s' % states.CA.fips})
r2 = c.sf1.get(('NAME', 'P0010001'), {'for': 'county:*', 'in': 'state:*' })



In [32]:

    
r1[0]









    Out[32]:





{u'NAME': u'Alameda County',
 u'P0010001': u'1510271',
 u'county': u'001',
 u'state': u'06'}



In [33]:

    
len(r2)









    Out[33]:





3221

Q: Which is the correct geographic hierarchy?

Nation > States = Nation is subdivided into States

Counties > States
Counties > Census Blocks > Census Tracks
Places > Counties
Census Tracts > Block Groups > Census Blocks



In [34]:

    
from pandas import DataFrame

r = c.sf1.get(('NAME', 'P0010001'), {'for': 'state:*'})
df = DataFrame(r)

df.head()









    Out[34]:






  
    
      
      NAME
      P0010001
      state
    
  
  
    
      0
          Alabama
        4779736
       01
    
    
      1
           Alaska
         710231
       02
    
    
      2
          Arizona
        6392017
       04
    
    
      3
         Arkansas
        2915918
       05
    
    
      4
       California
       37253956
       06
    
  

5 rows × 3 columns



In [54]:

    
df1 = DataFrame(r1)
df1.head()









    Out[54]:






  
    
      
      NAME
      P0010001
      county
      state
    
  
  
    
      0
         Alameda County
       1510271
       001
       06
    
    
      1
          Alpine County
          1175
       003
       06
    
    
      2
          Amador County
         38091
       005
       06
    
    
      3
           Butte County
        220000
       007
       06
    
    
      4
       Calaveras County
         45578
       009
       06
    
  

5 rows × 4 columns



In [55]:

    
r3 = c.sf1.get(('NAME', 'P0010001'), {'for': 'county:*' })
len(r3)









    Out[55]:





3221

Q: Why does df have 52 items? Please explain



In [56]:

    
len(df)









    Out[56]:





52

Q: Why are the results below different? Please explain



In [57]:

    
print df.P0010001.sum()
print
print df.P0010001.astype(int).sum()

Q: Describe the output of the following:

df.P0010001 = df.P0010001.astype(int)
df[['NAME','P0010001']].sort('P0010001', ascending=False).head()



In [58]:

    
df.P0010001 = df.P0010001.astype(int)
df[['NAME','P0010001']].sort('P0010001', ascending=False).head()









    



---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-58-0f3542cb237a> in <module>()
      1 df.P0010001 = df.P0010001.astype(int)
----> 2 df[['NAME','P0010001']].sort('P0010001', ascending=False).head()

/Library/Frameworks/Python.framework/Versions/7.3/lib/python2.7/site-packages/pandas/core/frame.pyc in __getitem__(self, key)
   1627         if isinstance(key, (Series, np.ndarray, list)):
   1628             # either boolean or fancy integer index
-> 1629             return self._getitem_array(key)
   1630         elif isinstance(key, DataFrame):
   1631             return self._getitem_frame(key)

/Library/Frameworks/Python.framework/Versions/7.3/lib/python2.7/site-packages/pandas/core/frame.pyc in _getitem_array(self, key)
   1671             return self.take(indexer, axis=0, convert=False)
   1672         else:
-> 1673             indexer = self.ix._convert_to_indexer(key, axis=1)
   1674             return self.take(indexer, axis=1, convert=True)
   1675 

/Library/Frameworks/Python.framework/Versions/7.3/lib/python2.7/site-packages/pandas/core/indexing.pyc in _convert_to_indexer(self, obj, axis, is_setter)
    957                     if isinstance(obj, tuple) and is_setter:
    958                         return {'key': obj}
--> 959                     raise KeyError('%s not in index' % objarr[mask])
    960 
    961                 return indexer

KeyError: "['NAME'] not in index"

Q: After running:

df.set_index('NAME', inplace=True)

how would you access the Series for the state of Alaska?

df['Alaska']
df[1]
df.ix['Alaska']
df[df['NAME'] == 'Alaska']



In [53]:

    
df.set_index('NAME', inplace=True)









    



---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-53-8ff8104008d3> in <module>()
----> 1 df.set_index('NAME', inplace=True)

/Library/Frameworks/Python.framework/Versions/7.3/lib/python2.7/site-packages/pandas/core/frame.pyc in set_index(self, keys, drop, append, inplace, verify_integrity)
   2343                 names.append(None)
   2344             else:
-> 2345                 level = frame[col].values
   2346                 names.append(col)
   2347                 if drop:

/Library/Frameworks/Python.framework/Versions/7.3/lib/python2.7/site-packages/pandas/core/frame.pyc in __getitem__(self, key)
   1633             return self._getitem_multilevel(key)
   1634         else:
-> 1635             return self._getitem_column(key)
   1636 
   1637     def _getitem_column(self, key):

/Library/Frameworks/Python.framework/Versions/7.3/lib/python2.7/site-packages/pandas/core/frame.pyc in _getitem_column(self, key)
   1640         # get column
   1641         if self.columns.is_unique:
-> 1642             return self._get_item_cache(key)
   1643 
   1644         # duplicate columns & possible reduce dimensionaility

/Library/Frameworks/Python.framework/Versions/7.3/lib/python2.7/site-packages/pandas/core/generic.pyc in _get_item_cache(self, item)
    981         res = cache.get(item)
    982         if res is None:
--> 983             values = self._data.get(item)
    984             res = self._box_item_values(item, values)
    985             cache[item] = res

/Library/Frameworks/Python.framework/Versions/7.3/lib/python2.7/site-packages/pandas/core/internals.pyc in get(self, item)
   2752                 return self.get_for_nan_indexer(indexer)
   2753 
-> 2754             _, block = self._find_block(item)
   2755             return block.get(item)
   2756         else:

/Library/Frameworks/Python.framework/Versions/7.3/lib/python2.7/site-packages/pandas/core/internals.pyc in _find_block(self, item)
   3063 
   3064     def _find_block(self, item):
-> 3065         self._check_have(item)
   3066         for i, block in enumerate(self.blocks):
   3067             if item in block:

/Library/Frameworks/Python.framework/Versions/7.3/lib/python2.7/site-packages/pandas/core/internals.pyc in _check_have(self, item)
   3070     def _check_have(self, item):
   3071         if item not in self.items:
-> 3072             raise KeyError('no item named %s' % com.pprint_thing(item))
   3073 
   3074     def reindex_axis(self, new_axis, indexer=None, method=None, axis=0,

KeyError: u'no item named NAME'



In [40]:



In [41]:

    
df.head()









    Out[41]:






  
    
      
      P0010001
      state
    
    
      NAME
      
      
    
  
  
    
      Alabama
        4779736
       01
    
    
      Alaska
         710231
       02
    
    
      Arizona
        6392017
       04
    
    
      Arkansas
        2915918
       05
    
    
      California
       37253956
       06
    
  

5 rows × 2 columns



In [42]:

    
len([ s.fips for s in states.STATES])









    Out[42]:





51



In [43]:

    
len(df)









    Out[43]:





52



In [44]:

    
np.in1d([ s.fips for s in states.STATES], df.state)









    Out[44]:





array([ True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True], dtype=bool)



In [45]:

    
len(np.in1d(df.state, [ s.fips for s in states.STATES]))









    Out[45]:





52



In [46]:

    
len(df[np.in1d(df.state, [ s.fips for s in states.STATES])])









    Out[46]:





51



In [47]:

    
df[np.in1d(df.state, [ s.fips for s in states.STATES])][:]









    Out[47]:






  
    
      
      P0010001
      state
    
    
      NAME
      
      
    
  
  
    
      Alabama
        4779736
       01
    
    
      Alaska
         710231
       02
    
    
      Arizona
        6392017
       04
    
    
      Arkansas
        2915918
       05
    
    
      California
       37253956
       06
    
    
      Colorado
        5029196
       08
    
    
      Connecticut
        3574097
       09
    
    
      Delaware
         897934
       10
    
    
      District of Columbia
         601723
       11
    
    
      Florida
       18801310
       12
    
    
      Georgia
        9687653
       13
    
    
      Hawaii
        1360301
       15
    
    
      Idaho
        1567582
       16
    
    
      Illinois
       12830632
       17
    
    
      Indiana
        6483802
       18
    
    
      Iowa
        3046355
       19
    
    
      Kansas
        2853118
       20
    
    
      Kentucky
        4339367
       21
    
    
      Louisiana
        4533372
       22
    
    
      Maine
        1328361
       23
    
    
      Maryland
        5773552
       24
    
    
      Massachusetts
        6547629
       25
    
    
      Michigan
        9883640
       26
    
    
      Minnesota
        5303925
       27
    
    
      Mississippi
        2967297
       28
    
    
      Missouri
        5988927
       29
    
    
      Montana
         989415
       30
    
    
      Nebraska
        1826341
       31
    
    
      Nevada
        2700551
       32
    
    
      New Hampshire
        1316470
       33
    
    
      New Jersey
        8791894
       34
    
    
      New Mexico
        2059179
       35
    
    
      New York
       19378102
       36
    
    
      North Carolina
        9535483
       37
    
    
      North Dakota
         672591
       38
    
    
      Ohio
       11536504
       39
    
    
      Oklahoma
        3751351
       40
    
    
      Oregon
        3831074
       41
    
    
      Pennsylvania
       12702379
       42
    
    
      Rhode Island
        1052567
       44
    
    
      South Carolina
        4625364
       45
    
    
      South Dakota
         814180
       46
    
    
      Tennessee
        6346105
       47
    
    
      Texas
       25145561
       48
    
    
      Utah
        2763885
       49
    
    
      Vermont
         625741
       50
    
    
      Virginia
        8001024
       51
    
    
      Washington
        6724540
       53
    
    
      West Virginia
        1852994
       54
    
    
      Wisconsin
        5686986
       55
    
    
      Wyoming
         563626
       56
    
  

51 rows × 2 columns

	0	1	2
0	1	China	1385566537
1	2	India	1252139596
2	3	United States	320050716
3	4	Indonesia	249865631
4	5	Brazil	200361925

	NAME	P0010001	state
0	Alabama	4779736	01
1	Alaska	710231	02
2	Arizona	6392017	04
3	Arkansas	2915918	05
4	California	37253956	06

	NAME	P0010001	county	state
0	Alameda County	1510271	001	06
1	Alpine County	1175	003	06
2	Amador County	38091	005	06
3	Butte County	220000	007	06
4	Calaveras County	45578	009	06

	P0010001	state
NAME
Alabama	4779736	01
Alaska	710231	02
Arizona	6392017	04
Arkansas	2915918	05
California	37253956	06