Read in JSON and DataFrame Basics
In [20]:
# read population in
import json
import requests
from pandas import DataFrame
# pop_json_url holds a
pop_json_url = "https://gist.github.com/rdhyee/8511607/raw/f16257434352916574473e63612fcea55a0c1b1c/population_of_countries.json"
pop_list= requests.get(pop_json_url, verify=False).json()
df = DataFrame(pop_list)
df[:5]
Out[20]:
In [21]:
df.ix[2]
Out[21]:
In [22]:
df.dtypes
Out[22]:
Q: Based on the above statement, which of these would you expect to see in pop_list?
['1', 'United States', '320050716']
[1, 'United States', 320050716]
['United States', 320050716]
[1, 'United States', '320050716']
Q: What is the relationship between s
and the population of China?
s = sum(df[df[1].str.startswith('C')][2])
s
is greater than the population of Chinas
is the same as the population of Chinas
is less than the population of Chinas
is not a number.
In [23]:
# df.columns = ['Number','Country','Population']
print df[df[1].str.startswith('C')][2].sum()
sum(df[df[1].str.startswith('C')][2])
Out[23]:
Q: This statement does the following?
df.columns = ['Number','Country','Population']
columns
Q: How would you rewrite this statement to get the same result
s = sum(df[df[1].str.startswith('C')][2])
after running:
df.columns = ['Number','Country','Population']
Series Examples
In [24]:
from pandas import DataFrame, Series
import numpy as np
s1 = Series(np.arange(1,4))
s1
Out[24]:
In [25]:
s1 + 1
Out[25]:
Q: What is
s1 + 1
Q: What is
s1.apply(lambda k: 2*k).sum()
In [26]:
sum(s1.apply(lambda k: 2*k))
Out[26]:
Q: What is
s1.cumsum()[1]
In [27]:
s1.cumsum()[1]
Out[27]:
Q: What is
s1.cumsum() + s1.cumsum()
In [28]:
s1.cumsum()+s1.cumsum()
Out[28]:
Q: Describe what is happening in these statements:
s1 + 1
and
s1.cumsum() + s1.cumsum()
Q: What is
np.any(s1 > 2)
In [29]:
s1 > 2
Out[29]:
Census API Examples
In [30]:
from census import Census
from us import states
import settings
c = Census(settings.CENSUS_KEY)
c.sf1.get(('NAME', 'P0010001'), {'for': 'state:%s' % states.CA.fips})
Out[30]:
Q: What is the purpose of settings.CENSUS_KEY
?
Q: What is the difference between r1
and r2
?
r1 = c.sf1.get(('NAME', 'P0010001'), {'for': 'county:*', 'in': 'state:%s' % states.CA.fips})
r2 = c.sf1.get(('NAME', 'P0010001'), {'for': 'county:*', 'in': 'state:*' })
In [31]:
r1 = c.sf1.get(('NAME', 'P0010001'), {'for': 'county:*', 'in': 'state:%s' % states.CA.fips})
r2 = c.sf1.get(('NAME', 'P0010001'), {'for': 'county:*', 'in': 'state:*' })
In [32]:
r1[0]
Out[32]:
In [33]:
len(r2)
Out[33]:
Q: Which is the correct geographic hierarchy?
Nation > States = Nation is subdivided into States
In [34]:
from pandas import DataFrame
r = c.sf1.get(('NAME', 'P0010001'), {'for': 'state:*'})
df = DataFrame(r)
df.head()
Out[34]:
In [54]:
df1 = DataFrame(r1)
df1.head()
Out[54]:
In [55]:
r3 = c.sf1.get(('NAME', 'P0010001'), {'for': 'county:*' })
len(r3)
Out[55]:
Q: Why does df
have 52 items? Please explain
In [56]:
len(df)
Out[56]:
Q: Why are the results below different? Please explain
In [57]:
print df.P0010001.sum()
print
print df.P0010001.astype(int).sum()
Q: Describe the output of the following:
df.P0010001 = df.P0010001.astype(int)
df[['NAME','P0010001']].sort('P0010001', ascending=False).head()
In [58]:
df.P0010001 = df.P0010001.astype(int)
df[['NAME','P0010001']].sort('P0010001', ascending=False).head()
Q: After running:
df.set_index('NAME', inplace=True)
how would you access the Series for the state of Alaska?
In [53]:
df.set_index('NAME', inplace=True)
In [40]:
In [41]:
df.head()
Out[41]:
In [42]:
len([ s.fips for s in states.STATES])
Out[42]:
In [43]:
len(df)
Out[43]:
In [44]:
np.in1d([ s.fips for s in states.STATES], df.state)
Out[44]:
In [45]:
len(np.in1d(df.state, [ s.fips for s in states.STATES]))
Out[45]:
In [46]:
len(df[np.in1d(df.state, [ s.fips for s in states.STATES])])
Out[46]:
In [47]:
df[np.in1d(df.state, [ s.fips for s in states.STATES])][:]
Out[47]: