Merge craiglist data with synthetic census data

Steps:

Get craigslist data
- Query from remote database in batches based on FIPS
- Joing to census via FIPS code
Get census data
- Aggregate to BG
Merge together

Can process in batches, one state at a time, or get data by region.

TODO:

The census data for DC are missing. The files are there but they contain no data.
If need to look at education or other person data, aggregate person data to household and then block group



In [4]:

    
import pandas as pd
import psycopg2
import paramiko
import os
import numpy as np
import json
import zipfile

Craigslist data table columns

FIPS code format

53-----033---001701--1--015

[state][county][tract][bg][block]

Note: for DC, county='001'

Synthetic census data - variables

Household data: household_id,serialno,persons,cars,income,race_of_head,age_of_head,workers,state,county,tract,block group,children,tenure,recent_mover

Person data: person_id,member_id,age,relate,edu,sex,hours,earning,race_id,household_id,student,work_at_home,worker

Configure connection parameters

Make sure you have two files saved in the same directory, postgres_settings.json and ssh_settings.json.



In [82]:

    
DATA_DIR=os.path.join('..','data') 
"""Path to local data directory"""

#read postgres connection parameters
with open('postgres_settings.json') as settings_file:    
    settings = json.load(settings_file)

DBNAME = settings['dbname']
USER = settings['user']
HOST = settings['host']
PASSWORD = settings['password']

conn_str = "dbname = {0} user = {1} host = {2} password = {3}".format(DBNAME, USER, HOST, PASSWORD)

try:
    conn = psycopg2.connect(conn_str)
    cur = conn.cursor()
except:
    print ("Cannot connection. Check settings")



In [13]:

    
# TODO: add putty connection too. 
#read SSH connection parameters
with open('ssh_settings.json') as settings_file:    
    settings = json.load(settings_file)

HOSTNAME = settings['hostname']
USERNAME = settings['username']
PASSWORD = settings['password']
LOCAL_KEY_DIR = settings['local_key_dir']

CENSUS_DIR = 'synthetic_population'
"""Remote directory with census data"""

RESULTS_DIR = 'craigslist_census'
"""Remote directory for results"""

# estbalish SSH connection
ssh = paramiko.SSHClient() 
ssh.load_host_keys(LOCAL_KEY_DIR)
ssh.set_missing_host_key_policy(paramiko.AutoAddPolicy())
ssh.connect(HOSTNAME,username=USERNAME, password=PASSWORD)
sftp = ssh.open_sftp()

Local data files

Local files with with block-level data from urbansim



In [14]:

    
BLOCK_DIR = os.path.join('..','data','urbansim')
BLOCK_ZFILE = 'ba_block_variables.csv.zip'
BLOCK_FILE = 'ba_block_variables.csv'

Create FIPS look-up tables



In [15]:

    
# make dictionary of states and fips codes. 
fips_state = pd.read_csv(os.path.join(DATA_DIR,'state_fips_codes.csv'),dtype=str)
fips2state=dict(zip(fips_state['FIPS'],fips_state['USPS']))
state2fips=dict(zip(fips_state['USPS'],fips_state['FIPS']))

# Make lookup for county to MPO code 
mpo_counties = pd.read_csv(os.path.join(DATA_DIR,'us_2015_mpo_regions_counties_v1.csv'), encoding='latin1', dtype={'MPO_ID':str,'COUNTYFP':str,'STFIPS':str})
mpo_counties['COUNTYFP'] = mpo_counties['COUNTYFP'].str.zfill(2)  
mpo_counties['st_co_fips'] = mpo_counties['STFIPS']+mpo_counties['COUNTYFP']  # we will want to join on 2-char state + 3-char county fips
county2mpo=dict(zip(mpo_counties['st_co_fips'],mpo_counties['MPO_ID']))  # do we want MPO_ID or do we want GEOID? 
mpo_counties.head()









    Out[15]:






  
    
      
      ID
      MPO_ID
      MPONAME
      STATE
      STFIPS
      STATEFP
      COUNTYFP
      COUNTYNS
      GEOID
      NAME
      ...
      CBSAFP
      METDIVFP
      FUNCSTAT
      ALAND
      AWATER
      INTPTLAT
      INTPTLON
      PERIMETER
      area_sqmi
      st_co_fips
    
  
  
    
      0
      1826900.0
      47198201
      Johnson City Metropolitan Transportation Plann...
      TN
      47
      47
      163
      1639793
      47163
      Sullivan
      ...
      28700.0
      NaN
      A
      1.070725e+09
      4.220920e+07
      36.510213
      -82.299396
      71837.801334
      0.756540
      47163
    
    
      1
      1541707.0
      29201300
      Southeast Metropolitan Planning Organization (...
      MO
      29
      17
      003
      424203
      17003
      Alexander
      ...
      16020.0
      NaN
      A
      6.099969e+08
      4.423719e+07
      37.183657
      -89.349516
      6972.897419
      0.591829
      29003
    
    
      2
      2101128.0
      51197401
      Richmond Area MPO
      VA
      51
      51
      670
      1498428
      51670
      Hopewell
      ...
      40060.0
      NaN
      F
      2.665508e+07
      1.324078e+06
      37.291010
      -77.298944
      6041.591069
      0.257294
      51670
    
    
      3
      220693.0
      15201300
      Maui MPO
      HI
      15
      15
      009
      365283
      15009
      Maui
      ...
      27980.0
      NaN
      A
      -1.286608e+09
      -1.091674e+09
      20.855931
      -156.601550
      299148.486465
      728.079214
      15009
    
    
      4
      220707.0
      15197500
      Oahu MPO
      HI
      15
      15
      003
      365281
      15003
      Honolulu
      ...
      46520.0
      NaN
      A
      1.555505e+09
      -2.400273e+08
      21.461365
      -158.201974
      329335.691836
      610.910815
      15003
    
  

5 rows × 25 columns



In [16]:

    
def run_query(q):
    """ Get results given SQL query"""
    cur.execute(q)
    return(cur.fetchall())

def get_craiglist(filters, split_fips=True):
    """Get craiglist data from database.
    Args: 
        filters (list): list of strings containing filter criteria. Format as individual SQL WHERE statements. E.g., ["region='sandiego'","rent>100"]
        split_fips (bool): if True, split fips code into block and fips12 (useful if merging wtih blockgroup)
    Returns: 
        DataFrame: listings data. 
    """
    #q="SELECT pid,date,rent,bedrooms,bathrooms,sqft,rent_sqft,fips_block,state,region,longitude,latitude FROM rental_listings WHERE state='{}';".format(state)
    filters_str = ' AND '.join([x for x in filters])
    q="SELECT pid,date,rent,bedrooms,bathrooms,sqft,rent_sqft,fips_block,state,region,longitude,latitude FROM rental_listings WHERE {};".format(filters_str)
    results=run_query(q)
    df=pd.DataFrame(results,columns=['listing_id', 'date','rent','bedrooms','bathrooms','sqft','rent_sqft','fips_block','state','region','lng','lat'] )  # put it all into a dataframe
    if split_fips==True:
        # split FIPS into different columns - split off the last 3 chars
        df['block']=df.fips_block.str[-4:]
        df['fips12']=df.fips_block.str[:-3]
    return(df)

def read_census_file(fname):
    """Read census csv file via SFTP and return as dataframe."""
    with sftp.open(os.path.join(CENSUS_DIR,fname)) as f:
        df = pd.read_csv(f, delimiter=',',dtype={'age_of_head':float, 'block group':str, 'cars':float, 'children':float, 'county':str,
           'household_id':str, 'income':float, 'persons':float, 'race_of_head':str, 'recent_mover':str,
           'serialno':str, 'state':str, 'tenure':str, 'tract':str, 'workers':float})
    return df

def write_results_file(data,fname):
    """Write merged data to csv file via SFTP."""
    with sftp.open(os.path.join(RESULTS_DIR,fname),'w') as f:
        data.to_csv(f,index=True)
    return

def get_census_by_state(state, table='households'): 
    """Return all census data for state given two-char abbreviation. Can be 'households' or 'persons' data. """ 
    filelist=sftp.listdir(CENSUS_DIR)
    if table=='households':
        files = [f for f in filelist if f[:5]=='hh_{}'.format(state)]
    elif table=='persons':
        files = [f for f in filelist if f[:4]=='p_{}'.format(state)]
    #files = files[:3]  # uncomment this line for testing.
    new_df = pd.DataFrame()
    for f in files:
        df = read_census_file(f)
        new_df = pd.concat([new_df,df])
    return(new_df)

def strip_zeros(s):
    """Remove '.0 from end of string"""
    if s.endswith('.0'):
        return(s[:-2])
    else:
        return(s)

def format_hh_data(df):
    """Fix formatting for hhs census data. Replace '' strings with zero. Format other strings."""

    df['county'] = df['county'].str.zfill(2)  # make county 3-char string.
    
    for col in ['children','workers']:
        df[col] = df[col].replace('','0')

    for col in ['race_of_head','recent_mover','tenure']:
        df[col] = df[col].astype(str)
        df[col] = df[col].map(strip_zeros)  # make sure strings are formatted. 
    return(df)

def aggregate_census(df, groupby_cols=['county','tract','block group'],cols_to_sum=['cars','children','persons','workers'], cols_to_median=['age_of_head','income'],categ_cols=['race_of_head','recent_mover','tenure'],id_col='serialno',table='hhs'):
    """Aggregate census table to block group. Made this for hh data, may need to revised for persons data.
    Args: 
        groupby_cols (list): names of columns to group by (default=['county','tract','block group'])
        cols_to_sum (list): names of columns for which to compute totals. 
        cols_to_median (list): names of columns for which to compute medians
        categ_cols (list): names of categorical columns
        id_col (str): name of column that serves as the id column, to use in counting rows. 
        table (str): 'hhs' (default) or 'per'
    Returns: 
        DataFrame: aggregated data. 
        """
    # For some columns we'll want to find the sum or average/median. These will need only a simple groupby 
    sums = df.groupby(by=groupby_cols).sum()[cols_to_sum]
    sums.columns = [x+'_tot' for x in cols_to_sum]
    
    medians = df.groupby(by=groupby_cols).median()[cols_to_median]
    medians.columns = [x+'_med' for x in cols_to_median]
    
    counts = pd.DataFrame(df.groupby(by=groupby_cols).count()[id_col])
    counts.columns=[table+'_tot']

    # Categorical columns will need pivot tables. 
    categoricals = pd.DataFrame(index=counts.index)
    for col in categ_cols:
        pivoted=df.pivot_table(index = groupby_cols, columns = col, aggfunc='count')[id_col]
        pivoted.columns = [col+'_'+x for x in pivoted.columns]
        pivoted.columns = pivoted.columns.map(strip_zeros)
        # merge back together
        categoricals = pd.merge(categoricals, pivoted, left_index=True, right_index=True)

    # put all back together in one table
    merged = pd.merge(sums, medians, left_index=True, right_index=True)
    merged = pd.merge(merged, counts, left_index=True, right_index=True)
    merged = pd.merge(merged, categoricals, left_index=True, right_index=True)

    # check lengths of dataframes to detect any problems in grouping or merging
    lengths = [len(sums),len(medians),len(counts),len(categoricals),len(merged)]
    if len(set(lengths))>1:
        print('Warning: Aggregated tables have different lengths.',lengths,'for sums, medians, counts, categoricals, and merged.')
    
    return(merged)

def match_mpo(s, mpo_dict=county2mpo):
    """Match a 5-char state-county FIPS code to an MPO code
    Args: 
        s (str): 5-char state-county string
        mpo_dict (dict): county2mpo dictionary
    Returns: 
        str: MPO code
    """
    try: 
        return mpo_dict[s]
    except KeyError: # in this case, the county is not in an MPO
        return ''



In [10]:

    
def run_all(state, filters=None):
    """Get craigslist data and merge with census data, by state, and save.  with additional filters if needed. 
    Args: 
        state (str): 2-char state abbreviation
        filters (list): additional filters. Do not need to include state in filter list
    """
    
    # load and prepare craiglist data
    
    # If filters are provided, use them to filter data
    if filters:
        filters.append("state='{}'".format(state))
        print(filters)
        df_cl=get_craiglist(filters)
    # If no filters provided, get all data for the specified state. 
    else:
        df_cl=get_craiglist(["state='{}'".format(state)])
    
    df_cl['st_co_fps'] = df_cl.fips_block.map(lambda x: x[:5])
    df_cl['mpo_id'] = df_cl.st_co_fps.map(match_mpo)

    # load and prepare census data for households
    hhs = get_census_by_state(state, table='households')
    hhs = format_hh_data(hhs)
    hhs_bg = aggregate_census(hhs)
    hhs_bg=hhs_bg.reset_index()
    hhs_bg['fips12']=state2fips[state]+hhs_bg['county']+hhs_bg['tract']+hhs_bg['block group'] # create 12-digit FIPS code for merging. 

    # merge with craigslist data. 
    merged = pd.merge(df_cl, hhs_bg, on='fips12',how='left')
    merged = merged.set_index('listing_id')

    #TODO: add persons data here, if needed. 

    # Keep only columns we'll need.
    cols_to_keep=['date','rent','bedrooms','bathrooms','sqft','rent_sqft','fips_block','state','region','mpo_id','lng','lat','cars_tot','children_tot','persons_tot','workers_tot','age_of_head_med','income_med','hhs_tot','race_of_head_1','race_of_head_2','race_of_head_3','race_of_head_4','race_of_head_5','race_of_head_6','race_of_head_7','race_of_head_8','race_of_head_9','recent_mover_0','recent_mover_1','tenure_1','tenure_2']
    
    # This is a bit of a hack in case some columns are missing in some states. 
    for col in cols_to_keep: 
        if col not in merged.columns:
            merged[col] = np.nan

    # save file either locally or remotely. 
    print('Saving data for {s}: {m} rows'.format(s=state,m=len(merged)))
    outfile = 'cl_census_{}.csv'.format(state)
    #merged[cols_to_keep].to_csv(os.path.join(DATA_DIR,outfile), index=True)  # uncomment to save locally

    #write_results_file(merged[cols_to_keep], outfile)  # uncomment to save remotely. 
    return merged[cols_to_keep]

Get data for a single region

Use this to get data for a single region, for use in our preliminary analysis



In [11]:

    
df_bayarea = run_all(state='CA',filters=["region = 'sfbay'","rent>0"])   # define whatever filters you want here.
df_bayarea.head()









    



["region = 'sfbay'", 'rent>0', "state='CA'"]






    



---------------------------------------------------------------------------
KeyboardInterrupt                         Traceback (most recent call last)
<ipython-input-11-3f9d58715a9e> in <module>()
----> 1 df_bayarea = run_all(state='CA',filters=["region = 'sfbay'","rent>0"])   # define whatever filters you want here.
      2 df_bayarea.head()

<ipython-input-10-f4f50040be1e> in run_all(state, filters)
     21 
     22     # load and prepare census data for households
---> 23     hhs = get_census_by_state(state, table='households')
     24     hhs = format_hh_data(hhs)
     25     hhs_bg = aggregate_census(hhs)

<ipython-input-9-d0074092a625> in get_census_by_state(state, table)
     39 def get_census_by_state(state, table='households'):
     40     """Return all census data for state given two-char abbreviation. Can be 'households' or 'persons' data. """
---> 41     filelist=sftp.listdir(CENSUS_DIR)
     42     if table=='households':
     43         files = [f for f in filelist if f[:5]=='hh_{}'.format(state)]

//anaconda/lib/python3.5/site-packages/paramiko/sftp_client.py in listdir(self, path)
    170         :param str path: path to list (defaults to ``'.'``)
    171         """
--> 172         return [f.filename for f in self.listdir_attr(path)]
    173 
    174     def listdir_attr(self, path='.'):

//anaconda/lib/python3.5/site-packages/paramiko/sftp_client.py in listdir_attr(self, path)
    198         while True:
    199             try:
--> 200                 t, msg = self._request(CMD_READDIR, handle)
    201             except EOFError:
    202                 # done with handle

//anaconda/lib/python3.5/site-packages/paramiko/sftp_client.py in _request(self, t, *arg)
    728     def _request(self, t, *arg):
    729         num = self._async_request(type(None), t, *arg)
--> 730         return self._read_response(num)
    731 
    732     def _async_request(self, fileobj, t, *arg):

//anaconda/lib/python3.5/site-packages/paramiko/sftp_client.py in _read_response(self, waitfor)
    758         while True:
    759             try:
--> 760                 t, data = self._read_packet()
    761             except EOFError as e:
    762                 raise SSHException('Server connection dropped: %s' % str(e))

//anaconda/lib/python3.5/site-packages/paramiko/sftp.py in _read_packet(self)
    171 
    172     def _read_packet(self):
--> 173         x = self._read_all(4)
    174         # most sftp servers won't accept packets larger than about 32k, so
    175         # anything with the high byte set (> 16MB) is just garbage.

//anaconda/lib/python3.5/site-packages/paramiko/sftp.py in _read_all(self, n)
    154                         break
    155             else:
--> 156                 x = self.sock.recv(n)
    157 
    158             if len(x) == 0:

//anaconda/lib/python3.5/site-packages/paramiko/channel.py in recv(self, nbytes)
    663         """
    664         try:
--> 665             out = self.in_buffer.read(nbytes, self.timeout)
    666         except PipeTimeout:
    667             raise socket.timeout()

//anaconda/lib/python3.5/site-packages/paramiko/buffered_pipe.py in read(self, nbytes, timeout)
    154                 while (len(self._buffer) == 0) and not self._closed:
    155                     then = time.time()
--> 156                     self._cv.wait(timeout)
    157                     if timeout is not None:
    158                         timeout -= time.time() - then

//anaconda/lib/python3.5/threading.py in wait(self, timeout)
    291         try:    # restore state no matter what (e.g., KeyboardInterrupt)
    292             if timeout is None:
--> 293                 waiter.acquire()
    294                 gotit = True
    295             else:

KeyboardInterrupt:



In [ ]:

    
# save locally
outfile = 'sfbay_listings_04282017.csv'
df_bayarea.to_csv(os.path.join(DATA_DIR,outfile), index=True)

Process all data by state

Use this to merge all the data by state



In [ ]:

    
for state in fips_state['USPS']:# uncomment when done with testing. 
    if state != 'DC':   # the DC census data is missing. 
        print('\n Working on',state)
        df_state = run_all(state)



In [ ]:

    
df_state.head()



In [ ]:

    
ssh.close()

Merge listings data with urbansim block-level data



In [17]:

    
# first unzip csv file into temp dir
os.mkdir('temp') # make temp dir for unzipped files
zip_ref = zipfile.ZipFile(os.path.join(BLOCK_DIR,BLOCK_ZFILE), 'r')
zip_ref.extractall('temp')
zip_ref.close()



In [18]:

    
# temporarily read first 100 lines just to see header names
df_temp = pd.read_csv(os.path.join('temp',BLOCK_FILE), nrows=100, dtype={'block_id':str})
print(df_temp.shape)
df_temp.head()









    



(100, 1015)






    Out[18]:






  
    
      
      block_id
      x
      y
      res_rents
      res_values
      square_meters_land
      puma10_id
      residential_unit_capacity
      employment_capacity
      rent_impute
      ...
      tracts_prop_persons_8
      tracts_prop_persons_9
      nodes_du_1500m
      nodes_ave_year_built_3000m
      tracts_prop_persons_5
      tracts_prop_persons_6
      tracts_prop_persons_7
      tracts_prop_persons_1
      tracts_prop_persons_2
      tracts_prop_persons_3
    
  
  
    
      0
      060014271001000
      -122.233867
      37.770270
      1475.0
      777700.0
      0
      600105
      0
      0
      1
      ...
      0.0
      0.0
      7.241375
      1949.245239
      0.033789
      0.014378
      0.007189
      0.244428
      0.35514
      0.180446
    
    
      1
      060014271001001
      -122.233991
      37.769464
      1253.0
      567900.0
      79696
      600105
      105
      367
      0
      ...
      0.0
      0.0
      7.241375
      1949.245239
      0.033789
      0.014378
      0.007189
      0.244428
      0.35514
      0.180446
    
    
      2
      060014271001002
      -122.234301
      37.768350
      1475.0
      777700.0
      739
      600105
      0
      23
      1
      ...
      0.0
      0.0
      7.241375
      1949.245239
      0.033789
      0.014378
      0.007189
      0.244428
      0.35514
      0.180446
    
    
      3
      060014271001003
      -122.235213
      37.768827
      1253.0
      567900.0
      19546
      600105
      48
      108
      0
      ...
      0.0
      0.0
      7.475714
      1949.422974
      0.033789
      0.014378
      0.007189
      0.244428
      0.35514
      0.180446
    
    
      4
      060014271001004
      -122.236751
      37.769734
      1253.0
      567900.0
      14364
      600105
      35
      105
      0
      ...
      0.0
      0.0
      7.157094
      1950.110962
      0.033789
      0.014378
      0.007189
      0.244428
      0.35514
      0.180446
    
  

5 rows × 1015 columns



In [19]:

    
# define the columns we need 
block_cols = df_temp.columns



In [26]:

    
#df_temp.columns[:100]

Certain columns are definitely not useful. E.g, 'puma10_id_is_0609502', 'puma10_id_is_0609503' 'tracts_mean_y', 'tracts_mean_x'

TODO: spend more time choosing features



In [21]:

    
# make sure to now include unneeded columns like these:
unneeded_cols = [x for x in block_cols if x.startswith('puma10_id_is_')]+
[x for x in block_cols if (x.endswith('mean_y'))|(x.endswith('mean_x'))]+
[x for x in block_cols if (x.endswith('std_y'))|(x.endswith('std_x'))]+
[x for x in block_cols if x.startswith('pumas_prop_sector_id')]+
[x for x in block_cols if x.startswith('county_id_is_')]+
[x for x in block_cols if x.startswith('tracts_prop_sector_id')]+
[x for x in block_cols if x.startswith('counties_prop_sector_id')]









    



  File "<ipython-input-21-c056f67af7df>", line 2
    unneeded_cols = [x for x in block_cols if x.startswith('puma10_id_is_')]+
                                                                             ^
SyntaxError: invalid syntax



In [ ]:

    
len(unneeded_cols)



In [33]:

    
cols_to_use = ['block_id','nodes_population_1500m','block_groups_total_jobs', 'block_groups_median_children', 
       'block_groups_median_income', 'prop_tenure_1', 'nodes_low_income_hh_1500m', 'nodes_high_income_hh_1500m', 
       'nodes_jobs_3000m','nodes_jobs_20km', 'nodes_population_400m', 'nodes_population_800m', 
       'block_groups_prop_race_of_head_1','block_groups_prop_race_of_head_2', 'block_groups_prop_race_of_head_3', 
       'block_groups_prop_race_of_head_7','block_groups_prop_race_of_head_8','block_groups_prop_race_of_head_6',
       'pumas_density_residential_units','block_groups_density_jobs', 
       'nodes_jobs_1500m_4445','nodes_jobs_3000m_4445', 
       'nodes_du_5000m','nodes_du_800m','block_groups_median_rent',
       'block_groups_median_persons', 'block_groups_median_age_of_head', 'nodes_ave_year_built_800m']



In [34]:

    
for col in cols_to_use: 
    if col not in block_cols:
        print(col)



In [35]:

    
# Read all rows, using only the columns we want
df_blocks = pd.read_csv(os.path.join('temp',BLOCK_FILE),dtype={'block_id':str}, usecols = cols_to_use)
print(df_blocks.shape)
df_blocks.head()









    



(109228, 28)






    Out[35]:






  
    
      
      block_id
      nodes_low_income_hh_1500m
      block_groups_median_children
      nodes_jobs_3000m_4445
      block_groups_total_jobs
      nodes_population_1500m
      prop_tenure_1
      block_groups_median_persons
      block_groups_median_age_of_head
      nodes_population_800m
      ...
      nodes_du_5000m
      nodes_jobs_1500m_4445
      pumas_density_residential_units
      nodes_jobs_3000m
      block_groups_median_rent
      nodes_jobs_20km
      block_groups_density_jobs
      block_groups_median_income
      nodes_population_400m
      nodes_du_800m
    
  
  
    
      0
      060014271001000
      5.991451
      0.0
      6.705032
      465.0
      9.113499
      0.000
      2.0
      57.0
      7.809947
      ...
      9.915149
      5.062971
      3.339906
      8.821410
      1253.0
      12.139385
      5.103936
      80000.0
      6.373320
      5.841072
    
    
      1
      060014271001001
      5.991451
      0.0
      6.705032
      465.0
      9.113499
      0.200
      2.0
      57.0
      7.809947
      ...
      9.915149
      5.062971
      3.339906
      8.821410
      1253.0
      12.139385
      5.103936
      80000.0
      6.373320
      5.841072
    
    
      2
      060014271001002
      5.991451
      0.0
      6.705032
      465.0
      9.113499
      0.000
      2.0
      57.0
      7.809947
      ...
      9.915149
      5.062971
      3.339906
      8.821410
      1253.0
      12.139385
      5.103936
      80000.0
      6.373320
      5.841072
    
    
      3
      060014271001003
      6.231131
      0.0
      6.814257
      465.0
      9.536473
      0.300
      2.0
      57.0
      7.960672
      ...
      9.982758
      5.257902
      3.339906
      8.925516
      1253.0
      12.150690
      5.103936
      80000.0
      6.703188
      6.253922
    
    
      4
      060014271001004
      6.107350
      0.0
      6.982764
      465.0
      9.355220
      0.375
      2.0
      57.0
      7.547502
      ...
      10.062321
      5.541484
      3.339906
      9.048043
      1253.0
      12.218818
      5.103936
      80000.0
      6.396930
      5.760618
    
  

5 rows × 28 columns



In [36]:

    
df_blocks['block_id'].head()









    Out[36]:





0    060014271001000
1    060014271001001
2    060014271001002
3    060014271001003
4    060014271001004
Name: block_id, dtype: object



In [37]:

    
df_listings = get_craiglist(filters = ["region='sfbay'","rent>100"])



In [38]:

    
# merge listings with vars on block_id
df_listings.fips_block.head()

print(len(df_listings), len(df_blocks))
df_merged = pd.merge(df_listings, df_blocks, left_on='fips_block', right_on='block_id', how='inner')
if len(df_merged)<len(df_listings):
    print('Warning: only {0} of {1} rows matched'.format(len(df_merged), len(df_listings)))









    



125560 109228
Warning: only 123463 of 125560 rows matched



In [67]:

    
# save to hdf
outfile = 'ba_listings.h5'
df_merged.to_hdf(os.path.join(DATA_DIR,outfile),'merged')


outfile = 'ba_listings.csv'
df_merged.to_csv(os.path.join(DATA_DIR,outfile),index=False)









    



//anaconda/lib/python3.5/site-packages/pandas/core/generic.py:1138: PerformanceWarning: 
your performance may suffer as PyTables will pickle object types that it cannot
map directly to c-types [inferred_type->mixed,key->block1_values] [items->['listing_id', 'date', 'fips_block', 'state', 'region', 'block', 'fips12', 'block_id']]

  return pytables.to_hdf(path_or_buf, key, self, **kwargs)

Store data to database

We need to create a database table to hold the block-level variables for the rent-predictor app.

Steps:

Create table schema

Assume all features are floats unless otherwise specified.

(If have a lot that aren't floats, might want to store the variable names and their types as a json so we can refer back to it when we change the variables.)

Copy data into table

Copy data into table. COPY is the fastest way to insert a large amount of data (https://www.postgresql.org/docs/current/static/populate.html)



In [90]:

    
# connect to local databse

DBNAME = settings['DBNAME_RENT']
# USER = settings['USER_RENT']
# PASSWORD = settings['PASSWORD_RENT']

conn_str = "dbname = {0}".format(DBNAME)

try:
    conn = psycopg2.connect(conn_str)
    cur = conn.cursor()
except:
    print ("Cannot connection. Check settings")



In [54]:

    
# first save data as csv. 
# let's use that temp dir again
FULL_PATH = '/Users/lisarayle/Dropbox/craigslist/src/'  # can't use relative path in postgres, I guess
csvfile = 'blocks_temp.csv'
df_blocks.to_csv(os.path.join('temp',csvfile), index=False)

table_name = 'block_vars'



In [91]:

    
def create_db_table(col_names,t_name,id_var='block_id'):
    """Create a new table with schema to hold the block-level data. 
    Args: 
        col_names (list): list of names of columns to use. First one can be 'block id'
        t_name (str): name of database table
        id_var (str): name of id variable (default: 'block_id')
    """
    # drop table if already exists
    q = "DROP TABLE IF EXISTS {}".format(t_name)
    cur.execute(q)
    conn.commit()

    # build the SQL string
    sql_begin = "CREATE TABLE {0} (id BIGSERIAL PRIMARY KEY, {1} varchar(15) not null, ".format(t_name, id_var)
    if col_names[0]==id_var:
        sql_middle = " real,".join([c for c in col_names[1:]]) # leave off block id if it's there. 
    else: 
        sql_middle = " real,".join([c for c in col_names])

    sql_end = " real);"
    q = sql_begin+sql_middle+sql_end
    
    cur.execute(q)
    conn.commit()
    return 

def copy_block_data(col_names,t_name,fname):
    """Copy data from csv file into block variables table. 
    Args: 
        col_names (list): list of names of columns to use. First one can be 'block id'
        t_name (str): name of database table
        fname (str): name of csv file with data
    """
    var_string = ','.join([c for c in col_names])
    q="COPY {t}({v}) FROM '{f}' DELIMITERS ',' CSV HEADER;".format(t=t_name,v=var_string, f=os.path.join(FULL_PATH,'temp',fname))
    print(q)
    cur.execute(q)
    conn.commit()
    return



In [92]:

    
create_db_table(cols_to_use, table_name)
copy_block_data(cols_to_use, table_name, csvfile)









    



COPY block_vars(block_id,nodes_population_1500m,block_groups_total_jobs,block_groups_median_children,block_groups_median_income,prop_tenure_1,nodes_low_income_hh_1500m,nodes_high_income_hh_1500m,nodes_jobs_3000m,nodes_jobs_20km,nodes_population_400m,nodes_population_800m,block_groups_prop_race_of_head_1,block_groups_prop_race_of_head_2,block_groups_prop_race_of_head_3,block_groups_prop_race_of_head_7,block_groups_prop_race_of_head_8,block_groups_prop_race_of_head_6,pumas_density_residential_units,block_groups_density_jobs,nodes_jobs_1500m_4445,nodes_jobs_3000m_4445,nodes_du_5000m,nodes_du_800m,block_groups_median_rent,block_groups_median_persons,block_groups_median_age_of_head,nodes_ave_year_built_800m) FROM '/Users/lisarayle/Dropbox/craigslist/src/temp/blocks_temp.csv' DELIMITERS ',' CSV HEADER;



In [57]:

    
# test queries

q = "select count(*) from block_vars;"
run_query(q)

q = "select column_name from information_schema.columns where table_name='block_vars';"
run_query(q)

q = "select * from block_vars limit 10;"
run_query(q)









    Out[57]:





[(1,
  '060014271001000',
  5.99145,
  0.0,
  6.70503,
  465.0,
  9.1135,
  0.0,
  2.0,
  57.0,
  7.80995,
  6.1145,
  0.109726,
  0.0,
  0.00498753,
  0.0,
  0.885287,
  0.0,
  1940.73,
  9.91515,
  5.06297,
  3.33991,
  8.82141,
  1253.0,
  12.1394,
  5.10394,
  80000.0,
  6.37332,
  5.84107),
 (2,
  '060014271001001',
  5.99145,
  0.0,
  6.70503,
  465.0,
  9.1135,
  0.2,
  2.0,
  57.0,
  7.80995,
  6.1145,
  0.109726,
  0.0,
  0.00498753,
  0.0,
  0.885287,
  0.0,
  1940.73,
  9.91515,
  5.06297,
  3.33991,
  8.82141,
  1253.0,
  12.1394,
  5.10394,
  80000.0,
  6.37332,
  5.84107),
 (3,
  '060014271001002',
  5.99145,
  0.0,
  6.70503,
  465.0,
  9.1135,
  0.0,
  2.0,
  57.0,
  7.80995,
  6.1145,
  0.109726,
  0.0,
  0.00498753,
  0.0,
  0.885287,
  0.0,
  1940.73,
  9.91515,
  5.06297,
  3.33991,
  8.82141,
  1253.0,
  12.1394,
  5.10394,
  80000.0,
  6.37332,
  5.84107),
 (4,
  '060014271001003',
  6.23113,
  0.0,
  6.81426,
  465.0,
  9.53647,
  0.3,
  2.0,
  57.0,
  7.96067,
  6.33903,
  0.109726,
  0.0,
  0.00498753,
  0.0,
  0.885287,
  0.0,
  1940.53,
  9.98276,
  5.2579,
  3.33991,
  8.92552,
  1253.0,
  12.1507,
  5.10394,
  80000.0,
  6.70319,
  6.25392),
 (5,
  '060014271001004',
  6.10735,
  0.0,
  6.98276,
  465.0,
  9.35522,
  0.375,
  2.0,
  57.0,
  7.5475,
  5.76372,
  0.109726,
  0.0,
  0.00498753,
  0.0,
  0.885287,
  0.0,
  1956.93,
  10.0623,
  5.54148,
  3.33991,
  9.04804,
  1253.0,
  12.2188,
  5.10394,
  80000.0,
  6.39693,
  5.76062),
 (6,
  '060014271001005',
  6.18346,
  0.0,
  6.97409,
  465.0,
  9.36529,
  0.0,
  2.0,
  57.0,
  7.55433,
  5.85112,
  0.109726,
  0.0,
  0.00498753,
  0.0,
  0.885287,
  0.0,
  1948.95,
  10.0493,
  5.5048,
  3.33991,
  9.04969,
  1253.0,
  12.2108,
  5.10394,
  80000.0,
  6.36819,
  5.80167),
 (7,
  '060014271001006',
  6.18346,
  0.0,
  6.97409,
  465.0,
  9.36529,
  0.0,
  2.0,
  57.0,
  7.55433,
  5.85112,
  0.109726,
  0.0,
  0.00498753,
  0.0,
  0.885287,
  0.0,
  1948.95,
  10.0493,
  5.5048,
  3.33991,
  9.04969,
  1253.0,
  12.2108,
  5.10394,
  80000.0,
  6.36819,
  5.80167),
 (8,
  '060014271001007',
  6.23485,
  0.0,
  6.96965,
  465.0,
  9.42173,
  0.666667,
  2.0,
  57.0,
  7.50163,
  5.90052,
  0.109726,
  0.0,
  0.00498753,
  0.0,
  0.885287,
  0.0,
  1948.78,
  10.0434,
  5.48594,
  3.33991,
  9.05139,
  1253.0,
  12.2067,
  5.10394,
  80000.0,
  5.80212,
  5.85131),
 (9,
  '060014271001008',
  5.98591,
  0.0,
  6.88824,
  465.0,
  9.30356,
  0.0,
  2.0,
  57.0,
  7.96763,
  6.2831,
  0.109726,
  0.0,
  0.00498753,
  0.0,
  0.885287,
  0.0,
  1940.3,
  10.0533,
  5.32579,
  3.33991,
  8.96883,
  1253.0,
  12.1763,
  5.10394,
  80000.0,
  6.56527,
  5.99314),
 (10,
  '060014271001009',
  5.90645,
  0.0,
  6.89457,
  465.0,
  9.34592,
  0.0,
  2.0,
  57.0,
  7.86864,
  6.17115,
  0.109726,
  0.0,
  0.00498753,
  0.0,
  0.885287,
  0.0,
  1940.38,
  10.0666,
  5.33424,
  3.33991,
  8.96128,
  1253.0,
  12.1815,
  5.10394,
  80000.0,
  5.98645,
  5.66477)]



In [72]:

    
cols_to_use









    Out[72]:





['block_id',
 'nodes_population_1500m',
 'block_groups_total_jobs',
 'block_groups_median_children',
 'block_groups_median_income',
 'prop_tenure_1',
 'nodes_low_income_hh_1500m',
 'nodes_high_income_hh_1500m',
 'nodes_jobs_3000m',
 'nodes_jobs_20km',
 'nodes_population_400m',
 'nodes_population_800m',
 'block_groups_prop_race_of_head_1',
 'block_groups_prop_race_of_head_2',
 'block_groups_prop_race_of_head_3',
 'block_groups_prop_race_of_head_7',
 'block_groups_prop_race_of_head_8',
 'block_groups_prop_race_of_head_6',
 'pumas_density_residential_units',
 'block_groups_density_jobs',
 'nodes_jobs_1500m_4445',
 'nodes_jobs_3000m_4445',
 'nodes_du_5000m',
 'nodes_du_800m',
 'block_groups_median_rent',
 'block_groups_median_persons',
 'block_groups_median_age_of_head',
 'nodes_ave_year_built_800m']



In [78]:

    
DATA_DIR='/Users/lisarayle/rent_predictor/data/'
"""Path to data directory"""

# read file with variable names 
infile = 'variables.txt'
with open(os.path.join(DATA_DIR, infile), 'r') as f:
    VARLIST = f.read().split(',')
print(VARLIST)
len(VARLIST)









    



['nodes_population_1500m', 'block_groups_total_jobs', 'block_groups_median_children', 'block_groups_median_income', 'prop_tenure_1', 'nodes_low_income_hh_1500m', 'nodes_high_income_hh_1500m', 'nodes_jobs_3000m', 'nodes_jobs_20km', 'nodes_population_400m', 'nodes_population_800m', 'block_groups_prop_race_of_head_1', 'block_groups_prop_race_of_head_2', 'block_groups_prop_race_of_head_3', 'block_groups_prop_race_of_head_7', 'block_groups_prop_race_of_head_8', 'block_groups_prop_race_of_head_6', 'pumas_density_residential_units', 'block_groups_density_jobs', 'nodes_jobs_1500m_4445', 'nodes_jobs_3000m_4445', 'nodes_du_5000m', 'nodes_du_800m', 'block_groups_median_rent', 'block_groups_median_persons', 'block_groups_median_age_of_head', 'nodes_ave_year_built_800m']






    Out[78]:





27



In [ ]:

	ID	MPO_ID	MPONAME	STATE	STFIPS	STATEFP	COUNTYFP	COUNTYNS	GEOID	NAME	...	CBSAFP	METDIVFP	FUNCSTAT	ALAND	AWATER	INTPTLAT	INTPTLON	PERIMETER	area_sqmi	st_co_fips
0	1826900.0	47198201	Johnson City Metropolitan Transportation Plann...	TN	47	47	163	1639793	47163	Sullivan	...	28700.0	NaN	A	1.070725e+09	4.220920e+07	36.510213	-82.299396	71837.801334	0.756540	47163
1	1541707.0	29201300	Southeast Metropolitan Planning Organization (...	MO	29	17	003	424203	17003	Alexander	...	16020.0	NaN	A	6.099969e+08	4.423719e+07	37.183657	-89.349516	6972.897419	0.591829	29003
2	2101128.0	51197401	Richmond Area MPO	VA	51	51	670	1498428	51670	Hopewell	...	40060.0	NaN	F	2.665508e+07	1.324078e+06	37.291010	-77.298944	6041.591069	0.257294	51670
3	220693.0	15201300	Maui MPO	HI	15	15	009	365283	15009	Maui	...	27980.0	NaN	A	-1.286608e+09	-1.091674e+09	20.855931	-156.601550	299148.486465	728.079214	15009
4	220707.0	15197500	Oahu MPO	HI	15	15	003	365281	15003	Honolulu	...	46520.0	NaN	A	1.555505e+09	-2.400273e+08	21.461365	-158.201974	329335.691836	610.910815	15003

	block_id	x	y	res_rents	res_values	square_meters_land	puma10_id	residential_unit_capacity	employment_capacity	rent_impute	...	nodes_du_1500m	nodes_ave_year_built_3000m	tracts_prop_persons_5	tracts_prop_persons_6	tracts_prop_persons_7	tracts_prop_persons_1	tracts_prop_persons_2	tracts_prop_persons_3
0	060014271001000	-122.233867	37.770270	1475.0	777700.0	0	600105	0	0	1	...	7.241375	1949.245239	0.033789	0.014378	0.007189	0.244428	0.35514	0.180446
1	060014271001001	-122.233991	37.769464	1253.0	567900.0	79696	600105	105	367	0	...	7.241375	1949.245239	0.033789	0.014378	0.007189	0.244428	0.35514	0.180446
2	060014271001002	-122.234301	37.768350	1475.0	777700.0	739	600105	0	23	1	...	7.241375	1949.245239	0.033789	0.014378	0.007189	0.244428	0.35514	0.180446
3	060014271001003	-122.235213	37.768827	1253.0	567900.0	19546	600105	48	108	0	...	7.475714	1949.422974	0.033789	0.014378	0.007189	0.244428	0.35514	0.180446
4	060014271001004	-122.236751	37.769734	1253.0	567900.0	14364	600105	35	105	0	...	7.157094	1950.110962	0.033789	0.014378	0.007189	0.244428	0.35514	0.180446

	block_id	nodes_low_income_hh_1500m	nodes_jobs_3000m_4445	block_groups_total_jobs	nodes_population_1500m	prop_tenure_1	block_groups_median_persons	block_groups_median_age_of_head	nodes_population_800m	...	nodes_du_5000m	nodes_jobs_1500m_4445	pumas_density_residential_units	nodes_jobs_3000m	block_groups_median_rent	nodes_jobs_20km	block_groups_density_jobs	block_groups_median_income	nodes_population_400m	nodes_du_800m
0	060014271001000	5.991451	6.705032	465.0	9.113499	0.000	2.0	57.0	7.809947	...	9.915149	5.062971	3.339906	8.821410	1253.0	12.139385	5.103936	80000.0	6.373320	5.841072
1	060014271001001	5.991451	6.705032	465.0	9.113499	0.200	2.0	57.0	7.809947	...	9.915149	5.062971	3.339906	8.821410	1253.0	12.139385	5.103936	80000.0	6.373320	5.841072
2	060014271001002	5.991451	6.705032	465.0	9.113499	0.000	2.0	57.0	7.809947	...	9.915149	5.062971	3.339906	8.821410	1253.0	12.139385	5.103936	80000.0	6.373320	5.841072
3	060014271001003	6.231131	6.814257	465.0	9.536473	0.300	2.0	57.0	7.960672	...	9.982758	5.257902	3.339906	8.925516	1253.0	12.150690	5.103936	80000.0	6.703188	6.253922
4	060014271001004	6.107350	6.982764	465.0	9.355220	0.375	2.0	57.0	7.547502	...	10.062321	5.541484	3.339906	9.048043	1253.0	12.218818	5.103936	80000.0	6.396930	5.760618