Merge craiglist data with synthetic census data

Steps:

  1. Get craigslist data
    • Query from remote database in batches based on FIPS
    • Joing to census via FIPS code
  2. Get census data
    • Aggregate to BG
  3. Merge together

Can process in batches, one state at a time, or get data by region.

TODO:

  • The census data for DC are missing. The files are there but they contain no data.
  • If need to look at education or other person data, aggregate person data to household and then block group

In [4]:
import pandas as pd
import psycopg2
import paramiko
import os
import numpy as np
import json
import zipfile

Craigslist data table columns

pid | character varying(25) | date | date | region | character varying(50) | neighborhood | character varying(200) | rent | double precision | bedrooms | double precision | sqft | double precision | rent_sqft | double precision | longitude | double precision | latitude | double precision | county | character varying(20) | fips_block | character varying(20) | state | character varying(20) | bathrooms

FIPS code format

53-----033---001701--1--015

[state][county][tract][bg][block]

Note: for DC, county='001'

Synthetic census data - variables

Household data: household_id,serialno,persons,cars,income,race_of_head,age_of_head,workers,state,county,tract,block group,children,tenure,recent_mover

Person data: person_id,member_id,age,relate,edu,sex,hours,earning,race_id,household_id,student,work_at_home,worker

Configure connection parameters

Make sure you have two files saved in the same directory, postgres_settings.json and ssh_settings.json.


In [82]:
DATA_DIR=os.path.join('..','data') 
"""Path to local data directory"""

#read postgres connection parameters
with open('postgres_settings.json') as settings_file:    
    settings = json.load(settings_file)

DBNAME = settings['dbname']
USER = settings['user']
HOST = settings['host']
PASSWORD = settings['password']

conn_str = "dbname = {0} user = {1} host = {2} password = {3}".format(DBNAME, USER, HOST, PASSWORD)

try:
    conn = psycopg2.connect(conn_str)
    cur = conn.cursor()
except:
    print ("Cannot connection. Check settings")

In [13]:
# TODO: add putty connection too. 
#read SSH connection parameters
with open('ssh_settings.json') as settings_file:    
    settings = json.load(settings_file)

HOSTNAME = settings['hostname']
USERNAME = settings['username']
PASSWORD = settings['password']
LOCAL_KEY_DIR = settings['local_key_dir']

CENSUS_DIR = 'synthetic_population'
"""Remote directory with census data"""

RESULTS_DIR = 'craigslist_census'
"""Remote directory for results"""

# estbalish SSH connection
ssh = paramiko.SSHClient() 
ssh.load_host_keys(LOCAL_KEY_DIR)
ssh.set_missing_host_key_policy(paramiko.AutoAddPolicy())
ssh.connect(HOSTNAME,username=USERNAME, password=PASSWORD)
sftp = ssh.open_sftp()

Local data files

Local files with with block-level data from urbansim


In [14]:
BLOCK_DIR = os.path.join('..','data','urbansim')
BLOCK_ZFILE = 'ba_block_variables.csv.zip'
BLOCK_FILE = 'ba_block_variables.csv'

Create FIPS look-up tables


In [15]:
# make dictionary of states and fips codes. 
fips_state = pd.read_csv(os.path.join(DATA_DIR,'state_fips_codes.csv'),dtype=str)
fips2state=dict(zip(fips_state['FIPS'],fips_state['USPS']))
state2fips=dict(zip(fips_state['USPS'],fips_state['FIPS']))

# Make lookup for county to MPO code 
mpo_counties = pd.read_csv(os.path.join(DATA_DIR,'us_2015_mpo_regions_counties_v1.csv'), encoding='latin1', dtype={'MPO_ID':str,'COUNTYFP':str,'STFIPS':str})
mpo_counties['COUNTYFP'] = mpo_counties['COUNTYFP'].str.zfill(2)  
mpo_counties['st_co_fips'] = mpo_counties['STFIPS']+mpo_counties['COUNTYFP']  # we will want to join on 2-char state + 3-char county fips
county2mpo=dict(zip(mpo_counties['st_co_fips'],mpo_counties['MPO_ID']))  # do we want MPO_ID or do we want GEOID? 
mpo_counties.head()


Out[15]:
ID MPO_ID MPONAME STATE STFIPS STATEFP COUNTYFP COUNTYNS GEOID NAME ... CBSAFP METDIVFP FUNCSTAT ALAND AWATER INTPTLAT INTPTLON PERIMETER area_sqmi st_co_fips
0 1826900.0 47198201 Johnson City Metropolitan Transportation Plann... TN 47 47 163 1639793 47163 Sullivan ... 28700.0 NaN A 1.070725e+09 4.220920e+07 36.510213 -82.299396 71837.801334 0.756540 47163
1 1541707.0 29201300 Southeast Metropolitan Planning Organization (... MO 29 17 003 424203 17003 Alexander ... 16020.0 NaN A 6.099969e+08 4.423719e+07 37.183657 -89.349516 6972.897419 0.591829 29003
2 2101128.0 51197401 Richmond Area MPO VA 51 51 670 1498428 51670 Hopewell ... 40060.0 NaN F 2.665508e+07 1.324078e+06 37.291010 -77.298944 6041.591069 0.257294 51670
3 220693.0 15201300 Maui MPO HI 15 15 009 365283 15009 Maui ... 27980.0 NaN A -1.286608e+09 -1.091674e+09 20.855931 -156.601550 299148.486465 728.079214 15009
4 220707.0 15197500 Oahu MPO HI 15 15 003 365281 15003 Honolulu ... 46520.0 NaN A 1.555505e+09 -2.400273e+08 21.461365 -158.201974 329335.691836 610.910815 15003

5 rows × 25 columns


In [16]:
def run_query(q):
    """ Get results given SQL query"""
    cur.execute(q)
    return(cur.fetchall())

def get_craiglist(filters, split_fips=True):
    """Get craiglist data from database.
    Args: 
        filters (list): list of strings containing filter criteria. Format as individual SQL WHERE statements. E.g., ["region='sandiego'","rent>100"]
        split_fips (bool): if True, split fips code into block and fips12 (useful if merging wtih blockgroup)
    Returns: 
        DataFrame: listings data. 
    """
    #q="SELECT pid,date,rent,bedrooms,bathrooms,sqft,rent_sqft,fips_block,state,region,longitude,latitude FROM rental_listings WHERE state='{}';".format(state)
    filters_str = ' AND '.join([x for x in filters])
    q="SELECT pid,date,rent,bedrooms,bathrooms,sqft,rent_sqft,fips_block,state,region,longitude,latitude FROM rental_listings WHERE {};".format(filters_str)
    results=run_query(q)
    df=pd.DataFrame(results,columns=['listing_id', 'date','rent','bedrooms','bathrooms','sqft','rent_sqft','fips_block','state','region','lng','lat'] )  # put it all into a dataframe
    if split_fips==True:
        # split FIPS into different columns - split off the last 3 chars
        df['block']=df.fips_block.str[-4:]
        df['fips12']=df.fips_block.str[:-3]
    return(df)

def read_census_file(fname):
    """Read census csv file via SFTP and return as dataframe."""
    with sftp.open(os.path.join(CENSUS_DIR,fname)) as f:
        df = pd.read_csv(f, delimiter=',',dtype={'age_of_head':float, 'block group':str, 'cars':float, 'children':float, 'county':str,
           'household_id':str, 'income':float, 'persons':float, 'race_of_head':str, 'recent_mover':str,
           'serialno':str, 'state':str, 'tenure':str, 'tract':str, 'workers':float})
    return df

def write_results_file(data,fname):
    """Write merged data to csv file via SFTP."""
    with sftp.open(os.path.join(RESULTS_DIR,fname),'w') as f:
        data.to_csv(f,index=True)
    return

def get_census_by_state(state, table='households'): 
    """Return all census data for state given two-char abbreviation. Can be 'households' or 'persons' data. """ 
    filelist=sftp.listdir(CENSUS_DIR)
    if table=='households':
        files = [f for f in filelist if f[:5]=='hh_{}'.format(state)]
    elif table=='persons':
        files = [f for f in filelist if f[:4]=='p_{}'.format(state)]
    #files = files[:3]  # uncomment this line for testing.
    new_df = pd.DataFrame()
    for f in files:
        df = read_census_file(f)
        new_df = pd.concat([new_df,df])
    return(new_df)

def strip_zeros(s):
    """Remove '.0 from end of string"""
    if s.endswith('.0'):
        return(s[:-2])
    else:
        return(s)

def format_hh_data(df):
    """Fix formatting for hhs census data. Replace '' strings with zero. Format other strings."""

    df['county'] = df['county'].str.zfill(2)  # make county 3-char string.
    
    for col in ['children','workers']:
        df[col] = df[col].replace('','0')

    for col in ['race_of_head','recent_mover','tenure']:
        df[col] = df[col].astype(str)
        df[col] = df[col].map(strip_zeros)  # make sure strings are formatted. 
    return(df)

def aggregate_census(df, groupby_cols=['county','tract','block group'],cols_to_sum=['cars','children','persons','workers'], cols_to_median=['age_of_head','income'],categ_cols=['race_of_head','recent_mover','tenure'],id_col='serialno',table='hhs'):
    """Aggregate census table to block group. Made this for hh data, may need to revised for persons data.
    Args: 
        groupby_cols (list): names of columns to group by (default=['county','tract','block group'])
        cols_to_sum (list): names of columns for which to compute totals. 
        cols_to_median (list): names of columns for which to compute medians
        categ_cols (list): names of categorical columns
        id_col (str): name of column that serves as the id column, to use in counting rows. 
        table (str): 'hhs' (default) or 'per'
    Returns: 
        DataFrame: aggregated data. 
        """
    # For some columns we'll want to find the sum or average/median. These will need only a simple groupby 
    sums = df.groupby(by=groupby_cols).sum()[cols_to_sum]
    sums.columns = [x+'_tot' for x in cols_to_sum]
    
    medians = df.groupby(by=groupby_cols).median()[cols_to_median]
    medians.columns = [x+'_med' for x in cols_to_median]
    
    counts = pd.DataFrame(df.groupby(by=groupby_cols).count()[id_col])
    counts.columns=[table+'_tot']

    # Categorical columns will need pivot tables. 
    categoricals = pd.DataFrame(index=counts.index)
    for col in categ_cols:
        pivoted=df.pivot_table(index = groupby_cols, columns = col, aggfunc='count')[id_col]
        pivoted.columns = [col+'_'+x for x in pivoted.columns]
        pivoted.columns = pivoted.columns.map(strip_zeros)
        # merge back together
        categoricals = pd.merge(categoricals, pivoted, left_index=True, right_index=True)

    # put all back together in one table
    merged = pd.merge(sums, medians, left_index=True, right_index=True)
    merged = pd.merge(merged, counts, left_index=True, right_index=True)
    merged = pd.merge(merged, categoricals, left_index=True, right_index=True)

    # check lengths of dataframes to detect any problems in grouping or merging
    lengths = [len(sums),len(medians),len(counts),len(categoricals),len(merged)]
    if len(set(lengths))>1:
        print('Warning: Aggregated tables have different lengths.',lengths,'for sums, medians, counts, categoricals, and merged.')
    
    return(merged)

def match_mpo(s, mpo_dict=county2mpo):
    """Match a 5-char state-county FIPS code to an MPO code
    Args: 
        s (str): 5-char state-county string
        mpo_dict (dict): county2mpo dictionary
    Returns: 
        str: MPO code
    """
    try: 
        return mpo_dict[s]
    except KeyError: # in this case, the county is not in an MPO
        return ''

In [10]:
def run_all(state, filters=None):
    """Get craigslist data and merge with census data, by state, and save.  with additional filters if needed. 
    Args: 
        state (str): 2-char state abbreviation
        filters (list): additional filters. Do not need to include state in filter list
    """
    
    # load and prepare craiglist data
    
    # If filters are provided, use them to filter data
    if filters:
        filters.append("state='{}'".format(state))
        print(filters)
        df_cl=get_craiglist(filters)
    # If no filters provided, get all data for the specified state. 
    else:
        df_cl=get_craiglist(["state='{}'".format(state)])
    
    df_cl['st_co_fps'] = df_cl.fips_block.map(lambda x: x[:5])
    df_cl['mpo_id'] = df_cl.st_co_fps.map(match_mpo)

    # load and prepare census data for households
    hhs = get_census_by_state(state, table='households')
    hhs = format_hh_data(hhs)
    hhs_bg = aggregate_census(hhs)
    hhs_bg=hhs_bg.reset_index()
    hhs_bg['fips12']=state2fips[state]+hhs_bg['county']+hhs_bg['tract']+hhs_bg['block group'] # create 12-digit FIPS code for merging. 

    # merge with craigslist data. 
    merged = pd.merge(df_cl, hhs_bg, on='fips12',how='left')
    merged = merged.set_index('listing_id')

    #TODO: add persons data here, if needed. 

    # Keep only columns we'll need.
    cols_to_keep=['date','rent','bedrooms','bathrooms','sqft','rent_sqft','fips_block','state','region','mpo_id','lng','lat','cars_tot','children_tot','persons_tot','workers_tot','age_of_head_med','income_med','hhs_tot','race_of_head_1','race_of_head_2','race_of_head_3','race_of_head_4','race_of_head_5','race_of_head_6','race_of_head_7','race_of_head_8','race_of_head_9','recent_mover_0','recent_mover_1','tenure_1','tenure_2']
    
    # This is a bit of a hack in case some columns are missing in some states. 
    for col in cols_to_keep: 
        if col not in merged.columns:
            merged[col] = np.nan

    # save file either locally or remotely. 
    print('Saving data for {s}: {m} rows'.format(s=state,m=len(merged)))
    outfile = 'cl_census_{}.csv'.format(state)
    #merged[cols_to_keep].to_csv(os.path.join(DATA_DIR,outfile), index=True)  # uncomment to save locally

    #write_results_file(merged[cols_to_keep], outfile)  # uncomment to save remotely. 
    return merged[cols_to_keep]

Get data for a single region

Use this to get data for a single region, for use in our preliminary analysis


In [11]:
df_bayarea = run_all(state='CA',filters=["region = 'sfbay'","rent>0"])   # define whatever filters you want here.
df_bayarea.head()


["region = 'sfbay'", 'rent>0', "state='CA'"]
---------------------------------------------------------------------------
KeyboardInterrupt                         Traceback (most recent call last)
<ipython-input-11-3f9d58715a9e> in <module>()
----> 1 df_bayarea = run_all(state='CA',filters=["region = 'sfbay'","rent>0"])   # define whatever filters you want here.
      2 df_bayarea.head()

<ipython-input-10-f4f50040be1e> in run_all(state, filters)
     21 
     22     # load and prepare census data for households
---> 23     hhs = get_census_by_state(state, table='households')
     24     hhs = format_hh_data(hhs)
     25     hhs_bg = aggregate_census(hhs)

<ipython-input-9-d0074092a625> in get_census_by_state(state, table)
     39 def get_census_by_state(state, table='households'):
     40     """Return all census data for state given two-char abbreviation. Can be 'households' or 'persons' data. """
---> 41     filelist=sftp.listdir(CENSUS_DIR)
     42     if table=='households':
     43         files = [f for f in filelist if f[:5]=='hh_{}'.format(state)]

//anaconda/lib/python3.5/site-packages/paramiko/sftp_client.py in listdir(self, path)
    170         :param str path: path to list (defaults to ``'.'``)
    171         """
--> 172         return [f.filename for f in self.listdir_attr(path)]
    173 
    174     def listdir_attr(self, path='.'):

//anaconda/lib/python3.5/site-packages/paramiko/sftp_client.py in listdir_attr(self, path)
    198         while True:
    199             try:
--> 200                 t, msg = self._request(CMD_READDIR, handle)
    201             except EOFError:
    202                 # done with handle

//anaconda/lib/python3.5/site-packages/paramiko/sftp_client.py in _request(self, t, *arg)
    728     def _request(self, t, *arg):
    729         num = self._async_request(type(None), t, *arg)
--> 730         return self._read_response(num)
    731 
    732     def _async_request(self, fileobj, t, *arg):

//anaconda/lib/python3.5/site-packages/paramiko/sftp_client.py in _read_response(self, waitfor)
    758         while True:
    759             try:
--> 760                 t, data = self._read_packet()
    761             except EOFError as e:
    762                 raise SSHException('Server connection dropped: %s' % str(e))

//anaconda/lib/python3.5/site-packages/paramiko/sftp.py in _read_packet(self)
    171 
    172     def _read_packet(self):
--> 173         x = self._read_all(4)
    174         # most sftp servers won't accept packets larger than about 32k, so
    175         # anything with the high byte set (> 16MB) is just garbage.

//anaconda/lib/python3.5/site-packages/paramiko/sftp.py in _read_all(self, n)
    154                         break
    155             else:
--> 156                 x = self.sock.recv(n)
    157 
    158             if len(x) == 0:

//anaconda/lib/python3.5/site-packages/paramiko/channel.py in recv(self, nbytes)
    663         """
    664         try:
--> 665             out = self.in_buffer.read(nbytes, self.timeout)
    666         except PipeTimeout:
    667             raise socket.timeout()

//anaconda/lib/python3.5/site-packages/paramiko/buffered_pipe.py in read(self, nbytes, timeout)
    154                 while (len(self._buffer) == 0) and not self._closed:
    155                     then = time.time()
--> 156                     self._cv.wait(timeout)
    157                     if timeout is not None:
    158                         timeout -= time.time() - then

//anaconda/lib/python3.5/threading.py in wait(self, timeout)
    291         try:    # restore state no matter what (e.g., KeyboardInterrupt)
    292             if timeout is None:
--> 293                 waiter.acquire()
    294                 gotit = True
    295             else:

KeyboardInterrupt: 

In [ ]:
# save locally
outfile = 'sfbay_listings_04282017.csv'
df_bayarea.to_csv(os.path.join(DATA_DIR,outfile), index=True)

Process all data by state

Use this to merge all the data by state


In [ ]:
for state in fips_state['USPS']:# uncomment when done with testing. 
    if state != 'DC':   # the DC census data is missing. 
        print('\n Working on',state)
        df_state = run_all(state)

In [ ]:
df_state.head()

In [ ]:
ssh.close()

Merge listings data with urbansim block-level data


In [17]:
# first unzip csv file into temp dir
os.mkdir('temp') # make temp dir for unzipped files
zip_ref = zipfile.ZipFile(os.path.join(BLOCK_DIR,BLOCK_ZFILE), 'r')
zip_ref.extractall('temp')
zip_ref.close()

In [18]:
# temporarily read first 100 lines just to see header names
df_temp = pd.read_csv(os.path.join('temp',BLOCK_FILE), nrows=100, dtype={'block_id':str})
print(df_temp.shape)
df_temp.head()


(100, 1015)
Out[18]:
block_id x y res_rents res_values square_meters_land puma10_id residential_unit_capacity employment_capacity rent_impute ... tracts_prop_persons_8 tracts_prop_persons_9 nodes_du_1500m nodes_ave_year_built_3000m tracts_prop_persons_5 tracts_prop_persons_6 tracts_prop_persons_7 tracts_prop_persons_1 tracts_prop_persons_2 tracts_prop_persons_3
0 060014271001000 -122.233867 37.770270 1475.0 777700.0 0 600105 0 0 1 ... 0.0 0.0 7.241375 1949.245239 0.033789 0.014378 0.007189 0.244428 0.35514 0.180446
1 060014271001001 -122.233991 37.769464 1253.0 567900.0 79696 600105 105 367 0 ... 0.0 0.0 7.241375 1949.245239 0.033789 0.014378 0.007189 0.244428 0.35514 0.180446
2 060014271001002 -122.234301 37.768350 1475.0 777700.0 739 600105 0 23 1 ... 0.0 0.0 7.241375 1949.245239 0.033789 0.014378 0.007189 0.244428 0.35514 0.180446
3 060014271001003 -122.235213 37.768827 1253.0 567900.0 19546 600105 48 108 0 ... 0.0 0.0 7.475714 1949.422974 0.033789 0.014378 0.007189 0.244428 0.35514 0.180446
4 060014271001004 -122.236751 37.769734 1253.0 567900.0 14364 600105 35 105 0 ... 0.0 0.0 7.157094 1950.110962 0.033789 0.014378 0.007189 0.244428 0.35514 0.180446

5 rows × 1015 columns


In [19]:
# define the columns we need 
block_cols = df_temp.columns

In [26]:
#df_temp.columns[:100]

Certain columns are definitely not useful. E.g, 'puma10_id_is_0609502', 'puma10_id_is_0609503' 'tracts_mean_y', 'tracts_mean_x'

TODO: spend more time choosing features


In [21]:
# make sure to now include unneeded columns like these:
unneeded_cols = [x for x in block_cols if x.startswith('puma10_id_is_')]+
[x for x in block_cols if (x.endswith('mean_y'))|(x.endswith('mean_x'))]+
[x for x in block_cols if (x.endswith('std_y'))|(x.endswith('std_x'))]+
[x for x in block_cols if x.startswith('pumas_prop_sector_id')]+
[x for x in block_cols if x.startswith('county_id_is_')]+
[x for x in block_cols if x.startswith('tracts_prop_sector_id')]+
[x for x in block_cols if x.startswith('counties_prop_sector_id')]


  File "<ipython-input-21-c056f67af7df>", line 2
    unneeded_cols = [x for x in block_cols if x.startswith('puma10_id_is_')]+
                                                                             ^
SyntaxError: invalid syntax

In [ ]:
len(unneeded_cols)

In [33]:
cols_to_use = ['block_id','nodes_population_1500m','block_groups_total_jobs', 'block_groups_median_children', 
       'block_groups_median_income', 'prop_tenure_1', 'nodes_low_income_hh_1500m', 'nodes_high_income_hh_1500m', 
       'nodes_jobs_3000m','nodes_jobs_20km', 'nodes_population_400m', 'nodes_population_800m', 
       'block_groups_prop_race_of_head_1','block_groups_prop_race_of_head_2', 'block_groups_prop_race_of_head_3', 
       'block_groups_prop_race_of_head_7','block_groups_prop_race_of_head_8','block_groups_prop_race_of_head_6',
       'pumas_density_residential_units','block_groups_density_jobs', 
       'nodes_jobs_1500m_4445','nodes_jobs_3000m_4445', 
       'nodes_du_5000m','nodes_du_800m','block_groups_median_rent',
       'block_groups_median_persons', 'block_groups_median_age_of_head', 'nodes_ave_year_built_800m']

In [34]:
for col in cols_to_use: 
    if col not in block_cols:
        print(col)

In [35]:
# Read all rows, using only the columns we want
df_blocks = pd.read_csv(os.path.join('temp',BLOCK_FILE),dtype={'block_id':str}, usecols = cols_to_use)
print(df_blocks.shape)
df_blocks.head()


(109228, 28)
Out[35]:
block_id nodes_low_income_hh_1500m block_groups_median_children nodes_jobs_3000m_4445 block_groups_total_jobs nodes_population_1500m prop_tenure_1 block_groups_median_persons block_groups_median_age_of_head nodes_population_800m ... nodes_du_5000m nodes_jobs_1500m_4445 pumas_density_residential_units nodes_jobs_3000m block_groups_median_rent nodes_jobs_20km block_groups_density_jobs block_groups_median_income nodes_population_400m nodes_du_800m
0 060014271001000 5.991451 0.0 6.705032 465.0 9.113499 0.000 2.0 57.0 7.809947 ... 9.915149 5.062971 3.339906 8.821410 1253.0 12.139385 5.103936 80000.0 6.373320 5.841072
1 060014271001001 5.991451 0.0 6.705032 465.0 9.113499 0.200 2.0 57.0 7.809947 ... 9.915149 5.062971 3.339906 8.821410 1253.0 12.139385 5.103936 80000.0 6.373320 5.841072
2 060014271001002 5.991451 0.0 6.705032 465.0 9.113499 0.000 2.0 57.0 7.809947 ... 9.915149 5.062971 3.339906 8.821410 1253.0 12.139385 5.103936 80000.0 6.373320 5.841072
3 060014271001003 6.231131 0.0 6.814257 465.0 9.536473 0.300 2.0 57.0 7.960672 ... 9.982758 5.257902 3.339906 8.925516 1253.0 12.150690 5.103936 80000.0 6.703188 6.253922
4 060014271001004 6.107350 0.0 6.982764 465.0 9.355220 0.375 2.0 57.0 7.547502 ... 10.062321 5.541484 3.339906 9.048043 1253.0 12.218818 5.103936 80000.0 6.396930 5.760618

5 rows × 28 columns


In [36]:
df_blocks['block_id'].head()


Out[36]:
0    060014271001000
1    060014271001001
2    060014271001002
3    060014271001003
4    060014271001004
Name: block_id, dtype: object

In [37]:
df_listings = get_craiglist(filters = ["region='sfbay'","rent>100"])

In [38]:
# merge listings with vars on block_id
df_listings.fips_block.head()

print(len(df_listings), len(df_blocks))
df_merged = pd.merge(df_listings, df_blocks, left_on='fips_block', right_on='block_id', how='inner')
if len(df_merged)<len(df_listings):
    print('Warning: only {0} of {1} rows matched'.format(len(df_merged), len(df_listings)))


125560 109228
Warning: only 123463 of 125560 rows matched

In [67]:
# save to hdf
outfile = 'ba_listings.h5'
df_merged.to_hdf(os.path.join(DATA_DIR,outfile),'merged')


outfile = 'ba_listings.csv'
df_merged.to_csv(os.path.join(DATA_DIR,outfile),index=False)


//anaconda/lib/python3.5/site-packages/pandas/core/generic.py:1138: PerformanceWarning: 
your performance may suffer as PyTables will pickle object types that it cannot
map directly to c-types [inferred_type->mixed,key->block1_values] [items->['listing_id', 'date', 'fips_block', 'state', 'region', 'block', 'fips12', 'block_id']]

  return pytables.to_hdf(path_or_buf, key, self, **kwargs)

Store data to database

We need to create a database table to hold the block-level variables for the rent-predictor app.

Steps:

  1. Create table schema

Assume all features are floats unless otherwise specified.

(If have a lot that aren't floats, might want to store the variable names and their types as a json so we can refer back to it when we change the variables.)

  1. Copy data into table

Copy data into table. COPY is the fastest way to insert a large amount of data (https://www.postgresql.org/docs/current/static/populate.html)


In [90]:
# connect to local databse

DBNAME = settings['DBNAME_RENT']
# USER = settings['USER_RENT']
# PASSWORD = settings['PASSWORD_RENT']

conn_str = "dbname = {0}".format(DBNAME)

try:
    conn = psycopg2.connect(conn_str)
    cur = conn.cursor()
except:
    print ("Cannot connection. Check settings")

In [54]:
# first save data as csv. 
# let's use that temp dir again
FULL_PATH = '/Users/lisarayle/Dropbox/craigslist/src/'  # can't use relative path in postgres, I guess
csvfile = 'blocks_temp.csv'
df_blocks.to_csv(os.path.join('temp',csvfile), index=False)

table_name = 'block_vars'

In [91]:
def create_db_table(col_names,t_name,id_var='block_id'):
    """Create a new table with schema to hold the block-level data. 
    Args: 
        col_names (list): list of names of columns to use. First one can be 'block id'
        t_name (str): name of database table
        id_var (str): name of id variable (default: 'block_id')
    """
    # drop table if already exists
    q = "DROP TABLE IF EXISTS {}".format(t_name)
    cur.execute(q)
    conn.commit()

    # build the SQL string
    sql_begin = "CREATE TABLE {0} (id BIGSERIAL PRIMARY KEY, {1} varchar(15) not null, ".format(t_name, id_var)
    if col_names[0]==id_var:
        sql_middle = " real,".join([c for c in col_names[1:]]) # leave off block id if it's there. 
    else: 
        sql_middle = " real,".join([c for c in col_names])

    sql_end = " real);"
    q = sql_begin+sql_middle+sql_end
    
    cur.execute(q)
    conn.commit()
    return 

def copy_block_data(col_names,t_name,fname):
    """Copy data from csv file into block variables table. 
    Args: 
        col_names (list): list of names of columns to use. First one can be 'block id'
        t_name (str): name of database table
        fname (str): name of csv file with data
    """
    var_string = ','.join([c for c in col_names])
    q="COPY {t}({v}) FROM '{f}' DELIMITERS ',' CSV HEADER;".format(t=t_name,v=var_string, f=os.path.join(FULL_PATH,'temp',fname))
    print(q)
    cur.execute(q)
    conn.commit()
    return

In [92]:
create_db_table(cols_to_use, table_name)
copy_block_data(cols_to_use, table_name, csvfile)


COPY block_vars(block_id,nodes_population_1500m,block_groups_total_jobs,block_groups_median_children,block_groups_median_income,prop_tenure_1,nodes_low_income_hh_1500m,nodes_high_income_hh_1500m,nodes_jobs_3000m,nodes_jobs_20km,nodes_population_400m,nodes_population_800m,block_groups_prop_race_of_head_1,block_groups_prop_race_of_head_2,block_groups_prop_race_of_head_3,block_groups_prop_race_of_head_7,block_groups_prop_race_of_head_8,block_groups_prop_race_of_head_6,pumas_density_residential_units,block_groups_density_jobs,nodes_jobs_1500m_4445,nodes_jobs_3000m_4445,nodes_du_5000m,nodes_du_800m,block_groups_median_rent,block_groups_median_persons,block_groups_median_age_of_head,nodes_ave_year_built_800m) FROM '/Users/lisarayle/Dropbox/craigslist/src/temp/blocks_temp.csv' DELIMITERS ',' CSV HEADER;

In [57]:
# test queries

q = "select count(*) from block_vars;"
run_query(q)

q = "select column_name from information_schema.columns where table_name='block_vars';"
run_query(q)

q = "select * from block_vars limit 10;"
run_query(q)


Out[57]:
[(1,
  '060014271001000',
  5.99145,
  0.0,
  6.70503,
  465.0,
  9.1135,
  0.0,
  2.0,
  57.0,
  7.80995,
  6.1145,
  0.109726,
  0.0,
  0.00498753,
  0.0,
  0.885287,
  0.0,
  1940.73,
  9.91515,
  5.06297,
  3.33991,
  8.82141,
  1253.0,
  12.1394,
  5.10394,
  80000.0,
  6.37332,
  5.84107),
 (2,
  '060014271001001',
  5.99145,
  0.0,
  6.70503,
  465.0,
  9.1135,
  0.2,
  2.0,
  57.0,
  7.80995,
  6.1145,
  0.109726,
  0.0,
  0.00498753,
  0.0,
  0.885287,
  0.0,
  1940.73,
  9.91515,
  5.06297,
  3.33991,
  8.82141,
  1253.0,
  12.1394,
  5.10394,
  80000.0,
  6.37332,
  5.84107),
 (3,
  '060014271001002',
  5.99145,
  0.0,
  6.70503,
  465.0,
  9.1135,
  0.0,
  2.0,
  57.0,
  7.80995,
  6.1145,
  0.109726,
  0.0,
  0.00498753,
  0.0,
  0.885287,
  0.0,
  1940.73,
  9.91515,
  5.06297,
  3.33991,
  8.82141,
  1253.0,
  12.1394,
  5.10394,
  80000.0,
  6.37332,
  5.84107),
 (4,
  '060014271001003',
  6.23113,
  0.0,
  6.81426,
  465.0,
  9.53647,
  0.3,
  2.0,
  57.0,
  7.96067,
  6.33903,
  0.109726,
  0.0,
  0.00498753,
  0.0,
  0.885287,
  0.0,
  1940.53,
  9.98276,
  5.2579,
  3.33991,
  8.92552,
  1253.0,
  12.1507,
  5.10394,
  80000.0,
  6.70319,
  6.25392),
 (5,
  '060014271001004',
  6.10735,
  0.0,
  6.98276,
  465.0,
  9.35522,
  0.375,
  2.0,
  57.0,
  7.5475,
  5.76372,
  0.109726,
  0.0,
  0.00498753,
  0.0,
  0.885287,
  0.0,
  1956.93,
  10.0623,
  5.54148,
  3.33991,
  9.04804,
  1253.0,
  12.2188,
  5.10394,
  80000.0,
  6.39693,
  5.76062),
 (6,
  '060014271001005',
  6.18346,
  0.0,
  6.97409,
  465.0,
  9.36529,
  0.0,
  2.0,
  57.0,
  7.55433,
  5.85112,
  0.109726,
  0.0,
  0.00498753,
  0.0,
  0.885287,
  0.0,
  1948.95,
  10.0493,
  5.5048,
  3.33991,
  9.04969,
  1253.0,
  12.2108,
  5.10394,
  80000.0,
  6.36819,
  5.80167),
 (7,
  '060014271001006',
  6.18346,
  0.0,
  6.97409,
  465.0,
  9.36529,
  0.0,
  2.0,
  57.0,
  7.55433,
  5.85112,
  0.109726,
  0.0,
  0.00498753,
  0.0,
  0.885287,
  0.0,
  1948.95,
  10.0493,
  5.5048,
  3.33991,
  9.04969,
  1253.0,
  12.2108,
  5.10394,
  80000.0,
  6.36819,
  5.80167),
 (8,
  '060014271001007',
  6.23485,
  0.0,
  6.96965,
  465.0,
  9.42173,
  0.666667,
  2.0,
  57.0,
  7.50163,
  5.90052,
  0.109726,
  0.0,
  0.00498753,
  0.0,
  0.885287,
  0.0,
  1948.78,
  10.0434,
  5.48594,
  3.33991,
  9.05139,
  1253.0,
  12.2067,
  5.10394,
  80000.0,
  5.80212,
  5.85131),
 (9,
  '060014271001008',
  5.98591,
  0.0,
  6.88824,
  465.0,
  9.30356,
  0.0,
  2.0,
  57.0,
  7.96763,
  6.2831,
  0.109726,
  0.0,
  0.00498753,
  0.0,
  0.885287,
  0.0,
  1940.3,
  10.0533,
  5.32579,
  3.33991,
  8.96883,
  1253.0,
  12.1763,
  5.10394,
  80000.0,
  6.56527,
  5.99314),
 (10,
  '060014271001009',
  5.90645,
  0.0,
  6.89457,
  465.0,
  9.34592,
  0.0,
  2.0,
  57.0,
  7.86864,
  6.17115,
  0.109726,
  0.0,
  0.00498753,
  0.0,
  0.885287,
  0.0,
  1940.38,
  10.0666,
  5.33424,
  3.33991,
  8.96128,
  1253.0,
  12.1815,
  5.10394,
  80000.0,
  5.98645,
  5.66477)]

In [72]:
cols_to_use


Out[72]:
['block_id',
 'nodes_population_1500m',
 'block_groups_total_jobs',
 'block_groups_median_children',
 'block_groups_median_income',
 'prop_tenure_1',
 'nodes_low_income_hh_1500m',
 'nodes_high_income_hh_1500m',
 'nodes_jobs_3000m',
 'nodes_jobs_20km',
 'nodes_population_400m',
 'nodes_population_800m',
 'block_groups_prop_race_of_head_1',
 'block_groups_prop_race_of_head_2',
 'block_groups_prop_race_of_head_3',
 'block_groups_prop_race_of_head_7',
 'block_groups_prop_race_of_head_8',
 'block_groups_prop_race_of_head_6',
 'pumas_density_residential_units',
 'block_groups_density_jobs',
 'nodes_jobs_1500m_4445',
 'nodes_jobs_3000m_4445',
 'nodes_du_5000m',
 'nodes_du_800m',
 'block_groups_median_rent',
 'block_groups_median_persons',
 'block_groups_median_age_of_head',
 'nodes_ave_year_built_800m']

In [78]:
DATA_DIR='/Users/lisarayle/rent_predictor/data/'
"""Path to data directory"""

# read file with variable names 
infile = 'variables.txt'
with open(os.path.join(DATA_DIR, infile), 'r') as f:
    VARLIST = f.read().split(',')
print(VARLIST)
len(VARLIST)


['nodes_population_1500m', 'block_groups_total_jobs', 'block_groups_median_children', 'block_groups_median_income', 'prop_tenure_1', 'nodes_low_income_hh_1500m', 'nodes_high_income_hh_1500m', 'nodes_jobs_3000m', 'nodes_jobs_20km', 'nodes_population_400m', 'nodes_population_800m', 'block_groups_prop_race_of_head_1', 'block_groups_prop_race_of_head_2', 'block_groups_prop_race_of_head_3', 'block_groups_prop_race_of_head_7', 'block_groups_prop_race_of_head_8', 'block_groups_prop_race_of_head_6', 'pumas_density_residential_units', 'block_groups_density_jobs', 'nodes_jobs_1500m_4445', 'nodes_jobs_3000m_4445', 'nodes_du_5000m', 'nodes_du_800m', 'block_groups_median_rent', 'block_groups_median_persons', 'block_groups_median_age_of_head', 'nodes_ave_year_built_800m']
Out[78]:
27

In [ ]: