Webscraping Craigslist for House Prices in the East Bay

Jennifer Jones, PhD

jennifer.jones@cal.berkeley.edu



In [1]:

    
# Python 3.4
%pylab inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import requests
from bs4 import BeautifulSoup as bs4









    



Populating the interactive namespace from numpy and matplotlib

Craigslist houses for sale

Look on the Craigslist website, select relevant search criteria, and then take a look at the web address:

Houses for sale in the East Bay:
http://sfbay.craigslist.org/search/eby/rea?housing_type=6

Houses for sale in selected neighborhoods in the East Bay:
http://sfbay.craigslist.org/search/eby/rea?nh=46&nh=47&nh=48&nh=49&nh=112&nh=54&nh=55&nh=60&nh=62&nh=63&nh=66&housing_type=6



In [2]:

    
# Get the data: Houses posted for sale on Craigslist in the Eastbay
url_base = 'http://sfbay.craigslist.org/search/eby/rea?housing_type=6' 
data = requests.get(url_base)
print(data.url)









    



http://sfbay.craigslist.org/search/eby/rea?housing_type=6



In [3]:

    
# BeautifulSoup can quickly parse the text, need to tell bs4 that the text is html
html = bs4(data.text, 'html.parser')



In [9]:

    
# Display the html in a somewhat readable way, to note the structure of housing listings
# then comment it out because it prints out a large amount to the screen
# print(html.prettify())

House entries



In [4]:

    
# Looked through above output and saw housing entries contained in <p class="row">

# Get a list of housing data and store the results
houses = html.find_all('p', attrs={'class': 'row'}) # html.findAll(attrs={'class': "row"})
print(len(houses))



In [7]:

    
# List neighborhoods of the houses in the list
neighborhoods = pd.DataFrame(data = ones(len(houses)), columns = {'Neighborhoods'})
n = 0
for row in range(len(houses)-1):
    one_neighborhood = houses[n].findAll(attrs={'class': 'pnr'})[0].text
    neighborhoods.iloc[n] = one_neighborhood
    n += 1
#print(neighborhoods)

Look at a single entry - for one house

To explore the data before working with the whole dataset.



In [11]:

    
# There's a consistent structure to each housing listing:
# There is a 'time', 
# a <span class="price">, 
# a 'housing', 
# a <span class="pnr"> neighborhood field

# Look at info for a single house
one_house = houses[11] # 11, 19, 28 is the selected row number for a housing listing

# Print out and view a single house entry, use prettify to make more legible
print(one_house.prettify())









    



<p class="row" data-pid="5435512814">
 <a class="i" data-ids="0:00i0i_5wOYOWgpigj,0:01313_kVRkR5KuWKy,0:00V0V_itCBgm2OaMO,0:00404_kLqJxhlIffA,0:00r0r_eJPbw1PURsK,0:00D0D_toh93xIhAv,0:00o0o_hZrlOFHmOtq,0:00A0A_1rbMRl7UegY,0:00R0R_gEG5hSz70ay,0:00E0E_4qXeNIMmPXZ,0:01717_jkC1pF5Ng8T,0:00q0q_keqedGCLN3Y,0:01313_3n33XE8aWCo" href="/eby/reb/5435512814.html">
 </a>
 <span class="txt">
  <span class="star">
  </span>
  <span class="pl">
   <time datetime="2016-02-05 16:26" title="Fri 05 Feb 04:26:09 PM">
    Feb 5
   </time>
   <a class="hdrlnk" data-id="5435512814" href="/eby/reb/5435512814.html">
    RICHMOND ANNEX - Adorable Updated Bungalow. OPEN SUN 1-3
   </a>
  </span>
  <span class="l2">
   <span class="price">
    $469000
   </span>
   <span class="housing">
    / 2br - 817ft
    <sup>
     2
    </sup>
    -
   </span>
   <span class="pnr">
    <small>
     (richmond / point / annex)
    </small>
    <span class="px">
     <span class="p">
      pic
      <span class="maptag" data-pid="5435512814">
       map
      </span>
     </span>
    </span>
   </span>
  </span>
  <span class="js-only banish-unbanish">
   <span class="banish" title="hide">
    <span class="trash">
    </span>
   </span>
   <span class="unbanish" title="restore">
    <span class="trash red">
    </span>
   </span>
  </span>
 </span>
</p>

A single housing entry looks like this:

Feb 5 OPEN House Sunday 2-4pm, For SALE Spacious 2 Bedroom, 1 Bathroom Home $440000 / 2br - 1156ft ² - (Oakland) pic map

Feb 4 OPEN HOUSE SAT 1-4 Sunny Richmond Home 3br - 1330ft ² - (richmond / point / annex) pic map

Feb 4 Millsmont House For Sale $579950 / 3br - 1912ft ² - (oakland hills / mills) pic map

Feb 3 Excellent Home in Berkeley $450000 (berkeley) pic

Feb 3 Conveniently located in Albany $600000 (albany / el cerrito) pic



In [14]:

    
# For one housing entry look at fields of interest: Price, Neighborhood, Size, Date Posted 
# Clean up values manually, to figure out how to automate

# Listing
allspecs = one_house.findAll(attrs={'class': 'l2'})[0].text # `findAll` returns a list, and there's only one entry in this html
print('Listing: \n', allspecs, '\n')

# Price
print('Price:')

price = one_house.findAll(attrs={'class': 'price'})[0].text
print(price)

price = float(one_house.find('span', {'class': 'price'}).text.strip('$'))
print(price, '\t', type(price), '\n')

# Neighborhood
print('Neighborhood:')

neighborhood = one_house.findAll(attrs={'class': 'pnr'})[0].text
print(neighborhood)

# Keep the neighborhood, remove leading spaces and parentheses. 
# Then split at the closing parentheses and only take the neighborhood part
# example: '   (vallejo / benicia)   pic map  '
neighborhood = one_house.findAll(attrs={'class': 'pnr'})[0].text.strip(' (').split(')')[0]
print(neighborhood, '\t', type(neighborhood), '\n')

#print(len([rw.findAll(attrs={'class': 'pnr'})[0].text.strip(' (').split(')')[0] for rw in houses]))

# Size
print('Size: bedrooms and sq ft: ')

size = one_house.findAll(attrs={'class': 'housing'})[0].text
print(size)

# Strip text of leading and trailing characters: /, dashes, and spaces
# Split number of bedrooms and square footage into 2 fields in list
size = one_house.findAll(attrs={'class': 'housing'})[0].text.strip('/- ').split(' - ')
print(size) 

# Delete suffixes and just keep the numbers 
size[0] = float(size[0].replace('br', '')) # number of bedrooms
size[1] = float(size[1].replace('ft2', '')) # square footage

print(size, '\t', type(size[0]), '\n') 

# Address/Posting Title
address = one_house.findAll(attrs={'class': 'hdrlnk'})[0].text
print(address,  '\n')
#link = 'http://sfbay.craigslist.org/search' + one_house.findAll(attrs={'class': 'hdrlnk'})[0]['href']
#print(link, '\n')

# Date posted
dateposted = one_house.findAll(attrs={'class': 'pl'})[0].time['datetime']
print(dateposted, '\t', type(dateposted))

# Convert to datetime type so can extract date
date = pd.to_datetime(one_house.find('time')['datetime']).date()
print(date, '\t', type(date))









    



Listing: 
  $469000 / 2br - 817ft2 -    (richmond / point / annex)   pic map   

Price:
$469000
469000.0 	 <class 'float'> 

Neighborhood:
  (richmond / point / annex)   pic map 
richmond / point / annex 	 <class 'str'> 

Size: bedrooms and sq ft: 
/ 2br - 817ft2 - 
['2br', '817ft2']
[2.0, 817.0] 	 <class 'float'> 

RICHMOND ANNEX - Adorable Updated Bungalow. OPEN SUN 1-3 

2016-02-05 16:26 	 <class 'str'>
2016-02-05 	 <class 'datetime.date'>

All rows, all housing entries

Now that I've figured out how to extract data for 1 house, do for the list of houses



In [15]:

    
# Define 4 functions for the price, neighborhood, sq footage & # bedrooms, and time
# that can deal with missing values (to prevent errors from showing up when running the code)

# Prices
def find_prices(results):
    prices = []
    for rw in results:
        price = rw.find('span', {'class': 'price'})
        if price is not None:
            price = float(price.text.strip('$'))
        else:
            price = np.nan
        prices.append(price)
    return prices

# Neighborhoods
# Example: '  (oakland hills / mills)   pic map  '
# Define a function for neighborhood in case a field is missing in 'class': 'pnr'
def find_neighborhood(results):
    neighborhoods = []
    for rw in results:
        split = rw.find('span', {'class': 'pnr'}).text.strip(' (').split(')')
        #split = rw.find(attrs={'class': 'pnr'}).text.strip(' (').split(')')
        if len(split) == 2:
            neighborhood = split[0]
        elif 'pic map' or 'pic' or 'map' in split[0]:
            neighborhood = np.nan
        neighborhoods.append(neighborhood)
    return neighborhoods

# Size
# Make a function to deal with size in case #br or ft2 is missing
def find_size_and_brs(results):
    sqft = []
    bedrooms = []
    for rw in results:
        split = rw.find('span', attrs={'class': 'housing'})
        # If the field doesn't exist altogether in a housing entry
        if split is not None:
        #if rw.find('span', {'class': 'housing'}) is not None:
            # Removes leading and trailing spaces and dashes, splits br & ft
            #split = rw.find('span', attrs={'class': 'housing'}).text.strip('/- ').split(' - ')
            split = split.text.strip('/- ').split(' - ')
            if len(split) == 2:
                n_brs = split[0].replace('br', '')
                size = split[1].replace('ft2', '')
            elif 'br' in split[0]: # in case 'size' field is missing
                n_brs = split[0].replace('br', '')
                size = np.nan
            elif 'ft2' in split[0]: # in case 'br' field is missing
                size = split[0].replace('ft2', '')
                n_brs = np.nan
        else:
            size = np.nan
            n_brs = np.nan
        sqft.append(float(size))
        bedrooms.append(float(n_brs))
    return sqft, bedrooms

# Time posted
def find_times(results):
    times = []
    for rw in results:
        time = rw.findAll(attrs={'class': 'pl'})[0].time['datetime']
        if time is not None:
            time# = time
        else:
            time = np.nan
        times.append(time)
    return pd.to_datetime(times)



In [16]:

    
prices = find_prices(houses)
neighborhoods = find_neighborhood(houses) 
sqft, bedrooms = find_size_and_brs(houses)
times = find_times(houses)

# Check
print(len(prices))
print(len(neighborhoods))
print(len(sqft))
print(len(bedrooms))
print(len(times))



In [18]:

    
# Add the data to a dataframe so I can work with it

housesdata = np.array([prices, sqft, bedrooms]).T
#print(housesdata)

# Add the array to the dataframe, then the dates column and the neighborhoods column
housesdf = pd.DataFrame(data = housesdata, columns = ['Price', 'SqFeet', 'nBedrooms'])
housesdf['DatePosted'] = times
housesdf['Neighborhood'] = neighborhoods

print(housesdf.tail(5))









    



      Price  SqFeet  nBedrooms          DatePosted  \
95   280000    1379          3 2016-02-05 10:32:00   
96   650000    1320          3 2016-02-05 10:27:00   
97  3699000    6641          5 2016-02-05 10:25:00   
98   660000    1300          3 2016-02-05 10:25:00   
99   625000    1350          3 2016-02-05 10:24:00   

                       Neighborhood  
95              pittsburg / antioch  
96             danville / san ramon  
97             danville / san ramon  
98  dublin / pleasanton / livermore  
99  dublin / pleasanton / livermore



In [19]:

    
print(housesdf.dtypes)









    



Price                  float64
SqFeet                 float64
nBedrooms              float64
DatePosted      datetime64[ns]
Neighborhood            object
dtype: object



In [22]:

    
# Quick plot to look at the data

fig = plt.figure() 
fig.set_figheight(6.0)
fig.set_figwidth(10.0)

ax = fig.add_subplot(111) # row column position 
ax.plot(housesdf.SqFeet, housesdf.Price, 'bo')
ax.set_xlim(0,5000)
ax.set_ylim(0,3000000)
ax.set_xlabel('$\mathrm{Square \; feet}$',fontsize=18)
ax.set_ylabel('$\mathrm{Price \; (in \; \$)}$',fontsize=18)

len(housesdf.SqFeet)









    Out[22]:





100



In [23]:

    
# Quick plot to look at the data

fig = plt.figure() 
fig.set_figheight(6.0)
fig.set_figwidth(10.0)

ax = fig.add_subplot(111) # row column position 
ax.plot(housesdf.nBedrooms, housesdf.Price, 'bo')
ax.set_xlim(1.5, 5.5)
ax.set_ylim(0,3000000)
ax.set_xlabel('$\mathrm{Number \; of \; Bedrooms}$',fontsize=18)
ax.set_ylabel('$\mathrm{Price \; (in \; \$)}$',fontsize=18)

len(housesdf.nBedrooms)









    Out[23]:





100



In [29]:

    
# Get houses listed in Berkeley
#housesdf[housesdf['Neighborhood'] == 'berkeley']
housesdf[housesdf['Neighborhood'] == 'berkeley north / hills']
#housesdf[housesdf['Neighborhood'] == 'oakland rockridge / claremont']
#housesdf[housesdf['Neighborhood'] == 'albany / el cerrito']
#housesdf[housesdf['Neighborhood'] == 'richmond / point / annex']









    Out[29]:






  
    
      
      Price
      SqFeet
      nBedrooms
      DatePosted
      Neighborhood
    
  
  
    
      28
      795000
      NaN
      2
      2016-02-05 15:20:00
      berkeley north / hills
    
    
      30
      879000
      NaN
      5
      2016-02-05 15:11:00
      berkeley north / hills



In [46]:

    
# How many houses for sale are under $700k?
print(housesdf[(housesdf.Price < 700000)].count(), '\n') # nulls aren't counted in count

# In which neighborhoods are these houses located?
print(set(housesdf[(housesdf.Price < 700000)].Neighborhood))

# Return entries for houses under $700k, sorted by price from least expensive to most
housesdf[(housesdf.Price < 700000)].sort_values(['Price'], ascending = [True])









    



Price           53
SqFeet          43
nBedrooms       51
DatePosted      53
Neighborhood    52
dtype: int64 

{nan, 'brentwood / oakley', 'vallejo / benicia', 'oakland west', 'hayward / castro valley', 'dublin / pleasanton / livermore', 'concord / pleasant hill / martinez', 'Oakland', 'Tracy', 'walnut creek', 'oakland hills / mills', 'oakland east', 'pittsburg / antioch', 'richmond / point / annex', 'hercules, pinole, san pablo, el sob', 'fairfield / vacaville', 'san leandro', 'danville / san ramon', 'fremont / union city / newark', 'CONCORD', 'albany / el cerrito', 'Fair Oaks'}






    Out[46]:






  
    
      
      Price
      SqFeet
      nBedrooms
      DatePosted
      Neighborhood
    
  
  
    
      45
      1300
      3950
      4
      2016-02-05 14:22:00
      hercules, pinole, san pablo, el sob
    
    
      24
      20000
      1900
      4
      2016-02-05 15:38:00
      vallejo / benicia
    
    
      39
      145000
      1872
      NaN
      2016-02-05 14:45:00
      dublin / pleasanton / livermore
    
    
      46
      270000
      1511
      3
      2016-02-05 14:20:00
      pittsburg / antioch
    
    
      95
      280000
      1379
      3
      2016-02-05 10:32:00
      pittsburg / antioch
    
    
      92
      319000
      NaN
      2
      2016-02-05 10:41:00
      Oakland
    
    
      31
      329000
      NaN
      3
      2016-02-05 15:07:00
      brentwood / oakley
    
    
      60
      350000
      1176
      3
      2016-02-05 13:06:00
      CONCORD
    
    
      55
      360000
      1905
      4
      2016-02-05 13:52:00
      fairfield / vacaville
    
    
      86
      373000
      1552
      3
      2016-02-05 10:59:00
      fairfield / vacaville
    
    
      87
      379000
      NaN
      2
      2016-02-05 10:57:00
      oakland hills / mills
    
    
      69
      399000
      1377
      4
      2016-02-05 12:09:00
      oakland east
    
    
      1
      410000
      1144
      2
      2016-02-05 17:31:00
      oakland east
    
    
      80
      418000
      11500
      3
      2016-02-05 11:13:00
      concord / pleasant hill / martinez
    
    
      82
      435000
      2506
      4
      2016-02-05 11:03:00
      pittsburg / antioch
    
    
      37
      439000
      2260
      3
      2016-02-05 14:57:00
      pittsburg / antioch
    
    
      48
      442500
      2255
      4
      2016-02-05 14:17:00
      vallejo / benicia
    
    
      53
      444000
      2020
      4
      2016-02-05 14:03:00
      vallejo / benicia
    
    
      83
      450000
      920
      2
      2016-02-05 11:01:00
      walnut creek
    
    
      11
      469000
      817
      2
      2016-02-05 16:26:00
      richmond / point / annex
    
    
      66
      479900
      2293
      5
      2016-02-05 12:40:00
      Fair Oaks
    
    
      36
      480000
      2580
      3
      2016-02-05 14:59:00
      vallejo / benicia
    
    
      56
      485000
      NaN
      2
      2016-02-05 13:47:00
      fremont / union city / newark
    
    
      54
      489000
      1945
      4
      2016-02-05 13:52:00
      fairfield / vacaville
    
    
      34
      489000
      2752
      5
      2016-02-05 15:05:00
      pittsburg / antioch
    
    
      27
      489900
      2451
      4
      2016-02-05 15:23:00
      dublin / pleasanton / livermore
    
    
      72
      490000
      1952
      4
      2016-02-05 11:47:00
      brentwood / oakley
    
    
      26
      499000
      1473
      3
      2016-02-05 15:24:00
      san leandro
    
    
      25
      499888
      1119
      3
      2016-02-05 15:36:00
      hayward / castro valley
    
    
      52
      499888
      1119
      3
      2016-02-05 14:07:00
      hayward / castro valley
    
    
      77
      529000
      NaN
      4
      2016-02-05 11:31:00
      concord / pleasant hill / martinez
    
    
      9
      535000
      3635
      7
      2016-02-05 16:32:00
      Tracy
    
    
      14
      535000
      3635
      7
      2016-02-05 16:23:00
      Tracy
    
    
      15
      535000
      3635
      7
      2016-02-05 16:22:00
      Tracy
    
    
      23
      539000
      3394
      5
      2016-02-05 15:40:00
      NaN
    
    
      67
      548000
      1244
      3
      2016-02-05 12:26:00
      richmond / point / annex
    
    
      19
      549000
      NaN
      2
      2016-02-05 16:04:00
      albany / el cerrito
    
    
      43
      550000
      NaN
      NaN
      2016-02-05 14:24:00
      fremont / union city / newark
    
    
      32
      550000
      1474
      4
      2016-02-05 15:06:00
      oakland west
    
    
      42
      564900
      1741
      4
      2016-02-05 14:25:00
      vallejo / benicia
    
    
      81
      569000
      NaN
      2
      2016-02-05 11:11:00
      richmond / point / annex
    
    
      38
      585000
      1750
      3
      2016-02-05 14:48:00
      vallejo / benicia
    
    
      94
      585000
      1350
      3
      2016-02-05 10:33:00
      hayward / castro valley
    
    
      99
      625000
      1350
      3
      2016-02-05 10:24:00
      dublin / pleasanton / livermore
    
    
      73
      625000
      1668
      3
      2016-02-05 11:39:00
      brentwood / oakley
    
    
      76
      629000
      NaN
      5
      2016-02-05 11:35:00
      fairfield / vacaville
    
    
      75
      629000
      4085
      5
      2016-02-05 11:36:00
      fairfield / vacaville
    
    
      4
      639000
      NaN
      3
      2016-02-05 17:21:00
      fremont / union city / newark
    
    
      78
      649900
      2903
      4
      2016-02-05 11:14:00
      brentwood / oakley
    
    
      90
      650000
      1623
      4
      2016-02-05 10:51:00
      dublin / pleasanton / livermore
    
    
      96
      650000
      1320
      3
      2016-02-05 10:27:00
      danville / san ramon
    
    
      79
      650000
      2091
      4
      2016-02-05 11:13:00
      concord / pleasant hill / martinez
    
    
      98
      660000
      1300
      3
      2016-02-05 10:25:00
      dublin / pleasanton / livermore

Group results by neighborhood and plot



In [47]:

    
by_neighborhood = housesdf.groupby('Neighborhood')
print(by_neighborhood.count())#.head()) # NOT NULL records within each column
#print('\n')

#print(by_neighborhood.size())#.head()) # total records for each neighborhood
#by_neighborhood.Neighborhood.nunique()









    



                                     Price  SqFeet  nBedrooms  DatePosted
Neighborhood                                                             
CONCORD                                  1       1          1           1
El Sobrante 7.5 miles from Orinda        0       1          1           1
Fair Oaks                                1       1          1           1
HAYWARD                                  0       0          0           1
Oakland                                  1       0          1           1
PLEASANTON-DUBLIN-LIVERMORE              0       0          0           1
San Diego                                0       2          0           2
Tracy                                    3       3          3           3
albany / el cerrito                      3       0          3           3
berkeley north / hills                   2       0          2           2
brentwood / oakley                       5       4          5           6
concord / pleasant hill / martinez       4       3          4           4
danville / san ramon                     3       3          3           3
dublin / pleasanton / livermore         10      10          9          11
fairfield / vacaville                    6       5          7           7
fremont / union city / newark           10       7         10          11
hayward / castro valley                  3       3          3           3
hercules, pinole, san pablo, el sob      1       1          1           1
oakland east                             2       2          2           2
oakland hills / mills                    1       0          1           1
oakland rockridge / claremont            1       0          1           1
oakland west                             1       1          1           1
pittsburg / antioch                      5       6          6           6
richmond / point / annex                 3       2          3           3
san leandro                              2       2          2           2
vallejo / benicia                        8       8          8           8
walnut creek                             1       1          2           3



In [48]:

    
print(len(housesdf.index)) # total #rows
print(len(set(housesdf.Neighborhood))) # #unique neighborhoods
set(housesdf.Neighborhood) # list the #unique neighborhoods









    



100
28






    Out[48]:





{'brentwood / oakley',
 nan,
 'vallejo / benicia',
 'oakland west',
 'hayward / castro valley',
 'dublin / pleasanton / livermore',
 'PLEASANTON-DUBLIN-LIVERMORE',
 'concord / pleasant hill / martinez',
 'Oakland',
 'Tracy',
 'walnut creek',
 'oakland hills / mills',
 'oakland east',
 'berkeley north / hills',
 'pittsburg / antioch',
 'San Diego',
 'richmond / point / annex',
 'hercules, pinole, san pablo, el sob',
 'El Sobrante 7.5 miles from Orinda',
 'fairfield / vacaville',
 'san leandro',
 'danville / san ramon',
 'HAYWARD',
 'fremont / union city / newark',
 'CONCORD',
 'albany / el cerrito',
 'oakland rockridge / claremont',
 'Fair Oaks'}



In [50]:

    
# Group the results by neighborhood, and then take the average home price in each neighborhood
by_neighborhood = housesdf.groupby('Neighborhood').mean().Price # by_neighborhood_mean_price
print(by_neighborhood.head(5), '\n')
print(by_neighborhood['berkeley north / hills'], '\n') 
#print(by_neighborhood.index, '\n') 

by_neighborhood_sort_price = by_neighborhood.sort_values(ascending = True)
#print(by_neighborhood_sort_price.index) # a list of the neighborhoods sorted by price
print(by_neighborhood_sort_price)









    



Neighborhood
CONCORD                              350000
El Sobrante 7.5 miles from Orinda       NaN
Fair Oaks                            479900
HAYWARD                                 NaN
Oakland                              319000
Name: Price, dtype: float64 

837000.0 

Neighborhood
hercules, pinole, san pablo, el sob       1300.000000
Oakland                                 319000.000000
CONCORD                                 350000.000000
oakland hills / mills                   379000.000000
pittsburg / antioch                     382600.000000
oakland east                            404500.000000
walnut creek                            450000.000000
Fair Oaks                               479900.000000
hayward / castro valley                 528258.666667
richmond / point / annex                528666.666667
Tracy                                   535000.000000
oakland west                            550000.000000
concord / pleasant hill / martinez      578000.000000
brentwood / oakley                      595180.000000
vallejo / benicia                       662925.000000
albany / el cerrito                     675666.666667
san leandro                             687000.000000
fairfield / vacaville                   746666.666667
dublin / pleasanton / livermore         785278.800000
fremont / union city / newark           792772.400000
berkeley north / hills                  837000.000000
oakland rockridge / claremont          1200000.000000
danville / san ramon                   1814666.666667
El Sobrante 7.5 miles from Orinda                 NaN
HAYWARD                                           NaN
PLEASANTON-DUBLIN-LIVERMORE                       NaN
San Diego                                         NaN
Name: Price, dtype: float64



In [56]:

    
# Plot average home price for each neighborhood in the East Bay
# dropna()

fig = plt.figure() # or fig = plt.figure(figsize=(15,8)) # width, height
fig.set_figheight(8.0)
fig.set_figwidth(13.0)
ax = fig.add_subplot(111) # row column position 

fntsz=20
titlefntsz=25
lablsz=20
mrkrsz=8

matplotlib.rc('xtick', labelsize = lablsz); matplotlib.rc('ytick', labelsize = lablsz)

# Choose a baseline, based on proximity to current location
# 'berkeley', 'berkeley north / hills', 'albany / el cerrito'
neighborhood_name = 'berkeley north / hills'

# Plot a bar chart
ax.bar(range(len(by_neighborhood_sort_price.dropna())), by_neighborhood_sort_price.dropna(), align='center')

# Add a horizontal line for Berkeley's (or the baseline's) average home price, corresponds with Berkeley bar
ax.axhline(y=housesdf.groupby('Neighborhood').mean().Price.ix[neighborhood_name], linestyle='--')

# Add a grid
ax.grid(b = True, which='major', axis='y') # which='major','both'; options/kwargs: color='r', linestyle='-', linewidth=2)

# Format x axis
ax.set_xticks(range(1,len(housesdf.groupby('Neighborhood').mean().Price.dropna()))); # 0 if first row is at least 100,000
ax.set_xticklabels(by_neighborhood_sort_price.dropna().index[1:], rotation='vertical', fontsize=fntsz) # remove [1:], 90, 45, 'vertical'
ax.set_xlim(0, len(by_neighborhood_sort_price.dropna().index)) # -1 if first row is at least 100,000

# Format y axis
minor_yticks  = np.arange(0, 2000000, 100000)
ax.set_yticks(minor_yticks, minor = True) 
ax.tick_params(axis='y', labelsize=fntsz)
ax.set_ylabel('$\mathrm{Price \; (Dollars)}$', fontsize = titlefntsz)

# Set figure title
ax.set_title('$\mathrm{Average \; Home \; Prices \; in \; the \; East \; Bay \; (Source: Craigslist)}$', fontsize = titlefntsz)

# Save figure
#plt.savefig("home_prices.pdf", bbox_inches='tight')

# Home prices in Berkeley (or the baseline)
print('The average home price in %s is: $' %neighborhood_name, '{0:8,.0f}'.format(housesdf.groupby('Neighborhood').mean().Price.ix[neighborhood_name]), '\n')
print('The most expensive home price in %s is:  $' %neighborhood_name, '{0:8,.0f}'.format(housesdf.groupby('Neighborhood').max().Price.ix[neighborhood_name]), '\n')
print('The least expensive home price in %s is: $' %neighborhood_name, '{0:9,.0f}'.format(housesdf.groupby('Neighborhood').min().Price.ix[neighborhood_name]), '\n')









    



The average home price in berkeley north / hills is: $  837,000 

The most expensive home price in berkeley north / hills is:  $  879,000 

The least expensive home price in berkeley north / hills is: $   795,000



In [ ]:

	Price	SqFeet	nBedrooms	DatePosted	Neighborhood
28	795000	NaN	2	2016-02-05 15:20:00	berkeley north / hills
30	879000	NaN	5	2016-02-05 15:11:00	berkeley north / hills

	Price	SqFeet	nBedrooms	DatePosted	Neighborhood
45	1300	3950	4	2016-02-05 14:22:00	hercules, pinole, san pablo, el sob
24	20000	1900	4	2016-02-05 15:38:00	vallejo / benicia
39	145000	1872	NaN	2016-02-05 14:45:00	dublin / pleasanton / livermore
46	270000	1511	3	2016-02-05 14:20:00	pittsburg / antioch
95	280000	1379	3	2016-02-05 10:32:00	pittsburg / antioch
92	319000	NaN	2	2016-02-05 10:41:00	Oakland
31	329000	NaN	3	2016-02-05 15:07:00	brentwood / oakley
60	350000	1176	3	2016-02-05 13:06:00	CONCORD
55	360000	1905	4	2016-02-05 13:52:00	fairfield / vacaville
86	373000	1552	3	2016-02-05 10:59:00	fairfield / vacaville
87	379000	NaN	2	2016-02-05 10:57:00	oakland hills / mills
69	399000	1377	4	2016-02-05 12:09:00	oakland east
1	410000	1144	2	2016-02-05 17:31:00	oakland east
80	418000	11500	3	2016-02-05 11:13:00	concord / pleasant hill / martinez
82	435000	2506	4	2016-02-05 11:03:00	pittsburg / antioch
37	439000	2260	3	2016-02-05 14:57:00	pittsburg / antioch
48	442500	2255	4	2016-02-05 14:17:00	vallejo / benicia
53	444000	2020	4	2016-02-05 14:03:00	vallejo / benicia
83	450000	920	2	2016-02-05 11:01:00	walnut creek
11	469000	817	2	2016-02-05 16:26:00	richmond / point / annex
66	479900	2293	5	2016-02-05 12:40:00	Fair Oaks
36	480000	2580	3	2016-02-05 14:59:00	vallejo / benicia
56	485000	NaN	2	2016-02-05 13:47:00	fremont / union city / newark
54	489000	1945	4	2016-02-05 13:52:00	fairfield / vacaville
34	489000	2752	5	2016-02-05 15:05:00	pittsburg / antioch
27	489900	2451	4	2016-02-05 15:23:00	dublin / pleasanton / livermore
72	490000	1952	4	2016-02-05 11:47:00	brentwood / oakley
26	499000	1473	3	2016-02-05 15:24:00	san leandro
25	499888	1119	3	2016-02-05 15:36:00	hayward / castro valley
52	499888	1119	3	2016-02-05 14:07:00	hayward / castro valley
77	529000	NaN	4	2016-02-05 11:31:00	concord / pleasant hill / martinez
9	535000	3635	7	2016-02-05 16:32:00	Tracy
14	535000	3635	7	2016-02-05 16:23:00	Tracy
15	535000	3635	7	2016-02-05 16:22:00	Tracy
23	539000	3394	5	2016-02-05 15:40:00	NaN
67	548000	1244	3	2016-02-05 12:26:00	richmond / point / annex
19	549000	NaN	2	2016-02-05 16:04:00	albany / el cerrito
43	550000	NaN	NaN	2016-02-05 14:24:00	fremont / union city / newark
32	550000	1474	4	2016-02-05 15:06:00	oakland west
42	564900	1741	4	2016-02-05 14:25:00	vallejo / benicia
81	569000	NaN	2	2016-02-05 11:11:00	richmond / point / annex
38	585000	1750	3	2016-02-05 14:48:00	vallejo / benicia
94	585000	1350	3	2016-02-05 10:33:00	hayward / castro valley
99	625000	1350	3	2016-02-05 10:24:00	dublin / pleasanton / livermore
73	625000	1668	3	2016-02-05 11:39:00	brentwood / oakley
76	629000	NaN	5	2016-02-05 11:35:00	fairfield / vacaville
75	629000	4085	5	2016-02-05 11:36:00	fairfield / vacaville
4	639000	NaN	3	2016-02-05 17:21:00	fremont / union city / newark
78	649900	2903	4	2016-02-05 11:14:00	brentwood / oakley
90	650000	1623	4	2016-02-05 10:51:00	dublin / pleasanton / livermore
96	650000	1320	3	2016-02-05 10:27:00	danville / san ramon
79	650000	2091	4	2016-02-05 11:13:00	concord / pleasant hill / martinez
98	660000	1300	3	2016-02-05 10:25:00	dublin / pleasanton / livermore