This is a running log of some work on used-car saled. I have no intention to use this information for financial purposes, rather I'd like to ask the question "is there regional variation in used-car prices?". To do this I will use the BeautifulSoup package to look at www.pistonheads.com, my preferred way to search for used cars.
In [2]:
from BeautifulSoup import BeautifulSoup
import urllib
import pandas as pd
import seaborn
import numpy as np
import matplotlib.pyplot as plt
import scipy.optimize as so
%matplotlib inline
import seaborn as sns
sns.set_style(rc={'font.family': ['sans-serif'],'axis.labelsize': 25})
sns.set_context("notebook")
plt.rcParams['figure.figsize'] = (8, 6)
plt.rcParams['axes.labelsize'] = 18
After some fiddling I have wrapped up the method to get a page of results (100 at a time) in a function:
In [3]:
def strip_results_from_ad(ad):
""" Strip out the information from a single advert and add it to the results dictionary"""
desc = ad.find("div", attrs={"class": "listing-headline"}).find("h3").text
loc = ad.findAll("p", attrs={"class": "location"})[0].text
price = int(ad.find("div", attrs={"class": "price"}).text.replace(u"£", "").replace(",", ""))
specs = ad.find("ul", attrs={"class": "specs"}).findAll("li")
if len(specs) == 4:
miles = int(specs[0].text.rstrip(" miles").replace(",", ""))
fuel = specs[1].text
bhp = int(specs[2].text.rstrip(" bhp"))
transmission = specs[3].text
else:
fuel = "NA"
bhp = np.nan
transmission = "NA"
try:
miles = int(specs[0].text.rstrip(" miles").replace(",", ""))
except:
# Except any error...!
miles = np.nan
return desc, loc, price, miles, fuel, bhp, transmission
def create_url(page=1, M=269, rpp=100):
base = ("http://www.pistonheads.com/classifieds?Category=used-cars"
"&M={M}&ResultsPerPage={rpp}&Page={page}")
return base.format(page=page, rpp=rpp, M=M)
def get_results(*args, **kwargs):
url = create_url(*args, **kwargs)
f = urllib.urlopen(url).read()
soup = BeautifulSoup(f)
ads = soup.findAll("div", attrs={"class": "ad-listing"})
results = {"desc":[], "loc":[], "price":[], "miles":[], "fuel":[], "bhp":[], "transmission":[]}
for ad in ads:
try:
desc, loc, price, miles, fuel, bhp, transmission = strip_results_from_ad(ad)
except:
break
results["desc"].append(desc)
results["loc"].append(loc)
results["price"].append(price)
results["miles"].append(miles)
results["fuel"].append(fuel)
results["bhp"].append(bhp)
results["transmission"].append(transmission)
return results
Here we get 100 results from the url with "M" 269. On the pistonheads website this corresponds to a BMW 1 series. I choose this search in particular as there isn't huge variation in the specs, most have a similar engine size and there are few special runs. Let's have a look at the first few results
In [4]:
r = get_results(M=269, rpp=20)
df_small = pd.DataFrame(r)
df_small
Out[4]:
This looks good so far, now from the source website we see there is 2,580 entries so let's pull down a few and see what we can do with the data. The best way to do this is to iterate through the page counts and concatanate the resulting data frames:
In [6]:
dfs = []
for page in xrange(1, 25):
r = get_results(M=269, rpp=100, page=page)
dfs.append(pd.DataFrame(r))
In [7]:
df = pd.concat(dfs)
len(df)
Out[7]:
In [8]:
ax = df.sort("price").plot("miles", "price", kind="scatter")
In [9]:
df.query("miles < 50000 and price > 30000")
Out[9]:
As we can see these are all the "1M" series, a much faster version. The BHP is 335 which is significantly greater than the average:
In [10]:
df.bhp.mean()
Out[10]:
In [11]:
df_clean = df[~np.isnan(df.price)]
df_clean = df_clean[~np.isnan(df_clean.miles)]
df_clean = df_clean[df_clean.price < 30000]
df_clean = df_clean.sort("miles")
And now we have a clean collection, we will apply a rolling mean to the price and milage:
In [12]:
ax = plt.subplot(111)
window = 100
mean_miles = pd.rolling_mean(df_clean.miles, window, center=True)
mean_price = pd.rolling_mean(df_clean.price, window, center=True)
# Drop the nans created in the rolling_window
mean_miles = mean_miles[~np.isnan(mean_miles)]
mean_price = mean_price[~np.isnan(mean_price)]
ax.plot(df_clean.miles, df_clean.price, "o", alpha=0.3, markersize=5)
ax.plot(mean_miles, mean_price, lw=3, color=seaborn.xkcd_rgb["pale red"])
ax.set_xlabel("Mileage")
ax.set_ylabel(u"Price (£)")
plt.show()
We may now imagine coming up with a model for the price. The obvious answer is to begin by stating that the depreciation is proportional to the mileage, therefore:
$$ \frac{dP}{dm} = -k P $$where $P$ is the price, $m$ is the mileage and $k$ is a constant of proportionality. Solving in the usual way yields:
$$ P(m) = P_{0} e^{-km} $$Let's try fitting this:
In [13]:
import scipy.optimize as so
ax = plt.subplot(111)
def P(m, P0, k):
return P0 * np.exp(-k * m)
popt, pcov = so.curve_fit(P, mean_miles, mean_price, p0=[20000, 1e-10])
ax.plot(df_clean.miles, df_clean.price, "o", alpha=0.3, markersize=5)
ax.plot(mean_miles, mean_price, lw=3, color=seaborn.xkcd_rgb["pale red"])
fit_miles = np.linspace(df_clean.miles.min(), df_clean.miles.max(), 1000)
ax.plot(fit_miles, P(fit_miles, *popt), zorder=10)
ax.annotate(xy=(0.6, 0.7), xycoords="figure fraction",
s=("$P_0={:5.0f} \pm {:5.0f}$\n$k={:1.2e} \pm {:1.2e}$".format(
popt[0], np.sqrt(pcov[0, 0]), popt[1], np.sqrt(pcov[1, 1]))),
fontsize=20)
ax.set_xlabel("Mileage")
ax.set_ylabel(u"Price (£)")
plt.show()
This model seems fairly able to predict the average used car sale price. The parameter $P_0$ measures the value at 0 miles, of course all used-cars should have at least some mileage to qualify as used, but a surprising number have very low mileage:
In [14]:
len(df.query("miles < 500"))
Out[14]:
In [40]:
df['P0'] = df.price*np.exp(popt[1] * df.miles)
df.head()
Out[40]:
In [125]:
grouped = df.groupby(['loc',])
In [124]:
P0_mean = grouped['P0'].mean().values
counts = grouped['P0'].count().values
locs = grouped.all().index
df_mean = pd.DataFrame(dict(locs = locs,
P0_mean = P0_mean,
counts = counts))
df_mean.head()
Out[124]:
In [127]:
df_mean_clean = df_mean[df_mean.counts > 1]
df_mean_clean.head()
Out[127]:
In [128]:
from geopy.geocoders import Nominatim
N = Nominatim()
location = [N.geocode(loc) for loc in df_mean_clean.locs]
df_mean_clean['location'] = location
In [141]:
df_mean_clean = df_mean_clean.dropna()
In [147]:
df_mean_clean['lat'] = [l.latitude for l in df_mean_clean.location.values]
df_mean_clean['lon'] = [l.longitude for l in df_mean_clean.location.values]
In [149]:
df_mean_clean.head()
Out[149]:
As a result we have predicted the initial price to be $20259 \pm 23$ which isn't all that far off the actual new-car prices $21180-21710$.
In [153]:
fig = plt.figure()
from mpl_toolkits.basemap import Basemap
# Calculate some parameters which will be resused]
#lat_0 = df.latitude.mean()
#lon_0 = df.longitude.mean()
#llcrnrlon, urcrnrlon = PaddingFunction(df.longitude.min(), df.longitude.max(), frac=0.3)
#llcrnrlat, urcrnrlat = PaddingFunction(df.latitude.min(), df.latitude.max())
lat_0 = 0
lon_0 = 1
llcrnrlon, urcrnrlon = -7, 2
llcrnrlat, urcrnrlat = 49, 60
# Create a map, using the Gall–Peters projection,
m = Basemap(projection='gall',
resolution = 'l',
area_thresh = 10000.0,
lat_0=lat_0, lon_0=lon_0,
llcrnrlon=llcrnrlon,
urcrnrlon=urcrnrlon,
llcrnrlat=llcrnrlat,
urcrnrlat=urcrnrlat,
ax=fig.gca()
)
m.drawcounties()
m.drawcoastlines()
m.fillcontinents(color = '#996633')
m.drawmapboundary(fill_color='#0099FF')
lons = df_mean_clean.lon.values
lats = df_mean_clean.lat.values
z = df_mean_clean.P0_mean.values
x, y = m(lons*180./np.pi, lats*180./np.pi)
m.pcolormesh(x, y, z)
In [ ]: