Análise exploratória de preços de veículos novos e usados

Analisamos a tabela fipe de preço médio de veículos.


In [1]:
#...:....|....:....|....:....|....:....|....:....|....:....|....:....|....:....|
import numpy
import pandas
import seaborn

from matplotlib import pyplot
%matplotlib inline

from scipy import stats

pyplot.style.use(['ggplot', 'seaborn-colorblind', 'seaborn-talk'])
pyplot.xkcd()


Out[1]:
<matplotlib.rc_context at 0x7f239c3b4630>

Carrega os dados da tabela de preços mais recente. Separa modelos e versões de modelos.


In [2]:
#...:....|....:....|....:....|....:....|....:....|....:....|....:....|....:....|
prices = pandas.read_csv('../dat/fipe_price_list_201710.csv.gz', index_col=[0])

# Major maker
major_makers = ['Mercedes-Benz', 'BMW', 'Audi', 'VW - Volkswagen', 'Fiat', 
                'GM - Chevrolet', 'Kia Motors', 'Hyundai', 'Ford', 'Citroën', 
                'Renault']
prices['maker_name_major'] = prices['maker_name'].map(lambda x: x 
                                                      if x in major_makers else
                                                      '_outros_')

# Model and version
model_version = prices.model_name.str.split(' ', 1)
prices['model_name'] = model_version.str[0]
prices['model_version'] = model_version.str[1]

In [3]:
prices.head()


Out[3]:
date maker_name model_name fuel_type build_year price fipe_code maker_name_major model_version
0 2017-10-15 Acura Integra 1 1992 12614.0 038003-2 _outros_ GS 1.8
1 2017-10-15 Acura Integra 1 1991 11601.0 038003-2 _outros_ GS 1.8
2 2017-10-15 Acura Legend 1 1998 28761.0 038002-4 _outros_ 3.2/3.5
3 2017-10-15 Acura Legend 1 1997 24217.0 038002-4 _outros_ 3.2/3.5
4 2017-10-15 Acura Legend 1 1996 23302.0 038002-4 _outros_ 3.2/3.5

In [4]:
prices.describe()


Out[4]:
fuel_type build_year price
count 22934.000000 22934.000000 2.293400e+04
mean 1.336444 3442.571379 7.574337e+04
std 0.735149 6406.947379 2.016749e+05
min 1.000000 1985.000000 1.902000e+03
25% 1.000000 1999.000000 1.590050e+04
50% 1.000000 2007.000000 3.120300e+04
75% 1.000000 2013.000000 6.511175e+04
max 3.000000 32000.000000 5.500000e+06

Modelos por marca


In [5]:
#...:....|....:....|....:....|....:....|....:....|....:....|....:....|....:....|
prices_by_models = prices.groupby(['maker_name', 'model_name'])\
    ['price'].agg(['min', 'mean', 'max', 'std'])
prices_by_models.reset_index(level=1, inplace=True)
prices_by_models.head()


Out[5]:
model_name min mean max std
maker_name
AM Gen Hummer 170643.0 2.219286e+05 290263.0 34972.729108
ASTON MARTIN DB9 554915.0 1.009014e+06 1436232.0 441272.962986
ASTON MARTIN Rapide 643655.0 1.210589e+06 1840884.0 500723.204692
ASTON MARTIN Vanquish 1542242.0 2.086130e+06 2688743.0 575502.046016
ASTON MARTIN Vantage 316137.0 7.688905e+05 1712739.0 425884.471066

In [6]:
#...:....|....:....|....:....|....:....|....:....|....:....|....:....|....:....|
models_by_maker = prices_by_models.pivot_table(index=['maker_name'], 
                                               values='model_name', 
                                               aggfunc='count')
models_by_maker.sort_values(by='model_name', ascending=False, inplace=True)
models_by_maker.head()


Out[6]:
model_name
maker_name
Mercedes-Benz 121
BMW 100
GM - Chevrolet 58
Ford 38
Fiat 35

In [7]:
ax = models_by_maker.plot.bar(legend=False, figsize=[11, 5])
ax.figure.tight_layout()



In [8]:
#...:....|....:....|....:....|....:....|....:....|....:....|....:....|....:....|
prices_by_models = prices.groupby(['maker_name', 'model_name', 'build_year'])\
    ['price'].agg(['min', 'mean', 'max', 'std'])
prices_by_models.reset_index(level=[2, 1], inplace=True)
prices_by_models.head()


Out[8]:
model_name build_year min mean max std
maker_name
AM Gen Hummer 1998 170643.0 1.926247e+05 209401.0 19896.413303
AM Gen Hummer 1999 199707.0 2.143920e+05 234803.0 18235.202521
AM Gen Hummer 2000 239710.0 2.587690e+05 290263.0 27474.958981
ASTON MARTIN DB9 2011 554915.0 5.549150e+05 554915.0 NaN
ASTON MARTIN DB9 2014 1035894.0 1.035894e+06 1035894.0 NaN

In [9]:
#...:....|....:....|....:....|....:....|....:....|....:....|....:....|....:....|
models_by_maker = prices_by_models.pivot_table(index=['maker_name'], 
                                               values='model_name',
                                               columns=['build_year'],
                                               aggfunc='count')
models_by_maker.head()


Out[9]:
build_year 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 ... 2010 2011 2012 2013 2014 2015 2016 2017 2018 32000
maker_name
AM Gen NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
ASTON MARTIN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN 3.0 3.0 NaN 4.0 1.0 4.0 NaN NaN 3.0
Acura NaN NaN NaN NaN NaN NaN 3.0 3.0 2.0 2.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
Agrale NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 NaN 1.0
Alfa Romeo 1.0 1.0 NaN NaN NaN NaN 2.0 2.0 2.0 1.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

5 rows × 35 columns


In [10]:
#...:....|....:....|....:....|....:....|....:....|....:....|....:....|....:....|
models_by_maker_today = models_by_maker[32000].sort_values(ascending=False)
ax = models_by_maker_today[models_by_maker_today>0].plot.bar(legend=False,
                                                             figsize=[11, 5])
ax.figure.tight_layout()



In [11]:
#...:....|....:....|....:....|....:....|....:....|....:....|....:....|....:....|
prices_by_models_major = prices.groupby(['maker_name_major', 'model_name', 
                                         'build_year'])['price'].agg(['min', 'mean',
                                                                      'max', 'std'])
prices_by_models_major.reset_index(level=2, inplace=True)
prices_by_models_major.head()


Out[11]:
build_year min mean max std
maker_name_major model_name
Audi 100 1993 11553.0 11553.000000 11553.0 NaN
100 1994 12118.0 14047.666667 16692.0 2369.267890
100 1995 14134.0 16509.666667 18801.0 2334.642656
80 1994 11082.0 17395.833333 33813.0 8336.167883
80 1995 13895.0 19689.000000 38381.0 9359.300893

In [12]:
#...:....|....:....|....:....|....:....|....:....|....:....|....:....|....:....|
models_by_maker_major = prices_by_models_major\
    [prices_by_models_major.build_year < 32000].pivot_table(
        index='build_year', columns=['maker_name_major'],  values='mean',
        aggfunc='count')
models_by_maker_major.head()


Out[12]:
maker_name_major Audi BMW Citroën Fiat Ford GM - Chevrolet Hyundai Kia Motors Mercedes-Benz Renault _outros_
build_year
1985 NaN NaN NaN 6.0 8.0 13.0 NaN NaN 2.0 NaN 17.0
1986 NaN 2.0 NaN 6.0 9.0 13.0 NaN NaN 5.0 NaN 17.0
1987 NaN 3.0 NaN 5.0 6.0 13.0 NaN NaN 4.0 NaN 15.0
1988 NaN 2.0 NaN 5.0 6.0 14.0 NaN NaN 5.0 NaN 17.0
1989 NaN 3.0 NaN 4.0 6.0 16.0 NaN NaN 4.0 NaN 16.0

In [13]:
colors = pyplot.cm.tab20c(numpy.linspace(0, 1, 11))
models_by_maker_major.plot.area(figsize=[11, 5], color=colors)


Out[13]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f2365513dd8>

Versões por marca

Desde 1985:


In [14]:
#...:....|....:....|....:....|....:....|....:....|....:....|....:....|....:....|
ptable_versions = prices.pivot_table(index=['maker_name'], values='model_name', 
                                     aggfunc=len)
ptable_versions.sort_values(by='model_name', ascending=False, inplace=True)

In [15]:
ax = ptable_versions.plot.bar(legend=False, figsize=[11, 5])
ax.figure.tight_layout()


Hoje, considerando-se apenas carros novos (0 km):


In [16]:
ptable_versions_new = prices[prices.build_year==32000].pivot_table(
    index=['maker_name'], values='model_name', aggfunc=len)
ptable_versions_new.sort_values(by='model_name', ascending=False, inplace=True)

In [17]:
ax = ptable_versions_new.plot.bar(legend=False, figsize=[11, 5])
ax.figure.tight_layout()


Ao longo do tempo, quantas versões de carros há no mercado?


In [18]:
#...:....|....:....|....:....|....:....|....:....|....:....|....:....|....:....|
ptable_model_version = prices.pivot_table(index=['maker_name', 'model_name'],
                                          columns='build_year',
                                          values='model_version', aggfunc=len)

In [19]:
ptable_model_version.head()


Out[19]:
build_year 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 ... 2010 2011 2012 2013 2014 2015 2016 2017 2018 32000
maker_name model_name
AM Gen Hummer NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
ASTON MARTIN DB9 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN 1.0 NaN NaN 1.0 NaN 1.0 NaN NaN NaN
Rapide NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN 1.0 1.0 NaN 1.0 NaN 1.0 NaN NaN 1.0
Vanquish NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN 1.0 NaN 1.0 NaN NaN 1.0
Vantage NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN 2.0 3.0 NaN 1.0 1.0 2.0 NaN NaN 2.0

5 rows × 35 columns


In [20]:
version_timeseries = ptable_model_version.sum()
version_timeseries[version_timeseries.index < 32000].plot.line(figsize=[11, 5])


Out[20]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f235bd8cef0>

Distribuição de preços


In [21]:
#...:....|....:....|....:....|....:....|....:....|....:....|....:....|....:....|
bins = numpy.arange(0, 300e3, 1e3)
ax = prices['price'].plot.hist(bins=bins, legend=False, figsize=[11, 5],
                               linewidth=0, normed=True, histtype='stepfilled')


Inverse gaussian distribution

https://en.wikipedia.org/wiki/Inverse_Gaussian_distribution

\begin{equation} f(x\;|\;\mu, \lambda) = \sqrt{\frac{\lambda}{2\pi x^3}} e^{-\frac{\lambda(x - \mu)^2}{2 \mu^2 x}} \end{equation}

In [22]:
#...:....|....:....|....:....|....:....|....:....|....:....|....:....|....:....|

ax = prices['price'].plot.hist(bins=bins, legend=False, figsize=[11, 5],
                               linewidth=0, normed=True, histtype='stepfilled')

mean, median = prices.price.mean(), prices.price.median()
mode = prices.price.mode().values[0]
ax.axvline(mean)
ax.text(mean+2500, 2e-5, 'Media={:.0f}'.format(mean))
ax.axvline(median)
ax.text(median+2500, 1.8e-5, 'Mediana={:.0f}'.format(median))
ax.axvline(mode)
ax.text(mode+2500, 1.6e-5, 'Moda={:.0f}'.format(mode))

mu, loc, scale = stats.invgauss.fit(prices['price'])
x = 0.5 * (bins[1:] + bins[:-1])  # Center of the bins
pdf = stats.invgauss.pdf(x, mu, loc, scale)
ax.plot(x, pdf, 'r')
ax.annotate('{:.0f}'.format(scale/mu), xy=(scale/mu, 2.45e-5), xycoords='data',
                xytext=(30, 10), textcoords='offset points',
                fontsize='12', fontstretch='ultra-condensed',
                ha='center', va='bottom',
                arrowprops=dict(arrowstyle="-",
                                connectionstyle='arc3,rad=0.3',
                                color='#999999')
           )


Out[22]:
<matplotlib.text.Annotation at 0x7f2365682710>

In [23]:
#...:....|....:....|....:....|....:....|....:....|....:....|....:....|....:....|
ax = prices['price'][prices.build_year == 32000].plot.hist(
    bins=bins, legend=False, figsize=[11, 5], linewidth=0,
    histtype='stepfilled')