O objetivo deste projeto é construir um modelo preditivo capaz de calcular o preço de um imóvel no bairro Vila Nova Conceição na cidade de São Paulo.
Para completar este projeto cada aluno deverá seguir as orientações que estão neste notebook e preencher as células vazias.
In [1]:
import pandas as pd
import io
import requests
url="https://media.githubusercontent.com/media/fbarth/ml-espm/master/data/20140917_imoveis_filtrados_final.csv_shaped.csv"
s=requests.get(url).content
df = pd.read_csv(io.StringIO(s.decode('utf-8')), sep=",")
In [2]:
df.head()
Out[2]:
In [3]:
%matplotlib inline
import seaborn as sns
import matplotlib.pyplot as plt
plt.subplots(figsize=(10, 8))
cmap = sns.cubehelix_palette(rot=-.2, as_cmap=True)
ax = sns.scatterplot(x="area", y="preco",
hue="suites", size="vagas",
palette=cmap, sizes=(10, 200),
data=df)
In [4]:
df['bairro'].value_counts()
Out[4]:
In [5]:
# @hidden_cell
df = df[df['bairro'] == 'vila-nova-conceicao']
In [6]:
df.shape
Out[6]:
In [7]:
%matplotlib inline
import seaborn as sns
import matplotlib.pyplot as plt
plt.subplots(figsize=(10, 8))
cmap = sns.cubehelix_palette(rot=-.2, as_cmap=True)
ax = sns.scatterplot(x="area", y="preco",
hue="suites", size="vagas",
palette=cmap, sizes=(10, 200),
data=df)
In [8]:
df.head()
Out[8]:
In [9]:
# @hidden_cell
df = df[df['suites'] <= df['dormitorios']]
df = df[df['suites'] <= df['banheiros']]
In [10]:
df.shape
Out[10]:
In [11]:
%matplotlib inline
import seaborn as sns
import matplotlib.pyplot as plt
plt.subplots(figsize=(10, 8))
cmap = sns.cubehelix_palette(rot=-.2, as_cmap=True)
ax = sns.scatterplot(x="area", y="preco",
hue="suites", size="vagas",
palette=cmap, sizes=(10, 200),
data=df)
In [12]:
# @hidden_cell
df = df.drop(columns=['bairro'])
In [13]:
df.head()
Out[13]:
In [14]:
# @hidden_cell
x = df['preco'].describe()
x
Out[14]:
In [15]:
df.head()
Out[15]:
In [16]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(df.iloc[:,1:12], df['preco'], test_size=0.2, random_state=4)
In [17]:
x_train.head()
Out[17]:
Crie um modelo de regressão utilizando os algoritmos e transformações nos atributos que você considera mais adequados.
Valide o modelo desenvolvido considerando os datasets X_test e y_test. Espera-se que o erro médio absoluto seja inferior a duzentos mil reais (R$ 200.000,00).
Descreva nas células abaixo todas as etapas necessárias para o desenvolvimento e validação do modelo.
In [20]:
print(x_train.shape)
print(y_train.shape)
print(x_test.shape)
In [21]:
from sklearn.linear_model import LinearRegression
model = LinearRegression().fit(x_train, y_train)
In [22]:
predicted = model.predict(x_test)
In [23]:
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
rmse = mean_squared_error(y_test, predicted)
r2 = r2_score(y_test, predicted)
mae = mean_absolute_error(y_test, predicted)
print('rmse = %', rmse)
print('r2 = %', r2)
print('mae = %', mae)
In [24]:
from sklearn.ensemble import RandomForestRegressor
results = pd.DataFrame(columns=['estimators','r2'])
for i in range(100, 2000, 100):
clf = RandomForestRegressor(n_estimators=i, max_depth=None, random_state=4, oob_score=True)
clf.fit(x_train, y_train)
results = results.append({'estimators':i, 'r2': clf.oob_score_}, ignore_index=True)
In [25]:
results.plot(x='estimators', y='r2')
Out[25]:
In [26]:
clf = RandomForestRegressor(n_estimators=400, max_depth=None, random_state=4, oob_score=True)
clf.fit(x_train, y_train)
Out[26]:
In [27]:
print(clf.oob_score_)
print(clf.feature_importances_)
print(x_train.columns)
In [28]:
predicted = clf.predict(x_test)
In [29]:
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
rmse = mean_squared_error(y_test, predicted)
r2 = r2_score(y_test, predicted)
mae = mean_absolute_error(y_test, predicted)
print('rmse = %', rmse)
print('r2 = %', r2)
print('mae = %', mae)
In [30]:
#Generate a new feature matrix consisting of all polynomial combinations
#of the features with degree less than or equal to the specified degree.
#For example, if an input sample is two dimensional and of the form [a, b],
#the degree-2 polynomial features are [1, a, b, a^2, ab, b^2].
from sklearn.preprocessing import PolynomialFeatures
transformer = PolynomialFeatures(degree=2, include_bias=False)
transformer.fit(x_train)
train_ = transformer.transform(x_train)
In [31]:
print(x_train.shape)
print(train_.shape)
In [32]:
from sklearn.linear_model import LinearRegression
model = LinearRegression().fit(train_, y_train)
In [33]:
transformer.fit(x_test)
test_ = transformer.transform(x_test)
predicted = model.predict(test_)
In [34]:
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
rmse = mean_squared_error(y_test, predicted)
r2 = r2_score(y_test, predicted)
mae = mean_absolute_error(y_test, predicted)
print('rmse = %', rmse)
print('r2 = %', r2)
print('mae = %', mae)
In [35]:
from sklearn.model_selection import GridSearchCV
param_grid = {
'n_estimators': [100, 200, 300, 400, 500, 600, 700, 800, 900, 1000],
'max_features': ['sqrt', 'log2', 3, 4, 5, 6],
'max_depth' : [2, 10, 20, 100, None]
}
rfc=RandomForestRegressor(random_state=4)
CV_rfc = GridSearchCV(estimator=rfc, param_grid=param_grid, cv= 3, verbose=1, n_jobs=-1)
CV_rfc.fit(train_, y_train)
CV_rfc.best_params_
Out[35]:
In [36]:
clf3 = RandomForestRegressor(n_estimators=100, max_features = 6, max_depth=20, random_state=4, oob_score=True)
clf3.fit(train_, y_train)
Out[36]:
In [37]:
print(clf3.oob_score_)
In [38]:
predicted = clf3.predict(test_)
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
rmse = mean_squared_error(y_test, predicted)
r2 = r2_score(y_test, predicted)
mae = mean_absolute_error(y_test, predicted)
print('rmse = %', rmse)
print('r2 = %', r2)
print('mae = %', mae)
In [40]:
def diff(x, y):
return (x-y)
resultados = pd.DataFrame({'real': y_test,'rf_grid_polinomial': predicted})
resultados['rf_grid_polinomial_dif'] = resultados.apply(lambda x: diff(x['real'], x['rf_grid_polinomial']), axis=1)
resultados = resultados.join(x_test)
resultados.head()
Out[40]:
In [41]:
ax = resultados.plot(x='area', y='rf_grid_polinomial_dif', kind='scatter')
ax.set_title('Análise dos erros')
Out[41]: