In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
% matplotlib inline
import seaborn as sns
sns.set(style="whitegrid", color_codes=True)
sns.set(font_scale=1)
from scipy import stats
In [3]:
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
In [4]:
train.head()
Out[4]:
In [5]:
test.head()
Out[5]:
In [6]:
train.shape, test.shape
Out[6]:
Target variable SalePrice is more in train dataset. It's not present in test dataset. We have to predict SalePrice for test dataset.
In [7]:
train.columns
Out[7]:
In [142]:
# Description of all the features and their values
desc_file = open('data_description.txt')
print (desc_file.read())
In [8]:
train.get_dtype_counts()
Out[8]:
In [9]:
train.describe()
Out[9]:
In [10]:
train.info()
In [11]:
corr = train.corr()["SalePrice"]
corr[np.argsort(corr, axis=0)[::-1]]
Out[11]:
In [12]:
plt.figure(figsize=(20,20))
corr = corr[1:-1] # removing 1st (SalePrice) and last (Id) row from dataframe
corr.plot(kind='barh') # using pandas plot
plt.title('Correlation coefficients w.r.t. Sale Price')
Out[12]:
In [13]:
# taking high correlated variables having positive correlation of 45% and above
high_positive_correlated_variables = ['SalePrice', 'OverallQual', 'GrLivArea', 'GarageCars', 'GarageArea', \
'TotalBsmtSF', '1stFlrSF', 'FullBath', 'TotRmsAbvGrd', 'YearBuilt', \
'YearRemodAdd', 'GarageYrBlt', 'MasVnrArea', 'Fireplaces']
corrMatrix = train[high_positive_correlated_variables].corr()
sns.set(font_scale=1.10)
plt.figure(figsize=(15, 15))
sns.heatmap(corrMatrix, vmax=.8, linewidths=0.01,
square=True, annot=True, cmap='viridis', linecolor="white")
plt.title('Correlation between features');
From the above heatmap, we can see that some features (other than our target variable SalePrice) are highly correlated among themselves. Note the yellow blocks in the above heatmap. The following features are intercorrelated:
TotRmsAbvGrd <> GrLivArea = 0.83
GarageYrBlt <> YearBuilt = 0.83
1stFlrSF <> TotalBsmtSF = 0.82
GarageArea <> GarageCars = 0.88
OverallQual is the other feature which is highly correlated with our target variable SalePrice.
SalePrice <> OverallQual = 0.79
This type of scenario results in multicollinearity. Multicollinearity occurs when there is moderate or high intercorrelation between independent variables. This can result in high standard error.
There are different ways to reduce multicollinearity like:
Let's see these features relation to SalePrice in overall data:
In [14]:
feature_variable = 'OverallQual'
target_variable = 'SalePrice'
train[[feature_variable, target_variable]].groupby([feature_variable], as_index=False).mean().sort_values(by=feature_variable, ascending=False)
Out[14]:
In [15]:
feature_variable = 'GarageCars'
target_variable = 'SalePrice'
train[[feature_variable, target_variable]].groupby([feature_variable], as_index=False).mean().sort_values(by=feature_variable, ascending=False)
Out[15]:
Multicollinearity among independent variables as stated above are:
TotRmsAbvGrd <> GrLivArea = 0.83
GarageYrBlt <> YearBuilt = 0.83
1stFlrSF <> TotalBsmtSF = 0.82
GarageArea <> GarageCars = 0.88
Let's draw a scatter plot between SalePrice and some of the high correlated variables having positive correlation with respect to SalePrice. We take the following independent variables:
In [16]:
cols = ['SalePrice', 'OverallQual', 'GrLivArea', 'GarageCars', 'TotalBsmtSF', 'YearBuilt']
sns.pairplot(train[cols], size = 2.5)
Out[16]:
From above scatter plot, we can see that:
Let's draw a box plot of OverallQual with respect to SalePrice.
In [17]:
# box plot overallqual/saleprice
plt.figure(figsize=[10,5])
sns.boxplot(x='OverallQual', y="SalePrice", data=train)
Out[17]:
Let's analyze the distribution of SalePrice across our train dataset.
Here, we do UNIVARIATE ANALYSIS. It's a kind of data observation and analysis which involves only one variable at a time.
We analyze Skewness and Kurtosis of SalePrice.
Skewness is a measure of symmetry, or more precisely, the lack of symmetry. A distribution, or data set, is symmetric if it looks the same to the left and right of the center point.
Kurtosis is a measure of whether the data are heavy-tailed or light-tailed relative to a normal distribution. That is, data sets with high kurtosis tend to have heavy tails, or outliers. Data sets with low kurtosis tend to have light tails, or lack of outliers.
Graphical representation of data distribution for SalePrice:
In [18]:
train['SalePrice'].describe()
Out[18]:
In [19]:
# histogram to graphically show skewness and kurtosis
plt.figure(figsize=[15,5])
sns.distplot(train['SalePrice'])
plt.title('Distribution of Sale Price')
plt.xlabel('Sale Price')
plt.ylabel('Number of Occurences')
Out[19]:
In [20]:
# normal probability plot
plt.figure(figsize=[8,6])
stats.probplot(train['SalePrice'], plot=plt)
Out[20]:
In [21]:
# skewness and kurtosis
print("Skewness: %f" % train['SalePrice'].skew())
print("Kurtosis: %f" % train['SalePrice'].kurt())
From the above computation and also from the above histogram, we can say that SalePrice:
High Kurtosis means that SalePrice has some outliners. We need to remove them so that they don't affect our prediction result.
In [22]:
plt.figure(figsize=[8,6])
plt.scatter(train["SalePrice"].values, range(train.shape[0]))
plt.title("Distribution of Sale Price")
plt.xlabel("Sale Price");
plt.ylabel("Number of Occurences")
Out[22]:
Let's remove the extreme outliers as seen in the above figure.
In [23]:
# removing outliers
upperlimit = np.percentile(train.SalePrice.values, 99.5)
train['SalePrice'].loc[train['SalePrice']>upperlimit] = upperlimit # slicing dataframe upto the uppperlimit
In [24]:
# plotting again the graph after removing outliers
plt.figure(figsize=[8,6])
plt.scatter(train["SalePrice"].values, range(train.shape[0]))
plt.title("Distribution of Sale Price")
plt.xlabel("Sale Price");
plt.ylabel("Number of Occurences")
Out[24]:
Another way of reducing skewness is by using Log Transformation method so that the data distribution become more linear. The logarithm function squeezes the larger values in your dataset and stretches out the smaller values.
Original value = $x$
New value after log-transformation = $log_{10}(x)$ = $x'$
$x = 1$ then $log_{10}(1) = 0 $
$x = 10$ then $log_{10}(10) = 1$
$x = 100$ then $log_{10}(100) = 2$
Let's log transform our target variable SalePrice values:
In [25]:
# applying log transformation
train['SalePrice'] = np.log(train['SalePrice'])
After applying log transformation, let's see the histogram and normal probability plot to see how has this affected Skewness and Kurtosis of the data. And, how normal and linear does the data distribution becomes.
In [26]:
# histogram to graphically show skewness and kurtosis
plt.figure(figsize=[15,5])
sns.distplot(train['SalePrice'])
plt.title('Distribution of Sale Price')
plt.xlabel('Sale Price')
plt.ylabel('Number of Occurences')
# normal probability plot
plt.figure(figsize=[8,6])
stats.probplot(train['SalePrice'], plot=plt)
Out[26]:
Great! We can see that log transformation has worked well and data distribution of SalePrice has been changed from right skewed to normal.
In [27]:
# skewness and kurtosis
print("Skewness: %f" % train['SalePrice'].skew())
print("Kurtosis: %f" % train['SalePrice'].kurt())
In [28]:
sns.factorplot(x="PoolArea",y="SalePrice",data=train,hue="PoolQC",kind='bar')
plt.title("Pool Area , Pool quality and SalePrice ")
plt.ylabel("SalePrice")
plt.xlabel("Pool Area in sq feet");
Let's analyze number of Fireplaces, Fireplace Quality and Sale Price's relationship.
Note: Sale price is not displayed in thousand value because it is log-transformed above.
Figure below shows that having 2 fireplaces increases sale price of the house. Excellent quality of Fireplace increases the sale price significantly.
In [29]:
sns.factorplot("Fireplaces","SalePrice",data=train,hue="FireplaceQu");
In [30]:
pd.crosstab(train.Fireplaces, train.FireplaceQu)
Out[30]:
In [31]:
# scatter plot grlivarea/saleprice
plt.figure(figsize=[8,6])
plt.scatter(x=train['GrLivArea'], y=train['SalePrice'])
plt.xlabel('GrLivArea', fontsize=13)
plt.ylabel('SalePrice', fontsize=13)
Out[31]:
Note at the bottom right of the above plot. This shows that two very large GrLivArea are having low SalePrice. These values are outliers for GrLivArea.
Let's remove these outliers.
In [32]:
# Deleting outliers
train = train.drop(train[(train['GrLivArea']>4000) & (train['SalePrice']<300000)].index)
In [33]:
# Plot the graph again
# scatter plot grlivarea/saleprice
plt.figure(figsize=[8,6])
plt.scatter(x=train['GrLivArea'], y=train['SalePrice'])
plt.xlabel('GrLivArea', fontsize=13)
plt.ylabel('SalePrice', fontsize=13)
Out[33]:
We have removed the extreme outliers from GrLivArea variable. Outliers can be present in other variables as well. But, removing outliers from all other variables may adversly affect our model because there can be outliers in test dataset as well. Solution to this will be to make the model more robust.
In [34]:
ntrain = train.shape[0]
ntest = test.shape[0]
y_train = train.SalePrice.values
all_data = pd.concat((train, test)).reset_index(drop=True)
all_data.drop(['SalePrice'], axis=1, inplace=True)
all_data.shape
Out[34]:
List variables with missing data with total number of missing rows along with the missing percentage.
In [35]:
null_columns = all_data.columns[all_data.isnull().any()]
total_null_columns = all_data[null_columns].isnull().sum()
percent_null_columns = ( all_data[null_columns].isnull().sum() / all_data[null_columns].isnull().count() )
missing_data = pd.concat([total_null_columns, percent_null_columns], axis=1, keys=['Total', 'Percent']).sort_values(by=['Percent'], ascending=False)
#missing_data.head()
missing_data
Out[35]:
In [36]:
plt.figure(figsize=[20,5])
plt.xticks(rotation='90', fontsize=14)
sns.barplot(x=missing_data.index, y=missing_data.Percent)
plt.xlabel('Features', fontsize=15)
plt.ylabel('Percent of missing values', fontsize=15)
plt.title('Percent missing data by feature', fontsize=15)
Out[36]:
In [37]:
# get unique values of the column data
all_data['PoolQC'].unique()
Out[37]:
In [38]:
# replace null values with 'None'
all_data['PoolQC'].fillna('None', inplace=True)
In [39]:
# get unique values of the column data
all_data['PoolQC'].unique()
Out[39]:
In [40]:
# get unique values of the column data
all_data['MiscFeature'].unique()
Out[40]:
In [41]:
# replace null values with 'None'
all_data['MiscFeature'].fillna('None', inplace=True)
In [42]:
# get unique values of the column data
all_data['Alley'].unique()
Out[42]:
In [43]:
# replace null values with 'None'
all_data['Alley'].fillna('None', inplace=True)
In [44]:
# get unique values of the column data
all_data['Fence'].unique()
Out[44]:
In [45]:
# replace null values with 'None'
all_data['Fence'].fillna('None', inplace=True)
In [46]:
# get unique values of the column data
all_data['FireplaceQu'].unique()
Out[46]:
In [47]:
# replace null values with 'None'
all_data['FireplaceQu'].fillna('None', inplace=True)
LotFrontage: Linear feet of street connected to property
16.67% values are missing for LotFrontage. We can assume that the distance of the street connected to the property (LotFrontage) will be same as that of that particular property's neighbor property (Neighborhood).
We can fill the missing value by the median LotFrontage of all the Neighborhood.
In [48]:
# barplot of median of LotFrontage with respect to Neighborhood
sns.barplot(data=train,x='Neighborhood',y='LotFrontage', estimator=np.median)
plt.xticks(rotation=90)
Out[48]:
In [49]:
# get unique values of the column data
all_data['LotFrontage'].unique()
Out[49]:
In [50]:
# replace null values with median LotFrontage of all the Neighborhood
all_data["LotFrontage"] = all_data.groupby("Neighborhood")["LotFrontage"].transform(
lambda x: x.fillna(x.median()))
In [51]:
all_data['LotFrontage'].unique()
Out[51]:
In [52]:
# get unique values of the column data
for col in ('GarageType', 'GarageFinish', 'GarageQual', 'GarageCond'):
print (all_data[col].unique())
In [53]:
# replace null values with 'None'
for col in ('GarageType', 'GarageFinish', 'GarageQual', 'GarageCond'):
all_data[col].fillna('None', inplace=True)
In [54]:
# get unique values of the column data
for col in ('GarageYrBlt', 'GarageArea', 'GarageCars'):
print (all_data[col].unique())
In [55]:
# replace null values with 0
for col in ('GarageYrBlt', 'GarageArea', 'GarageCars'):
all_data[col].fillna(0, inplace=True)
In [56]:
# get unique values of the column data
for col in ('BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2'):
print (all_data[col].unique())
In [57]:
# replace null values with 'None'
for col in ('BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2'):
all_data[col].fillna('None', inplace=True)
In [58]:
# replace null values with 0
for col in ('BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF','TotalBsmtSF', 'BsmtFullBath', 'BsmtHalfBath'):
all_data[col].fillna(0, inplace=True)
In [59]:
all_data["MasVnrType"].fillna("None", inplace=True)
all_data["MasVnrArea"].fillna(0, inplace=True)
In [60]:
for col in ('MSZoning', 'Utilities', 'Functional', 'Exterior2nd', 'Exterior1st', 'KitchenQual', 'Electrical', 'SaleType'):
all_data[col].fillna(all_data[col].mode()[0], inplace=True)
In [61]:
null_columns = all_data.columns[all_data.isnull().any()]
print (null_columns)
Earlier in this notebook, we have reduced the Skewness of our target variable SalePrice. We did it through Log Transformation. We will apply the same for all other numeric dependent variables having high skewness.
Let's check the Skewness of numeric dependent variables:
In [62]:
numeric_features = all_data.dtypes[all_data.dtypes != 'object'].index
#print (numeric_features)
skewness = []
for col in numeric_features:
skewness.append( (col, all_data[col].skew()) )
pd.DataFrame(skewness, columns=('Feature', 'Skewness')).sort_values(by='Skewness', ascending=False)
Out[62]:
In [63]:
all_data.head()
Out[63]:
In [64]:
positively_skewed_features = all_data[numeric_features].columns[abs(all_data[numeric_features].skew()) > 1]
#print (positively_skewed_features)
# applying log transformation
for col in positively_skewed_features:
all_data[col] = np.log(np.ma.array(all_data[col], mask=(all_data[col]<=0))) # using masked array to ignore log transformation of 0 values as (log 0) is undefined
In [65]:
all_data.head()
Out[65]:
In [66]:
%%HTML
<style>
table {margin-left: 0 !important;}
</style>
Dummy variables are used to convert categorical/nominal features into quantitative one. A new column is created for each unique category of a nominal/categorical column. Values in that newly created column will be either 1 or 0.
Let's take an example of a column named "Sex" which has two values "male" and "female". If we create dummy variables for this column then two new columns will be added with name "male" and "female". For any row, if the "Sex" value is 'male' then the "male" column will have value 1 and "female" column will have value 0. Similary, if the "Sex" value is 'female' then the "male" column will have value 0 and "female" column will have value 1.
BEFORE
Row | Sex |
---|---|
1 | male |
2 | female |
3 | female |
4 | male |
AFTER CREATING DUMMY VARIABLES
Row | Sex | male | female |
---|---|---|---|
1 | male | 1 | 0 |
2 | female | 0 | 1 |
3 | female | 0 | 1 |
4 | male | 1 | 0 |
We will now create dummy variables for all our categorical/nominal features.
In [67]:
all_data = pd.get_dummies(all_data)
print(all_data.shape)
In [68]:
train = all_data[:ntrain]
test = all_data[ntrain:]
In [69]:
train.head()
Out[69]:
In [70]:
test.head()
Out[70]:
Here, we create different regression models and evaluate the Root Mean Square Error (RMSE) of predictions done by those models. The root-mean-square error (RMSE) is a frequently used measure of the differences between values predicted by a model or an estimator and the values actually observed.
Note:
Scikit-Learn cross-validation features expect a utility function (greater is better) rather than a cost function (lower is better).
Mean Square Error (MSE) ranges from 0 to 1. Generally, low error means better model. But, in case of scikit-learn, high MSE means better model. So, if our MSE value is 0.9 then we can say that our model is performing better as compared to MSE value 0.2.
To revert this behavior of scikit-learn, we can use "scoring" parameter in "cross_val_scores" function like this:
cv_score = cross_val_score(lasso, train.drop(['Id'], axis=1), y_train, scoring="neg_mean_squared_error", cv=5)
We will be testing the following Regression Models for this House Price problem:
Let's first import the model libraries.
In [90]:
# importing model libraries
from sklearn.linear_model import ElasticNet, Lasso
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.kernel_ridge import KernelRidge
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import RobustScaler
from sklearn.model_selection import KFold, cross_val_score, train_test_split
from sklearn.metrics import mean_squared_error
import xgboost as xgb
import lightgbm as lgb
In [78]:
X_train = train.drop(['Id'], axis=1)
# y_train has been defined above where we combined train and test data to create all_data
X_test = test.drop(['Id'], axis=1)
In [91]:
#lasso = Lasso(alpha =0.0005, random_state=1)
#lasso = Lasso()
model_lasso = make_pipeline(RobustScaler(), Lasso(alpha =0.0005))
# y_train is defined above where we combined train and test data to create all_data
# np.sqrt() function is used to create square root of MSE returned by cross_val_score function
cv_score = np.sqrt( -cross_val_score(model_lasso, X_train, y_train, scoring="neg_mean_squared_error", cv=5) )
print (cv_score)
print ("SCORE (mean: %f , std: %f)" % (np.mean(cv_score), np.std(cv_score)))
In [92]:
model_elastic_net = make_pipeline(RobustScaler(), ElasticNet(alpha=0.0005))
# y_train is defined above where we combined train and test data to create all_data
# np.sqrt() function is used to create square root of MSE returned by cross_val_score function
cv_score = np.sqrt( -cross_val_score(model_elastic_net, X_train, y_train, scoring="neg_mean_squared_error", cv=5) )
print (cv_score)
print ("SCORE (mean: %f , std: %f)" % (np.mean(cv_score), np.std(cv_score)))
In [94]:
model_kernel_ridge = KernelRidge(alpha=0.6)
# y_train is defined above where we combined train and test data to create all_data
# np.sqrt() function is used to create square root of MSE returned by cross_val_score function
cv_score = np.sqrt( -cross_val_score(model_kernel_ridge, X_train, y_train, scoring="neg_mean_squared_error", cv=5) )
print (cv_score)
print ("SCORE (mean: %f , std: %f)" % (np.mean(cv_score), np.std(cv_score)))
In [95]:
model_gboost = GradientBoostingRegressor(n_estimators=3000, learning_rate=0.05,
max_depth=4, max_features='sqrt',
min_samples_leaf=15, min_samples_split=10,
loss='huber', random_state=5)
# y_train is defined above where we combined train and test data to create all_data
# np.sqrt() function is used to create square root of MSE returned by cross_val_score function
cv_score = np.sqrt( -cross_val_score(model_gboost, X_train, y_train, scoring="neg_mean_squared_error", cv=5) )
print (cv_score)
print ("SCORE (mean: %f , std: %f)" % (np.mean(cv_score), np.std(cv_score)))
XGBoost is an implementation of gradient boosted decision trees designed for speed and performance.
In [96]:
model_xgboost = xgb.XGBRegressor(colsample_bytree=0.4603, gamma=0.0468,
learning_rate=0.05, max_depth=3,
min_child_weight=1.7817, n_estimators=2200,
reg_alpha=0.4640, reg_lambda=0.8571,
subsample=0.5213, silent=True, nthread = -1)
# y_train is defined above where we combined train and test data to create all_data
# np.sqrt() function is used to create square root of MSE returned by cross_val_score function
cv_score = np.sqrt( -cross_val_score(model_xgboost, X_train, y_train, scoring="neg_mean_squared_error", cv=5) )
print (cv_score)
print ("SCORE (mean: %f , std: %f)" % (np.mean(cv_score), np.std(cv_score)))
Light GBM is a fast, distributed, high-performance gradient boosting framework based on decision tree algorithm, used for ranking, classification and many other machine learning tasks.
Since it is based on decision tree algorithms, it splits the tree leaf wise with the best fit whereas other boosting algorithms split the tree depth wise or level wise rather than leaf-wise. So when growing on the same leaf in Light GBM, the leaf-wise algorithm can reduce more loss than the level-wise algorithm and hence results in much better accuracy which can rarely be achieved by any of the existing boosting algorithms. Also, it is surprisingly very fast, hence the word ‘Light’.
In [97]:
model_lgbm = lgb.LGBMRegressor(objective='regression',num_leaves=5,
learning_rate=0.05, n_estimators=720,
max_bin = 55, bagging_fraction = 0.8,
bagging_freq = 5, feature_fraction = 0.2319,
feature_fraction_seed=9, bagging_seed=9,
min_data_in_leaf =6, min_sum_hessian_in_leaf = 11)
# y_train is defined above where we combined train and test data to create all_data
# np.sqrt() function is used to create square root of MSE returned by cross_val_score function
cv_score = np.sqrt( -cross_val_score(model_lgbm, X_train, y_train, scoring="neg_mean_squared_error", cv=5) )
print (cv_score)
print ("SCORE (mean: %f , std: %f)" % (np.mean(cv_score), np.std(cv_score)))
We have already done Cross Validation before but Cross Validation fits the classifier on different subsets of dataset, and then averages their scores. It is a common practice to train/fit classifier on full dataset after it has shown sufficient score in Cross Validation.
Hence, here we train our models with fit method, i.e. we fit our models with the predictors (X_train) and outcome (y_train) so that it can learn and predict the outcome in future.
In [104]:
model_lasso.fit(X_train, y_train)
model_elastic_net.fit(X_train, y_train)
model_kernel_ridge.fit(X_train, y_train)
model_gboost.fit(X_train, y_train)
model_xgboost.fit(X_train, y_train)
model_lgbm.fit(X_train, y_train)
Out[104]:
Above, we have trained our model with the training dataset. Here, use those trained models to generate predition on the training data itself. And then calculate the Root Mean Squre Error (RMSE) of those predictions.
This can show how accurately the model predict the data that it has already seen before. The result below shows that Gradient Boosting model has the most accurate predictions for already learned data.
In [122]:
dict_models = {'lasso':model_lasso, 'elastic_net':model_elastic_net, 'kernel_ridge':model_kernel_ridge,
'gboost':model_gboost, 'xgboost':model_xgboost, 'lgbm':model_lgbm}
for key, value in dict_models.items():
pred_train = value.predict(X_train)
rmse = np.sqrt(mean_squared_error(y_train, pred_train))
print ("%s: %f" % (key, rmse))
In [128]:
prediction_lasso = np.expm1(model_lasso.predict(X_test))
prediction_elastic_net = np.expm1(model_elastic_net.predict(X_test))
prediction_kernel_ridge = np.expm1(model_kernel_ridge.predict(X_test))
prediction_gboost = np.expm1(model_gboost.predict(X_test))
prediction_xgboost = np.expm1(model_xgboost.predict(X_test))
prediction_lgbm = np.expm1(model_lgbm.predict(X_test))
We can try different prediction combination before generating the Kaggle submission file. We can try single prediction model or an average of two or more prediction model.
I got better result on Kaggle score while combining prediction Lasso and Elastic Net models and taking the average of their prediction.
In [131]:
# kaggle score: 0.12346
#prediction = prediction_gboost
# kaggle score: 0.12053
#prediction = (prediction_lasso + prediction_xgboost) / float(2)
# kaggle score: 0.11960
#prediction = prediction_lasso
# kaggle score: 0.11937
prediction = (prediction_lasso + prediction_elastic_net) / float(2)
#print prediction
In [132]:
submission = pd.DataFrame({
"Id": test["Id"],
"SalePrice": prediction
})
#submission.to_csv('submission.csv', index=False)