In [4]:
import pandas as pd
housing = pd.read_csv(r"E:\GIS_Data\file_formats\CSV\housing.csv")
housing.head()
Out[4]:
In [5]:
housing.info()
In [6]:
# find unique values in ocean proximity column
housing.ocean_proximity.value_counts()
Out[6]:
In [7]:
#describe all numerical rows - basic stats
housing.describe()
Out[7]:
In [9]:
%matplotlib inline
import matplotlib.pyplot as plt
housing.hist(bins=50, figsize=(20,15))
Out[9]:
In [10]:
from sklearn.model_selection import train_test_split
In [11]:
train_set, test_set = train_test_split(housing, test_size=0.2, random_state=42)
print(train_set.shape)
print(test_set.shape)
Assuming median_income is an important predictor, we need to categorize it. It is important to build categories such that there are a sufficient number of data points in each strata, else the stratum's importance is biased. To make sure, we need not too many strata (like it is now with median income) and strata are relatively wide.
In [14]:
# scale the median income down by dividing it by 1.5 and rounding up those which are greater than 5 to 5.0
import numpy as np
housing['income_cat'] = np.ceil(housing['median_income'] / 1.5) #up round to integers
#replace those with values > 5 with 5.0, values < 5 remain as is
housing['income_cat'].where(housing['income_cat'] < 5, 5.0, inplace=True)
In [15]:
housing['income_cat'].hist()
Out[15]:
In [16]:
from sklearn.model_selection import StratifiedShuffleSplit
split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
for train_index, test_index in split.split(housing, housing['income_cat']):
strat_train_set = housing.loc[train_index]
strat_test_set = housing.loc[test_index]
Now remove the income_cat
field used for this sampling. We will learn on the median_income
data instead
In [27]:
for _temp in (strat_test_set, strat_train_set):
_temp.drop("income_cat", axis=1, inplace=True)
In [38]:
# Write the train and test data to disk
strat_test_set.to_csv('./housing_strat_test.csv')
strat_train_set.to_csv('./housing_strat_train.csv')
In [37]:
strat_train_set.plot(kind='scatter', x='longitude', y='latitude', alpha=0.4, s=strat_train_set['population']/100,
label='population', figsize=(10,7), color=strat_train_set['median_house_value'],
cmap=plt.get_cmap('jet'), colorbar=True)
plt.legend()
Out[37]:
Do pairwise plot to understand how each feature is correlated to each other
In [50]:
import seaborn as sns
sns.pairplot(data=strat_train_set[['median_house_value','median_income','total_rooms','housing_median_age']])
Out[50]:
Focussing on relationship between income and house value
In [51]:
strat_train_set.plot(kind='scatter', x='median_income', y='median_house_value', alpha=0.1)
Out[51]:
In [52]:
housing['rooms_per_household'] = housing['total_rooms'] / housing['households']
housing['bedrooms_per_household'] = housing['total_bedrooms'] / housing['households']
housing['bedrooms_per_rooms'] = housing['total_bedrooms'] / housing['total_rooms']
In [54]:
corr_matrix = housing.corr()
corr_matrix['median_house_value'].sort_values(ascending=False)
Out[54]:
In [59]:
housing.plot(kind='scatter', x='bedrooms_per_household',y='median_house_value', alpha=0.5)
Out[59]:
In [60]:
#create a copy without the house value column
housing = strat_train_set.drop('median_house_value', axis=1)
#create a copy of house value column into a new series, which will be the labeled data
housing_labels = strat_test_set['median_house_value'].copy()
In [61]:
from sklearn.preprocessing import Imputer
housing_imputer = Imputer(strategy='median')
In [62]:
#drop text columns let Imputer learn
housing_numeric = housing.drop('ocean_proximity', axis=1)
housing_imputer.fit(housing_numeric)
housing_imputer.statistics_
Out[62]:
In [64]:
_x = housing_imputer.transform(housing_numeric)
housing_filled= pd.DataFrame(_x, columns=housing_numeric.columns)
In [66]:
housing_filled['ocean_proximity'] = housing['ocean_proximity']
housing_filled.head()
Out[66]:
We can use LabelEncoder
of scipy to enumerate the ocean_proximity
category. However ML algorithm should not associate higher values as desireable or should not think values 1,2 are closer than values 1,4. So we transform the categories into new columns of booleans using OneHotEncoder
.
We can do both enumeration then to booleans using LabelBinarizer
In [ ]: