In the last mission, we explored how to use a simple k-nearest neighbors machine learning model that used just one feature, or attribute, of the listing to predict the rent price. We first relied on the accommodates column, which describes the number of people a living space can comfortably accommodate. Then, we switched to the bathrooms column and observed an improvement in accuracy. While these were good features to become familiar with the basics of machine learning, it's clear that using just a single feature to compare listings doesn't reflect the reality of the market. An apartment that can accommodate 4 guests in a popular part of Washington D.C. will rent for much higher than one that can accommodate 4 guests in a crime ridden area.
There are 2 ways we can tweak the model to try to improve the accuracy (decrease the RMSE during validation):
In this mission, we'll focus on increasing the number of attributes the model uses. When selecting more attributes to use in the model, we need to watch out for columns that don't work well with the distance equation. This includes columns containing:
In the following code screen, we've read the dc_airbnb.csv dataset from the last mission into pandas and brought over the data cleaning changes we made. Let's first look at the first row's values to identify any columns containing non-numerical or non-ordinal values. In the next screen, we'll drop those columns and then look for missing values in each of the remaining columns.
Description:
In [8]:
import pandas as pd
import numpy as np
In [9]:
np.random.seed(1)
In [10]:
dc_listings = pd.read_csv('dc_airbnb.csv')
dc_listings = dc_listings.loc[np.random.permutation(len(dc_listings))]
stripped_commas = dc_listings['price'].str.replace(',', '')
stripped_dollars = stripped_commas.str.replace('$', '')
dc_listings['price'] = stripped_dollars.astype('float')
In [11]:
dc_listings.info()
The following columns contain non-numerical values:
while these columns contain numerical but non-ordinal values:
Geographic values like these aren't ordinal, because a smaller numerical value doesn't directly correspond to a smaller value in a meaningful way. For example, the zip code 20009 isn't smaller or larger than the zip code 75023 and instead both are unique, identifier values. Latitude and longitude value pairs describe a point on a geographic coordinate system and different equations are used in those cases (e.g. haversine).
While we could convert the host_response_rate and host_acceptance_rate columns to be numerical (right now they're object data types and contain the % sign), these columns describe the host and not the living space itself. Since a host could have many living spaces and we don't have enough information to uniquely group living spaces to the hosts themselves, let's avoid using any columns that don't directly describe the living space or the listing itself:
Let's remove these 9 columns from the Dataframe
Description:
In [12]:
dc_listings.drop(labels=['room_type', 'city', 'state', 'latitude', 'longitude', 'zipcode', 'host_response_rate', 'host_acceptance_rate', 'host_listings_count'], axis=1, inplace=True)
In [13]:
dc_listings.info()
Of the remaining columns, 3 columns have a few missing values (less than 1% of the total number of rows):
Since the number of rows containing missing values for one of these 3 columns is low, we can select and remove those rows without losing much information. There are also 2 columns have a large number of missing values:
and we can't handle these easily. We can't just remove the rows containing missing values for these 2 columns because we'd miss out on the majority of the observations in the dataset. Instead, let's remove these 2 columns entirely from consideration.
Description:
In [14]:
#Drop feature
dc_listings.drop(labels=['cleaning_fee', 'security_deposit'], axis=1,inplace=True)
#Drop NA rows
dc_listings.dropna( subset=['bedrooms','bathrooms','beds'], axis=0, how='any', inplace=True )
In [15]:
dc_listings.info()
Here's how the dc_listings Dataframe looks like after all the changes we made:
accommodates | bedrooms | bathrooms | beds | price | minimum_nights | maximum_nights | number_of_reviews |
---|---|---|---|---|---|---|---|
2 | 1.0 | 1.0 | 1.0 | 125.0 | 1 | 4 | 149 |
2 | 1.0 | 1.5 | 1.0 | 85.0 | 1 | 30 | 49 |
1 | 1.0 | 0.5 | 1.0 | 50.0 | 1 | 1125 | 1 |
2 | 1.0 | 1.0 | 1.0 | 209.0 | 4 | 730 | 2 |
12 | 5.0 | 2.0 | 5.0 | 215.0 | 2 | 1825 | 34 |
You may have noticed that while the accommodates, bedrooms, bathrooms, beds, and minimum_nights columns hover between 0 and 12 (at least in the first few rows), the values in the maximum_nights and number_of_reviews columns span much larger ranges. For example, the maximum_nights column has values as low as 4 and high as 1825, in the first few rows itself. If we use these 2 columns as part of a k-nearest neighbors model, these attributes could end up having an outsized effect on the distance calculations because of the largeness of the values.
For example, 2 living spaces could be identical across every attribute but be vastly different just on the maximum_nights column. If one listing had a maximum_nights value of 1825 and the other a maximum_nights value of 4, because of the way Euclidean distance is calculated, these listings would be considered very far apart because of the outsized effect the largeness of the values had on the overall Euclidean distance. To prevent any single column from having too much of an impact on the distance, we can normalize all of the columns to have a mean of 0 and a standard deviation of 1.
Normalizing the values in each columns to the standard normal distribution (mean of 0, standard deviation of 1) preserves the distribution of the values in each column while aligning the scales. To normalize the values in a column to the standard normal distribution, you need to:
Here's the mathematical formula describing the transformation that needs to be applied for all values in a column:
$\displaystyle z= \frac{x − \mu}{\sigma}$
where x is a value in a specific column, $\mu$ is the mean of all the values in the column, and $\sigma$ is the standard deviation of all the values in the column. Here's what the corresponding code, using pandas, looks like:
# Subtract each value in the column by the mean. first_transform = dc_listings['maximum_nights'] - dc_listings['maximum_nights'].mean() # Divide each value in the column by the standard deviation. normalized_col = first_transform / dc_listings['maximum_nights'].std()
To apply this transformation across all of the columns in a Dataframe, you can use the corresponding Dataframe methods mean() and std():
normalized_listings = (dc_listings - dc_listings.mean()) / (dc_listings.std())
These methods were written with mass column transformation in mind and when you call mean() or std(), the appropriate column means and column standard deviations are used for each value in the Dataframe. Let's now normalize all of the feature columns in dc_listings.
Description:
In [16]:
normalized_listings = (dc_listings - dc_listings.mean()) / (dc_listings.std())
normalized_listings['price'] = dc_listings['price']
In [17]:
normalized_listings.head()
Out[17]:
In the last mission, we trained 2 univariate k-nearest neighbors models. The first one used the accommodates attribute while the second one used the bathrooms attribute. Let's now train a model that uses both attributes when determining how similar 2 living spaces are. Let's refer to the Euclidean distance equation again to see what the distance calculation using 2 attributes would look like:
$\displaystyle d = \sqrt{(q_1 - p_1)^2 + (q_2 - p_2)^2 + \ldots + (q_n - p_n)^2}$
Since we're using 2 attributes, the distance calculation would look like:
$\displaystyle d = \sqrt{(accommodates_1 - accomodates_2)^2 + (bathrooms_1 - bathrooms_2)^2}$
To find the distance between 2 living spaces, we need to calculate the squared difference between both accommodates values, the squared difference between both bathrooms values, add them together, and then take the square root of the resulting sum. Here's what the Euclidean distance between the first 2 rows in normalized_listings looks like:
So far, we've been calculating Euclidean distance ourselves by writing the logic for the equation ourselves. We can instead use the distance.euclidean() function from scipy.spatial, which takes in 2 vectors as the parameters and calculates the Euclidean distance between them. The euclidean() function expects:
Here's a simple example:
from scipy.spatial import distance first_listing = [-0.596544, -0.439151] second_listing = [-0.596544, 0.412923] dist = distance.euclidean(first_listing, second_listing)
Let's use the euclidean() function to calculate the Euclidean distance between 2 rows in our dataset to practice.
Description:
In [18]:
from scipy.spatial import distance
features = ['accommodates', 'bathrooms']
first = normalized_listings[features].iloc[0]
fifth = normalized_listings[features].iloc[4]
first_fifth_distance = distance.euclidean( first, fifth )
print(first_fifth_distance)
So far, we've been writing functions from scratch to train the k-nearest neighbor models. While this is helpful deliberate practice to understand how the mechanics work, you can be more productive and iterate quicker by using a library that handles most of the implementation. In this screen, we'll learn about the scikit-learn library, which is the most popular machine learning in Python. Scikit-learn contains functions for all of the major machine learning algorithms and a simple, unified workflow. Both of these properties allow data scientists to be incredibly productive when training and testing different models on a new dataset.
The scikit-learn workflow consists of 4 main steps:
We'll focus on the first 3 steps in this screen and the next screen. Each model in scikit-learn is implemented as a separate class and the first step is to identify the class we want to create an instance of. In our case, we want to use the KNeighborsRegressor class. Any model that helps us predict numerical values, like listing price in our case, is known as a regression model. The other main class of machine learning models is called classification, where we're trying to predict a label from a fixed set of labels (e.g. blood type or gender). The word regressor from the class name KNeighborsRegressor refers to the regression model class that we just discussed.
Scikit-learn uses a similar object-oriented style to Matplotlib and you need to instantiate an empty model first by calling the constructor:
from sklearn.neighbors import KNeighborsRegressor knn = KNeighborsRegressor()
If you refer to the documentation, you'll notice that by default:
Let's set the algorithm parameter to brute and leave the n_neighbors value as 5, which matches the implementation we wrote in the last mission. If we leave the algorithm parameter set to the default value of auto, scikit-learn will try to use tree-based optimizations to improve performance (which are outside of the scope of this mission):
knn = KNeighborsRegressor(algorithm='brute')
Now, we can fit the model to the data using the fit method. For all models, the fit method takes in 2 required parameters:
Matrix-like object means that the method is flexible in the input and either a Dataframe or a NumPy 2D array of values is accepted. This means you can select the columns you want to use from the Dataframe and use that as the first parameter to the fit method.
If you recall from earlier in the mission, all of the following are acceptable list-like objects:
You can select the target column from the Dataframe and use that as the second parameter to the fit method:
# Split full dataset into train and test sets. train_df = normalized_listings.iloc[0:2792] test_df = normalized_listings.iloc[2792:] # Matrix-like object, containing just the 2 columns of interest from training set. train_features = train_df[['accommodates', 'bathrooms']] # List-like object, containing just the target column, `price`. train_target = normalized_listings['price'] # Pass everything into the fit method. knn.fit(train_features, train_target)
When the fit method is called, scikit-learn stores the training data we specified within the KNearestNeighbors instance (knn). If you try passing in data containing missing values or non-numerical values into the fit method, scikit-learn will return an error. Scikit-learn contains many such features that help prevent us from making common mistakes.
Now that we specified the training data we want used to make predictions, we can use the predict method to make predictions on the test set. The predict method has only one required parameter:
The number of feature columns you use during both training and testing need to match or scikit-learn will return an error:
predictions = knn.predict(test_df[['accommodates', 'bathrooms']])
The predict() method returns a NumPy array containing the predicted price values for the test set. You now have everything you need to practice the entire scikit-learn workflow.
Description:
In [19]:
from sklearn.neighbors import KNeighborsRegressor
train_df = normalized_listings.iloc[0:2792]
test_df = normalized_listings.iloc[2792:]
In [30]:
knn = KNeighborsRegressor(algorithm='brute')
train_features = train_df[features]
train_target = train_df['price']
knn.fit(train_features, train_target)
predictions = knn.predict(test_df[features])
predictions[:5]
Out[30]:
Earlier in this mission, we calculated the MSE and RMSE values using the pandas arithmetic operators to compare each predicted value with the actual value from the price column of our test set. Alternatively, we can instead use the sklearn.metrics.mean_squared_error function(). Once you become familiar with the different machine learning concepts, unifying your workflow using scikit-learn helps save you a lot of time and avoid mistakes.
The mean_squared_error() function takes in 2 inputs:
For this function, we won't show any sample code and will leave it to you to understand the function from the documentation itself to calculate the MSE and RMSE values for the predictions we just made.
Description:
In [32]:
from sklearn.metrics import mean_squared_error
features_train_columns = ['accommodates', 'bathrooms']
knn = KNeighborsRegressor(n_neighbors=5, algorithm='brute', metric='euclidean')
knn.fit(train_df[features_train_columns], train_df['price'])
predictions = knn.predict(test_df[features_train_columns])
two_features_mse = mean_squared_error(y_true = test_df['price'], y_pred = predictions)
two_features_rmse = np.sqrt(two_features_mse)
print('MSE: %.2f' % two_features_mse)
print('RMSE: %.2f' % two_features_rmse)
Here's a table comparing the MSE and RMSE values for the 2 univariate models from the last mission and the multivariate model we just trained:
feature(s) | MSE | RMSE |
---|---|---|
accommodates | 18646.5 | 136.6 |
bathrooms | 17333.4 | 131.7 |
accommodates, bathrooms | 15660.4 | 125.1 |
As you can tell, the model we trained using both features ended up performing better (lower error score) than either of the univariate models from the last mission. Let's now train a model using the following 4 features:
Scikit-learn makes it incredibly easy to swap the columns used during training and testing. We're going to leave this for you as a challenge to train and test a k-nearest neighbors model using these columns instead. Use the code you wrote in the last screen as a guide.
Description:
In [40]:
features = ['bedrooms', 'accommodates', 'bathrooms', 'number_of_reviews']
k=5
knn = KNeighborsRegressor(n_neighbors=k, algorithm='brute', metric='euclidean')
knn.fit(train_df[features], train_df['price'])
four_predictions = knn.predict(test_df[features])
four_features_mse = mean_squared_error(y_true = test_df['price'], y_pred = four_predictions)
four_features_rmse = np.sqrt(four_features_mse)
print('MSE: %.2f' % four_features_mse)
print('RMSE: %.2f' % four_features_rmse)
So far so good! As we increased the features the model used, we observed lower MSE and RMSE values:
feature(s) | MSE | RMSE |
---|---|---|
accommodates | 18646.5 | 136.6 |
bathrooms | 17333.4 | 131.7 |
accommodates, bathrooms | 15660.4 | 125.1 |
accommodates, bathrooms, bedrooms, number_of_reviews | 13320.2 | 115.4 |
Let's take this to the extreme and use all of the potential features. We should expect the error scores to decrease since so far adding more features has helped do so.
Description:
In [41]:
features = train_df.columns.tolist()
features.remove('price')
k=5
knn = KNeighborsRegressor(n_neighbors=k, algorithm='brute', metric='euclidean')
knn.fit(train_df[features], train_df['price'])
all_predictions = knn.predict(test_df[features])
all_features_mse = mean_squared_error(y_true = test_df['price'], y_pred = all_predictions)
all_features_rmse = np.sqrt(all_features_mse)
print('MSE: %.2f' % all_features_mse)
print('RMSE: %.2f' % all_features_rmse)
Interestingly enough, the RMSE value actually increased to 125.1 when we used all of the features available to us. This means that selecting the right features is important and that using more features doesn't automatically improve prediction accuracy. We should re-phrase the lever we mentioned earlier from:
to:
The process of selecting features to use in a model is known as feature selection.
In this mission, we prepared the data to be able to use more features, trained a few models using multiple features, and evaluated the different performance tradeoffs. We explored how using more features doesn't always improve the accuracy of a k-nearest neighbors model. In the next mission, we'll explore another knob for tuning k-nearest neighbor models - the k value.