In this notebook we show you some code snippets. These examples are not by any means exaustive, they are only some commons code templates for Machine Learning tasks.
Almost all ML algorithms presented here are implemented by SciKitLearn library.
Author: Flávio Clésio (flavio.clésio@movile.com)
The most used library to manipulate data in Python is Pandas. We use Pandas to read the data and to do some basic filter/data transformation. Along with Pandas, we use Numpy, a package for scientific computation focused in matrices, and we use MatPlot (with the seaborn wrapper) lib to plot some useful graphs and give us some insights about the data.
In [1]:
import warnings
warnings.filterwarnings('ignore')
import numpy as np # Linear algebra
import pandas as pd # Data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns #Data Visualization
import matplotlib.pyplot as plt #Data visualization
from sklearn.model_selection import train_test_split
from sklearn import linear_model
%matplotlib inline
In [2]:
# Path where the files are stored
PATH = '/Volumes/PANZER/Github/ml-lab/ICPC-UNICAMP-2018/Notebooks/' #TODO: Change it to your path
In [3]:
# Now we'll load the data (train and test data) using the read_csv() method of Pandas (pd) and parsing the dates to timestamp
train = pd.read_csv(PATH + "train.csv", parse_dates = ['timestamp'])
test = pd.read_csv(PATH + "test.csv", parse_dates=['timestamp'])
In [4]:
# As we need to transform all our data for feature engineering,
# let's join our data to perform the same transformations for all
num_train = len(train)
num_test = len(test)
print '# Samples in Train Dataframe:', num_train
print '# Samples in Test Dataframe:', num_test
In [5]:
# Now we'll store the information about the Y (the variable that we need to predict) and the id's to check
# if our prediction are correct
# Store the dependent variable to a object called y_price_doc_log_train and transform to log
Y = train['price_doc'].values
# Store the id of test dataframe
id_test = test['id']
# Remove the ids (no predictive power) in both datasets and the price_doc variable from X dataset
train.drop(['id','price_doc' ], axis=1, inplace=True)
test.drop(['id'], axis=1, inplace=True)
In [6]:
# We bind our dataframes in a single dataframe
df_all = pd.concat([train, test])
num_df_all = len(df_all)
print '# Samples in Test Dataframe:', num_df_all
print 'Shape of our dataset (#Records, #Columns)', (df_all.shape)
In [7]:
# To see the columns of our dataset lets use the function below
list(df_all)
Out[7]:
In [8]:
# Now, let's take a look over the dataset
df_all.head()
Out[8]:
In [9]:
# To see some descriptive statistics lets use the function describe()
df_all.describe()
Out[9]:
We show you two basic plot. Plot the data is a very important step in Machine learning, you can get some really useful insights about the data.
You can find a much more graphs examples here: https://seaborn.pydata.org/examples/index.html
References:
In [10]:
# For plotting we'll use the library called Seaborn
plt.figure(figsize=(10, 5)) # The size of the plot
sns.distplot(Y, kde = False) # We'll use distribution plot and as first arg we use the column .price_doc
Out[10]:
In [11]:
# As we can see there's a lot of outliers and to smooth this records, let's use np.log() function and see the distribution
plt.figure(figsize=(10, 5)) # The size of the plot
sns.distplot(np.log(Y), kde = False) # We'll use distribution plot and as first arg we use the column .price_doc
plt.xlabel('np.log(price_doc)', fontsize=12)
Out[11]:
In [12]:
# One thing can help the convergence of ML algorithms it's to remove outliers or smooth or data, to do that
# Let's convert out Y dataset in .log
Y = np.log(Y)
In [13]:
# To put another variable inside the dataset. This variable will call price_sq
df_all["kitch_proportions"] = df_all["kitch_sq"]/df_all["full_sq"]
In [14]:
# When we deal with machine learning algorithms, Null values (or NaN) are trouble. You can choose
# 1) remove then from dataset, 2) replace their values.
# It's much better make replacemens of this values using the mean, or median or some spare value (e.g. -99999) to avoid
# lost some predictive power. Let's check the columns with null.
for col in df_all.columns.values: # For each col the loop will take the value
if len(df_all[df_all[col].isnull()][col]) > 0: # If the value are null, count each unit of the column
print("{0}: {1}".format(col, len(df_all[df_all[col].isnull()][col]))) # and print the name of the column and the number of null values
In [15]:
# To delete some column, just use the del function
del df_all['timestamp']
del df_all['life_sq']
del df_all['floor']
del df_all['max_floor']
del df_all['material']
del df_all['build_year']
del df_all['num_room']
del df_all['kitch_sq']
del df_all['state']
del df_all['preschool_quota']
del df_all['school_quota']
del df_all['hospital_beds_raion']
del df_all['raion_build_count_with_material_info']
del df_all['build_count_block']
del df_all['build_count_wood']
del df_all['build_count_frame']
del df_all['build_count_brick']
del df_all['build_count_monolith']
del df_all['build_count_panel']
del df_all['build_count_foam']
del df_all['build_count_slag']
del df_all['build_count_mix']
del df_all['raion_build_count_with_builddate_info']
del df_all['build_count_before_1920']
del df_all['build_count_1921-1945']
del df_all['build_count_1946-1970']
del df_all['build_count_1971-1995']
del df_all['build_count_after_1995']
In [16]:
# To fill NaN values with a specific number apply the function .fillna
df_all['cafe_sum_500_min_price_avg'].fillna(-99, inplace=True)
df_all['cafe_sum_500_max_price_avg'].fillna(-99, inplace=True)
df_all['cafe_avg_price_500'].fillna(-99, inplace=True)
df_all['cafe_sum_1000_min_price_avg'].fillna(-99, inplace=True)
df_all['cafe_sum_1000_max_price_avg'].fillna(-99, inplace=True)
df_all['metro_min_walk'].fillna(-99, inplace=True)
df_all['metro_km_walk'].fillna(-99, inplace=True)
df_all['railroad_station_walk_km'].fillna(-99, inplace=True)
df_all['railroad_station_walk_min'].fillna(-99, inplace=True)
df_all['ID_railroad_station_walk'].fillna(-99, inplace=True)
df_all['prom_part_5000'].fillna(-99, inplace=True)
df_all['cafe_sum_5000_min_price_avg'].fillna(-99, inplace=True)
df_all['cafe_sum_5000_max_price_avg'].fillna(-99, inplace=True)
df_all['cafe_avg_price_5000'].fillna(-99, inplace=True)
df_all['product_type'].fillna(-99, inplace=True)
df_all['green_part_2000'].fillna(-99, inplace=True)
df_all['kitch_proportions'].fillna(-99, inplace=True)
In [17]:
# To fill the the mean use the .mean()
df_all['cafe_avg_price_1000'].fillna(train['cafe_avg_price_1000'].mean(), inplace=True)
df_all['cafe_sum_1500_min_price_avg'].fillna(train['cafe_sum_1500_min_price_avg'].mean(), inplace=True)
df_all['cafe_sum_1500_max_price_avg'].fillna(train['cafe_sum_1500_max_price_avg'].mean(), inplace=True)
df_all['cafe_avg_price_1500'].fillna(train['cafe_avg_price_1500'].mean(), inplace=True)
df_all['cafe_sum_2000_min_price_avg'].fillna(train['cafe_sum_2000_min_price_avg'].mean(), inplace=True)
In [18]:
# To fill the the median use the .median()
df_all['cafe_sum_2000_max_price_avg'].fillna(train['cafe_sum_2000_max_price_avg'].median(), inplace=True)
df_all['cafe_avg_price_2000'].fillna(train['cafe_avg_price_2000'].median(), inplace=True)
df_all['cafe_sum_3000_min_price_avg'].fillna(train['cafe_sum_3000_min_price_avg'].median(), inplace=True)
df_all['cafe_sum_3000_max_price_avg'].fillna(train['cafe_sum_3000_max_price_avg'].median(), inplace=True)
df_all['cafe_avg_price_3000'].fillna(train['cafe_avg_price_3000'].median(), inplace=True)
In [19]:
# Another big problem for ML Algorithms it's the representation of categorical variables.
# this is because, most of this algoritms deals only with numeric representations.
# Deal with categorical values
df_numeric = df_all.select_dtypes(exclude=['object']) # Select columns with numerical variables
df_obj = df_all.select_dtypes(include=['object']).copy() # Select columns with non numerical variables
In [20]:
for c in df_obj:
df_obj[c] = pd.factorize(df_obj[c])[0]
In [21]:
df_obj.head()
Out[21]:
In [22]:
# Now we can join this two data frames using the function concat()
df_all = pd.concat([df_numeric, df_obj], axis=1)
In [23]:
# Create a validation set
#num_val = int(num_train * 0.2)
In [24]:
# After we cleasing our data in pandas, we need to transform this data in a numpy array
# because the most popular machine learning packages only make the computation using this format
# To to that, let's convert our dataframe
X_all = df_all.values
X = X_all[:num_train]
In [25]:
# In this step we'll use the train dataset to split in training and test to ensure that our algorithm
# is learning and to perform some quality check over the rmse
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2) # We'll use 20% of the data for test
print('X Train shape is', X_train.shape)
print('Y Train shape is', y_train.shape)
print('X Test shape is', X_test.shape)
print('Y Test shape is', y_test.shape)
Linear Regression models are models which fits the data in linear equations. Linear Regression models are really useful in many cases and it is, almost always, the first choice when dealing with a ML problem.
There are some variants from the basic Linear Regression model. The two most common are Ridge Regression and Lasso Regression. The ideia behind these two models is regularize the model, eliminating useless features.
References:
In [26]:
# Now we'll use the a Linear model to fit this data
# First we use .LinearRegression() function of the linear_model library and make the object lm
lm = linear_model.LinearRegression()
# After this, we make a object called model where we'll call the function .fit where we use as argument
# X_train and Y_train datasets to perform the fit
model = lm.fit(X_train, y_train)
# In this step we'll make the predictions objects after we call the predict function using a
# X_test dataframe as parameter
predictions = lm.predict(X_test)
# And now we perform a inverse transformation of the log using np.exp() function
y_pred = np.exp(predictions)
In [27]:
# To make an assessment of the quality of the model, we'll use the RMSE as a main metric to assess the performance of
# the model
from sklearn.metrics import mean_squared_error
from math import sqrt
rmse = sqrt(mean_squared_error(y_test, y_pred.astype('int64')))
print("rms error is: " + str(rmse))
In [28]:
# There's another way to do that
def rmse(predictions, targets):
return np.sqrt(((predictions - targets) ** 2).mean())
rmse_val = rmse(y_pred.astype('int64'),y_test)
print("rms error is: " + str(rmse_val))
In [29]:
# Now let's do some submission
# First we'll get the records of test database and store in some object
test_sub = X_all[:-num_train]
print('Test Sub shape is', test_sub.shape)
In [30]:
# We'll made the prediction using our test dataset and store inside y_pred object
predictions = lm.predict(test_sub)
y_pred = np.exp(predictions)
In [31]:
# We'll join the id and the predictions and store in an object called df_submission
df_submission = pd.DataFrame({'id': id_test, 'price_doc': y_pred.astype('int64')})
In [32]:
# Let's see what we got
df_submission.head(10)
Out[32]:
In [33]:
# To generate the file, we'll use to_csv function and we'll use the submission.csv ans the name of the file
df_submission.to_csv('submission.csv', index=False)
In [34]:
# Check the submission format
! head -n15 submission.csv
In [35]:
from sklearn import linear_model
reg = linear_model.Ridge (alpha = .5)
reg.fit(X_train, y_train)
predictions = reg.predict(X_test)
y_pred = np.exp(predictions)
rmse = sqrt(mean_squared_error(y_test, y_pred.astype('int64')))
print("rms error is: " + str(rmse))
In [36]:
from sklearn import linear_model
reg = linear_model.Lasso(alpha = 0.1)
reg.fit(X_train, y_train)
reg.predict(X_test)
predictions = reg.predict(X_test)
y_pred = np.exp(predictions)
rmse = sqrt(mean_squared_error(y_test, y_pred.astype('int64')))
print("rms error is: " + str(rmse))
In [37]:
from sklearn import tree
clf = tree.DecisionTreeRegressor()
clf = clf.fit(X_train, y_train)
clf.predict(X_test)
predictions = clf.predict(X_test)
y_pred = np.exp(predictions)
rmse = sqrt(mean_squared_error(y_test, y_pred.astype('int64')))
print("rms error is: " + str(rmse))
In [39]:
import xgboost as xgb
d_train = xgb.DMatrix(X_train, label=y_train)
d_valid = xgb.DMatrix(X_test, label=y_test)
params = {}
params['objective'] = 'reg:linear'
params['eta'] = 0.02
params['silent'] = 1
watchlist = [(d_train, 'train'), (d_valid, 'valid')]
clf = xgb.train(params, d_train, 50, watchlist, early_stopping_rounds=100)
predictions = clf.predict(xgb.DMatrix(X_test))
y_pred = np.exp(predictions)
rmse = sqrt(mean_squared_error(y_test, y_pred.astype('int64')))
print("rms error is: " + str(rmse))
In [42]:
from sklearn.ensemble import GradientBoostingRegressor
alpha = 0.95
clf = GradientBoostingRegressor(loss='quantile', alpha=alpha,
n_estimators=10, max_depth=3,
learning_rate=.1, min_samples_leaf=9,
min_samples_split=9)
clf.fit(X_train, y_train)
predictions = clf.predict(X_test)
y_pred = np.exp(predictions)
rmse = sqrt(mean_squared_error(y_test, y_pred.astype('int64')))
print("rms error is: " + str(rmse))
In [40]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)
In [41]:
from sklearn.neural_network import MLPRegressor
clf = MLPRegressor(solver='lbfgs', alpha=1e-5,
hidden_layer_sizes=(7, ), random_state=1)
clf.fit(X_train, y_train)
predictions = clf.predict(X_test)
y_pred = np.exp(predictions)
rmse = sqrt(mean_squared_error(y_test, y_pred.astype('int64')))
print("rms error is: " + str(rmse))
In [43]:
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=10, random_state=0).fit(X_train)
kmeans.predict(X_test)
Out[43]:
In [44]:
kmeans.labels_
Out[44]:
In [45]:
kmeans.cluster_centers_
Out[45]: