Basic Examples On ML

In this notebook we show you some code snippets. These examples are not by any means exaustive, they are only some commons code templates for Machine Learning tasks.

Almost all ML algorithms presented here are implemented by SciKitLearn library.

Author: Flávio Clésio (flavio.clésio@movile.com)

Basic Input/Output and Data Manipulation

The most used library to manipulate data in Python is Pandas. We use Pandas to read the data and to do some basic filter/data transformation. Along with Pandas, we use Numpy, a package for scientific computation focused in matrices, and we use MatPlot (with the seaborn wrapper) lib to plot some useful graphs and give us some insights about the data.


In [1]:
import warnings
warnings.filterwarnings('ignore')

import numpy as np # Linear algebra
import pandas as pd # Data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns #Data Visualization
import matplotlib.pyplot as plt #Data visualization
from sklearn.model_selection import train_test_split
from sklearn import linear_model

%matplotlib inline

In [2]:
# Path where the files are stored
PATH = '/Volumes/PANZER/Github/ml-lab/ICPC-UNICAMP-2018/Notebooks/' #TODO: Change it to your path

In [3]:
# Now we'll load the data (train and test data) using the read_csv() method of Pandas (pd) and parsing the dates to timestamp
train = pd.read_csv(PATH + "train.csv", parse_dates = ['timestamp'])
test = pd.read_csv(PATH + "test.csv", parse_dates=['timestamp'])

In [4]:
# As we need to transform all our data for feature engineering, 
# let's join our data to perform the same transformations for all

num_train = len(train)
num_test = len(test)
print '# Samples in Train Dataframe:', num_train
print '# Samples in Test Dataframe:', num_test


# Samples in Train Dataframe: 30471
# Samples in Test Dataframe: 7662

In [5]:
# Now we'll store the information about the Y (the variable that we need to predict) and the id's to check
# if our prediction are correct

# Store the dependent variable to a object called y_price_doc_log_train and transform to log
Y = train['price_doc'].values

# Store the id of test dataframe
id_test = test['id']

# Remove the ids (no predictive power) in both datasets and the price_doc variable from X dataset
train.drop(['id','price_doc' ], axis=1, inplace=True)
test.drop(['id'], axis=1, inplace=True)

In [6]:
# We bind our dataframes in a single dataframe
df_all = pd.concat([train, test])

num_df_all = len(df_all)

print '# Samples in Test Dataframe:', num_df_all

print 'Shape of our dataset (#Records, #Columns)', (df_all.shape)


# Samples in Test Dataframe: 38133
Shape of our dataset (#Records, #Columns) (38133, 290)

In [7]:
# To see the columns of our dataset lets use the function below
list(df_all)


Out[7]:
['timestamp',
 'full_sq',
 'life_sq',
 'floor',
 'max_floor',
 'material',
 'build_year',
 'num_room',
 'kitch_sq',
 'state',
 'product_type',
 'sub_area',
 'area_m',
 'raion_popul',
 'green_zone_part',
 'indust_part',
 'children_preschool',
 'preschool_quota',
 'preschool_education_centers_raion',
 'children_school',
 'school_quota',
 'school_education_centers_raion',
 'school_education_centers_top_20_raion',
 'hospital_beds_raion',
 'healthcare_centers_raion',
 'university_top_20_raion',
 'sport_objects_raion',
 'additional_education_raion',
 'culture_objects_top_25',
 'culture_objects_top_25_raion',
 'shopping_centers_raion',
 'office_raion',
 'thermal_power_plant_raion',
 'incineration_raion',
 'oil_chemistry_raion',
 'radiation_raion',
 'railroad_terminal_raion',
 'big_market_raion',
 'nuclear_reactor_raion',
 'detention_facility_raion',
 'full_all',
 'male_f',
 'female_f',
 'young_all',
 'young_male',
 'young_female',
 'work_all',
 'work_male',
 'work_female',
 'ekder_all',
 'ekder_male',
 'ekder_female',
 '0_6_all',
 '0_6_male',
 '0_6_female',
 '7_14_all',
 '7_14_male',
 '7_14_female',
 '0_17_all',
 '0_17_male',
 '0_17_female',
 '16_29_all',
 '16_29_male',
 '16_29_female',
 '0_13_all',
 '0_13_male',
 '0_13_female',
 'raion_build_count_with_material_info',
 'build_count_block',
 'build_count_wood',
 'build_count_frame',
 'build_count_brick',
 'build_count_monolith',
 'build_count_panel',
 'build_count_foam',
 'build_count_slag',
 'build_count_mix',
 'raion_build_count_with_builddate_info',
 'build_count_before_1920',
 'build_count_1921-1945',
 'build_count_1946-1970',
 'build_count_1971-1995',
 'build_count_after_1995',
 'ID_metro',
 'metro_min_avto',
 'metro_km_avto',
 'metro_min_walk',
 'metro_km_walk',
 'kindergarten_km',
 'school_km',
 'park_km',
 'green_zone_km',
 'industrial_km',
 'water_treatment_km',
 'cemetery_km',
 'incineration_km',
 'railroad_station_walk_km',
 'railroad_station_walk_min',
 'ID_railroad_station_walk',
 'railroad_station_avto_km',
 'railroad_station_avto_min',
 'ID_railroad_station_avto',
 'public_transport_station_km',
 'public_transport_station_min_walk',
 'water_km',
 'water_1line',
 'mkad_km',
 'ttk_km',
 'sadovoe_km',
 'bulvar_ring_km',
 'kremlin_km',
 'big_road1_km',
 'ID_big_road1',
 'big_road1_1line',
 'big_road2_km',
 'ID_big_road2',
 'railroad_km',
 'railroad_1line',
 'zd_vokzaly_avto_km',
 'ID_railroad_terminal',
 'bus_terminal_avto_km',
 'ID_bus_terminal',
 'oil_chemistry_km',
 'nuclear_reactor_km',
 'radiation_km',
 'power_transmission_line_km',
 'thermal_power_plant_km',
 'ts_km',
 'big_market_km',
 'market_shop_km',
 'fitness_km',
 'swim_pool_km',
 'ice_rink_km',
 'stadium_km',
 'basketball_km',
 'hospice_morgue_km',
 'detention_facility_km',
 'public_healthcare_km',
 'university_km',
 'workplaces_km',
 'shopping_centers_km',
 'office_km',
 'additional_education_km',
 'preschool_km',
 'big_church_km',
 'church_synagogue_km',
 'mosque_km',
 'theater_km',
 'museum_km',
 'exhibition_km',
 'catering_km',
 'ecology',
 'green_part_500',
 'prom_part_500',
 'office_count_500',
 'office_sqm_500',
 'trc_count_500',
 'trc_sqm_500',
 'cafe_count_500',
 'cafe_sum_500_min_price_avg',
 'cafe_sum_500_max_price_avg',
 'cafe_avg_price_500',
 'cafe_count_500_na_price',
 'cafe_count_500_price_500',
 'cafe_count_500_price_1000',
 'cafe_count_500_price_1500',
 'cafe_count_500_price_2500',
 'cafe_count_500_price_4000',
 'cafe_count_500_price_high',
 'big_church_count_500',
 'church_count_500',
 'mosque_count_500',
 'leisure_count_500',
 'sport_count_500',
 'market_count_500',
 'green_part_1000',
 'prom_part_1000',
 'office_count_1000',
 'office_sqm_1000',
 'trc_count_1000',
 'trc_sqm_1000',
 'cafe_count_1000',
 'cafe_sum_1000_min_price_avg',
 'cafe_sum_1000_max_price_avg',
 'cafe_avg_price_1000',
 'cafe_count_1000_na_price',
 'cafe_count_1000_price_500',
 'cafe_count_1000_price_1000',
 'cafe_count_1000_price_1500',
 'cafe_count_1000_price_2500',
 'cafe_count_1000_price_4000',
 'cafe_count_1000_price_high',
 'big_church_count_1000',
 'church_count_1000',
 'mosque_count_1000',
 'leisure_count_1000',
 'sport_count_1000',
 'market_count_1000',
 'green_part_1500',
 'prom_part_1500',
 'office_count_1500',
 'office_sqm_1500',
 'trc_count_1500',
 'trc_sqm_1500',
 'cafe_count_1500',
 'cafe_sum_1500_min_price_avg',
 'cafe_sum_1500_max_price_avg',
 'cafe_avg_price_1500',
 'cafe_count_1500_na_price',
 'cafe_count_1500_price_500',
 'cafe_count_1500_price_1000',
 'cafe_count_1500_price_1500',
 'cafe_count_1500_price_2500',
 'cafe_count_1500_price_4000',
 'cafe_count_1500_price_high',
 'big_church_count_1500',
 'church_count_1500',
 'mosque_count_1500',
 'leisure_count_1500',
 'sport_count_1500',
 'market_count_1500',
 'green_part_2000',
 'prom_part_2000',
 'office_count_2000',
 'office_sqm_2000',
 'trc_count_2000',
 'trc_sqm_2000',
 'cafe_count_2000',
 'cafe_sum_2000_min_price_avg',
 'cafe_sum_2000_max_price_avg',
 'cafe_avg_price_2000',
 'cafe_count_2000_na_price',
 'cafe_count_2000_price_500',
 'cafe_count_2000_price_1000',
 'cafe_count_2000_price_1500',
 'cafe_count_2000_price_2500',
 'cafe_count_2000_price_4000',
 'cafe_count_2000_price_high',
 'big_church_count_2000',
 'church_count_2000',
 'mosque_count_2000',
 'leisure_count_2000',
 'sport_count_2000',
 'market_count_2000',
 'green_part_3000',
 'prom_part_3000',
 'office_count_3000',
 'office_sqm_3000',
 'trc_count_3000',
 'trc_sqm_3000',
 'cafe_count_3000',
 'cafe_sum_3000_min_price_avg',
 'cafe_sum_3000_max_price_avg',
 'cafe_avg_price_3000',
 'cafe_count_3000_na_price',
 'cafe_count_3000_price_500',
 'cafe_count_3000_price_1000',
 'cafe_count_3000_price_1500',
 'cafe_count_3000_price_2500',
 'cafe_count_3000_price_4000',
 'cafe_count_3000_price_high',
 'big_church_count_3000',
 'church_count_3000',
 'mosque_count_3000',
 'leisure_count_3000',
 'sport_count_3000',
 'market_count_3000',
 'green_part_5000',
 'prom_part_5000',
 'office_count_5000',
 'office_sqm_5000',
 'trc_count_5000',
 'trc_sqm_5000',
 'cafe_count_5000',
 'cafe_sum_5000_min_price_avg',
 'cafe_sum_5000_max_price_avg',
 'cafe_avg_price_5000',
 'cafe_count_5000_na_price',
 'cafe_count_5000_price_500',
 'cafe_count_5000_price_1000',
 'cafe_count_5000_price_1500',
 'cafe_count_5000_price_2500',
 'cafe_count_5000_price_4000',
 'cafe_count_5000_price_high',
 'big_church_count_5000',
 'church_count_5000',
 'mosque_count_5000',
 'leisure_count_5000',
 'sport_count_5000',
 'market_count_5000']

In [8]:
# Now, let's take a look over the dataset 
df_all.head()


Out[8]:
timestamp full_sq life_sq floor max_floor material build_year num_room kitch_sq state ... cafe_count_5000_price_1500 cafe_count_5000_price_2500 cafe_count_5000_price_4000 cafe_count_5000_price_high big_church_count_5000 church_count_5000 mosque_count_5000 leisure_count_5000 sport_count_5000 market_count_5000
0 2011-08-20 43.0 27.0 4.0 NaN NaN NaN NaN NaN NaN ... 40 9 4 0 13 22 1 0 52 4
1 2011-08-23 34.0 19.0 3.0 NaN NaN NaN NaN NaN NaN ... 36 15 3 0 15 29 1 10 66 14
2 2011-08-27 43.0 29.0 2.0 NaN NaN NaN NaN NaN NaN ... 25 10 3 0 11 27 0 4 67 10
3 2011-09-01 89.0 50.0 9.0 NaN NaN NaN NaN NaN NaN ... 15 11 2 1 4 4 0 0 26 3
4 2011-09-05 77.0 77.0 4.0 NaN NaN NaN NaN NaN NaN ... 552 319 108 17 135 236 2 91 195 14

5 rows × 290 columns


In [9]:
# To see some descriptive statistics lets use the function describe()
df_all.describe()


Out[9]:
full_sq life_sq floor max_floor material build_year num_room kitch_sq state area_m ... cafe_count_5000_price_1500 cafe_count_5000_price_2500 cafe_count_5000_price_4000 cafe_count_5000_price_high big_church_count_5000 church_count_5000 mosque_count_5000 leisure_count_5000 sport_count_5000 market_count_5000
count 38133.000000 30574.000000 37966.000000 28561.000000 28561.000000 2.347900e+04 28561.000000 28561.000000 23880.000000 3.813300e+04 ... 38133.000000 38133.000000 38133.000000 38133.000000 38133.000000 38133.000000 38133.000000 38133.000000 38133.000000 38133.000000
mean 54.111172 34.033460 7.667123 12.567592 1.834390 2.716785e+03 1.900844 6.543995 2.071650 1.766282e+07 ... 64.687934 32.805680 11.058820 1.819133 15.387853 30.825741 0.436394 8.847901 53.487635 6.056119
std 35.171162 47.581529 5.276156 6.730496 1.490923 1.308521e+05 0.847620 27.571630 0.864795 2.095034e+07 ... 125.214092 74.104439 28.636604 5.469808 29.452128 47.850168 0.609313 20.772155 46.584733 4.904623
min 0.000000 0.000000 0.000000 0.000000 1.000000 0.000000e+00 0.000000 0.000000 1.000000 2.081628e+06 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 38.900000 20.000000 3.000000 9.000000 1.000000 1.966000e+03 1.000000 1.000000 1.000000 7.307411e+06 ... 6.000000 2.000000 1.000000 0.000000 2.000000 9.000000 0.000000 0.000000 11.000000 1.000000
50% 50.000000 30.000000 7.000000 12.000000 1.000000 1.980000e+03 2.000000 6.000000 2.000000 1.020722e+07 ... 25.000000 9.000000 2.000000 0.000000 7.000000 16.000000 0.000000 2.000000 48.000000 5.000000
75% 63.000000 43.000000 11.000000 17.000000 2.000000 2.006000e+03 2.000000 9.000000 3.000000 1.803644e+07 ... 51.000000 22.000000 5.000000 1.000000 12.000000 28.000000 1.000000 7.000000 76.000000 11.000000
max 5326.000000 7478.000000 77.000000 117.000000 6.000000 2.005201e+07 19.000000 2014.000000 33.000000 2.060718e+08 ... 643.000000 377.000000 147.000000 30.000000 151.000000 250.000000 2.000000 106.000000 218.000000 21.000000

8 rows × 274 columns

Plots

We show you two basic plot. Plot the data is a very important step in Machine learning, you can get some really useful insights about the data.

You can find a much more graphs examples here: https://seaborn.pydata.org/examples/index.html

References:


In [10]:
# For plotting we'll use the library called Seaborn 
plt.figure(figsize=(10, 5)) # The size of the plot
sns.distplot(Y, kde = False) # We'll use distribution plot and as first arg we use the column .price_doc


Out[10]:
<matplotlib.axes._subplots.AxesSubplot at 0x1123cbb50>

In [11]:
# As we can see there's a lot of outliers and to smooth this records, let's use np.log() function and see the distribution
plt.figure(figsize=(10, 5)) # The size of the plot
sns.distplot(np.log(Y), kde = False) # We'll use distribution plot and as first arg we use the column .price_doc
plt.xlabel('np.log(price_doc)', fontsize=12)


Out[11]:
<matplotlib.text.Text at 0x10c9db950>

In [12]:
# One thing can help the convergence of ML algorithms it's to remove outliers or smooth or data, to do that
# Let's convert out Y dataset in .log

Y = np.log(Y)

Data Manipulation

As explained in the comments bellow, we need be carefull about NaN values. Also, for some problems, it is a good ideia combine some features to create new ones. Therefore, you should know do some basic data manipulations.Here are some examples:


In [13]:
# To put another variable inside the dataset. This variable will call price_sq
df_all["kitch_proportions"] = df_all["kitch_sq"]/df_all["full_sq"]

In [14]:
# When we deal with machine learning algorithms, Null values (or NaN) are trouble. You can choose 
# 1) remove then from dataset, 2) replace their values.
# It's much better make replacemens of this values using the mean, or median or some spare value (e.g. -99999) to avoid
# lost some predictive power. Let's check the columns with null.

for col in df_all.columns.values: # For each col the loop will take the value
    if len(df_all[df_all[col].isnull()][col]) > 0: # If the value are null, count each unit of the column
        print("{0}: {1}".format(col, len(df_all[df_all[col].isnull()][col]))) # and print the name of the column and the number of null values


life_sq: 7559
floor: 167
max_floor: 9572
material: 9572
build_year: 14654
num_room: 9572
kitch_sq: 9572
state: 14253
product_type: 33
preschool_quota: 8284
school_quota: 8280
hospital_beds_raion: 17859
raion_build_count_with_material_info: 6209
build_count_block: 6209
build_count_wood: 6209
build_count_frame: 6209
build_count_brick: 6209
build_count_monolith: 6209
build_count_panel: 6209
build_count_foam: 6209
build_count_slag: 6209
build_count_mix: 6209
raion_build_count_with_builddate_info: 6209
build_count_before_1920: 6209
build_count_1921-1945: 6209
build_count_1946-1970: 6209
build_count_1971-1995: 6209
build_count_after_1995: 6209
metro_min_walk: 59
metro_km_walk: 59
railroad_station_walk_km: 59
railroad_station_walk_min: 59
ID_railroad_station_walk: 59
cafe_sum_500_min_price_avg: 16440
cafe_sum_500_max_price_avg: 16440
cafe_avg_price_500: 16440
cafe_sum_1000_min_price_avg: 7746
cafe_sum_1000_max_price_avg: 7746
cafe_avg_price_1000: 7746
cafe_sum_1500_min_price_avg: 5020
cafe_sum_1500_max_price_avg: 5020
cafe_avg_price_1500: 5020
green_part_2000: 19
cafe_sum_2000_min_price_avg: 2149
cafe_sum_2000_max_price_avg: 2149
cafe_avg_price_2000: 2149
cafe_sum_3000_min_price_avg: 1173
cafe_sum_3000_max_price_avg: 1173
cafe_avg_price_3000: 1173
prom_part_5000: 270
cafe_sum_5000_min_price_avg: 425
cafe_sum_5000_max_price_avg: 425
cafe_avg_price_5000: 425
kitch_proportions: 9575

In [15]:
# To delete some column, just use the del function
del df_all['timestamp']
del df_all['life_sq']
del df_all['floor']
del df_all['max_floor']
del df_all['material']
del df_all['build_year']
del df_all['num_room']
del df_all['kitch_sq']
del df_all['state']
del df_all['preschool_quota']
del df_all['school_quota']
del df_all['hospital_beds_raion']
del df_all['raion_build_count_with_material_info']
del df_all['build_count_block']
del df_all['build_count_wood']
del df_all['build_count_frame']
del df_all['build_count_brick']
del df_all['build_count_monolith']
del df_all['build_count_panel']
del df_all['build_count_foam']
del df_all['build_count_slag']
del df_all['build_count_mix']
del df_all['raion_build_count_with_builddate_info']
del df_all['build_count_before_1920']
del df_all['build_count_1921-1945']
del df_all['build_count_1946-1970']
del df_all['build_count_1971-1995']
del df_all['build_count_after_1995']

In [16]:
# To fill NaN values with a specific number apply the function .fillna
df_all['cafe_sum_500_min_price_avg'].fillna(-99, inplace=True)
df_all['cafe_sum_500_max_price_avg'].fillna(-99, inplace=True)
df_all['cafe_avg_price_500'].fillna(-99, inplace=True)
df_all['cafe_sum_1000_min_price_avg'].fillna(-99, inplace=True)
df_all['cafe_sum_1000_max_price_avg'].fillna(-99, inplace=True)
df_all['metro_min_walk'].fillna(-99, inplace=True)
df_all['metro_km_walk'].fillna(-99, inplace=True)
df_all['railroad_station_walk_km'].fillna(-99, inplace=True)
df_all['railroad_station_walk_min'].fillna(-99, inplace=True)
df_all['ID_railroad_station_walk'].fillna(-99, inplace=True)
df_all['prom_part_5000'].fillna(-99, inplace=True)
df_all['cafe_sum_5000_min_price_avg'].fillna(-99, inplace=True)
df_all['cafe_sum_5000_max_price_avg'].fillna(-99, inplace=True)
df_all['cafe_avg_price_5000'].fillna(-99, inplace=True)
df_all['product_type'].fillna(-99, inplace=True)
df_all['green_part_2000'].fillna(-99, inplace=True)
df_all['kitch_proportions'].fillna(-99, inplace=True)

In [17]:
# To fill the the mean use the .mean()
df_all['cafe_avg_price_1000'].fillna(train['cafe_avg_price_1000'].mean(), inplace=True)
df_all['cafe_sum_1500_min_price_avg'].fillna(train['cafe_sum_1500_min_price_avg'].mean(), inplace=True)
df_all['cafe_sum_1500_max_price_avg'].fillna(train['cafe_sum_1500_max_price_avg'].mean(), inplace=True)
df_all['cafe_avg_price_1500'].fillna(train['cafe_avg_price_1500'].mean(), inplace=True)
df_all['cafe_sum_2000_min_price_avg'].fillna(train['cafe_sum_2000_min_price_avg'].mean(), inplace=True)

In [18]:
# To fill the the median use the .median()
df_all['cafe_sum_2000_max_price_avg'].fillna(train['cafe_sum_2000_max_price_avg'].median(), inplace=True)
df_all['cafe_avg_price_2000'].fillna(train['cafe_avg_price_2000'].median(), inplace=True)
df_all['cafe_sum_3000_min_price_avg'].fillna(train['cafe_sum_3000_min_price_avg'].median(), inplace=True)
df_all['cafe_sum_3000_max_price_avg'].fillna(train['cafe_sum_3000_max_price_avg'].median(), inplace=True)
df_all['cafe_avg_price_3000'].fillna(train['cafe_avg_price_3000'].median(), inplace=True)

In [19]:
# Another big problem for ML Algorithms it's the representation of categorical variables.
# this is because, most of this algoritms deals only with numeric representations. 

# Deal with categorical values
df_numeric = df_all.select_dtypes(exclude=['object']) # Select columns with numerical variables

df_obj = df_all.select_dtypes(include=['object']).copy() # Select columns with non numerical variables

In [20]:
for c in df_obj:
    df_obj[c] = pd.factorize(df_obj[c])[0]

In [21]:
df_obj.head()


Out[21]:
product_type sub_area culture_objects_top_25 thermal_power_plant_raion incineration_raion oil_chemistry_raion radiation_raion railroad_terminal_raion big_market_raion nuclear_reactor_raion detention_facility_raion water_1line big_road1_1line railroad_1line ecology
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 1
2 0 2 0 0 0 0 1 0 0 0 0 0 0 0 2
3 0 3 0 0 0 0 0 0 0 0 0 0 0 0 0
4 0 4 0 0 0 0 1 1 0 0 0 0 0 1 1

In [22]:
# Now we can join this two data frames using the function concat() 
df_all = pd.concat([df_numeric, df_obj], axis=1)

In [23]:
# Create a validation set
#num_val = int(num_train * 0.2)

In [24]:
# After we cleasing our data in pandas, we need to transform this data in a numpy array
# because the most popular machine learning packages only make the computation using this format
# To to that, let's convert our dataframe 
X_all = df_all.values
X = X_all[:num_train]

In [25]:
# In this step we'll use the train dataset to split in training and test to ensure that our algorithm 
# is learning and to perform some quality check over the rmse

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2) # We'll use 20% of the data for test

print('X Train shape is', X_train.shape)
print('Y Train shape is', y_train.shape)
print('X Test shape is', X_test.shape)
print('Y Test shape is', y_test.shape)


('X Train shape is', (24376, 263))
('Y Train shape is', (24376,))
('X Test shape is', (6095, 263))
('Y Test shape is', (6095,))

Linear Regression

Linear Regression models are models which fits the data in linear equations. Linear Regression models are really useful in many cases and it is, almost always, the first choice when dealing with a ML problem.

There are some variants from the basic Linear Regression model. The two most common are Ridge Regression and Lasso Regression. The ideia behind these two models is regularize the model, eliminating useless features.

References:


In [26]:
# Now we'll use the a Linear model to fit this data
# First we use .LinearRegression() function of the linear_model library and make the object lm
lm = linear_model.LinearRegression()

# After this, we make a object called model where we'll call the function .fit where we use as argument 
# X_train and Y_train datasets to perform the fit
model = lm.fit(X_train, y_train)

# In this step we'll make the predictions objects after we call the predict function using a 
# X_test dataframe as parameter 
predictions = lm.predict(X_test)

# And now we perform a inverse transformation of the log using np.exp() function
y_pred = np.exp(predictions)

In [27]:
# To make an assessment of the quality of the model, we'll use the RMSE as a main metric to assess the performance of
# the model
from sklearn.metrics import mean_squared_error
from math import sqrt

rmse = sqrt(mean_squared_error(y_test, y_pred.astype('int64')))
print("rms error is: " + str(rmse))


rms error is: 6590782.58109

In [28]:
# There's another way to do that
def rmse(predictions, targets):
    return np.sqrt(((predictions - targets) ** 2).mean())

rmse_val = rmse(y_pred.astype('int64'),y_test)
print("rms error is: " + str(rmse_val))


rms error is: 6590782.58109

In [29]:
# Now let's do some submission
# First we'll get the records of test database and store in some object
test_sub = X_all[:-num_train]
print('Test Sub shape is', test_sub.shape)


('Test Sub shape is', (7662, 263))

In [30]:
# We'll made the prediction using our test dataset and store inside y_pred object
predictions = lm.predict(test_sub)

y_pred = np.exp(predictions)

Submision

You need to create your submission when you want to test your model using the test data. These next cells creates a file with the submission format.


In [31]:
# We'll join the id and the predictions and store in an object called df_submission
df_submission = pd.DataFrame({'id': id_test, 'price_doc': y_pred.astype('int64')})

In [32]:
# Let's see what we got
df_submission.head(10)


Out[32]:
id price_doc
0 30474 5483799
1 30475 6441868
2 30476 5433498
3 30477 6981448
4 30478 9728970
5 30479 5539864
6 30480 6462102
7 30481 6528096
8 30482 5750421
9 30483 5229801

In [33]:
# To generate the file, we'll use to_csv function and we'll use the submission.csv ans the name of the file
df_submission.to_csv('submission.csv', index=False)

In [34]:
# Check the submission format
! head -n15 submission.csv


id,price_doc
30474,5483799
30475,6441868
30476,5433498
30477,6981448
30478,9728970
30479,5539864
30480,6462102
30481,6528096
30482,5750421
30483,5229801
30484,4725018
30485,4054000
30486,5464960
30487,4641812

In [35]:
from sklearn import linear_model
reg = linear_model.Ridge (alpha = .5)
reg.fit(X_train, y_train)

predictions = reg.predict(X_test)
y_pred = np.exp(predictions)

rmse = sqrt(mean_squared_error(y_test, y_pred.astype('int64')))
print("rms error is: " + str(rmse))


rms error is: 6588663.29143

In [36]:
from sklearn import linear_model
reg = linear_model.Lasso(alpha = 0.1)
reg.fit(X_train, y_train)
reg.predict(X_test)

predictions = reg.predict(X_test)
y_pred = np.exp(predictions)

rmse = sqrt(mean_squared_error(y_test, y_pred.astype('int64')))
print("rms error is: " + str(rmse))


rms error is: 6325898.69787

Decision Tree Model

This type of model creates a decision tree based on the parameters. Each node will have some specific constant for each one of the parameters. It is a very powerful model, combining the versatile of linear models with a more flexible from decision trees.

References:


In [37]:
from sklearn import tree
clf = tree.DecisionTreeRegressor()
clf = clf.fit(X_train, y_train)
clf.predict(X_test)

predictions = clf.predict(X_test)
y_pred = np.exp(predictions)

rmse = sqrt(mean_squared_error(y_test, y_pred.astype('int64')))
print("rms error is: " + str(rmse))


rms error is: 8537604.49169

Gradient Boosting

Gradient boosting is a technique which combines several weak models to predict the result. This kind of approach is named as ensamble. XGBoost is a library which implements the gradient boosting method. XGBoost is being used with very good results for Kaggle problems.

References:


In [39]:
import xgboost as xgb

d_train = xgb.DMatrix(X_train, label=y_train)
d_valid = xgb.DMatrix(X_test, label=y_test)

params = {}
params['objective'] = 'reg:linear'
params['eta'] = 0.02
params['silent'] = 1

watchlist = [(d_train, 'train'), (d_valid, 'valid')]

clf = xgb.train(params, d_train, 50, watchlist, early_stopping_rounds=100)

predictions = clf.predict(xgb.DMatrix(X_test))

y_pred = np.exp(predictions)

rmse = sqrt(mean_squared_error(y_test, y_pred.astype('int64')))
print("rms error is: " + str(rmse))


/usr/local/lib/python2.7/site-packages/sklearn/cross_validation.py:41: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
  "This module will be removed in 0.20.", DeprecationWarning)
[0]	train-rmse:14.8164	valid-rmse:14.8324
Multiple eval metrics have been passed: 'valid-rmse' will be used for early stopping.

Will train until valid-rmse hasn't improved in 100 rounds.
[1]	train-rmse:14.5206	valid-rmse:14.5366
[2]	train-rmse:14.2306	valid-rmse:14.2466
[3]	train-rmse:13.9464	valid-rmse:13.9625
[4]	train-rmse:13.6679	valid-rmse:13.6841
[5]	train-rmse:13.395	valid-rmse:13.4112
[6]	train-rmse:13.1276	valid-rmse:13.1438
[7]	train-rmse:12.8655	valid-rmse:12.8817
[8]	train-rmse:12.6087	valid-rmse:12.625
[9]	train-rmse:12.357	valid-rmse:12.3733
[10]	train-rmse:12.1103	valid-rmse:12.1266
[11]	train-rmse:11.8686	valid-rmse:11.8849
[12]	train-rmse:11.6318	valid-rmse:11.6481
[13]	train-rmse:11.3996	valid-rmse:11.416
[14]	train-rmse:11.1722	valid-rmse:11.1886
[15]	train-rmse:10.9493	valid-rmse:10.9657
[16]	train-rmse:10.7308	valid-rmse:10.7473
[17]	train-rmse:10.5167	valid-rmse:10.5332
[18]	train-rmse:10.3069	valid-rmse:10.3234
[19]	train-rmse:10.1014	valid-rmse:10.1178
[20]	train-rmse:9.89991	valid-rmse:9.91639
[21]	train-rmse:9.70249	valid-rmse:9.71898
[22]	train-rmse:9.50903	valid-rmse:9.52555
[23]	train-rmse:9.31944	valid-rmse:9.33593
[24]	train-rmse:9.13366	valid-rmse:9.15016
[25]	train-rmse:8.9516	valid-rmse:8.96811
[26]	train-rmse:8.77319	valid-rmse:8.78967
[27]	train-rmse:8.59836	valid-rmse:8.61484
[28]	train-rmse:8.42704	valid-rmse:8.44351
[29]	train-rmse:8.25915	valid-rmse:8.27562
[30]	train-rmse:8.09464	valid-rmse:8.11117
[31]	train-rmse:7.93342	valid-rmse:7.94993
[32]	train-rmse:7.77544	valid-rmse:7.79196
[33]	train-rmse:7.62063	valid-rmse:7.63713
[34]	train-rmse:7.46893	valid-rmse:7.48545
[35]	train-rmse:7.32027	valid-rmse:7.3368
[36]	train-rmse:7.1746	valid-rmse:7.19119
[37]	train-rmse:7.03186	valid-rmse:7.04846
[38]	train-rmse:6.89198	valid-rmse:6.90857
[39]	train-rmse:6.75491	valid-rmse:6.77148
[40]	train-rmse:6.6206	valid-rmse:6.63717
[41]	train-rmse:6.48899	valid-rmse:6.50559
[42]	train-rmse:6.36002	valid-rmse:6.37662
[43]	train-rmse:6.23365	valid-rmse:6.25027
[44]	train-rmse:6.10981	valid-rmse:6.12645
[45]	train-rmse:5.98847	valid-rmse:6.00507
[46]	train-rmse:5.86957	valid-rmse:5.88617
[47]	train-rmse:5.75306	valid-rmse:5.76964
[48]	train-rmse:5.6389	valid-rmse:5.65547
[49]	train-rmse:5.52703	valid-rmse:5.5436
rms error is: 25467.814114

In [42]:
from sklearn.ensemble import GradientBoostingRegressor
    
alpha = 0.95

clf = GradientBoostingRegressor(loss='quantile', alpha=alpha,
                                n_estimators=10, max_depth=3,
                                learning_rate=.1, min_samples_leaf=9,
                                min_samples_split=9)    

clf.fit(X_train, y_train)
predictions = clf.predict(X_test)
y_pred = np.exp(predictions)

rmse = sqrt(mean_squared_error(y_test, y_pred.astype('int64')))
print("rms error is: " + str(rmse))


rms error is: 12778870.6031

Scalers

It is very common to have very different scales for the features. This is, in general, very bad for almost all machine learning algorithms, it might cause, for example, numerical stability problems. In order to avoid these problems, SciKit Learn implements some scalaers.

Refereces:


In [40]:
from sklearn.preprocessing import StandardScaler  
scaler = StandardScaler()  
scaler.fit(X_train)  
X_train = scaler.transform(X_train) 
X_test = scaler.transform(X_test)

Neural Networks

Today's hottest topic in ML world is Deep Learning. The basic idea of DL is use deep Neural Networks to learn a the data's pattern. The next cell shows how to create a very simple(not deep) in SciKitLearn.

References:


In [41]:
from sklearn.neural_network import MLPRegressor
clf = MLPRegressor(solver='lbfgs', alpha=1e-5,
                    hidden_layer_sizes=(7, ), random_state=1)

clf.fit(X_train, y_train)
predictions = clf.predict(X_test)
y_pred = np.exp(predictions)

rmse = sqrt(mean_squared_error(y_test, y_pred.astype('int64')))
print("rms error is: " + str(rmse))


rms error is: 6846707.02397

Clustering Algorithms

Clustering algorithms cannot be used to solve this kind of problem, but they are very useful to discover more informations about the features and data. Here is a example on how to create some clustes for some points.

References:


In [43]:
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=10, random_state=0).fit(X_train)
kmeans.predict(X_test)


Out[43]:
array([0, 3, 3, ..., 3, 7, 7], dtype=int32)

In [44]:
kmeans.labels_


Out[44]:
array([1, 0, 1, ..., 3, 7, 4], dtype=int32)

In [45]:
kmeans.cluster_centers_


Out[45]:
array([[ 0.07270668, -0.37032846,  0.7693542 , ...,  0.03340506,
        -0.03125792, -0.13485907],
       [ 0.16370134, -0.50624687, -0.16232737, ..., -0.161818  ,
        -0.17308184, -0.68907097],
       [ 0.00449375,  1.58130596, -1.26168349, ..., -0.15495479,
        -0.17308184,  1.33789532],
       ..., 
       [-0.06055566, -0.45718753,  0.24869457, ...,  0.05275176,
         0.07252469, -0.28958602],
       [-0.00884478, -0.02271096,  1.59833822, ...,  0.00861252,
        -0.02617405, -0.70976292],
       [ 0.08804819, -0.47867875, -0.05771808, ...,  0.34836266,
         0.24902631, -0.3424761 ]])