Home Assignment No. 2: Part 2 (Practice)

To solve this task, you will write a lot of code to try several machine learning methods for classification and regression.

  • You are HIGHLY RECOMMENDED to read relevant documentation, e.g. for python, numpy, matlpotlib and sklearn. Also remember that seminars, lecture slides, Google and StackOverflow are your close friends during this course (and, probably, whole life?).

  • If you want an easy life, you have to use BUILT-IN METHODS of sklearn library instead of writing tons of our yown code. There exists a class/method for almost everything you can imagine (related to this homework).

  • To do this part of homework, you have to write CODE directly inside specified places inside notebook CELLS.

  • In some problems you may be asked to provide short discussion of the results. In this cases you have to create MARKDOWN cell with your comments right after the your code cell.

  • For every separate problem you can get only 0 points or maximal points for this problem. There are NO INTERMEDIATE scores. So make sure that you did everything required in the task

  • Your SOLUTION notebook MUST BE REPRODUCIBLE, i.e. if the reviewer decides to execute Kernel -> Restart Kernel and Run All Cells, after all the computation he will obtain exactly the same solution (with all the corresponding plots) as in your uploaded notebook. For this purpose, we suggest to fix random seed or (better) define random_state= inside every algorithm that uses some pseudorandomness.

  • Your code must be clear to the reviewer. For this purpose, try to include neccessary comments inside the code. But remember: GOOD CODE MUST BE SELF-EXPLANATORY without any additional comments.

Before the start, read several additional recommendations.

  • Probably you lauch jupyter notebook or ipython notebook from linux console. Try jupyter lab instead - it is a more convenient environment to work with notebooks.
  • Probably the PC on which you are going to evaluate models has limited CPU/RAM Memory. In this case, we recommend to monitor the CPU and Memory Usage. To do this, you can execute htop (for CPU/RAM) or free -s 0.2 (for RAM) in terminal.
  • Probably tou have multiple Cores (CPU) on your PC. Many sklearn algorithms support multithreading (Ensemble Methods, Cross-Validation, etc.). Check if the particular algorithm has n_jobs parameters and set it to -1 to use all the cores.

Please, write your implementation within the designated blocks:

...
### BEGIN Solution

# >>> your solution here <<<

### END Solution
...

Model and feature selection

Let's load the dataset for this task.


In [1]:
import numpy as np
import sklearn
from sklearn.model_selection import train_test_split
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import cross_val_score
%matplotlib inline

In [136]:
data_fs = pd.read_csv(r'data/data_fs.csv', low_memory=False)

Look at the first 10 rows of this dataset.


In [137]:
data_fs.head(10)


Out[137]:
timestamp full_sq life_sq floor max_floor material build_year num_room kitch_sq state ... provision_retail_space_modern_sqm turnover_catering_per_cap theaters_viewers_per_1000_cap seats_theather_rfmin_per_100000_cap museum_visitis_per_100_cap bandwidth_sports population_reg_sports_share students_reg_sports_share apartment_build apartment_fund_sqm
0 2011-08-20 43 27.0 4.0 NaN NaN NaN NaN NaN NaN ... 271.0 6943.0 565.0 0.45356 1240.0 269768.0 22.37 64.12 23587.0 230310.0
1 2011-08-23 34 19.0 3.0 NaN NaN NaN NaN NaN NaN ... 271.0 6943.0 565.0 0.45356 1240.0 269768.0 22.37 64.12 23587.0 230310.0
2 2011-08-27 43 29.0 2.0 NaN NaN NaN NaN NaN NaN ... 271.0 6943.0 565.0 0.45356 1240.0 269768.0 22.37 64.12 23587.0 230310.0
3 2011-09-01 89 50.0 9.0 NaN NaN NaN NaN NaN NaN ... 271.0 6943.0 565.0 0.45356 1240.0 269768.0 22.37 64.12 23587.0 230310.0
4 2011-09-05 77 77.0 4.0 NaN NaN NaN NaN NaN NaN ... 271.0 6943.0 565.0 0.45356 1240.0 269768.0 22.37 64.12 23587.0 230310.0
5 2011-09-06 67 46.0 14.0 NaN NaN NaN NaN NaN NaN ... 271.0 6943.0 565.0 0.45356 1240.0 269768.0 22.37 64.12 23587.0 230310.0
6 2011-09-08 25 14.0 10.0 NaN NaN NaN NaN NaN NaN ... 271.0 6943.0 565.0 0.45356 1240.0 269768.0 22.37 64.12 23587.0 230310.0
7 2011-09-09 44 44.0 5.0 NaN NaN NaN NaN NaN NaN ... 271.0 6943.0 565.0 0.45356 1240.0 269768.0 22.37 64.12 23587.0 230310.0
8 2011-09-10 42 27.0 5.0 NaN NaN NaN NaN NaN NaN ... 271.0 6943.0 565.0 0.45356 1240.0 269768.0 22.37 64.12 23587.0 230310.0
9 2011-09-13 36 21.0 9.0 NaN NaN NaN NaN NaN NaN ... 271.0 6943.0 565.0 0.45356 1240.0 269768.0 22.37 64.12 23587.0 230310.0

10 rows × 390 columns

The dataset has many NaN's and also a lot of categorical features. So at first, you should preprocess the data. We can deal with categorical features by using one-hot encoding. To do that we can use pandas.get_dummies.


In [138]:
# fill nan with 0
data_fs = data_fs.fillna(0)

# our goal is to predict the "price_doc" feature.
y = data_fs[["price_doc"]]
X = data_fs.drop("price_doc", axis=1)
X = X.drop("timestamp", axis=1)

# one-hot encoding
X = pd.get_dummies(X, sparse=True)

In [139]:
# Let's split our dataset into train 70 % and test 30% by using sklearn. 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Look at first 10 rows what you get.
X_train.head(10)


Out[139]:
full_sq life_sq floor max_floor material build_year num_room kitch_sq state area_m ... child_on_acc_pre_school_3,013 child_on_acc_pre_school_7,311 modern_education_share_0 modern_education_share_90,92 modern_education_share_93,08 modern_education_share_95,4918 old_education_build_share_0 old_education_build_share_23,14 old_education_build_share_25,47 old_education_build_share_8,2517
14065 46 44.0 7.0 25.0 1.0 2015.0 1.0 1.0 1.0 1.139168e+07 ... 0 0 0 0 1 0 0 0 1 0
12978 77 48.0 17.0 17.0 4.0 2009.0 3.0 9.0 3.0 1.631523e+07 ... 1 0 0 1 0 0 0 1 0 0
18695 39 18.0 7.0 17.0 1.0 0.0 1.0 9.0 0.0 5.293465e+06 ... 0 0 0 0 1 0 0 0 1 0
26411 52 52.0 9.0 17.0 1.0 0.0 2.0 1.0 1.0 2.553630e+07 ... 0 0 0 0 1 0 0 0 1 0
1419 30 18.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 2.641243e+06 ... 0 1 1 0 0 0 1 0 0 0
29787 99 0.0 12.0 0.0 1.0 2015.0 4.0 1.0 1.0 4.441296e+06 ... 0 0 0 0 0 1 0 0 0 1
18411 40 0.0 17.0 17.0 1.0 0.0 1.0 1.0 1.0 1.139168e+07 ... 0 0 0 0 1 0 0 0 1 0
11541 31 17.0 1.0 9.0 2.0 1964.0 1.0 6.0 2.0 4.662813e+06 ... 1 0 0 1 0 0 0 1 0 0
20741 55 0.0 6.0 0.0 1.0 0.0 2.0 12.0 1.0 6.677245e+07 ... 0 0 0 0 1 0 0 0 1 0
13103 58 42.0 7.0 9.0 1.0 1974.0 3.0 6.0 2.0 4.389199e+06 ... 1 0 0 1 0 0 0 1 0 0

10 rows × 560 columns

Okay, now let's see how much data we have.


In [7]:
print("Train size =", X_train.shape)
print("Test size =", X_test.shape)


Train size = (21329, 560)
Test size = (9142, 560)

There are too many features in this dataset and not all of them are equally important for our problem. Besides, using the whole dataset as-is to train a linear model will, for sure, lead to overfitting. Instead of painful and time consuming manual selection of the most relevant data, we will use the methods of automatic feature selection.


But at first, we almost forgot to take a look at our targets. Let's plot y_train histogram.


In [8]:
y_train.hist(bins=100)


Out[8]:
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x7f499c1650f0>]],
      dtype=object)

There is a big variance in it and it's far from being a normal distribution. In the real-world problems it happens all the time: the data can be far from perfect. We can use some tricks to make it more like what we want. In this particular case we can predict $\log y$ instead of $y$. This transformation is invertible, so we will be able to get our $y$ back.


In [9]:
y_train_log = np.log(y_train)
y_test_log = np.log(y_test)
y_train_log.hist(bins=100)


Out[9]:
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x7f4997428ba8>]],
      dtype=object)

Now it looks more like the data we want to deal with.

The preprocessing is finally over, so now we are ready for the actual task.

**IMPORTANT NOTICE**

If you have difficulties with solving the below problems take a look at seminar $7$ on feature and model selection.


Task 1 (1 pt.): Random forest feature importances

Use random forest to find the imortance of features. Plot the histogram.


In [10]:
from sklearn.ensemble import RandomForestRegressor 

### BEGIN Solution

random_forest = RandomForestRegressor(n_estimators=250, random_state=101, n_jobs=4)
random_forest.fit(X_train, y_train_log.values.ravel())

std = np.std([random_forest.feature_importances_ for random_forest in random_forest.estimators_], axis=0)
importances = random_forest.feature_importances_
indices = np.argsort(importances)[::-1]

In [11]:
FEAT_NUM = 20

plt.figure(figsize=(15,10))
plt.title("Top %d important features" % (FEAT_NUM), size=16)
plt.bar(range(FEAT_NUM), importances[indices][:FEAT_NUM],
       color="r", yerr=std[indices[:FEAT_NUM]], align="center")
plt.xticks(range(FEAT_NUM), [X_train.columns[indices[f]] for f in range(FEAT_NUM)], 
           rotation='vertical', size=16)
plt.yticks(size=16)
plt.xlim([-1, FEAT_NUM])
plt.show()

### END Solution


Print the 20 most important features and their values.


In [12]:
### BEGIN Solution

# Print the feature ranking
print("Feature ranking:")
for f in range(FEAT_NUM):
    print("%d. %s (%f)" % (f + 1, (X_train.columns[indices[f]]), importances[indices[f]]))

### END Solution


Feature ranking:
1. full_sq (0.243196)
2. sport_count_3000 (0.025632)
3. cafe_count_3000 (0.021454)
4. cafe_count_5000_price_2500 (0.019507)
5. cafe_count_2000 (0.018246)
6. micex_cbi_tr (0.009018)
7. num_room (0.008832)
8. brent (0.007926)
9. exhibition_km (0.007346)
10. swim_pool_km (0.007321)
11. kindergarten_km (0.007186)
12. ttk_km (0.007076)
13. metro_km_avto (0.006917)
14. eurrub (0.006910)
15. micex (0.006823)
16. cafe_count_5000 (0.006613)
17. floor (0.006505)
18. public_healthcare_km (0.006380)
19. usdrub (0.006318)
20. additional_education_km (0.006261)

In [13]:
X_train_cut = X_train.filter([X_train.columns[x] for x in indices[:20]], axis=1)
X_test_cut = X_test.filter([X_test.columns[x] for x in indices[:20]], axis=1)
print("New shape of training samples: ", X_train_cut.shape)
print("New shape of testing samples: ", X_test_cut.shape)


New shape of training samples:  (21329, 20)
New shape of testing samples:  (9142, 20)


Task 2 (1 pt.)

On these 20 features train each of the following models

  • Linear Regression
  • Ridge regression
  • Random forest
  • DecisionTree

and test its performance using the Root Mean Squared Logarithmic Error (RMSLE).


In [14]:
from sklearn.metrics import mean_squared_log_error

You will need to do it for the next tasks too, so we recommend you to implement a dedicated function for comparisons, which

  1. on input the function takes a training dataset (X_train, y_train) and a test sample (X_test, y_test)
  2. it trains all of the listed models on the (X_train, y_train) sample
  3. it computes and returns a table the RMSLE score of each fitted model on the test dataset(X_test, y_test)

In [15]:
from sklearn import linear_model
from sklearn.metrics import mean_squared_log_error

def comparator(X_train, y_train, X_test, y_test):
    """
    Parameters
    ==========
        X_train: ndarray - training inputs
        y_train: ndarray - training targets
        X_test: ndarray - test inputs
        y_test: ndarray - test targets
        
    Returns
    =======
        pd.DataFrame - table of RMSLE scores of each model on test and train datasets
    """
    methods = {
        "Linear Regression": sklearn.linear_model.LinearRegression(n_jobs=4), 
        "Lasso": linear_model.Lasso(random_state=101), 
        "Ridge": linear_model.Ridge(random_state=101),
        "Dtree": sklearn.tree.DecisionTreeRegressor(random_state=101),
        "RFR": sklearn.ensemble.RandomForestRegressor(random_state=101, n_estimators =100, n_jobs=4)
    }
    error_train = []
    error_test = []
    
### BEGIN Solution
    for model in methods.values():
        model.fit(X_train, y_train.values.ravel())
        y_train_pred = model.predict(X_train)
        y_test_pred = model.predict(X_test)
        error_train.append(mean_squared_log_error(y_train_pred, y_train))
        error_test.append(mean_squared_log_error(y_test_pred, y_test))

### END Solution
    return pd.DataFrame({
        "Methods": list(methods.keys()),
        "Train loss": error_train,
        "Test loss": error_test
    })

Now apply this function


In [18]:
### BEGIN Solution
result = comparator(X_train_cut, y_train_log, X_test_cut, y_test_log)
print(result)
### END Solution


             Methods  Test loss    Train loss
0              Ridge   0.001054  1.044497e-03
1              Dtree   0.001756  1.159749e-09
2              Lasso   0.001197  1.217445e-03
3  Linear Regression   0.001054  1.044498e-03
4                RFR   0.000871  1.237115e-04


Forward-backward methods

The idea is to add or remove features and look how it influences the value of the loss function or some other criteria.

Decision about adding or deleting a feature may be made based on:

  • AIC
  • BIC
  • validation error
  • Mallows $C_p$
  • sklearn's estimator.score()

Task 3 (2 pt.): Implement forward method with early stopping

Implement the following greedy feature selection algorithm:

# Initialize with an empty list of features.
list_of_best_features = []

while round < n_rounds:
    round = round + 1

    if no_more_features:
        # end loop

    # Iterate over currently *unsued* features and use $k$-fold 
    # . `cross_val_score` to measure model "quality".
    compute_quality_with_each_new_unused_feature(...)

    # **Add** the feature that gives the highest "quality" of the model.
    pick_and_add_the_best_feature(...)

    if model_quality_has_increased_since_last_round:
        round = 0

return list_of_best_features

ATTN

Use $k=3$ for the $k$-fold cv, because higher values could take a lo-o-o-o-o-o-o-o-ong time.

Please bear in mind that the lower RMSLE (mean_squared_log_error) is, the higher the model "quality" is.

Please look up cross_val_score(...) peculiarities in scikit's manual.

In the cell below implement a function that would iterate over a list of features and use $k$-fold cross_val_score to measure model "quality".


In [39]:
from sklearn.metrics import make_scorer
import warnings
warnings.filterwarnings("ignore")

def selection_step(model, X, y, used_features=(), cv=3):
    """
    Parameters
    ==========
        X: ndarray - training inputs
        y: ndarray - training targets
        used_features: - list of features
        cv: int - number of folds

    Returns
    =======
        scores - dictionary of scores
    """
    
    scores = {}
    
    ### BEGIN Solution

    for feature in X.columns:
        if feature not in used_features:
            feat_set = list(used_features).copy()
            feat_set.append(feature)
            rmsle = abs(cross_val_score(model, X[feat_set], y.values.ravel(),
                                        scoring=make_scorer(mean_squared_log_error),
                                        error_score=np.nan, cv=cv, n_jobs=4).mean())
            scores[feature] = rmsle
            
    ### END Solution
    return scores

In [47]:
def forward_steps(X, y, n_rounds, method):
    """
    Parameters
    ==========
        X: ndarray - training inputs
        y: ndarray - training targets
        n_rounds: int - early stop when score doesn't increase n_rounds
        method: sklearn model

    Returns
    =======
        feat_best_list - list of features
    """
    
    feat_best_list = []
    last_score = np.inf

    ### BEGIN Solution
    round = 0
    count = 0
    
    while (round < n_rounds):
        round = round + 1
        count = count + 1
        
        if (len(feat_best_list) == X.shape[1]):
            break
            
        scores = selection_step(method, X, y, feat_best_list)
        best_feat = min(scores, key=scores.get)
        feat_best_list.append(best_feat)
        print(round, best_feat)
        
        if (scores[best_feat] < last_score):
            last_score = scores[best_feat]
            round = 0

    ### END Solution
    
    return feat_best_list

Use the function implemented above and use DecisionTreeRegressor to get the best features according to this algorithm and print them.


In [48]:
### BEGIN Solution
from sklearn import tree

# DecisionTreeRegressor
print("Decision Tree Regressor feature ranking")
clf = sklearn.tree.DecisionTreeRegressor(random_state=101)
best_features = forward_steps(X_train, y_train_log, 3, clf)

### END Solution


Decision Tree Regressor feature ranking
1 full_sq
1 ecology_no data
1 sub_area_Nekrasovka
1 sub_area_Poselenie Vnukovskoe
1 sub_area_Poselenie Novofedorovskoe
1 sub_area_Poselenie Filimonkovskoe
1 sub_area_Zapadnoe Degunino
1 sub_area_Krylatskoe
1 sub_area_Hamovniki
1 sub_area_Poselenie Krasnopahorskoe
1 sub_area_Zamoskvorech'e
1 sub_area_Troickij okrug
1 sub_area_Poselenie Moskovskij
1 sub_area_Sokol'niki
1 sub_area_Birjulevo Zapadnoe
1 sub_area_Poselenie Kokoshkino
1 sub_area_Arbat
1 sub_area_Begovoe
1 sub_area_Poselenie Shherbinka
1 sub_area_Vostochnoe
2 sub_area_Poselenie Voskresenskoe
1 sub_area_Babushkinskoe
2 sub_area_Poselenie Rogovskoe
1 sub_area_Poselenie Klenovskoe
1 sub_area_Ostankinskoe
2 sub_area_Poselenie Voronovskoe
3 sub_area_Poselenie Shhapovskoe

Use Linear Regression, Ridge regression, Random forest and DecisionTree to get the RMSLE score using these features. Remember the function you wrote earlier.


In [49]:
### BEGIN Solution

result = comparator(X_train[best_features], y_train_log, X_test[best_features], y_test_log)
print(result)

### END Solution


             Methods  Test loss  Train loss
0              Ridge   0.001119    0.001154
1              Dtree   0.000940    0.000879
2              Lasso   0.001252    0.001277
3  Linear Regression   0.001119    0.001154
4                RFR   0.000928    0.000882


Boosting: gradient boosting, adaboost

Practical Boosting

In this task you are asked to implement a boosting algorithm, and compare speed of different popular boosting libraries.

Task 4 (2 pt.): Boosting Classification on a toy dataset

Let's generate a toy dataset for classification.


In [97]:
from sklearn.datasets import make_moons
from sklearn.model_selection import train_test_split

X, y = make_moons(n_samples=300, shuffle=True, noise=0.05, random_state=1011)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=1011)

In [98]:
y_test[y_test == 0] = -1
y_train[y_train == 0] = -1

Your task is:

  1. Implement gradient boosting algorithms with logistic loss and labels $y\in \{-1, +1\}$;
  2. Plot the decision boundary on a $2$-d grid;
  3. Estimate the accuracy score on the test dataset, as well as other classification metrics, that you can think of;

For basic implementation please refer to seminars $8-9$.


In [99]:
from sklearn.tree import DecisionTreeRegressor
from scipy.optimize import minimize
from sklearn.base import BaseEstimator, ClassifierMixin
from sklearn.metrics import accuracy_score, precision_score, f1_score

In [130]:
class FuncSeries:
    def __init__(self):
        self.func_series = []
    
    def __call__(self, X):
        sum = self.func_series[0](X)
        for f in self.func_series[1:]:
            sum += f(X)
            
        return sum
    
    def append(self, func):
        self.func_series.append(func)

class GradientBoostingClassifier(BaseEstimator, ClassifierMixin):
    def __init__(self, estimators=5):
        self.estimators = estimators
        self.func_series = FuncSeries()
         
        
    def fit(self, X, y):
        self.func_series.append(lambda X: np.zeros(X.shape[0]))

        for i in range(self.estimators):
            residuals =  2 * y / (1 + np.exp(2 * y * self.func_series(X)))
            
            clf = DecisionTreeRegressor(max_depth=3)
            clf.fit(X, residuals)
            
            self.func_series.append(lambda X: clf.predict(X))
            
        return self
    
    def predict(self, X):
        predicted = np.sign(self.func_series(X)).astype(np.int)
        predicted[predicted == 0] = -1
        
        return predicted

In [131]:
class GBM:
    
    def __init__(self, estimator, estimator_params, n_estimators):
        self.base_estimator = estimator
        self.params = estimator_params
        self.n_estimators = n_estimators
        self.cascade = []
    
    def fit(self, X, y):
        
        for i in range(self.n_estimators):
            
            s = y / (1.0 + np.exp(y * self._output(X)))
            new_estimator = self.base_estimator(**self.params)
            new_estimator.fit(X, s)
            self.cascade.append(new_estimator)
    
    def _output(self, X):
        res = np.zeros(X.shape[0])
        
        for i in range(len(self.cascade)):
            res += self.cascade[i].predict(X)
        
        return res
        
    
    def predict_proba(self, X):
        return 1.0 / (1.0 + np.exp(-self._output(X)))
    
    def predict(self, X):
        res = np.sign(self._output(X))
        res[res == 0] = -1
        
        return res

In [133]:
### BEGIN Solution

model = GradientBoostingClassifier(estimators=6)
params = {
    'max_depth' : 2
}
# model = GBM(DecisionTreeRegressor, params, 100)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

results = {'Accuracy' : accuracy_score(y_test, y_pred),
           'Precision': precision_score(y_test, y_pred),
           'F1_score' : f1_score(y_test, y_pred)}

print ("Results:")
for key, value in results.items():
    print ("%s %.3f" % (key, value))

### END Solution


Results:
Accuracy 0.975
Precision 0.982
F1_score 0.974

In [134]:
from mlxtend.plotting import plot_decision_regions

plt.figure(figsize=(10,7))
plt.title("Decision boundary", size=16)
plot_decision_regions(X=X_train, y=y_train, clf=model, legend=2)
    
plt.tight_layout()
plt.show()




Task 5 (1 pt.): Measuring the Speed and Performance

Please make sure to install the following powerful packages for boosting:

In this task you are asked to compare the training time of the GBDT, the Gradient Boosted Decision Trees, as implemeted by different popular ML libraries. The dataset you shall use is the UCI Breast Cancer dataset. You should study the parameters of each library and establish the correspondence between them.

The plan is as follows:

  1. Take the default parameter settings, measure the training time, and plot the ROC curves;
  2. Use grid search with the $3$-fold cross valiadation to choose the best model. Then measure the training time as a function of (separately) tree depth and the number of estimators in the ensemble, finally plot the ROC curves of the best models.

You need to make sure that you are comparing comparable classifiers, i.e. with the same tree and ensemble hyperparameters.

**NOTE** You need figure out how to make parameter settings compatible. One possible way to understand the correspondence is to study the docs. You may choose the default parameters from any library.

Please plot three ROC curves, one per library, on the same one plot with a comprehensible legend.

A useful command for timing is IPython's timeit cell magic.


In [141]:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_curve, auc
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import GridSearchCV
import xgboost as xgb
import catboost as ctb
import lightgbm as lgb

data = load_breast_cancer()
X, y = data.data, data.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4,
                                                    random_state=0x0BADBEEF)

In [142]:
### BEGIN Solution

default_depth = 6 
default_estimators = 100

clfs = {
    "CatBoost" : ctb.CatBoostClassifier(logging_level='Silent', 
                                        max_depth=default_depth, 
                                        n_estimators=default_estimators),

    "LGBM" : lgb.LGBMClassifier(max_depth=default_depth, 
                                n_estimators=default_estimators),
    
    "XGBC" : xgb.XGBClassifier(max_depth=default_depth, 
                               n_estimators=default_estimators)
}

plt.figure(figsize=(10,8))

for key, clf in clfs.items():
    probas = clf.fit(X_train, y_train).predict_proba(X_test)
    fpr, tpr, thresholds = roc_curve(y_test, probas[:, 1])
    roc_auc = auc(fpr, tpr)
    plt.plot(fpr, tpr, lw=2, alpha=0.8,
             label='ROC %s (AUC = %0.2f)' % (key, roc_auc))

plt.xlabel('False Positive Rate', size=14)
plt.ylabel('True Positive Rate', size=14)
plt.title('Receiver operating characteristic example', size=14)
plt.legend(loc="lower right")
plt.show()

### END Solution



In [143]:
tunned_params = [{
    'max_depth' : range(2, 8, 2),
    'n_estimators' : range(40, 160, 20)
}]

cv = StratifiedKFold(n_splits=3)

plt.figure(figsize=(10,8))

for key, clf in models.items():
    gs = GridSearchCV(clf, tunned_params, cv=3, iid=True,
                       scoring='roc_auc', n_jobs=4, return_train_score=True)
    gs.fit(X_train, y_train)
    clf = gs.best_estimator_
    probas = clf.predict_proba(X_test)
    fpr, tpr, thresholds = roc_curve(y_test, probas[:, 1])
    plt.plot(fpr, tpr, lw=2, alpha=0.6,
             label='%s | %s = %d; %s = %d' 
             % (key, 'max_depth', gs.best_params_['max_depth'],
                'n_estimators', gs.best_params_['n_estimators']))
    
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic example')
plt.legend(loc="lower right")
plt.show()



In [350]:
estimator_params = [{
    'n_estimators' : range(40, 160, 20)
}]

depth_params = [{
    'max_depth' : range(2, 10, 2),
}]

f, axes = plt.subplots(1, 2, sharex=True, figsize=(15, 7))
axes[0].set_title('Relative time')
axes[1].set_title('Absolute time')

def plot_time_chart(axes, params, key):
    grid_cv = GridSearchCV(model, params, cv=3, iid=True,
                           scoring='roc_auc', n_jobs=4, return_train_score=True)
    grid_cv.fit(X_train, y_train)
    axes[0].plot(params[0][key], 
            np.array(grid_cv.cv_results_['mean_fit_time']) / grid_cv.cv_results_['mean_fit_time'][0],
            label=grid_cv.best_estimator_.__class__.__name__)
    
    axes[1].plot(params[0][key], 
            grid_cv.cv_results_['mean_fit_time'],
            label=grid_cv.best_estimator_.__class__.__name__)



for model in models.values():
    plot_time_chart(axes, estimator_params, 'n_estimators')

axes[0].legend()
axes[1].legend()
plt.tight_layout()
plt.show()



In [351]:
i = 0
f, axes = plt.subplots(1, 2, sharex=True, figsize=(15, 7))
axes[0].set_title('Relative time')
axes[1].set_title('Absolute time')

for model in models.values():
    plot_time_chart(axes, depth_params, 'max_depth')

axes[0].legend()
axes[1].legend()
plt.tight_layout()
plt.show()



NNs

Task 6 (1 pt.): Activation functions

Plot the following activation functions using their PyTorch realizations and their derivatives using autograd functionality:

  • ReLU, ELU ($\alpha = 1$), Softplus ($\beta = 1$);
  • Sign, Sigmoid, Softsign, Tanh.

In [33]:
import torch.nn.functional as F
import matplotlib.pyplot as plt
import torch

x = torch.arange(-2, 2, .01, requires_grad=True)
x.sum().backward() # to create x.grad

f, axes = plt.subplots(2, 2, sharex=True, figsize=(15, 7))
axes[0, 0].set_title('Values')
axes[0, 1].set_title('Derivatives')

for i, function_set in (0, (('ReLU', F.relu), ('ELU', F.elu), ('Softplus', F.softplus))), \
                       (1, (('Sign', torch.sign), ('Sigmoid', torch.sigmoid), ('Softsign', F.softsign), ('Tanh', torch.tanh))):
    for function_name, activation in function_set:
        ### BEGIN Solution
        
        x.grad.data.zero_()
        y = activation(x)
        axes[i, 0].plot(x.data.numpy(), y.data.numpy(), label=function_name)
        y.sum().backward()
        axes[i, 1].plot(x.data.numpy(), x.grad.data.numpy(), label=function_name)
        
        ### END Solution

    axes[i, 0].legend()
    axes[i, 1].legend()

plt.tight_layout()
plt.show()


Answer the following questions. Which of these functions may be, and which -- definitely are a poor choise as an activation function in a neural network? Why?

The main requirement for backprop is activation function being differentiable. However, the sign function is non-differentiable at x = 0 and it has 0 derivative elsewhere. Therefore the gradient descent will not be able to update the weights.

Another problem may be with ReLU activation function which may cause lot of redundant or dead nodes in a net. Thus neurons do not contribute to the fial result, and do not have a derivative.

Task 7 (3 pt.): Backpropagation

At the seminar 10 on neural networks, we built an MLP with one hidden layer using our numpy implementations of linear layer and logistic and softmax activation functions. Your task is to

  1. implement backpropagation for these modules,
  2. train our numpy realization of MLP to classify the toy MNIST from sklearn.datasets.

In [34]:
from sklearn.metrics import confusion_matrix
from sklearn.datasets import load_digits
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

Prepare the dataset.


In [233]:
digits, targets = load_digits(return_X_y=True)
digits = digits.astype(np.float32) / 255

digits_train, digits_test, targets_train, targets_test = train_test_split(digits, targets, random_state=0)

train_size = digits_train.shape[0]
test_size = digits_test.shape[0]


input_size = 8*8
classes_n = 10

Implement the MLP with backprop.


In [249]:
class Linear:
    def __init__(self, input_size, output_size):
        self.thetas = np.random.randn(input_size, output_size)
        self.thetas_grads = np.empty_like(self.thetas)
        self.bias = np.random.randn(output_size)
        self.bias_grads = np.empty_like(self.bias)
        self.input = None
        self.out = None

    def forward(self, x): 
        self.input = x
        output = np.matmul(x, self.thetas) + self.bias
        self.out = output
        return output
    

    def backward(self, x, output_grad):
        ### BEGIN Solution
        self.input = self.input.reshape(-1,1)
        input_grad = np.matmul(self.thetas, output_grad)
        self.thetas_grads += self.input @ output_grad.T
        self.bias_grads += output_grad.sum(axis=1)
        
        assert self.thetas_grads.shape == self.thetas.shape 
        assert self.bias_grads.shape == self.bias.shape
        ### END Solution
        return input_grad


class LogisticActivation:
    def __init__(self):
        self.input = None
        self.out = None    
    
    def forward(self, x):
        self.input = x
        output = 1/(1 + np.exp(-x))
        self.out = output
        return output


    def backward(self, x, output_grad):
        ### BEGIN Solution
        self.out = self.out.reshape(-1,1)
        input_grad = output_grad * self.out * (1. - self.out)
        ### END Solution
        return input_grad
    

class SoftMaxActivation:
    def __init__(self):
        self.input = None
        self.out = None
        
    def forward(self, x):
        self.input = x
        output = np.exp(x) / np.exp(x).sum(axis=-1, keepdims=True)
        self.out = output
        return output

    def backward(self, x, output_grad):
        ### BEGIN Solution
        self.out = self.out.reshape(-1,1)
        input_grad = output_grad * self.out * (1. - self.out)
        ### END Solution
        return input_grad
    

class MLP:
    def __init__(self, input_size, hidden_layer_size, output_size):
        self.linear1 = Linear(input_size, hidden_layer_size)
        self.activation1 = LogisticActivation()
        self.linear2 = Linear(hidden_layer_size, output_size)
        self.softmax = SoftMaxActivation()
        
    
    def forward(self, x):
        return self.softmax.forward((self.linear2.forward(self.activation1.forward(self.linear1.forward(x)))))


    def backward(self, x, output_grad):
    
        ### BEGIN Solution
        output_grad = self.linear2.backward(x, output_grad)
        output_grad = self.activation1.backward(x, output_grad)
        output_grad = self.linear1.backward(x, output_grad)
        ### END Solution

In [353]:
### BEGIN Solution
def cross_entropy_loss(predicted, target):
    target_vector = np.zeros_like(predicted)
    
    if (predicted.ndim != 1):
        target_vector[np.arange(len(target)), target] = 1
        cost = -np.sum(target_vector * np.log2(predicted), axis=1)
    else:
        target_vector[target] = 1
        cost = -np.sum(target_vector * np.log2(predicted))
    
    return cost

def grad_cross_entropy_loss(predicted, target):
    target_vector = np.zeros_like(predicted)
    target_vector[target] = 1    
    return (predicted - target_vector).reshape(-1, 1)
### END Solution

In [354]:
np.random.seed(0)

mlp = MLP(input_size=input_size, hidden_layer_size=100, output_size=classes_n)

epochs_n = 250
learning_curve = [0] * epochs_n
test_curve = [0] * epochs_n

x_train = digits_train
x_test = digits_test
y_train = targets_train
y_test = targets_test

learning_rate = 1e-2

for epoch in range(epochs_n):
    if epoch % 10 == 0:
        print('Starting epoch', epoch)
    for sample_i in range(train_size):
        x = x_train[sample_i]
        target = y_train[sample_i]

        ### BEGIN Solution
        # ... zero the gradients
        mlp.linear1.thetas_grads = np.zeros_like(mlp.linear1.thetas_grads)
        mlp.linear1.bias_grads = np.zeros_like(mlp.linear1.bias_grads)

        mlp.linear2.thetas_grads = np.zeros_like(mlp.linear2.thetas_grads)
        mlp.linear2.bias_grads = np.zeros_like(mlp.linear2.bias_grads)
        
        # prediction = mlp.forward(x)
        predicted_value = mlp.forward(x)
        loss = cross_entropy_loss(predicted_value, target) # use cross entropy loss
        loss_grad = grad_cross_entropy_loss(predicted_value, target)
        learning_curve[epoch] += loss
        grad = mlp.backward(x, loss_grad)
        # ... perform backward pass
        # ... update the weights simply with weight -= grad * learning_rate
        mlp.linear1.thetas -= learning_rate * mlp.linear1.thetas_grads
        mlp.linear1.bias -= learning_rate * mlp.linear1.bias_grads
        
        mlp.linear2.thetas -= learning_rate * mlp.linear2.thetas_grads
        mlp.linear2.bias -= learning_rate * mlp.linear2.bias_grads
    
    learning_curve[epoch] /= train_size
    prediction = mlp.forward(x_test)
    loss = cross_entropy_loss(prediction, y_test).mean()
    test_curve[epoch] = loss
    ### END Solution

plt.plot(learning_curve)
plt.plot(test_curve)


Starting epoch 0
Starting epoch 10
Starting epoch 20
Starting epoch 30
Starting epoch 40
Starting epoch 50
Starting epoch 60
Starting epoch 70
Starting epoch 80
Starting epoch 90
Starting epoch 100
Starting epoch 110
Starting epoch 120
Starting epoch 130
Starting epoch 140
Starting epoch 150
Starting epoch 160
Starting epoch 170
Starting epoch 180
Starting epoch 190
Starting epoch 200
Starting epoch 210
Starting epoch 220
Starting epoch 230
Starting epoch 240
Out[354]:
[<matplotlib.lines.Line2D at 0x7fe9ad384a90>]

In [355]:
predictions = mlp.forward(digits).argmax(axis=1)
pd.DataFrame(confusion_matrix(targets, predictions))


Out[355]:
0 1 2 3 4 5 6 7 8 9
0 177 0 0 0 1 0 0 0 0 0
1 0 174 0 1 1 0 2 0 1 3
2 0 1 175 0 0 0 0 1 0 0
3 0 0 1 180 0 0 0 1 1 0
4 0 1 0 0 177 0 0 1 0 2
5 0 0 0 1 1 177 1 0 0 2
6 1 1 0 0 0 0 178 0 1 0
7 0 0 0 0 1 0 0 176 0 2
8 0 10 1 1 0 2 1 2 157 0
9 0 1 0 2 0 2 0 0 0 175

Task 8 (3 pt.): Modelling real-life DL

In this task you will train your own CNN for dogs vs cats classification task. The goal of this task is not to get the highest accuracy possible (try getting the highest accuracy possible though) but to model the real-life process of training a deep neural network.

**IMPORTANT NOTICE**

Training neural networks is a time consuming task and it can take days or even weeks. Try not to leave this task to the last day. It is not necessary for you to use GPU for this task, but using it may drastically reduce the time required for you to complete this task.

There is a good amount of datasets in torchvision, but in practice, chances are that you wouldn't find the dataset for your particular problem, so you should be capable of writing DataLoader for your own dataset.


In [19]:
from torch.utils.data import DataLoader, Dataset
import torch.nn.functional as F
import PIL.Image as Image
from torch import nn
import numpy as np
import torch.optim as optim
import matplotlib.pyplot as plt
import pandas as pd
import torch
from torchvision import transforms, utils
from PIL import Image
import os
import os.path
import sys
import progressbar

Make sure you are using the right device.


In [20]:
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
print(device)


cuda:0

First take a look at the data.


In [21]:
dt = pd.read_csv(r'data/cats_dogs/train.csv')
dt.head()


Out[21]:
path y
0 cats_dogs/train/dogs/dog.342.jpg 1
1 cats_dogs/train/cats/cat.661.jpg 0
2 cats_dogs/train/cats/cat.516.jpg 0
3 cats_dogs/train/dogs/dog.938.jpg 1
4 cats_dogs/train/cats/cat.224.jpg 0

In [22]:
Image.open('data/' + dt['path'].iloc[1])


Out[22]:

Implement your Dataset class.


In [23]:
#Change class name 
class ImageFolder(Dataset):
    def __init__(self, csv_file, root_dir, transform=None):
        """
        Args:
            csv_file (string): Path to csv file
            root_dir (string): Root directory path.
        """
        
        self.dt = pd.read_csv(csv_file)
        self.root_dir = root_dir
        self.transform = transform
        

    def __getitem__(self, idx):
        """
        Args:
            index (int): Index

        Returns:
            tuple: (sample, target) where target is class_index of the target class.
        """

        path = self.root_dir + '/' + self.dt.iloc[idx]['path']
        target = self.dt.iloc[idx]['y']
        
        with open(path, 'rb') as f:
            sample= Image.open(f).convert('RGB')
        
        sample = self.transform(sample)
        
        return sample, target
    
    
    def __len__(self):
        return self.dt.shape[0]

In [24]:
root_dir = './data'

image_size = 224

batch_size = 8

workers = 2

ngpu = 2

In [25]:
dataset = ImageFolder('data/cats_dogs/train.csv', root_dir)
len(dataset)


Out[25]:
2000

Define the augmentation tranform and instantiate training and validation subsets of your Dataset and the correpsonding DataLoaders.


In [26]:
data_transform_train = transforms.Compose([
    transforms.RandomResizedCrop(image_size),
    transforms.RandomHorizontalFlip(),
    transforms.ToTensor(),
    transforms.Normalize([0.485, 0.456, 0.406],
                         [0.229, 0.224, 0.225])
])
    
data_transform_test = transforms.Compose([
    transforms.Resize(image_size),
    transforms.CenterCrop(image_size),
    transforms.ToTensor(),
    transforms.Normalize([0.485, 0.456, 0.406],
                         [0.229, 0.224, 0.225])
])

### BEGIN Solution
dataset_train = ImageFolder('data/cats_dogs/train.csv', root_dir, transform=data_transform_train)
dataset_val = ImageFolder('data/cats_dogs/validation.csv', root_dir, transform=data_transform_test)

train_loader = torch.utils.data.DataLoader(dataset_train, batch_size=batch_size,
                                           shuffle=True, num_workers=workers)
val_loader = torch.utils.data.DataLoader(dataset_val, batch_size=batch_size,
                                         shuffle=False, num_workers=workers)
### END Solution

Make sure that dataloader works as expected by observing one sample from it.


In [27]:
for X,y in train_loader:
    print(X[0])
    print(y[0])
    plt.imshow(np.array(X[0,0,:,:]))
    break


tensor([[[-0.2856, -0.2684, -0.2684,  ..., -0.1143, -0.1486, -0.1828],
         [-0.2684, -0.2684, -0.2684,  ..., -0.1143, -0.1143, -0.1486],
         [-0.2684, -0.2684, -0.2684,  ..., -0.1314, -0.1143, -0.1314],
         ...,
         [ 0.2967,  0.3823,  0.2796,  ...,  0.1254,  0.1597,  0.1083],
         [ 0.2796,  0.2796,  0.2453,  ...,  0.0398,  0.2282,  0.1597],
         [ 0.3652,  0.3481,  0.2111,  ..., -0.1657,  0.1083,  0.1939]],

        [[-0.0749, -0.0574, -0.0574,  ...,  0.2227,  0.1877,  0.1527],
         [-0.0924, -0.0749, -0.0749,  ...,  0.1702,  0.1702,  0.1527],
         [-0.0924, -0.0924, -0.0924,  ...,  0.1352,  0.1527,  0.1352],
         ...,
         [ 0.0826,  0.1702,  0.0651,  ...,  0.0476,  0.0826, -0.0049],
         [ 0.0826,  0.0826,  0.0476,  ..., -0.0574,  0.0826, -0.0399],
         [ 0.1527,  0.1527, -0.0049,  ..., -0.2850, -0.0574, -0.0224]],

        [[ 0.2173,  0.2348,  0.2348,  ...,  0.6879,  0.6531,  0.6182],
         [ 0.2173,  0.2173,  0.2173,  ...,  0.7228,  0.7054,  0.6705],
         [ 0.2173,  0.2173,  0.2173,  ...,  0.7228,  0.7402,  0.7054],
         ...,
         [-0.7936, -0.6890, -0.7936,  ..., -0.5147, -0.4798, -0.5147],
         [-0.8458, -0.8284, -0.8807,  ..., -0.7238, -0.5321, -0.5844],
         [-0.7413, -0.7413, -0.8981,  ..., -0.9678, -0.6890, -0.5670]]])
tensor(0)

Implement your model below. You can use any layers that you want, but in general the structure of your model should be

  1. convolutional feature extractor, followed by
  2. fully-connected classifier.

In [28]:
class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        
        self.convol = nn.Sequential(
            nn.Conv2d(3, 16, 3, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(2, 2),
            
            nn.Conv2d(16, 32, 3, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(2, 2),
            
            nn.Conv2d(32, 64, 3, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(2, 2),
            
            nn.Dropout(),
            nn.ReLU()
        )
        
        self.linear = nn.Sequential(
            nn.Linear(64 * 28 * 28, 256),
            nn.ReLU(),
            nn.Linear(256, 84),
            nn.ReLU(),
            nn.Dropout(),
            nn.Linear(84, 2),
            nn.LogSoftmax(dim=1)
        )

    def forward(self, input):
        x = self.convol(input)
        x = x.view(x.size(0), -1)
        x = self.linear(x)
        return x

Send your model to GPU, if you have it.


In [29]:
def create_model(net, device):
    model = net.to(device)

    if (device.type == 'cuda') and (ngpu > 1):
        model = nn.DataParallel(model, list(range(ngpu)))
        
    return model

Implement your loss function below, or use the predefined loss, suitable for this task.


In [30]:
### BEGIN Solution
criterion = nn.CrossEntropyLoss().cuda()
### END Solution

Try two different optimizers and choose one. For the optimizer of your choice, try two different sets of parameters (e.g learning rate). Explain both of your choices and back them with the learning performance of the network (see the rest of the task).

In this parts of the task you may try more than two options, but, please, leave in your solution only the results for two different optimizers and two different sets of parameters.

You may finally train you model. Don't forget to:

  1. monitor its training and validation performance during training, i.e plot the loss functions and prediction accuracy for train and validation sets, to make sure that your model doesn't learn complete nonsense; do not include tons of learning curves in your homework solution; (in real-life, you may find tensorboardX extremely useful for this task);
  2. visualize its training and validation performance after training, to demonstrate that you have accomplished the task;
  3. save the state of your model during the training, to use the best one at the end; you may find useful this tutorial on saving and loading models;
  4. send the input and target data to the same device as your model.

Your model should be able to show at least 75% validation accuracy.

You may also find useful the following parts of documentation: Module.train, Module.eval, Module.state_dict, Module.load_state_dict.


In [31]:
def save_checkpoint(model, path):
    torch.save({
            'model_state_dict': model.state_dict(),
            }, path)
    
def load_checkpoint(model, path):
    checkpoint = torch.load(path)
    model.load_state_dict(checkpoint['model_state_dict'])
    
def accuracy_score(model):
    correct = 0
    total = 0

    model.eval()
    with torch.no_grad():
        for data in val_loader:
            samples = data[0].to(device)
            labels = data[1].to(device)
            outputs = model(samples)
            _, predicted = torch.max(outputs.data, 1)

            total += labels.size(0)
            correct += (predicted == labels).sum().item()

    print('Accuracy of the network: %d %%' % (100 * correct / total))


    model.train()

In [32]:
### BEGIN Solution
def train(model, optimizer):
    model_losses_train = []
    model_losses_val = []
    model_accuracy_val = []

    num_epochs = 100

    it = 0

    with progressbar.ProgressBar(max_value = num_epochs * len(train_loader)) as bar:
        for epoch in range(num_epochs):
            train_loss = 0
            for i, data in enumerate(train_loader, 0):
                model.zero_grad()

                labels = data[1].to(device)
                samples = data[0].type(torch.FloatTensor).to(device)

                output = model(samples)
                err = criterion(output, labels)

                err.backward()        
                optimizer.step()

                train_loss += err.item()

                bar.update(it)

                it += 1

            if epoch % 5 == 0:
                model_losses_train.append(train_loss / len(train_loader))

                model.eval()

                with torch.no_grad():
                    test_loss = 0

                    for data_val in val_loader:
                        labels_val = data_val[1].to(device)
                        samples_val = data_val[0].type(torch.FloatTensor).to(device)

                        output_val = model(samples_val)
                        err_val = criterion(output_val, labels_val)

                        test_loss += err_val.item()

                    model_losses_val.append(test_loss / len(val_loader))

                model.train()
                
    plt.figure(figsize=(10,5))
    plt.title("CE Loss During Training")

    plt.plot(model_losses_train, label="Train")
    plt.plot(model_losses_val, label="Validation")

    plt.xlabel("iterations")
    plt.ylabel("Loss")
    plt.legend()
    plt.show()
### END Solution

In [33]:
# Model = Net(), optimizer = Adam, lr = 1e-5
model = create_model(Net(), device)
lr = 1e-5
optimizer = optim.Adam(model.parameters(), lr=lr, weight_decay=1e-7)

train(model, optimizer)
save_checkpoint(model, './adam_relu_lr=1e-5')
accuracy_score(model)


100% (25000 of 25000) |##################| Elapsed Time: 0:10:40 Time:  0:10:40
Accuracy of the network: 75 %

In [34]:
# Model = Net(), optimizer = Adam, lr = 1e-7
model = create_model(Net(), device)
lr = 1e-5
optimizer = optim.Adam(model.parameters(), lr=lr, weight_decay=1e-7)

train(model, optimizer)
save_checkpoint(model, './adam_relu_lr=1e-7')
accuracy_score(model)


100% (25000 of 25000) |##################| Elapsed Time: 0:10:40 Time:  0:10:40
Accuracy of the network: 75 %

In [35]:
# Model = Net(), optimizer = SGD, lr = 1e-5, momentum=0.9
model = create_model(Net(), device)
lr = 1e-5
optimizer = optim.SGD(model.parameters(), lr = 1e-5, momentum = 0.9)

train(model, optimizer)
save_checkpoint(model, './sgd_relu_lr=1e-5')
accuracy_score(model)


100% (25000 of 25000) |##################| Elapsed Time: 0:10:35 Time:  0:10:35
Accuracy of the network: 61 %

Load Checkpoints and check accuracy score


In [39]:
model = create_model(Net(), device)
load_checkpoint(model, './adam_relu_lr=1e-5')
accuracy_score(model)


Accuracy of the network: 75 %

In [ ]:

Task 9 (1 pt.): Bad activation function

Using your conclusions from the Task 6, choose the worst activation function and replace all activations in your model from the previous Task 8 with this one. Demonstrate the training and validation performance of this version of the model.


In [40]:
class Sign(nn.Module):
    def __init__(self):
        super(Sign, self).__init__()
        
    def forward(self, input):
        return torch.sign(input)
    
class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        
        self.convol = nn.Sequential(
            nn.Conv2d(3, 16, 3, padding=1),
            Sign(),
            nn.MaxPool2d(2, 2),
            
            nn.Conv2d(16, 32, 3, padding=1),
            Sign(),
            nn.MaxPool2d(2, 2),
            
            nn.Conv2d(32, 64, 3, padding=1),
            Sign(),
            nn.MaxPool2d(2, 2),
            
            nn.Dropout(),
            Sign()
        )
        
        self.linear = nn.Sequential(
            nn.Linear(64 * 28 * 28, 256),
            Sign(),
            nn.Linear(256, 84),
            Sign(),
            nn.Dropout(),
            nn.Linear(84, 2),
            nn.LogSoftmax(dim=1)
        )

    def forward(self, input):
        x = self.convol(input)
        x = x.view(x.size(0), -1)
        x = self.linear(x)
        return x

In [42]:
model = create_model(Net(), device)
lr = 1e-5
optimizer = optim.Adam(model.parameters(), lr=lr, weight_decay=1e-7)

In [43]:
train(model, optimizer)
accuracy_score(model)


100% (25000 of 25000) |##################| Elapsed Time: 0:14:25 Time:  0:14:25
Accuracy of the network: 51 %