1. Classification Modelling

Overview

In this phase, we already have our dataset completely prepared and normalized, ready to be applied into classification algorithms. We'll first split our dataset and leave aside 20% of our data, to be tested only in the end of this process, after training a few different classification models. The one with best performance in this final test dataset will be our best model.

Load pre-processed dataset


In [106]:
#All imports here
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from IPython.display import display, HTML

from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.decomposition import PCA
from sklearn.neighbors import KNeighborsClassifier

from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
from sklearn.metrics import confusion_matrix
from sklearn import grid_search


%matplotlib inline

In [3]:
startups = pd.read_csv('data/startups_pre_processed.csv', index_col=0)
startups[:3]


Out[3]:
funding_total_usd funding_rounds founded_at first_funding_at last_funding_at Category_Software Category_Biotechnology Category_Mobile Category_Enterprise Software Category_E-Commerce ... State_IL State_PA State_CO State_VA State_NJ State_GA State_OH State_NC State_MD status
permalink
/organization/-qounter 0.000023 0.055556 0.005917 0.002825 0.027344 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 operating
/organization/004-technologies 0.000000 0.000000 0.018409 0.002410 0.033203 1 0 0 0 0 ... 1 0 0 0 0 0 0 0 0 operating
/organization/0xdata 0.001117 0.166667 0.015779 0.003906 0.001953 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 operating

3 rows × 104 columns

Define acquired status as our target variable

We are trying to forecast which startups are more likely to be acquired, so we are defining 'acquired' startups as 1 and the rest as 0.

Also, in this phase, it was decided to remove from the dataset startups with status IPO, since they can be considered as successfull startups but that's not the point of this project.


In [4]:
startups = startups[startups['status'] != 'ipo']

In [5]:
startups['acquired'] = startups['status'].replace({'acquired':1, 'operating':0, 'closed':0})
startups = startups.drop('status', 1)

In [6]:
ax = startups['acquired'].replace({0:'not acquired', 1:'acquired'}).value_counts().plot(kind='bar', title="Acquired startups", rot=0)


Train/Test Split

We'll now split the dataset, leaving the test dataset as it is in a real world observation ( imbalanced ).


In [7]:
dtrain, dtest = train_test_split(startups, test_size = 0.2, random_state=42, stratify=startups['acquired'])

General model evualator

Let's define a helper function to train and evaluate models using GridSearchCV


In [8]:
def run_classifier(parameters, classifier, df):
    seed = 42
    np.random.seed(seed)
    X = df.iloc[:, :-1]
    y = df['acquired']
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=seed, stratify=y)
    clf = grid_search.GridSearchCV(classifier, parameters, n_jobs=-1, scoring='roc_auc', cv=5)
    clf.fit(X=X_train, y=y_train)
    model = clf.best_estimator_
    print ('Avg auc score: '+str(clf.best_score_), 'Best params: '+str(clf.best_params_))
    print 'Auc score on Train       set: {0:0.3f}'.format(roc_auc_score(y_train, model.predict(X_train)))
    print 'Auc score on Validation  set: {0:0.3f}'.format(roc_auc_score(y_test, model.predict(X_test)))
    cm = pd.crosstab(y_test, model.predict(X_test), rownames=['True'], colnames=['Predicted'], margins=True)
    print 'FP rate: {0:0.1f}%'.format(cm[1][0]/float(cm['All'][0])*100)
    print 'TP rate: {0:0.1f}%'.format(cm[1][1]/float(cm['All'][1])*100)
    
    return model

Random Forest


In [102]:
rf_parameters = {'max_depth':range(5,12), 'n_estimators': [50], 'class_weight':['balanced']}
rf_clf = run_classifier((rf_parameters), RandomForestClassifier(random_state=0), dtrain)


('Avg auc score: 0.858577560978', "Best params: {'n_estimators': 50, 'max_depth': 10, 'class_weight': 'balanced'}")
Auc score on Train       set: 0.827
Auc score on Validation  set: 0.777
FP rate: 18.1%
TP rate: 73.4%

Principal Component Analysis


In [101]:
dtrain_numeric = dtrain.iloc[:, :-1].filter(regex=('(number_of|avg_).*|.*(funding_total_usd|funding_rounds|_at)'))
pca = PCA(n_components=30)

pca.fit(dtrain_numeric)

fig, ax = plt.subplots()
plt.plot(pca.explained_variance_ratio_, marker='o', color='b', linestyle='dashed')
ax.annotate('6 dimensions', xy=(6.2, pca.explained_variance_ratio_[6]+0.001), xytext=(8, 0.04),arrowprops=dict(facecolor='black', shrink=0.01),)
plt.title('Total variance explained')
plt.xlabel('Number of dimensions')
plt.ylabel('Variance explained by dimension')


Out[101]:
<matplotlib.text.Text at 0x1d7d8320>

We can notice that after the 6th dimension, the total variance explained by starts to get very low.


In [97]:
import sys
sys.path.insert(0, './exploratory_code')
import visuals as vs


pca = PCA(n_components=6)
pca.fit(dtrain_numeric)

# Generate PCA results plot
vs.pca_results(dtrain_numeric, pca)


Out[97]:
Explained Variance funding_total_usd funding_rounds founded_at first_funding_at last_funding_at angel_funding_total_usd angel_funding_rounds convertible_note_funding_total_usd convertible_note_funding_rounds ... seed_funding_rounds undisclosed_funding_total_usd undisclosed_funding_rounds venture_funding_total_usd venture_funding_rounds number_of_acquisitions number_of_investments number_of_unique_investments number_of_investors_per_round avg_amount_invested_per_round
Dimension 1 0.3086 0.0057 0.6335 0.0219 0.0100 -0.0785 0.0027 0.0048 0.0018 0.0124 ... -0.0182 0.0000 0.0040 0.0081 0.7098 0.0035 0.0002 0.0002 0.2492 0.0018
Dimension 2 0.1926 -0.0010 0.1700 -0.0511 -0.0169 -0.4192 0.0021 0.0325 0.0005 0.0177 ... 0.8514 -0.0001 0.0023 -0.0019 -0.2211 -0.0016 -0.0001 -0.0001 0.1250 -0.0011
Dimension 3 0.1344 0.0000 -0.0982 0.0242 0.0264 0.6598 -0.0003 -0.0008 -0.0010 -0.0230 ... 0.2412 0.0001 0.0048 0.0009 -0.0465 0.0013 0.0007 0.0007 0.5123 0.0014
Dimension 4 0.0944 0.0006 -0.0037 -0.0052 0.0095 0.2188 -0.0027 -0.0158 -0.0006 -0.0172 ... 0.0738 0.0000 -0.0018 0.0017 -0.0065 0.0007 0.0008 0.0007 0.4422 0.0011
Dimension 5 0.0808 0.0001 -0.2442 -0.0470 -0.0254 -0.5645 -0.0013 -0.0182 0.0003 0.0105 ... -0.3600 0.0001 -0.0097 0.0004 -0.0895 -0.0010 0.0024 0.0023 0.6642 0.0016
Dimension 6 0.0473 0.0022 0.3201 0.0324 0.0064 0.0911 0.0293 0.1689 0.0029 0.0531 ... -0.1645 0.0008 0.0745 -0.0026 -0.4793 0.0005 0.0003 0.0003 0.1483 0.0010

6 rows × 39 columns

With PCA, we can visualize the explained variance ratio of each dimension generated. From the chart above, we see, for example, a PCA transformation using only the numerical variables of our dataset. We see that the first dimension explains around 30% of the variance of the data. Most part of this percentage is explained by funding_rounds and venture_funding_rounds variables. For the second dimension, which explains 19% of the variance in the data, the most expressing features are last_funding_at and seed_fundind_rounds.

Principal Component Analysis + Support Vector Machine


In [9]:
pca = PCA(n_components=6)
pca.fit(dtrain.iloc[:, :-1])
dtrain_pca =  pca.transform(dtrain.iloc[:, :-1])
dtrain_pca = pd.DataFrame(dtrain_pca)
dtrain_pca['acquired'] = list(dtrain['acquired'])

In [10]:
svm_parameters = [
  #{'C': [1, 10, 100], 'kernel': ['linear'], 'class_weight':['balanced']},
  {'C': [1, 10, 100], 'gamma': [0.001, 0.0001], 'kernel': ['rbf'], 'class_weight':['balanced']},
  #{'C': [1, 10, 100, 1000], 'kernel': ['poly'], 'degree': [2, 5], 'coef0':[0,1], 'class_weight':['balanced']},
 ]
svm_clf = run_classifier(svm_parameters, SVC(random_state=0), dtrain_pca)


('Avg auc score: 0.612707153638', "Best params: {'kernel': 'rbf', 'C': 100, 'gamma': 0.001, 'class_weight': 'balanced'}")
Auc score on Train       set: 0.576
Auc score on Validation  set: 0.567
FP rate: 39.1%
TP rate: 52.6%

Only SVM


In [11]:
svm_parameters = [
  #{'C': [1, 10, 100], 'kernel': ['linear'], 'class_weight':['balanced']},
  {'C': [100], 'gamma': [0.001], 'kernel': ['rbf'], 'class_weight':['balanced']},
  #{'C': [1, 10, 100, 1000], 'kernel': ['poly'], 'degree': [2, 5], 'coef0':[0,1], 'class_weight':['balanced']},
 ]
svm_clf = run_classifier(svm_parameters, SVC(random_state=0), dtrain)


('Avg auc score: 0.830874595566', "Best params: {'kernel': 'rbf', 'C': 100, 'gamma': 0.001, 'class_weight': 'balanced'}")
Auc score on Train       set: 0.763
Auc score on Validation  set: 0.766
FP rate: 23.1%
TP rate: 76.4%

k-Nearest Neighbors


In [12]:
knn_parameters = {'n_neighbors':[3, 5], 'n_jobs':[-1], 'weights':['distance', 'uniform']}
knn_clf = run_classifier(knn_parameters, KNeighborsClassifier(), dtrain)


('Avg auc score: 0.691067808096', "Best params: {'n_neighbors': 5, 'n_jobs': -1, 'weights': 'uniform'}")
Auc score on Train       set: 0.629
Auc score on Validation  set: 0.548
FP rate: 2.7%
TP rate: 12.3%

k-Nearest Neighbors (subsampled)


In [13]:
from imblearn.under_sampling import RandomUnderSampler
rus = RandomUnderSampler(random_state=42, return_indices=True)
X_undersampled, y_undersampled, indices = rus.fit_sample(dtrain.iloc[:, :-1], dtrain['acquired'])
dtrain_subsampled = pd.DataFrame(X_undersampled)
dtrain_subsampled['acquired'] = y_undersampled

In [14]:
knn_parameters = {'n_neighbors':[1, 10], 'n_jobs':[-1], 'weights':['distance', 'uniform']}
knn_subsampled_clf = run_classifier(knn_parameters, KNeighborsClassifier(), dtrain_subsampled)


('Avg auc score: 0.757172183278', "Best params: {'n_neighbors': 10, 'n_jobs': -1, 'weights': 'distance'}")
Auc score on Train       set: 1.000
Auc score on Validation  set: 0.701
FP rate: 30.1%
TP rate: 70.2%

XGBoost


In [15]:
def model_xgboost_classifier(classifier, df, sample_weight={1:4,0:1}, useTrainCV=True, cv_folds=5, early_stopping_rounds=50):
    seed = 42
    np.random.seed(seed)
    X = df.iloc[:, :-1]
    y = df['acquired']
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=seed, stratify=y)
    
    if useTrainCV:
        xgb_param = classifier.get_xgb_params()
        xgtrain = xgb.DMatrix(X_train.values, label=y_train.values)
        cvresult = xgb.cv(xgb_param, xgtrain, num_boost_round=classifier.get_params()['n_estimators'], nfold=cv_folds, \
                          metrics='auc', early_stopping_rounds=early_stopping_rounds)
        classifier.set_params(n_estimators=cvresult.shape[0])
        print cvresult
    
    classifier.fit(X_train, y_train, eval_metric='auc', sample_weight=y_train.replace(sample_weight))
    
    print 'Auc score on Train       set: {0:0.3f}'.format(roc_auc_score(y_train, classifier.predict(X_train)))
    print 'Auc score on Validation  set: {0:0.3f}'.format(roc_auc_score(y_test, classifier.predict(X_test)))
    cm = pd.crosstab(y_test, classifier.predict(X_test), rownames=['True'], colnames=['Predicted'], margins=True)
    print cm 
    print 'TP rate: {0:0.1f}%'.format(cm[1][0]/float(cm['All'][0])*100)
    print 'FP rate: {0:0.1f}%'.format(cm[1][1]/float(cm['All'][1])*100)
    
    return classifier

In [16]:
from xgboost import XGBClassifier

xgb1 = XGBClassifier(
 learning_rate =0.1,
 n_estimators=10,
 max_depth=5,
 min_child_weight=1,
 gamma=0,
 subsample=0.9,
 colsample_bytree=0.9,
 objective= 'binary:logistic',
 n_jobs=4,
 scale_pos_weight=1,
 random_state=0)

xgb_clf = model_xgboost_classifier(xgb1, dtrain, useTrainCV=False, cv_folds=5, early_stopping_rounds=1000)


Auc score on Train       set: 0.779
Auc score on Validation  set: 0.768
Predicted     0     1   All
True                       
0          4442   732  5174
1           214   451   665
All        4656  1183  5839
TP rate: 14.1%
FP rate: 67.8%

Final test within unseen data


In [17]:
def compare_models(models, df):
    for var_model in models:
        model = eval(var_model)
        X = df.iloc[:, :-1]
        y = df['acquired']
        print '--------------'+var_model + '---------------'
        print 'Auc score on Test set: {0:0.3f}'.format(roc_auc_score(y, model.predict(X)))
        cm = pd.crosstab(y, model.predict(X), rownames=['True'], colnames=['Predicted'], margins=True)
        #print cm 
        print 'FP rate: {0:0.1f}%'.format(cm[1][0]/float(cm['All'][0])*100)
        print 'TP rate: {0:0.1f}%'.format(cm[1][1]/float(cm['All'][1])*100)
        

#compare_models(['rf_clf', 'svm_clf', 'knn_clf', 'knn_subsampled_clf'], dtest)
compare_models(['rf_clf', 'svm_clf', 'knn_clf', 'knn_subsampled_clf', 'xgb_clf'], dtest)


--------------rf_clf---------------
Auc score on Test set: 0.773
FP rate: 18.8%
TP rate: 73.4%
--------------svm_clf---------------
Auc score on Test set: 0.757
FP rate: 23.5%
TP rate: 74.8%
--------------knn_clf---------------
Auc score on Test set: 0.564
FP rate: 2.9%
TP rate: 15.6%
--------------knn_subsampled_clf---------------
Auc score on Test set: 0.665
FP rate: 32.8%
TP rate: 65.8%
--------------xgb_clf---------------
Auc score on Test set: 0.761
FP rate: 15.1%
TP rate: 67.4%

Confusion Matrix visualization - Random Forest


In [109]:
X = dtest.iloc[:, :-1]
y = dtest['acquired']

cm = confusion_matrix(y, rf_clf.predict(X))

sns.heatmap(cm, annot=True, cmap='Blues', xticklabels=['no', 'yes'], yticklabels=['no', 'yes'], fmt='g')
plt.ylabel('True label')
plt.xlabel('Predicted label')
plt.title('Confusion matrix for:\n{}'.format(rf_clf.__class__.__name__));


The chart above illustrates a concrete use case of our model and illustrastes as well the imbalance behavior in the test dataset. When classifying a dataset of 7299 companies, the model misclassified 1219 companies as they would be acquired but they were not. Similarly, it classified 610 companies as they would be acquired and they were, in fact, acquired. The results obtained in this project varied around this trade-off. Some models increased the accuracy of acquisitions suggestions, but at the same time, increased the missclassification of companies that should not be tagged positively.