Dataset Exploration

Here we'll be exploring how each of the features we have so far relates to the target variable "status"

Importing the dataset



In [1]:

    
import pandas as pd
startups = pd.read_csv('data/startups_2.csv', index_col=0)
startups[:3]









    Out[1]:






  
    
      
      funding_total_usd
      state_code
      funding_rounds
      founded_at
      first_funding_at
      last_funding_at
      Category_Software
      Category_Biotechnology
      Category_Mobile
      Category_Enterprise Software
      ...
      undisclosed_funding_total_usd
      undisclosed_funding_rounds
      venture_funding_total_usd
      venture_funding_rounds
      number_of_acquisitions
      number_of_investments
      number_of_unique_investments
      number_of_investors_per_round
      avg_amount_invested_per_round
      status
    
  
  
    
      /organization/-qounter
      0.000023
      DE
      0.055556
      0.005917
      0.002825
      0.027344
      0
      0
      0
      0
      ...
      0.0
      0.0
      0.000000
      0.000000
      0.0
      0.0
      0.0
      0.000000
      0.000000
      operating
    
    
      /organization/004-technologies
      0.000000
      IL
      0.000000
      0.018409
      0.002410
      0.033203
      1
      0
      0
      0
      ...
      0.0
      0.0
      0.000000
      0.000000
      0.0
      0.0
      0.0
      0.000000
      0.000000
      operating
    
    
      /organization/0xdata
      0.001117
      CA
      0.166667
      0.015779
      0.003906
      0.001953
      0
      0
      0
      0
      ...
      0.0
      0.0
      0.001739
      0.214286
      0.0
      0.0
      0.0
      0.104167
      0.000477
      operating
    
  

3 rows × 90 columns

Let's start exploring the numerical features

Let's see a heatmap chart of the average features for 'acquired' startups against the complete set of startups



In [2]:

    
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

def plot_avg_status_against_avg_total(df, status):
    startups_numeric = df.filter(regex=('(number_of|avg_).*|.*(funding_total_usd|funding_rounds|_at|status)'))
    startups_acquired = startups_numeric[startups_numeric['status'] == status]

    startups_numeric = startups_numeric.drop('status', 1)
    startups_acquired = startups_acquired.drop('status', 1)

    fig, ax = plt.subplots(figsize=(20,20)) 
    ax.set_title(status+' startups heatmap')
    sns.heatmap((pd.DataFrame(startups_acquired.mean()).transpose() -startups_numeric.mean())/startups_numeric.std(ddof=0), annot=True, cbar=False, square=True, ax=ax)

plot_avg_status_against_avg_total(startups, 'acquired')

The same for 'closed':



In [3]:

    
plot_avg_status_against_avg_total(startups, 'closed')



In [4]:

    
plot_avg_status_against_avg_total(startups, 'ipo')



In [5]:

    
plot_avg_status_against_avg_total(startups, 'operating')

We can see some logic behavior here. Acquired startups tend to have high venture_funding_rounds and low seed_funding_rounds, while closed startups have few funding_rounds in general and relatively high angel_funding_rounds.

Regarding the dates variables we also have logical results. Acquired and closed startups haven't had a funding for a higher amount of time. While operating startups had a funding not so long ago when compared to the rest of the startups.



In [7]:

    
# Produce a scatter matrix for each pair of features in the data
#startups_funding_rounds = startups_numeric.filter(regex=('.*funding_total_usd'))
#pd.scatter_matrix(startups_funding_rounds, alpha = 0.3, figsize = (14,8), diagonal = 'kde');

Applying PCA to discover which features best explain the variance in the dataset



In [6]:

    
from sklearn.decomposition import PCA
import visuals as vs


startups_numeric = startups.filter(regex=('(number_of|avg_).*|.*(funding_total_usd|funding_rounds|_at)'))

# TODO: Apply PCA by fitting the good data with the same number of dimensions as features
pca = PCA(n_components=4)
pca.fit(startups_numeric)


# Generate PCA results plot
pca_results = vs.pca_results(startups_numeric, pca)
startups_numeric[:3]









    Out[6]:






  
    
      
      funding_total_usd
      funding_rounds
      founded_at
      first_funding_at
      last_funding_at
      angel_funding_total_usd
      angel_funding_rounds
      convertible_note_funding_total_usd
      convertible_note_funding_rounds
      debt_financing_funding_total_usd
      ...
      seed_funding_rounds
      undisclosed_funding_total_usd
      undisclosed_funding_rounds
      venture_funding_total_usd
      venture_funding_rounds
      number_of_acquisitions
      number_of_investments
      number_of_unique_investments
      number_of_investors_per_round
      avg_amount_invested_per_round
    
  
  
    
      /organization/-qounter
      0.000023
      0.055556
      0.005917
      0.002825
      0.027344
      0.0
      0.0
      0.0
      0.0
      0.0
      ...
      0.142857
      0.0
      0.0
      0.000000
      0.000000
      0.0
      0.0
      0.0
      0.000000
      0.000000
    
    
      /organization/004-technologies
      0.000000
      0.000000
      0.018409
      0.002410
      0.033203
      0.0
      0.0
      0.0
      0.0
      0.0
      ...
      0.000000
      0.0
      0.0
      0.000000
      0.000000
      0.0
      0.0
      0.0
      0.000000
      0.000000
    
    
      /organization/0xdata
      0.001117
      0.166667
      0.015779
      0.003906
      0.001953
      0.0
      0.0
      0.0
      0.0
      0.0
      ...
      0.142857
      0.0
      0.0
      0.001739
      0.214286
      0.0
      0.0
      0.0
      0.104167
      0.000477
    
  

3 rows × 38 columns



In [7]:

    
good_data = startups_numeric
import numpy as np
dimensions = dimensions = ['Dimension {}'.format(i) for i in range(1,len(pca.components_)+1)]

components = pd.DataFrame(np.round(pca.components_, 4), columns = good_data.keys())
components.index = dimensions
components









    Out[7]:






  
    
      
      funding_total_usd
      funding_rounds
      founded_at
      first_funding_at
      last_funding_at
      angel_funding_total_usd
      angel_funding_rounds
      convertible_note_funding_total_usd
      convertible_note_funding_rounds
      debt_financing_funding_total_usd
      ...
      seed_funding_rounds
      undisclosed_funding_total_usd
      undisclosed_funding_rounds
      venture_funding_total_usd
      venture_funding_rounds
      number_of_acquisitions
      number_of_investments
      number_of_unique_investments
      number_of_investors_per_round
      avg_amount_invested_per_round
    
  
  
    
      Dimension 1
      0.0062
      0.6377
      0.0210
      0.0094
      -0.0906
      0.0022
      0.0031
      0.0026
      0.0109
      0.0007
      ...
      -0.0415
      -0.0001
      0.0041
      0.0084
      0.7168
      0.0046
      0.0002
      0.0002
      0.2107
      0.0018
    
    
      Dimension 2
      -0.0013
      0.1952
      -0.0555
      -0.0190
      -0.4816
      0.0027
      0.0332
      0.0006
      0.0196
      0.0000
      ...
      0.8273
      -0.0006
      0.0018
      -0.0023
      -0.1967
      -0.0053
      -0.0006
      -0.0006
      0.0325
      -0.0016
    
    
      Dimension 3
      -0.0001
      -0.0040
      0.0303
      0.0295
      0.7086
      -0.0002
      0.0031
      -0.0007
      -0.0199
      -0.0007
      ...
      0.3991
      0.0009
      0.0061
      0.0016
      -0.0039
      0.0074
      0.0009
      0.0009
      0.3209
      0.0015
    
    
      Dimension 4
      0.0001
      0.0752
      0.0059
      0.0162
      0.3660
      -0.0027
      -0.0125
      -0.0005
      -0.0178
      -0.0006
      ...
      0.2052
      0.0005
      0.0002
      0.0019
      0.0286
      0.0051
      0.0004
      0.0004
      0.2016
      0.0008
    
  

4 rows × 38 columns

The most important variables here are:

Dimension1: funding_rounds, -last_funding_at, debt_financing_funding_rounds, venture_funding_rounds

Dimension2: -funding_rounds, -last_funding_at, -seed_funding_rounds, venture_funding_rounds

Dimension3: -last_funding_at, equity_crowdfunding_funding_rounds, -seed_funding_rounds

Dimension4: last_funding_at, equity_crowdfunding_funding_rounds, seed_funding_rounds

Now I'll apply the same PCA algorithm, but just for startups with acquired status



In [9]:

    
startups_numeric_acquired = startups.filter(regex=('(number_of|avg_).*|.*(funding_total_usd|funding_rounds|_at|status)'))
startups_numeric_acquired = startups_numeric_acquired[startups_numeric_acquired['status'] == 'acquired']
startups_numeric_acquired = startups_numeric_acquired.drop('status', 1)

pca = PCA(n_components=4)
pca.fit(startups_numeric_acquired)

# Generate PCA results plot
pca_results = vs.pca_results(startups_numeric_acquired, pca)

Okay. We see now that some features tend to express more variance than others.

We also see that funding_rounds variable tend to dominate against funding_total_usd values. And also, that last_funding_at is a very expressing variable.

Let's start playing with non-numerical variables: dates and Categories



In [11]:

    
#startups_numeric = df.filter(regex=('.*(funding_total_usd|funding_rounds|status)'))
startups_non_numeric = startups.filter(regex=('^((?!(_acquisitions|_investments|_per_round|funding_total_usd|funding_rounds|_at)).)*$'))
startups_non_numeric[:3]









    Out[11]:






  
    
      
      state_code
      Category_Software
      Category_Biotechnology
      Category_Mobile
      Category_Enterprise Software
      Category_E-Commerce
      Category_Curated Web
      Category_Health Care
      Category_Social Media
      Category_Advertising
      ...
      Category_Search
      Category_Semiconductors
      Category_Information Technology
      Category_FinTech
      Category_Retail
      Category_Financial Services
      Category_Messaging
      Category_Social Network Media
      Category_Startups
      status
    
  
  
    
      /organization/-qounter
      DE
      0
      0
      0
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      1
      0
      operating
    
    
      /organization/004-technologies
      IL
      1
      0
      0
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      operating
    
    
      /organization/0xdata
      CA
      0
      0
      0
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      operating
    
  

3 rows × 52 columns

Let's try some DecisionTrees for categories and see which performance we get.



In [12]:

    
startups_non_numeric['status'].value_counts()
startups_non_numeric['acquired'] = startups_non_numeric['status'].map({'operating': 0, 'acquired':1, 'closed':0, 'ipo':0})
startups_non_numeric = startups_non_numeric.drop('status', 1)
startups_non_numeric[:3]









    Out[12]:






  
    
      
      state_code
      Category_Software
      Category_Biotechnology
      Category_Mobile
      Category_Enterprise Software
      Category_E-Commerce
      Category_Curated Web
      Category_Health Care
      Category_Social Media
      Category_Advertising
      ...
      Category_Search
      Category_Semiconductors
      Category_Information Technology
      Category_FinTech
      Category_Retail
      Category_Financial Services
      Category_Messaging
      Category_Social Network Media
      Category_Startups
      acquired
    
  
  
    
      /organization/-qounter
      DE
      0
      0
      0
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      1
      0
      0
    
    
      /organization/004-technologies
      IL
      1
      0
      0
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      /organization/0xdata
      CA
      0
      0
      0
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
  

3 rows × 52 columns



In [14]:

    
from sklearn import tree
def visualize_tree(tree_model, feature_names):
    """Create tree png using graphviz.

    Args
    ----
    tree_model -- scikit-learn DecsisionTree.
    feature_names -- list of feature names.
    """
    with open("dt.dot", 'w') as f:
        tree.export_graphviz(tree_model, out_file=f,
                        feature_names=feature_names)

    command = ["dot", "-Tpng", "dt.dot", "-o", "dt.png"]
    try:
        subprocess.check_call(command)
    except:
        exit("Could not run dot, ie graphviz, to "
             "produce visualization")



In [ ]:



In [15]:

    
#import visuals_tree as vs_tree
#vs_tree.ModelLearning(startups_non_numeric.drop(['acquired','state_code'], 1), startups_non_numeric['acquired'])
from sklearn import tree
from sklearn.cross_validation import cross_val_score
from sklearn import tree
from sklearn import grid_search
from sklearn import preprocessing


#clf = tree.DecisionTreeClassifier(random_state=0)
#cross_val_score(clf, startups_non_numeric.drop(['acquired','state_code'], 1),  startups_non_numeric['acquired'], cv=10)


#Drop state_code feature
features = startups_non_numeric.drop(['acquired','state_code'], 1)

#Convert state_code feature to number
#features = startups_non_numeric.drop(['acquired'], 1)
#features['state_code'] = preprocessing.LabelEncoder().fit_transform(features['state_code'])

#Convert state_code to dummy variables
features = pd.get_dummies(startups_non_numeric.drop(['acquired'], 1), prefix='state', columns=['state_code'])


#Merge numeric_features to non-numeric-features
features_all = pd.concat([features, startups_numeric], axis=1, ignore_index=False)
#features = features_all

features = startups_numeric






parameters = {'max_depth':range(5,20)}
clf = grid_search.GridSearchCV(tree.DecisionTreeClassifier(), parameters, n_jobs=5, scoring='roc_auc')
clf.fit(X=features, y=startups_non_numeric['acquired'])
tree_model = clf.best_estimator_
print (clf.best_score_, clf.best_params_) 
print tree.export_graphviz(clf.best_estimator_, feature_names=list(features.columns))









    



(0.83932271629844135, {'max_depth': 5})
None



In [16]:

    
import visuals_tree as vs_tree
vs_tree = reload(vs_tree)
vs_tree.ModelComplexity(features_all, startups_non_numeric['acquired'])

Only categories and states are not enough for making a good prediction. With that, maximum (roc_auc) of 0.64 was achieved. With attributes, a simple decisionTreeClassifier achieved 0.84 roc_auc.

Saving the dataset ready to be tested by different learning algorithms



In [17]:

    
all = pd.concat([features_all, startups_non_numeric['acquired']], axis=1, ignore_index=False)
all.to_csv('data/startups_3.csv')



In [26]:

    
all_with_status = all.join(startups['status'])
all_with_status_without_operating = all_with_status[all_with_status['status'] != 'operating']
all_with_status_without_operating.shape
all_without_operating = all_with_status_without_operating.drop('status', 1)
all_without_operating.to_csv('data/startups_not_operating_3.csv')

	funding_total_usd	state_code	funding_rounds	founded_at	first_funding_at	last_funding_at	Category_Software	...	venture_funding_total_usd	venture_funding_rounds	number_of_investors_per_round	avg_amount_invested_per_round	status
/organization/-qounter	0.000023	DE	0.055556	0.005917	0.002825	0.027344	0	...	0.000000	0.000000	0.000000	0.000000	operating
/organization/004-technologies	0.000000	IL	0.000000	0.018409	0.002410	0.033203	1	...	0.000000	0.000000	0.000000	0.000000	operating
/organization/0xdata	0.001117	CA	0.166667	0.015779	0.003906	0.001953	0	...	0.001739	0.214286	0.104167	0.000477	operating

	funding_total_usd	funding_rounds	founded_at	first_funding_at	last_funding_at	angel_funding_total_usd	angel_funding_rounds	convertible_note_funding_total_usd	convertible_note_funding_rounds	debt_financing_funding_total_usd	...	seed_funding_rounds	undisclosed_funding_total_usd	undisclosed_funding_rounds	venture_funding_total_usd	venture_funding_rounds	number_of_acquisitions	number_of_investments	number_of_unique_investments	number_of_investors_per_round	avg_amount_invested_per_round
Dimension 1	0.0062	0.6377	0.0210	0.0094	-0.0906	0.0022	0.0031	0.0026	0.0109	0.0007	...	-0.0415	-0.0001	0.0041	0.0084	0.7168	0.0046	0.0002	0.0002	0.2107	0.0018
Dimension 2	-0.0013	0.1952	-0.0555	-0.0190	-0.4816	0.0027	0.0332	0.0006	0.0196	0.0000	...	0.8273	-0.0006	0.0018	-0.0023	-0.1967	-0.0053	-0.0006	-0.0006	0.0325	-0.0016
Dimension 3	-0.0001	-0.0040	0.0303	0.0295	0.7086	-0.0002	0.0031	-0.0007	-0.0199	-0.0007	...	0.3991	0.0009	0.0061	0.0016	-0.0039	0.0074	0.0009	0.0009	0.3209	0.0015
Dimension 4	0.0001	0.0752	0.0059	0.0162	0.3660	-0.0027	-0.0125	-0.0005	-0.0178	-0.0006	...	0.2052	0.0005	0.0002	0.0019	0.0286	0.0051	0.0004	0.0004	0.2016	0.0008