Working with data 2017. Class 8

Contact

Javier Garcia-Bernardo garcia@uva.nl

1. Clustering

2. Data imputation

3. Dimensionality reduction


In [2]:
##Some code to run at the beginning of the file, to be able to show images in the notebook
##Don't worry about this cell

#Print the plots in this screen
%matplotlib inline 

#Be able to plot images saved in the hard drive
from IPython.display import Image 

#Make the notebook wider
from IPython.core.display import display, HTML 
display(HTML("<style>.container { width:90% !important; }</style>"))

import seaborn as sns
import pylab as plt
import pandas as pd
import numpy as np
import scipy.stats

import statsmodels.formula.api as smf
import sklearn
from sklearn.model_selection import train_test_split


3. Dimensionality reduction

  • Many times we want to combine variables (for linear regression to avoid multicollinearity, to create indexes, etc)

Our data


In [3]:
#Read data
df_companies = pd.read_csv("data/big3_position.csv",sep="\t")
df_companies["log_revenue"] = np.log10(df_companies["Revenue"])
df_companies["log_assets"] = np.log10(df_companies["Assets"])
df_companies["log_employees"] = np.log10(df_companies["Employees"])
df_companies["log_marketcap"] = np.log10(df_companies["MarketCap"])

#Keep only industrial companies
df_companies = df_companies.loc[:,["log_revenue","log_assets","log_employees","log_marketcap","Company_name","TypeEnt"]]
df_companies = df_companies.loc[df_companies["TypeEnt"]=="Industrial company"]

#Dropnans
df_companies = df_companies.replace([np.inf,-np.inf],np.nan)
df_companies = df_companies.dropna()


df_companies.head()


/home/javiergb/Programs/anaconda3/lib/python3.5/site-packages/ipykernel/__main__.py:3: RuntimeWarning: divide by zero encountered in log10
  app.launch_new_instance()
/home/javiergb/Programs/anaconda3/lib/python3.5/site-packages/ipykernel/__main__.py:3: RuntimeWarning: invalid value encountered in log10
  app.launch_new_instance()
/home/javiergb/Programs/anaconda3/lib/python3.5/site-packages/ipykernel/__main__.py:6: RuntimeWarning: divide by zero encountered in log10
Out[3]:
log_revenue log_assets log_employees log_marketcap Company_name TypeEnt
23 3.634679 4.271609 1.633468 5.483054 PROTEOSTASIS THERAPEUTICS, INC. Industrial company
24 4.645913 5.266613 2.029384 5.488190 TELIGENT, INC. Industrial company
27 6.204633 6.168790 3.832509 6.906554 IDEXX LABORATORIES INC Industrial company
32 7.469174 7.988933 4.201397 7.739554 CONOCOPHILLIPS Industrial company
33 6.080771 6.175744 3.732394 6.037998 ENPRO INDUSTRIES, INC. Industrial company

Correlation between variables


In [4]:
# Compute the correlation matrix
corr = df_companies.corr()

# Generate a mask for the upper triangle (hide the upper triangle)
mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True

# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, mask=mask, square=True,linewidths=.5,cmap="YlOrRd",vmin=0,vmax=1)

plt.show()


Revenue, employees and assets are highly correlated.

Let's imagine wwe want to explain the market capitalization in terms of the other variables.


In [5]:
mod = smf.ols(formula='log_marketcap ~ log_revenue + log_employees  + log_assets', data=df_companies)
res = mod.fit()
print(res.summary())


                            OLS Regression Results                            
==============================================================================
Dep. Variable:          log_marketcap   R-squared:                       0.770
Model:                            OLS   Adj. R-squared:                  0.770
Method:                 Least Squares   F-statistic:                     2773.
Date:                Thu, 02 Feb 2017   Prob (F-statistic):               0.00
Time:                        08:26:56   Log-Likelihood:                -1431.3
No. Observations:                2484   AIC:                             2871.
Df Residuals:                    2480   BIC:                             2894.
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
=================================================================================
                    coef    std err          t      P>|t|      [95.0% Conf. Int.]
---------------------------------------------------------------------------------
Intercept         0.7175      0.067     10.673      0.000         0.586     0.849
log_revenue      -0.0524      0.022     -2.424      0.015        -0.095    -0.010
log_employees     0.0912      0.020      4.585      0.000         0.052     0.130
log_assets        0.8800      0.021     42.562      0.000         0.839     0.921
==============================================================================
Omnibus:                      172.670   Durbin-Watson:                   2.016
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              285.008
Skew:                          -0.531   Prob(JB):                     1.29e-62
Kurtosis:                       4.275   Cond. No.                         72.4
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

In [6]:
#The residuals are fine
plt.figure(figsize=(4,3))
sns.regplot(res.predict(),df_companies["log_marketcap"] -res.predict())


Out[6]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f1cd497f5c0>

In [7]:
#Get many models to see hwo coefficient changes
from statsmodels.iolib.summary2 import summary_col

mod1 = smf.ols(formula='log_marketcap ~ log_revenue + log_employees  + log_assets', data=df_companies).fit()
mod2 = smf.ols(formula='log_marketcap ~ log_revenue + log_assets', data=df_companies).fit()
mod3 = smf.ols(formula='log_marketcap ~ log_employees  + log_assets', data=df_companies).fit()
mod4 = smf.ols(formula='log_marketcap ~ log_assets', data=df_companies).fit()

mod5 = smf.ols(formula='log_marketcap ~ log_revenue + log_employees ', data=df_companies).fit()
mod6 = smf.ols(formula='log_marketcap ~ log_revenue ', data=df_companies).fit()

mod7 = smf.ols(formula='log_marketcap ~ log_employees  ', data=df_companies).fit()

output = summary_col([mod1,mod2,mod3,mod4,mod5,mod6,mod7],stars=True)
print(mod1.rsquared_adj,mod2.rsquared_adj,mod3.rsquared_adj,mod4.rsquared_adj,mod5.rsquared_adj,mod6.rsquared_adj,mod7.rsquared_adj)
output


/home/javiergb/Programs/anaconda3/lib/python3.5/site-packages/statsmodels/iolib/summary2.py:372: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  res.ix[:, 0][idx] = res.ix[:, 0][idx] + '*'
/home/javiergb/Programs/anaconda3/lib/python3.5/site-packages/statsmodels/iolib/summary2.py:374: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  res.ix[:, 0][idx] = res.ix[:, 0][idx] + '*'
/home/javiergb/Programs/anaconda3/lib/python3.5/site-packages/statsmodels/iolib/summary2.py:376: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  res.ix[:, 0][idx] = res.ix[:, 0][idx] + '*'
0.770085449982 0.768229760717 0.769633653357 0.768295907742 0.602303466161 0.591747385198 0.529313920375
Out[7]:
log_marketcap I log_marketcap II log_marketcap III log_marketcap IIII log_marketcap IIIII log_marketcap IIIIII log_marketcap IIIIIII
Intercept 0.7175*** 0.5808*** 0.6901*** 0.5758*** 2.4392*** 2.1779*** 3.6642***
(0.0672) (0.0605) (0.0663) (0.0598) (0.0706) (0.0638) (0.0449)
log_assets 0.8800*** 0.8935*** 0.8505*** 0.9032***
(0.0207) (0.0205) (0.0167) (0.0100)
log_employees 0.0912*** 0.0612*** 0.2117*** 0.7000***
(0.0199) (0.0156) (0.0259) (0.0132)
log_revenue -0.0524** 0.0092 0.4906*** 0.6563***
(0.0216) (0.0170) (0.0230) (0.0109)

3.1 Combining variables

  • Multiplying/summing variables
  • It's easy
  • It's arbitrary

In [8]:
X = df_companies.loc[:,["log_revenue","log_employees","log_assets"]]
X.head(2)


Out[8]:
log_revenue log_employees log_assets
23 3.634679 1.633468 4.271609
24 4.645913 2.029384 5.266613

In [9]:
#Let's scale all the columns to have mean 0 and std 1
from sklearn.preprocessing import scale
X_to_combine = scale(X)
X_to_combine


Out[9]:
array([[-1.99571299, -1.73647635, -1.91571154],
       [-1.03556037, -1.31257881, -0.77455883],
       [ 0.44442179,  0.61798627,  0.26013323],
       ..., 
       [ 0.93584733,  1.17438899,  0.59298665],
       [-0.87417016, -0.80028962, -1.19760982],
       [-0.07417925,  0.82440959, -0.64435546]])

In [10]:
#In this case we sum them together
X_combined = np.sum(X_to_combine,axis=1)
X_combined


Out[10]:
array([-5.64790088, -3.12269802,  1.32254129, ...,  2.70322297,
       -2.8720696 ,  0.10587488])

In [11]:
#Add a new column with our combined variable and run regression
df_companies["combined"] = X_combined
print(smf.ols(formula='log_marketcap ~ combined  ', data=df_companies).fit().summary())


                            OLS Regression Results                            
==============================================================================
Dep. Variable:          log_marketcap   R-squared:                       0.694
Model:                            OLS   Adj. R-squared:                  0.693
Method:                 Least Squares   F-statistic:                     5617.
Date:                Thu, 02 Feb 2017   Prob (F-statistic):               0.00
Time:                        08:26:58   Log-Likelihood:                -1789.8
No. Observations:                2484   AIC:                             3584.
Df Residuals:                    2482   BIC:                             3595.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
Intercept      5.9428      0.010    595.258      0.000         5.923     5.962
combined       0.2625      0.004     74.944      0.000         0.256     0.269
==============================================================================
Omnibus:                       30.536   Durbin-Watson:                   2.021
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               38.383
Skew:                          -0.182   Prob(JB):                     4.63e-09
Kurtosis:                       3.488   Cond. No.                         2.85
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

3.2 PCA

  • Keep all the info
  • The resulting variables do not actually mean much

In [12]:
#Do the fitting
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
new_X = pca.fit_transform(X)

print("Explained variance")
print(pca.explained_variance_ratio_)
print()

print("Weight of components")
print(["log_revenue","log_employees","log_assets"])
print(pca.components_)
print()

new_X


Explained variance
[ 0.90754131  0.05796448]

Weight of components
['log_revenue', 'log_employees', 'log_assets']
[[-0.65102557 -0.55826228 -0.51430431]
 [ 0.04059483 -0.70219538  0.71082608]]

Out[12]:
array([[ 3.13287507, -0.13380895],
       [ 1.74177643,  0.33650637],
       [-0.74360056, -0.22507263],
       ..., 
       [-1.51992984, -0.36267534],
       [ 1.55371598, -0.25477687],
       [-0.0900414 , -0.9432188 ]])

In [13]:
#Create our new variables (2 components, so 2 variables)
df_companies["pca_x1"] = new_X[:,0]
df_companies["pca_x2"] = new_X[:,1]

In [14]:
print(smf.ols(formula='log_marketcap ~ pca_x1 + pca_x2 ', data=df_companies).fit().summary())


                            OLS Regression Results                            
==============================================================================
Dep. Variable:          log_marketcap   R-squared:                       0.741
Model:                            OLS   Adj. R-squared:                  0.741
Method:                 Least Squares   F-statistic:                     3545.
Date:                Thu, 02 Feb 2017   Prob (F-statistic):               0.00
Time:                        08:26:59   Log-Likelihood:                -1581.7
No. Observations:                2484   AIC:                             3169.
Df Residuals:                    2481   BIC:                             3187.
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
Intercept      5.9428      0.009    647.146      0.000         5.925     5.961
pca_x1        -0.4694      0.006    -80.630      0.000        -0.481    -0.458
pca_x2         0.5594      0.023     24.284      0.000         0.514     0.605
==============================================================================
Omnibus:                       95.394   Durbin-Watson:                   2.020
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              148.223
Skew:                          -0.347   Prob(JB):                     6.51e-33
Kurtosis:                       3.975   Cond. No.                         3.96
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

In [15]:
print("Before")
sns.lmplot("log_revenue","log_assets",data=df_companies,fit_reg=False)


Before
Out[15]:
<seaborn.axisgrid.FacetGrid at 0x7f1cd7f6dfd0>

In [16]:
print("After")
sns.lmplot("pca_x1","pca_x2",data=df_companies,fit_reg=False)


After
Out[16]:
<seaborn.axisgrid.FacetGrid at 0x7f1cce309128>

3.3 Factor analysis

  • The observations are assumed to be caused by a linear transformation of lower dimensional latent factors and added Gaussian noise. Without loss of generality the factors are distributed according to a Gaussian with zero mean and unit covariance. The noise is also zero mean and has an arbitrary diagonal covariance matrix.

In [43]:
from sklearn.decomposition import FactorAnalysis
fa = FactorAnalysis(n_components=2)
new_X = fa.fit_transform(X)

print("Weight of components")
print(["log_revenue","log_employees","log_assets"])
print(fa.components_)
print()

new_X


Weight of components
['log_revenue', 'log_employees', 'log_assets']
[[-1.01579117 -0.85478902 -0.79054836]
 [ 0.00155853 -0.143005    0.14338296]]

Out[43]:
array([[ 1.95803243, -0.16050885],
       [ 1.06995109,  0.43027854],
       [-0.45630827, -0.28772932],
       ..., 
       [-0.93899904, -0.4656501 ],
       [ 0.9512708 , -0.32753938],
       [-0.0206818 , -1.19533099]])

In [44]:
#New variables
df_companies["fa_x1"] = new_X[:,0]
df_companies["fa_x2"] = new_X[:,1]

In [45]:
print(smf.ols(formula='log_marketcap ~ fa_x1  + fa_x2 ', data=df_companies).fit().summary())


                            OLS Regression Results                            
==============================================================================
Dep. Variable:          log_marketcap   R-squared:                       0.730
Model:                            OLS   Adj. R-squared:                  0.730
Method:                 Least Squares   F-statistic:                     3359.
Date:                Thu, 02 Feb 2017   Prob (F-statistic):               0.00
Time:                        08:21:02   Log-Likelihood:                -1631.1
No. Observations:                2484   AIC:                             3268.
Df Residuals:                    2481   BIC:                             3286.
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
Intercept      5.9428      0.009    634.407      0.000         5.924     5.961
fa_x1         -0.7495      0.010    -78.442      0.000        -0.768    -0.731
fa_x2          0.4382      0.018     23.762      0.000         0.402     0.474
==============================================================================
Omnibus:                       85.114   Durbin-Watson:                   2.020
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              140.476
Skew:                          -0.297   Prob(JB):                     3.13e-31
Kurtosis:                       4.002   Cond. No.                         1.97
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

In [46]:
print("After")
sns.lmplot("fa_x1","fa_x2",data=df_companies,fit_reg=False)


After
Out[46]:
<seaborn.axisgrid.FacetGrid at 0x7f8b480a6470>

Difference between FA and PCA

http://stats.stackexchange.com/questions/1576/what-are-the-differences-between-factor-analysis-and-principal-component-analysi

  • Principal component analysis involves extracting linear composites of observed variables.
  • Factor analysis is based on a formal model predicting observed variables from theoretical latent factors.

3.4 Methods to avoid overfitting (Machine learning with regularization)

Both have a regularization parameter, that penalizes having many terms. How to choose the best value of this parameter?


In [47]:
Image(url="http://www.holehouse.org/mlclass/07_Regularization_files/Image.png")


Out[47]:

In [19]:
from sklearn.model_selection import train_test_split
y = df_companies["log_marketcap"]
X = df_companies.loc[:,["log_revenue","log_employees","log_assets"]]
X.head(2)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)
X_train.head()


Out[19]:
log_revenue log_employees log_assets
836 6.455242 4.285557 6.615582
189 5.075985 2.655138 5.081912
1911 7.286230 4.431364 6.876073
3858 5.644867 2.518514 6.168640
1809 6.883773 4.181844 6.284532

Linear regression (to compare)


In [25]:
df_train = X_train.copy()
df_train["log_marketcap"] = y_train
df_train.head()

mod = smf.ols(formula='log_marketcap ~ log_revenue + log_employees  + log_assets', data=df_train).fit()
print("log_revenue 	log_employees 	log_assets ")
print(mod.params.values[1:])


log_revenue 	log_employees 	log_assets 
[-0.0745411   0.09301402  0.90935025]

SVR

  • Gives balanced weights (the most correlated independent variable (with the dependent) doesn't take all the weight).
  • Very good when you have hundreds of variables. You can iteratively drop the worst predictor.
  • It allow for more than linear "regression". The default kernel is "rbf", which fits curves. The problem is that interpreting it is hard.

In [56]:
from sklearn.svm import SVR
clf = SVR(C=0.1, epsilon=0.2,kernel="linear")
clf.fit(X_train, y_train) 

print("log_revenue 	log_employees 	log_assets ")
print(clf.coef_)


log_revenue 	log_employees 	log_assets 
[[-0.09292941  0.11011615  0.91769297]]

Lasso

  • Have a penalty
  • Discards the variables with low weights.

In [53]:
from sklearn import linear_model
reg = linear_model.Lasso(alpha = 0.01)
reg.fit(X_train,y_train)

print("log_revenue 	log_employees 	log_assets ")
print(reg.coef_)


log_revenue 	log_employees 	log_assets 
[-0.          0.04451499  0.85993465]

Summary


In [57]:
print(["SVR","Lasso","Linear regression"])
err1,err2,err3 = sklearn.metrics.mean_squared_error(clf.predict(X_test),y_test),sklearn.metrics.mean_squared_error(reg.predict(X_test),y_test),sklearn.metrics.mean_squared_error(mod.predict(X_test),y_test)
print(err1,err2,err3)


['SVR', 'Lasso', 'Linear regression']
0.19001609378 0.18711028571 0.188061519375

In [58]:
print(["SVR","Lasso","Linear regression"])
err1,err2,err3 = sklearn.metrics.r2_score(clf.predict(X_test),y_test),sklearn.metrics.r2_score(reg.predict(X_test),y_test),sklearn.metrics.r2_score(mod.predict(X_test),y_test)
print(err1,err2,err3)


['SVR', 'Lasso', 'Linear regression']
0.694444205162 0.688371093093 0.695033719486