Working with data 2017. Class 8


Javier Garcia-Bernardo

1. Clustering

2. Data imputation

3. Dimensionality reduction

1. Clustering

#Som elibraries
from sklearn import preprocessing
from sklearn.cluster import DBSCAN, KMeans

#Read teh data, dropna, get sample
df = pd.read_csv("data/big3_position.csv",sep="\t").dropna()
df["Revenue"] = np.log10(df["Revenue"])
df["Assets"] = np.log10(df["Assets"])
df["Employees"] = np.log10(df["Employees"])
df["MarketCap"] = np.log10(df["MarketCap"])
df = df.replace([np.inf,-np.inf],np.nan).dropna().sample(300)

Company_name Company_ID Big3Share Position Revenue Assets Employees MarketCap Exchange TypeEnt
3130 APPLIED OPTOELECTRONICS, INC. US760533927 9.76 2 5.278532 5.436918 3.400192 5.240180 NASDAQ National Market Industrial company
755 MONOTYPE IMAGING HOLDINGS INC. US203289482 19.78 1 5.284248 5.593050 2.693727 5.982111 NASDAQ National Market Industrial company

#Scale variables to give all of them the same weight
X = df.loc[:,["Revenue","Assets","Employees","MarketCap"]]
X = preprocessing.scale(X)

1a. Clustering with K-means

  • k-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster. This results in a partitioning of the data space into Voronoi cells.
  • Other methods:

#Get labels of each row and add a new column with the labels
kmeans = KMeans(n_clusters=2, random_state=0).fit(X)
labels = kmeans.labels_
df["kmeans_labels"] = labels

1b. Clustering with DBSCAN

  • The DBSCAN algorithm views clusters as areas of high density separated by areas of low density. Due to this rather generic view, clusters found by DBSCAN can be any shape, as oppos

#Get labels of each row and add a new column with the labels
db = DBSCAN(eps=1, min_samples=10).fit(X)
labels = db.labels_
df["dbscan_labels"] = labels

1c. Hierarchical clustering

  • Keeps aggreagating from a point

import scipy
import pylab
import scipy.cluster.hierarchy as sch

# Generate distance matrix based on the difference between rows
D = np.zeros([4,4])
for i in range(4):
    for j in range(4):
        D[i,j] = np.sum(np.abs(X[:,i]-X[:,j])) #Euclidean distance or mutual information are also common

#Create the linkage and plot
Y = sch.linkage(D, method='centroid') #many methods, single, complete...
Z1 = sch.dendrogram(Y, orientation='right',labels=["Revenue","Assets","Employees","MarketCap"])

2. Imputation of missing data (fancy)

#Required libraries
!conda install tensorflow -y
!pip install fancyimpute
!pip install pydot_ng

import sklearn.preprocessing
import sklearn

#Read the data again but do not 
df = pd.read_csv("data/big3_position.csv",sep="\t")
df["Revenue"] = np.log10(df["Revenue"])
df["Assets"] = np.log10(df["Assets"])
df["Employees"] = np.log10(df["Employees"])
df["MarketCap"] = np.log10(df["MarketCap"])

le = sklearn.preprocessing.LabelEncoder()
labels = le.fit_transform(df["TypeEnt"])
df["TypeEnt_int"] = labels


df = df.replace([np.inf,-np.inf],np.nan).sample(300)

['Bank' 'Financial company' 'Foundation/Research institute'
 'Industrial company' 'Insurance company' 'Venture capital']
Company_name Company_ID Big3Share Position Revenue Assets Employees MarketCap Exchange TypeEnt TypeEnt_int
1142 MAGNEGAS CORP US260250418 0.18 8 3.385785 4.249467 NaN 4.575361 NASDAQ National Market Industrial company 3
878 DS HEALTHCARE GROUP INC US208380461 1.37 5 4.243038 NaN 1.544068 4.179293 NASDAQ National Market Industrial company 3

X = df.loc[:,["Revenue","Assets","Employees","MarketCap","TypeEnt_int"]].values

from fancyimpute import KNN

# X is the complete data matrix
# X_incomplete has the same values as X except a subset have been replace with NaN

# Use 10 nearest rows which have a feature to fill in each row's missing features
X_filled_knn = KNN(k=10).complete(X)
df.loc[:,cols] = X_filled_knn


