ECSI MULTIGROUP ANALYSIS

Hugo Silveira da Cunha

08/05/2017 WIP

INTRODUCTION

ECSI - European Customer Satisfaction Index

ECSI, the European Customer Satisfaction Index, is a statistical methodology to find out which factors are most important to the creation of customer satisfaction and loyalty.

ECSI was inspired by the work of the Swedish Professor Claes Fornell, one of the most influential scholars in marketing science today. His name can be found on the top most academically cited papers from the leading sources in the field, seen as an outstanding name in the study of customer satisfaction measurement and asset measurement. He was the author of the model for the Swedish economy, the Swedish Customer Satisfaction Barometer (SCSB), in 1989. Five years later, in October 1994, together with Donald C. Cook from the University of Michigan and in conjunction with the American Society for Quality and CFI Group, they have developed the American Customer Satisfaction Index (ACSI) as an economic indicator that measures the satisfaction of consumers across the U.S. economy.

A few years later, ECSI was initiated by the EU Commission in collaboration with the European Foundation for Quality Management and the European Organization for Quality (EOQ) along with a network of universities and business schools. ECSI Technical Committee has developed the method of analysis, the econometric model and the causality analysis derived from the ACSI.

The data is collected from surveys and respondents are asked to rate their experience of individual organisations they have dealt with in the previous six months, using a scale of 1 – 10, on a series of metrics covering perceptions of image, quality and price, complaint handling, as well as attitudes towards loyalty and trust. The metrics reflect the priorities customers have identified as the most important attributes of customer experience according to the research. Overall scores for each sector are the mean averages of all responses for that sector.

The ECSI score for each organisation, sector or country is the average of all of its customers’ satisfaction scores. These scores are then multiplied by ten so that the index scores are expressed as a number out of 100.

This methodological and reliable approach, offers a unique benchmark capability across all companies, industries, countries and time periods with the confidence that results are consistent and are proven over time in scientific research literature.

Structural equation modeling (SEM)

Structural equation modeling (SEM) was applied to assess latent variables at the observation level and test relationships between latent variables on the theoretical level.

We have considered two methods:

covariance-based technique (CB-SEM; Jöreskog 1978, 1993)
variance based partial least squares (PLS-SEM; Lohmöller 1989; Wold 1982, 1985).

Both methods have the same root (Jöreskog and Wold 1982), previous marketing research has focused primarily on CB-SEM (e.g., Bagozzi 1994; Baumgartner and Homburg 1996; Steenkamp and Baumgartner 2000).

Covariance-based SEM (CB-SEM), originally developed by Wold (1975) is primarily used to confirm theories, validate relationships between variables that can be tested measured directly or indirectly.

Lohmöller, 1989, extended CB-SEM to a second-generation technique to overcome their weaknesses and developed PLS-SEM or PLS path modeling. It is a very usefull method to develop theories in exploratory research, focused on explaining the variance in the dependent variables when examining the model.

METHODOLOGY



In [1]:

    
import seaborn as sns
import matplotlib.pyplot as plt
plt.style.use("seaborn")
%pylab inline
#%matplotlib inline

# Scientific libraries
import pandas as pd
import numpy as np
from scipy import stats
from sklearn.decomposition import PCA
from sklearn.mixture import GaussianMixture

from warnings import filterwarnings









    



Populating the interactive namespace from numpy and matplotlib



In [2]:

    
mqm = pd.read_csv('../BD_MQM.csv', index_col = "Respondent", sep=";",
                  encoding="utf8", skipinitialspace=True, na_values=['9999','99'])

Some Previous Considerations

In order to precisely apply the methodology, some previous considerations where made about:

the variate;
the measurement;
the measurement scales;
the coding;
the data distributions.

The variate: Each construct is a linear combination of several variables that are chosen based on the questions of the questionnaire. The responses from 1758 respondents were arranged in a data matrix. All factor loading were significant, they were all greater than 0.50 as suggested in Hair, Black, Babin, Anderson, 2010.



In [3]:

    
image = ['Q4A', 'Q4B', 'Q4C', 'Q4D', 'Q4E']
expectations = ['Q5A', 'Q5B','Q5C']
perceivedQuality = ['Q6', 'Q7A', 'Q7B', 'Q7C', 'Q7D', 'Q7E', 'Q7F', 'Q7G', 'Q7H']
perceivedValue = ['Q10', 'Q11']
satisfaction = ['Q3', 'Q9', 'Q18']
complaints = ['Q1516']
loyalty = ['Q12', 'Q17']
socio = ['Bank', 'B1', 'B5', 'B6', 'B8']

The measurement: the measured phenomenon is abstract, complex, and not directly observable. Satisfaction, loyalty and perceived image, value and quality are latent variables (constructs) and unobservable. However, the indicators or manifestations of each item represents a single separate aspect, or proxie, of those larger abstract concepts.

The combination of several items in a scale, a multi-item scale, was used to indirectly measure each of those concepts forming a single accurate composite score - the variate score.



In [4]:

    
# Image and Expections constructs response patterns 
fig, ax = plt.subplots(1,2, figsize=(14, 6))
mqm[image].sort_values(by=('Q4A'), ascending=False).T.plot(legend=False, alpha=0.01, ax=ax[0])
mqm[expectations].sort_values(by=('Q5A'), ascending=False).T.plot(legend=False, alpha=0.01, ax=ax[1])
ax[0].set_title("Image")
ax[1].set_title("Expectations")









    Out[4]:





<matplotlib.text.Text at 0x26101d970b8>



In [5]:

    
# Perceived Quality and Value constructs response patterns
fig, ax = plt.subplots(1,2, figsize=(14, 6))
mqm[perceivedQuality].sort_values(by=('Q6'), ascending=False).T.plot(legend=False, alpha=0.01, ax=ax[0])
mqm[perceivedValue].sort_values(by=('Q10'), ascending=False).T.plot(legend=False, alpha=0.01, ax=ax[1])
ax[0].set_title("Perceived Quality")
ax[1].set_title("Perceived Value")









    Out[5]:





<matplotlib.text.Text at 0x26107063f98>



In [6]:

    
# Satisfaction and Loyalty constructs response patterns
fig, ax = plt.subplots(1,2, figsize=(14, 6))
mqm[satisfaction].sort_values(by=('Q3'), ascending=False).T.plot(legend=False, alpha=0.01, ax=ax[0])
mqm[loyalty].sort_values(by=('Q12'), ascending=False).T.plot(legend=False, alpha=0.01, ax=ax[1])
ax[0].set_title("Satisfaction")
ax[1].set_title("Loyalty")









    Out[6]:





<matplotlib.text.Text at 0x2610d136828>

The measurement scales: For each question, an interval scale was used, from 1 to 10.

The coding: Was assigned numbers of each response to a point in the scale in a manner that facilitates measurement and the equidistant attribute were preserved in the interval-level measurement.

The data distributions: the answers to the questions asked were measured and the frequencies of each corresponding variable are presented in the flollowing table.



In [7]:

    
mqm.describe()









    Out[7]:






  
    
      
      Bank
      Q2
      Q3
      Q4A
      Q4B
      Q4C
      Q4D
      Q4E
      Q5A
      Q5B
      ...
      Q15
      Q16
      Q1516
      Q17
      Q18
      B1
      B2
      B5
      B6
      B8
    
  
  
    
      count
      1758.000000
      1406.000000
      1756.000000
      1748.000000
      1707.000000
      1601.000000
      1733.000000
      1688.000000
      1692.000000
      1634.000000
      ...
      180.000000
      1423.000000
      1603.000000
      1715.000000
      1678.000000
      1758.000000
      1744.000000
      1758.000000
      1078.000000
      1749.000000
    
    
      mean
      4.008532
      1992.160028
      5.485592
      5.674485
      5.946514
      5.383573
      5.471956
      5.585900
      5.216489
      5.231151
      ...
      4.087222
      5.208433
      5.082533
      5.393469
      5.076043
      1.432309
      62.068807
      2.129124
      2.833024
      5.937107
    
    
      std
      2.001546
      10.122136
      1.182721
      1.206779
      1.056573
      1.293575
      1.376258
      1.212952
      1.162175
      1.210245
      ...
      2.167563
      1.381508
      1.531168
      1.590136
      1.356311
      0.610776
      15.669890
      1.594257
      0.624436
      2.172625
    
    
      min
      1.000000
      1950.000000
      0.700000
      0.700000
      0.700000
      0.700000
      0.700000
      0.700000
      0.700000
      0.700000
      ...
      0.700000
      0.700000
      0.700000
      0.700000
      0.700000
      1.000000
      18.000000
      1.000000
      1.000000
      2.000000
    
    
      25%
      2.000000
      1987.000000
      4.900000
      4.900000
      5.600000
      4.900000
      4.900000
      4.900000
      4.200000
      4.200000
      ...
      2.800000
      4.200000
      4.200000
      4.900000
      4.200000
      1.000000
      51.000000
      1.000000
      3.000000
      4.000000
    
    
      50%
      4.000000
      1995.000000
      5.600000
      5.600000
      6.300000
      5.600000
      5.600000
      5.600000
      5.600000
      5.600000
      ...
      4.200000
      5.600000
      5.600000
      5.600000
      4.900000
      1.000000
      64.000000
      1.000000
      3.000000
      6.000000
    
    
      75%
      6.000000
      2000.000000
      6.300000
      7.000000
      7.000000
      6.300000
      7.000000
      6.300000
      5.600000
      6.300000
      ...
      5.775000
      6.300000
      6.300000
      7.000000
      5.600000
      2.000000
      75.000000
      3.000000
      3.000000
      8.000000
    
    
      max
      7.000000
      2005.000000
      7.000000
      7.000000
      7.000000
      7.000000
      7.000000
      7.000000
      7.000000
      7.000000
      ...
      7.000000
      7.000000
      7.000000
      7.000000
      7.000000
      9.000000
      88.000000
      9.000000
      9.000000
      9.000000
    
  

8 rows × 36 columns

When working with CB-SEM, the properties of normally behaved variables are allmost desirable, and Shapiro-Wilk test statistic is used in the experiment.

Nevertheless, PLS-SEM is a nonparametric approach, a distribution free technique, and methodologicaly, when working in PLS-SEM, the assumption of normality is not required. However, knowing the distribution of the variables under analysis provides the researcher with a better insight of the respondents behavior.



In [8]:

    
# plot the sum of image = ['Q4A', 'Q4B', 'Q4C', 'Q4D', 'Q4E']
mqm[image].sum().plot()









    Out[8]:





<matplotlib.axes._subplots.AxesSubplot at 0x2610e504160>



In [9]:

    
for _ in image:
    mqm[_].hist(histtype='step', stacked=True, fill=False)

PCA and Gaussian Mixture

If we plot a construct formed by normally distributed indicators, we would have something like the following:



In [10]:

    
def PCAGM(data):
    # PCA - Principal Components Analysis
    pcaN = PCA(n_components=3, svd_solver='full').fit_transform(data.fillna(0))
    # Gaussian Mixture
    gmN = GaussianMixture(3).fit(data.fillna(0))
    labelsN = gmN.predict(data.fillna(0))
    
    fig, ax = plt.subplots(1,2, figsize=(15, 5))
    ax[0].set_title("PC1 - PC2")
    ax[0].scatter(pcaN[:,0], pcaN[:,1], c=labelsN, cmap='rainbow', alpha=0.5)
    ax[1].set_title("PC1 - PC3")
    ax[1].scatter(pcaN[:,0], pcaN[:,2], c=labelsN, cmap='rainbow', alpha=0.5)
    
    return

Random normal distribution $ X = N(\mu, \sigma)$ where $X = [X_1, X_2, X_3, X_4, X_5]$



In [11]:

    
# Generate Random Normal Sample
mu, sigma, dimensao = (7-0.7)/2, 1, len(mqm)  # mean and standard deviation
mqmNormal = pd.DataFrame(np.random.normal(mu, sigma, dimensao * 5).reshape((dimensao, 5)), 
                         index = np.arange(len(mqm)), columns=image)
PCAGM(mqmNormal)

Image construct



In [12]:

    
PCAGM(mqm[image])  # image = ['Q4A', 'Q4B', 'Q4C', 'Q4D', 'Q4E']

t-Distributed Stochastic Neighbor Embedding (t-SNE)

Since the Image construct is a linear combination of five variables, we will use the t-Distributed Stochastic Neighbor Embedding (t-SNE) to reduce the dimensionality of data to 2 or 3 dimensions so that we can plot our 1758 data points. It is a new unsupervised dimensionality reduction technique that learns a parametric mapping between the high-dimensional data space and the low-dimensional latent space. Parametric t-SNE learns the parametric mapping in such a way that the local structure of the data is preserved as well as possible in the latent space.

The similarities between data points to joint probabilities are converted and the Kullback-Leibler divergence between the joint probabilities of the low-dimensional embedding and the high-dimensional data are minimized. However, t-SNE has a cost function that is not convex, i.e. with different initializations we can get different results.

First, we compute conditional probabilites:

$$p_{j|i} = \frac{\exp{(-d(\boldsymbol{x}_i, \boldsymbol{x}_j) / (2 \sigma_i^2)})}{\sum_{i \neq k} \exp{(-d(\boldsymbol{x}_i, \boldsymbol{x}_k) / (2 \sigma_i^2)})}, \quad p_{i|i} = 0,$$

to compute the joint probabilities:

$$p_{ij} = \frac{p_{j|i} + p_{i|j}}{2N}.$$

A heavy-tailed distribution will be used to measure the similarities in the embedded space

$$q_{ij} = \frac{(1 + ||\boldsymbol{y}_i - \boldsymbol{y}_j)||^2)^{-1}}{\sum_{k \neq l} (1 + ||\boldsymbol{y}_k - \boldsymbol{y}_l)||^2)^{-1}},$$

You can find more details in Laurens van der Maaten, 2009, Learning a Parametric Embedding by Preserving Local Structure.

Many parametric dimensionality reduction techniques, such as PCA and NCA (Goldberger et al., 2005), are hampered by their linear nature, which makes it difficult to successfully embed highly non-linear real-world data in the latent space. This new unsupervised parametric dimensionality reduction technique attempts to retain the local data structure in the latent space, parametrizing the non-linear mapping between the data space and the latent space by means of a feed-forward neural network.



In [13]:

    
from sklearn.manifold import TSNE

def ecsiTSNE(data):
    X_tsne = TSNE(learning_rate=100).fit_transform(data.fillna(0))
    X_pca = PCA().fit_transform(data.fillna(0))
    
    # Gaussian Mixture
    gm = GaussianMixture(3).fit(data.fillna(0))
    labelsGM = gm.predict(data.fillna(0))
    
    figure(figsize=(10, 5))
    subplot(121)
    scatter(X_tsne[:, 0], X_tsne[:, 1], c=labelsGM, cmap='rainbow', alpha=0.5)
    subplot(122)
    scatter(X_pca[:, 0], X_pca[:, 1], c=labelsGM, cmap='rainbow', alpha=0.5)
    
    return

Random normal distribution $ X = N(\mu, \sigma)$ where $X = [X_1, X_2, X_3, X_4, X_5]$



In [14]:

    
ecsiTSNE(mqmNormal)



In [15]:

    
ecsiTSNE(mqm[image]) # image = ['Q4A', 'Q4B', 'Q4C', 'Q4D', 'Q4E']



In [16]:

    
ecsiTSNE(mqm[expectations]) # expectations = ['Q5A', 'Q5B','Q5C']



In [17]:

    
ecsiTSNE(mqm[perceivedQuality]) # perceivedQuality = ['Q6', 'Q7A', 'Q7B', 'Q7C', 'Q7D', 'Q7E', 'Q7F', 'Q7G', 'Q7H']



In [18]:

    
ecsiTSNE(mqm[socio]) # socio = ['Bank', 'B1', 'B5', 'B6', 'B8']

ECSI MODEL EQUATIONS

The general form of the ECSI structural model is:

$$\eta =\beta\eta + \gamma \xi + \nu $$$$E( \nu | \xi) = 0$$

where $\eta = (\eta_1, \eta_2, \dots, \eta_6)$ is the vector of endogenous latent variables:

$\eta_1$ : customer expectations;
$\eta_2$ : perceived quality of products and services;
$\eta_3$ : perceived value;
$\eta_4$ : customer satisfaction (ECSI);
$\eta_5$ : complaints;
$\eta_6$ : customer loyalty.

$\xi$ is the exogenous latent variable (image), $\beta$ is the matrix of coeficients of $\eta$, $\gamma$ is the vector of coeficients of $\xi$, and $\nu$ is the vector of errors.

$\begin{bmatrix}\eta_1\\ \eta_2\\ \eta_3\\ \eta_4\\ \eta_5\\ \eta_6\\ \end{bmatrix} = \begin{bmatrix} 0&0&0&0&0&0\\ \beta_{21}&0&0&0&0&0 \\ \beta_{31}&\beta_{32}&0&0&0&0 \\ \beta_{41}&\beta_{42}&\beta_{43}&0&0&0 \\ 0&0&0&\beta_{54}&0&0 \\ 0&0&0&\beta_{64}&\beta_{65}&0 \\ \end{bmatrix} \begin{bmatrix}\eta_1\\ \eta_2\\ \eta_3\\ \eta_4\\ \eta_5\\ \eta_6\\ \end{bmatrix} + \begin{bmatrix} \gamma_1\\ 0 \\ 0 \\ \gamma_4\\ 0 \\ \gamma_6\\ \end{bmatrix} \xi + \begin{bmatrix} \nu_1\\ \nu_2 \\ \nu_3 \\ \nu_4 \\ \nu_5 \\ \nu_6 \\ \end{bmatrix}$

Conceptual Model

ECSI PLS-SEM PATH MODEL



In [19]:

    
import networkx as nx
import pylab

G = nx.DiGraph()

G.add_node(1,pos="100,100")

G.add_node(2,pos="0,0")
G.add_node(3,pos="200,0")
G.add_edge(1,2)
G.add_edge(1,3)
G.add_edge(1,1)

print(G.edges(data=True))
# [(1, 1, {}), (1, 2, {}), (1, 3, {})]

nx.drawing.nx_pydot.write_dot(G,'graph.dot')
# use -n to suppress node positioning (routes edges)
# run dot -n -Tpng graph.dot >graph.png









    



[(1, 1, {}), (1, 2, {}), (1, 3, {})]

ECSI OUTER MODEL

DATA DISTRIBUTION

The responses are requested using a 10-point scale, then the distribution of the answers in each of the possible response categories (1, 2, 3, ... , 10) can be calculated and displayed in a table or chart.

Questão	Descrição	Valores
Q2	Ano a partir do qual o inquirido é cliente do banco	9999 – NS/NR
Q3	Grau de satisfação global com o banco	1_10 99: NS/NR
Q4A	Banco de confiança no que diz e no que faz	1_10 99: NS/NR
Q4B	Banco perfeitamente implantado no mercado	1_10 99: NS/NR
Q4C	Contribui positivamente para a sociedade	1_10 99: NS/NR
Q4D	Preocupa-se com os seus clientes	1_10 99: NS/NR
Q4E	Banco inovador e virado para o futuro	1_10 99: NS/NR
Q5A	Expectativas que tinha há seis meses atrás ou quando se tornou cliente do banco relativamente à qualidade global do mesmo	1_10 99: NS/NR
Q5B	Expectativas que tinha há seis meses atrás ou quando se tornou cliente do banco relativamente à capacidade do mesmo oferecer produtos e serviços que satisfizessem as suas necessidades pessoais	1_10 99: NS/NR
Q5C	Expectativas que tinha há seis meses atrás ou quando se tornou cliente do banco relativamente à capacidade do mesmo evitar falhas ou erros	1_10 99: NS/NR
Q6	Qualidade apercebida do banco	1_10 99: NS/NR
Q7A	Qualidade dos produtos e serviços bancários oferecidos	1_10 99: NS/NR
Q7B	Atendimento e capacidade de aconselhamento	1_10 99: NS/NR
Q7C	Acessibilidade a produtos e serviços por via das novas tecnologias	1_10 99: NS/NR
Q7D	Fiabilidade dos produtos e serviços oferecidos	1_10 99: NS/NR
Q7E	Diversidade de produtos e serviços	1_10 99: NS/NR
Q7F	Clareza e transparência na informação fornecida	1_10 99: NS/NR
Q7G	Disponibilidade das agências	1_10 99: NS/NR
Q7H	Qualidade das agências	1_10 99: NS/NR
Q9	Realização das expectativas percepcionadas	1_10 99: NS/NR
Q10	Classificação dos preços e das taxas dos produtos e serviços do banco, dada a qualidade dos mesmos	1_10 99: NS/NR
Q11	Classificação da qualidade dos produtos e serviços do banco, dados os preços e as taxas dos mesmos	1_10 99: NS/NR
Q12	Probabilidade de voltar a escolher o banco no caso de adquirir um produto ou serviço bancário	1_10 99: NS/NR
Q13	Diferença a partir da qual mudaria de banco face à redução de comissões, juros e outras taxas em outros bancos	0_100(%) 222: FicaSempreBanco 999: NS/NR
Q14	Apresentação de reclamações	1: Sim 2: Não
Q15	(Caso Q14=1) Avaliação da forma como foi tratada a reclamação apresentada	1_10 99: NS/NR
Q16	(Caso Q14=2) Expectativa na forma de resolução de uma possível reclamação	1_10 99: NS/NR
Q17	Probabilidade de recomendar o banco a outras pessoas	1_10 99: NS/NR
Q18	Medida de proximidade do banco do cliente a um banco que considere ideal	1_10 9 9: NS/NR

</pre> Nota: Nas respostas com escala de 1 a 10, os valores de 1 a 5 correspondem a avaliações negativas e os valores de 6 a 10 correspondem a avaliações positivas.

Questão	Descrição	Valores
B1	Género do inquirido	1-Feminino 2-Masculino
B2	Ano de nascimento do inquirido	(últimos dois dígitos) 99:Revelação recusada
B5	Situação profissional	1 – Empregado 2 – Desempregado 3 – Estudante 4 – Doméstico 5 – Reformado 6 – Outra 9 – NS/NR
B6	(Caso B5=1 ou B5=6) Situação perante a actividade profissional	1 – Patrão 2 – Trabalhador por conta própria 3 – Trabalhador por conta de outrem 4 – Outra 9 – NS/NR
B8	Nível de instrução escolar	1 – Não sabe ler nem escrever 2 – Sabe ler e escrever sem possuir grau de ensino 3 – Ensino básico elementar 4 – Ensino básico preparatório 5 – Ensino secundário unificado 6 – Ensino secundário complementar 7 – Cursos médios 8 – Ensino superior incompleto 9 – Ensino superior completo ou mais 99 – NS/NR

</pre>

LOGISTIC REGRESSION



In [20]:

    
labels = pd.get_dummies(mqm[socio], columns=['Bank', 'B1', 'B5', 'B6', 'B8'])
features = mqm[['Q2', 'Q3', 'Q4A', 'Q4B', 'Q4C', 'Q4D', 'Q4E', 'Q5A', 'Q5B', 'Q5C', 'Q6', 'Q7A', 'Q7B', 
                'Q7C', 'Q7D', 'Q7E', 'Q7F', 'Q7G', 'Q7H', 'Q9', 'Q10', 'Q11', 'Q12', 'Q13', 'Q14', 
                'Q1516', 'Q17', 'Q18', 'B2']]



In [21]:

    
labels.columns = ['Bank_1', 'Bank_2', 'Bank_3', 'Bank_4', 'Bank_5', 'Bank_6', 'Bank_7',
                  'B1_1Feminino', 'B1_2Masculino ', 'B1_9NR',
                  'B5_1Empregado', 'B5_2Desempregado', 'B5_3Estudante', 
                  'B5_4Domestico', 'B5_5Reformado', 'B5_6Outra', 'B5_9NS/NR', 
                  'B6_1Patrão', 'B6_2TrabalhadorContaPropria', 'B6_3TrabalhadorContaOutrem', 
                  'B6_4Outra', 'B6_9NS_NR ', 
                  'B8_2SabeLerEscrever', 'B8_3EnsinoBasicoElementar', 'B8_4EnsinoBasicoPreparatorio', 
                  'B8_5EnsinoSecundarioUnificado', 'B8_6EnsinoSecundarioComplementar', 
                  'B8_7CursoMedio', 'B8_8EnsinoSuperiorIncompleto', 'B8_9EnsinoSuperiorCompleto']



In [22]:

    
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import Imputer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix

logitScores = {}
logitCoefs = pd.DataFrame(index=features.columns, columns = labels.columns)

for label in labels:
    train_data, test_data, train_labels, test_labels = train_test_split(features, labels[[label]], 
                                                                       random_state=123)
    imp = Imputer()
    imp.fit(train_data)
    train_data = imp.transform(train_data)
    test_data = imp.transform(test_data)
    modelLogit = LogisticRegression()
    modelLogit = LogisticRegression().fit(train_data, train_labels.values.ravel())
    score = modelLogit.score(test_data, test_labels.values.ravel())
    #print('#', list(labels[[label]]), "Logit Score: %f" % score)
    #print(confusion_matrix(test_labels, modelLogit.predict(test_data)))
    logitScores[label] = score 
    logitCoefs[label] = modelLogit.coef_.T
    #print(" ")

logitScores = pd.DataFrame.from_dict([logitScores]).reindex(columns=logitCoefs.columns)



In [23]:

    
logitScores.T.plot(legend=False)









    Out[23]:





<matplotlib.axes._subplots.AxesSubplot at 0x26112f74e48>



In [24]:

    
logitCoefs.T.plot(legend=False, alpha=0.25)









    Out[24]:





<matplotlib.axes._subplots.AxesSubplot at 0x26112fac4a8>



In [25]:

    
logitCoefs.T.head()









    Out[25]:






  
    
      
      Q2
      Q3
      Q4A
      Q4B
      Q4C
      Q4D
      Q4E
      Q5A
      Q5B
      Q5C
      ...
      Q9
      Q10
      Q11
      Q12
      Q13
      Q14
      Q1516
      Q17
      Q18
      B2
    
  
  
    
      Bank_1
      -0.001093
      -0.016482
      -0.221993
      -0.071937
      -0.017808
      -0.101278
      -0.061737
      0.017417
      -0.106335
      0.101361
      ...
      0.015230
      -0.179294
      0.076590
      -0.179849
      -0.019550
      0.297771
      0.153014
      -0.076148
      0.051110
      0.001182
    
    
      Bank_2
      -0.001828
      0.226192
      -0.024992
      -0.062755
      -0.175519
      0.133909
      -0.037153
      -0.084791
      -0.024660
      0.077934
      ...
      0.022852
      0.097009
      0.045593
      0.050858
      -0.033392
      0.064710
      -0.106552
      -0.029033
      0.112476
      -0.003328
    
    
      Bank_3
      -0.001285
      -0.028030
      0.019167
      0.449511
      0.217615
      -0.294469
      -0.123135
      -0.211831
      -0.073203
      -0.017167
      ...
      -0.128465
      0.052487
      0.035491
      0.090007
      0.028103
      0.410852
      0.026237
      0.149160
      0.110872
      0.006073
    
    
      Bank_4
      -0.001324
      -0.034563
      -0.218546
      0.058146
      -0.202839
      -0.114691
      0.582337
      0.084867
      0.038071
      -0.046370
      ...
      0.061256
      -0.034468
      -0.187627
      0.100397
      -0.037830
      -0.152429
      0.015501
      -0.045592
      -0.045662
      0.001171
    
    
      Bank_5
      -0.001785
      0.031834
      0.149065
      -0.235579
      0.224817
      0.020736
      -0.152599
      0.016515
      0.101990
      -0.016388
      ...
      -0.111728
      0.115841
      0.012856
      -0.036722
      0.066726
      -0.008988
      -0.005303
      0.229183
      -0.139418
      0.006842
    
  

5 rows × 29 columns



In [26]:

    
from sklearn.decomposition import PCA
pca = PCA().fit_transform(logitCoefs)
plt.plot(pca[:,0],pca[:,1], 'ro')
plt.axis([-1, 1, -1, 1])
plt.show()



In [27]:

    
#sns.set(style="white")
sns.set_context("paper")

# Generate a large random dataset
rs = np.random.RandomState(33)
#d = pd.DataFrame(logitCoefs.corr(), columns=labels.columns)

# Compute the correlation matrix
corr = logitCoefs.corr()
corr[corr > 0.25] = 1
corr[corr < -0.25] = -1

# Generate a mask for the upper triangle
mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True

# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(11, 9))

# Generate a custom diverging colormap
cmap = sns.diverging_palette(200, 10, as_cmap=True)


# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, mask=mask, cmap=cmap, vmax=.3,  
            square=True, xticklabels=True, yticklabels=True,
            linewidths=.5, cbar_kws={"shrink": .5}, ax=ax)









    Out[27]:





<matplotlib.axes._subplots.AxesSubplot at 0x261130cd908>

	Bank	Q2	Q3	Q4A	Q4B	Q4C	Q4D	Q4E	Q5A	Q5B	...	Q15	Q16	Q1516	Q17	Q18	B1	B2	B5	B6	B8
count	1758.000000	1406.000000	1756.000000	1748.000000	1707.000000	1601.000000	1733.000000	1688.000000	1692.000000	1634.000000	...	180.000000	1423.000000	1603.000000	1715.000000	1678.000000	1758.000000	1744.000000	1758.000000	1078.000000	1749.000000
mean	4.008532	1992.160028	5.485592	5.674485	5.946514	5.383573	5.471956	5.585900	5.216489	5.231151	...	4.087222	5.208433	5.082533	5.393469	5.076043	1.432309	62.068807	2.129124	2.833024	5.937107
std	2.001546	10.122136	1.182721	1.206779	1.056573	1.293575	1.376258	1.212952	1.162175	1.210245	...	2.167563	1.381508	1.531168	1.590136	1.356311	0.610776	15.669890	1.594257	0.624436	2.172625
min	1.000000	1950.000000	0.700000	0.700000	0.700000	0.700000	0.700000	0.700000	0.700000	0.700000	...	0.700000	0.700000	0.700000	0.700000	0.700000	1.000000	18.000000	1.000000	1.000000	2.000000
25%	2.000000	1987.000000	4.900000	4.900000	5.600000	4.900000	4.900000	4.900000	4.200000	4.200000	...	2.800000	4.200000	4.200000	4.900000	4.200000	1.000000	51.000000	1.000000	3.000000	4.000000
50%	4.000000	1995.000000	5.600000	5.600000	6.300000	5.600000	5.600000	5.600000	5.600000	5.600000	...	4.200000	5.600000	5.600000	5.600000	4.900000	1.000000	64.000000	1.000000	3.000000	6.000000
75%	6.000000	2000.000000	6.300000	7.000000	7.000000	6.300000	7.000000	6.300000	5.600000	6.300000	...	5.775000	6.300000	6.300000	7.000000	5.600000	2.000000	75.000000	3.000000	3.000000	8.000000
max	7.000000	2005.000000	7.000000	7.000000	7.000000	7.000000	7.000000	7.000000	7.000000	7.000000	...	7.000000	7.000000	7.000000	7.000000	7.000000	9.000000	88.000000	9.000000	9.000000	9.000000

	Q2	Q3	Q4A	Q4B	Q4C	Q4D	Q4E	Q5A	Q5B	Q5C	...	Q9	Q10	Q11	Q12	Q13	Q14	Q1516	Q17	Q18	B2
Bank_1	-0.001093	-0.016482	-0.221993	-0.071937	-0.017808	-0.101278	-0.061737	0.017417	-0.106335	0.101361	...	0.015230	-0.179294	0.076590	-0.179849	-0.019550	0.297771	0.153014	-0.076148	0.051110	0.001182
Bank_2	-0.001828	0.226192	-0.024992	-0.062755	-0.175519	0.133909	-0.037153	-0.084791	-0.024660	0.077934	...	0.022852	0.097009	0.045593	0.050858	-0.033392	0.064710	-0.106552	-0.029033	0.112476	-0.003328
Bank_3	-0.001285	-0.028030	0.019167	0.449511	0.217615	-0.294469	-0.123135	-0.211831	-0.073203	-0.017167	...	-0.128465	0.052487	0.035491	0.090007	0.028103	0.410852	0.026237	0.149160	0.110872	0.006073
Bank_4	-0.001324	-0.034563	-0.218546	0.058146	-0.202839	-0.114691	0.582337	0.084867	0.038071	-0.046370	...	0.061256	-0.034468	-0.187627	0.100397	-0.037830	-0.152429	0.015501	-0.045592	-0.045662	0.001171
Bank_5	-0.001785	0.031834	0.149065	-0.235579	0.224817	0.020736	-0.152599	0.016515	0.101990	-0.016388	...	-0.111728	0.115841	0.012856	-0.036722	0.066726	-0.008988	-0.005303	0.229183	-0.139418	0.006842