Principal component analysis (PCA) on cancer data

This notebook illustrates how to use the hoggorm package to carry out principal component analysis (PCA) on a multivariate data set on cancer in men across OECD countries. Furthermore, we will learn how to visualise the results of the PCA using the hoggormPlot package.


Import packages and prepare data

First import hoggorm for analysis of the data and hoggormPlot for plotting of the analysis results. We'll also import pandas such that we can read the data into a data frame. numpy is needed for checking dimensions of the data.


In [1]:
import hoggorm as ho
import hoggormplot as hop
import pandas as pd
import numpy as np

Next, load the cancer data that we are going to analyse using hoggorm. The data can be acquired from the OECD (The Organisation for Economic Co-operation and Development) and holds the percentages of various cacner types in men. After the data has been loaded into the pandas data frame, we'll display it in the notebook.


In [2]:
# Load OECD data for cancer in men

# Insert code for reading data from other folder in repository instead of directly from same repository.
data_df = pd.read_csv('Cancer_men_perc.txt', index_col=0, sep='\t')
data_df


Out[2]:
Trachea-bronchus-lung Colon, rectum and anus Stomach Pancreas Prostate Liver Hodgkins disease Leukemia Bladder Skin
MEN
Australia 20.493429 9.290024 2.954790 5.033474 13.612695 4.049921 0.157038 3.855691 3.107695 4.425986
Austria 22.286929 10.944722 4.584951 6.627842 10.574968 5.666482 0.147902 3.780736 3.031984 1.941209
Belgium 30.090706 10.450900 3.253582 5.179440 9.162613 3.450769 0.144604 3.569081 4.719337 1.018798
Canada 27.731624 11.443416 3.011842 5.419733 9.739695 3.863702 0.195163 3.789857 3.534035 1.537569
Chile 13.304823 8.022491 17.272511 4.165677 16.195454 5.044745 0.221747 3.009424 2.162034 0.554368
Czech Rep. 24.875324 13.970344 4.455083 6.742470 9.455416 3.464326 0.279274 3.371235 3.557417 1.522708
Denmark 24.311585 12.168196 3.212602 6.375589 14.301662 3.162987 0.210866 3.100968 4.366162 2.245100
Estonia 26.371951 10.264228 8.333333 5.436992 13.008130 2.388211 0.101626 3.861789 3.506098 1.016260
Finland 23.917887 9.818587 4.169319 7.940802 13.574157 4.232973 0.159134 2.737110 2.928071 2.418842
France 25.125758 10.161689 3.360656 5.335729 9.885470 6.466427 0.143723 3.535819 4.350999 1.112733
Germany 24.401222 11.177186 4.592273 6.795183 11.012912 4.106844 0.151953 3.609916 3.198410 1.467786
Greece 31.667245 8.503992 4.830499 5.229666 9.013074 1.237996 0.792549 3.823904 5.582552 0.769409
Hungary 30.412574 16.081953 5.287679 5.293292 6.797642 3.104126 0.117878 2.559641 3.575639 1.044064
Iceland 20.322581 13.225806 4.193548 3.870968 17.096774 0.967742 0.645161 2.903226 5.806452 0.967742
Ireland 22.957477 12.326804 4.395604 5.805065 12.446249 3.511706 0.238892 3.057812 2.866699 1.887243
Israel 21.822897 12.671360 5.446462 8.725454 7.725083 3.501297 0.351982 5.094479 4.890700 2.278622
Italy 26.067418 10.900444 6.087111 5.398894 7.628006 6.953407 0.250356 3.660018 4.687631 1.131317
Japan 23.990561 11.971776 14.737968 7.315522 5.327754 9.132765 0.051157 2.214981 2.426985 0.151168
Korea 26.204210 10.098575 13.107486 5.630408 3.142353 18.281606 0.094701 1.977960 1.975808 0.271189
Luxembourg 25.183824 11.948529 4.044118 6.066176 7.720588 6.066176 0.183824 4.779412 4.044118 1.838235
Mexico 11.495974 6.910581 8.227151 5.088289 16.321514 7.591468 0.740218 6.260771 1.873146 0.802373
Netherlands 27.155154 11.469393 3.543496 5.334906 11.076157 2.127846 0.157295 3.062874 3.700791 2.154061
New Zealand 19.773765 13.095497 4.198390 4.763977 12.725691 3.567544 0.195780 4.154884 2.740918 5.286056
Norway 21.398230 13.451327 3.203540 6.194690 17.486726 2.566372 0.176991 3.221239 4.247788 3.309735
Poland 30.656934 11.920225 6.439067 4.550070 8.201621 2.086327 0.195414 2.927371 5.145890 1.415790
Portugal 20.390610 14.466792 8.730518 4.733881 11.103926 4.695079 0.155209 3.000711 4.416995 0.821315
Slovak Rep. 22.663051 15.099187 6.108178 5.651491 7.606679 3.111175 0.285429 2.554588 3.125446 1.541316
Slovenia 24.945770 13.727921 7.468237 5.113108 11.093895 3.997521 0.154943 2.479083 4.090487 2.169197
Spain 26.777063 14.070260 5.245117 4.827702 8.816002 5.138478 0.196521 2.953901 6.384632 0.885104
Sweden 16.053045 12.345140 3.350201 6.962136 20.441459 3.577037 0.157041 3.568313 4.240098 2.748211
Switzerland 21.607539 10.188470 3.680710 6.241685 14.168514 5.365854 0.000000 3.359202 4.257206 2.372506
Turkey 38.971658 7.623419 8.846855 5.211386 7.225854 3.723590 0.219276 3.612927 3.397750 0.571756
United Kingdom 22.803081 10.262614 3.447752 4.946063 12.707596 3.427884 0.191672 3.223356 4.037960 1.521686
United States 29.141246 9.063132 2.226842 6.211578 9.487245 4.537710 0.237131 4.268031 3.463645 1.993364
OECD 25.991834 10.696822 6.270034 5.933306 9.235451 5.686093 0.194011 3.432922 3.617678 1.304311

Let's have a look at the dimensions of the data frame.


In [3]:
np.shape(data_df)


Out[3]:
(35, 10)

There are observations for 34 countries as well as all OECD countries together, which results in 35 rows. Furthermore, there are 10 columns where each column represents one type of cancer in men.

The nipalsPCA class in hoggorm accepts only numpy arrays with numerical values and not pandas data frames. Therefore, the pandas data frame holding the imported data needs to be "taken apart" into three parts:

  • a numpy array holding the numeric values
  • a Python list holding variable (column) names
  • a Python list holding object (row) names.

The array with values will be used as input for the nipalsPCA class for analysis. The Python lists holding the variable and row names will be used later in the plotting function from the hoggormPlot package when visualising the results of the analysis. Below is the code needed to access both data, variable names and object names.


In [4]:
# Get the values from the data frame
data = data_df.values

# Get the variable or columns names
data_varNames = list(data_df.columns)

# Get the object or row names
data_objNames = list(data_df.index)

Let's have a quick look at the column or variable names.


In [5]:
data_varNames


Out[5]:
['Trachea-bronchus-lung',
 'Colon, rectum and anus',
 'Stomach',
 'Pancreas',
 'Prostate',
 'Liver',
 'Hodgkins disease',
 'Leukemia',
 'Bladder',
 'Skin']

Now show the object or row names.


In [6]:
data_objNames


Out[6]:
['Australia',
 'Austria',
 'Belgium',
 'Canada',
 'Chile',
 'Czech Rep.',
 'Denmark',
 'Estonia',
 'Finland',
 'France',
 'Germany',
 'Greece',
 'Hungary',
 'Iceland',
 'Ireland',
 'Israel',
 'Italy',
 'Japan',
 'Korea',
 'Luxembourg',
 'Mexico',
 'Netherlands',
 'New Zealand',
 'Norway',
 'Poland',
 'Portugal',
 'Slovak Rep.',
 'Slovenia',
 'Spain',
 'Sweden',
 'Switzerland',
 'Turkey',
 'United Kingdom',
 'United States',
 'OECD']

Apply PCA to our data

Now, let's run PCA on the data using the nipalsPCA class. The documentation provides a description of the input parameters. Using input paramter arrX we define which numpy array we would like to analyse. By setting input parameter Xstand=False we make sure that the variables are only mean centered, not scaled to unit variance. This is the default setting and actually doesn't need to expressed explicitly. Setting paramter cvType=["loo"] we make sure that we compute the PCA model using full cross validation. "loo" means "Leave One Out". By setting paramter numpComp=4 we ask for four principal components (PC) to be computed.


In [7]:
model = ho.nipalsPCA(arrX=data, Xstand=False, cvType=["loo"], numComp=4)


loo
loo

That's it, the PCA model has been computed. Now we would like to inspect the results by visualising them. We can do this using the taylor-made plotting function for PCA from the separate hoggormPlot package. If we wish to plot the results for component 1 and component 2, we can do this by setting the input argument comp=[1, 2]. The input argument plots=[1, 2, 3, 4, 6] lets the user define which plots are to be plotted. If this list for example contains value 1, the function will generate the scores plot for the model. If the list contains value 2, then the loadings plot will be plotted. Value 3 stands for correlation loadings plot and value 4 stands for bi-plot and 6 stands for explained variance plot. The hoggormPlot documentation provides a description of input paramters.


In [8]:
hop.plot(model, comp=[1, 2], 
         plots=[1, 2, 3, 4, 6], 
         objNames=data_objNames, 
         XvarNames=data_varNames)



Accessing numerical results

Now that we have visualised the PCA results, we may also want to access the numerical results. Below are some examples. For a complete list of accessible results, please see this part of the documentation.


In [9]:
# Get scores and store in numpy array
scores = model.X_scores()

# Get scores and store in pandas dataframe with row and column names
scores_df = pd.DataFrame(model.X_scores())
scores_df.index = data_objNames
scores_df.columns = ['PC{0}'.format(x+1) for x in range(model.X_scores().shape[1])]
scores_df


Out[9]:
PC1 PC2 PC3 PC4
Australia -4.804277 2.310263 -0.320919 2.835885
Austria -1.491013 -0.521756 -1.669103 1.558577
Belgium 5.930459 2.930098 0.758349 1.199125
Canada 3.580658 2.560206 -0.593118 1.027471
Chile -12.382151 -9.441836 6.727705 -3.070356
Czech Rep. 1.319773 1.315515 -2.350299 -1.668611
Denmark -1.711772 3.914209 0.076722 0.251714
Estonia 0.524115 0.435027 4.308677 -1.262844
Finland -1.741210 1.886170 0.830626 2.085349
France 1.435297 0.232770 -1.275385 2.792210
Germany 0.057056 0.964567 -0.388893 0.658498
Greece 7.226501 3.290313 3.784990 0.607414
Hungary 7.452514 1.392996 -2.187983 -3.928233
Iceland -6.500124 4.719514 1.199494 -2.495195
Ireland -1.937233 1.729734 -0.639801 -0.376054
Israel -0.450615 -0.495134 -3.286539 -1.172208
Italy 3.378048 -2.357099 -1.225306 1.105300
Japan 2.711983 -11.101208 -0.021016 -2.765627
Korea 6.055118 -15.670778 -2.856998 3.790745
Luxembourg 2.513495 -0.481077 -2.925284 1.238478
Mexico -13.796612 -5.092872 1.371124 3.851659
Netherlands 2.307706 3.593764 0.571465 -0.149753
New Zealand -5.020797 1.731151 -2.325881 -0.523972
Norway -5.926939 4.986128 -0.018785 -0.672933
Poland 6.812707 1.554399 1.749005 -2.223861
Portugal -3.313726 -2.641853 -1.211839 -3.820953
Slovak Rep. 0.388879 -0.724607 -3.411846 -3.827329
Slovenia 0.459487 -0.347168 0.000022 -2.798297
Spain 3.471632 0.319985 -1.971708 -1.789486
Sweden -11.957737 3.990669 -0.043580 0.608510
Switzerland -3.913395 1.499904 -0.226503 2.412778
Turkey 14.236427 -0.244792 7.015958 0.957182
United Kingdom -2.149667 2.373420 0.262543 1.224744
United States 4.865291 2.751178 0.272190 3.571268
OECD 2.370123 -1.361798 0.021916 0.768806

In [10]:
help(ho.nipalsPCA.X_scores)


Help on function X_scores in module hoggorm.pca:

X_scores(self)
    Returns array holding scores T. First column holds scores for
    component 1, second column holds scores for component 2, etc.


In [11]:
# Dimension of the scores
np.shape(model.X_scores())


Out[11]:
(35, 4)

We see that the numpy array holds the scores for all countries and OECD (35 in total) for four components as required when computing the PCA model.


In [12]:
# Get loadings and store in numpy array
loadings = model.X_loadings()

# Get loadings and store in pandas dataframe with row and column names
loadings_df = pd.DataFrame(model.X_loadings())
loadings_df.index = data_varNames
loadings_df.columns = ['PC{0}'.format(x+1) for x in range(model.X_loadings().shape[1])]
loadings_df


Out[12]:
PC1 PC2 PC3 PC4
Trachea-bronchus-lung 0.845199 0.224593 0.314566 0.046389
Colon, rectum and anus 0.015660 0.106296 -0.550679 -0.668953
Stomach -0.027941 -0.680376 0.498526 -0.471499
Pancreas 0.001799 -0.000626 -0.141297 0.084208
Prostate -0.525513 0.354902 0.453070 0.086309
Liver 0.035896 -0.562988 -0.327117 0.518504
Hodgkins disease -0.005208 0.005020 0.015071 0.003083
Leukemia -0.036068 0.034901 0.010028 0.165016
Bladder 0.040952 0.128914 -0.023781 -0.107264
Skin -0.064443 0.120724 -0.127032 0.076852

In [13]:
help(ho.nipalsPCA.X_loadings)


Help on function X_loadings in module hoggorm.pca:

X_loadings(self)
    Returns array holding loadings P of array X. Rows represent variables
    and columns represent components. First column holds loadings for
    component 1, second column holds scores for component 2, etc.


In [14]:
np.shape(model.X_loadings())


Out[14]:
(10, 4)

Here we see that the array holds the loadings for the 10 variables in the data across four components.


In [15]:
# Get loadings and store in numpy array
loadings = model.X_corrLoadings()

# Get loadings and store in pandas dataframe with row and column names
loadings_df = pd.DataFrame(model.X_corrLoadings())
loadings_df.index = data_varNames
loadings_df.columns = ['PC{0}'.format(x+1) for x in range(model.X_corrLoadings().shape[1])]
loadings_df


Out[15]:
PC1 PC2 PC3 PC4
Trachea-bronchus-lung 0.965908 0.191746 0.149441 0.020209
Colon, rectum and anus 0.043048 0.218298 -0.629289 -0.701062
Stomach -0.048240 -0.877575 0.357808 -0.310349
Pancreas 0.010313 -0.002680 -0.336703 0.184027
Prostate -0.829307 0.418408 0.297220 0.051922
Liver 0.071639 -0.839369 -0.271383 0.394491
Hodgkins disease -0.180713 0.130129 0.217384 0.040783
Leukemia -0.258370 0.186778 0.029859 0.450652
Bladder 0.230094 0.541118 -0.055545 -0.229764
Skin -0.352154 0.492845 -0.288573 0.160107

In [16]:
help(ho.nipalsPCA.X_corrLoadings)


Help on function X_corrLoadings in module hoggorm.pca:

X_corrLoadings(self)
    Returns array holding correlation loadings of array X. First column
    holds correlation loadings for component 1, second column holds
    correlation loadings for component 2, etc.


In [17]:
# Get calibrated explained variance of each component
calExplVar = model.X_calExplVar()

# Get calibrated explained variance and store in pandas dataframe with row and column names
calExplVar_df = pd.DataFrame(model.X_calExplVar())
calExplVar_df.columns = ['calibrated explained variance']
calExplVar_df.index = ['PC{0}'.format(x+1) for x in range(model.X_loadings().shape[1])]
calExplVar_df


Out[17]:
calibrated explained variance
PC1 49.949783
PC2 27.877279
PC3 8.631727
PC4 7.259616

In [18]:
help(ho.nipalsPCA.X_calExplVar)


Help on function X_calExplVar in module hoggorm.pca:

X_calExplVar(self)
    Returns a list holding the calibrated explained variance for
    each component. First number in list is for component 1, second number
    for component 2, etc.


In [19]:
# Get cumulative calibrated explained variance
cumCalExplVar = model.X_cumCalExplVar()

# Get cumulative calibrated explained variance and store in pandas dataframe with row and column names
cumCalExplVar_df = pd.DataFrame(model.X_cumCalExplVar())
cumCalExplVar_df.columns = ['cumulative calibrated explained variance']
cumCalExplVar_df.index = ['PC{0}'.format(x) for x in range(model.X_loadings().shape[1] + 1)]
cumCalExplVar_df


Out[19]:
cumulative calibrated explained variance
PC0 0.000000
PC1 49.949783
PC2 77.827062
PC3 86.458789
PC4 93.718405

In [20]:
help(ho.nipalsPCA.X_cumCalExplVar)


Help on function X_cumCalExplVar in module hoggorm.pca:

X_cumCalExplVar(self)
    Returns a list holding the cumulative validated explained variance
    for array X after each component. First number represents zero
    components, second number represents component 1, etc.


In [21]:
# Get cumulative calibrated explained variance for each variable
cumCalExplVar_ind = model.X_cumCalExplVar_indVar()

# Get cumulative calibrated explained variance for each variable and store in pandas dataframe with row and column names
cumCalExplVar_ind_df = pd.DataFrame(model.X_cumCalExplVar_indVar())
cumCalExplVar_ind_df.columns = data_varNames
cumCalExplVar_ind_df.index = ['PC{0}'.format(x) for x in range(model.X_loadings().shape[1] + 1)]
cumCalExplVar_ind_df


Out[21]:
Trachea-bronchus-lung Colon, rectum and anus Stomach Pancreas Prostate Liver Hodgkins disease Leukemia Bladder Skin
PC0 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
PC1 93.297874 0.185312 0.232708 0.010637 68.775042 0.513211 3.265701 6.675528 5.294306 12.401232
PC2 96.974604 4.950729 77.246454 0.011355 86.281381 70.967161 4.959056 10.164105 34.575287 36.690812
PC3 99.207876 44.551154 90.049020 11.348255 95.115368 78.332080 9.684653 10.253262 34.883801 45.018208
PC4 99.248721 93.700489 99.680558 14.734767 95.384980 93.894274 9.850987 30.561992 40.162969 47.581571

In [22]:
help(ho.nipalsPCA.X_cumCalExplVar_indVar)


Help on function X_cumCalExplVar_indVar in module hoggorm.pca:

X_cumCalExplVar_indVar(self)
    Returns an array holding the cumulative calibrated explained variance
    for each variable in X after each component. First row represents zero
    components, second row represents one component, third row represents
    two components, etc. Columns represent variables.


In [23]:
# Get calibrated predicted X for a given number of components

# Predicted X from calibration using 1 component
X_from_1_component = model.X_predCal()[1]

# Predicted X from calibration using 1 component stored in pandas data frame with row and columns names
X_from_1_component_df = pd.DataFrame(model.X_predCal()[1])
X_from_1_component_df.index = data_objNames
X_from_1_component_df.columns = data_varNames
X_from_1_component_df


Out[23]:
Trachea-bronchus-lung Colon, rectum and anus Stomach Pancreas Prostate Liver Hodgkins disease Leukemia Bladder Skin
Australia 20.264145 11.348532 5.914738 5.710003 13.528109 4.394810 0.250634 3.612515 3.604950 1.980947
Austria 23.064511 11.400417 5.822161 5.715964 11.786947 4.513742 0.233378 3.493013 3.740634 1.767432
Belgium 29.337131 11.516635 5.614795 5.729317 7.886869 4.780140 0.194727 3.225336 4.044555 1.289172
Canada 27.351081 11.479838 5.680452 5.725089 9.121720 4.695792 0.206965 3.310089 3.948327 1.440599
Chile 13.859335 11.229864 6.126475 5.696369 17.510377 4.122798 0.290101 3.885833 3.294624 2.469286
Czech Rep. 25.440185 11.444433 5.743624 5.721021 10.309843 4.614636 0.218740 3.391634 3.855740 1.586297
Denmark 22.877926 11.396960 5.828329 5.715567 11.902959 4.505817 0.234528 3.500975 3.731593 1.781658
Estonia 24.767695 11.431973 5.765856 5.719590 10.727972 4.586076 0.222883 3.420331 3.823156 1.637571
Finland 22.853045 11.396499 5.829152 5.715514 11.918429 4.504761 0.234682 3.502037 3.730388 1.783555
France 25.537825 11.446242 5.740396 5.721229 10.249134 4.618783 0.218138 3.387467 3.860471 1.578852
Germany 24.372937 11.424659 5.778906 5.718750 10.973417 4.569310 0.225316 3.437177 3.804030 1.667670
Greece 30.432544 11.536931 5.578582 5.731649 7.205783 4.826662 0.187977 3.178591 4.097630 1.205651
Hungary 30.623569 11.540470 5.572266 5.732055 7.087011 4.834775 0.186800 3.170439 4.106886 1.191086
Iceland 18.830817 11.321975 5.962123 5.706952 14.419298 4.333937 0.259466 3.673681 3.535502 2.090232
Ireland 22.687367 11.393429 5.834629 5.715162 12.021441 4.497724 0.235702 3.509107 3.722360 1.796187
Israel 23.943854 11.416709 5.793091 5.717836 11.240205 4.551087 0.227960 3.455488 3.783240 1.700385
Italy 27.179835 11.476665 5.686113 5.724725 9.228194 4.688519 0.208020 3.317396 3.940030 1.453656
Japan 26.616879 11.466235 5.704724 5.723526 9.578219 4.664611 0.211489 3.341420 3.912753 1.496579
Korea 29.442492 11.518587 5.611312 5.729541 7.821360 4.784614 0.194077 3.220840 4.049660 1.281138
Luxembourg 26.449117 11.463126 5.710270 5.723169 9.682527 4.657486 0.212523 3.348579 3.904625 1.509370
Mexico 12.663835 11.207714 6.165997 5.693825 18.253694 4.072025 0.297467 3.936849 3.236699 2.560438
Netherlands 26.275184 11.459904 5.716020 5.722799 9.790672 4.650099 0.213594 3.356001 3.896197 1.522632
New Zealand 20.081142 11.345141 5.920788 5.709614 13.641893 4.387038 0.251762 3.620325 3.596083 1.994900
Norway 19.315272 11.330951 5.946107 5.707983 14.118082 4.354512 0.256481 3.653007 3.558975 2.053295
Poland 30.082805 11.530451 5.590144 5.730904 7.423238 4.811808 0.190132 3.193516 4.080684 1.232317
Portugal 21.523957 11.371874 5.873090 5.712685 12.744805 4.448314 0.242871 3.558754 3.665991 1.884892
Slovak Rep. 24.653394 11.429856 5.769634 5.719347 10.799040 4.581221 0.223588 3.425209 3.817618 1.646286
Slovenia 24.713071 11.430961 5.767661 5.719474 10.761935 4.583756 0.223220 3.422662 3.820510 1.641736
Spain 27.258933 11.478131 5.683498 5.724893 9.179014 4.691879 0.207532 3.314021 3.943862 1.447625
Sweden 14.218050 11.236511 6.114616 5.697133 17.287342 4.138033 0.287890 3.870525 3.312004 2.441935
Switzerland 21.017117 11.362483 5.889846 5.711606 13.059939 4.426789 0.245994 3.580383 3.641433 1.923536
Turkey 36.357323 11.646705 5.382714 5.744261 3.521979 5.078287 0.151468 2.925758 4.384698 0.753913
United Kingdom 22.507818 11.390103 5.840565 5.714779 12.133078 4.490099 0.236809 3.516769 3.713661 1.809877
United States 28.436852 11.499955 5.644557 5.727400 8.446629 4.741905 0.200274 3.263755 4.000935 1.357814
OECD 26.327938 11.460881 5.714276 5.722911 9.757871 4.652339 0.213269 3.353750 3.898753 1.518609

In [24]:
# Get predicted X for a given number of components

# Predicted X from calibration using 4 components
X_from_4_component = model.X_predCal()[4]

# Predicted X from calibration using 1 component stored in pandas data frame with row and columns names
X_from_4_component_df = pd.DataFrame(model.X_predCal()[4])
X_from_4_component_df.index = data_objNames
X_from_4_component_df.columns = data_varNames
X_from_4_component_df


Out[24]:
Trachea-bronchus-lung Colon, rectum and anus Stomach Pancreas Prostate Liver Hodgkins disease Leukemia Bladder Skin
Australia 20.813615 9.873754 2.845786 5.992706 14.447389 4.669554 0.266138 4.157895 3.606217 2.518563
Austria 22.494585 11.221481 4.610192 6.083374 10.980075 6.161605 0.210409 3.715255 3.545885 2.036253
Belgium 30.289386 10.608328 3.433897 5.721306 9.373847 3.504211 0.224562 3.533081 4.275626 1.638726
Canada 27.787173 11.391266 3.158412 5.893813 9.850298 3.981193 0.214046 3.563045 4.182266 1.903987
Chile 13.712647 8.575350 17.352077 4.493128 16.942574 5.645705 0.334629 3.117106 2.246791 0.238829
Czech Rep. 24.918911 13.994751 4.463644 5.911778 9.567855 3.777658 0.184778 3.138631 4.260203 1.915438
Denmark 23.792840 11.602391 3.084761 5.723473 13.348603 2.407581 0.256110 3.679893 4.207363 2.263796
Estonia 26.162181 9.950300 8.213290 5.004174 12.725499 2.276930 0.286110 3.270333 3.912231 1.045696
Finland 23.634688 9.744582 3.976695 5.772572 13.144150 4.252419 0.263098 3.920313 3.730104 2.066009
France 25.318437 10.305455 3.629688 6.136417 9.994900 6.352708 0.208694 3.843561 3.621303 1.983556
Germany 24.497786 11.300840 4.618283 5.828546 11.196383 4.494917 0.226327 3.575605 3.866990 1.884125
Greece 32.390331 9.396033 4.940451 5.245931 10.140811 2.051066 0.263410 3.431617 4.366631 1.168737
Hungary 30.065936 15.521221 5.385899 5.709550 6.251035 2.729456 0.148707 2.548894 4.759854 1.335305
Iceland 20.152357 12.832275 4.525541 5.324398 16.422357 -0.009238 0.293543 3.438680 4.383032 2.315855
Ireland 22.857148 12.181181 4.516112 5.772814 12.312995 3.538208 0.233584 3.501006 4.000899 2.057382
Israel 22.744440 13.958059 5.044240 6.083814 9.474277 5.297130 0.172329 3.211816 3.923304 1.968020
Italy 26.316282 11.161471 6.157831 5.992407 7.931905 6.989459 0.181128 3.405235 3.546747 1.409696
Japan 23.988724 12.147866 14.551232 5.500557 5.390162 9.487348 0.146917 2.497388 2.778810 -0.053480
Korea 25.200082 8.890300 13.061711 6.462246 1.292533 16.507169 0.084039 3.270790 1.690815 0.043556
Luxembourg 25.478326 12.194399 3.995312 6.241093 8.293326 6.527392 0.169839 3.506822 3.779329 1.918078
Mexico 12.129994 7.334733 8.498550 5.827617 17.399875 8.487836 0.304439 4.408436 2.134407 2.067437
Netherlands 27.255134 11.627392 3.626408 5.627193 11.312093 2.362268 0.239786 3.462448 4.361955 1.872382
New Zealand 19.713997 13.160483 3.830496 5.993047 13.157271 3.901573 0.223784 3.570956 3.930767 2.459085
Norway 20.397995 12.321463 2.861589 5.650850 15.821075 1.204606 0.279154 3.715797 4.274383 2.605910
Poland 30.878927 12.220196 6.453041 5.295536 8.575377 2.211490 0.217438 2.898333 4.478015 1.026882
Portugal 20.372164 14.314427 8.867988 5.563813 10.928376 4.350881 0.199566 2.823879 3.764091 1.426250
Slovak Rep. 23.239857 15.791968 6.366330 5.879592 8.665737 4.120755 0.156731 2.734135 4.215880 1.698084
Slovenia 24.505298 13.265975 7.323272 5.484049 10.397215 3.328272 0.212850 2.948782 4.075912 1.384766
Spain 26.627555 13.795004 5.326583 5.852600 8.244807 4.228854 0.173906 3.010123 4.223950 1.599199
Sweden 15.128844 11.277638 3.090824 5.752034 18.736412 2.221103 0.309143 4.109782 3.762220 2.976007
Switzerland 21.394659 10.032613 3.618807 5.945846 13.697881 4.907488 0.257549 4.028607 3.581372 2.318812
Turkey 38.553730 7.116832 8.595589 4.833684 6.696434 3.417367 0.258927 3.145521 4.083622 -0.093331
United Kingdom 23.180272 10.678515 3.779165 5.779330 13.200065 3.703044 0.256456 3.804340 3.882012 2.157179
United States 29.306034 9.253495 2.224565 5.987948 9.854582 4.955702 0.229198 3.951821 3.966056 1.929831
OECD 26.064646 10.789763 6.289244 5.785407 9.350852 5.810476 0.209133 3.433306 3.640212 1.410508

In [25]:
help(ho.nipalsPCA.X_predCal)


Help on function X_predCal in module hoggorm.pca:

X_predCal(self)
    Returns a dictionary holding the predicted arrays Xhat from
    calibration after each computed component. Dictionary key represents
    order of component.


In [26]:
# Get validated explained variance of each component
valExplVar = model.X_valExplVar()

# Get calibrated explained variance and store in pandas dataframe with row and column names
valExplVar_df = pd.DataFrame(model.X_valExplVar())
valExplVar_df.columns = ['validated explained variance']
valExplVar_df.index = ['PC{0}'.format(x+1) for x in range(model.X_loadings().shape[1])]
valExplVar_df


Out[26]:
validated explained variance
PC1 41.228104
PC2 27.515848
PC3 3.577173
PC4 16.787521

In [27]:
help(ho.nipalsPCA.X_valExplVar)


Help on function X_valExplVar in module hoggorm.pca:

X_valExplVar(self)
    Returns a list holding the validated explained variance for X after
    each component. First number in list is for component 1, second number
    for component 2, third number for component 3, etc.


In [28]:
# Get cumulative validated explained variance
cumValExplVar = model.X_cumValExplVar()

# Get cumulative validated explained variance and store in pandas dataframe with row and column names
cumValExplVar_df = pd.DataFrame(model.X_cumValExplVar())
cumValExplVar_df.columns = ['cumulative validated explained variance']
cumValExplVar_df.index = ['PC{0}'.format(x) for x in range(model.X_loadings().shape[1] + 1)]
cumValExplVar_df


Out[28]:
cumulative validated explained variance
PC0 0.000000
PC1 41.228104
PC2 68.743952
PC3 72.321125
PC4 89.108645

In [29]:
help(ho.nipalsPCA.X_cumValExplVar)


Help on function X_cumValExplVar in module hoggorm.pca:

X_cumValExplVar(self)
    Returns a list holding the cumulative validated explained variance
    for array X after each component.


In [30]:
# Get cumulative validated explained variance for each variable
cumCalExplVar_ind = model.X_cumCalExplVar_indVar()

# Get cumulative validated explained variance for each variable and store in pandas dataframe with row and column names
cumValExplVar_ind_df = pd.DataFrame(model.X_cumValExplVar_indVar())
cumValExplVar_ind_df.columns = data_varNames
cumValExplVar_ind_df.index = ['PC{0}'.format(x) for x in range(model.X_loadings().shape[1] + 1)]
cumValExplVar_ind_df


Out[30]:
Trachea-bronchus-lung Colon, rectum and anus Stomach Pancreas Prostate Liver Hodgkins disease Leukemia Bladder Skin
PC0 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
PC1 90.824851 -17.145967 -19.815931 -5.240408 62.379405 -10.961539 -9.643134 -5.138706 -6.705916 2.898408
PC2 95.904674 -17.188115 63.295538 -15.408414 84.244357 41.440343 -17.843876 -15.248051 28.054297 26.713650
PC3 97.104389 1.232204 67.630816 -10.748362 90.128127 40.872884 -17.770772 -23.139446 24.826724 34.032033
PC4 98.475855 88.493629 99.498470 -12.666695 91.756606 79.683951 -33.877348 -18.734551 24.589304 31.520312

In [31]:
help(ho.nipalsPCA.X_cumValExplVar_indVar)


Help on function X_cumValExplVar_indVar in module hoggorm.pca:

X_cumValExplVar_indVar(self)
    Returns an array holding the cumulative validated explained variance
    for each variable in X after each component. First row represents
    zero components, second row represents component 1, third row for
    compnent 2, etc. Columns represent variables.


In [32]:
# Get validated predicted X for a given number of components

# Predicted X from validation using 1 component
X_from_1_component_val = model.X_predVal()[1]

# Predicted X from calibration using 1 component stored in pandas data frame with row and columns names
X_from_1_component_val_df = pd.DataFrame(model.X_predVal()[1])
X_from_1_component_val_df.index = data_objNames
X_from_1_component_val_df.columns = data_varNames
X_from_1_component_val_df


Out[32]:
Trachea-bronchus-lung Colon, rectum and anus Stomach Pancreas Prostate Liver Hodgkins disease Leukemia Bladder Skin
Australia 20.346812 11.452455 6.105403 5.743157 13.435601 4.446251 0.254297 3.591813 3.628411 1.842250
Austria 23.098502 11.415033 5.860404 5.687173 11.820143 4.476746 0.236020 3.483532 3.763559 1.761308
Belgium 29.146256 11.573484 5.847552 5.763295 7.834264 4.924500 0.197677 3.203712 3.982234 1.300025
Canada 27.278466 11.477010 5.818147 5.737198 9.114227 4.747381 0.207564 3.290940 3.958260 1.435539
Chile 17.147842 11.842768 3.732281 5.903233 16.242627 3.460482 0.284343 3.913106 3.740709 2.648065
Czech Rep. 25.441970 11.365015 5.785915 5.689270 10.345341 4.651024 0.216961 3.392898 3.863856 1.589068
Denmark 22.877615 11.371955 5.918591 5.694592 11.791892 4.557505 0.234934 3.511302 3.712101 1.761700
Estonia 24.709957 11.466478 5.690023 5.727979 10.666493 4.650982 0.226537 3.407674 3.832053 1.656775
Finland 22.845354 11.447777 5.885934 5.644067 11.845279 4.517876 0.236904 3.525032 3.756839 1.760221
France 25.536053 11.486346 5.816234 5.733147 10.269548 4.560135 0.220550 3.383306 3.844408 1.594376
Germany 24.371904 11.431935 5.813820 5.687086 10.972379 4.582907 0.227475 3.432105 3.821834 1.673564
Greece 29.988759 11.760878 5.769006 5.773468 7.166377 5.209923 0.142857 3.133014 3.946455 1.240776
Hungary 30.432727 11.129622 5.663199 5.766419 7.208318 5.027657 0.193566 3.230453 4.123417 1.205169
Iceland 19.045072 11.176274 6.205299 5.830261 13.937207 4.701215 0.231198 3.703216 3.375083 2.111165
Ireland 22.690428 11.361906 5.886459 5.712189 11.996754 4.534695 0.235493 3.523132 3.750129 1.791230
Israel 24.018899 11.379755 5.802910 5.628857 11.336708 4.582601 0.224217 3.406475 3.751077 1.682347
Italy 27.184279 11.501753 5.653202 5.737154 9.332802 4.577990 0.206790 3.306558 3.912044 1.473370
Japan 26.228981 11.445194 5.375725 5.668874 10.046941 4.461135 0.220210 3.402577 3.946523 1.586448
Korea 27.730339 11.575703 5.162100 5.728087 9.321042 3.946935 0.211874 3.366191 4.084304 1.504239
Luxembourg 26.466206 11.445000 5.769489 5.710523 9.772187 4.603458 0.213747 3.299525 3.898795 1.499993
Mexico 14.823905 12.330597 4.708767 5.809731 18.140365 2.574636 0.209335 3.454833 3.783557 2.955884
Netherlands 26.186871 11.456603 5.805600 5.735989 9.772606 4.745666 0.215727 3.367532 3.897608 1.502747
New Zealand 20.191262 11.244010 6.045521 5.757741 13.613356 4.460347 0.253996 3.586785 3.639054 1.808438
Norway 19.463314 11.188321 6.226537 5.680406 13.654525 4.581480 0.258470 3.660036 3.508576 1.932281
Poland 29.921540 11.489156 5.578672 5.820119 7.389165 5.069093 0.189600 3.215776 3.985895 1.218740
Portugal 21.668124 11.251377 5.735740 5.751282 12.764150 4.426888 0.245906 3.578680 3.644499 1.923347
Slovak Rep. 24.701635 11.321178 5.759912 5.721322 10.900040 4.624149 0.221828 3.451396 3.837558 1.650205
Slovenia 24.702075 11.362842 5.717365 5.737422 10.754704 4.600925 0.225268 3.450790 3.812320 1.626453
Spain 27.234718 11.368016 5.706269 5.760888 9.221238 4.674359 0.208281 3.330966 3.842222 1.473031
Sweden 14.401049 10.978468 7.150693 5.492561 16.026316 4.742420 0.301095 3.862850 3.081189 2.231894
Switzerland 21.027227 11.414215 6.004838 5.688268 12.976067 4.399588 0.256229 3.586549 3.613891 1.897293
Turkey 34.708058 12.730771 4.438275 5.901250 3.193107 5.370252 0.138639 2.812942 4.581269 0.917090
United Kingdom 22.522778 11.427796 5.928451 5.740641 12.092165 4.533439 0.238070 3.524824 3.702613 1.815952
United States 28.256565 11.620894 5.875856 5.701849 8.446227 4.783065 0.198763 3.213984 4.013630 1.325812
OECD 26.332338 11.488461 5.689744 5.715559 9.784644 4.611067 0.214029 3.351589 3.909208 1.527713

In [33]:
# Get validated predicted X for a given number of components

# Predicted X from validation using 3 components
X_from_3_component_val = model.X_predVal()[3]

# Predicted X from calibration using 3 components stored in pandas data frame with row and columns names
X_from_3_component_val_df = pd.DataFrame(model.X_predVal()[3])
X_from_3_component_val_df.index = data_objNames
X_from_3_component_val_df.columns = data_varNames
X_from_3_component_val_df


Out[33]:
Trachea-bronchus-lung Colon, rectum and anus Stomach Pancreas Prostate Liver Hodgkins disease Leukemia Bladder Skin
Australia 20.764530 11.883816 4.351302 5.783851 14.210242 3.152660 0.263723 3.671416 3.953540 2.148731
Austria 22.479761 12.312983 5.506251 5.888952 10.919635 5.230857 0.210053 3.425437 3.753513 1.886682
Belgium 30.212489 11.532262 4.064188 5.660435 9.259651 2.843325 0.225970 3.310212 4.379648 1.592228
Canada 27.738536 12.108949 3.703828 5.823744 9.773033 3.415657 0.211989 3.369483 4.333096 1.835485
Chile 13.075733 9.646726 8.913721 6.307958 13.900808 10.670921 0.264012 4.068782 2.150252 1.824376
Czech Rep. 25.098286 12.488010 3.669759 5.987947 9.881577 4.755909 0.189882 3.458486 4.077980 2.054321
Denmark 23.748048 11.746771 3.217476 5.663136 13.257003 2.235640 0.257956 3.669418 4.224329 2.241527
Estonia 25.977504 8.716484 6.600062 5.281250 12.517250 3.840790 0.301266 3.609970 3.712430 1.346756
Finland 23.484804 11.287266 4.966832 5.517355 12.847988 3.179513 0.258285 3.601377 3.998405 1.893054
France 25.264965 12.140194 5.192369 5.874895 9.880548 4.711927 0.206250 3.367113 3.904670 1.750883
Germany 24.471358 11.755519 4.948300 5.738860 11.146549 4.152540 0.226706 3.461806 3.960749 1.843174
Greece 32.315134 10.583494 5.399023 5.196190 10.136267 1.745528 0.155814 3.129398 4.267369 1.200123
Hungary 30.378073 11.861468 3.760067 6.000479 7.095720 4.787683 0.181130 3.320748 4.341984 1.596543
Iceland 20.227861 10.997202 3.146196 5.781847 16.337650 1.570692 0.257341 3.968692 3.898049 2.707126
Ireland 22.875723 11.904691 4.342052 5.803498 12.342237 3.744655 0.234745 3.584067 4.003327 2.092452
Israel 23.125626 12.641666 4.580049 5.917360 10.074275 6.055285 0.172077 3.325237 3.675352 1.985619
Italy 26.290041 11.955581 6.761001 5.914960 7.888150 6.328044 0.174375 3.190602 3.611133 1.326765
Japan 24.221279 9.685499 12.405529 5.304529 5.927250 11.321808 0.188055 3.218558 2.527670 0.234515
Korea 25.402589 11.882237 16.291011 5.836304 2.376307 8.385048 0.137746 2.959834 2.654578 -0.690457
Luxembourg 25.511190 13.184020 4.926473 6.071516 8.345538 5.603757 0.166709 3.127996 3.931822 1.760704
Mexico 12.391547 11.509568 10.399259 5.705564 16.763155 6.062712 0.149478 2.977442 2.868827 2.136482
Netherlands 27.264210 11.530376 3.557542 5.658308 11.337331 2.464355 0.244981 3.511832 4.382769 1.867873
New Zealand 19.847536 12.615937 3.612401 6.126305 13.298844 4.234944 0.231075 3.626668 3.960792 2.228559
Norway 20.350495 11.679687 2.525859 5.656758 15.641944 1.497086 0.292522 3.893807 4.186937 2.574396
Poland 30.863768 10.683579 5.082660 5.630259 8.707421 3.693953 0.222339 3.322107 4.139811 1.234604
Portugal 20.773941 11.319348 7.082734 5.894732 11.475380 6.271428 0.221338 3.500289 3.290746 1.728424
Slovak Rep. 23.984321 11.900879 4.980590 6.053141 9.861055 6.019048 0.187696 3.512518 3.751198 1.862199
Slovenia 24.623444 11.325025 5.955342 5.737772 10.630254 4.798362 0.223524 3.438775 3.767088 1.584056
Spain 26.775612 12.226788 4.524750 6.043729 8.546323 5.138220 0.183747 3.359587 3.866305 1.762706
Sweden 14.995524 11.542715 3.453728 5.431998 18.189938 1.583005 0.338882 4.099645 3.736133 2.945017
Switzerland 21.299826 11.698535 4.847546 5.712640 13.449388 3.570984 0.262217 3.640261 3.818733 2.111408
Turkey 34.685849 13.269931 5.350938 5.761327 3.089725 4.889617 0.138391 2.615668 4.655886 0.742734
United Kingdom 23.135092 11.563119 4.402602 5.709889 13.093267 3.064259 0.254849 3.616245 4.011357 2.085617
United States 29.030028 11.865958 4.045997 5.663223 9.517251 3.071514 0.215477 3.298146 4.397172 1.636963
OECD 26.029230 11.328335 6.665261 5.712566 9.287666 5.399807 0.207244 3.301529 3.726988 1.353495

In [34]:
help(ho.nipalsPCA.X_predVal)


Help on function X_predVal in module hoggorm.pca:

X_predVal(self)
    Returns a dictionary holding the predicted arrays Xhat from
    validation after each computed component. Dictionary key represents
    order of component.


In [35]:
# Get predicted scores for new measurements (objects) of X

# First pretend that we acquired new X data by using part of the existing data and overlaying some noise
import numpy.random as npr
new_data = data[0:4, :] + npr.rand(4, np.shape(data)[1])
np.shape(new_data)

# Now insert the new data into the existing model and compute scores for two components (numComp=2)
pred_scores = model.X_scores_predict(new_data, numComp=2)

# Same as above, but results stored in a pandas dataframe with row names and column names
pred_scores_df = pd.DataFrame(model.X_scores_predict(new_data, numComp=2))
pred_scores_df.columns = ['PC{0}'.format(x) for x in range(2)]
pred_scores_df.index = ['new object {0}'.format(x) for x in range(np.shape(new_data)[0])]
pred_scores_df


Out[35]:
PC0 PC1
new object 0 -4.744123 2.509074
new object 1 -1.288585 -0.648013
new object 2 6.164501 2.848108
new object 3 3.978922 2.262806

In [36]:
help(ho.nipalsPCA.X_scores_predict)


Help on function X_scores_predict in module hoggorm.pca:

X_scores_predict(self, Xnew, numComp=None)
    Returns array of X scores from new X data using the exsisting model.
    Rows represent objects and columns represent components.