RV and RV2 coefficient on Sensory and Fluorescence data

This notebook illustrates how to use the hoggorm package to carry out partial least squares regression (PLSR) on multivariate data. Furthermore, we will learn how to visualise the results of the PLSR using the hoggormPlot package.


Import packages and prepare data

First import hoggorm for analysis of the data and hoggormPlot for plotting of the analysis results. We'll also import pandas such that we can read the data into a data frame. numpy is needed for checking dimensions of the data.


In [2]:
import hoggorm as ho
import hoggormplot as hop
import pandas as pd
import numpy as np

Next, load the data that we are going to analyse using hoggorm. After the data has been loaded into the pandas data frame, we'll display it in the notebook.


In [3]:
# Load fluorescence data
X_df = pd.read_csv('cheese_fluorescence.txt', index_col=0, sep='\t')
X_df


Out[3]:
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 ... V283 V284 V285 V286 V287 V288 V289 V290 V291 V292
Pr 1 19222.109 19937.834 20491.777 20994.000 21427.500 21915.891 22273.834 22750.279 23215.609 23497.221 ... 1338.0557 1311.9445 1275.1666 1235.7777 1204.6666 1184.944500 1140.500000 1109.888800 1099.666600 1070.500000
Pr 2 18965.945 19613.334 20157.277 20661.557 21167.334 21554.057 22031.391 22451.889 22915.334 23311.611 ... 1244.5555 1217.1666 1183.9445 1156.5000 1130.0555 1084.000000 1066.500000 1039.944500 1018.500000 992.083313
Pr 3 19698.221 20438.279 21124.721 21740.666 22200.445 22709.725 23222.111 23646.225 24047.389 24519.111 ... 1409.5000 1366.9445 1319.8888 1289.7778 1258.2223 1235.166600 1200.611000 1173.277800 1126.555700 1097.250000
Pr 4 20037.334 20841.779 21510.889 22096.443 22605.889 23077.834 23547.725 23974.445 24490.889 24896.945 ... 1374.5000 1332.3334 1287.5000 1252.9445 1228.8334 1195.944300 1159.166600 1153.611200 1117.222300 1088.333400
Pr 5 19874.889 20561.834 21248.500 21780.889 22328.834 22812.057 23266.111 23723.334 24171.221 24601.943 ... 1329.0000 1291.9445 1256.7778 1226.6110 1209.7777 1169.888800 1144.555500 1123.333400 1084.888800 1081.500000
Pr 6 19529.391 20157.834 20847.500 21308.111 21716.443 22165.775 22583.166 22993.779 23520.779 24015.221 ... 1737.3888 1696.5000 1635.5000 1580.3334 1556.8334 1501.222200 1463.555500 1419.277800 1365.388800 1343.416600
Pr 7 18795.582 19485.582 20139.584 20644.668 21013.668 21480.668 21873.666 22302.418 22662.500 23097.000 ... 1323.3333 1286.9167 1261.0000 1235.0833 1190.0833 1174.666700 1129.166700 1095.416600 1070.416600 1049.500000
Pr 8 20052.943 20839.445 21569.221 22150.221 22662.389 23160.389 23589.943 24117.500 24484.334 24971.666 ... 1140.2778 1113.1112 1075.8334 1055.7778 1037.1112 1025.777800 986.277832 969.388855 944.944397 936.083313
Pr 9 19001.391 19709.943 20368.443 20939.111 21383.111 21879.111 22335.221 22758.834 23213.443 23688.891 ... 1119.1666 1076.7777 1045.3888 1033.1112 1021.3333 994.222229 962.111084 943.000000 920.166687 899.083313
Pr 10 20602.834 21406.389 22144.611 22775.000 23407.443 23940.609 24486.111 24976.275 25480.779 25966.279 ... 1248.2777 1226.7778 1195.0000 1169.5000 1135.9445 1120.888800 1069.555500 1062.833400 1034.722200 1016.750000
Pr 11 20116.443 20880.611 21584.834 22137.775 22667.166 23144.557 23592.889 24122.225 24518.221 25007.000 ... 1237.1112 1196.5000 1164.8334 1152.7223 1118.1666 1104.277800 1057.555700 1046.666600 1021.611100 1007.166700
Pr 12 20282.721 21016.500 21678.279 22241.555 22751.779 23257.945 23730.000 24221.221 24638.834 25100.055 ... 1192.2778 1177.8334 1130.8889 1121.0555 1099.6112 1068.722200 1053.277700 1034.388900 993.444397 992.583313
Pr 13 19508.000 20124.445 20701.057 21145.500 21529.389 21974.389 22338.834 22726.611 23156.000 23600.000 ... 1710.4445 1675.3334 1589.2778 1568.9445 1515.2778 1480.611200 1424.500000 1404.777800 1358.333400 1334.250000
Pr 14 18739.391 19444.275 20072.555 20603.500 21035.389 21470.834 21912.889 22356.279 22747.225 23205.889 ... 1158.2778 1155.4445 1102.9443 1081.4445 1060.2778 1044.388800 999.722229 985.222168 954.722229 935.083313

14 rows × 292 columns


In [4]:
# Load sensory data
Y_df = pd.read_csv('cheese_sensory.txt', index_col=0, sep='\t')
Y_df


Out[4]:
Att 01 Att 02 Att 03 Att 04 Att 05 Att 06 Att 07 Att 08 Att 09 Att 10 Att 11 Att 12 Att 13 Att 14 Att 15 Att 16 Att 17
Product
Pr 01 6.19 3.33 3.43 2.14 1.29 3.11 6.70 3.22 2.66 5.10 4.57 3.34 2.93 1.89 1.23 3.15 4.07
Pr 02 6.55 2.50 4.32 2.52 1.24 3.91 6.68 2.57 2.42 4.87 4.75 4.13 3.09 2.29 1.51 3.93 4.07
Pr 03 6.23 3.43 3.42 2.03 1.28 2.93 6.61 3.39 2.56 5.00 4.73 3.44 3.08 1.81 1.37 3.19 4.16
Pr 04 6.14 2.93 3.96 2.13 1.08 3.12 6.51 2.98 2.50 4.66 4.68 3.92 2.93 1.99 1.19 3.13 4.29
Pr 05 6.70 1.97 4.72 2.43 1.13 4.60 7.01 2.07 2.32 5.29 5.19 4.52 3.14 2.47 1.34 4.67 4.03
Pr 06 6.19 5.28 1.59 1.07 1.00 1.13 6.42 5.18 2.82 5.02 4.49 2.05 2.54 1.18 1.18 1.29 4.11
Pr 07 6.17 3.45 3.32 2.04 1.47 2.69 6.39 3.81 2.76 4.58 4.32 3.22 2.72 1.81 1.33 2.52 4.26
Pr 08 6.90 2.58 4.24 2.58 1.70 4.19 7.11 2.06 2.47 4.58 5.09 4.44 3.25 2.62 1.73 4.87 3.98
Pr 09 6.70 2.53 4.53 2.32 1.22 4.16 6.91 2.42 2.41 4.52 4.96 4.49 3.37 2.47 1.64 4.54 4.01
Pr 10 6.35 3.14 3.64 2.17 1.17 2.57 6.50 2.77 2.66 4.76 4.64 4.06 3.11 2.21 1.46 3.35 3.93
Pr 11 5.97 3.34 3.46 1.67 1.15 1.43 6.31 3.15 2.56 4.57 4.36 3.65 2.66 1.56 1.19 2.23 4.01
Pr 12 6.29 2.99 4.03 2.06 1.17 3.06 6.76 2.37 2.44 4.69 4.97 4.28 3.16 2.56 1.53 4.23 4.03
Pr 13 5.91 4.88 2.04 1.00 1.00 1.08 6.34 4.79 2.44 5.48 4.54 1.98 2.57 1.00 1.03 1.03 4.16
Pr 14 6.75 1.91 4.36 2.95 1.43 4.83 7.14 1.53 2.47 4.72 5.06 4.54 3.43 2.80 1.87 5.65 3.98

The RVcoeff and RV2coeff methods in hoggorm accept only numpy arrays with numerical values and not pandas data frames. Therefore, the pandas data frames holding the imported data need to be "taken apart" into three parts:

  • two numpy array holding the numeric values
  • two Python list holding variable (column) names
  • two Python list holding object (row) names.

In [5]:
# Get the values from the data frame
X = X_df.values
Y = Y_df.values

# Get the variable or columns names
X_varNames = list(X_df.columns)
Y_varNames = list(Y_df.columns)

# Get the object or row names
X_objNames = list(X_df.index)
Y_objNames = list(Y_df.index)

Apply RV and RV2 to our data

Now, let's apply the RV and RV2 matrix correlation coefficient methods on the data description of the input parameters. The functions take python lists as input which may contain two or more arrays measured on the same objects and compute RV and RV2 matrix correlation coefficients between pairs of arrays. The number and order of objects (rows) for the two arrays must match. The number of variables in each array may vary. The RV coefficient results in values 0 <= RV <= 1. The RV2 coefficient is a modified version of the RV coefficient with values -1 <= RV2 <= 1. RV2 is independent of object and variable size.

Preprocessing the data

Arrays need to be preprocessed before computing RV and RV2. More precisely, the arrays need to be either centred or standardised/scaled.


In [9]:
# Center data first
X_cent = ho.center(X_df.values, axis=0)
Y_cent = ho.center(Y_df.values, axis=0)

In [8]:
X_cent


Out[8]:
array([[-379.83342857, -380.60057143, -482.455     , ...,   -1.61318736,
          20.33922786,   10.2440525 ],
       [-635.99742857, -705.10057143, -816.955     , ...,  -71.55748736,
         -60.82737214,  -68.1726345 ],
       [  96.27857143,  119.84442857,  150.489     , ...,   61.77581264,
          47.22832786,   36.9940525 ],
       ...,
       [ 680.77857143,  698.06542857,  704.047     , ...,  -77.11308736,
         -85.88297514,  -67.6726345 ],
       [ -93.94242857, -193.98957143, -273.175     , ...,  293.27581264,
         279.00602786,  273.9940525 ],
       [-862.55142857, -874.15957143, -901.677     , ..., -126.27981936,
        -124.60514314, -125.1726345 ]])

In [10]:
Y_cent


Out[10]:
array([[-1.70000000e-01,  1.68571429e-01, -2.17142857e-01,
         6.07142857e-02,  5.21428571e-02,  5.21428571e-02,
         2.92857143e-02,  1.97857143e-01,  1.25000000e-01,
         2.54285714e-01, -1.69285714e-01, -3.78571429e-01,
        -6.85714286e-02, -1.57142857e-01, -1.70000000e-01,
        -2.62857143e-01, -7.85714286e-03],
       [ 1.90000000e-01, -6.61428571e-01,  6.72857143e-01,
         4.40714286e-01,  2.14285714e-03,  8.52142857e-01,
         9.28571429e-03, -4.52142857e-01, -1.15000000e-01,
         2.42857143e-02,  1.07142857e-02,  4.11428571e-01,
         9.14285714e-02,  2.42857143e-01,  1.10000000e-01,
         5.17142857e-01, -7.85714286e-03],
       [-1.30000000e-01,  2.68571429e-01, -2.27142857e-01,
        -4.92857143e-02,  4.21428571e-02, -1.27857143e-01,
        -6.07142857e-02,  3.67857143e-01,  2.50000000e-02,
         1.54285714e-01, -9.28571429e-03, -2.78571429e-01,
         8.14285714e-02, -2.37142857e-01, -3.00000000e-02,
        -2.22857143e-01,  8.21428571e-02],
       [-2.20000000e-01, -2.31428571e-01,  3.12857143e-01,
         5.07142857e-02, -1.57857143e-01,  6.21428571e-02,
        -1.60714286e-01, -4.21428571e-02, -3.50000000e-02,
        -1.85714286e-01, -5.92857143e-02,  2.01428571e-01,
        -6.85714286e-02, -5.71428571e-02, -2.10000000e-01,
        -2.82857143e-01,  2.12142857e-01],
       [ 3.40000000e-01, -1.19142857e+00,  1.07285714e+00,
         3.50714286e-01, -1.07857143e-01,  1.54214286e+00,
         3.39285714e-01, -9.52142857e-01, -2.15000000e-01,
         4.44285714e-01,  4.50714286e-01,  8.01428571e-01,
         1.41428571e-01,  4.22857143e-01, -6.00000000e-02,
         1.25714286e+00, -4.78571429e-02],
       [-1.70000000e-01,  2.11857143e+00, -2.05714286e+00,
        -1.00928571e+00, -2.37857143e-01, -1.92785714e+00,
        -2.50714286e-01,  2.15785714e+00,  2.85000000e-01,
         1.74285714e-01, -2.49285714e-01, -1.66857143e+00,
        -4.58571429e-01, -8.67142857e-01, -2.20000000e-01,
        -2.12285714e+00,  3.21428571e-02],
       [-1.90000000e-01,  2.88571429e-01, -3.27142857e-01,
        -3.92857143e-02,  2.32142857e-01, -3.67857143e-01,
        -2.80714286e-01,  7.87857143e-01,  2.25000000e-01,
        -2.65714286e-01, -4.19285714e-01, -4.98571429e-01,
        -2.78571429e-01, -2.37142857e-01, -7.00000000e-02,
        -8.92857143e-01,  1.82142857e-01],
       [ 5.40000000e-01, -5.81428571e-01,  5.92857143e-01,
         5.00714286e-01,  4.62142857e-01,  1.13214286e+00,
         4.39285714e-01, -9.62142857e-01, -6.50000000e-02,
        -2.65714286e-01,  3.50714286e-01,  7.21428571e-01,
         2.51428571e-01,  5.72857143e-01,  3.30000000e-01,
         1.45714286e+00, -9.78571429e-02],
       [ 3.40000000e-01, -6.31428571e-01,  8.82857143e-01,
         2.40714286e-01, -1.78571429e-02,  1.10214286e+00,
         2.39285714e-01, -6.02142857e-01, -1.25000000e-01,
        -3.25714286e-01,  2.20714286e-01,  7.71428571e-01,
         3.71428571e-01,  4.22857143e-01,  2.40000000e-01,
         1.12714286e+00, -6.78571429e-02],
       [-1.00000000e-02, -2.14285714e-02, -7.14285714e-03,
         9.07142857e-02, -6.78571429e-02, -4.87857143e-01,
        -1.70714286e-01, -2.52142857e-01,  1.25000000e-01,
        -8.57142857e-02, -9.92857143e-02,  3.41428571e-01,
         1.11428571e-01,  1.62857143e-01,  6.00000000e-02,
        -6.28571429e-02, -1.47857143e-01],
       [-3.90000000e-01,  1.78571429e-01, -1.87142857e-01,
        -4.09285714e-01, -8.78571429e-02, -1.62785714e+00,
        -3.60714286e-01,  1.27857143e-01,  2.50000000e-02,
        -2.75714286e-01, -3.79285714e-01, -6.85714286e-02,
        -3.38571429e-01, -4.87142857e-01, -2.10000000e-01,
        -1.18285714e+00, -6.78571429e-02],
       [-7.00000000e-02, -1.71428571e-01,  3.82857143e-01,
        -1.92857143e-02, -6.78571429e-02,  2.14285714e-03,
         8.92857143e-02, -6.52142857e-01, -9.50000000e-02,
        -1.55714286e-01,  2.30714286e-01,  5.61428571e-01,
         1.61428571e-01,  5.12857143e-01,  1.30000000e-01,
         8.17142857e-01, -4.78571429e-02],
       [-4.50000000e-01,  1.71857143e+00, -1.60714286e+00,
        -1.07928571e+00, -2.37857143e-01, -1.97785714e+00,
        -3.30714286e-01,  1.76785714e+00, -9.50000000e-02,
         6.34285714e-01, -1.99285714e-01, -1.73857143e+00,
        -4.28571429e-01, -1.04714286e+00, -3.70000000e-01,
        -2.38285714e+00,  8.21428571e-02],
       [ 3.90000000e-01, -1.25142857e+00,  7.12857143e-01,
         8.70714286e-01,  1.92142857e-01,  1.77214286e+00,
         4.69285714e-01, -1.49214286e+00, -6.50000000e-02,
        -1.25714286e-01,  3.20714286e-01,  8.21428571e-01,
         4.31428571e-01,  7.52857143e-01,  4.70000000e-01,
         2.23714286e+00, -9.78571429e-02]])

After both arrays were centered, we store them in a list and submit them to the RV or RV2 matrix correlation coefficient function, as described below. Note that the list can contain two or more arrays. The function then returns an array holding RV coefficient for all pair-wise combinations of arrays.


In [19]:
rv_results_cent = ho.RVcoeff([X_cent, Y_cent])

In [23]:
rv_results_cent


Out[23]:
array([[1.        , 0.24142324],
       [0.24142324, 1.        ]])

The RV computation results are stored in a new array as seen above. At the diagonal the RV is 1, since the we compute $RV(X_{cent}, X_{cent}) = 1$ and $RV(Y_{cent}, Y_{cent}) = 1$, in each case indicating that the information across the two matrices is identical. Correspondingly, $RV(X_{cent}, Y_{cent}) = 0.24142324$ at index [0, 1] and $RV(Y_{cent}, X_{cent}) = 0.24142324$ at index [1, 0].

Now the corresponding computation using the RV2 coefficient.


In [24]:
rv2_results_cent = ho.RV2coeff([X_cent, Y_cent])

In [25]:
rv2_results_cent


Out[25]:
array([[1.       , 0.1855865],
       [0.1855865, 1.       ]])

Do the same computations, however with standardised arrays where each feature has the same weight.


In [30]:
# Standardise data first
X_stand = ho.standardise(X_df.values, mode=0)
Y_stand = ho.standardise(Y_df.values, mode=0)

In [26]:
rv_results_stand = ho.RVcoeff([X_stand, Y_stand])

In [27]:
rv_results_stand


Out[27]:
array([[1.        , 0.53160759],
       [0.53160759, 1.        ]])

In [28]:
rv2_results_stand = ho.RV2coeff([X_stand, Y_stand])

In [29]:
rv2_results_stand


Out[29]:
array([[1.        , 0.43897699],
       [0.43897699, 1.        ]])