Clustering

In which we explore Segmenting Frequent InterGallacticHoppers

Business Problem

  • ###InterGallactic Airlines have the GallacticHoppers frequent flyer program & have data about their customers who participate in the program.
  • ###The airlines execs have a feeling that other airlines will poach their customers if they do not keep their loyal customers happy.
  • ###So the business want to customize promotions to their frequent flier program.
  • ###Can they just have one type of promotion ?
  • ###Should they have different types of incentives ?
  • ###Who exactly are the customers in their GallacticHoppers program ?
  • ###Recently they have deployed an infrastructure with Spark
  • ###Can Spark help in this business problem ?

In [1]:
import datetime
from pytz import timezone
print "Last run @%s" % (datetime.datetime.now(timezone('US/Pacific')))
#
from pyspark.context import SparkContext
print "Running Spark Version %s" % (sc.version)
#
from pyspark.conf import SparkConf
conf = SparkConf()
print conf.toDebugString()


Last run @2015-07-15 15:18:08.185957-07:00
Running Spark Version 1.4.1
spark.app.name=pyspark-shell
spark.files=file:/Users/ksankar/.ivy2/jars/com.databricks_spark-csv_2.11-1.0.3.jar,file:/Users/ksankar/.ivy2/jars/org.apache.commons_commons-csv-1.1.jar
spark.jars=file:/Users/ksankar/.ivy2/jars/com.databricks_spark-csv_2.11-1.0.3.jar,file:/Users/ksankar/.ivy2/jars/org.apache.commons_commons-csv-1.1.jar
spark.master=local[*]
spark.submit.pyFiles=/Users/ksankar/.ivy2/jars/com.databricks_spark-csv_2.11-1.0.3.jar,/Users/ksankar/.ivy2/jars/org.apache.commons_commons-csv-1.1.jar

In [5]:
# Read Dataset
freq_df = sqlContext.read.format('com.databricks.spark.csv')\
            .options(header='true')\
            .load('freq-flyer/AirlinesCluster.csv')

In [7]:
freq_df.show(5)


+-------+---------+----------+----------+-----------+-----------+---------------+
|Balance|QualMiles|BonusMiles|BonusTrans|FlightMiles|FlightTrans|DaysSinceEnroll|
+-------+---------+----------+----------+-----------+-----------+---------------+
|  28143|        0|       174|         1|          0|          0|           7000|
|  19244|        0|       215|         2|          0|          0|           6968|
|  41354|        0|      4123|         4|          0|          0|           7034|
|  14776|        0|       500|         1|          0|          0|           6952|
|  97752|        0|     43300|        26|       2077|          4|           6935|
+-------+---------+----------+----------+-----------+-----------+---------------+


In [8]:
freq_df.count()


Out[8]:
3999

In [9]:
freq_df.dtypes


Out[9]:
[('Balance', 'string'),
 ('QualMiles', 'string'),
 ('BonusMiles', 'string'),
 ('BonusTrans', 'string'),
 ('FlightMiles', 'string'),
 ('FlightTrans', 'string'),
 ('DaysSinceEnroll', 'string')]
But we need a table of numbers.

In [16]:
from numpy import array
freq_rdd = freq_df.map(lambda row: array([float(x) for x in row]))

In [17]:
freq_rdd.take(3)


Out[17]:
[array([  2.81430000e+04,   0.00000000e+00,   1.74000000e+02,
          1.00000000e+00,   0.00000000e+00,   0.00000000e+00,
          7.00000000e+03]),
 array([  1.92440000e+04,   0.00000000e+00,   2.15000000e+02,
          2.00000000e+00,   0.00000000e+00,   0.00000000e+00,
          6.96800000e+03]),
 array([  4.13540000e+04,   0.00000000e+00,   4.12300000e+03,
          4.00000000e+00,   0.00000000e+00,   0.00000000e+00,
          7.03400000e+03])]

In [18]:
from pyspark.mllib.clustering import KMeans
from math import sqrt

In [19]:
freq_rdd.first()
# Balance, TopStatusQualMiles, NonFlightMiles, NonFlightTrans, FlightMiles, FlightTrans, DaysSinceEnroll


Out[19]:
array([  2.81430000e+04,   0.00000000e+00,   1.74000000e+02,
         1.00000000e+00,   0.00000000e+00,   0.00000000e+00,
         7.00000000e+03])

In [20]:
help(KMeans.train)


Help on method train in module pyspark.mllib.clustering:

train(cls, rdd, k, maxIterations=100, runs=1, initializationMode='k-means||', seed=None, initializationSteps=5, epsilon=0.0001) method of __builtin__.type instance
    Train a k-means clustering model.


In [21]:
km_mdl_1 = KMeans.train(freq_rdd, 2, maxIterations=10,runs=10, initializationMode="random")

In [22]:
for x in km_mdl_1.clusterCenters:
        print "%10.3f %10.3f %10.3f %10.3f %10.3f %10.3f %10.3f" % (x[0],x[1],x[2],x[3],x[4],x[5],x[6])
# Balance, TopStatusQualMiles, NonFlightMiles, NonFlightTrans, FlightMiles, FlightTrans, DaysSinceEnroll


 51356.287    126.293  14976.641     10.936    389.559      1.151   4001.617
320111.326    341.607  41171.964     18.976   1241.275      3.843   5414.462

In [23]:
for x in freq_rdd.take(10):
    print x,km_mdl_1.predict(x)


[  2.81430000e+04   0.00000000e+00   1.74000000e+02   1.00000000e+00
   0.00000000e+00   0.00000000e+00   7.00000000e+03] 0
[  1.92440000e+04   0.00000000e+00   2.15000000e+02   2.00000000e+00
   0.00000000e+00   0.00000000e+00   6.96800000e+03] 0
[  4.13540000e+04   0.00000000e+00   4.12300000e+03   4.00000000e+00
   0.00000000e+00   0.00000000e+00   7.03400000e+03] 0
[  1.47760000e+04   0.00000000e+00   5.00000000e+02   1.00000000e+00
   0.00000000e+00   0.00000000e+00   6.95200000e+03] 0
[  9.77520000e+04   0.00000000e+00   4.33000000e+04   2.60000000e+01
   2.07700000e+03   4.00000000e+00   6.93500000e+03] 0
[ 16420.      0.      0.      0.      0.      0.   6942.] 0
[  8.49140000e+04   0.00000000e+00   2.74820000e+04   2.50000000e+01
   0.00000000e+00   0.00000000e+00   6.99400000e+03] 0
[  2.08560000e+04   0.00000000e+00   5.25000000e+03   4.00000000e+00
   2.50000000e+02   1.00000000e+00   6.93800000e+03] 0
[  4.43003000e+05   0.00000000e+00   1.75300000e+03   4.30000000e+01
   3.85000000e+03   1.20000000e+01   6.94800000e+03] 1
[  1.04860000e+05   0.00000000e+00   2.84260000e+04   2.80000000e+01
   1.15000000e+03   3.00000000e+00   6.93100000e+03] 0

In [24]:
def squared_error(mdl, point):
    center = mdl.centers[mdl.predict(point)]
    return sqrt(sum([x**2 for x in (point - center)]))

In [25]:
WSSSE = freq_rdd.map(lambda point: squared_error(km_mdl_1,point)).reduce(lambda x, y: x + y)
print("Within Set Sum of Squared Error = " + str(WSSSE))


Within Set Sum of Squared Error = 191906529.055

In [26]:
from pyspark.mllib.stat import Statistics
summary = Statistics.colStats(freq_rdd)
print summary.mean()


[  7.36013276e+04   1.44114529e+02   1.71448462e+04   1.16019005e+01
   4.60055764e+02   1.37359340e+00   4.11855939e+03]

In [27]:
print "Mean : %10.3f %10.3f %10.3f %10.3f %10.3f %10.3f %10.3f" % (summary.mean()[0],summary.mean()[1],summary.mean()[2],
                                                            summary.mean()[3],summary.mean()[4],summary.mean()[5],
                                                            summary.mean()[6])
print "Max  : %10.3f %10.3f %10.3f %10.3f %10.3f %10.3f %10.3f" % (summary.max()[0],summary.max()[1],
                                                                       summary.max()[2],
                                                            summary.max()[3],summary.max()[4],summary.max()[5],
                                                            summary.max()[6])
print "Min  : %10.3f %10.3f %10.3f %10.3f %10.3f %10.3f %10.3f" % (summary.min()[0],summary.min()[1],
                                                                       summary.min()[2],
                                                            summary.min()[3],summary.min()[4],summary.min()[5],
                                                            summary.min()[6])
print "Variance : %10.3f %10.3f %10.3f %10.3f %10.3f %10.3f %10.3f" % (summary.variance()[0],summary.variance()[1],
                                                                       summary.variance()[2],
                                                            summary.variance()[3],summary.variance()[4],summary.variance()[5],
                                                            summary.variance()[6])
# Balance, TopStatusQualMiles, NonFlightMiles, NonFlightTrans, FlightMiles, FlightTrans, DaysSinceEnroll


Mean :  73601.328    144.115  17144.846     11.602    460.056      1.374   4118.559
Max  : 1704838.000  11148.000 263685.000     86.000  30817.000     53.000   8296.000
Min  :      0.000      0.000      0.000      0.000      0.000      0.000      2.000
Variance : 10155734647.781 598555.682 583269246.943     92.233 1960585.724     14.388 4264780.669

In [28]:
# You see, K-means clustering is "isotropic" in all directions of space and therefore tends to produce 
# more or less round (rather than elongated) clusters. [Ref 2]
# In this situation leaving variances unequal is equivalent to putting more weight on variables with smaller variance, 
# so clusters will tend to be separated along variables with greater variance. [Ref 3]
#
# center, scale, box-cox, preprocess in caret
# zero mean and unit variance
#
# (x - mu)/sigma
# org.apache.spark.mlib.feature.StandardScaler does this, but to the best of my knowledge 
#            as of now (9/28/14) not available for python 
# So we do it manually, gives us a chance to do some functional programming !
#

In [29]:
data_mean = summary.mean()
data_sigma = summary.variance()

In [30]:
for x in data_sigma:
    print x,sqrt(x)


10155734647.8 100775.664958
598555.682228 773.663804393
583269246.943 24150.9678262
92.2331729756 9.6038103363
1960585.7235 1400.20917134
14.3881569442 3.79317241161
4264780.66925 2065.13454023

In [31]:
def center_and_scale(a_record):
    for i in range(len(a_record)):
        a_record[i] = (a_record[i] - data_mean[i])/sqrt(data_sigma[i]) # (x-mean)/sd
    return a_record

In [32]:
freq_norm_rdd = freq_rdd.map(lambda x: center_and_scale(x))

In [33]:
freq_norm_rdd.first()


Out[33]:
array([-0.45108437, -0.18627539, -0.70269839, -1.10392647, -0.32856217,
       -0.36212258,  1.39527985])

In [34]:
# now let us try with the standardized data
km_mdl_std = KMeans.train(freq_norm_rdd, 2, maxIterations=10,runs=10, initializationMode="random")

In [35]:
for x in km_mdl_std.clusterCenters:
        print "%10.3f %10.3f %10.3f %10.3f %10.3f %10.3f %10.3f" % (x[0],x[1],x[2],x[3],x[4],x[5],x[6])
# Balance, TopStatusQualMiles, NonFlightMiles, NonFlightTrans, FlightMiles, FlightTrans, DaysSinceEnroll


     1.422      0.733      1.363      1.461      1.761      1.918      0.460
    -0.158     -0.081     -0.151     -0.162     -0.195     -0.213     -0.051

In [36]:
WSSSE = freq_norm_rdd.map(lambda point: squared_error(km_mdl_std,point)).reduce(lambda x, y: x + y)
print("Within Set Sum of Squared Error = " + str(WSSSE))


Within Set Sum of Squared Error = 7575.5729814

In [37]:
# Let us try with k= 5 clusters instead of k=2
km_mdl_std_5 = KMeans.train(freq_norm_rdd, 5, maxIterations=10,runs=10, initializationMode="random")

In [38]:
for x in km_mdl_std_5.clusterCenters:
        print "%10.3f %10.3f %10.3f %10.3f %10.3f %10.3f %10.3f" % (x[0],x[1],x[2],x[3],x[4],x[5],x[6])
# Balance, TopStatusQualMiles, NonFlightMiles, NonFlightTrans, FlightMiles, FlightTrans, DaysSinceEnroll


     0.425      6.683      0.120      0.115      0.370      0.416     -0.087
    -0.381     -0.142     -0.488     -0.510     -0.204     -0.220     -0.840
     0.581     -0.112      1.188      0.940     -0.060     -0.071      0.212
    -0.117     -0.116     -0.354     -0.266     -0.184     -0.195      0.921
     1.229      0.358      0.869      1.738      3.407      3.679      0.289

In [39]:
WSSSE = freq_norm_rdd.map(lambda point: squared_error(km_mdl_std_5,point)).reduce(lambda x, y: x + y)
print("Within Set Sum of Squared Error = " + str(WSSSE))


Within Set Sum of Squared Error = 5788.536169

In [40]:
km_mdl_std_10 = KMeans.train(freq_norm_rdd, 10, maxIterations=10,runs=10, initializationMode="random")
for x in km_mdl_std_10.clusterCenters:
        print "%10.3f %10.3f %10.3f %10.3f %10.3f %10.3f %10.3f" % (x[0],x[1],x[2],x[3],x[4],x[5],x[6])
#
WSSSE = freq_norm_rdd.map(lambda point: squared_error(km_mdl_std_10,point)).reduce(lambda x, y: x + y)
print("Within Set Sum of Squared Error = " + str(WSSSE))


     0.882     -0.037      2.737      1.536      0.150      0.200      0.382
     5.178      0.185      1.160      0.748      0.493      0.772      1.035
    -0.332     -0.156     -0.602     -0.843     -0.234     -0.234     -0.172
    -0.120     -0.139      0.126      0.575     -0.241     -0.258     -0.610
    -0.273     -0.098     -0.536     -0.587     -0.228     -0.237      1.039
    -0.475     -0.112     -0.596     -0.804     -0.245     -0.256     -1.276
     0.193      0.055      0.064      0.697      1.863      1.837      0.010
     0.317     -0.132      0.587      0.588     -0.188     -0.210      1.003
     0.413      7.036      0.099      0.094      0.353      0.391     -0.092
     0.759      0.480      0.973      2.533      5.742      6.047      0.165
Within Set Sum of Squared Error = 4709.48068684

In [41]:
cluster_rdd = freq_norm_rdd.map(lambda x: km_mdl_std_5.predict(x))

In [42]:
cluster_rdd.take(10)


Out[42]:
[3, 3, 3, 3, 2, 3, 2, 3, 4, 2]

In [43]:
freq_rdd_1 = inp_file.map(lambda line: array([int(x) for x in line.split(',')]))
freq_cluster_map = freq_rdd_1.zip(cluster_rdd)
freq_cluster_map.take(5) 
# Gives org.apache.spark.SparkException: Can only zip RDDs with same number of elements in each partition


Out[43]:
[(array([28143,     0,   174,     1,     0,     0,  7000]), 3),
 (array([19244,     0,   215,     2,     0,     0,  6968]), 3),
 (array([41354,     0,  4123,     4,     0,     0,  7034]), 3),
 (array([14776,     0,   500,     1,     0,     0,  6952]), 3),
 (array([97752,     0, 43300,    26,  2077,     4,  6935]), 2)]

In [44]:
freq_cluster_map = inp_file.map(lambda line: array([int(x) for x in line.split(',')])).zip(cluster_rdd)
freq_cluster_map.take(5) 
# Gives org.apache.spark.SparkException: Can only zip RDDs with same number of elements in each partition


Out[44]:
[(array([28143,     0,   174,     1,     0,     0,  7000]), 3),
 (array([19244,     0,   215,     2,     0,     0,  6968]), 3),
 (array([41354,     0,  4123,     4,     0,     0,  7034]), 3),
 (array([14776,     0,   500,     1,     0,     0,  6952]), 3),
 (array([97752,     0, 43300,    26,  2077,     4,  6935]), 2)]

In [45]:
freq_cluster_map = freq_rdd.zip(cluster_rdd)
freq_cluster_map.take(5) # This works !


Out[45]:
[(array([  2.81430000e+04,   0.00000000e+00,   1.74000000e+02,
           1.00000000e+00,   0.00000000e+00,   0.00000000e+00,
           7.00000000e+03]), 3),
 (array([  1.92440000e+04,   0.00000000e+00,   2.15000000e+02,
           2.00000000e+00,   0.00000000e+00,   0.00000000e+00,
           6.96800000e+03]), 3),
 (array([  4.13540000e+04,   0.00000000e+00,   4.12300000e+03,
           4.00000000e+00,   0.00000000e+00,   0.00000000e+00,
           7.03400000e+03]), 3),
 (array([  1.47760000e+04,   0.00000000e+00,   5.00000000e+02,
           1.00000000e+00,   0.00000000e+00,   0.00000000e+00,
           6.95200000e+03]), 3),
 (array([  9.77520000e+04,   0.00000000e+00,   4.33000000e+04,
           2.60000000e+01,   2.07700000e+03,   4.00000000e+00,
           6.93500000e+03]), 2)]

In [46]:
cluster_0 = freq_cluster_map.filter(lambda x: x[1] == 0)
cluster_1 = freq_cluster_map.filter(lambda x: x[1] == 1)
cluster_2 = freq_cluster_map.filter(lambda x: x[1] == 2)
cluster_3 = freq_cluster_map.filter(lambda x: x[1] == 3)
cluster_4 = freq_cluster_map.filter(lambda x: x[1] == 4)

In [47]:
print cluster_0.count()
print cluster_1.count()
print cluster_2.count()
print cluster_3.count()
print cluster_4.count()


61
1639
877
1263
159
143 1372 768 59 1657

In [48]:
cluster_0.count()+cluster_1.count()+cluster_2.count()+cluster_3.count()+cluster_4.count()


Out[48]:
3999

In [49]:
freq_rdd_1.count()


Out[49]:
3999

In [50]:
freq_cluster_map.count()


Out[50]:
3999

In [51]:
cluster_0.take(5)


Out[51]:
[(array([  8.44090000e+04,   5.03100000e+03,   1.54360000e+04,
           1.60000000e+01,   1.15000000e+03,   4.00000000e+00,
           7.76600000e+03]), 0),
 (array([  2.78457000e+05,   6.72700000e+03,   5.73130000e+04,
           2.70000000e+01,   1.00000000e+03,   2.00000000e+00,
           7.10100000e+03]), 0),
 (array([  5.29886000e+05,   7.21000000e+03,   2.38660000e+04,
           2.60000000e+01,   7.74100000e+03,   1.50000000e+01,
           8.29600000e+03]), 0),
 (array([  8.65200000e+04,   3.44500000e+03,   6.44500000e+04,
           2.00000000e+01,   1.00000000e+03,   2.00000000e+00,
           6.59200000e+03]), 0),
 (array([  1.33445000e+05,   8.26400000e+03,   3.37500000e+03,
           1.30000000e+01,   0.00000000e+00,   0.00000000e+00,
           6.49200000e+03]), 0)]

In [52]:
cluster_1.take(5)


Out[52]:
[(array([ 1625.,     0.,  1375.,     4.,     0.,     0.,  1547.]), 1),
 (array([  2.46980000e+04,   0.00000000e+00,   1.32900000e+03,
           5.00000000e+00,   5.00000000e+02,   1.00000000e+00,
           4.26700000e+03]), 1),
 (array([  3.14840000e+04,   0.00000000e+00,   3.12500000e+03,
           1.10000000e+01,   0.00000000e+00,   0.00000000e+00,
           4.11700000e+03]), 1),
 (array([  2.20930000e+04,   0.00000000e+00,   1.48570000e+04,
           1.10000000e+01,   2.00000000e+02,   1.00000000e+00,
           2.58700000e+03]), 1),
 (array([  4.46650000e+04,   0.00000000e+00,   3.33000000e+02,
           2.00000000e+00,   3.33000000e+02,   2.00000000e+00,
           3.60100000e+03]), 1)]

In [53]:
cluster_2.take(5)


Out[53]:
[(array([  9.77520000e+04,   0.00000000e+00,   4.33000000e+04,
           2.60000000e+01,   2.07700000e+03,   4.00000000e+00,
           6.93500000e+03]), 2),
 (array([  8.49140000e+04,   0.00000000e+00,   2.74820000e+04,
           2.50000000e+01,   0.00000000e+00,   0.00000000e+00,
           6.99400000e+03]), 2),
 (array([  1.04860000e+05,   0.00000000e+00,   2.84260000e+04,
           2.80000000e+01,   1.15000000e+03,   3.00000000e+00,
           6.93100000e+03]), 2),
 (array([  9.65220000e+04,   0.00000000e+00,   6.11050000e+04,
           1.90000000e+01,   0.00000000e+00,   0.00000000e+00,
           6.92400000e+03]), 2),
 (array([  2.84950000e+04,   0.00000000e+00,   4.94420000e+04,
           1.50000000e+01,   0.00000000e+00,   0.00000000e+00,
           6.91200000e+03]), 2)]

In [54]:
cluster_3.take(5)


Out[54]:
[(array([  2.81430000e+04,   0.00000000e+00,   1.74000000e+02,
           1.00000000e+00,   0.00000000e+00,   0.00000000e+00,
           7.00000000e+03]), 3),
 (array([  1.92440000e+04,   0.00000000e+00,   2.15000000e+02,
           2.00000000e+00,   0.00000000e+00,   0.00000000e+00,
           6.96800000e+03]), 3),
 (array([  4.13540000e+04,   0.00000000e+00,   4.12300000e+03,
           4.00000000e+00,   0.00000000e+00,   0.00000000e+00,
           7.03400000e+03]), 3),
 (array([  1.47760000e+04,   0.00000000e+00,   5.00000000e+02,
           1.00000000e+00,   0.00000000e+00,   0.00000000e+00,
           6.95200000e+03]), 3),
 (array([ 16420.,      0.,      0.,      0.,      0.,      0.,   6942.]), 3)]

In [55]:
cluster_4.take(5)


Out[55]:
[(array([  4.43003000e+05,   0.00000000e+00,   1.75300000e+03,
           4.30000000e+01,   3.85000000e+03,   1.20000000e+01,
           6.94800000e+03]), 4),
 (array([  2.05840000e+04,   0.00000000e+00,   3.45000000e+03,
           1.10000000e+01,   3.45000000e+03,   1.10000000e+01,
           6.88400000e+03]), 4),
 (array([  6.03130000e+04,   0.00000000e+00,   1.00000000e+04,
           2.60000000e+01,   3.25000000e+03,   9.00000000e+00,
           7.82900000e+03]), 4),
 (array([  1.08137000e+05,   0.00000000e+00,   6.36800000e+03,
           5.00000000e+00,   6.36800000e+03,   5.00000000e+00,
           6.84400000e+03]), 4),
 (array([  5.39140000e+04,   0.00000000e+00,   3.37670000e+04,
           4.50000000e+01,   5.55000000e+03,   2.90000000e+01,
           6.82600000e+03]), 4)]

In [56]:
stat_0 = Statistics.colStats(cluster_0.map(lambda x: x[0]))
stat_1 = Statistics.colStats(cluster_1.map(lambda x: x[0]))
stat_2 = Statistics.colStats(cluster_2.map(lambda x: x[0]))
stat_3 = Statistics.colStats(cluster_3.map(lambda x: x[0]))
stat_4 = Statistics.colStats(cluster_4.map(lambda x: x[0]))
print "0 : %10.3f %10.3f %10.3f %10.3f %10.3f %10.3f %10.3f" % (stat_0.mean()[0],stat_0.mean()[1],stat_0.mean()[2],
                                                            stat_0.mean()[3],stat_0.mean()[4],stat_0.mean()[5],
                                                            stat_0.mean()[6])
print "1 : %10.3f %10.3f %10.3f %10.3f %10.3f %10.3f %10.3f" % (stat_1.mean()[0],stat_1.mean()[1],stat_1.mean()[2],
                                                            stat_1.mean()[3],stat_1.mean()[4],stat_1.mean()[5],
                                                            stat_1.mean()[6])
print "2 : %10.3f %10.3f %10.3f %10.3f %10.3f %10.3f %10.3f" % (stat_2.mean()[0],stat_2.mean()[1],stat_2.mean()[2],
                                                            stat_2.mean()[3],stat_2.mean()[4],stat_2.mean()[5],
                                                            stat_2.mean()[6])
print "3 : %10.3f %10.3f %10.3f %10.3f %10.3f %10.3f %10.3f" % (stat_3.mean()[0],stat_3.mean()[1],stat_3.mean()[2],
                                                            stat_3.mean()[3],stat_3.mean()[4],stat_3.mean()[5],
                                                            stat_3.mean()[6])
print "4 : %10.3f %10.3f %10.3f %10.3f %10.3f %10.3f %10.3f" % (stat_4.mean()[0],stat_4.mean()[1],stat_4.mean()[2],
                                                            stat_4.mean()[3],stat_4.mean()[4],stat_4.mean()[5],
                                                            stat_4.mean()[6])
# Balance, TopStatusQualMiles, NonFlightMiles, NonFlightTrans, FlightMiles, FlightTrans, DaysSinceEnroll


0 : 117260.148   5354.426  19189.607     12.426    952.164      2.885   3893.672
1 :  36118.613     33.151   5783.135      7.065    179.646      0.553   2345.847
2 : 139958.892     63.229  47719.475     20.822    411.513      1.235   4713.800
3 :  59099.428     54.033   8329.658      8.975    202.061      0.626   5949.587
4 : 192414.516    450.717  34860.233     28.069   5478.874     15.956   4650.497
One run on Sep 27:
0 :  37950.925     33.352   6660.214      7.644    183.511      0.567   2220.540 # Relatively new, not active
1 :  56183.841     54.051   8370.021      8.902    205.035      0.620   5748.698
2 : 117326.186   5445.305  19059.610     12.305    965.797      2.881   3874.831 # Top Status Qual Miles
3 : 191736.336    471.566  33093.336     28.357   5763.133     16.769   4666.413 # Most Active
4 : 150843.700     73.158  50474.264     21.183    473.292      1.441   4938.489 # non-flight but active customers
Run 10/28/14
0 :  38091.905     32.784   6731.402      7.630    178.718      0.555   2281.777
1 :  57441.909     55.024   8758.131      9.104    213.633      0.646   5823.841
2 : 191736.336    471.566  33093.336     28.357   5763.133     16.769   4666.413
3 : 117326.186   5445.305  19059.610     12.305    965.797      2.881   3874.831
4 : 152607.968     74.778  51066.228     21.329    478.139      1.449   4913.985

In [57]:
# Different runs will produce different clusters
# Once the model is executed, the characteristics can interpreted & used in business

In [ ]:


In [ ]: