Welcome to Week 3

This document provides a running example of completing the Week 3 assignment :

A shorter version with fewer comments is available as script: sparkMLlibClustering.py
To run these commands in Cloudera VM: first run the setup script: setupWeek3.sh
You can then copy paste these commands in pySpark.
To open pySpark, refer to : Week 2 and Week 4 of the Machine Learning course
Note that your dataset may be different from what is used here, so your results may not match with those shown here



In [1]:

    
import pandas as pd
from pyspark.mllib.clustering import KMeans, KMeansModel
from numpy import array

Step 1: Attribute Selection

Import Data

First let us read the contents of the file ad-clicks.csv. The following commands read in the CSV file in a table format and removes any extra whitespaces. So, if the CSV contained ' userid ' it becomes 'userid'.

Note that you must change the path to ad-clicks.csv to the location on your machine, if you want to run this command on your machine.



In [2]:

    
adclicksDF = pd.read_csv('./ad-clicks.csv')
adclicksDF = adclicksDF.rename(columns=lambda x: x.strip()) #remove whitespaces from headers

Let us display the first 5 lines of adclicksDF:



In [3]:

    
adclicksDF.head(n=5)









    Out[3]:






  
    
      
      timestamp
      txId
      userSessionId
      teamId
      userId
      adId
      adCategory
    
  
  
    
      0
      2016-05-30 14:24:03
      6616
      6289
      24
      876
      29
      movies
    
    
      1
      2016-05-30 14:24:47
      6624
      6144
      29
      1935
      8
      games
    
    
      2
      2016-05-30 14:26:34
      6628
      6536
      20
      1588
      25
      movies
    
    
      3
      2016-05-30 14:26:50
      6618
      6518
      21
      1195
      19
      fashion
    
    
      4
      2016-05-30 14:27:06
      6629
      6072
      146
      1685
      20
      games

Next, We are going to add an extra column to the ad-clicks table and make it equal to 1. We do so to record the fact that each ROW is 1 ad-click. You will see how this will become useful when we sum up this column to find how many ads did a user click.



In [4]:

    
adclicksDF['adCount'] = 1

Let us display the first 5 lines of adclicksDF and see if a new column has been added:



In [5]:

    
adclicksDF.head(n=5)









    Out[5]:






  
    
      
      timestamp
      txId
      userSessionId
      teamId
      userId
      adId
      adCategory
      adCount
    
  
  
    
      0
      2016-05-30 14:24:03
      6616
      6289
      24
      876
      29
      movies
      1
    
    
      1
      2016-05-30 14:24:47
      6624
      6144
      29
      1935
      8
      games
      1
    
    
      2
      2016-05-30 14:26:34
      6628
      6536
      20
      1588
      25
      movies
      1
    
    
      3
      2016-05-30 14:26:50
      6618
      6518
      21
      1195
      19
      fashion
      1
    
    
      4
      2016-05-30 14:27:06
      6629
      6072
      146
      1685
      20
      games
      1

Next, let us read the contents of the file buy-clicks.csv. As before, the following commands read in the CSV file in a table format and removes any extra whitespaces. So, if the CSV contained ' userid ' it becomes 'userid'.

Note that you must change the path to buy-clicks.csv to the location on your machine, if you want to run this command on your machine.



In [6]:

    
buyclicksDF = pd.read_csv('./buy-clicks.csv')
buyclicksDF = buyclicksDF.rename(columns=lambda x: x.strip()) #removes whitespaces from headers

Let us display the first 5 lines of buyclicksDF:



In [7]:

    
buyclicksDF.head(n=5)









    Out[7]:






  
    
      
      timestamp
      txId
      userSessionId
      team
      userId
      buyId
      price
    
  
  
    
      0
      2016-05-30 13:51:50
      6587
      6368
      29
      1422
      2
      3
    
    
      1
      2016-05-30 13:51:50
      6588
      6440
      112
      2253
      0
      1
    
    
      2
      2016-05-30 13:51:50
      6589
      6420
      48
      1393
      2
      3
    
    
      3
      2016-05-30 14:21:50
      6631
      6495
      141
      2295
      2
      3
    
    
      4
      2016-05-30 14:21:50
      6632
      6111
      119
      1560
      0
      1

Feature Selection

For this exercise, we can choose from buyclicksDF, the 'price' of each app that a user purchases as an attribute that captures user's purchasing behavior. The following command selects 'userid' and 'price' and drops all other columns that we do not want to use at this stage.



In [8]:

    
userPurchases = buyclicksDF[['userId','price']] #select only userid and price
userPurchases.head(n=5)

Similarly, from the adclicksDF, we will use the 'adCount' as an attribute that captures user's inclination to click on ads. The following command selects 'userid' and 'adCount' and drops all other columns that we do not want to use at this stage.



In [9]:

    
useradClicks = adclicksDF[['userId','adCount']]



In [10]:

    
useradClicks.head(n=5) #as we saw before, this line displays first five lines

Step 2: Training Data Set Creation

Create the first aggregate feature for clustering

From each of these single ad-clicks per row, we can now generate total ad clicks per user. Let's pick a user with userid = 3. To find out how many ads this user has clicked overall, we have to find each row that contains userid = 3, and report the total number of such rows. The following commands sum the total number of ads per user and rename the columns to be called 'userid' and 'totalAdClicks'. Note that you may not need to aggregate (e.g. sum over many rows) if you choose a different feature and your data set already provides the necessary information. In the end, we want to get one row per user, if we are performing clustering over users.



In [11]:

    
adsPerUser = useradClicks.groupby('userId').sum()
adsPerUser = adsPerUser.reset_index()
adsPerUser.columns = ['userId', 'totalAdClicks'] #rename the columns

Let us display the first 5 lines of 'adsPerUser' to see if there is a column named 'totalAdClicks' containing total adclicks per user.



In [12]:

    
adsPerUser.head(n=5)









    Out[12]:






  
    
      
      userId
      totalAdClicks
    
  
  
    
      0
      1
      42
    
    
      1
      5
      4
    
    
      2
      9
      17
    
    
      3
      11
      4
    
    
      4
      14
      40

Create the second aggregate feature for clustering

Similar to what we did for adclicks, here we find out how much money in total did each user spend on buying in-app purchases. As an example, let's pick a user with userid = 9. To find out the total money spent by this user, we have to find each row that contains userid = 9, and report the sum of the column'price' of each product they purchased. The following commands sum the total money spent by each user and rename the columns to be called 'userid' and 'revenue'.

Note: that you can also use other aggregates, such as sum of money spent on a specific ad category by a user or on a set of ad categories by each user, game clicks per hour by each user etc. You are free to use any mathematical operations on the fields provided in the CSV files when creating features.



In [13]:

    
revenuePerUser = userPurchases.groupby('userId').sum()
revenuePerUser = revenuePerUser.reset_index()
revenuePerUser.columns = ['userId', 'revenue'] #rename the columns



In [14]:

    
revenuePerUser.head(n=5)

Merge the two tables

Lets see what we have so far. We have a table called revenuePerUser, where each row contains total money a user (with that 'userid') has spent. We also have another table called adsPerUser where each row contains total number of ads a user has clicked. We will use revenuePerUser and adsPerUser as features / attributes to capture our users' behavior.

Let us combine these two attributes (features) so that each row contains both attributes per user. Let's merge these two tables to get one single table we can use for K-Means clustering.



In [15]:

    
combinedDF = adsPerUser.merge(revenuePerUser, on='userId') #userId, adCount, price

Let us display the first 5 lines of the merged table. Note: Depending on what attributes you choose, you may not need to merge tables. You may get all your attributes from a single table.



In [16]:

    
combinedDF.head(n=5) #display how the merged table looks









    Out[16]:






  
    
      
      userId
      totalAdClicks
      revenue
    
  
  
    
      0
      1
      42
      32
    
    
      1
      5
      4
      2
    
    
      2
      9
      17
      10
    
    
      3
      14
      40
      26
    
    
      4
      17
      50
      25

Create the final training dataset

Our training data set is almost ready. At this stage we can remove the 'userid' from each row, since 'userid' is a computer generated random number assigned to each user. It does not capture any behavioral aspect of a user. One way to drop the 'userid', is to select the other two columns.



In [17]:

    
trainingDF = combinedDF[['totalAdClicks','revenue']]
trainingDF.head(n=5)









    Out[17]:






  
    
      
      totalAdClicks
      revenue
    
  
  
    
      0
      42
      32
    
    
      1
      4
      2
    
    
      2
      17
      10
    
    
      3
      40
      26
    
    
      4
      50
      25

Display the dimensions of the training dataset

Display the dimension of the training data set. To display the dimensions of the trainingDF, simply add .shape as a suffix and hit enter.



In [18]:

    
trainingDF.shape









    Out[18]:





(832, 2)

The following two commands convert the tables we created into a format that can be understood by the KMeans.train function.

line[0] refers to the first column. line[1] refers to the second column. If you have more than 2 columns in your training table, modify this command by adding line[2], line[3], line[4] ...



In [19]:

    
sqlContext = SQLContext(sc)
pDF = sqlContext.createDataFrame(trainingDF)
parsedData = pDF.rdd.map(lambda line: array([line[0], line[1]])) #totalAdClicks, revenue

Step 3: Train to Create Cluster Centers

Train KMeans model

Here we are creating two clusters as denoted in the second argument.



In [20]:

    
my_kmmodel = KMeans.train(parsedData, 2, maxIterations=10, runs=10, initializationMode="random")









    



/usr/local/Cellar/apache-spark/1.6.0/libexec/python/pyspark/mllib/clustering.py:176: UserWarning: Support for runs is deprecated in 1.6.0. This param will have no effect in 1.7.0.
  "Support for runs is deprecated in 1.6.0. This param will have no effect in 1.7.0.")

Display the centers of two clusters formed



In [21]:

    
print(my_kmmodel.centers)









    



[array([  42.05442177,  113.02040816]), array([ 29.43211679,  24.21021898])]

Step 4: Recommend Actions

Analyze the cluster centers

Each array denotes the center for a cluster:

One Cluster is centered at ... array([ 29.43211679, 24.21021898])
Other Cluster is centered at ... array([ 42.05442177, 113.02040816])

First number (field1) in each array refers to number of ad-clicks and the second number (field2) is the revenue per user. Compare the 1st number of each cluster to see how differently users in each cluster behave when it comes to clicking ads. Compare the 2nd number of each cluster to see how differently users in each cluster behave when it comes to buying stuff.

In one cluster, in general, players click on ads much more often (~1.4 times) and spend more money (~4.7 times) on in-app purchases. Assuming that Eglence Inc. gets paid for showing ads and for hosting in-app purchase items, we can use this information to increase game's revenue by increasing the prices for ads we show to the frequent-clickers, and charge higher fees for hosting the in-app purchase items shown to the higher revenue generating buyers.

Note: This analysis requires you to compare the cluster centers and find any ‘significant’ differences in the corresponding feature values of the centers. The answer to this question will depend on the features you have chosen.

Some features help distinguish the clusters remarkably while others may not tell you much. At this point, if you don’t find clear distinguishing patterns, perhaps re-running the clustering model with different numbers of clusters and revising the features you picked would be a good idea.



In [ ]:

	timestamp	txId	userSessionId	teamId	userId	adId	adCategory
0	2016-05-30 14:24:03	6616	6289	24	876	29	movies
1	2016-05-30 14:24:47	6624	6144	29	1935	8	games
2	2016-05-30 14:26:34	6628	6536	20	1588	25	movies
3	2016-05-30 14:26:50	6618	6518	21	1195	19	fashion
4	2016-05-30 14:27:06	6629	6072	146	1685	20	games

	timestamp	txId	userSessionId	team	userId	buyId	price
0	2016-05-30 13:51:50	6587	6368	29	1422	2	3
1	2016-05-30 13:51:50	6588	6440	112	2253	0	1
2	2016-05-30 13:51:50	6589	6420	48	1393	2	3
3	2016-05-30 14:21:50	6631	6495	141	2295	2	3
4	2016-05-30 14:21:50	6632	6111	119	1560	0	1