Welcome to Week 3




This document provides a running example of completing the Week 3 assignment :

  • A shorter version with fewer comments is available as script: sparkMLlibClustering.py
  • To run these commands in Cloudera VM: first run the setup script: setupWeek3.sh
  • You can then copy paste these commands in pySpark.
  • To open pySpark, refer to : Week 2 and Week 4 of the Machine Learning course
  • Note that your dataset may be different from what is used here, so your results may not match with those shown here

In [1]:
import pandas as pd
from pyspark.mllib.clustering import KMeans, KMeansModel
from numpy import array



Step 1: Attribute Selection



Import Data



First let us read the contents of the file ad-clicks.csv. The following commands read in the CSV file in a table format and removes any extra whitespaces. So, if the CSV contained ' userid ' it becomes 'userid'.

Note that you must change the path to ad-clicks.csv to the location on your machine, if you want to run this command on your machine.






In [2]:
adclicksDF = pd.read_csv('./ad-clicks.csv')
adclicksDF = adclicksDF.rename(columns=lambda x: x.strip()) #remove whitespaces from headers





Let us display the first 5 lines of adclicksDF:






In [3]:
adclicksDF.head(n=5)


Out[3]:
timestamp txId userSessionId teamId userId adId adCategory
0 2016-05-30 14:24:03 6616 6289 24 876 29 movies
1 2016-05-30 14:24:47 6624 6144 29 1935 8 games
2 2016-05-30 14:26:34 6628 6536 20 1588 25 movies
3 2016-05-30 14:26:50 6618 6518 21 1195 19 fashion
4 2016-05-30 14:27:06 6629 6072 146 1685 20 games





Next, We are going to add an extra column to the ad-clicks table and make it equal to 1. We do so to record the fact that each ROW is 1 ad-click. You will see how this will become useful when we sum up this column to find how many ads did a user click.






In [4]:
adclicksDF['adCount'] = 1





Let us display the first 5 lines of adclicksDF and see if a new column has been added:






In [5]:
adclicksDF.head(n=5)


Out[5]:
timestamp txId userSessionId teamId userId adId adCategory adCount
0 2016-05-30 14:24:03 6616 6289 24 876 29 movies 1
1 2016-05-30 14:24:47 6624 6144 29 1935 8 games 1
2 2016-05-30 14:26:34 6628 6536 20 1588 25 movies 1
3 2016-05-30 14:26:50 6618 6518 21 1195 19 fashion 1
4 2016-05-30 14:27:06 6629 6072 146 1685 20 games 1





Next, let us read the contents of the file buy-clicks.csv. As before, the following commands read in the CSV file in a table format and removes any extra whitespaces. So, if the CSV contained ' userid ' it becomes 'userid'.

Note that you must change the path to buy-clicks.csv to the location on your machine, if you want to run this command on your machine.






In [6]:
buyclicksDF = pd.read_csv('./buy-clicks.csv')
buyclicksDF = buyclicksDF.rename(columns=lambda x: x.strip()) #removes whitespaces from headers





Let us display the first 5 lines of buyclicksDF:






In [7]:
buyclicksDF.head(n=5)


Out[7]:
timestamp txId userSessionId team userId buyId price
0 2016-05-30 13:51:50 6587 6368 29 1422 2 3
1 2016-05-30 13:51:50 6588 6440 112 2253 0 1
2 2016-05-30 13:51:50 6589 6420 48 1393 2 3
3 2016-05-30 14:21:50 6631 6495 141 2295 2 3
4 2016-05-30 14:21:50 6632 6111 119 1560 0 1





Feature Selection



For this exercise, we can choose from buyclicksDF, the 'price' of each app that a user purchases as an attribute that captures user's purchasing behavior. The following command selects 'userid' and 'price' and drops all other columns that we do not want to use at this stage.






In [8]:
userPurchases = buyclicksDF[['userId','price']] #select only userid and price
userPurchases.head(n=5)


Out[8]:
userId price
0 1422 3
1 2253 1
2 1393 3
3 2295 3
4 1560 1





Similarly, from the adclicksDF, we will use the 'adCount' as an attribute that captures user's inclination to click on ads. The following command selects 'userid' and 'adCount' and drops all other columns that we do not want to use at this stage.






In [9]:
useradClicks = adclicksDF[['userId','adCount']]

In [10]:
useradClicks.head(n=5) #as we saw before, this line displays first five lines


Out[10]:
userId adCount
0 876 1
1 1935 1
2 1588 1
3 1195 1
4 1685 1



Step 2: Training Data Set Creation



Create the first aggregate feature for clustering



From each of these single ad-clicks per row, we can now generate total ad clicks per user. Let's pick a user with userid = 3. To find out how many ads this user has clicked overall, we have to find each row that contains userid = 3, and report the total number of such rows. The following commands sum the total number of ads per user and rename the columns to be called 'userid' and 'totalAdClicks'. Note that you may not need to aggregate (e.g. sum over many rows) if you choose a different feature and your data set already provides the necessary information. In the end, we want to get one row per user, if we are performing clustering over users.






In [11]:
adsPerUser = useradClicks.groupby('userId').sum()
adsPerUser = adsPerUser.reset_index()
adsPerUser.columns = ['userId', 'totalAdClicks'] #rename the columns





Let us display the first 5 lines of 'adsPerUser' to see if there is a column named 'totalAdClicks' containing total adclicks per user.






In [12]:
adsPerUser.head(n=5)


Out[12]:
userId totalAdClicks
0 1 42
1 5 4
2 9 17
3 11 4
4 14 40





Create the second aggregate feature for clustering



Similar to what we did for adclicks, here we find out how much money in total did each user spend on buying in-app purchases. As an example, let's pick a user with userid = 9. To find out the total money spent by this user, we have to find each row that contains userid = 9, and report the sum of the column'price' of each product they purchased. The following commands sum the total money spent by each user and rename the columns to be called 'userid' and 'revenue'.

Note: that you can also use other aggregates, such as sum of money spent on a specific ad category by a user or on a set of ad categories by each user, game clicks per hour by each user etc. You are free to use any mathematical operations on the fields provided in the CSV files when creating features.






In [13]:
revenuePerUser = userPurchases.groupby('userId').sum()
revenuePerUser = revenuePerUser.reset_index()
revenuePerUser.columns = ['userId', 'revenue'] #rename the columns

In [14]:
revenuePerUser.head(n=5)


Out[14]:
userId revenue
0 1 32
1 5 2
2 9 10
3 14 26
4 17 25





Merge the two tables



Lets see what we have so far. We have a table called revenuePerUser, where each row contains total money a user (with that 'userid') has spent. We also have another table called adsPerUser where each row contains total number of ads a user has clicked. We will use revenuePerUser and adsPerUser as features / attributes to capture our users' behavior.

Let us combine these two attributes (features) so that each row contains both attributes per user. Let's merge these two tables to get one single table we can use for K-Means clustering.






In [15]:
combinedDF = adsPerUser.merge(revenuePerUser, on='userId') #userId, adCount, price





Let us display the first 5 lines of the merged table. Note: Depending on what attributes you choose, you may not need to merge tables. You may get all your attributes from a single table.






In [16]:
combinedDF.head(n=5) #display how the merged table looks


Out[16]:
userId totalAdClicks revenue
0 1 42 32
1 5 4 2
2 9 17 10
3 14 40 26
4 17 50 25





Create the final training dataset



Our training data set is almost ready. At this stage we can remove the 'userid' from each row, since 'userid' is a computer generated random number assigned to each user. It does not capture any behavioral aspect of a user. One way to drop the 'userid', is to select the other two columns.






In [17]:
trainingDF = combinedDF[['totalAdClicks','revenue']]
trainingDF.head(n=5)


Out[17]:
totalAdClicks revenue
0 42 32
1 4 2
2 17 10
3 40 26
4 50 25





Display the dimensions of the training dataset



Display the dimension of the training data set. To display the dimensions of the trainingDF, simply add .shape as a suffix and hit enter.






In [18]:
trainingDF.shape


Out[18]:
(832, 2)





The following two commands convert the tables we created into a format that can be understood by the KMeans.train function.

line[0] refers to the first column. line[1] refers to the second column. If you have more than 2 columns in your training table, modify this command by adding line[2], line[3], line[4] ...






In [19]:
sqlContext = SQLContext(sc)
pDF = sqlContext.createDataFrame(trainingDF)
parsedData = pDF.rdd.map(lambda line: array([line[0], line[1]])) #totalAdClicks, revenue


Step 3: Train to Create Cluster Centers



Train KMeans model







Here we are creating two clusters as denoted in the second argument.






In [20]:
my_kmmodel = KMeans.train(parsedData, 2, maxIterations=10, runs=10, initializationMode="random")


/usr/local/Cellar/apache-spark/1.6.0/libexec/python/pyspark/mllib/clustering.py:176: UserWarning: Support for runs is deprecated in 1.6.0. This param will have no effect in 1.7.0.
  "Support for runs is deprecated in 1.6.0. This param will have no effect in 1.7.0.")



Display the centers of two clusters formed




In [21]:
print(my_kmmodel.centers)


[array([  42.05442177,  113.02040816]), array([ 29.43211679,  24.21021898])]


Step 4: Recommend Actions


Analyze the cluster centers



Each array denotes the center for a cluster:

One Cluster is centered at ... array([ 29.43211679, 24.21021898])
Other Cluster is centered at ... array([ 42.05442177, 113.02040816])



First number (field1) in each array refers to number of ad-clicks and the second number (field2) is the revenue per user. Compare the 1st number of each cluster to see how differently users in each cluster behave when it comes to clicking ads. Compare the 2nd number of each cluster to see how differently users in each cluster behave when it comes to buying stuff.



In one cluster, in general, players click on ads much more often (~1.4 times) and spend more money (~4.7 times) on in-app purchases. Assuming that Eglence Inc. gets paid for showing ads and for hosting in-app purchase items, we can use this information to increase game's revenue by increasing the prices for ads we show to the frequent-clickers, and charge higher fees for hosting the in-app purchase items shown to the higher revenue generating buyers.



Note: This analysis requires you to compare the cluster centers and find any ‘significant’ differences in the corresponding feature values of the centers. The answer to this question will depend on the features you have chosen.

Some features help distinguish the clusters remarkably while others may not tell you much. At this point, if you don’t find clear distinguishing patterns, perhaps re-running the clustering model with different numbers of clusters and revising the features you picked would be a good idea.


In [ ]: