Data Science using Telco data

Load the data

We need to load data from a file in to a Spark DataFrame. Each row is an observed customer, and each column contains attributes of that customer.

Data from UCI data set repo, hosted by SGI

Fields:

state: discrete.
account length: numeric.
area code: numeric.
phone number: discrete.
international plan: discrete.
voice mail plan: discrete.
number vmail messages: numeric.
total day minutes: numeric.
total day calls: numeric.
total day charge: numeric.
total eve minutes: numeric.
total eve calls: numeric.
total eve charge: numeric.
total night minutes: numeric.
total night calls: numeric.
total night charge: numeric.
total intl minutes: numeric.
total intl calls: numeric.
total intl charge: numeric.
number customer service calls: numeric.
churned: discrete.

'Numeric' and 'discrete' do not adequately describe the fundamental differencecs in the attributes.

Area codes are considered numeric, but they a better thought of as a categorical variable. This is because attributes that are really numeric features have a reasonable concept of distance between points. Area codes do not fall into this cateogory. They can have a small distance |area_code_1 - area_code_2| but that distance doesn't correspond to a similarity in area codes.


In [3]:
from pyspark.sql import SQLContext
from pyspark.sql.types import *

sqlContext = SQLContext(sc)
schema = StructType([ \
    StructField("state", StringType(), True), \
    StructField("account_length", DoubleType(), True), \
    StructField("area_code", StringType(), True), \
    StructField("phone_number", StringType(), True), \
    StructField("intl_plan", StringType(), True), \
    StructField("voice_mail_plan", StringType(), True), \
    StructField("number_vmail_messages", DoubleType(), True), \
    StructField("total_day_minutes", DoubleType(), True), \
    StructField("total_day_calls", DoubleType(), True), \
    StructField("total_day_charge", DoubleType(), True), \
    StructField("total_eve_minutes", DoubleType(), True), \
    StructField("total_eve_calls", DoubleType(), True), \
    StructField("total_eve_charge", DoubleType(), True), \
    StructField("total_night_minutes", DoubleType(), True), \
    StructField("total_night_calls", DoubleType(), True), \
    StructField("total_night_charge", DoubleType(), True), \
    StructField("total_intl_minutes", DoubleType(), True), \
    StructField("total_intl_calls", DoubleType(), True), \
    StructField("total_intl_charge", DoubleType(), True), \
    StructField("number_customer_service_calls", DoubleType(), True), \
    StructField("churned", StringType(), True)])

churn_data = sqlContext.read \
    .format('com.databricks.spark.csv') \
    .load('data/churn.all', schema = schema)

Basic DataFrame operations

Dataframes essentially allow you to express sql-like statements. We can filter, count, and so on. DataFrame Operations documentation.


In [4]:
count = churn_data.count()
voice_mail_plans = churn_data.filter(churn_data.voice_mail_plan == " yes").count()

print "Total accts: %d\nAccts with voicemail: %d" % (count, voice_mail_plans)


Total accts: 5000
Accts with voicemail: 1323

In [5]:
churn_data.toPandas()


Out[5]:
state account_length area_code phone_number intl_plan voice_mail_plan number_vmail_messages total_day_minutes total_day_calls total_day_charge ... total_eve_calls total_eve_charge total_night_minutes total_night_calls total_night_charge total_intl_minutes total_intl_calls total_intl_charge number_customer_service_calls churned
0 KS 128.0 415 382-4657 no yes 25.0 265.1 110.0 45.07 ... 99.0 16.78 244.7 91.0 11.01 10.0 3.0 2.70 1.0 False.
1 OH 107.0 415 371-7191 no yes 26.0 161.6 123.0 27.47 ... 103.0 16.62 254.4 103.0 11.45 13.7 3.0 3.70 1.0 False.
2 NJ 137.0 415 358-1921 no no 0.0 243.4 114.0 41.38 ... 110.0 10.30 162.6 104.0 7.32 12.2 5.0 3.29 0.0 False.
3 OH 84.0 408 375-9999 yes no 0.0 299.4 71.0 50.90 ... 88.0 5.26 196.9 89.0 8.86 6.6 7.0 1.78 2.0 False.
4 OK 75.0 415 330-6626 yes no 0.0 166.7 113.0 28.34 ... 122.0 12.61 186.9 121.0 8.41 10.1 3.0 2.73 3.0 False.
5 AL 118.0 510 391-8027 yes no 0.0 223.4 98.0 37.98 ... 101.0 18.75 203.9 118.0 9.18 6.3 6.0 1.70 0.0 False.
6 MA 121.0 510 355-9993 no yes 24.0 218.2 88.0 37.09 ... 108.0 29.62 212.6 118.0 9.57 7.5 7.0 2.03 3.0 False.
7 MO 147.0 415 329-9001 yes no 0.0 157.0 79.0 26.69 ... 94.0 8.76 211.8 96.0 9.53 7.1 6.0 1.92 0.0 False.
8 LA 117.0 408 335-4719 no no 0.0 184.5 97.0 31.37 ... 80.0 29.89 215.8 90.0 9.71 8.7 4.0 2.35 1.0 False.
9 WV 141.0 415 330-8173 yes yes 37.0 258.6 84.0 43.96 ... 111.0 18.87 326.4 97.0 14.69 11.2 5.0 3.02 0.0 False.
10 IN 65.0 415 329-6603 no no 0.0 129.1 137.0 21.95 ... 83.0 19.42 208.8 111.0 9.40 12.7 6.0 3.43 4.0 True.
11 RI 74.0 415 344-9403 no no 0.0 187.7 127.0 31.91 ... 148.0 13.89 196.0 94.0 8.82 9.1 5.0 2.46 0.0 False.
12 IA 168.0 408 363-1107 no no 0.0 128.8 96.0 21.90 ... 71.0 8.92 141.1 128.0 6.35 11.2 2.0 3.02 1.0 False.
13 MT 95.0 510 394-8006 no no 0.0 156.6 88.0 26.62 ... 75.0 21.05 192.3 115.0 8.65 12.3 5.0 3.32 3.0 False.
14 IA 62.0 415 366-9238 no no 0.0 120.7 70.0 20.52 ... 76.0 26.11 203.0 99.0 9.14 13.1 6.0 3.54 4.0 False.
15 NY 161.0 415 351-7269 no no 0.0 332.9 67.0 56.59 ... 97.0 27.01 160.6 128.0 7.23 5.4 9.0 1.46 4.0 True.
16 ID 85.0 408 350-8884 no yes 27.0 196.4 139.0 33.39 ... 90.0 23.88 89.3 75.0 4.02 13.8 4.0 3.73 1.0 False.
17 VT 93.0 510 386-2923 no no 0.0 190.7 114.0 32.42 ... 111.0 18.55 129.6 121.0 5.83 8.1 3.0 2.19 3.0 False.
18 VA 76.0 510 356-2992 no yes 33.0 189.7 66.0 32.25 ... 65.0 18.09 165.7 108.0 7.46 10.0 5.0 2.70 1.0 False.
19 TX 73.0 415 373-2782 no no 0.0 224.4 90.0 38.15 ... 88.0 13.56 192.8 74.0 8.68 13.0 2.0 3.51 1.0 False.
20 FL 147.0 415 396-5800 no no 0.0 155.1 117.0 26.37 ... 93.0 20.37 208.8 133.0 9.40 10.6 4.0 2.86 0.0 False.
21 CO 77.0 408 393-7984 no no 0.0 62.4 89.0 10.61 ... 121.0 14.44 209.6 64.0 9.43 5.7 6.0 1.54 5.0 True.
22 AZ 130.0 415 358-1958 no no 0.0 183.0 112.0 31.11 ... 99.0 6.20 181.8 78.0 8.18 9.5 19.0 2.57 0.0 False.
23 SC 111.0 415 350-2565 no no 0.0 110.4 103.0 18.77 ... 102.0 11.67 189.6 105.0 8.53 7.7 6.0 2.08 2.0 False.
24 VA 132.0 510 343-4696 no no 0.0 81.1 86.0 13.79 ... 72.0 20.84 237.0 115.0 10.67 10.3 2.0 2.78 0.0 False.
25 NE 174.0 415 331-3698 no no 0.0 124.3 76.0 21.13 ... 112.0 23.55 250.7 115.0 11.28 15.5 5.0 4.19 3.0 False.
26 WY 57.0 408 357-3817 no yes 39.0 213.0 115.0 36.21 ... 112.0 16.24 182.7 115.0 8.22 9.5 3.0 2.57 0.0 False.
27 MT 54.0 408 418-6412 no no 0.0 134.3 73.0 22.83 ... 100.0 13.22 102.1 68.0 4.59 14.7 4.0 3.97 3.0 False.
28 MO 20.0 415 353-2630 no no 0.0 190.0 109.0 32.30 ... 84.0 21.95 181.5 102.0 8.17 6.3 6.0 1.70 0.0 False.
29 HI 49.0 510 410-7789 no no 0.0 119.3 117.0 20.28 ... 109.0 18.28 178.7 90.0 8.04 11.1 1.0 3.00 1.0 False.
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
4970 MD 133.0 415 363-2715 no no 0.0 173.1 126.0 29.43 ... 98.0 17.00 231.0 103.0 10.40 10.8 1.0 2.92 1.0 False.
4971 HI 128.0 510 409-8074 no yes 28.0 115.4 96.0 19.62 ... 94.0 15.76 209.3 69.0 9.42 17.7 4.0 4.78 0.0 False.
4972 FL 81.0 408 357-1624 no no 0.0 190.9 92.0 32.45 ... 132.0 17.74 120.8 98.0 5.44 11.4 6.0 3.08 1.0 False.
4973 ND 81.0 415 342-2543 no no 0.0 234.1 105.0 39.80 ... 88.0 20.18 240.6 85.0 10.83 11.4 4.0 3.08 1.0 False.
4974 LA 96.0 408 347-4427 no no 0.0 102.1 74.0 17.36 ... 88.0 7.96 236.2 125.0 10.63 14.0 2.0 3.78 1.0 False.
4975 AR 93.0 415 411-9580 no no 0.0 131.9 91.0 22.42 ... 92.0 17.32 205.6 121.0 9.25 6.8 3.0 1.84 1.0 False.
4976 KY 145.0 408 338-1170 no no 0.0 135.0 122.0 22.95 ... 88.0 17.54 210.4 90.0 9.47 19.7 4.0 5.32 0.0 False.
4977 HI 53.0 510 356-5217 yes no 0.0 145.0 80.0 24.65 ... 82.0 19.41 169.4 109.0 7.62 8.9 5.0 2.40 0.0 False.
4978 KS 108.0 408 389-4921 no no 0.0 170.3 48.0 28.95 ... 146.0 15.95 204.0 84.0 9.18 16.1 4.0 4.35 3.0 False.
4979 TX 52.0 415 330-3162 no no 0.0 188.7 99.0 32.08 ... 89.0 17.88 191.4 103.0 8.61 10.8 2.0 2.92 1.0 False.
4980 WA 73.0 408 408-8236 no no 0.0 177.2 118.0 30.12 ... 84.0 22.99 241.8 112.0 10.88 12.3 2.0 3.32 3.0 True.
4981 NC 61.0 510 408-3742 no no 0.0 7.2 97.0 1.22 ... 78.0 18.85 174.5 94.0 7.85 12.0 4.0 3.24 1.0 False.
4982 TX 86.0 408 360-8716 no no 0.0 181.1 124.0 30.79 ... 105.0 19.51 122.2 91.0 5.50 5.6 6.0 1.51 5.0 False.
4983 LA 137.0 408 400-1731 no no 0.0 151.8 90.0 25.81 ... 68.0 19.52 171.8 122.0 7.73 10.8 5.0 2.92 2.0 False.
4984 TN 74.0 510 341-2423 no no 0.0 159.4 86.0 27.10 ... 122.0 17.85 172.4 99.0 7.76 8.5 2.0 2.30 1.0 False.
4985 NC 118.0 510 344-6099 no yes 31.0 226.6 99.0 38.52 ... 136.0 14.02 189.3 120.0 8.52 10.2 5.0 2.75 3.0 False.
4986 OH 125.0 415 364-2584 no no 0.0 214.7 115.0 36.50 ... 94.0 22.15 266.3 123.0 11.98 7.4 6.0 2.00 3.0 False.
4987 AR 127.0 415 397-4141 no yes 27.0 157.6 107.0 26.79 ... 49.0 23.85 75.1 77.0 3.38 8.0 4.0 2.16 1.0 False.
4988 WA 80.0 510 397-4475 no no 0.0 157.0 101.0 26.69 ... 127.0 17.75 113.3 109.0 5.10 16.2 2.0 4.37 2.0 False.
4989 MN 150.0 408 407-6315 no no 0.0 170.0 115.0 28.90 ... 138.0 13.83 267.2 77.0 12.02 8.3 2.0 2.24 0.0 False.
4990 ND 140.0 510 364-8203 no no 0.0 244.7 115.0 41.60 ... 101.0 21.98 231.3 112.0 10.41 7.5 6.0 2.03 1.0 True.
4991 AZ 97.0 510 410-3888 no no 0.0 252.6 89.0 42.94 ... 91.0 28.93 256.5 67.0 11.54 8.8 5.0 2.38 1.0 True.
4992 MT 83.0 415 355-3630 no no 0.0 188.3 70.0 32.01 ... 88.0 20.72 213.7 79.0 9.62 10.3 6.0 2.78 0.0 False.
4993 WV 73.0 408 411-9655 no no 0.0 177.9 89.0 30.24 ... 82.0 11.15 186.2 89.0 8.38 11.5 6.0 3.11 3.0 False.
4994 NC 75.0 408 417-1886 no no 0.0 170.7 101.0 29.02 ... 126.0 16.41 129.1 104.0 5.81 6.9 7.0 1.86 1.0 False.
4995 HI 50.0 408 365-8751 no yes 40.0 235.7 127.0 40.07 ... 126.0 18.96 297.5 116.0 13.39 9.9 5.0 2.67 2.0 False.
4996 WV 152.0 415 334-9736 no no 0.0 184.2 90.0 31.31 ... 73.0 21.83 213.6 113.0 9.61 14.7 2.0 3.97 3.0 True.
4997 DC 61.0 415 333-6861 no no 0.0 140.6 89.0 23.90 ... 128.0 14.69 212.4 97.0 9.56 13.6 4.0 3.67 1.0 False.
4998 DC 109.0 510 394-2206 no no 0.0 188.8 67.0 32.10 ... 92.0 14.59 224.4 89.0 10.10 8.5 6.0 2.30 0.0 False.
4999 VT 86.0 415 373-8058 no yes 34.0 129.4 102.0 22.00 ... 104.0 22.70 154.8 100.0 6.97 9.3 16.0 2.51 0.0 False.

5000 rows × 21 columns

Exercise: Basic DataFrame operations

How many customers have one or more customer service calls? How many have more than one?


In [6]:
print "Customers with at least one service call: %d\nCustomers with more than one service call: %d" % (churn_data.filter(churn_data.number_customer_service_calls >= 1.0).count(), churn_data.filter(churn_data.number_customer_service_calls > 1.0).count())


Customers with at least one service call: 3977
Customers with more than one service call: 2191

Feature Visualization

The data vizualization workflow for large data sets is usually:

  • Sample data so it fits in memory on a single machine.
  • Examine single variable distributions.
  • Examine joint distributions and correlations.
  • Look for other types of relationships.

DataFrame#sample() documentation


In [7]:
sample_data = churn_data.sample(False, 0.5, 83).toPandas()
sample_data.head()


Out[7]:
state account_length area_code phone_number intl_plan voice_mail_plan number_vmail_messages total_day_minutes total_day_calls total_day_charge ... total_eve_calls total_eve_charge total_night_minutes total_night_calls total_night_charge total_intl_minutes total_intl_calls total_intl_charge number_customer_service_calls churned
0 KS 128.0 415 382-4657 no yes 25.0 265.1 110.0 45.07 ... 99.0 16.78 244.7 91.0 11.01 10.0 3.0 2.70 1.0 False.
1 OH 107.0 415 371-7191 no yes 26.0 161.6 123.0 27.47 ... 103.0 16.62 254.4 103.0 11.45 13.7 3.0 3.70 1.0 False.
2 NJ 137.0 415 358-1921 no no 0.0 243.4 114.0 41.38 ... 110.0 10.30 162.6 104.0 7.32 12.2 5.0 3.29 0.0 False.
3 OK 75.0 415 330-6626 yes no 0.0 166.7 113.0 28.34 ... 122.0 12.61 186.9 121.0 8.41 10.1 3.0 2.73 3.0 False.
4 AL 118.0 510 391-8027 yes no 0.0 223.4 98.0 37.98 ... 101.0 18.75 203.9 118.0 9.18 6.3 6.0 1.70 0.0 False.

5 rows × 21 columns

DataTypes

The type of visualization we do depends on the data type, so lets define what columns have different properties first:


In [8]:
numeric_cols = ["account_length", "number_vmail_messages", "total_day_minutes",
                "total_day_calls", "total_day_charge", "total_eve_minutes",
                "total_eve_calls", "total_eve_charge", "total_night_minutes",
                "total_night_calls", "total_intl_minutes", "total_intl_calls",
                "total_intl_charge"]

categorical_cols = ["state", "international_plan", "voice_mail_plan", "area_code"]

Seaborn

Seaborn is a library for statistical visualization that is built on matplotlib.

Great support for:

  • plotting distributions
  • regression analyses
  • plotting with categorical splitting

Feature Distributions

We want to examine the distribution of our features, so start with them one at a time.

Seaborn has a standard function called dist() that allows us to easily examine the distribution of a column of a pandas dataframe or a numpy array.


In [9]:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sb

sb.distplot(sample_data['number_customer_service_calls'], kde=False);


We can examine feature differences in the distribution of our features when we condition (split) our data in whether they churned or not.

BoxPlot docs


In [10]:
sb.boxplot(x="churned", y="number_customer_service_calls", data=sample_data);


Joint Distributions

Looking at joint distributions of data can also tell us a lot, particularly about redundant features. Seaborn's PairPlot let's us look at joint distributions for many variables at once.


In [ ]:
example_numeric_data = sample_data[["total_intl_minutes", "total_intl_calls",
                                       "total_intl_charge", "churned"]]
sb.pairplot(example_numeric_data, hue="churned");

Clearly, there are some strong linear relationships between some variables, let's get a general impression of the correlations between variables by using Seaborn's heatmap functionality.


In [ ]:
corr = sample_data[["account_length", "number_vmail_messages", "total_day_minutes",
                    "total_day_calls", "total_day_charge", "total_eve_minutes",
                    "total_eve_calls", "total_eve_charge", "total_night_minutes",
                    "total_night_calls", "total_intl_minutes", "total_intl_calls",
                    "total_intl_charge"]].corr()

sb.heatmap(corr);

Let's generate a pair plot for all numerical variables that we have.


In [ ]:
reduced_numeric_cols = ["account_length", "number_vmail_messages", "total_day_calls",
                        "total_day_charge", "total_eve_calls", "total_eve_charge",
                        "total_night_calls", "total_intl_calls", "total_intl_charge"]

sb.pairplot(sample_data[reduced_numeric_cols + ['churned']], hue="churned", palette='Paired');

Build a classification model using MLLib

We want to build a predictive model.

The overall process:

Feature Extraction and Model Training

We need to:

  • Code features that are not already numeric
  • Gather all features we need into a single column in the DataFrame.
  • Split labeled data into training and testing set
  • Fit the model to the training data.

Feature Extraction

We need to define our input features.

PySpark Pipeline Docs


In [ ]:
from pyspark.ml.feature import StringIndexer
from pyspark.ml.feature import VectorAssembler

label_indexer = StringIndexer(inputCol = 'churned', outputCol = 'label')
plan_indexer = StringIndexer(inputCol = 'intl_plan', outputCol = 'intl_plan_indexed')

assembler = VectorAssembler(
    inputCols = ['intl_plan_indexed'] + reduced_numeric_cols,
    outputCol = 'features')

Model Training

We can now define our classifier and pipeline. With this done, we can split our labeled data in train and test sets and fit a model.


In [ ]:
from pyspark.ml import Pipeline
from pyspark.ml.classification import DecisionTreeClassifier

classifier = DecisionTreeClassifier(labelCol = 'label', featuresCol = 'features')

pipeline = Pipeline(stages=[plan_indexer, label_indexer, assembler, classifier])

(train, test) = churn_data.randomSplit([0.7, 0.3])
model = pipeline.fit(train)

Model Evaluation

The most important question to ask:

Is my predictor better than random guessing?

How do we quantify that?


In [ ]:
predictions = model.transform(churn_data)
predictions.toPandas()

Measure the area under the receiver operator curve (RO, abreviated to AUROC. AUPR is area under the precision recall curve, a good measurement of success for machine learning.


In [ ]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator

predictions = model.transform(test)
evaluator = BinaryClassificationEvaluator()
auroc = evaluator.evaluate(predictions, {evaluator.metricName: "areaUnderROC"})
aupr = evaluator.evaluate(predictions, {evaluator.metricName: "areaUnderPR"})
print "The AUROC is %s and the AUPR is %s." % (auroc, aupr)

In [ ]:
predictions.filter((predictions.churned == ' False.') & (predictions.prediction == 0.0 )).count()

Exercise: Fit a RandomForestClassifier

Fit a random forest classifier to the data. Try experimenting with different values of the maxDepth, numTrees, and entropy parameters to see which gives the best classification performance. Do the settings that give the best classification performance on the training set also give the best classification performance on the test set?

Have a look at the documentation.


In [ ]:
from pyspark.ml.classification import RandomForestClassifier
 
# Your code here
classifier = RandomForestClassifier(labelCol = 'label', featuresCol = 'features')

pipeline = Pipeline(stages=[plan_indexer, label_indexer, assembler, classifier])

# (train, test) = churn_data.randomSplit([0.7, 0.3])
model = pipeline.fit(train)

In [ ]:
predictions = model.transform(test)
evaluator = BinaryClassificationEvaluator()
auroc = evaluator.evaluate(predictions, {evaluator.metricName: "areaUnderROC"})
aupr = evaluator.evaluate(predictions, {evaluator.metricName: "areaUnderPR"})
print "The AUROC is %s and the AUPR is %s." % (auroc, aupr)

In [ ]:
classifier = RandomForestClassifier(labelCol = 'label', featuresCol = 'features',
numTrees=30, maxDepth=5, seed=42, impurity='entropy')
pipeline = Pipeline(stages=[plan_indexer, label_indexer, assembler, classifier])

# (train, test) = churn_data.randomSplit([0.7, 0.3])
model = pipeline.fit(train)

predictions = model.transform(test)
evaluator = BinaryClassificationEvaluator()
auroc = evaluator.evaluate(predictions, {evaluator.metricName: "areaUnderROC"})
aupr = evaluator.evaluate(predictions, {evaluator.metricName: "areaUnderPR"})
print "The AUROC is %s and the AUPR is %s." % (auroc, aupr)

In [ ]: