Data Science using Telco data

Load the data

We need to load data from a file in to a Spark DataFrame. Each row is an observed customer, and each column contains attributes of that customer.

Data from UCI data set repo, hosted by SGI

Fields:

state: discrete.
account length: numeric.
area code: numeric.
phone number: discrete.
international plan: discrete.
voice mail plan: discrete.
number vmail messages: numeric.
total day minutes: numeric.
total day calls: numeric.
total day charge: numeric.
total eve minutes: numeric.
total eve calls: numeric.
total eve charge: numeric.
total night minutes: numeric.
total night calls: numeric.
total night charge: numeric.
total intl minutes: numeric.
total intl calls: numeric.
total intl charge: numeric.
number customer service calls: numeric.
churned: discrete.

'Numeric' and 'discrete' do not adequately describe the fundamental differencecs in the attributes.

Area codes are considered numeric, but they a better thought of as a categorical variable. This is because attributes that are really numeric features have a reasonable concept of distance between points. Area codes do not fall into this cateogory. They can have a small distance |area_code_1 - area_code_2| but that distance doesn't correspond to a similarity in area codes.



In [3]:

    
from pyspark.sql import SQLContext
from pyspark.sql.types import *

sqlContext = SQLContext(sc)
schema = StructType([ \
    StructField("state", StringType(), True), \
    StructField("account_length", DoubleType(), True), \
    StructField("area_code", StringType(), True), \
    StructField("phone_number", StringType(), True), \
    StructField("intl_plan", StringType(), True), \
    StructField("voice_mail_plan", StringType(), True), \
    StructField("number_vmail_messages", DoubleType(), True), \
    StructField("total_day_minutes", DoubleType(), True), \
    StructField("total_day_calls", DoubleType(), True), \
    StructField("total_day_charge", DoubleType(), True), \
    StructField("total_eve_minutes", DoubleType(), True), \
    StructField("total_eve_calls", DoubleType(), True), \
    StructField("total_eve_charge", DoubleType(), True), \
    StructField("total_night_minutes", DoubleType(), True), \
    StructField("total_night_calls", DoubleType(), True), \
    StructField("total_night_charge", DoubleType(), True), \
    StructField("total_intl_minutes", DoubleType(), True), \
    StructField("total_intl_calls", DoubleType(), True), \
    StructField("total_intl_charge", DoubleType(), True), \
    StructField("number_customer_service_calls", DoubleType(), True), \
    StructField("churned", StringType(), True)])

churn_data = sqlContext.read \
    .format('com.databricks.spark.csv') \
    .load('data/churn.all', schema = schema)

Basic DataFrame operations

Dataframes essentially allow you to express sql-like statements. We can filter, count, and so on. DataFrame Operations documentation.



In [4]:

    
count = churn_data.count()
voice_mail_plans = churn_data.filter(churn_data.voice_mail_plan == " yes").count()

print "Total accts: %d\nAccts with voicemail: %d" % (count, voice_mail_plans)









    



Total accts: 5000
Accts with voicemail: 1323



In [5]:

    
churn_data.toPandas()









    Out[5]:






  
    
      
      state
      account_length
      area_code
      phone_number
      intl_plan
      voice_mail_plan
      number_vmail_messages
      total_day_minutes
      total_day_calls
      total_day_charge
      ...
      total_eve_calls
      total_eve_charge
      total_night_minutes
      total_night_calls
      total_night_charge
      total_intl_minutes
      total_intl_calls
      total_intl_charge
      number_customer_service_calls
      churned
    
  
  
    
      0
      KS
      128.0
      415
      382-4657
      no
      yes
      25.0
      265.1
      110.0
      45.07
      ...
      99.0
      16.78
      244.7
      91.0
      11.01
      10.0
      3.0
      2.70
      1.0
      False.
    
    
      1
      OH
      107.0
      415
      371-7191
      no
      yes
      26.0
      161.6
      123.0
      27.47
      ...
      103.0
      16.62
      254.4
      103.0
      11.45
      13.7
      3.0
      3.70
      1.0
      False.
    
    
      2
      NJ
      137.0
      415
      358-1921
      no
      no
      0.0
      243.4
      114.0
      41.38
      ...
      110.0
      10.30
      162.6
      104.0
      7.32
      12.2
      5.0
      3.29
      0.0
      False.
    
    
      3
      OH
      84.0
      408
      375-9999
      yes
      no
      0.0
      299.4
      71.0
      50.90
      ...
      88.0
      5.26
      196.9
      89.0
      8.86
      6.6
      7.0
      1.78
      2.0
      False.
    
    
      4
      OK
      75.0
      415
      330-6626
      yes
      no
      0.0
      166.7
      113.0
      28.34
      ...
      122.0
      12.61
      186.9
      121.0
      8.41
      10.1
      3.0
      2.73
      3.0
      False.
    
    
      5
      AL
      118.0
      510
      391-8027
      yes
      no
      0.0
      223.4
      98.0
      37.98
      ...
      101.0
      18.75
      203.9
      118.0
      9.18
      6.3
      6.0
      1.70
      0.0
      False.
    
    
      6
      MA
      121.0
      510
      355-9993
      no
      yes
      24.0
      218.2
      88.0
      37.09
      ...
      108.0
      29.62
      212.6
      118.0
      9.57
      7.5
      7.0
      2.03
      3.0
      False.
    
    
      7
      MO
      147.0
      415
      329-9001
      yes
      no
      0.0
      157.0
      79.0
      26.69
      ...
      94.0
      8.76
      211.8
      96.0
      9.53
      7.1
      6.0
      1.92
      0.0
      False.
    
    
      8
      LA
      117.0
      408
      335-4719
      no
      no
      0.0
      184.5
      97.0
      31.37
      ...
      80.0
      29.89
      215.8
      90.0
      9.71
      8.7
      4.0
      2.35
      1.0
      False.
    
    
      9
      WV
      141.0
      415
      330-8173
      yes
      yes
      37.0
      258.6
      84.0
      43.96
      ...
      111.0
      18.87
      326.4
      97.0
      14.69
      11.2
      5.0
      3.02
      0.0
      False.
    
    
      10
      IN
      65.0
      415
      329-6603
      no
      no
      0.0
      129.1
      137.0
      21.95
      ...
      83.0
      19.42
      208.8
      111.0
      9.40
      12.7
      6.0
      3.43
      4.0
      True.
    
    
      11
      RI
      74.0
      415
      344-9403
      no
      no
      0.0
      187.7
      127.0
      31.91
      ...
      148.0
      13.89
      196.0
      94.0
      8.82
      9.1
      5.0
      2.46
      0.0
      False.
    
    
      12
      IA
      168.0
      408
      363-1107
      no
      no
      0.0
      128.8
      96.0
      21.90
      ...
      71.0
      8.92
      141.1
      128.0
      6.35
      11.2
      2.0
      3.02
      1.0
      False.
    
    
      13
      MT
      95.0
      510
      394-8006
      no
      no
      0.0
      156.6
      88.0
      26.62
      ...
      75.0
      21.05
      192.3
      115.0
      8.65
      12.3
      5.0
      3.32
      3.0
      False.
    
    
      14
      IA
      62.0
      415
      366-9238
      no
      no
      0.0
      120.7
      70.0
      20.52
      ...
      76.0
      26.11
      203.0
      99.0
      9.14
      13.1
      6.0
      3.54
      4.0
      False.
    
    
      15
      NY
      161.0
      415
      351-7269
      no
      no
      0.0
      332.9
      67.0
      56.59
      ...
      97.0
      27.01
      160.6
      128.0
      7.23
      5.4
      9.0
      1.46
      4.0
      True.
    
    
      16
      ID
      85.0
      408
      350-8884
      no
      yes
      27.0
      196.4
      139.0
      33.39
      ...
      90.0
      23.88
      89.3
      75.0
      4.02
      13.8
      4.0
      3.73
      1.0
      False.
    
    
      17
      VT
      93.0
      510
      386-2923
      no
      no
      0.0
      190.7
      114.0
      32.42
      ...
      111.0
      18.55
      129.6
      121.0
      5.83
      8.1
      3.0
      2.19
      3.0
      False.
    
    
      18
      VA
      76.0
      510
      356-2992
      no
      yes
      33.0
      189.7
      66.0
      32.25
      ...
      65.0
      18.09
      165.7
      108.0
      7.46
      10.0
      5.0
      2.70
      1.0
      False.
    
    
      19
      TX
      73.0
      415
      373-2782
      no
      no
      0.0
      224.4
      90.0
      38.15
      ...
      88.0
      13.56
      192.8
      74.0
      8.68
      13.0
      2.0
      3.51
      1.0
      False.
    
    
      20
      FL
      147.0
      415
      396-5800
      no
      no
      0.0
      155.1
      117.0
      26.37
      ...
      93.0
      20.37
      208.8
      133.0
      9.40
      10.6
      4.0
      2.86
      0.0
      False.
    
    
      21
      CO
      77.0
      408
      393-7984
      no
      no
      0.0
      62.4
      89.0
      10.61
      ...
      121.0
      14.44
      209.6
      64.0
      9.43
      5.7
      6.0
      1.54
      5.0
      True.
    
    
      22
      AZ
      130.0
      415
      358-1958
      no
      no
      0.0
      183.0
      112.0
      31.11
      ...
      99.0
      6.20
      181.8
      78.0
      8.18
      9.5
      19.0
      2.57
      0.0
      False.
    
    
      23
      SC
      111.0
      415
      350-2565
      no
      no
      0.0
      110.4
      103.0
      18.77
      ...
      102.0
      11.67
      189.6
      105.0
      8.53
      7.7
      6.0
      2.08
      2.0
      False.
    
    
      24
      VA
      132.0
      510
      343-4696
      no
      no
      0.0
      81.1
      86.0
      13.79
      ...
      72.0
      20.84
      237.0
      115.0
      10.67
      10.3
      2.0
      2.78
      0.0
      False.
    
    
      25
      NE
      174.0
      415
      331-3698
      no
      no
      0.0
      124.3
      76.0
      21.13
      ...
      112.0
      23.55
      250.7
      115.0
      11.28
      15.5
      5.0
      4.19
      3.0
      False.
    
    
      26
      WY
      57.0
      408
      357-3817
      no
      yes
      39.0
      213.0
      115.0
      36.21
      ...
      112.0
      16.24
      182.7
      115.0
      8.22
      9.5
      3.0
      2.57
      0.0
      False.
    
    
      27
      MT
      54.0
      408
      418-6412
      no
      no
      0.0
      134.3
      73.0
      22.83
      ...
      100.0
      13.22
      102.1
      68.0
      4.59
      14.7
      4.0
      3.97
      3.0
      False.
    
    
      28
      MO
      20.0
      415
      353-2630
      no
      no
      0.0
      190.0
      109.0
      32.30
      ...
      84.0
      21.95
      181.5
      102.0
      8.17
      6.3
      6.0
      1.70
      0.0
      False.
    
    
      29
      HI
      49.0
      510
      410-7789
      no
      no
      0.0
      119.3
      117.0
      20.28
      ...
      109.0
      18.28
      178.7
      90.0
      8.04
      11.1
      1.0
      3.00
      1.0
      False.
    
    
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      4970
      MD
      133.0
      415
      363-2715
      no
      no
      0.0
      173.1
      126.0
      29.43
      ...
      98.0
      17.00
      231.0
      103.0
      10.40
      10.8
      1.0
      2.92
      1.0
      False.
    
    
      4971
      HI
      128.0
      510
      409-8074
      no
      yes
      28.0
      115.4
      96.0
      19.62
      ...
      94.0
      15.76
      209.3
      69.0
      9.42
      17.7
      4.0
      4.78
      0.0
      False.
    
    
      4972
      FL
      81.0
      408
      357-1624
      no
      no
      0.0
      190.9
      92.0
      32.45
      ...
      132.0
      17.74
      120.8
      98.0
      5.44
      11.4
      6.0
      3.08
      1.0
      False.
    
    
      4973
      ND
      81.0
      415
      342-2543
      no
      no
      0.0
      234.1
      105.0
      39.80
      ...
      88.0
      20.18
      240.6
      85.0
      10.83
      11.4
      4.0
      3.08
      1.0
      False.
    
    
      4974
      LA
      96.0
      408
      347-4427
      no
      no
      0.0
      102.1
      74.0
      17.36
      ...
      88.0
      7.96
      236.2
      125.0
      10.63
      14.0
      2.0
      3.78
      1.0
      False.
    
    
      4975
      AR
      93.0
      415
      411-9580
      no
      no
      0.0
      131.9
      91.0
      22.42
      ...
      92.0
      17.32
      205.6
      121.0
      9.25
      6.8
      3.0
      1.84
      1.0
      False.
    
    
      4976
      KY
      145.0
      408
      338-1170
      no
      no
      0.0
      135.0
      122.0
      22.95
      ...
      88.0
      17.54
      210.4
      90.0
      9.47
      19.7
      4.0
      5.32
      0.0
      False.
    
    
      4977
      HI
      53.0
      510
      356-5217
      yes
      no
      0.0
      145.0
      80.0
      24.65
      ...
      82.0
      19.41
      169.4
      109.0
      7.62
      8.9
      5.0
      2.40
      0.0
      False.
    
    
      4978
      KS
      108.0
      408
      389-4921
      no
      no
      0.0
      170.3
      48.0
      28.95
      ...
      146.0
      15.95
      204.0
      84.0
      9.18
      16.1
      4.0
      4.35
      3.0
      False.
    
    
      4979
      TX
      52.0
      415
      330-3162
      no
      no
      0.0
      188.7
      99.0
      32.08
      ...
      89.0
      17.88
      191.4
      103.0
      8.61
      10.8
      2.0
      2.92
      1.0
      False.
    
    
      4980
      WA
      73.0
      408
      408-8236
      no
      no
      0.0
      177.2
      118.0
      30.12
      ...
      84.0
      22.99
      241.8
      112.0
      10.88
      12.3
      2.0
      3.32
      3.0
      True.
    
    
      4981
      NC
      61.0
      510
      408-3742
      no
      no
      0.0
      7.2
      97.0
      1.22
      ...
      78.0
      18.85
      174.5
      94.0
      7.85
      12.0
      4.0
      3.24
      1.0
      False.
    
    
      4982
      TX
      86.0
      408
      360-8716
      no
      no
      0.0
      181.1
      124.0
      30.79
      ...
      105.0
      19.51
      122.2
      91.0
      5.50
      5.6
      6.0
      1.51
      5.0
      False.
    
    
      4983
      LA
      137.0
      408
      400-1731
      no
      no
      0.0
      151.8
      90.0
      25.81
      ...
      68.0
      19.52
      171.8
      122.0
      7.73
      10.8
      5.0
      2.92
      2.0
      False.
    
    
      4984
      TN
      74.0
      510
      341-2423
      no
      no
      0.0
      159.4
      86.0
      27.10
      ...
      122.0
      17.85
      172.4
      99.0
      7.76
      8.5
      2.0
      2.30
      1.0
      False.
    
    
      4985
      NC
      118.0
      510
      344-6099
      no
      yes
      31.0
      226.6
      99.0
      38.52
      ...
      136.0
      14.02
      189.3
      120.0
      8.52
      10.2
      5.0
      2.75
      3.0
      False.
    
    
      4986
      OH
      125.0
      415
      364-2584
      no
      no
      0.0
      214.7
      115.0
      36.50
      ...
      94.0
      22.15
      266.3
      123.0
      11.98
      7.4
      6.0
      2.00
      3.0
      False.
    
    
      4987
      AR
      127.0
      415
      397-4141
      no
      yes
      27.0
      157.6
      107.0
      26.79
      ...
      49.0
      23.85
      75.1
      77.0
      3.38
      8.0
      4.0
      2.16
      1.0
      False.
    
    
      4988
      WA
      80.0
      510
      397-4475
      no
      no
      0.0
      157.0
      101.0
      26.69
      ...
      127.0
      17.75
      113.3
      109.0
      5.10
      16.2
      2.0
      4.37
      2.0
      False.
    
    
      4989
      MN
      150.0
      408
      407-6315
      no
      no
      0.0
      170.0
      115.0
      28.90
      ...
      138.0
      13.83
      267.2
      77.0
      12.02
      8.3
      2.0
      2.24
      0.0
      False.
    
    
      4990
      ND
      140.0
      510
      364-8203
      no
      no
      0.0
      244.7
      115.0
      41.60
      ...
      101.0
      21.98
      231.3
      112.0
      10.41
      7.5
      6.0
      2.03
      1.0
      True.
    
    
      4991
      AZ
      97.0
      510
      410-3888
      no
      no
      0.0
      252.6
      89.0
      42.94
      ...
      91.0
      28.93
      256.5
      67.0
      11.54
      8.8
      5.0
      2.38
      1.0
      True.
    
    
      4992
      MT
      83.0
      415
      355-3630
      no
      no
      0.0
      188.3
      70.0
      32.01
      ...
      88.0
      20.72
      213.7
      79.0
      9.62
      10.3
      6.0
      2.78
      0.0
      False.
    
    
      4993
      WV
      73.0
      408
      411-9655
      no
      no
      0.0
      177.9
      89.0
      30.24
      ...
      82.0
      11.15
      186.2
      89.0
      8.38
      11.5
      6.0
      3.11
      3.0
      False.
    
    
      4994
      NC
      75.0
      408
      417-1886
      no
      no
      0.0
      170.7
      101.0
      29.02
      ...
      126.0
      16.41
      129.1
      104.0
      5.81
      6.9
      7.0
      1.86
      1.0
      False.
    
    
      4995
      HI
      50.0
      408
      365-8751
      no
      yes
      40.0
      235.7
      127.0
      40.07
      ...
      126.0
      18.96
      297.5
      116.0
      13.39
      9.9
      5.0
      2.67
      2.0
      False.
    
    
      4996
      WV
      152.0
      415
      334-9736
      no
      no
      0.0
      184.2
      90.0
      31.31
      ...
      73.0
      21.83
      213.6
      113.0
      9.61
      14.7
      2.0
      3.97
      3.0
      True.
    
    
      4997
      DC
      61.0
      415
      333-6861
      no
      no
      0.0
      140.6
      89.0
      23.90
      ...
      128.0
      14.69
      212.4
      97.0
      9.56
      13.6
      4.0
      3.67
      1.0
      False.
    
    
      4998
      DC
      109.0
      510
      394-2206
      no
      no
      0.0
      188.8
      67.0
      32.10
      ...
      92.0
      14.59
      224.4
      89.0
      10.10
      8.5
      6.0
      2.30
      0.0
      False.
    
    
      4999
      VT
      86.0
      415
      373-8058
      no
      yes
      34.0
      129.4
      102.0
      22.00
      ...
      104.0
      22.70
      154.8
      100.0
      6.97
      9.3
      16.0
      2.51
      0.0
      False.
    
  

5000 rows × 21 columns

Exercise: Basic DataFrame operations

How many customers have one or more customer service calls? How many have more than one?



In [6]:

    
print "Customers with at least one service call: %d\nCustomers with more than one service call: %d" % (churn_data.filter(churn_data.number_customer_service_calls >= 1.0).count(), churn_data.filter(churn_data.number_customer_service_calls > 1.0).count())









    



Customers with at least one service call: 3977
Customers with more than one service call: 2191

Feature Visualization

The data vizualization workflow for large data sets is usually:

Sample data so it fits in memory on a single machine.
Examine single variable distributions.
Examine joint distributions and correlations.
Look for other types of relationships.

DataFrame#sample() documentation



In [7]:

    
sample_data = churn_data.sample(False, 0.5, 83).toPandas()
sample_data.head()









    Out[7]:






  
    
      
      state
      account_length
      area_code
      phone_number
      intl_plan
      voice_mail_plan
      number_vmail_messages
      total_day_minutes
      total_day_calls
      total_day_charge
      ...
      total_eve_calls
      total_eve_charge
      total_night_minutes
      total_night_calls
      total_night_charge
      total_intl_minutes
      total_intl_calls
      total_intl_charge
      number_customer_service_calls
      churned
    
  
  
    
      0
      KS
      128.0
      415
      382-4657
      no
      yes
      25.0
      265.1
      110.0
      45.07
      ...
      99.0
      16.78
      244.7
      91.0
      11.01
      10.0
      3.0
      2.70
      1.0
      False.
    
    
      1
      OH
      107.0
      415
      371-7191
      no
      yes
      26.0
      161.6
      123.0
      27.47
      ...
      103.0
      16.62
      254.4
      103.0
      11.45
      13.7
      3.0
      3.70
      1.0
      False.
    
    
      2
      NJ
      137.0
      415
      358-1921
      no
      no
      0.0
      243.4
      114.0
      41.38
      ...
      110.0
      10.30
      162.6
      104.0
      7.32
      12.2
      5.0
      3.29
      0.0
      False.
    
    
      3
      OK
      75.0
      415
      330-6626
      yes
      no
      0.0
      166.7
      113.0
      28.34
      ...
      122.0
      12.61
      186.9
      121.0
      8.41
      10.1
      3.0
      2.73
      3.0
      False.
    
    
      4
      AL
      118.0
      510
      391-8027
      yes
      no
      0.0
      223.4
      98.0
      37.98
      ...
      101.0
      18.75
      203.9
      118.0
      9.18
      6.3
      6.0
      1.70
      0.0
      False.
    
  

5 rows × 21 columns

DataTypes

The type of visualization we do depends on the data type, so lets define what columns have different properties first:



In [8]:

    
numeric_cols = ["account_length", "number_vmail_messages", "total_day_minutes",
                "total_day_calls", "total_day_charge", "total_eve_minutes",
                "total_eve_calls", "total_eve_charge", "total_night_minutes",
                "total_night_calls", "total_intl_minutes", "total_intl_calls",
                "total_intl_charge"]

categorical_cols = ["state", "international_plan", "voice_mail_plan", "area_code"]

Seaborn

Seaborn is a library for statistical visualization that is built on matplotlib.

Great support for:

plotting distributions
regression analyses
plotting with categorical splitting

Feature Distributions

We want to examine the distribution of our features, so start with them one at a time.

Seaborn has a standard function called dist() that allows us to easily examine the distribution of a column of a pandas dataframe or a numpy array.



In [9]:

    
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sb

sb.distplot(sample_data['number_customer_service_calls'], kde=False);

We can examine feature differences in the distribution of our features when we condition (split) our data in whether they churned or not.

BoxPlot docs



In [10]:

    
sb.boxplot(x="churned", y="number_customer_service_calls", data=sample_data);

Joint Distributions

Looking at joint distributions of data can also tell us a lot, particularly about redundant features. Seaborn's PairPlot let's us look at joint distributions for many variables at once.



In [ ]:

    
example_numeric_data = sample_data[["total_intl_minutes", "total_intl_calls",
                                       "total_intl_charge", "churned"]]
sb.pairplot(example_numeric_data, hue="churned");

Clearly, there are some strong linear relationships between some variables, let's get a general impression of the correlations between variables by using Seaborn's heatmap functionality.



In [ ]:

    
corr = sample_data[["account_length", "number_vmail_messages", "total_day_minutes",
                    "total_day_calls", "total_day_charge", "total_eve_minutes",
                    "total_eve_calls", "total_eve_charge", "total_night_minutes",
                    "total_night_calls", "total_intl_minutes", "total_intl_calls",
                    "total_intl_charge"]].corr()

sb.heatmap(corr);

Let's generate a pair plot for all numerical variables that we have.



In [ ]:

    
reduced_numeric_cols = ["account_length", "number_vmail_messages", "total_day_calls",
                        "total_day_charge", "total_eve_calls", "total_eve_charge",
                        "total_night_calls", "total_intl_calls", "total_intl_charge"]

sb.pairplot(sample_data[reduced_numeric_cols + ['churned']], hue="churned", palette='Paired');

Build a classification model using MLLib

We want to build a predictive model.

The overall process:

Feature Extraction and Model Training

We need to:

Code features that are not already numeric
Gather all features we need into a single column in the DataFrame.
Split labeled data into training and testing set
Fit the model to the training data.

Feature Extraction

We need to define our input features.

PySpark Pipeline Docs



In [ ]:

    
from pyspark.ml.feature import StringIndexer
from pyspark.ml.feature import VectorAssembler

label_indexer = StringIndexer(inputCol = 'churned', outputCol = 'label')
plan_indexer = StringIndexer(inputCol = 'intl_plan', outputCol = 'intl_plan_indexed')

assembler = VectorAssembler(
    inputCols = ['intl_plan_indexed'] + reduced_numeric_cols,
    outputCol = 'features')

Model Training

We can now define our classifier and pipeline. With this done, we can split our labeled data in train and test sets and fit a model.



In [ ]:

    
from pyspark.ml import Pipeline
from pyspark.ml.classification import DecisionTreeClassifier

classifier = DecisionTreeClassifier(labelCol = 'label', featuresCol = 'features')

pipeline = Pipeline(stages=[plan_indexer, label_indexer, assembler, classifier])

(train, test) = churn_data.randomSplit([0.7, 0.3])
model = pipeline.fit(train)

Model Evaluation

The most important question to ask:

Is my predictor better than random guessing?

How do we quantify that?



In [ ]:

    
predictions = model.transform(churn_data)
predictions.toPandas()

Measure the area under the receiver operator curve (RO, abreviated to AUROC. AUPR is area under the precision recall curve, a good measurement of success for machine learning.



In [ ]:

    
from pyspark.ml.evaluation import BinaryClassificationEvaluator

predictions = model.transform(test)
evaluator = BinaryClassificationEvaluator()
auroc = evaluator.evaluate(predictions, {evaluator.metricName: "areaUnderROC"})
aupr = evaluator.evaluate(predictions, {evaluator.metricName: "areaUnderPR"})
print "The AUROC is %s and the AUPR is %s." % (auroc, aupr)



In [ ]:

    
predictions.filter((predictions.churned == ' False.') & (predictions.prediction == 0.0 )).count()

Exercise: Fit a RandomForestClassifier

Fit a random forest classifier to the data. Try experimenting with different values of the maxDepth, numTrees, and entropy parameters to see which gives the best classification performance. Do the settings that give the best classification performance on the training set also give the best classification performance on the test set?

Have a look at the documentation.



In [ ]:

    
from pyspark.ml.classification import RandomForestClassifier
 
# Your code here
classifier = RandomForestClassifier(labelCol = 'label', featuresCol = 'features')

pipeline = Pipeline(stages=[plan_indexer, label_indexer, assembler, classifier])

# (train, test) = churn_data.randomSplit([0.7, 0.3])
model = pipeline.fit(train)



In [ ]:

    
predictions = model.transform(test)
evaluator = BinaryClassificationEvaluator()
auroc = evaluator.evaluate(predictions, {evaluator.metricName: "areaUnderROC"})
aupr = evaluator.evaluate(predictions, {evaluator.metricName: "areaUnderPR"})
print "The AUROC is %s and the AUPR is %s." % (auroc, aupr)



In [ ]:

    
classifier = RandomForestClassifier(labelCol = 'label', featuresCol = 'features',
numTrees=30, maxDepth=5, seed=42, impurity='entropy')
pipeline = Pipeline(stages=[plan_indexer, label_indexer, assembler, classifier])

# (train, test) = churn_data.randomSplit([0.7, 0.3])
model = pipeline.fit(train)

predictions = model.transform(test)
evaluator = BinaryClassificationEvaluator()
auroc = evaluator.evaluate(predictions, {evaluator.metricName: "areaUnderROC"})
aupr = evaluator.evaluate(predictions, {evaluator.metricName: "areaUnderPR"})
print "The AUROC is %s and the AUPR is %s." % (auroc, aupr)



In [ ]:

	state	account_length	area_code	phone_number	intl_plan	voice_mail_plan	number_vmail_messages	total_day_minutes	total_day_calls	total_day_charge	...	total_eve_calls	total_eve_charge	total_night_minutes	total_night_calls	total_night_charge	total_intl_minutes	total_intl_calls	total_intl_charge	number_customer_service_calls	churned
0	KS	128.0	415	382-4657	no	yes	25.0	265.1	110.0	45.07	...	99.0	16.78	244.7	91.0	11.01	10.0	3.0	2.70	1.0	False.
1	OH	107.0	415	371-7191	no	yes	26.0	161.6	123.0	27.47	...	103.0	16.62	254.4	103.0	11.45	13.7	3.0	3.70	1.0	False.
2	NJ	137.0	415	358-1921	no	no	0.0	243.4	114.0	41.38	...	110.0	10.30	162.6	104.0	7.32	12.2	5.0	3.29	0.0	False.
3	OH	84.0	408	375-9999	yes	no	0.0	299.4	71.0	50.90	...	88.0	5.26	196.9	89.0	8.86	6.6	7.0	1.78	2.0	False.
4	OK	75.0	415	330-6626	yes	no	0.0	166.7	113.0	28.34	...	122.0	12.61	186.9	121.0	8.41	10.1	3.0	2.73	3.0	False.
5	AL	118.0	510	391-8027	yes	no	0.0	223.4	98.0	37.98	...	101.0	18.75	203.9	118.0	9.18	6.3	6.0	1.70	0.0	False.
6	MA	121.0	510	355-9993	no	yes	24.0	218.2	88.0	37.09	...	108.0	29.62	212.6	118.0	9.57	7.5	7.0	2.03	3.0	False.
7	MO	147.0	415	329-9001	yes	no	0.0	157.0	79.0	26.69	...	94.0	8.76	211.8	96.0	9.53	7.1	6.0	1.92	0.0	False.
8	LA	117.0	408	335-4719	no	no	0.0	184.5	97.0	31.37	...	80.0	29.89	215.8	90.0	9.71	8.7	4.0	2.35	1.0	False.
9	WV	141.0	415	330-8173	yes	yes	37.0	258.6	84.0	43.96	...	111.0	18.87	326.4	97.0	14.69	11.2	5.0	3.02	0.0	False.
10	IN	65.0	415	329-6603	no	no	0.0	129.1	137.0	21.95	...	83.0	19.42	208.8	111.0	9.40	12.7	6.0	3.43	4.0	True.
11	RI	74.0	415	344-9403	no	no	0.0	187.7	127.0	31.91	...	148.0	13.89	196.0	94.0	8.82	9.1	5.0	2.46	0.0	False.
12	IA	168.0	408	363-1107	no	no	0.0	128.8	96.0	21.90	...	71.0	8.92	141.1	128.0	6.35	11.2	2.0	3.02	1.0	False.
13	MT	95.0	510	394-8006	no	no	0.0	156.6	88.0	26.62	...	75.0	21.05	192.3	115.0	8.65	12.3	5.0	3.32	3.0	False.
14	IA	62.0	415	366-9238	no	no	0.0	120.7	70.0	20.52	...	76.0	26.11	203.0	99.0	9.14	13.1	6.0	3.54	4.0	False.
15	NY	161.0	415	351-7269	no	no	0.0	332.9	67.0	56.59	...	97.0	27.01	160.6	128.0	7.23	5.4	9.0	1.46	4.0	True.
16	ID	85.0	408	350-8884	no	yes	27.0	196.4	139.0	33.39	...	90.0	23.88	89.3	75.0	4.02	13.8	4.0	3.73	1.0	False.
17	VT	93.0	510	386-2923	no	no	0.0	190.7	114.0	32.42	...	111.0	18.55	129.6	121.0	5.83	8.1	3.0	2.19	3.0	False.
18	VA	76.0	510	356-2992	no	yes	33.0	189.7	66.0	32.25	...	65.0	18.09	165.7	108.0	7.46	10.0	5.0	2.70	1.0	False.
19	TX	73.0	415	373-2782	no	no	0.0	224.4	90.0	38.15	...	88.0	13.56	192.8	74.0	8.68	13.0	2.0	3.51	1.0	False.
20	FL	147.0	415	396-5800	no	no	0.0	155.1	117.0	26.37	...	93.0	20.37	208.8	133.0	9.40	10.6	4.0	2.86	0.0	False.
21	CO	77.0	408	393-7984	no	no	0.0	62.4	89.0	10.61	...	121.0	14.44	209.6	64.0	9.43	5.7	6.0	1.54	5.0	True.
22	AZ	130.0	415	358-1958	no	no	0.0	183.0	112.0	31.11	...	99.0	6.20	181.8	78.0	8.18	9.5	19.0	2.57	0.0	False.
23	SC	111.0	415	350-2565	no	no	0.0	110.4	103.0	18.77	...	102.0	11.67	189.6	105.0	8.53	7.7	6.0	2.08	2.0	False.
24	VA	132.0	510	343-4696	no	no	0.0	81.1	86.0	13.79	...	72.0	20.84	237.0	115.0	10.67	10.3	2.0	2.78	0.0	False.
25	NE	174.0	415	331-3698	no	no	0.0	124.3	76.0	21.13	...	112.0	23.55	250.7	115.0	11.28	15.5	5.0	4.19	3.0	False.
26	WY	57.0	408	357-3817	no	yes	39.0	213.0	115.0	36.21	...	112.0	16.24	182.7	115.0	8.22	9.5	3.0	2.57	0.0	False.
27	MT	54.0	408	418-6412	no	no	0.0	134.3	73.0	22.83	...	100.0	13.22	102.1	68.0	4.59	14.7	4.0	3.97	3.0	False.
28	MO	20.0	415	353-2630	no	no	0.0	190.0	109.0	32.30	...	84.0	21.95	181.5	102.0	8.17	6.3	6.0	1.70	0.0	False.
29	HI	49.0	510	410-7789	no	no	0.0	119.3	117.0	20.28	...	109.0	18.28	178.7	90.0	8.04	11.1	1.0	3.00	1.0	False.
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
4970	MD	133.0	415	363-2715	no	no	0.0	173.1	126.0	29.43	...	98.0	17.00	231.0	103.0	10.40	10.8	1.0	2.92	1.0	False.
4971	HI	128.0	510	409-8074	no	yes	28.0	115.4	96.0	19.62	...	94.0	15.76	209.3	69.0	9.42	17.7	4.0	4.78	0.0	False.
4972	FL	81.0	408	357-1624	no	no	0.0	190.9	92.0	32.45	...	132.0	17.74	120.8	98.0	5.44	11.4	6.0	3.08	1.0	False.
4973	ND	81.0	415	342-2543	no	no	0.0	234.1	105.0	39.80	...	88.0	20.18	240.6	85.0	10.83	11.4	4.0	3.08	1.0	False.
4974	LA	96.0	408	347-4427	no	no	0.0	102.1	74.0	17.36	...	88.0	7.96	236.2	125.0	10.63	14.0	2.0	3.78	1.0	False.
4975	AR	93.0	415	411-9580	no	no	0.0	131.9	91.0	22.42	...	92.0	17.32	205.6	121.0	9.25	6.8	3.0	1.84	1.0	False.
4976	KY	145.0	408	338-1170	no	no	0.0	135.0	122.0	22.95	...	88.0	17.54	210.4	90.0	9.47	19.7	4.0	5.32	0.0	False.
4977	HI	53.0	510	356-5217	yes	no	0.0	145.0	80.0	24.65	...	82.0	19.41	169.4	109.0	7.62	8.9	5.0	2.40	0.0	False.
4978	KS	108.0	408	389-4921	no	no	0.0	170.3	48.0	28.95	...	146.0	15.95	204.0	84.0	9.18	16.1	4.0	4.35	3.0	False.
4979	TX	52.0	415	330-3162	no	no	0.0	188.7	99.0	32.08	...	89.0	17.88	191.4	103.0	8.61	10.8	2.0	2.92	1.0	False.
4980	WA	73.0	408	408-8236	no	no	0.0	177.2	118.0	30.12	...	84.0	22.99	241.8	112.0	10.88	12.3	2.0	3.32	3.0	True.
4981	NC	61.0	510	408-3742	no	no	0.0	7.2	97.0	1.22	...	78.0	18.85	174.5	94.0	7.85	12.0	4.0	3.24	1.0	False.
4982	TX	86.0	408	360-8716	no	no	0.0	181.1	124.0	30.79	...	105.0	19.51	122.2	91.0	5.50	5.6	6.0	1.51	5.0	False.
4983	LA	137.0	408	400-1731	no	no	0.0	151.8	90.0	25.81	...	68.0	19.52	171.8	122.0	7.73	10.8	5.0	2.92	2.0	False.
4984	TN	74.0	510	341-2423	no	no	0.0	159.4	86.0	27.10	...	122.0	17.85	172.4	99.0	7.76	8.5	2.0	2.30	1.0	False.
4985	NC	118.0	510	344-6099	no	yes	31.0	226.6	99.0	38.52	...	136.0	14.02	189.3	120.0	8.52	10.2	5.0	2.75	3.0	False.
4986	OH	125.0	415	364-2584	no	no	0.0	214.7	115.0	36.50	...	94.0	22.15	266.3	123.0	11.98	7.4	6.0	2.00	3.0	False.
4987	AR	127.0	415	397-4141	no	yes	27.0	157.6	107.0	26.79	...	49.0	23.85	75.1	77.0	3.38	8.0	4.0	2.16	1.0	False.
4988	WA	80.0	510	397-4475	no	no	0.0	157.0	101.0	26.69	...	127.0	17.75	113.3	109.0	5.10	16.2	2.0	4.37	2.0	False.
4989	MN	150.0	408	407-6315	no	no	0.0	170.0	115.0	28.90	...	138.0	13.83	267.2	77.0	12.02	8.3	2.0	2.24	0.0	False.
4990	ND	140.0	510	364-8203	no	no	0.0	244.7	115.0	41.60	...	101.0	21.98	231.3	112.0	10.41	7.5	6.0	2.03	1.0	True.
4991	AZ	97.0	510	410-3888	no	no	0.0	252.6	89.0	42.94	...	91.0	28.93	256.5	67.0	11.54	8.8	5.0	2.38	1.0	True.
4992	MT	83.0	415	355-3630	no	no	0.0	188.3	70.0	32.01	...	88.0	20.72	213.7	79.0	9.62	10.3	6.0	2.78	0.0	False.
4993	WV	73.0	408	411-9655	no	no	0.0	177.9	89.0	30.24	...	82.0	11.15	186.2	89.0	8.38	11.5	6.0	3.11	3.0	False.
4994	NC	75.0	408	417-1886	no	no	0.0	170.7	101.0	29.02	...	126.0	16.41	129.1	104.0	5.81	6.9	7.0	1.86	1.0	False.
4995	HI	50.0	408	365-8751	no	yes	40.0	235.7	127.0	40.07	...	126.0	18.96	297.5	116.0	13.39	9.9	5.0	2.67	2.0	False.
4996	WV	152.0	415	334-9736	no	no	0.0	184.2	90.0	31.31	...	73.0	21.83	213.6	113.0	9.61	14.7	2.0	3.97	3.0	True.
4997	DC	61.0	415	333-6861	no	no	0.0	140.6	89.0	23.90	...	128.0	14.69	212.4	97.0	9.56	13.6	4.0	3.67	1.0	False.
4998	DC	109.0	510	394-2206	no	no	0.0	188.8	67.0	32.10	...	92.0	14.59	224.4	89.0	10.10	8.5	6.0	2.30	0.0	False.
4999	VT	86.0	415	373-8058	no	yes	34.0	129.4	102.0	22.00	...	104.0	22.70	154.8	100.0	6.97	9.3	16.0	2.51	0.0	False.