Exercise 04

Part 1 - Linear Regression

Estimate a regression using the Income data

Forecast of income

We'll be working with a dataset from US Census indome (data dictionary).

Many businesses would like to personalize their offer based on customer’s income. High-income customers could be, for instance, exposed to premium products. As a customer’s income is not always explicitly known, predictive model could estimate income of a person based on other information.

Our goal is to create a predictive model that will be able to output an estimation of a person income.



In [2]:

    
import pandas as pd
import numpy as np

%matplotlib inline
import matplotlib.pyplot as plt

# read the data and set the datetime as the index
income = pd.read_csv('https://github.com/albahnsen/PracticalMachineLearningClass/raw/master/datasets/income.csv.zip', index_col=0)

income.head()









    Out[2]:







  
    
      
      Age
      Workclass
      fnlwgt
      Education
      Education-Num
      Martial Status
      Occupation
      Relationship
      Race
      Sex
      Capital Gain
      Capital Loss
      Hours per week
      Country
      Income
    
  
  
    
      0
      39
      State-gov
      77516
      Bachelors
      13
      Never-married
      Adm-clerical
      Not-in-family
      White
      Male
      2174
      0
      40
      United-States
      51806.0
    
    
      1
      50
      Self-emp-not-inc
      83311
      Bachelors
      13
      Married-civ-spouse
      Exec-managerial
      Husband
      White
      Male
      0
      0
      13
      United-States
      68719.0
    
    
      2
      38
      Private
      215646
      HS-grad
      9
      Divorced
      Handlers-cleaners
      Not-in-family
      White
      Male
      0
      0
      40
      United-States
      51255.0
    
    
      3
      53
      Private
      234721
      11th
      7
      Married-civ-spouse
      Handlers-cleaners
      Husband
      Black
      Male
      0
      0
      40
      United-States
      47398.0
    
    
      4
      28
      Private
      338409
      Bachelors
      13
      Married-civ-spouse
      Prof-specialty
      Wife
      Black
      Female
      0
      0
      40
      Cuba
      30493.0



In [3]:

    
income.shape









    Out[3]:





(32561, 15)

Exercise 4.1

What is the relation between the age and Income?

For a one percent increase in the Age how much the income increases?

Using sklearn estimate a linear regression and predict the income when the Age is 30 and 40 years



In [4]:

    
income.plot(x='Age', y='Income', kind='scatter')









    Out[4]:





<matplotlib.axes._subplots.AxesSubplot at 0x1b127d8e630>



In [ ]:

Exercise 4.2

Evaluate the model using the MSE



In [ ]:

Exercise 4.3

Run a regression model using as features the Age and Age$^2$ using the OLS equations



In [ ]:

Exercise 4.4

Estimate a regression using more features.

How is the performance compared to using only the Age?



In [ ]:

Part 2: Logistic Regression

Customer Churn:

losing/attrition of the customers from the company. Especially, the industries that the user acquisition is costly, it is crucially important for one company to reduce and ideally make the customer churn to 0 to sustain their recurring revenue. If you consider customer retention is always cheaper than customer acquisition and generally depends on the data of the user(usage of the service or product), it poses a great/exciting/hard problem for machine learning.

Data

Dataset is from a telecom service provider where they have the service usage(international plan, voicemail plan, usage in daytime, usage in evenings and nights and so on) and basic demographic information(state and area code) of the user. For labels, I have a single data point whether the customer is churned out or not.



In [6]:

    
# Download the dataset
data = pd.read_csv('https://github.com/ghuiber/churn/raw/master/data/churn.csv')



In [7]:

    
data.head()









    Out[7]:







  
    
      
      State
      Account Length
      Area Code
      Phone
      Int'l Plan
      VMail Plan
      VMail Message
      Day Mins
      Day Calls
      Day Charge
      ...
      Eve Calls
      Eve Charge
      Night Mins
      Night Calls
      Night Charge
      Intl Mins
      Intl Calls
      Intl Charge
      CustServ Calls
      Churn?
    
  
  
    
      0
      KS
      128
      415
      382-4657
      no
      yes
      25
      265.1
      110
      45.07
      ...
      99
      16.78
      244.7
      91
      11.01
      10.0
      3
      2.70
      1
      False.
    
    
      1
      OH
      107
      415
      371-7191
      no
      yes
      26
      161.6
      123
      27.47
      ...
      103
      16.62
      254.4
      103
      11.45
      13.7
      3
      3.70
      1
      False.
    
    
      2
      NJ
      137
      415
      358-1921
      no
      no
      0
      243.4
      114
      41.38
      ...
      110
      10.30
      162.6
      104
      7.32
      12.2
      5
      3.29
      0
      False.
    
    
      3
      OH
      84
      408
      375-9999
      yes
      no
      0
      299.4
      71
      50.90
      ...
      88
      5.26
      196.9
      89
      8.86
      6.6
      7
      1.78
      2
      False.
    
    
      4
      OK
      75
      415
      330-6626
      yes
      no
      0
      166.7
      113
      28.34
      ...
      122
      12.61
      186.9
      121
      8.41
      10.1
      3
      2.73
      3
      False.
    
  

5 rows × 21 columns

Exercise 4.5

Create Y and X

What is the distribution of the churners?

Split the data in train (70%) and test (30%)



In [ ]:

Exercise 4.6

Train a Logistic Regression using the training set and apply the algorithm to the testing set.



In [ ]:

Exercise 4.7

a) Create a confusion matrix using the prediction on the 30% set.

b) Estimate the accuracy of the model in the 30% set



In [ ]:

	Age	Workclass	fnlwgt	Education	Education-Num	Martial Status	Occupation	Relationship	Race	Sex	Capital Gain	Hours per week	Country	Income
0	39	State-gov	77516	Bachelors	13	Never-married	Adm-clerical	Not-in-family	White	Male	2174	40	United-States	51806.0
1	50	Self-emp-not-inc	83311	Bachelors	13	Married-civ-spouse	Exec-managerial	Husband	White	Male	0	13	United-States	68719.0
2	38	Private	215646	HS-grad	9	Divorced	Handlers-cleaners	Not-in-family	White	Male	0	40	United-States	51255.0
3	53	Private	234721	11th	7	Married-civ-spouse	Handlers-cleaners	Husband	Black	Male	0	40	United-States	47398.0
4	28	Private	338409	Bachelors	13	Married-civ-spouse	Prof-specialty	Wife	Black	Female	0	40	Cuba	30493.0

	State	Account Length	Area Code	Phone	Int'l Plan	VMail Plan	VMail Message	Day Mins	Day Calls	Day Charge	...	Eve Calls	Eve Charge	Night Mins	Night Calls	Night Charge	Intl Mins	Intl Calls	Intl Charge	CustServ Calls	Churn?
0	KS	128	415	382-4657	no	yes	25	265.1	110	45.07	...	99	16.78	244.7	91	11.01	10.0	3	2.70	1	False.
1	OH	107	415	371-7191	no	yes	26	161.6	123	27.47	...	103	16.62	254.4	103	11.45	13.7	3	3.70	1	False.
2	NJ	137	415	358-1921	no	no	0	243.4	114	41.38	...	110	10.30	162.6	104	7.32	12.2	5	3.29	0	False.
3	OH	84	408	375-9999	yes	no	0	299.4	71	50.90	...	88	5.26	196.9	89	8.86	6.6	7	1.78	2	False.
4	OK	75	415	330-6626	yes	no	0	166.7	113	28.34	...	122	12.61	186.9	121	8.41	10.1	3	2.73	3	False.