Wireless Service Customer Deactivation Analysis

Data Information:

acctno –account number
actdt – account active date
deactdt – account deactivate date
deactreason – reason for deactivation
goodcredit – credit is good or not
rateplan – rate plan
dealertype – dealer type

Analysis lists

Clean the provided data ( duplicates / missing values) and obtain the earliest and latest active/deactivate date available.
Calculate tenure days for each account and obtain the statistics
Segment the account base by account status and tenure group; Then test the general assocaition with different factors for tenure segments and account status



In [1]:

    
import numpy as np
import pandas as pd



In [2]:

    
df = pd.read_csv('wireless_data.csv')
df.head(3)









    Out[2]:







  
    
      
      acctno
      actdt
      deactdt
      deactreason
      goodcredit
      rateplan
      dealertype
    
  
  
    
      0
      1176913194483
      1999-06-20
      NaN
      NaN
      0
      1
      A1
    
    
      1
      1176914599423
      1999-10-04
      1999-10-15
      NEED
      1
      1
      A1
    
    
      2
      1176951913656
      2000-07-01
      NaN
      NaN
      0
      1
      A1

Step 1:

Clean the provided data ( duplicates / missing values) and obtain the earliest and latest active/deactivate date available



In [3]:

    
# check duplicate rows
total_users = df.shape[0]
total_uni_users = df.drop_duplicates().shape[0]
print('The original data has {} usres'.format(total_users))
print('After dropping duplicates by Python, we found: {} duplicates rows in total.'.format(total_users - total_uni_users))









    



The original data has 102255 usres
After dropping duplicates by Python, we found: 0 duplicates rows in total.



In [4]:

    
# check missing values
df.isnull().sum()









    Out[4]:





acctno             0
actdt              0
deactdt        82620
deactreason    83162
goodcredit         0
rateplan           0
dealertype         0
dtype: int64



In [5]:

    
print("{} users have deactivated the services.".format(df.shape[0] - 82620))
print("82620 users have not deactivated the services yet.")

# Transfer the date format
df['actdt'] = pd.to_datetime(df['actdt'])
df['deactdt'] = pd.to_datetime(df['deactdt'])

act_early = df['actdt'].min().to_pydatetime().strftime('%Y-%m-%d') # use strftime function to get time in string
act_late = df['actdt'].max().to_pydatetime().strftime('%Y-%m-%d')
deact_early = df['deactdt'].min().to_pydatetime().strftime('%Y-%m-%d')
deact_late = df['deactdt'].max().to_pydatetime().strftime('%Y-%m-%d')
print("The earlist activating date is: {} and the latest activating date is : {}".format(act_early, act_late))
print("The earlist deactivating date is: {} and the latest deactivating date is : {}".format(deact_early, deact_late))









    



19635 users have deactivated the services.
82620 users have not deactivated the services yet.
The earlist activating date is: 1999-01-20 and the latest activating date is : 2001-01-20
The earlist deactivating date is: 1999-01-25 and the latest deactivating date is : 2001-01-20



In [6]:

    
# Dealing with missing values
last_day = pd.to_datetime('2001-01-21')

df['deactdt'].fillna(last_day, inplace = True)
df['deactreason'].fillna('Unknown', inplace = True)
df.head()









    Out[6]:







  
    
      
      acctno
      actdt
      deactdt
      deactreason
      goodcredit
      rateplan
      dealertype
    
  
  
    
      0
      1176913194483
      1999-06-20
      2001-01-21
      Unknown
      0
      1
      A1
    
    
      1
      1176914599423
      1999-10-04
      1999-10-15
      NEED
      1
      1
      A1
    
    
      2
      1176951913656
      2000-07-01
      2001-01-21
      Unknown
      0
      1
      A1
    
    
      3
      1176954000288
      2000-05-30
      2001-01-21
      Unknown
      1
      2
      A1
    
    
      4
      1176969186303
      2000-12-13
      2001-01-21
      Unknown
      1
      1
      C1

Step 2:

Calculate tenure days for each account and obtian the simple statistics.
Count deactivated user number for each month



In [7]:

    
df['Tenure Days'] = df['deactdt'] - df['actdt']

#Dividing Groups
act_gp = df[df['deactdt'] == last_day]  # last_day = pd.to_datetime('2001-01-21')
deact_gp = df[df['deactdt'] != last_day]

#get statistics 
df_ss = df['Tenure Days'].describe().rename('All Users')
act_ss = act_gp['Tenure Days'].describe().rename('Still Alive Users')
deact_ss = deact_gp['Tenure Days'].describe().rename('Deactivated Users')

#show the resluts
pd.concat([df_ss, act_ss, deact_ss], axis = 1)









    Out[7]:







  
    
      
      All Users
      Still Alive Users
      Deactivated Users
    
  
  
    
      count
      102255
      82620
      19635
    
    
      mean
      283 days 09:06:52.650726
      300 days 09:25:06.405228
      211 days 19:03:25.164247
    
    
      std
      197 days 09:25:52.402391
      200 days 16:05:02.006583
      164 days 17:21:57.484312
    
    
      min
      0 days 00:00:00
      1 days 00:00:00
      0 days 00:00:00
    
    
      25%
      102 days 00:00:00
      110 days 00:00:00
      74 days 00:00:00
    
    
      50%
      266 days 00:00:00
      299 days 00:00:00
      183 days 00:00:00
    
    
      75%
      426 days 00:00:00
      448 days 00:00:00
      327 days 00:00:00
    
    
      max
      732 days 00:00:00
      732 days 00:00:00
      718 days 00:00:00



In [8]:

    
# create a helper function to get specific time format
def datetime_switch(timestamp):
    '''For given pd.timestamp value, return a string in format as year - month '''
    return timestamp.to_pydatetime().strftime('%Y-%m')

def user_group(dataframe, by):
    '''Return a dataframe grouped by specificed column
    dateframe: data set, pd.df
    by: column name, str
    output: dataframe'''
    
    tmp_df = dataframe[['acctno',by]].copy()
    tmp_df['Date-Month'] = dataframe[by].apply(datetime_switch)
    
    return tmp_df.groupby('Date-Month').count()[['acctno']]



In [9]:

    
deact_gb = user_group(deact_gp, 'deactdt')
act_gb = user_group(act_gp, 'actdt')

# merge the data
date_count = pd.concat([deact_gb, act_gb], axis = 1)
date_count.columns = ['Deactivated Users','New Users']

date_count.head(3)









    Out[9]:







  
    
      
      Deactivated Users
      New Users
    
    
      Date-Month
      
      
    
  
  
    
      1999-01
      2
      537
    
    
      1999-02
      21
      935
    
    
      1999-03
      40
      1698



In [10]:

    
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns



In [11]:

    
date_count.plot(color = ['r','b'],figsize = (10,5), title = 'Fig.1  User numbers during 1999 to 2001')









    Out[11]:





<matplotlib.axes._subplots.AxesSubplot at 0x200230f47f0>



In [12]:

    
deact_gb.plot(kind = 'bar', figsize = (10,5), title = 'Fig.2  Deactivated User numbers by month')









    Out[12]:





<matplotlib.axes._subplots.AxesSubplot at 0x20023510d68>

Step - 3:

Segment the account base by account status – ‘deactivated’ and ‘still alive’ and by tenure: tenures are less than 30 days, between 31 and 60 days, between 61 days and one year, over a year.

1.Test the general association between the tenure segments and ‘goodcredit’ ‘rateplan’ and ‘dealertype’

2.Test the general association between the account status and ‘goodcredit’ ‘rateplan’ and ‘dealertype’



In [13]:

    
# Get user status (using lambda funciton if is faster than def a if else function)
df['User Status'] = df['deactdt'].apply(lambda x : 'Still alive' if x == last_day else 'Deactivated' )

# cut fuction will not included the left side ( right incluced by default)
tenure_bins = pd.to_timedelta(['-1 days', '30 days', '60 days', '365 days', '732 days'])
tenure_labels = ['0 to 30 days','31 to 60 days', '61 days to 1 year','over 1 year']
df['Tenure Group'] = pd.cut(df['Tenure Days'], bins = tenure_bins, labels = tenure_labels)

df.head(3)









    Out[13]:







  
    
      
      acctno
      actdt
      deactdt
      deactreason
      goodcredit
      rateplan
      dealertype
      Tenure Days
      User Status
      Tenure Group
    
  
  
    
      0
      1176913194483
      1999-06-20
      2001-01-21
      Unknown
      0
      1
      A1
      581 days
      Still alive
      over 1 year
    
    
      1
      1176914599423
      1999-10-04
      1999-10-15
      NEED
      1
      1
      A1
      11 days
      Deactivated
      0 to 30 days
    
    
      2
      1176951913656
      2000-07-01
      2001-01-21
      Unknown
      0
      1
      A1
      204 days
      Still alive
      61 days to 1 year



In [14]:

    
# define a helper function for groupby question
def percent_of_all (series):
    '''return percent of total for each element in a series
    input: a numeric series
    output: string showing percentage '''
    # calculate the percentage 
    tmp = series / series.sum()
    # format the answer
    ans = tmp.apply(lambda x : '{:.2%}'.format(x))
    return ans

def df_group_per(dataframe, by):
    '''With given dataframe, returan a new dataframe groupby "by column" and show acct# account and percentage
    input: dateframe: pandas clean dataframe
    by: a list of string of column names which used to groupby
    
    output: a new dateframe with by and by_per columns only'''
    dataframe_gb = dataframe.groupby(by).count()[['acctno']]
    dataframe_gb['acctno_per'] = percent_of_all(dataframe_gb['acctno'])
    
    return dataframe_gb



In [15]:

    
user_status_pie = df_group_per(df, 'User Status')
user_status_pie









    Out[15]:







  
    
      
      acctno
      acctno_per
    
    
      User Status
      
      
    
  
  
    
      Deactivated
      19635
      19.20%
    
    
      Still alive
      82620
      80.80%



In [16]:

    
user_status_pie[['acctno']].plot(kind = 'pie', explode = (0.03, 0.03), autopct = '%.2f%%', figsize = (6,6), subplots = True,
                                legend = False, title = 'Fig.3  User Status in Percent')









    Out[16]:





array([<matplotlib.axes._subplots.AxesSubplot object at 0x000002002368F0F0>], dtype=object)



In [17]:

    
tenure_group_pie =df_group_per(df, 'Tenure Group')
tenure_group_pie









    Out[17]:







  
    
      
      acctno
      acctno_per
    
    
      Tenure Group
      
      
    
  
  
    
      0 to 30 days
      9566
      9.36%
    
    
      31 to 60 days
      8534
      8.35%
    
    
      61 days to 1 year
      45404
      44.40%
    
    
      over 1 year
      38751
      37.90%



In [18]:

    
tenure_group_pie['acctno'].plot(kind = 'pie', explode = (0.02, 0.02, 0.01, 0.01), autopct = '%.2f%%', figsize = (6,6), 
                                subplots = True, legend = False, title = 'Fig.4  User Status in Percent')









    Out[18]:





array([<matplotlib.axes._subplots.AxesSubplot object at 0x000002002366FCC0>], dtype=object)



In [19]:

    
# build helper functions to transfer data to contingency table
def get_unique_labels(oneDarray):
    '''Separate the unique row and column labels from given one D array, retuan by a dict 
    for example: 
    input: array([('a',1), ('a',2), ('a', 3), ('b',1), ('b',2), ('b',3)])
    output: {'row':['a','b'], 'column':[1,2,3]}'''
    
    temp_dict = {'row':[], 'column':[]}
    
    for item in oneDarray:
        if item[0] not in temp_dict['row']:
            temp_dict['row'].append(item[0])
        if item[1] not in temp_dict['column']:
            temp_dict['column'].append(item[1])
            
    return temp_dict

def to_contigency_table(dataframe, row, column):
    '''For given dataframe, return a new dataframe with new_index = row groupby value and new_column = column groupby value
    the values in the table will be counts based on account number
    row: row name in string ( one column of the dataframe)
    column: column name in string (one column of the dataframe)'''
    
    # get a new dataframe using groupby([row, column])
    grouped_df = dataframe.groupby([row, column]).count()[['acctno']]
     
    # get the new row and column values from dataframe
    labels_dict = get_unique_labels(grouped_df.index.values)
    
    # get the np-reshape size tuple like (4,2) here based on row and column number 
    row_column_tuple = (len(labels_dict['row']),len(labels_dict['column']))
    
    # reshape the np-array to be ready to build the new dataframe
    array_data = np.reshape(grouped_df.values, row_column_tuple)
    
    # build the target df
    target_df = pd.DataFrame(data = array_data, index = labels_dict['row'] ,columns= labels_dict['column'])
    
    return target_df

# to create a helper function to get merged output columns
def get_merged_df(df, row, column_list):
    '''return a merged dataframe for paired row and column contigency table '''
    # set the list of unmerged df
    unmerged_list = []
    # add the rest data into the df
    for i in range( len(column_list)): 
        unmerged_list.append(to_contigency_table(df, row, column_list[i]))
    
    # merge everything
    merged_df = pd.concat(unmerged_list, axis = 1)
        
    return merged_df



In [20]:

    
Tenure_Group_corr = get_merged_df(df, 'Tenure Group', ['dealertype', 'rateplan', 'goodcredit', 'User Status'])
Tenure_Group_corr









    Out[20]:







  
    
      
      A1
      A2
      B1
      C1
      1
      2
      3
      0
      1
      Deactivated
      Still alive
    
  
  
    
      0 to 30 days
      5417
      913
      1799
      1437
      4175
      4474
      917
      2201
      7365
      3213
      6353
    
    
      31 to 60 days
      4604
      792
      1639
      1499
      3763
      3998
      773
      1893
      6641
      1136
      7398
    
    
      61 days to 1 year
      24076
      4188
      10696
      6444
      32118
      6724
      6562
      16512
      28892
      11427
      33977
    
    
      over 1 year
      22035
      5362
      6536
      4818
      28138
      4991
      5622
      10647
      28104
      3859
      34892



In [21]:

    
User_Status_corr = get_merged_df(df,'User Status', ['dealertype', 'rateplan', 'goodcredit'])
User_Status_corr









    Out[21]:







  
    
      
      A1
      A2
      B1
      C1
      1
      2
      3
      0
      1
    
  
  
    
      Deactivated
      10631
      2549
      3879
      2576
      12469
      3439
      3727
      8657
      10978
    
    
      Still alive
      45501
      8706
      16791
      11622
      55725
      16748
      10147
      22596
      60024



In [22]:

    
# import stats to do Chi square test
from scipy import stats as ss



In [23]:

    
def get_CramerV_chi2p (contigency_dataframe):
    '''return a tuple (Cramer's V value, chi square p value) based on given contigency table dataframe'''
    
    # get the array values 
    df_array = contigency_dataframe.values
    
    # get chi squre value and p value
    chi2_p = ss.chi2_contingency(df_array)[:2]
    
    # calculate the Cramer's V value based on Wikipedia's formula
    sample_total = df_array.sum()
    shape_sub1 = min(df_array.shape) - 1
    cramer_v = np.sqrt(chi2_p[0]/(sample_total * shape_sub1 ))
    
    return (cramer_v, chi2_p[1])

# Printout the results of V value and P value
def printout_vp(dataframe, main_factor, other_factor):
    '''Based on given dataframe and main_factor and other_factor(s) to print out the v value and p value of chi squre test
    on each paired main factor and other factor
    
    dataframe: the dataframe has all data information
    main_factor: string, the main factor to be used, like User Status or Tenure Group
    other_factor(s): string or a list of string, the other factors to be used with main factor
    
    no output'''
    
    if type(other_factor) == str:
        chi2_str = to_contigency_table(dataframe, main_factor, other_factor)
        chi2_values = get_CramerV_chi2p(chi2_str)
        print('Between {} and {} factors:\n The P value of Chi Square test is: {};\n\
 The Cramer V value is: {}.\n-----------'.format(main_factor, other_factor, chi2_values[1], chi2_values[0]))
        
    elif type(other_factor) == list:
        for factor in other_factor:
            chi2_list = to_contigency_table(dataframe, main_factor, factor)
            chi2_values = get_CramerV_chi2p(chi2_list)
            print('Between {} and {} factors :\n The P value of Chi Square test is: {};\n\
 The Cramer V value is: {}.\n-----------\n'.format(main_factor, factor, chi2_values[1], chi2_values[0]))
    else:
        raise ValueError('Please enter the correct factor name(s)')



In [24]:

    
printout_vp(df,'Tenure Group', ['goodcredit', 'rateplan','dealertype','User Status'])









    



Between Tenure Group and goodcredit factors :
 The P value of Chi Square test is: 9.906813176926e-311;
 The Cramer V value is: 0.11844013402718538.
-----------

Between Tenure Group and rateplan factors :
 The P value of Chi Square test is: 0.0;
 The Cramer V value is: 0.2235623772227456.
-----------

Between Tenure Group and dealertype factors :
 The P value of Chi Square test is: 3.933922195864395e-242;
 The Cramer V value is: 0.06126216189494606.
-----------

Between Tenure Group and User Status factors :
 The P value of Chi Square test is: 0.0;
 The Cramer V value is: 0.21307224406352068.
-----------



In [25]:

    
printout_vp(df,'User Status', ['goodcredit','rateplan','dealertype'])









    



Between User Status and goodcredit factors :
 The P value of Chi Square test is: 0.0;
 The Cramer V value is: 0.14310639281767454.
-----------

Between User Status and rateplan factors :
 The P value of Chi Square test is: 5.4332819334025574e-136;
 The Cramer V value is: 0.07805004973927918.
-----------

Between User Status and dealertype factors :
 The P value of Chi Square test is: 7.726596205628729e-22;
 The Cramer V value is: 0.03149207059004648.
-----------

	acctno	actdt	deactdt	deactreason	goodcredit	rateplan	dealertype
0	1176913194483	1999-06-20	NaN	NaN	0	1	A1
1	1176914599423	1999-10-04	1999-10-15	NEED	1	1	A1
2	1176951913656	2000-07-01	NaN	NaN	0	1	A1

	acctno	actdt	deactdt	deactreason	goodcredit	rateplan	dealertype
0	1176913194483	1999-06-20	2001-01-21	Unknown	0	1	A1
1	1176914599423	1999-10-04	1999-10-15	NEED	1	1	A1
2	1176951913656	2000-07-01	2001-01-21	Unknown	0	1	A1
3	1176954000288	2000-05-30	2001-01-21	Unknown	1	2	A1
4	1176969186303	2000-12-13	2001-01-21	Unknown	1	1	C1

	All Users	Still Alive Users	Deactivated Users
count	102255	82620	19635
mean	283 days 09:06:52.650726	300 days 09:25:06.405228	211 days 19:03:25.164247
std	197 days 09:25:52.402391	200 days 16:05:02.006583	164 days 17:21:57.484312
min	0 days 00:00:00	1 days 00:00:00	0 days 00:00:00
25%	102 days 00:00:00	110 days 00:00:00	74 days 00:00:00
50%	266 days 00:00:00	299 days 00:00:00	183 days 00:00:00
75%	426 days 00:00:00	448 days 00:00:00	327 days 00:00:00
max	732 days 00:00:00	732 days 00:00:00	718 days 00:00:00

	acctno	acctno_per
Tenure Group
0 to 30 days	9566	9.36%
31 to 60 days	8534	8.35%
61 days to 1 year	45404	44.40%
over 1 year	38751	37.90%

	A1	A2	B1	C1	1	2	3	0	1	Deactivated	Still alive
0 to 30 days	5417	913	1799	1437	4175	4474	917	2201	7365	3213	6353
31 to 60 days	4604	792	1639	1499	3763	3998	773	1893	6641	1136	7398
61 days to 1 year	24076	4188	10696	6444	32118	6724	6562	16512	28892	11427	33977
over 1 year	22035	5362	6536	4818	28138	4991	5622	10647	28104	3859	34892

	A1	A2	B1	C1	1	2	3	0	1
Deactivated	10631	2549	3879	2576	12469	3439	3727	8657	10978
Still alive	45501	8706	16791	11622	55725	16748	10147	22596	60024