acctno –account number
actdt – account active date
deactdt – account deactivate date
deactreason – reason for deactivation
goodcredit – credit is good or not
rateplan – rate plan
dealertype – dealer type
Clean the provided data ( duplicates / missing values) and obtain the earliest and latest active/deactivate date available.
Calculate tenure days for each account and obtain the statistics
Segment the account base by account status and tenure group; Then test the general assocaition with different factors for tenure segments and account status
In [1]:
import numpy as np
import pandas as pd
In [2]:
df = pd.read_csv('wireless_data.csv')
df.head(3)
Out[2]:
In [3]:
# check duplicate rows
total_users = df.shape[0]
total_uni_users = df.drop_duplicates().shape[0]
print('The original data has {} usres'.format(total_users))
print('After dropping duplicates by Python, we found: {} duplicates rows in total.'.format(total_users - total_uni_users))
In [4]:
# check missing values
df.isnull().sum()
Out[4]:
In [5]:
print("{} users have deactivated the services.".format(df.shape[0] - 82620))
print("82620 users have not deactivated the services yet.")
# Transfer the date format
df['actdt'] = pd.to_datetime(df['actdt'])
df['deactdt'] = pd.to_datetime(df['deactdt'])
act_early = df['actdt'].min().to_pydatetime().strftime('%Y-%m-%d') # use strftime function to get time in string
act_late = df['actdt'].max().to_pydatetime().strftime('%Y-%m-%d')
deact_early = df['deactdt'].min().to_pydatetime().strftime('%Y-%m-%d')
deact_late = df['deactdt'].max().to_pydatetime().strftime('%Y-%m-%d')
print("The earlist activating date is: {} and the latest activating date is : {}".format(act_early, act_late))
print("The earlist deactivating date is: {} and the latest deactivating date is : {}".format(deact_early, deact_late))
In [6]:
# Dealing with missing values
last_day = pd.to_datetime('2001-01-21')
df['deactdt'].fillna(last_day, inplace = True)
df['deactreason'].fillna('Unknown', inplace = True)
df.head()
Out[6]:
In [7]:
df['Tenure Days'] = df['deactdt'] - df['actdt']
#Dividing Groups
act_gp = df[df['deactdt'] == last_day] # last_day = pd.to_datetime('2001-01-21')
deact_gp = df[df['deactdt'] != last_day]
#get statistics
df_ss = df['Tenure Days'].describe().rename('All Users')
act_ss = act_gp['Tenure Days'].describe().rename('Still Alive Users')
deact_ss = deact_gp['Tenure Days'].describe().rename('Deactivated Users')
#show the resluts
pd.concat([df_ss, act_ss, deact_ss], axis = 1)
Out[7]:
In [8]:
# create a helper function to get specific time format
def datetime_switch(timestamp):
'''For given pd.timestamp value, return a string in format as year - month '''
return timestamp.to_pydatetime().strftime('%Y-%m')
def user_group(dataframe, by):
'''Return a dataframe grouped by specificed column
dateframe: data set, pd.df
by: column name, str
output: dataframe'''
tmp_df = dataframe[['acctno',by]].copy()
tmp_df['Date-Month'] = dataframe[by].apply(datetime_switch)
return tmp_df.groupby('Date-Month').count()[['acctno']]
In [9]:
deact_gb = user_group(deact_gp, 'deactdt')
act_gb = user_group(act_gp, 'actdt')
# merge the data
date_count = pd.concat([deact_gb, act_gb], axis = 1)
date_count.columns = ['Deactivated Users','New Users']
date_count.head(3)
Out[9]:
In [10]:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
In [11]:
date_count.plot(color = ['r','b'],figsize = (10,5), title = 'Fig.1 User numbers during 1999 to 2001')
Out[11]:
In [12]:
deact_gb.plot(kind = 'bar', figsize = (10,5), title = 'Fig.2 Deactivated User numbers by month')
Out[12]:
Segment the account base by account status – ‘deactivated’ and ‘still alive’ and by tenure: tenures are less than 30 days, between 31 and 60 days, between 61 days and one year, over a year.
1.Test the general association between the tenure segments and ‘goodcredit’ ‘rateplan’ and ‘dealertype’
2.Test the general association between the account status and ‘goodcredit’ ‘rateplan’ and ‘dealertype’
In [13]:
# Get user status (using lambda funciton if is faster than def a if else function)
df['User Status'] = df['deactdt'].apply(lambda x : 'Still alive' if x == last_day else 'Deactivated' )
# cut fuction will not included the left side ( right incluced by default)
tenure_bins = pd.to_timedelta(['-1 days', '30 days', '60 days', '365 days', '732 days'])
tenure_labels = ['0 to 30 days','31 to 60 days', '61 days to 1 year','over 1 year']
df['Tenure Group'] = pd.cut(df['Tenure Days'], bins = tenure_bins, labels = tenure_labels)
df.head(3)
Out[13]:
In [14]:
# define a helper function for groupby question
def percent_of_all (series):
'''return percent of total for each element in a series
input: a numeric series
output: string showing percentage '''
# calculate the percentage
tmp = series / series.sum()
# format the answer
ans = tmp.apply(lambda x : '{:.2%}'.format(x))
return ans
def df_group_per(dataframe, by):
'''With given dataframe, returan a new dataframe groupby "by column" and show acct# account and percentage
input: dateframe: pandas clean dataframe
by: a list of string of column names which used to groupby
output: a new dateframe with by and by_per columns only'''
dataframe_gb = dataframe.groupby(by).count()[['acctno']]
dataframe_gb['acctno_per'] = percent_of_all(dataframe_gb['acctno'])
return dataframe_gb
In [15]:
user_status_pie = df_group_per(df, 'User Status')
user_status_pie
Out[15]:
In [16]:
user_status_pie[['acctno']].plot(kind = 'pie', explode = (0.03, 0.03), autopct = '%.2f%%', figsize = (6,6), subplots = True,
legend = False, title = 'Fig.3 User Status in Percent')
Out[16]:
In [17]:
tenure_group_pie =df_group_per(df, 'Tenure Group')
tenure_group_pie
Out[17]:
In [18]:
tenure_group_pie['acctno'].plot(kind = 'pie', explode = (0.02, 0.02, 0.01, 0.01), autopct = '%.2f%%', figsize = (6,6),
subplots = True, legend = False, title = 'Fig.4 User Status in Percent')
Out[18]:
In [19]:
# build helper functions to transfer data to contingency table
def get_unique_labels(oneDarray):
'''Separate the unique row and column labels from given one D array, retuan by a dict
for example:
input: array([('a',1), ('a',2), ('a', 3), ('b',1), ('b',2), ('b',3)])
output: {'row':['a','b'], 'column':[1,2,3]}'''
temp_dict = {'row':[], 'column':[]}
for item in oneDarray:
if item[0] not in temp_dict['row']:
temp_dict['row'].append(item[0])
if item[1] not in temp_dict['column']:
temp_dict['column'].append(item[1])
return temp_dict
def to_contigency_table(dataframe, row, column):
'''For given dataframe, return a new dataframe with new_index = row groupby value and new_column = column groupby value
the values in the table will be counts based on account number
row: row name in string ( one column of the dataframe)
column: column name in string (one column of the dataframe)'''
# get a new dataframe using groupby([row, column])
grouped_df = dataframe.groupby([row, column]).count()[['acctno']]
# get the new row and column values from dataframe
labels_dict = get_unique_labels(grouped_df.index.values)
# get the np-reshape size tuple like (4,2) here based on row and column number
row_column_tuple = (len(labels_dict['row']),len(labels_dict['column']))
# reshape the np-array to be ready to build the new dataframe
array_data = np.reshape(grouped_df.values, row_column_tuple)
# build the target df
target_df = pd.DataFrame(data = array_data, index = labels_dict['row'] ,columns= labels_dict['column'])
return target_df
# to create a helper function to get merged output columns
def get_merged_df(df, row, column_list):
'''return a merged dataframe for paired row and column contigency table '''
# set the list of unmerged df
unmerged_list = []
# add the rest data into the df
for i in range( len(column_list)):
unmerged_list.append(to_contigency_table(df, row, column_list[i]))
# merge everything
merged_df = pd.concat(unmerged_list, axis = 1)
return merged_df
In [20]:
Tenure_Group_corr = get_merged_df(df, 'Tenure Group', ['dealertype', 'rateplan', 'goodcredit', 'User Status'])
Tenure_Group_corr
Out[20]:
In [21]:
User_Status_corr = get_merged_df(df,'User Status', ['dealertype', 'rateplan', 'goodcredit'])
User_Status_corr
Out[21]:
In [22]:
# import stats to do Chi square test
from scipy import stats as ss
In [23]:
def get_CramerV_chi2p (contigency_dataframe):
'''return a tuple (Cramer's V value, chi square p value) based on given contigency table dataframe'''
# get the array values
df_array = contigency_dataframe.values
# get chi squre value and p value
chi2_p = ss.chi2_contingency(df_array)[:2]
# calculate the Cramer's V value based on Wikipedia's formula
sample_total = df_array.sum()
shape_sub1 = min(df_array.shape) - 1
cramer_v = np.sqrt(chi2_p[0]/(sample_total * shape_sub1 ))
return (cramer_v, chi2_p[1])
# Printout the results of V value and P value
def printout_vp(dataframe, main_factor, other_factor):
'''Based on given dataframe and main_factor and other_factor(s) to print out the v value and p value of chi squre test
on each paired main factor and other factor
dataframe: the dataframe has all data information
main_factor: string, the main factor to be used, like User Status or Tenure Group
other_factor(s): string or a list of string, the other factors to be used with main factor
no output'''
if type(other_factor) == str:
chi2_str = to_contigency_table(dataframe, main_factor, other_factor)
chi2_values = get_CramerV_chi2p(chi2_str)
print('Between {} and {} factors:\n The P value of Chi Square test is: {};\n\
The Cramer V value is: {}.\n-----------'.format(main_factor, other_factor, chi2_values[1], chi2_values[0]))
elif type(other_factor) == list:
for factor in other_factor:
chi2_list = to_contigency_table(dataframe, main_factor, factor)
chi2_values = get_CramerV_chi2p(chi2_list)
print('Between {} and {} factors :\n The P value of Chi Square test is: {};\n\
The Cramer V value is: {}.\n-----------\n'.format(main_factor, factor, chi2_values[1], chi2_values[0]))
else:
raise ValueError('Please enter the correct factor name(s)')
In [24]:
printout_vp(df,'Tenure Group', ['goodcredit', 'rateplan','dealertype','User Status'])
In [25]:
printout_vp(df,'User Status', ['goodcredit','rateplan','dealertype'])