Student Name: Shaival Dalal
Student Netid: sd3462
Problem: Consumer shopping habits are incredibly difficult to change once formed no matter how ingenius ads campaigns are. There are only a handful of events in a consumer's life that cause a change in habits and it is difficult to gain consumer's attention during these events due to competition from other companies.
Goal: To identify and target pregnant consumers or expecting parents before other marketers do.
Data: Data from two major sources. Internal and external. Internal data about consumers contains unique guest identifiers, historical products purchases, age, marriage status, timing of purchase, coupon delivery preference, preferred brands of items (in-store), baby shower registry etc. External data contains data points generated by consumers outside Target. Such data points include preferred brands of coffee, preferred brands for everyday items (paper towel, applesauce, cereal) etc.
Impact: The "Pregnancy Prediction" score model developed by Andrew Pole between 2002 and 2010 enabled Target to radically increase their sales for the Mom and Baby segment. Target's revenues increased from $44 billion to $67 billion between 2002 and 2010.
For this part we will be using the data file located in "data/advertising_events.csv". This file consists of records that pertain to some online advertising events on a given day. There are 4 comma separated columns in this order: userid, timestamp, domain, and action. These fields are of type int (continuous), int (continuous), string, and int (category) respectively. Answer the following questions using Linux/Unix bash commands. All questions can be answered in one line (sometimes, with pipes)! Some questions will have many possible solutions. Don't forget that in IPython notebooks you must prefix all bash commands with an exclamation point, i.e. "!command arguments".
[Hints: You can experiment with whatever you want in the notebook and then delete things to construct your answer later. You can also use ssh to use the actual bash shell on EC2 and then just paste your answers here. Recall that once you enter the "!" then filename completion should work.]
1. How many records (lines) are in this file?
In [1]:
!wc -l Datasets/advertising_events.csv | cut -d ' ' -f 1
2. How many unique users are in this file? (hint: consider the 'cut' command and use pipe operator '|')
In [2]:
!cut Datasets/advertising_events.csv -d ',' -f 1 | sort | uniq | wc -l
3. Rank all domains by the number of visits they received in descending order. (hint: consider the 'cut', 'uniq' and 'sort' commands and the pipe operator).
In [3]:
!cut Datasets/advertising_events.csv -d ',' -f 3 | sort | uniq -c | sort -nr
4. List all records for the user with user id 37. (hint: this can be done using 'grep')
In [4]:
!awk -F "," '{if($1==37) {print}}' Datasets/advertising_events.csv
In [5]:
# You might find these packages useful. You may import any others you want!
import pandas as pd
import numpy as np
1. Load the data set "data/ads_dataset.tsv" into a Python Pandas data frame called ads.
In [6]:
ads=pd.read_table("Datasets/ads_dataset.tsv",sep='\t')
2. Write a Python function called getDfSummary() that does the following:
number_nan to count the number of missing not-a-number valuesnumber_distinct to count the number of distinct values a variable can take onmean, max, min, std (standard deviation), and 25%, 50%, 75% to correspond to the appropriate percentilesHint: The pandas describe() (manual page) method returns a useful series of values that can be used here.
In [7]:
def getDfSummary(input_data):
lstmain=[]
lsttemp=[]
for features in input_data:
number_nan=np.count_nonzero(input_data[features].isnull())
number_distinct=input_data[features].nunique()
describe=input_data[features].describe()[['mean','max','min','std','25%','50%','75%']].values.tolist()
lsttemp=[number_nan,number_distinct]+describe
lstmain.append(lsttemp)
output_data=pd.DataFrame(lstmain)
output_data.columns=['Number_NaN','Number_Distinct','Mean','Max','Min','Std','25%','50%','75%']
output_data.index=[input_data.columns]
return output_data
3. How long does it take for your getDfSummary() function to work on your ads data frame? Show us the results below.
Hint: %timeit getDfSummary(ads)
In [8]:
%timeit getDfSummary(ads)
4. Using the results returned from getDfSummary(), which fields, if any, contain missing NaN values?
In [9]:
result=getDfSummary(ads); result[result.Number_NaN>0]
Out[9]:
5. For the fields with missing values, does it look like the data is missing at random? Are there any other fields that correlate perfectly, or predict that the data is missing? What would be an appropriate method for filling in missing values?
Hint: create another data frame that has just the records with a missing value. Get a summary of this data frame using getDfSummary() and compare the differences. Do some feature distributions change dramatically?
In [10]:
#ads[result.loc[result['Number_NaN']>0].index]
#Analyse the results (returned by getDfSummary() function) to identify columns containing null values.
result[result.Number_NaN>0]
# Based on the results obtained from the above code, we can infer that only one column has NaN values.
#We use the dropna() function to remove the NaN values and use the getDfSummary function to see how different values change.
new=ads.dropna()
getDfSummary(new)
#Based on the results obtained by calling the getDfSummary function, we observe that the "isbuyer" column has only one distinct value which is 1
#It is safe to assume that when 'buy_freq' is not null, 'isbuyer' is always 1.
#We can check out how well these columns correlate by using the corr() function
new.isbuyer.corr(new['buy_freq'])
#Running the above command returns nan. Since we have removed all NaN values, we look at other factors resulting in this.
#We can observe that when 'buy_freq' is not null, the field of 'isbuyer' demonstrates 0 standard deviation.
#Due to this, the denominator of the correlation formula becomes 0 thus resulting in NaN output
#We can check corrlation between different columns as well using the below command
new.corr()
#Apart from 'isbuyer', we can observe that 'buy_freq' demonstrates strong correlation with 'multiple_buy'
new[['buy_freq','multiple_buy']]
#The above code informs that for 'buy_freq'>1, 'multiple_buy' is 1
#We can safely assume that when 'multiple_buy' is 0 and 'isbuyer' is 0, 'buy_freq' will be 0
##ads.loc[(ads['isbuyer']==0) & ads['multiple_buy']<1]=0
ads.loc[(ads['multiple_buy']<1) & (ads['isbuyer']==0),['buy_freq']]=0
ads.isnull().sum()
#We have elminated all NaN values by observing correlation values
Out[10]:
6. Which variables are binary?
In [11]:
result=getDfSummary(ads)
result.loc[(result.Number_Distinct==2)]
#We can use Number_Distinct as an indicator of binary variables since the value of the below four variables can either be 0 or 1.
#Number_Distinct of the below four variables is 2 and max and min is 1 and 0 respectively.
Out[11]: