Foundations of Data Science

Homework 2

Student Name: Shaival Dalal

Student Netid: sd3462


Part 1: Case study

  • Read this article in the New York Times.
  • Use what we've learned in class and from the book to describe how one could set Target's problem up as a predictive modeling problem, such that they could have gotten the results that they did. Formulate your solution as a proposed plan using our data science terminology. Include aspects of the Data Science Workflow that you see as relevant to solving the problem. Be precise but concise.

Problem: Consumer shopping habits are incredibly difficult to change once formed no matter how ingenius ads campaigns are. There are only a handful of events in a consumer's life that cause a change in habits and it is difficult to gain consumer's attention during these events due to competition from other companies.

Goal: To identify and target pregnant consumers or expecting parents before other marketers do.

Data: Data from two major sources. Internal and external. Internal data about consumers contains unique guest identifiers, historical products purchases, age, marriage status, timing of purchase, coupon delivery preference, preferred brands of items (in-store), baby shower registry etc. External data contains data points generated by consumers outside Target. Such data points include preferred brands of coffee, preferred brands for everyday items (paper towel, applesauce, cereal) etc.

Impact: The "Pregnancy Prediction" score model developed by Andrew Pole between 2002 and 2010 enabled Target to radically increase their sales for the Mom and Baby segment. Target's revenues increased from $44 billion to $67 billion between 2002 and 2010.

Part 2: Exploring data in the command line

For this part we will be using the data file located in "data/advertising_events.csv". This file consists of records that pertain to some online advertising events on a given day. There are 4 comma separated columns in this order: userid, timestamp, domain, and action. These fields are of type int (continuous), int (continuous), string, and int (category) respectively. Answer the following questions using Linux/Unix bash commands. All questions can be answered in one line (sometimes, with pipes)! Some questions will have many possible solutions. Don't forget that in IPython notebooks you must prefix all bash commands with an exclamation point, i.e. "!command arguments".

[Hints: You can experiment with whatever you want in the notebook and then delete things to construct your answer later. You can also use ssh to use the actual bash shell on EC2 and then just paste your answers here. Recall that once you enter the "!" then filename completion should work.]

1. How many records (lines) are in this file?


In [1]:
!wc -l Datasets/advertising_events.csv | cut -d ' ' -f 1


10341

2. How many unique users are in this file? (hint: consider the 'cut' command and use pipe operator '|')


In [2]:
!cut Datasets/advertising_events.csv -d ',' -f 1 | sort | uniq | wc -l


732

3. Rank all domains by the number of visits they received in descending order. (hint: consider the 'cut', 'uniq' and 'sort' commands and the pipe operator).


In [3]:
!cut Datasets/advertising_events.csv -d ',' -f 3 | sort | uniq -c | sort -nr


   3114 google.com
   2092 facebook.com
   1036 youtube.com
   1034 yahoo.com
   1022 baidu.com
    513 wikipedia.org
    511 amazon.com
    382 qq.com
    321 twitter.com
    316 taobao.com

4. List all records for the user with user id 37. (hint: this can be done using 'grep')


In [4]:
!awk -F "," '{if($1==37) {print}}' Datasets/advertising_events.csv


37,648061658,google.com,0
37,642479972,google.com,2
37,644493341,facebook.com,2
37,654941318,facebook.com,1
37,649979874,baidu.com,1
37,653061949,yahoo.com,1
37,655020469,google.com,3
37,640878012,amazon.com,0
37,659864136,youtube.com,1
37,640361378,yahoo.com,1
37,653862134,facebook.com,0
37,648828970,youtube.com,0

Part 3: Dealing with data Pythonically


In [5]:
# You might find these packages useful. You may import any others you want!
import pandas as pd
import numpy as np

1. Load the data set "data/ads_dataset.tsv" into a Python Pandas data frame called ads.


In [6]:
ads=pd.read_table("Datasets/ads_dataset.tsv",sep='\t')

2. Write a Python function called getDfSummary() that does the following:

  • Takes as input a data frame
  • For each variable in the data frame calculates the following features:
    • number_nan to count the number of missing not-a-number values
    • Ignoring missing, NA, and Null values:
      • number_distinct to count the number of distinct values a variable can take on
      • mean, max, min, std (standard deviation), and 25%, 50%, 75% to correspond to the appropriate percentiles
  • All of these new features should be loaded in a new data frame. Each row of the data frame should be a variable from the input data frame, and the columns should be the new summary features.
  • Returns this new data frame containing all of the summary information

Hint: The pandas describe() (manual page) method returns a useful series of values that can be used here.


In [7]:
def getDfSummary(input_data):
    lstmain=[]
    lsttemp=[]
    for features in input_data:
        number_nan=np.count_nonzero(input_data[features].isnull())
        number_distinct=input_data[features].nunique()
        describe=input_data[features].describe()[['mean','max','min','std','25%','50%','75%']].values.tolist()
        lsttemp=[number_nan,number_distinct]+describe
        lstmain.append(lsttemp)
    output_data=pd.DataFrame(lstmain)
    output_data.columns=['Number_NaN','Number_Distinct','Mean','Max','Min','Std','25%','50%','75%']
    output_data.index=[input_data.columns]
    return output_data

3. How long does it take for your getDfSummary() function to work on your ads data frame? Show us the results below.

Hint: %timeit getDfSummary(ads)


In [8]:
%timeit getDfSummary(ads)


10 loops, best of 3: 67.3 ms per loop

4. Using the results returned from getDfSummary(), which fields, if any, contain missing NaN values?


In [9]:
result=getDfSummary(ads); result[result.Number_NaN>0]


Out[9]:
Number_NaN Number_Distinct Mean Max Min Std 25% 50% 75%
buy_freq 52257 10 1.240653 15.0 1.0 0.782228 1.0 1.0 1.0

5. For the fields with missing values, does it look like the data is missing at random? Are there any other fields that correlate perfectly, or predict that the data is missing? What would be an appropriate method for filling in missing values?

Hint: create another data frame that has just the records with a missing value. Get a summary of this data frame using getDfSummary() and compare the differences. Do some feature distributions change dramatically?


In [10]:
#ads[result.loc[result['Number_NaN']>0].index]
#Analyse the results (returned by getDfSummary() function) to identify columns containing null values.
result[result.Number_NaN>0] 

# Based on the results obtained from the above code, we can infer that only one column has NaN values.
#We use the dropna() function to remove the NaN values and use the getDfSummary function to see how different values change.
new=ads.dropna()
getDfSummary(new)

#Based on the results obtained by calling the getDfSummary function, we observe that the "isbuyer" column has only one distinct value which is 1
#It is safe to assume that when 'buy_freq' is not null, 'isbuyer' is always 1.

#We can check out how well these columns correlate by using the corr() function
new.isbuyer.corr(new['buy_freq'])

#Running the above command returns nan. Since we have removed all NaN values, we look at other factors resulting in this.
#We can observe that when 'buy_freq' is not null, the field of 'isbuyer' demonstrates 0 standard deviation.
#Due to this, the denominator of the correlation formula becomes 0 thus resulting in NaN output
#We can check corrlation between different columns as well using the below command
new.corr()

#Apart from 'isbuyer', we can observe that 'buy_freq' demonstrates strong correlation with 'multiple_buy'
new[['buy_freq','multiple_buy']]
#The above code informs that for 'buy_freq'>1, 'multiple_buy' is 1

#We can safely assume that when 'multiple_buy' is 0 and 'isbuyer' is 0, 'buy_freq' will be 0
##ads.loc[(ads['isbuyer']==0) & ads['multiple_buy']<1]=0
ads.loc[(ads['multiple_buy']<1) & (ads['isbuyer']==0),['buy_freq']]=0
ads.isnull().sum()
#We have elminated all NaN values by observing correlation values


Out[10]:
isbuyer                0
buy_freq               0
visit_freq             0
buy_interval           0
sv_interval            0
expected_time_buy      0
expected_time_visit    0
last_buy               0
last_visit             0
multiple_buy           0
multiple_visit         0
uniq_urls              0
num_checkins           0
y_buy                  0
dtype: int64

6. Which variables are binary?


In [11]:
result=getDfSummary(ads)
result.loc[(result.Number_Distinct==2)]

#We can use Number_Distinct as an indicator of binary variables since the value of the below four variables can either be 0 or 1.
#Number_Distinct of the below four variables is 2 and max and min is 1 and 0 respectively.


Out[11]:
Number_NaN Number_Distinct Mean Max Min Std 25% 50% 75%
isbuyer 0 2 0.042632 1.0 0.0 0.202027 0.0 0.0 0.0
multiple_buy 0 2 0.006357 1.0 0.0 0.079479 0.0 0.0 0.0
multiple_visit 0 2 0.277444 1.0 0.0 0.447742 0.0 0.0 1.0
y_buy 0 2 0.004635 1.0 0.0 0.067924 0.0 0.0 0.0