Homework 3: Data Wrangling and Exploratory Data Analysis

Ta-Feng is a retail warehouse in Taiwan that mainly sells food-products though it stocks some office and home products. Like Costco and Sam's Club, it operates on a club structure wherein people buy a membership to gain access to the store. As a result, purchases can be traced to specific customers. In this homework, we will use this database to practice data wrangling, exploratory data analysis, and visualization.

Throughout the assignment, make sure that each plot has meaningful labels that characterize the variable represented and its units.

Setup


In [ ]:
# Run this cell to set up the notebook, but please don't change it.

# These lines import the Numpy, Datascience, pandas modules.
import numpy as np
import pandas as pd
import seaborn as sns

# These lines do some fancy plotting magic.
import matplotlib
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
import warnings
warnings.simplefilter('ignore', FutureWarning)

# These lines load the tests.
!pip install -U okpy
from client.api.notebook import Notebook
ok = Notebook('hw3.ok')

Importing Data

In this section, we run through some problems that you might face as you go through the innocuous-sounding process of "importing data." We also have you process some variables that aren't in the most user-friendly form.

Age Classes

We'll start by looking at age_classes.txt, which is a lookup table for our main Ta Feng table.


In [ ]:
!head "age_classes.txt"

Question 1

What is the delimiter used in this file?

SOLUTION: Space

Let's read in age_classes.txt as a data frame called age_classes. Use the column names provided below.


In [ ]:
age_class_columns = ['code', 'age_range']
age_classes = pd.read_csv('age_classes.txt', sep=' ', 
                          header=None, names=age_class_columns)
age_classes

Question 2

What type of variable is age_range? Continuous, discrete, nominal, ordinal or other? Make sure the type of age_range in the data frame age_classes matches your answer.

SOLUTION: Ordinal


In [ ]:
age_classes['age_range'] = age_classes['age_range'].astype('category', ordered=True) # SOLUTION
age_classes.dtypes

In [ ]:
_ = ok.grade('q02')
_ = ok.backup()

Question 3

Perhaps at some point we would like to work with the age data numerically. Derive two numerical columns from age_range called age_min and age_max, which contain the left and right endpoints of age_range, respectively.


In [ ]:
age_classes[['age_min', 'age_max']] = (
    age_classes['age_range']
    .str.split("-", expand=True)
    .astype('int')
) #SOLUTION

In [ ]:
_ = ok.grade('q03')
_ = ok.backup()

Question 4

To have a single column representing age, we could reduce the two endpoints to a single number. Compute an additional age_center attribute that is equal to the midpoint of age_range.


In [ ]:
age_classes['age_center'] = (age_classes['age_max'] + age_classes['age_min']) / 2. # SOLUTION
age_classes

In [ ]:
_ = ok.grade('q04')
_ = ok.backup()

Residence Area

Now inspect residence_area.txt in your terminal window. We'll read in the data in the next cell.


In [ ]:
!head "residence_area.txt"

In [ ]:
# If you use a single character delimiter, it uses the faster engine ... 
residence_areas = pd.read_csv('residence_area.txt', 
                              delimiter=':',
                              header=None, 
                              names=['code','area'])
# Need to remove the leading whitespace
residence_areas['area'] = residence_areas['area'].str.strip()

Question 5

The README.TXT file contains more information about the postal codes that is not immediately obvious if you just look at residence area.txt. Now that we have information that adds context to our data, we might want to add it to the lookup table. Create a new column dist_rank that encodes the information as ranks. Code missing values as np.nan.


In [ ]:
residence_areas['dist_rank'] = [4, 5, 6, 3, 1, 2, np.nan, np.nan] # SOLUTION

In [ ]:
_ = ok.grade('q05')
_ = ok.backup()

Run this cell: Reading in Ta Feng

The cell below reads in the Ta Feng data and concatenates the four months into a single data frame. gzip deals with the fact that the files are compressed (hence the gz at the ends of their names). Big5 refers to a character encoding for Chinese characters.


In [ ]:
data_column_names = [
    'transaction_time',
    'customer_id',
    'age_class', 
    'residence_area', 
    'product_subclass', 
    'product_id', 
    'amount', 
    'asset', 
    'sales_price'
]

files = [
    'D01.gz',
    'D02.gz',
    'D11.gz',
    'D12.gz'
]

frames = []

for f in files:
    df = pd.read_csv(f, compression='gzip', encoding='Big5', delimiter=';', 
                     skipinitialspace=True, header=0, names=data_column_names)
    frames.append(df)

taFeng = pd.concat(frames)

# These two columns have hard-to-detect whitespace that cause issues
taFeng['age_class'] = taFeng['age_class'].str.strip()
taFeng['residence_area'] = taFeng['residence_area'].str.strip()
taFeng.head()

Run this cell: Joining the tables

The code below combines the data in the taFeng table with the information found in the lookup tables age_classes and residence_areas.


In [ ]:
taFengFull = pd.merge(taFeng, age_classes,
                      how='left', left_on='age_class', right_on='code')
taFengFull = pd.merge(taFengFull, residence_areas, suffixes=['_age', '_residence'], 
                      how='left', left_on = 'residence_area', right_on = 'code')

In [ ]:
taFengFull.head()

Question 6

Since we performed a left join and the code columns were keys in both age classes.txt and residence areas.txt were keys, we expect that the number of rows in taFengFull to be the same as in taFeng. Write function a have_same_nrows that takes two data frames as its arguments and returns True if they have the same number of rows.


In [1]:
def have_same_nrows(df1, df2):
    return df1.shape[0] == df2.shape[0] # SOLUTION

In [ ]:
...

In [ ]:
same_after_join = have_same_nrows(taFeng, taFengFull)
same_after_join

In [ ]:
_ = ok.grade('q06')
_ = ok.backup()

Question 7

It's a good habit to keep track of missing values. Notice that K appears in the column age_class, but not in the age lookup table age_classes. What happened when we joined the two tables? i.e. Do the corresponding rows appear in our taFengFull table, and if so, how does the join deal with the fact that K doesn't exist in the lookup table?

SOLUTION: The columns derived from the lookup table are filled with missing values (np.nan) for all the rows associated with the age class K.

Guided EDA

As a business analyst, we may be interested in what the purchase data can tell us about TaFeng's business and the behavior of their customers. While there are many aspects of this data set that we could look into, we will only focus on two ideas in the interest of keeping this assignment reasonable. The questions will guide you through the iterative process of data exploration---where one question flows into another.

  • Does the relationship between basket size and amount spent match our intuition?
  • Are there temporal trends in customer behavior?

We start by exploring customer behavior at the product level to get a feel for the variables, data type, cleanliness, missingness, and distributions. For the moment, assume that the data consists of all transactions in the four month period (i.e. no transactions are missing from this data set).

Note

For each visualizations we display, please faithfully reproduce that plot to the best of your ability. They represent our expections of exemplar work.

Question 8

Calculate the number of shopping trips each customer made in the four-month period. Then make a histogram that summarizes the information. Assume that transactions occurring on the same day can be grouped together into one trip (e.g. All purchases on January 1 by the same customer should be lumped together). Hint: You will have to use a groupby to change the granularity of the data to the level of customers.


In [ ]:
rpt_business = taFeng.groupby('customer_id')['transaction_time'].nunique()
rpt_business.hist(bins=rpt_business.max())

plt.xlabel('Number of Transactions in 4 Months')
plt.ylabel('Number of Customers')
plt.title('Repeat Business')

Question 9

A customer that is on the left side of the distribution shops only sporadically at Ta Feng, and a customer on the right side of the distribution shops at Ta Feng very frequently. The distribution is skewed right with nearly all the customers shopping fewer than 20 times in the four month period. Let's zoom in on the x-axis to better view the shape of the distribution of the bulk of the customers. To do this, remake the above plot as above, but set the limits of the x-axis to 0 and 20 to zoom in on the range where we find most of the customers.


In [ ]:
rpt_business.hist(bins=rpt_business.max())
plt.xlabel('Number of Transactions in 4 Months')
plt.ylabel('Number of Customers')
plt.title('Repeat Business')
plt.xlim(0, 30)

In [ ]:
...

Question 10

We have successfully filled the plotting region with our histogram, and we can more easily see how rapidly the number of shopping trips trails off. We wonder just how much of our data is concentrated on the left side. Use the quantile method to calculate the deciles (i.e., the 0th, 10th, 20th, ..., 90th, and 100th percentiles) of rpt_business.


In [ ]:
rpt_deciles = rpt_business.quantile(np.arange(0, 1.1, 0.1)) # SOLUTION
rpt_deciles

Question 11

The percentiles tell us that 90% of customers shopped at Ta Feng 8 times or fewer in the four months spanning our data set and more than half went only 1 or 2 times! Perhaps we can bring in the large values and stretch the small values with a log transform. Remake the histogram as before, except this time, rescale the x-axis with a log transformation. Hint: To make the tick marks appear, you may want to look into pyplot's tick_params.


In [ ]:
rpt_business.hist(bins=np.logspace(0, 4, 41))
plt.xlabel('Number of Transactions in 4 Months')
plt.ylabel('Number of Customers')
plt.title('Repeat Business')
plt.xscale('log')
plt.xlim(7e-1, 2.5e1)
plt.tick_params(axis='x', which='minor', length=3, top=False)
plt.tick_params(axis='x', which='major', length=6, top=False)

In [ ]:
...

In this case, the log transformation does not help symmetrize the distribution. Since the vast bulk of the data take on only 8 distinct values with over half the data being either a 1 or a 2, the log transformation has little impact.

Taking the data to another level

Next, we investigate the shopping trips in greater detail. For example, we might be interested in the total number of items purchased, the total amount spent, and the number of unique products purchased on the trip. Since our data are at the level of items purchased, we need to aggregate our data to the shopping cart. Fortunately we have tools to do this; that is, we will change the granularity. For each shopping cart, we find:

  • num_items - the total number of items purchased
  • total - the total value of the cart
  • num_unique - the number of unique products purchased

Question 12

Create a data frame called carts that contains the variables above, as well as customer_id and transaction_time. Make sure you use the names specified.


In [ ]:
transactions = taFengFull.groupby(['customer_id', 'transaction_time'])
carts = pd.DataFrame()
carts['num_items'] = transactions['amount'].sum()
carts['num_unique'] = transactions['amount'].count()
carts['total'] = transactions['sales_price'].sum()
carts.head()

Question 13

Let's take a look at the relationship between the number of items in a cart and the cart total. Intuitively the two should be positively correlated, but why assume when we can verify? Make a scatter plot that will help us inspect the relationship between these two variables.


In [ ]:
carts.plot(kind='scatter', x='num_items', y='total')
plt.xlim(-50, 1250)
plt.ylim(-25000, 475000)
plt.xlabel('Number of Items Purchased')
plt.ylabel('Total Amount Spent')
plt.title('Quantity Purchased vs Value')

Question 14

We might expect from the distribution of the number of trips that there would be a few very large values for the number of items and the total amount spent. Indeed, a handful of observations make it difficult to see the shape of the bulk of the data. As before take a log transformation of these two variables. Make another scatter plot, but this time, log-transform both the x and y axes.


In [ ]:
carts.plot(kind='scatter', x='num_items', y='total')
plt.title('Quantity Purchased vs Value')
plt.xlabel('Number of Items Purchased')
plt.ylabel('Total Amount Spent')
plt.xscale('log')
plt.yscale('log')
plt.xlim(7e-1, 2.5e3)
plt.ylim(7e-1, 1e6)
plt.tick_params(axis='x', which='minor', length=3, top=False)
plt.tick_params(axis='x', which='major', length=6, top=False)
plt.tick_params(axis='y', which='minor', length=3, right=False)
plt.tick_params(axis='y', which='major', length=6, right=False)

Question 15

Notice the vertical bands in the scatter plot. These are from the discreteness of the number of items purchased. We are better able to see these bands with the log transformed data.

While there seems to be a positive correlation in the point cloud, the over plotting makes be deceiving us. We will make a few changes to the scatter plot to attempt to address the problem. First, we adjust the opacity of the plotting symbols by changing alpha values for the scatter plot portion of the plot. We also shrink the size of the plotting symbol to reduce the amount of over plotting. Additionally, we add a line-of-best-fit through the data. Hint: Use seaborn's lmplot to simultaneously plot the points and the line-of-best-fit.


In [ ]:
sns.lmplot('num_items', 'total', data=carts, 
           scatter_kws={'s': 1, 'alpha': 0.1}, 
           line_kws={'lw': 2, 'color': '#4682b4'})
plt.title('Quantity Purchased vs Value')
plt.xlabel('Number of Items Purchased')
plt.ylabel('Total Amount Spent')
plt.xscale('log')
plt.yscale('log')
plt.xlim(7e-1, 2.5e3)
plt.ylim(7e-1, 1e6)
plt.tick_params(axis='x', which='minor', length=3, top=False)
plt.tick_params(axis='x', which='major', length=6, top=False)
plt.tick_params(axis='y', which='minor', length=3, right=False)
plt.tick_params(axis='y', which='major', length=6, right=False)

From this scatter plot, we can make several observations about the relationship between the number of items purchased and the total amount spent. It appears that there is a linear association between the amount spent and number of items purchased on a log-log scale. (A log-log scale implies that, say, a 10% increase in the number of items purchased is associated with b*10% increase in the amount spent, on average). We also note that the spread in amount spent varies with the number of items spent; in particular, the spread is greater for smaller shopping baskets. This might be due to shopping trips where one expensive item is purchased.

Time

In this question, we will investigate temporal trends in shopping behavior. We begin with the technical details of date-time data.

Run this cell: Parsing Time

Currently, your computer understands the transaction times as strings. The code below typecasts the transaction time column to the datetime64.


In [ ]:
time_format = "%Y-%m-%d %H:%M:%S"
taFengFull['transaction_time'] = pd.to_datetime(taFengFull['transaction_time'], 
                                                format = time_format)

Question 16

Now that we have the time data in a usable format, we can start looking into customer trends over time. Make a line plot of the number of unique customers per day against day. Note that we have once again changed the granularity at which we are examining our data.


In [ ]:
taFengFull.groupby('transaction_time')['customer_id'].nunique().plot(color='black', lw=1.5)
plt.xticks(rotation='vertical')
plt.xlabel('Day')
plt.ylabel('Unique Customers')
plt.title('Unique Customers Per Day')

Question 17

The data seems to exhibit some patterns. It would seem plausible that people are more likely to shop on certain days of the week. Make a box-and-whiskers plot that summarizes customer counts by day of the week. Hint: For our solution, we created a new column called day_of_week that contains the name of, well, the day of the week.


In [ ]:
day_of_week_ord = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
taFengFull['day_of_week'] = (
    taFengFull['transaction_time']
    .dt.weekday_name
    .astype('category', categories=day_of_week_ord, ordered=True)
)
taFengFull['week_of_year'] = taFengFull['transaction_time'].dt.week

count_by_weekday = (
    taFengFull[['week_of_year', 'day_of_week', 'customer_id', 'transaction_time']]
    .drop_duplicates()
    .groupby(['day_of_week', 'week_of_year'])
    .size()
    .reset_index(name='count')
)

sns.boxplot('day_of_week', 'count', data=count_by_weekday)
plt.xlabel('Day of Week')
plt.ylabel('Count')
plt.title('Number of Customers by Day of Week')

The boxplots confirm that typical Sundays have many more (~500) customers shopping in comparison to the week days. Also, many customers visit Ta Feng on Saturday.

Question 18

There seems to be three distinct times that the number of customers drops below "normal" counts. Could these days be holidays? Name three holidays that would most influence shopping behavior in this time period. Hint: To answer this question, you will have to understand the temporal and geographical context of the data.

SOLUTION: Christmas, New Year's, Lunar New Year

Question 19

Reproduce the line graph from above, but this time, add reference lines so that you can easily pick out which days are holidays.


In [ ]:
taFengFull.groupby('transaction_time')['customer_id'].nunique().plot(color='black', lw=1.5)
holidays = ['2000-12-25', '2001-01-01', '2001-01-24']
for holiday in holidays:
    plt.axvline(holiday, lw=1.5, color='#ef8a62', ls='dashed')
plt.xticks(rotation='vertical')
plt.xlabel('Day')
plt.ylabel('Unique Customers')
plt.title('Unique Customers Per Day')

It seems like holidays are indeed a plausible explanation for the lows and highs we see on our plot! It looks like Ta Feng has a week of really slow business right before Christmas and after both New Year celebrations. The most pronounced peak comes right before the celebration of the Lunar New Year.

Self-Directed EDA

The last two questions are intentionally more open-ended and will be graded on the completeness of the plot(s) produced and the insights you gain from them. Be sure to consider transformations, subsets, correlations, reference markers, and lines/curves-of-best-fit to reveal the relationship that you are wanting to learn more about. Also be sure to make plots that are appropriate for the variable types. For completeness, be explicit about any assumptions you make in your analysis. An exemplary plot will have:

  • A title
  • Labelled and appropriately scaled axes
  • A legend, if applicable
  • A carefully selected color scheme
  • A main point, accentuated through design choices

Question 20

Make a visualization of and interpret the distribution of one of the following quantitities:

  • Age distribution of the shoppers
  • Distribution of the number of trips each customer makes

In [ ]:
...

Your Observations

Question 21

Make a visualization of and interpret the relationship between one of the following pairs:

  • Amount spent on a shopping trip and the number of trips
  • Amount spent on a shopping trip and the customer's age.
  • Frequency of shopping and the customer's age

Your Observations

Submitting your assignment

First, run the next cell to run all the tests at once.


In [ ]:
_ = ok.grade_all()

Then, we'll submit the assignment to OkPy so that the staff will know to grade it. You can submit as many times as you want, and you can choose which submission you want us to grade by going to https://okpy.org/cal/data100/sp17/. After you've done that, make sure you've pushed your changes to Github as well!


In [ ]:
# Now, we'll submit to okpy
_ = ok.submit()