Customer segmentation with unsupervised learning

We will learn by doing: you should complete with code the places marked with [WORKSHOP]

We will be using the following tools in Python 3:

jupyter
pandas
matplotlib
scikit-learn

jupyter

You are already using jupyter!

jupyter is a web-interface IDE, a development environment with REPL where you can execute arbitrary code and see the results instantly. Beneath the web interface there is a server that executes the code. The coding language is usually Python 3, but there are server's kerners for other languages.

Thanks to its REPL powers, jupyter has became the facto standard format when showing data analysis, research papers and all data-related presentations. For us developers we could find jupyter as a too-simple IDE, but this simplicity is actually what helps researchers to easily explore datasets and try things out quickly.

jupyter notebook's cells

A cell usually contains code or markdown. Do a single click here to see it ;-)

Here you have some ways to move around and execute a cell:

You can edit any cell just clicking on it with the mouse, or use the keyboard arrows and do Enter
You can unfocus a cell with Esc
You can add a new cell with the key B (bellow) or
Executing a cell is simple: shift+Enter or control+Enter will run the code or render the markdown

You can use the mouse and the top menu to do all kind of things, or learn the keyboard shortcuts (in the Help menu).



In [1]:

    
# This is a cell with code. Try to execute it: use the mouse to focus it and press control+Enter.

' '.join(['Hello', 'world'])









    Out[1]:





'Hello world'

pandas

pandas is a library for manipulating data frames. It is based in a smaller library, Numpy, which operates with matrices. But pandas leverages numpy giving an incredible collection of functions to play with the data.

Let's load ulabox's dataset and play a bit with the data. If the .csv file is not in the current directory, this code will download it.

In order to know the meaning of each column of the dataset, please have a look at its data dictionary.



In [2]:

    
# Usually 'pandas' is nicknamed as 'pd'.
# Please execute this cell with control+Enter (and the following ones, while you read them).
import pandas as pd
import os.path

filename = 'ulabox_orders_with_categories_partials_2017.csv'
if not os.path.isfile(filename):
    import urllib.request
    urllib.request.urlretrieve('https://raw.githubusercontent.com/ulabox/datasets/master/data/ulabox_orders_with_categories_partials_2017.csv', filename) 

raw_df = pd.read_csv(filename)   

# head() shows first 5 rows
raw_df.head()









    Out[2]:






  
    
      
      customer
      order
      total_items
      discount%
      weekday
      hour
      Food%
      Fresh%
      Drinks%
      Home%
      Beauty%
      Health%
      Baby%
      Pets%
    
  
  
    
      0
      0
      0
      45
      23.03
      4
      13
      9.46
      87.06
      3.48
      0.00
      0.00
      0.00
      0.0
      0.0
    
    
      1
      0
      1
      38
      1.22
      5
      13
      15.87
      75.80
      6.22
      2.12
      0.00
      0.00
      0.0
      0.0
    
    
      2
      0
      2
      51
      18.08
      4
      13
      16.88
      56.75
      3.37
      16.48
      6.53
      0.00
      0.0
      0.0
    
    
      3
      1
      3
      57
      16.51
      1
      12
      28.81
      35.99
      11.78
      4.62
      2.87
      15.92
      0.0
      0.0
    
    
      4
      1
      4
      53
      18.31
      2
      11
      24.13
      60.38
      7.78
      7.72
      0.00
      0.00
      0.0
      0.0

If you have a look at the raw data, each row has an index (left, bold) with its corresponding column data. In data analysis rows are usually named "samples" while columns are named "features".

Actually in this case the feature "order" (order number) could be directly used as the index of the dataframe. So let's use it and then drop the original "order" column.



In [3]:

    
df = raw_df.reindex(index=raw_df['order'])

df.drop('order', axis=1, inplace=True)

df.head()









    Out[3]:






  
    
      
      customer
      total_items
      discount%
      weekday
      hour
      Food%
      Fresh%
      Drinks%
      Home%
      Beauty%
      Health%
      Baby%
      Pets%
    
    
      order
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      0
      0
      45
      23.03
      4
      13
      9.46
      87.06
      3.48
      0.00
      0.00
      0.00
      0.0
      0.0
    
    
      1
      0
      38
      1.22
      5
      13
      15.87
      75.80
      6.22
      2.12
      0.00
      0.00
      0.0
      0.0
    
    
      2
      0
      51
      18.08
      4
      13
      16.88
      56.75
      3.37
      16.48
      6.53
      0.00
      0.0
      0.0
    
    
      3
      1
      57
      16.51
      1
      12
      28.81
      35.99
      11.78
      4.62
      2.87
      15.92
      0.0
      0.0
    
    
      4
      1
      53
      18.31
      2
      11
      24.13
      60.38
      7.78
      7.72
      0.00
      0.00
      0.0
      0.0

Notice that pandas library is really powerfull. It can manipulate data in different ways and allows all kind of dataframe operations (at cell level, row and column level, and even between dataframes).

For example, it can use multi-indexes as follows.



In [4]:

    
multi_indexed_df = raw_df.groupby(by=['customer','order']).sum()

multi_indexed_df.head()









    Out[4]:






  
    
      
      
      total_items
      discount%
      weekday
      hour
      Food%
      Fresh%
      Drinks%
      Home%
      Beauty%
      Health%
      Baby%
      Pets%
    
    
      customer
      order
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      0
      0
      45
      23.03
      4
      13
      9.46
      87.06
      3.48
      0.00
      0.00
      0.00
      0.0
      0.0
    
    
      1
      38
      1.22
      5
      13
      15.87
      75.80
      6.22
      2.12
      0.00
      0.00
      0.0
      0.0
    
    
      2
      51
      18.08
      4
      13
      16.88
      56.75
      3.37
      16.48
      6.53
      0.00
      0.0
      0.0
    
    
      1
      3
      57
      16.51
      1
      12
      28.81
      35.99
      11.78
      4.62
      2.87
      15.92
      0.0
      0.0
    
    
      4
      53
      18.31
      2
      11
      24.13
      60.38
      7.78
      7.72
      0.00
      0.00
      0.0
      0.0

Filtering, sampling and indexing by sample or a feature is really easy too.



In [5]:

    
# Get 50 random rows
sample = df.sample(50, random_state = 1)
# I use random_state to get same results (it's the random seed)

sample.head()









    Out[5]:






  
    
      
      customer
      total_items
      discount%
      weekday
      hour
      Food%
      Fresh%
      Drinks%
      Home%
      Beauty%
      Health%
      Baby%
      Pets%
    
    
      order
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      10747
      3657
      45
      7.99
      1
      20
      35.44
      18.43
      23.96
      1.00
      3.46
      0.00
      17.72
      0.00
    
    
      12573
      4230
      55
      21.70
      2
      13
      9.60
      41.92
      16.11
      15.74
      0.00
      0.48
      0.00
      16.14
    
    
      29676
      10079
      1
      0.00
      1
      13
      0.00
      0.00
      100.00
      0.00
      0.00
      0.00
      0.00
      0.00
    
    
      8856
      3062
      28
      4.68
      2
      13
      84.92
      0.00
      10.55
      0.00
      4.54
      0.00
      0.00
      0.00
    
    
      21098
      7054
      18
      8.62
      1
      8
      25.64
      19.87
      0.00
      20.31
      34.17
      0.00
      0.00
      0.00



In [6]:

    
# Getting a column (feature)
column = sample['total_items']

column.head()









    Out[6]:





order
10747    45
12573    55
29676     1
8856     28
21098    18
Name: total_items, dtype: int64



In [7]:

    
# Getting a row (sample)
sample.loc[8856]









    Out[7]:





customer       3062.00
total_items      28.00
discount%         4.68
weekday           2.00
hour             13.00
Food%            84.92
Fresh%            0.00
Drinks%          10.55
Home%             0.00
Beauty%           4.54
Health%           0.00
Baby%             0.00
Pets%             0.00
Name: 8856, dtype: float64



In [8]:

    
# Getting an individual value
sample.loc[8856, 'hour']









    Out[8]:





13.0



In [9]:

    
# Filtering with direct test
orders_with_more_than_50_items = sample.loc[sample['total_items'] > 50]

orders_with_more_than_50_items.head()









    Out[9]:






  
    
      
      customer
      total_items
      discount%
      weekday
      hour
      Food%
      Fresh%
      Drinks%
      Home%
      Beauty%
      Health%
      Baby%
      Pets%
    
    
      order
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      12573
      4230
      55
      21.70
      2
      13
      9.60
      41.92
      16.11
      15.74
      0.00
      0.48
      0.00
      16.14
    
    
      12789
      4303
      58
      3.10
      1
      11
      15.89
      8.76
      45.12
      18.15
      0.00
      0.00
      12.08
      0.00
    
    
      537
      124
      141
      4.44
      6
      23
      23.00
      52.97
      15.39
      0.00
      8.63
      0.00
      0.00
      0.00



In [10]:

    
# Notice the content of the previous comparison
(sample['total_items'] > 50).head()









    Out[10]:





order
10747    False
12573     True
29676    False
8856     False
21098    False
Name: total_items, dtype: bool



In [11]:

    
# [WORKSHOP] Can you find any order with only Drink products bought?
sample.loc[sample['Drinks%'] == 100]









    Out[11]:






  
    
      
      customer
      total_items
      discount%
      weekday
      hour
      Food%
      Fresh%
      Drinks%
      Home%
      Beauty%
      Health%
      Baby%
      Pets%
    
    
      order
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      29676
      10079
      1
      0.0
      1
      13
      0.0
      0.0
      100.0
      0.0
      0.0
      0.0
      0.0
      0.0
    
    
      6141
      2074
      36
      0.1
      2
      17
      0.0
      0.0
      100.0
      0.0
      0.0
      0.0
      0.0
      0.0

You should have found more than one order with just drinks. Somebody was really thirsty!

Moreover pandas have several functions for doing statistics, helpfull when exploring the data.



In [12]:

    
df.describe()









    Out[12]:






  
    
      
      customer
      total_items
      discount%
      weekday
      hour
      Food%
      Fresh%
      Drinks%
      Home%
      Beauty%
      Health%
      Baby%
      Pets%
    
  
  
    
      count
      30000.000000
      30000.000000
      30000.000000
      30000.000000
      30000.000000
      30000.000000
      30000.000000
      30000.000000
      30000.000000
      30000.000000
      30000.000000
      30000.000000
      30000.000000
    
    
      mean
      5012.483367
      31.739933
      8.455495
      3.642367
      15.034667
      23.560406
      20.227279
      23.349235
      13.539575
      5.892949
      1.132959
      11.096145
      1.043087
    
    
      std
      2888.646245
      20.576579
      14.199350
      2.122031
      5.717052
      21.719824
      23.661767
      22.523335
      17.850282
      13.996518
      5.307620
      24.740364
      6.195390
    
    
      min
      0.000000
      1.000000
      -65.150000
      1.000000
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
    
    
      25%
      2541.750000
      18.000000
      1.177500
      2.000000
      11.000000
      7.220000
      0.000000
      6.710000
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
    
    
      50%
      5043.000000
      29.000000
      4.160000
      3.000000
      15.000000
      19.650000
      10.930000
      17.890000
      7.210000
      0.000000
      0.000000
      0.000000
      0.000000
    
    
      75%
      7483.000000
      41.000000
      9.990000
      5.000000
      20.000000
      33.260000
      35.790000
      33.162500
      20.002500
      6.512500
      0.000000
      4.172500
      0.000000
    
    
      max
      10238.000000
      298.000000
      100.000000
      7.000000
      23.000000
      100.000000
      100.000000
      100.000000
      100.000000
      100.000000
      100.000000
      100.000000
      100.000000

As you can see, the dataset contains 30k rows (samples). We get the mean, standard deviation, min, max and other statistics for each feature.

Wait a moment! Look at the discount% column. It seems the maximum discount% is 100%, and the minimum discount% is (minus) -65%!! This looks weird...

If you want to have a look at other pandas features, I recommend its documentation and this cheatsheet.

matplotlib

matplotlib is a library for graphs. It can use numpy arrays and pandas dataframes as input.

Actually pandas comes with direct helpers to display data thru matplotlib! If you want to learn more about matplotlib integration in pandas, check pandas visualization documentation and matplotlib documentation.

For instance, let's explore the discount% feature using a histogram.



In [13]:

    
import matplotlib.pyplot as plt

# This is a jupyter helper, so when a matplotlib is evaluated, it shows the graph
%matplotlib inline

# Use 15 blocks(bins)
df['discount%'].hist(bins=15)









    Out[13]:





<matplotlib.axes._subplots.AxesSubplot at 0x7f4adac10a20>

After seeing the discount% feature displayed, it's clear that if the discount% is 100%, it should be some kind of free order (like a gift for a VIP).

On the other hand, why are there some negative discount% values? This is not easy to understand until you ask the domain expert: some drinks have a surcharge (a negative discount) due to a law that taxes drinks with added sugar.



In [14]:

    
# [WORKSHOP] Plot a histogram to display the most common hours of the day orders are purchased. 
df['hour'].hist(bins=24)









    Out[14]:





<matplotlib.axes._subplots.AxesSubplot at 0x7f4b0bc180b8>

The results should show a peak at 12~13h and one at 22h.

scikit-learn

scikit-learn is a library with machine learning algorithms and helpers.

Before starting using ML algorithms, let's prepare a smaller dataset without free orders, with just 100 rows, so the algorithms will work really fast. Also consider only the 8 categories' partials.

Notice that algorithms are optimized to work better with values between 0 to 1; it's vital to keep values at this order of magnitude. So an easy way to normalize the samples' data is to divide the percents values by 100. Another option could be using Standarization (see documentation).

In ML jargon, the algorithm input is called X (a matrix with samples and features).



In [15]:

    
no_free_orders = df.loc[df['discount%'] < 100]
one_thousand = no_free_orders.sample(100, random_state = 1)
X = one_thousand[['Food%', 'Fresh%', 'Drinks%', 'Home%', 'Beauty%', 'Health%', 'Baby%', 'Pets%']].divide(100)

X.head()

KMeans clustering

scikit-learn comes with a k-means clustering algorithm with only one mandatory parameter: the number of expected clusters. Let's try with 7 clusters.

All algorithms in this library come with a .fit_predict() method to do the training.



In [16]:

    
from sklearn.cluster import KMeans

seven_clusters_alg = KMeans(n_clusters = 7, random_state = 1)
cluster_labels = seven_clusters_alg.fit_predict(X)
cluster_labels









    Out[16]:





array([3, 4, 3, 1, 6, 2, 6, 3, 4, 6, 0, 1, 1, 2, 1, 2, 0, 3, 0, 1, 1, 2, 3,
       3, 3, 4, 1, 0, 4, 1, 3, 3, 0, 3, 5, 1, 4, 1, 3, 1, 1, 4, 2, 3, 0, 6,
       5, 3, 1, 3, 0, 3, 1, 2, 6, 1, 1, 0, 4, 0, 3, 1, 1, 1, 3, 1, 4, 6, 1,
       4, 1, 2, 3, 6, 1, 3, 3, 4, 1, 3, 3, 3, 0, 1, 4, 1, 3, 0, 0, 1, 0, 1,
       1, 5, 1, 5, 1, 5, 1, 4], dtype=int32)

Each one of the 100 samples is classified in a cluster (from 0 to 6). For instance, the first sample felt in cluster #3, the second in #4, etc.

Let's see how many samples felt in each cluster.



In [17]:

    
# cluster_labels is a numpy array, so first we embed it in a dataframe and then plot each cluster counting
pd.DataFrame(cluster_labels).hist(bins = 7)









    Out[17]:





array([[<matplotlib.axes._subplots.AxesSubplot object at 0x7f4acb348160>]], dtype=object)

As you see, cluster #1 and #3 have quite a lot of samples, while cluster #5 has just 5 samples.

Was choosing 7 clusters a good idea? If a cluster has a lot of samples, is it a correct clusterization? How can we find the most "correct" amount of clusters? Silhouette score to the rescue!

scikit-learn comes with a silhouette_score function, that evaluates how well the samples felt in the clusters. Its result is a score, the higher value the better. If you want to understand this function better, have a look at the visual example that comes with its documentation.

In our case, let's do a small script to try different amount of clusters, from 2 to 20, to see the score of each case.



In [18]:

    
from sklearn.metrics import silhouette_score

range_n_clusters = range(2,20)

for n_clusters in range_n_clusters:
    cluster_alg = KMeans(n_clusters = n_clusters, random_state = 1)
    cluster_labels = cluster_alg.fit_predict(X)

    silhouette_avg = silhouette_score(X, cluster_labels)
    print("For n_clusters =", n_clusters,
          "the average silhouette_score is :", silhouette_avg)









    



For n_clusters = 2 the average silhouette_score is : 0.209614580674
For n_clusters = 3 the average silhouette_score is : 0.279697267986
For n_clusters = 4 the average silhouette_score is : 0.309569371964
For n_clusters = 5 the average silhouette_score is : 0.328202008154
For n_clusters = 6 the average silhouette_score is : 0.354079181771
For n_clusters = 7 the average silhouette_score is : 0.324515358682
For n_clusters = 8 the average silhouette_score is : 0.32765910155
For n_clusters = 9 the average silhouette_score is : 0.298035520382
For n_clusters = 10 the average silhouette_score is : 0.312711363902
For n_clusters = 11 the average silhouette_score is : 0.289352747521
For n_clusters = 12 the average silhouette_score is : 0.306856929345
For n_clusters = 13 the average silhouette_score is : 0.29406721323
For n_clusters = 14 the average silhouette_score is : 0.313353132715
For n_clusters = 15 the average silhouette_score is : 0.292885947328
For n_clusters = 16 the average silhouette_score is : 0.298789844803
For n_clusters = 17 the average silhouette_score is : 0.291376548394
For n_clusters = 18 the average silhouette_score is : 0.300411324124
For n_clusters = 19 the average silhouette_score is : 0.312780791642

Apparently choosing 7 clusters was a quite good guess, but the best score goes for 6 clusters.



In [19]:

    
# [WORKSHOP] Use KMeans algorithm with 6 clusters (and random_state = 1), and plot clusters' histogram (with 6 bins)
six_clusters_alg = KMeans(n_clusters = 6, random_state = 1)
cluster_labels = six_clusters_alg.fit_predict(X)
pd.DataFrame(cluster_labels).hist(bins = 6)









    Out[19]:





array([[<matplotlib.axes._subplots.AxesSubplot object at 0x7f4aca736668>]], dtype=object)

Changing from 7 clusters to 6 doesn't look like a big improvement, but the fact that some clusters have more samples than others doesn't mean those were bad chosen. It's normal that some kind of customers are more popular that other kind of customers.

A good way to understand better those 6 clusters is to see their centers.



In [20]:

    
six_clusters_alg = KMeans(n_clusters = 6, random_state = 1)
cluster_labels = six_clusters_alg.fit_predict(X)

centers = pd.DataFrame(six_clusters_alg.cluster_centers_, columns=X.columns)
centers.multiply(100).round(0)

Looking at the 6 clusters' centers everything makes more sense. Let's see what it is bought in each one:

#0 : food and some drinks, "the food lover"
#1 : drinks and some food, "the thirsty"
#2 : basically baby stuff, "the parent"
#3 : fresh, food and drink, "the healthy", very common
#4 : home products, "the cleaner"
#5 : a bit of everything, "the balanced", quite common

Notice too that some features are totally irrelevant: Health% and Pets%. So we can consider ignoring them in the future.

DBSCAN clustering

Another algorithm for clustering is DBSCAN. Let's try it!



In [21]:

    
from sklearn.cluster import DBSCAN
# Let's ask for at least 3 samples in each cluster, with a maximum of 0.3 distance
dbscan_alg = DBSCAN(eps = 0.3, min_samples = 3)
cluster_labels = dbscan_alg.fit_predict(X)
cluster_labels









    Out[21]:





array([ 0,  0,  0,  0,  0,  1,  0,  0,  0,  0, -1, -1,  0,  1,  0,  1,  0,
        0,  0,  0,  0,  1,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        2,  0,  0,  0,  0,  0,  0, -1,  0,  0,  0,  0, -1,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  1,  0,  0,  0,  0,  0,  0, -1, -1,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  2,  0,  2,  0,  2,  0,  0])

Ouch! This algorithm only found 3 clusters (#0, #1 and #2) and some samples are marked as outsiders (#-1).

Let's try to remove Health% and Pets% as we have seen those features are irrelevant.



In [22]:

    
X2 = X[['Food%', 'Fresh%', 'Drinks%', 'Home%', 'Baby%']]
cluster_labels = dbscan_alg.fit_predict(X2)
cluster_labels









    Out[22]:





array([0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 2, 0, 2, 0, 2, 0, 0])

Well, the result has improved a bit. But this still has space for improvement...



In [23]:

    
# [WORKSHOP] Feel free to try alternative configurations for this algorithm...

Thank you for attending this workshop!

As you have experienced, playing with Machine Learning algorithms is a question of spending time learning about the data and finding the correct parameters. Also notice that we were working with just 100 samples... things can get slow with real big data.

I hope you liked this workshop!

	customer	order	total_items	discount%	weekday	hour	Food%	Fresh%	Drinks%	Home%	Beauty%	Health%
0	0	0	45	23.03	4	13	9.46	87.06	3.48	0.00	0.00	0.00
1	0	1	38	1.22	5	13	15.87	75.80	6.22	2.12	0.00	0.00
2	0	2	51	18.08	4	13	16.88	56.75	3.37	16.48	6.53	0.00
3	1	3	57	16.51	1	12	28.81	35.99	11.78	4.62	2.87	15.92
4	1	4	53	18.31	2	11	24.13	60.38	7.78	7.72	0.00	0.00

	customer	total_items	discount%	weekday	hour	Food%	Fresh%	Drinks%	Home%	Beauty%	Health%	Baby%	Pets%
order
0	0	45	23.03	4	13	9.46	87.06	3.48	0.00	0.00	0.00	0.0	0.0
1	0	38	1.22	5	13	15.87	75.80	6.22	2.12	0.00	0.00	0.0	0.0
2	0	51	18.08	4	13	16.88	56.75	3.37	16.48	6.53	0.00	0.0	0.0
3	1	57	16.51	1	12	28.81	35.99	11.78	4.62	2.87	15.92	0.0	0.0
4	1	53	18.31	2	11	24.13	60.38	7.78	7.72	0.00	0.00	0.0	0.0

		total_items	discount%	weekday	hour	Food%	Fresh%	Drinks%	Home%	Beauty%	Health%	Baby%	Pets%
customer	order
0	0	45	23.03	4	13	9.46	87.06	3.48	0.00	0.00	0.00	0.0	0.0
	1	38	1.22	5	13	15.87	75.80	6.22	2.12	0.00	0.00	0.0	0.0
	2	51	18.08	4	13	16.88	56.75	3.37	16.48	6.53	0.00	0.0	0.0
1	3	57	16.51	1	12	28.81	35.99	11.78	4.62	2.87	15.92	0.0	0.0
1	4	53	18.31	2	11	24.13	60.38	7.78	7.72	0.00	0.00	0.0	0.0

	customer	total_items	discount%	weekday	hour	Food%	Fresh%	Drinks%	Home%	Beauty%	Health%	Baby%	Pets%
order
10747	3657	45	7.99	1	20	35.44	18.43	23.96	1.00	3.46	0.00	17.72	0.00
12573	4230	55	21.70	2	13	9.60	41.92	16.11	15.74	0.00	0.48	0.00	16.14
29676	10079	1	0.00	1	13	0.00	0.00	100.00	0.00	0.00	0.00	0.00	0.00
8856	3062	28	4.68	2	13	84.92	0.00	10.55	0.00	4.54	0.00	0.00	0.00
21098	7054	18	8.62	1	8	25.64	19.87	0.00	20.31	34.17	0.00	0.00	0.00

	customer	total_items	discount%	weekday	hour	Food%	Fresh%	Drinks%	Home%	Beauty%	Health%	Baby%	Pets%
count	30000.000000	30000.000000	30000.000000	30000.000000	30000.000000	30000.000000	30000.000000	30000.000000	30000.000000	30000.000000	30000.000000	30000.000000	30000.000000
mean	5012.483367	31.739933	8.455495	3.642367	15.034667	23.560406	20.227279	23.349235	13.539575	5.892949	1.132959	11.096145	1.043087
std	2888.646245	20.576579	14.199350	2.122031	5.717052	21.719824	23.661767	22.523335	17.850282	13.996518	5.307620	24.740364	6.195390
min	0.000000	1.000000	-65.150000	1.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
25%	2541.750000	18.000000	1.177500	2.000000	11.000000	7.220000	0.000000	6.710000	0.000000	0.000000	0.000000	0.000000	0.000000
50%	5043.000000	29.000000	4.160000	3.000000	15.000000	19.650000	10.930000	17.890000	7.210000	0.000000	0.000000	0.000000	0.000000
75%	7483.000000	41.000000	9.990000	5.000000	20.000000	33.260000	35.790000	33.162500	20.002500	6.512500	0.000000	4.172500	0.000000
max	10238.000000	298.000000	100.000000	7.000000	23.000000	100.000000	100.000000	100.000000	100.000000	100.000000	100.000000	100.000000	100.000000

	Food%	Fresh%	Drinks%	Home%	Beauty%	Health%	Baby%	Pets%
order
15987	0.3468	0.0000	0.3176	0.3094	0.0263	0.0	0.0000	0.0
5936	1.0000	0.0000	0.0000	0.0000	0.0000	0.0	0.0000	0.0
17612	0.2606	0.0000	0.2633	0.1342	0.0000	0.0	0.3418	0.0
24720	0.0239	0.4836	0.1137	0.3789	0.0000	0.0	0.0000	0.0
22232	0.1257	0.6472	0.2059	0.0212	0.0000	0.0	0.0000	0.0

	Food%	Fresh%	Drinks%	Home%	Beauty%	Health%	Baby%	Pets%
0	65.0	8.0	14.0	5.0	3.0	3.0	-0.0	2.0
1	15.0	6.0	68.0	7.0	1.0	2.0	0.0	-0.0
2	2.0	6.0	3.0	2.0	2.0	0.0	85.0	0.0
3	19.0	50.0	13.0	8.0	3.0	2.0	0.0	2.0
4	3.0	1.0	5.0	85.0	6.0	0.0	0.0	0.0
5	24.0	12.0	24.0	23.0	9.0	2.0	6.0	0.0