You are already using jupyter!
jupyter is a web-interface IDE, a development environment with REPL where you can execute arbitrary code and see the results instantly. Beneath the web interface there is a server that executes the code. The coding language is usually Python 3, but there are server's kerners for other languages.
Thanks to its REPL powers, jupyter has became the facto standard format when showing data analysis, research papers and all data-related presentations. For us developers we could find jupyter as a too-simple IDE, but this simplicity is actually what helps researchers to easily explore datasets and try things out quickly.
A cell usually contains code or markdown. Do a single click here to see it ;-)
Here you have some ways to move around and execute a cell:
You can use the mouse and the top menu to do all kind of things, or learn the keyboard shortcuts (in the Help menu).
In [1]:
# This is a cell with code. Try to execute it: use the mouse to focus it and press control+Enter.
' '.join(['Hello', 'world'])
Out[1]:
pandas is a library for manipulating data frames. It is based in a smaller library, Numpy, which operates with matrices. But pandas leverages numpy giving an incredible collection of functions to play with the data.
Let's load ulabox's dataset and play a bit with the data. If the .csv file is not in the current directory, this code will download it.
In order to know the meaning of each column of the dataset, please have a look at its data dictionary.
In [2]:
# Usually 'pandas' is nicknamed as 'pd'.
# Please execute this cell with control+Enter (and the following ones, while you read them).
import pandas as pd
import os.path
filename = 'ulabox_orders_with_categories_partials_2017.csv'
if not os.path.isfile(filename):
import urllib.request
urllib.request.urlretrieve('https://raw.githubusercontent.com/ulabox/datasets/master/data/ulabox_orders_with_categories_partials_2017.csv', filename)
raw_df = pd.read_csv(filename)
# head() shows first 5 rows
raw_df.head()
Out[2]:
If you have a look at the raw data, each row has an index (left, bold) with its corresponding column data. In data analysis rows are usually named "samples" while columns are named "features".
Actually in this case the feature "order" (order number) could be directly used as the index of the dataframe. So let's use it and then drop the original "order" column.
In [3]:
df = raw_df.reindex(index=raw_df['order'])
df.drop('order', axis=1, inplace=True)
df.head()
Out[3]:
Notice that pandas library is really powerfull. It can manipulate data in different ways and allows all kind of dataframe operations (at cell level, row and column level, and even between dataframes).
For example, it can use multi-indexes as follows.
In [4]:
multi_indexed_df = raw_df.groupby(by=['customer','order']).sum()
multi_indexed_df.head()
Out[4]:
Filtering, sampling and indexing by sample or a feature is really easy too.
In [5]:
# Get 50 random rows
sample = df.sample(50, random_state = 1)
# I use random_state to get same results (it's the random seed)
sample.head()
Out[5]:
In [6]:
# Getting a column (feature)
column = sample['total_items']
column.head()
Out[6]:
In [7]:
# Getting a row (sample)
sample.loc[8856]
Out[7]:
In [8]:
# Getting an individual value
sample.loc[8856, 'hour']
Out[8]:
In [9]:
# Filtering with direct test
orders_with_more_than_50_items = sample.loc[sample['total_items'] > 50]
orders_with_more_than_50_items.head()
Out[9]:
In [10]:
# Notice the content of the previous comparison
(sample['total_items'] > 50).head()
Out[10]:
In [11]:
# [WORKSHOP] Can you find any order with only Drink products bought?
sample.loc[sample['Drinks%'] == 100]
Out[11]:
You should have found more than one order with just drinks. Somebody was really thirsty!
Moreover pandas have several functions for doing statistics, helpfull when exploring the data.
In [12]:
df.describe()
Out[12]:
As you can see, the dataset contains 30k rows (samples). We get the mean, standard deviation, min, max and other statistics for each feature.
Wait a moment! Look at the discount% column. It seems the maximum discount% is 100%, and the minimum discount% is (minus) -65%!! This looks weird...
If you want to have a look at other pandas features, I recommend its documentation and this cheatsheet.
matplotlib is a library for graphs. It can use numpy arrays and pandas dataframes as input.
Actually pandas comes with direct helpers to display data thru matplotlib! If you want to learn more about matplotlib integration in pandas, check pandas visualization documentation and matplotlib documentation.
For instance, let's explore the discount% feature using a histogram.
In [13]:
import matplotlib.pyplot as plt
# This is a jupyter helper, so when a matplotlib is evaluated, it shows the graph
%matplotlib inline
# Use 15 blocks(bins)
df['discount%'].hist(bins=15)
Out[13]:
After seeing the discount% feature displayed, it's clear that if the discount% is 100%, it should be some kind of free order (like a gift for a VIP).
On the other hand, why are there some negative discount% values? This is not easy to understand until you ask the domain expert: some drinks have a surcharge (a negative discount) due to a law that taxes drinks with added sugar.
In [14]:
# [WORKSHOP] Plot a histogram to display the most common hours of the day orders are purchased.
df['hour'].hist(bins=24)
Out[14]:
The results should show a peak at 12~13h and one at 22h.
scikit-learn is a library with machine learning algorithms and helpers.
Before starting using ML algorithms, let's prepare a smaller dataset without free orders, with just 100 rows, so the algorithms will work really fast. Also consider only the 8 categories' partials.
Notice that algorithms are optimized to work better with values between 0 to 1; it's vital to keep values at this order of magnitude. So an easy way to normalize the samples' data is to divide the percents values by 100. Another option could be using Standarization (see documentation).
In ML jargon, the algorithm input is called X (a matrix with samples and features).
In [15]:
no_free_orders = df.loc[df['discount%'] < 100]
one_thousand = no_free_orders.sample(100, random_state = 1)
X = one_thousand[['Food%', 'Fresh%', 'Drinks%', 'Home%', 'Beauty%', 'Health%', 'Baby%', 'Pets%']].divide(100)
X.head()
Out[15]:
In [16]:
from sklearn.cluster import KMeans
seven_clusters_alg = KMeans(n_clusters = 7, random_state = 1)
cluster_labels = seven_clusters_alg.fit_predict(X)
cluster_labels
Out[16]:
Each one of the 100 samples is classified in a cluster (from 0 to 6). For instance, the first sample felt in cluster #3, the second in #4, etc.
Let's see how many samples felt in each cluster.
In [17]:
# cluster_labels is a numpy array, so first we embed it in a dataframe and then plot each cluster counting
pd.DataFrame(cluster_labels).hist(bins = 7)
Out[17]:
As you see, cluster #1 and #3 have quite a lot of samples, while cluster #5 has just 5 samples.
Was choosing 7 clusters a good idea? If a cluster has a lot of samples, is it a correct clusterization? How can we find the most "correct" amount of clusters? Silhouette score to the rescue!
scikit-learn comes with a silhouette_score function, that evaluates how well the samples felt in the clusters. Its result is a score, the higher value the better. If you want to understand this function better, have a look at the visual example that comes with its documentation.
In our case, let's do a small script to try different amount of clusters, from 2 to 20, to see the score of each case.
In [18]:
from sklearn.metrics import silhouette_score
range_n_clusters = range(2,20)
for n_clusters in range_n_clusters:
cluster_alg = KMeans(n_clusters = n_clusters, random_state = 1)
cluster_labels = cluster_alg.fit_predict(X)
silhouette_avg = silhouette_score(X, cluster_labels)
print("For n_clusters =", n_clusters,
"the average silhouette_score is :", silhouette_avg)
Apparently choosing 7 clusters was a quite good guess, but the best score goes for 6 clusters.
In [19]:
# [WORKSHOP] Use KMeans algorithm with 6 clusters (and random_state = 1), and plot clusters' histogram (with 6 bins)
six_clusters_alg = KMeans(n_clusters = 6, random_state = 1)
cluster_labels = six_clusters_alg.fit_predict(X)
pd.DataFrame(cluster_labels).hist(bins = 6)
Out[19]:
Changing from 7 clusters to 6 doesn't look like a big improvement, but the fact that some clusters have more samples than others doesn't mean those were bad chosen. It's normal that some kind of customers are more popular that other kind of customers.
A good way to understand better those 6 clusters is to see their centers.
In [20]:
six_clusters_alg = KMeans(n_clusters = 6, random_state = 1)
cluster_labels = six_clusters_alg.fit_predict(X)
centers = pd.DataFrame(six_clusters_alg.cluster_centers_, columns=X.columns)
centers.multiply(100).round(0)
Out[20]:
Looking at the 6 clusters' centers everything makes more sense. Let's see what it is bought in each one:
Notice too that some features are totally irrelevant: Health% and Pets%. So we can consider ignoring them in the future.
In [21]:
from sklearn.cluster import DBSCAN
# Let's ask for at least 3 samples in each cluster, with a maximum of 0.3 distance
dbscan_alg = DBSCAN(eps = 0.3, min_samples = 3)
cluster_labels = dbscan_alg.fit_predict(X)
cluster_labels
Out[21]:
Ouch! This algorithm only found 3 clusters (#0, #1 and #2) and some samples are marked as outsiders (#-1).
Let's try to remove Health% and Pets% as we have seen those features are irrelevant.
In [22]:
X2 = X[['Food%', 'Fresh%', 'Drinks%', 'Home%', 'Baby%']]
cluster_labels = dbscan_alg.fit_predict(X2)
cluster_labels
Out[22]:
Well, the result has improved a bit. But this still has space for improvement...
In [23]:
# [WORKSHOP] Feel free to try alternative configurations for this algorithm...
As you have experienced, playing with Machine Learning algorithms is a question of spending time learning about the data and finding the correct parameters. Also notice that we were working with just 100 samples... things can get slow with real big data.
I hope you liked this workshop!