In this tutorial we briefly show how to obtain a similarity map from the users from your WhatsApp group. We will use the library whatstk included in this project.
Below we provide some theory background. However, for a detailed and further documentation you can refer to some of these references Kohonen 1982, Kohonen 1998, , Rojas 1996, Joschka Boedecker 2015 or simply take a look at the Wikipedia page.
A self-organizing map (SOM) is an unsupervised learning method that performs dimension reduction to a topological-predefined output space.
The picture below (from this course) illustrates the main idea of SOM
In this picture, $\boldsymbol{x} = [x_1, \dots, x_n]$ denotes the input vector of features. Note that it only contains one single layer.
The network is divided into two stages: (1) Competitive Learning and (2) Topologycal output space.
Each neuron (unit) $k$ is represented by a prototype vector $\boldsymbol{w}_k$. When feeding the network with the input vector $\boldsymbol{x}$, the closest unit (unit minimizing $||\boldsymbol{x}-\boldsymbol{w}_k||$) is known as the winning unit. In a Winner-Takes-All (WTA) approach, only the winning unit prototype vector is updated, i.e.
$$ \Delta \boldsymbol{w}_{win} = \eta(\boldsymbol{x}-\boldsymbol{w}_{win}).$$Note that in the WTA approach, dead units might easily appear. Thus, at some points we need to allow for dead units to learn a bit in order to start claiming their territory!
An alternative to WTA approach relies on allowing other units to be also updated. In particular, all units are updated according to their proximity to the winning units in the output space. Now the update rule is
$$ \Delta \boldsymbol{w}_{k} = \eta h_k(\boldsymbol{x}-\boldsymbol{w}_{k}),$$where the term $h_k$ quantifies the proximity of the unit $\boldsymbol{w}_k$ to the winning unit in the ouput space (high if they are close) and $\eta$ represent the learning rate.
This simple but powerful idea, allows for easy visualization of high-dimensional data in an eye-friendly format. Typical output spaces are lines, circles or 2D grids (like in the picture).
In [1]:
from __future__ import print_function
In [2]:
import sys
sys.path.append('../')
In [3]:
import whatstk.parser as wp
In [4]:
chat = wp.WhatsAppChat("../chats/samplechat.txt")
In [5]:
# Obtain the names of the users from the chat
users = chat.usernames
# Obtain list of days with interventions
days = chat.days
# Obtain number of interventions in the chat
num_interventions = chat.num_interventions
and print it
In [6]:
print("Brief summary")
# Print name of users
print("\n *", len(users),"users found: ")
[print("\t", user) for user in users]
# Number of days the chat has been active
print("\n * Chat was active", len(days), "days")
# Number of interventions
print(" * Chat had", num_interventions, "interventions")
# Average number of interventions per day
int_day=num_interventions/len(days)
print(" * Chat had an average of %.2f" % int_day, "interventions/day")
# Average number of interventions per day per user
int_day_pers = int_day/len(users)
print(" * Chat had an average of %.2f" % int_day_pers, "interventions/day/person")
In [7]:
# Dataframe containing the number of user interventions per day (only days of chat activity are considered)
interventions_per_day = wp.user_interventions(chat, 'days')
In [8]:
# Show dataframe
interventions_per_day
Out[8]:
To ease the learning, we normalize and center each dimension
In [9]:
# Center each dimension
interventions_per_day = interventions_per_day.sub(interventions_per_day.mean(axis=1), axis=0)
# Normalize each dimension
interventions_per_day = interventions_per_day.divide(interventions_per_day.max(axis=1)-interventions_per_day.min(axis=1), axis=0)
# Show dataframe
interventions_per_day
Out[9]:
In [10]:
from whatstk.learn.som import SelfOrganizingMap
In [11]:
# We define the number of units that we will be using. Large number leads to good global
# fit but poor local fit (low number leads to the oposite)
num_units = 5
# We choose an output space define by an array of neurons arranged in a line fashion
topology = 'line'
In [12]:
# Initialize our SOM
som = SelfOrganizingMap(interventions_per_day, num_units, sigma_initial=num_units/2, num_epochs=1000,
learning_rate_initial=1, topology=topology)
In [13]:
# Train our SOM
som.train()
Finally, we can print the results of the similarity map
In [14]:
som.print_results()
There are other topologies available. Let us try them.
In [15]:
# Circle topology is the same as line, except that the first and last components are connected
topology = 'circle'
som = SelfOrganizingMap(interventions_per_day, num_units, sigma_initial=num_units/2, num_epochs=1000,
learning_rate_initial=1, topology=topology)
som.train()
som.print_results()
In [16]:
# We now try a 2D-grid
# Number of units now denotes the number of units per side. In total, we have
# num_units*num_units units
num_units = 5
topology = '2dgrid'
som = SelfOrganizingMap(interventions_per_day, num_units, sigma_initial=num_units/2, num_epochs=1000,
learning_rate_initial=1, topology=topology)
som.train()
som.print_results()