Similarity Map of WhatsApp Users based

In this tutorial we briefly show how to obtain a similarity map from the users from your WhatsApp group. We will use the library whatstk included in this project.

Below we provide some theory background. However, for a detailed and further documentation you can refer to some of these references Kohonen 1982, Kohonen 1998, , Rojas 1996, Joschka Boedecker 2015 or simply take a look at the Wikipedia page.

1. Self-Organizing-Maps

A self-organizing map (SOM) is an unsupervised learning method that performs dimension reduction to a topological-predefined output space.

The picture below (from this course) illustrates the main idea of SOM

In this picture, $\boldsymbol{x} = [x_1, \dots, x_n]$ denotes the input vector of features. Note that it only contains one single layer.

The network is divided into two stages: (1) Competitive Learning and (2) Topologycal output space.

1.3 Competitive Learning

Each neuron (unit) $k$ is represented by a prototype vector $\boldsymbol{w}_k$. When feeding the network with the input vector $\boldsymbol{x}$, the closest unit (unit minimizing $||\boldsymbol{x}-\boldsymbol{w}_k||$) is known as the winning unit. In a Winner-Takes-All (WTA) approach, only the winning unit prototype vector is updated, i.e.

$$ \Delta \boldsymbol{w}_{win} = \eta(\boldsymbol{x}-\boldsymbol{w}_{win}).$$

Note that in the WTA approach, dead units might easily appear. Thus, at some points we need to allow for dead units to learn a bit in order to start claiming their territory!

1.2 Kohonen Map

An alternative to WTA approach relies on allowing other units to be also updated. In particular, all units are updated according to their proximity to the winning units in the output space. Now the update rule is

$$ \Delta \boldsymbol{w}_{k} = \eta h_k(\boldsymbol{x}-\boldsymbol{w}_{k}),$$

where the term $h_k$ quantifies the proximity of the unit $\boldsymbol{w}_k$ to the winning unit in the ouput space (high if they are close) and $\eta$ represent the learning rate.

This simple but powerful idea, allows for easy visualization of high-dimensional data in an eye-friendly format. Typical output spaces are lines, circles or 2D grids (like in the picture).

2. Code

Let us now begin this brief tutorial. Let us first import the basic libraries that we will be using.

2.1 Initialization


In [1]:
from __future__ import print_function

In [2]:
import sys
sys.path.append('../')

In [3]:
import whatstk.parser as wp

Create chat object

We first create a WhatAppChat object of our chat log file. For testing pruposes, we provide a sample chat log file. However, please please feel free to try with your own chats.


In [4]:
chat = wp.WhatsAppChat("../chats/samplechat.txt")

Obtain some basic data

Let us now obtain basic data from the chat


In [5]:
# Obtain the names of the users from the chat
users = chat.usernames
# Obtain list of days with interventions
days = chat.days
# Obtain number of interventions in the chat
num_interventions = chat.num_interventions

and print it


In [6]:
print("Brief summary")
# Print name of users
print("\n *", len(users),"users found: ")
[print("\t", user) for user in users]
# Number of days the chat has been active
print("\n * Chat was active", len(days), "days")
# Number of interventions
print(" * Chat had", num_interventions, "interventions")
# Average number of interventions per day
int_day=num_interventions/len(days)
print(" * Chat had an average of %.2f" % int_day, "interventions/day")
# Average number of interventions per day per user
int_day_pers = int_day/len(users)
print(" * Chat had an average of %.2f" % int_day_pers, "interventions/day/person")


Brief summary

 * 8 users found: 
	 Ash Ketchum
	 Brock
	 Jessie & James
	 Meowth
	 Misty
	 Prof. Oak
	 Raichu
	 Wobbuffet

 * Chat was active 6 days
 * Chat had 18 interventions
 * Chat had an average of 3.00 interventions/day
 * Chat had an average of 0.38 interventions/day/person

2.2 Obtain input data

We start by obtaining the number of interventions of each user per day of chat activity. We can do this by calling the method interventions_per_day from the class WhatsAppChat, which returns a DataFrame of the data (column per username)


In [7]:
# Dataframe containing the number of user interventions per day (only days of chat activity are considered)
interventions_per_day = wp.user_interventions(chat, 'days')

In [8]:
# Show dataframe
interventions_per_day


Out[8]:
Ash Ketchum Brock Misty Prof. Oak Jessie & James Raichu Meowth Wobbuffet
2016-08-06 2.0 2.0 2.0 0.0 0.0 0.0 0.0 0.0
2016-08-07 1.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0
2016-08-10 1.0 0.0 0.0 0.0 1.0 2.0 0.0 0.0
2016-08-11 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0
2016-09-11 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0
2016-10-31 0.0 0.0 0.0 1.0 0.0 0.0 0.0 1.0

To ease the learning, we normalize and center each dimension


In [9]:
# Center each dimension
interventions_per_day = interventions_per_day.sub(interventions_per_day.mean(axis=1), axis=0)
# Normalize each dimension
interventions_per_day = interventions_per_day.divide(interventions_per_day.max(axis=1)-interventions_per_day.min(axis=1), axis=0)
# Show dataframe
interventions_per_day


Out[9]:
Ash Ketchum Brock Misty Prof. Oak Jessie & James Raichu Meowth Wobbuffet
2016-08-06 0.625 0.625 0.625 -0.375 -0.375 -0.375 -0.375 -0.375
2016-08-07 0.500 0.500 0.500 0.500 -0.500 -0.500 -0.500 -0.500
2016-08-10 0.250 -0.250 -0.250 -0.250 0.250 0.750 -0.250 -0.250
2016-08-11 -0.125 -0.125 0.875 -0.125 -0.125 -0.125 -0.125 -0.125
2016-09-11 -0.125 -0.125 -0.125 -0.125 -0.125 -0.125 0.875 -0.125
2016-10-31 -0.250 -0.250 -0.250 0.750 -0.250 -0.250 -0.250 0.750

2.3 Self-Organizing Map

Once we have the our WhatsAppChat object created, we are ready to do have some fun.


In [10]:
from whatstk.learn.som import SelfOrganizingMap

In [11]:
# We define the number of units that we will be using. Large number leads to good global
# fit but poor local fit (low number leads to the oposite)
num_units = 5
# We choose an output space define by an array of neurons arranged in a line fashion
topology = 'line'

In [12]:
# Initialize our SOM
som = SelfOrganizingMap(interventions_per_day, num_units, sigma_initial=num_units/2, num_epochs=1000,
    learning_rate_initial=1, topology=topology)

In [13]:
# Train our SOM
som.train()


* Training *
- Starting parameters: 
	 learning rate = 1
	 sigma = 2.5
- Ending parameters: 
	 learning rate = 0.00034
	 sigma = 0.25058

Finally, we can print the results of the similarity map


In [14]:
som.print_results()


* Results Self Organizing Map *

0 - Jessie & James, Raichu, Meowth
1 - Wobbuffet
2 - Prof. Oak
3
4 - Ash Ketchum, Brock, Misty

There are other topologies available. Let us try them.


In [15]:
# Circle topology is the same as line, except that the first and last components are connected
topology = 'circle'
som = SelfOrganizingMap(interventions_per_day, num_units, sigma_initial=num_units/2, num_epochs=1000,
    learning_rate_initial=1, topology=topology)
som.train()
som.print_results()


* Training *
- Starting parameters: 
	 learning rate = 1
	 sigma = 2.5
- Ending parameters: 
	 learning rate = 0.00034
	 sigma = 0.25058

* Results Self Organizing Map *

0 - Ash Ketchum, Brock, Misty
1
2 - Jessie & James, Raichu, Meowth
3 - Wobbuffet
4 - Prof. Oak

In [16]:
# We now try a 2D-grid
# Number of units now denotes the number of units per side. In total, we have 
# num_units*num_units units

num_units = 5 
topology = '2dgrid'
som = SelfOrganizingMap(interventions_per_day, num_units, sigma_initial=num_units/2, num_epochs=1000,
    learning_rate_initial=1, topology=topology)
som.train()
som.print_results()


* Training *
- Starting parameters: 
	 learning rate = 1
	 sigma = 2.5
- Ending parameters: 
	 learning rate = 0.00034
	 sigma = 0.25058

* Results Self Organizing Map *

             0      1          2       3               4
0  Ash Ketchum                    Raichu  Jessie & James
1               Brock                                   
2                                                 Meowth
3        Misty                                          
4                      Prof. Oak               Wobbuffet