For a full experience please view the notebook on Jupyter Notebook Viewer

0. Prerequisites

0.0 Dependencies

  • numpy
  • pandas
  • networkx
  • seaborn
  • scikit-learn (sklearn)
  • matplotlib
  • tqdm
  • plotly
  • python-louvain (collections)

0.1 Local Files Imports


In [150]:
import spectral
import plots
import learning
import models
import graph
import preprocess
import utils
%reload_ext autoreload
%autoreload 2

0.2 Libraries Imports


In [151]:
import scipy
import matplotlib

import numpy as np
import pandas as pd
import networkx as nx
import seaborn as sns

from collections import Counter
from sklearn import preprocessing, model_selection
from plotly.offline import init_notebook_mode

from tqdm import tqdm_notebook as tqdm
from matplotlib import pyplot as plt

In [152]:
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.mixture import GaussianMixture
from sklearn.metrics import confusion_matrix, accuracy_score, f1_score
from sklearn import metrics

In [153]:
init_notebook_mode(connected=True)



In [154]:
%matplotlib inline
sns.set(rc={"figure.figsize": (15, 6)})
sns.set_palette(sns.color_palette("Set2", 10))

In [155]:
np.random.seed(42)

1. Introduction

Reading is too mainstream. What if you could get the important ideas from a text without even reading it? What about comparing several documents based on their textual content? Or maybe you just want to visualize the concepts present in a book and their interaction? GraphLang is the tool you need to boost your texts using graphs.

How ?

This project is all about analysing textual resources using graphs and extracting useful insights. The main idea is to represent a document using the cooccurrences of its words, turning that into a graph and leverage the power of graph analysis tools in order to better understand the document.

At first the graph could be built by only considering words directly adjacent to each other and representing this proximity with a link in the graph, where the nodes would be the words themselves. The recipe could then be complexified by considering also words at distance N from each other (N would have to be defined) and defining edge weights as a function of N. Punctuation could also be taken into account and would influence the weight of edges (two words, one at the end of a sentence and the other at the beginning of the next one shouldn’t (maybe) have a strong edge between them). This graph could be extended to take into account multiple documents at once using signals on the edges.

On the other hand, we could consider taking a set of documents and create for each document a set of features that characterizes the particularity of each document. Inspired by the homework 03, we could build a graph using those features and then, using spectral decomposition, we could represent each document in a reducted space.

2. GraphLang v1

Our first approach consists of analyzing a single textual resource to extract the important concepts out of it and see how these concepts are linked.

The approach is simple: we start by building a cooccurences undirected graph using a custom window size N (two words will be connected with an edge if they appear at a distance of at most N). We then compute a metric for each node called the betweenness (more on this later). This metric will help us identify several communities of words inside the graph. Finally, we construct an induced graph where each community is reduced to one node, represented by its most important word (importance being defined by the degree). This final graph is the one that will display the concepts of the text.

2.1 Data Acquisition

We will use three textual resources.

Scikit-Learn provides a great dataset of textual ressources. A corpus called 20NewsGroups containing a large variety of labelled news. Those have the advantage of being concise and individually quickly analyzable.

The other texts we considered are the most popular books in the world: The Bible and The Coran


In [156]:
bible = utils.load_text("books/king-james-bible-processed.txt")
quran = utils.load_text("books/quran-shakir.txt")
news_chunk = fetch_20newsgroups(subset='all')

2.2 Graph construction

We have considered several ways of building the graph, here we keep only the one which gave us good results.

We use a window size of 4 for the cooccurences, ignore all stopwords and punctuation (note they are still taken into account when assigning the occurences: two words separated by a stopword will have a lower cooccurence value than if they were directly adjacent) and finally build an undirected graph.


In [157]:
nlinks = 4 # Window size to build links in the graph

# The words_map variables are dictionaries from the actual words to the nodes (labeled as integers)
occs_bible, words_map_bible = graph.text_to_graph(bible, undirected=True, subsample=0.03, ignore_punct=True, ignore_stopwords=True, self_links=False, nlinks=nlinks, return_words_map=True)
occs_quran, words_map_quran = graph.text_to_graph(quran, undirected=True, subsample=0.15, ignore_punct=True, ignore_stopwords=True, self_links=False, nlinks=nlinks, return_words_map=True)
occs_news_1, words_map_news_1 = graph.text_to_graph(news_chunk.data[1], undirected=True, ignore_punct=True, ignore_stopwords=True, self_links=False, nlinks=nlinks, return_words_map=True)
occs_news_2, words_map_news_2 = graph.text_to_graph(news_chunk.data[2], undirected=True, ignore_punct=True, ignore_stopwords=True, self_links=False, nlinks=nlinks, return_words_map=True)

# Those will be useful later, when plotting the graphs
words_map_inv_bible = utils.inverse_dict(words_map_bible)
words_map_inv_quran = utils.inverse_dict(words_map_quran)
words_map_inv_news_1 = utils.inverse_dict(words_map_news_1)
words_map_inv_news_2 = utils.inverse_dict(words_map_news_2)

G_bible = graph.np_to_nx(occs_bible, words_map_bible)
G_quran = graph.np_to_nx(occs_quran, words_map_quran)
G_news_1 = graph.np_to_nx(occs_news_1, words_map_news_1)
G_news_2 = graph.np_to_nx(occs_news_2, words_map_news_2)

We now build a colormap that will be useful when we draw the communities


In [158]:
colors = list(plt.cm.Set1.colors)
colors.extend(list(plt.cm.Set3.colors))
cmap = {i: matplotlib.colors.to_hex(colors[i]) for i in range(len(colors))}

2.3 Extracting insights

Let's get straight to the exciting part! In this section, we will analyze the various graphs and extract the important concepts in each of them.

2.3.1 20 news group

The first step of our analysis consists of computing the betweenness values for all nodes of our graph.

Betweenness centrality can be thought as a measure of how often a node appears on the shortest path between any two randomly chosen nodes in the network. As such, nodes (words) with high betweenness are often at the intersection of meaning and have high importance in the considered text.

Formally it is defined in the following way: $$c_B(v) =\sum_{s,t \in V} \frac{\sigma(s, t|v)}{\sigma(s, t)}$$ where $V$ is the set of nodes, $\sigma(s, t)$ is the number of shortest $(s,t)$-paths, and $\sigma(s, t|v)$ is the number of those paths passing through some node $v$ other than $s,t$. If $s=t$, $\sigma(s, t) = 1$, and if $v \in {s, t}$, $\sigma(s, t|v) = 0$


In [159]:
betweenness_news_1 = graph.compute_betweenness(G_news_1)
betweenness_news_2 = graph.compute_betweenness(G_news_2)

The betweenness value has now been stored in the graphs! Let's compute the communities.

Our approach computes the partition of the graph nodes which maximises the modularity using the Louvain heuristics. This is the partition of highest modularity, i.e. the highest partition of the dendrogram generated by the Louvain algorithm.


In [160]:
partition_news_1 = graph.community_partition(G_news_1, weight="betweenness")
partition_news_2 = graph.community_partition(G_news_2, weight="betweenness")

We'll scale the nodes' size in the graph later using the betweenness values. For this, we will rescale the range of values to have proper node sizes


In [161]:
betweenness_scaled_news_1 = graph.scale_betweenness(betweenness_news_1, min_=25)
betweenness_scaled_news_2 = graph.scale_betweenness(betweenness_news_2, min_=25)

Now we can visualize the graph, where the nodes sharing the same community have the same color. The size of the nodes is also function of the betweenness.


In [162]:
graph.communities(G_news_1, draw=True, cmap=cmap, partition=partition_news_1, betweenness_scaled=betweenness_scaled_news_1);


Take a moment to appreciate how well the graph was clustered using this approach.

We are ready to compute the induced graph ! For this, we will reduce each community to a single node and label it using its highest degree node (where the degrees are taken on the weighted graph), considered as its representant.


In [247]:
np.random.seed(5)
graph.induced_graph(G_news_1, partition_news_1, draw=True, cmap=cmap, words_map_inv=words_map_inv_news_1);


This is the graph of concepts we were looking for ! Compare this with the actual content of the news group item here (or in the /news folder) and note that two of the three keywords were captured.

Let's now do the same for the second item of the news group


In [165]:
graph.communities(G_news_2, draw=True, cmap=cmap, partition=partition_news_2, betweenness_scaled=betweenness_scaled_news_2);



In [166]:
graph.induced_graph(G_news_2, partition_news_2, draw=True, cmap=cmap, words_map_inv=words_map_inv_news_2);


A quick look at this graph already gives us a precise idea of the content of the item, awesome ! Again, for your convenience the text is available here (or in the /news folder) if you want to compare.

2.3.2 The Bible

We have warmed up using short texts. Let's now test our approach more seriously (of course seriously here means bigger texts).

Now that you've understood the process, this time we'll use "shortcut" methods to visualize the clustered words and the induced graph of concepts with fewer lines of code :)


In [170]:
pos_bible, partition_bible, betweenness_scaled_bible = graph.communities(G_bible, draw=True, cmap=cmap)



In [242]:
np.random.seed(5)
G_induced_bible = graph.induced_graph(G_bible, partition_bible, rescale_node_size=0.02, draw=True, cmap=cmap, words_map_inv=words_map_inv_bible)


Impressive ! The concepts extracted are those you can expect to be the most important ones in the Bible

2.3.3 The Quran


In [172]:
pos_quran, partition_quran, betweenness_scaled_quran = graph.communities(G_quran, draw=True, cmap=cmap)



In [248]:
np.random.seed(7)
G_induced_quran = graph.induced_graph(G_quran, partition_quran, rescale_node_size=0.03, draw=True, cmap=cmap, words_map_inv=words_map_inv_quran)


Again, the results are very conclusive. While we have only considered a small subset of the book, the extracted concepts seem very pertinent.

2.3.4 Bonus text :)

We have decided to analyze the description of your (our ?) favorite library. What do you think of the result ?


In [249]:
pygsp_text = utils.load_text("books/pygsp.txt")

occs_pygsp, words_map_pygsp = graph.text_to_graph(pygsp_text, undirected=True, ignore_punct=True, ignore_stopwords=True, self_links=False, nlinks=nlinks, return_words_map=True)
words_map_inv_pygsp = utils.inverse_dict(words_map_pygsp)

G_pygsp = graph.np_to_nx(occs_pygsp, words_map_pygsp)

pos_pygsp, partition_pygsp, betweenness_scaled_pygsp = graph.communities(G_pygsp, draw=True, cmap=cmap)



In [254]:
np.random.seed(7)
G_induced_quran = graph.induced_graph(G_pygsp, partition_pygsp, rescale_node_size=0.8, draw=True, cmap=cmap, words_map_inv=words_map_inv_pygsp)


2.4 Spectral analysis ?

We wanted to go one step further and tried to use spectral analysis to visualize our graph. Our hope was that this approach would highlight various clusters of words.

We tried various alternatives for the graph construction before running the graph analysis:

  • simple graph of co-occurrences (with various parameters)
  • graph of distances between words, where the distances are deduced from the co-occurrences values (nodes having a big co-occurrence value have low distance)

Sadly, the results were non-conclusive. In fact, all the words were always projected very close to one another, forming a big ugly mass (with the exception of a few outliers, some very low degree nodes).

Fortunately, the basic approach gives very good result. And as you will see in the next section, we will still satisfy our desire of spectral analysis !

3. GraphLang v2

In the previous section, we have conducted an analysis on individual texts and extracted the important concepts from them.

We will now follow a different approach: we will compare a document against other documents in the same corpus.

3.1 Data Acquisition


In [174]:
news_target = news_chunk.target
news_target_names = news_chunk.target_names

In [175]:
y_all = news_target

In [176]:
df_y = pd.DataFrame(pd.DataFrame(y_all)[0].value_counts())
df_y.reset_index(inplace=True)
df_y.columns = ['label', 'counts']

In [177]:
parent_cat_names = ['Computer', 'Recreational', 'Religion', 'Politics', 'Science', 'Sale']
parent_cat_keyw = ['comp.', 'rec.', 'religion', '.politics.', 'sci.', 'misc.forsale']

def sub_to_parent(name):
    if 'atheism' in name:
        return 'Religion'
    for p, kw in zip(parent_cat_names, parent_cat_keyw):
        if kw in name:
            return p
    raise ValueError('Keyword not found: ' + str(name))

In [178]:
df_y_names = pd.DataFrame(news_target_names, columns=['cat'])
df_y_names['parent_cat'] = df_y_names['cat'].apply(sub_to_parent)
df_y_names['parent_label'] = df_y_names['parent_cat'].apply(lambda x: parent_cat_names.index(x))

In [179]:
df_y_names = df_y_names.reset_index().set_index(['parent_cat', 'cat'])

In [180]:
df_y_names.columns = ['label', 'parent_label']

In [181]:
df_y_merged = pd.merge(df_y_names.reset_index(), df_y, on='label').set_index(['parent_cat', 'cat'])
df_y_merged.sort_index(level=['parent_cat','cat'], ascending=[1, 1], inplace=True)

In [182]:
df_y_merged


Out[182]:
label parent_label counts
parent_cat cat
Computer comp.graphics 1 0 973
comp.os.ms-windows.misc 2 0 985
comp.sys.ibm.pc.hardware 3 0 982
comp.sys.mac.hardware 4 0 963
comp.windows.x 5 0 988
Politics talk.politics.guns 16 3 910
talk.politics.mideast 17 3 940
talk.politics.misc 18 3 775
Recreational rec.autos 7 1 990
rec.motorcycles 8 1 996
rec.sport.baseball 9 1 994
rec.sport.hockey 10 1 999
Religion alt.atheism 0 2 799
soc.religion.christian 15 2 997
talk.religion.misc 19 2 628
Sale misc.forsale 6 5 975
Science sci.crypt 11 4 991
sci.electronics 12 4 984
sci.med 13 4 990
sci.space 14 4 987

In [183]:
df_y_merged.reset_index().groupby('parent_cat').agg(sum)['counts'].plot(kind='bar');


To keep it simple, we choose a subset of 4 first categories which are Computer, Politics, Recreational, and Religion. Which are themselves a merge of finer categories which we refer as parent categories. There also seems to be quite different categories in a lexical sense.


In [184]:
selected_cat = ['Computer', 'Politics', 'Recreational', 'Religion']
selected_labels = set(df_y_merged.loc[selected_cat]['label'].values)

In [185]:
#Select news of interest
mask_selected = np.vectorize(lambda x: x in selected_labels)(news_target)
y = y_all[mask_selected]

#Dict mapping news label to parent (super) label 
label_to_p_label = dict(df_y_merged[['label', 'parent_label']].values)

#Super label mapped between 0 and len(set(y))
y_parent = np.vectorize(lambda x: label_to_p_label[x])(y)

3.2 Features Engineering

To process text we need a fixed size vector of features representing it. One of the best and easy way to do it is known as tf-idf. The idea is to count the number of times a word appears in the document divided by the number of times it appears in the whole document collection. This approach gives less importance more frequent words which contain less information, on the other hand it gives more importance to rarer words appearing only in few documents.


In [186]:
vectorizer = TfidfVectorizer(stop_words='english', max_df=0.5, sublinear_tf=True, max_features=1000)

In [187]:
selected_data = [d for d, b in zip(news_chunk.data, mask_selected) if b]
news_features = vectorizer.fit_transform(selected_data)

feature_names = vectorizer.get_feature_names()
X = scipy.sparse.csr_matrix.todense(news_features)
X.shape, y.shape, y_parent.shape


Out[187]:
((13919, 1000), (13919,), (13919,))

with a subset of the data


In [188]:
subset_size = 2000

In [189]:
X = X[:subset_size]
y = y[:subset_size]
y_parent = y_parent[:subset_size]

4. Graph Construction

4.1 Distance matrix


In [190]:
distances = spectral.features_to_dist_matrix(X, metric='cosine')

In [191]:
plt.hist(np.nan_to_num(distances.flatten()), bins=100);


4.2 Weight Matrix


In [192]:
all_weights = spectral.dist_to_adj_matrix(distances, 'gaussian')

In [193]:
weights = spectral.filter_neighbors(all_weights, 100)

In [194]:
def plot(weights, axes):
    axes[0].spy(weights)
    axes[1].hist(weights[weights > 0].reshape(-1), bins=50);

In [195]:
fix, axes = plt.subplots(2, 2, figsize=(17, 8))

plot(all_weights, axes[:, 0])
plot(weights, axes[:, 1])


horizontal and vertical white lines are due to pyplot.spy (proof below)


In [196]:
np.nonzero(all_weights.sum(axis = 0) == 0)


Out[196]:
(array([], dtype=int64),)

4.3 Degree matrix


In [197]:
degrees = np.sum(weights, axis=0)

In [198]:
plt.hist(np.nan_to_num(degrees), bins=50, log=True);


5. Unsupervised Clustering

In this section, we will try to find clusters in a 100% unsupervised manner. To do that we will use techniques on graph such as Spectral clustering seen in the lecture and more precisely in homework 3. To extend what we have seen in homework 3 to multiple classes clustering we will perform GMM which is an extension of the well known K-mean algorithm for soft clustering. In order to stay unsupervised even in the number of clusters, we perform a Silhouette analysis.

5.1 Spectral Decomposition


In [199]:
D = np.diag(degrees)
W = weights
L = D - W

inv_sqrt_D = np.diag(1 / np.diag(D**(0.5)))

normalized_laplacian = inv_sqrt_D @ L @ inv_sqrt_D

In [200]:
plt.spy(normalized_laplacian);



In [201]:
eigenvalues, eigenvectors = scipy.sparse.linalg.eigsh(normalized_laplacian, k=20, which='SM')

In [202]:
plt.plot(eigenvalues, '.-', markersize=20);



In [203]:
label_to_name = dict(df_y_merged.reset_index()[['parent_label', 'parent_cat']].values)

Now, using the 2nd, 3rd and 4th eigenvectors and the true labels, we will look at whether or not the Spectral decomposition is useful or not.


In [204]:
plots.plot3D(eigenvectors, y_parent, y_parent, label_to_name, node_size=2, opacity=1)


As we can see on the plot above, the different categories can clearly be separated when using 3 eigenvectors.

5.2 Silhouette Score

Once we have the eigenvectors, we would like to use an unsupervised clustering algorithm. To make sure that we are selecting the number of clusters independently of the number of parent labels, we will compute the silhouette score for different numbers of clusters and compare them.

$$silouhette = \frac{1}{N}\sum_{i=1}^{N} \frac{b(i) - a(i)}{\max{\{a(i), b(i)\}}}$$

where a(i) is the average distance of point i from the other points in the same cluster and b(i) is the lowest average distance of point i from the points in any cluster that do not contain i.

note : we will later use an algorithm called GMM to create the clusters, and for this reason we will also use GMM for the silhouette score.


In [205]:
def get_silhouette_GMM(X, i):
    clusters = GaussianMixture(n_components=i, covariance_type='full', max_iter=500)
    clusters.fit(X)
    labels = np.argmax(clusters.predict_proba(X), axis=1)
    return metrics.silhouette_score(X, labels, metric='euclidean')

n_eigen=3
interval_of_interest = range(2, 11)
scores = [np.mean([get_silhouette_GMM(eigenvectors[:, 1:n_eigen+1], i) for _ in range(10)]) for i in interval_of_interest]

plt.plot(interval_of_interest, scores, '.-', markersize=20);


We can see above the silhouette scores for each number of clusters, note that the highest score is obtained with 4 clusters. And since a higher score is better, we will continue the rest of the project with 4 clusters.

Given that the number of parent labels is equal to 4, it is not surprising that the optimal number of clusters is also equal to 4.


In [206]:
n_classes = interval_of_interest[np.argmax(scores)]

5.3 Gaussian Mixture Model


In [207]:
y_true = preprocessing.LabelEncoder().fit_transform(y_parent)

new_label = sorted(set(y_true))
original_label = sorted(set(y_parent))

new_to_ori_dict = {n:o for n, o in zip(new_label, original_label)}
to_original_label = np.vectorize(lambda l: new_to_ori_dict[l])

In [208]:
#Create and train the GMM
gmm_clf = GaussianMixture(n_components=n_classes, covariance_type='full', max_iter=500, random_state=42)
gmm_clf.fit(eigenvectors[:, 1:n_eigen+1]);
y_pred_brute = gmm_clf.predict(eigenvectors[:, 1:n_eigen+1])
y_pred_proba_brute = gmm_clf.predict_proba(eigenvectors[:, 1:n_eigen+1])

In [209]:
cluster_names = {i:'Cluster ' + str(i) for i in range(n_classes)}
infos = np.array([plots.proba_to_infos(arr, cluster_names) for arr in y_pred_proba_brute])
plots.plot3D(eigenvectors, y_pred_brute, infos, cluster_names, node_size=2, opacity=1)


Apart from the color which might not matching the label (a permutation would solve this -> see evalutation later), the results seem promising, each branch has a different label.

6. Evaluation

Based on the results obtained, we will try to assign a label to each cluster, using the labels from the 20 News Groups corpus.

Remark that the labeled data is only used to put a name on the cluster and to evaluate the performance


In [210]:
best_perm = models.find_best_perm(y_true, y_pred_brute)
best_perm


Out[210]:
(3, 0, 2, 1)

In [211]:
# apply best permutation
y_pred = np.vectorize(lambda x: best_perm[x])(y_pred_brute)
y_pred_proba = y_pred_proba_brute[:, best_perm]

In [212]:
f1_score(y_true, y_pred, average='weighted')


Out[212]:
0.75534908273156454

In [213]:
accuracy_score(y_true, y_pred)


Out[213]:
0.76649999999999996

In [214]:
confusion_mat = confusion_matrix(y_pred=y_pred, y_true=y_true)
permuted_cluster_names = np.array(list(cluster_names.values()))[list(best_perm)]
plots.plot_confusion_matrix(confusion_mat, permuted_cluster_names, selected_cat, normalize=True)


Above, we plotted the confusion matrix with the 4 topics and the 4 clusters computed with GMM. As we can see, the topic called "Computer" is detected by cluster 3 with a high accuracy (95%), the topic "Politics" is detected with less accuracy and is contained in the cluster 0, the topic "Recreational" is contained 3 times out of 4 in cluster 2 and the topic of "Religion" also has a high accuracy (87%) and is detected in the last cluster.

7. Inference

The idea is to introduce a new document into the graph representation in order to predict their cluster assignment (soft clustering) with respect to the parent categories.


In [215]:
new_texts = {
    "Mix" : "Maxime is our lord in the sky and on earth because he has an awesome macbook pro. Its processor is a dualcore with 16Gb of RAM, no joke :O. But, what if his computer is like the apple for Adam ?",
    
    "Jesus" : "Jesus is our lord in the sky and on earth because he has an awesome beard.",
    
    "Macbook" : "Maxime has an awesome macbook pro. Its processor is a dualcore with 16Gb of RAM, no joke.",
    
    "Tchoukball" : "Tchoukball /ˈtʃuːkbɔːl/ is an indoor team sport developed in the 1970s by Swiss biologist Dr Hermann Brandt. Dr Brandt was concerned about the number of injuries in sport at the time and as part of an educational study he wanted to create a sport that reduced injuries, was not aggressive between players and enabled people of all shapes, sizes, genders, cultures, and backgrounds to play together. The sport is usually played on an indoor court measuring 27 metres by 16 metres. At each end there is a 'frame' (a device similar to a trampoline off which the ball bounces) which measures one square metre and a semicircular D-shaped forbidden zone measuring three metres in radius. Each team can score on both ends of the court, and comprises 12 players, of which 7 may be on the court at any one time. In order to score a point, the ball must be thrown by an attacking player, hit the frame and bounce outside the 'D' without being caught by the defending team. Physical contact is prohibited, and defenders may not attempt to intercept the attacking team's passes. Players may take three steps with the ball, hold the ball for a maximum of three seconds, and teams may not pass the ball more than three times before shooting at the frame. Tchoukball has become an international sport, played in Brazil, Canada, China, the Czech Republic, Great Britain, India, Italy, Japan, Macau, Philippines, Poland, Singapore, Switzerland, Taiwan, and the United States. It is governed by the Féderation Internationale de Tchoukball (FITB, founded in 1971). Taiwan hosted the 2004 World Championships and won both the women's and junior championships, with the Swiss men winning the men's championship. The 2006 European Championships were held in Switzerland, with Great Britain taking both the Men's and Under-18's titles, while the hosts won the Women's event..",
    "US OPEN" : "The 1999 US Open – Women's Singles was the women's singles event of the hundred-and-ninth edition of the US Open, the fourth and last Grand Slam of the year, and the most prestigious tournament in the Americas. Lindsay Davenport was the defending champion, but she was defeated in the semifinals by Serena Williams. Williams then won in the final, defeating World No. 1 Martina Hingis. This was Williams' first Grand Slam title, and she became the first African American woman to win a Grand Slam in the Open Era. She won five more titles in 2002, 2008, 2012, 2013 and 2014.",
    "2nd World War" : "World War II (often abbreviated to WWII or WW2), also known as the Second World War, was a global war that lasted from 1939 to 1945, although related conflicts began earlier. It involved the vast majority of the world's countries—including all of the great powers—eventually forming two opposing military alliances: the Allies and the Axis. It was the most widespread war in history, and directly involved more than 100 million people from over 30 countries. In a state of total war, the major participants threw their entire economic, industrial, and scientific capabilities behind the war effort, erasing the distinction between civilian and military resources. World War II was the deadliest conflict in human history, marked by 50 million to 85 million fatalities, most of which were civilians in the Soviet Union and China. It included massacres, the deliberate genocide of the Holocaust, strategic bombing, starvation, disease and the first use of nuclear weapons in history.[1][2][3][4] The Empire of Japan aimed to dominate Asia and the Pacific and was already at war with the Republic of China in 1937,[5] but the world war is generally said to have begun on 1 September 1939[6] with the invasion of Poland by Nazi Germany and subsequent declarations of war on Germany by France and the United Kingdom. From late 1939 to early 1941, in a series of campaigns and treaties, Germany conquered or controlled much of continental Europe, and formed the Axis alliance with Italy and Japan. Under the Molotov–Ribbentrop Pact of August 1939, Germany and the Soviet Union partitioned and annexed territories of their European neighbours, Poland, Finland, Romania and the Baltic states. The war continued primarily between the European Axis powers and the coalition of the United Kingdom and the British Commonwealth, with campaigns including the North Africa and East Africa campaigns, the aerial Battle of Britain, the Blitz bombing campaign, and the Balkan Campaign, as well as the long-running Battle of the Atlantic. On 22 June 1941, the European Axis powers launched an invasion of the Soviet Union, opening the largest land theatre of war in history, which trapped the major part of the Axis military forces into a war of attrition. In December 1941, Japan attacked the United States and European colonies in the Pacific Ocean, and quickly conquered much of the Western Pacific."
}

In [216]:
new_labels2 = []
new_items = []
for key, item in new_texts.items():
    new_labels2.append(key)
    new_items.append(item)
new_labels2 = np.array(new_labels2)
new_items = np.array(new_items)

In [217]:
new_eigenvectors = spectral.fast_spectral_decomposition(X, vectorizer, new_items)

In [218]:
new_labels = [990+i for i, _ in enumerate(new_items)]

In [219]:
new_y = np.append(y_parent, new_labels)
new_y_pred, new_y_pred_proba = models.fast_gmm(new_y * (new_y < 20), n_classes, new_eigenvectors)

In [220]:
# Assign a different label that the one predicted to 
# the new point such that we can differentiate them in the scatter plot

for i, l in enumerate(new_labels):
    new_y_pred[i - len(new_labels)] = l

label_to_name_with_new = label_to_name.copy()
for i, l in enumerate(new_labels):
    label_to_name_with_new[l] = new_labels2[i]

In [221]:
infos2 = np.array([plots.proba_to_infos(arr, label_to_name_with_new) for arr in new_y_pred_proba])
plots.plot3D(new_eigenvectors, new_y_pred, infos2, label_to_name_with_new, node_size=2, opacity=1)


We can clearly see that the ones with both keywords is in between the two expected labels, the two which contains topic specific label are close to their respective cluster

note: the labels here are the one we predicted in with the GMM, not the true ones. (this is what makes the result interesting in fact!)

8. Results

GraphLang V1

Being fully unsupervised and having no ground truth, our measure of correctness for this part is somehow appreciative. For short texts (20 news groups), we can read them in full and we see that the extracted concepts are in fact rather representative of the texts.

For the longer texts (the Quran and the Bible), it seems quite obvious that the concepts we get from our analysis are, if not the most important ones, at least of great importance in the texts. It is to be noted that we only use a subset of those texts to run our analysis (running on the full texts take too much time). We would have gotten even better results with the full texts !

GraphLang V2

We use the accuracy and the f1-score to measure the performance of our unsupervised clustering. Overall we have an accuracy of ~75% (and 76% for the f1-score respectively) which is a quite satisfying result with 4 clusters/categories.

Limitations

We tried with more categories but our results were less conclusive even though there were still interesting.

9. Conclusion

Working with texts is quite difficult, lots of preprocessing/cleaning is required and there is no clear consensus about this process.

The evalutation is also a challenging part because even if there is a label groundtruth, results are always subject to interpretation.

About our performance of GraphLang v2, the results are not comparable to state of the art supervised techniques, though our approach is entirely unsupervised.

10. Future Work

It would be interesting to find a way to achieve spectral analysis in GraphLang v1.

It could also be very interesting for the user to be able to click on a point in order to see a preview of the text. Such a feature would allow us to discuss about "outlier" points and see if they really are badly classified by us, badly labeled or just a mixture of multiple categories.

11. Authors

Grégoire CLEMENT, Maxime DELISLE, Charles GALLAY and Ali HOSSEINY

The whole code is available on github


In [ ]: