Modeling

ML Tasks


In [1]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
%matplotlib inline

Input


In [2]:
from sklearn.datasets import load_files

corpus = load_files("../data/")

doc_count = len(corpus.data)
print("Doc count:", doc_count)
assert doc_count is 56, "Wrong number of documents loaded, should be 56 (56 stories)"


Doc count: 56

Vectorizer


In [3]:
from helpers.tokenizer import TextWrangler
from sklearn.feature_extraction.text import CountVectorizer

bow = CountVectorizer(strip_accents="ascii", tokenizer=TextWrangler(kind="lemma"))
X_bow = bow.fit_transform(corpus.data)


[nltk_data] Downloading package punkt to ../nltk/...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to ../nltk/...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to ../nltk/...
[nltk_data]   Package wordnet is already up-to-date!

Decided for BOW vectors, containing lemmatized words. BOW results (in this case) in better cluster performance than with tf-idf vectors. Lemmatization worked slightly better than stemming. (-> KElbow plots in plots/ dir).

Models


In [4]:
from sklearn.cluster import KMeans

kmeans = KMeans(n_jobs=-1, random_state=23)

In [5]:
from yellowbrick.cluster import KElbowVisualizer

viz = KElbowVisualizer(kmeans, k=(2, 28), metric="silhouette")
viz.fit(X_bow)
#viz.poof(outpath="plots/KElbow_bow_lemma_silhoutte.png")
viz.poof()



In [6]:
from yellowbrick.cluster import SilhouetteVisualizer

def plot_silhoutte_plots(max_n):
    for i in range(2, max_n + 1):
        plt.clf()
        n_cluster = i

        viz = SilhouetteVisualizer(KMeans(n_clusters=n_cluster, random_state=23))

        viz.fit(X_bow)
        path = "plots/SilhouetteViz" + str(n_cluster)
        viz.poof(outpath=path)

#plot_silhoutte_plots(28)

Decided for 3 clusters, because of highest avg Silhoutte score compared to other cluster sizes.


In [7]:
from yellowbrick.cluster import SilhouetteVisualizer

n_clusters = 3
model = KMeans(n_clusters=n_clusters, n_jobs=-1, random_state=23)

viz = SilhouetteVisualizer(model)

viz.fit(X_bow)
viz.poof()


Nonetheless, the assignment isn't perfect. Cluster #1 looks good, but the many negative vals in cluster #0 & #1 suggest that there exist a cluster with more similar docs than in the actual assigned cluster. As a cluster size of 2 also leads to an inhomogen cluster and has a lower avg Silhoutte score, we go with the size of 3. Nevertheless, in general those findings suggest that the Sherlock Holmes stories should be represented in a single collection only.

Training


In [8]:
from sklearn.pipeline import Pipeline

pipe = Pipeline([("bow", bow),
                 ("kmeans", model)])
pipe.fit(corpus.data)

pred = pipe.predict(corpus.data)

Evaluation

Cluster density

Silhoutte coefficient: [-1,1], where 1 is most dense and negative vals correspond to ill seperation.


In [9]:
from sklearn.metrics import silhouette_score

print("Avg Silhoutte score:", silhouette_score(X_bow, pred), "(novel collections)")


Avg Silhoutte score: 0.08043954719290361 (novel collections)

Compared to original collections by Sir Arthur Conan Doyle:


In [10]:
print("AVG Silhoutte score", silhouette_score(X_bow, corpus.target), "(original collections)")


AVG Silhoutte score -0.01820726730325904 (original collections)

Average Silhoutte coefficient is at least slightly positive and much better than the score of the original assignment (which is even negative). Success.

Visual Inspection

We come from the original assignment by Sir Arthur Conan Doyle...


In [11]:
from yellowbrick.text import TSNEVisualizer

# Map target names of original collections to target vals
collections_map = {}
for i, collection_name in enumerate(corpus.target_names):
    collections_map[i] = collection_name

# Plot
tsne_original = TSNEVisualizer()
labels = [collections_map[c] for c in corpus.target]
tsne_original.fit(X_bow, labels)
tsne_original.poof()


'c' argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with 'x' & 'y'.  Please use a 2-D array with a single row if you really want to specify the same RGB or RGBA value for all points.
'c' argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with 'x' & 'y'.  Please use a 2-D array with a single row if you really want to specify the same RGB or RGBA value for all points.
'c' argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with 'x' & 'y'.  Please use a 2-D array with a single row if you really want to specify the same RGB or RGBA value for all points.
'c' argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with 'x' & 'y'.  Please use a 2-D array with a single row if you really want to specify the same RGB or RGBA value for all points.
'c' argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with 'x' & 'y'.  Please use a 2-D array with a single row if you really want to specify the same RGB or RGBA value for all points.

... to the novel collection assignment:


In [12]:
# Plot
tsne_novel = TSNEVisualizer()
labels = ["c{}".format(c) for c in pipe.named_steps.kmeans.labels_]
tsne_novel.fit(X_bow, labels)
tsne_novel.poof()


'c' argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with 'x' & 'y'.  Please use a 2-D array with a single row if you really want to specify the same RGB or RGBA value for all points.
'c' argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with 'x' & 'y'.  Please use a 2-D array with a single row if you really want to specify the same RGB or RGBA value for all points.

Confirms the findings from the Silhoutte plot above (in the Models section), cluster #1 looks very coherent, cluster #2 is seperated and the two documents of cluster #0 fly somewhere around. Nonetheless, compared to the original collection, this looks far better. Success.

Document-Cluster Assignment

Finally, we want to assign the Sherlock Holmes stories to the novel collection created by clustering, right?

Create artificial titles for the collections created from clusters.


In [13]:
# Novel titles, can be more creative ;>
novel_collections_map = {0: "The Unassignable Adventures of Cluster 0", 
                         1: "The Adventures of Sherlock Holmes in Cluster 1",
                         2: "The Case-Book of Cluster 2"}

Let's see how the the books are differently assigned to collections by Sir Arthur Conan Doyle (Original Collection), respectively by the clustering algo (Novel Collection).


In [14]:
orig_assignment = [collections_map[c] for c in corpus.target]
novel_assignment = [novel_collections_map[p] for p in pred]

titles = [" ".join(f_name.split("/")[-1].split(".")[0].split("_")) 
          for f_name in corpus.filenames]

# Final df, compares original with new assignment
df_documents = pd.DataFrame([orig_assignment, novel_assignment], 
                            columns=titles, index=["Original Collection", "Novel Collection"]).T
df_documents.to_csv("collections.csv")
df_documents


Out[14]:
Original Collection Novel Collection
THE ADVENTURE OF THE ABBEY GRANGE The_Return_of_Sherlock_Holmes The Adventures of Sherlock Holmes in Cluster 1
THE CROOKED MAN The_Memoirs_of_Sherlock_Holmes The Adventures of Sherlock Holmes in Cluster 1
THE RESIDENT PATIENT The_Memoirs_of_Sherlock_Holmes The Adventures of Sherlock Holmes in Cluster 1
THE ADVENTURE OF THE THREE GABLES The_Case-Book_of_Sherlock_Holmes The Adventures of Sherlock Holmes in Cluster 1
THE ADVENTURE OF THE BLUE CARBUNCLE The_Adventures_of_Sherlock_Holmes The Adventures of Sherlock Holmes in Cluster 1
THE ADVENTURE OF THE CARDBOARD BOX His_Last_Bow The Adventures of Sherlock Holmes in Cluster 1
SILVER BLAZE The_Memoirs_of_Sherlock_Holmes The Case-Book of Cluster 2
THE ADVENTURE OF BLACK PETER The_Return_of_Sherlock_Holmes The Adventures of Sherlock Holmes in Cluster 1
THE ADVENTURE OF THE DANCING MEN The_Return_of_Sherlock_Holmes The Case-Book of Cluster 2
THE ILLUSTRIOUS CLIENT The_Case-Book_of_Sherlock_Holmes The Adventures of Sherlock Holmes in Cluster 1
THE ADVENTURE OF THE DYING DETECTIVE His_Last_Bow The Adventures of Sherlock Holmes in Cluster 1
THE ADVENTURE OF THE BERYL CORONET The_Adventures_of_Sherlock_Holmes The Adventures of Sherlock Holmes in Cluster 1
THE ADVENTURE OF THE VEILED LODGER The_Case-Book_of_Sherlock_Holmes The Adventures of Sherlock Holmes in Cluster 1
THE ADVENTURE OF THE LION'S MANE The_Case-Book_of_Sherlock_Holmes The Adventures of Sherlock Holmes in Cluster 1
THE PROBLEM OF THOR BRIDGE The_Case-Book_of_Sherlock_Holmes The Case-Book of Cluster 2
THE ADVENTURE OF THE MISSING THREE-QUARTER The_Return_of_Sherlock_Holmes The Adventures of Sherlock Holmes in Cluster 1
THE MUSGRAVE RITUAL The_Memoirs_of_Sherlock_Holmes The Adventures of Sherlock Holmes in Cluster 1
THE DISAPPEARANCE OF LADY FRANCES CARFAX His_Last_Bow The Adventures of Sherlock Holmes in Cluster 1
THE ADVENTURE OF THE NOBLE BACHELOR The_Adventures_of_Sherlock_Holmes The Adventures of Sherlock Holmes in Cluster 1
THE ADVENTURE OF THE THREE GARRIDEBS The_Case-Book_of_Sherlock_Holmes The Adventures of Sherlock Holmes in Cluster 1
THE GLORIA SCOTT The_Memoirs_of_Sherlock_Holmes The Adventures of Sherlock Holmes in Cluster 1
THE ADVENTURE OF THE NORWOOD BUILDER The_Return_of_Sherlock_Holmes The Case-Book of Cluster 2
THE MAN WITH THE TWISTED LIP The_Adventures_of_Sherlock_Holmes The Adventures of Sherlock Holmes in Cluster 1
THE ADVENTURE OF THE SECOND STAIN The_Return_of_Sherlock_Holmes The Case-Book of Cluster 2
THE FINAL PROBLEM The_Memoirs_of_Sherlock_Holmes The Adventures of Sherlock Holmes in Cluster 1
THE ADVENTURE OF THE SPECKLED BAND The_Adventures_of_Sherlock_Holmes The Unassignable Adventures of Cluster 0
THE ADVENTURE OF THE RED CIRCLE His_Last_Bow The Adventures of Sherlock Holmes in Cluster 1
THE BLANCHED SOLDIER The_Case-Book_of_Sherlock_Holmes The Adventures of Sherlock Holmes in Cluster 1
THE BOSCOMBE VALLEY MYSTERY The_Adventures_of_Sherlock_Holmes The Case-Book of Cluster 2
THE ADVENTURE OF THE SOLITARY CYCLIST The_Return_of_Sherlock_Holmes The Adventures of Sherlock Holmes in Cluster 1
THE ADVENTURE OF SHOSCOMBE OLD PLACE The_Case-Book_of_Sherlock_Holmes The Adventures of Sherlock Holmes in Cluster 1
THE ADVENTURE OF THE GOLDEN PINCE-NEZ The_Return_of_Sherlock_Holmes The Case-Book of Cluster 2
A SCANDAL IN BOHEMIA The_Adventures_of_Sherlock_Holmes The Adventures of Sherlock Holmes in Cluster 1
THE ADVENTURE OF THE ENGINEER'S THUMB The_Adventures_of_Sherlock_Holmes The Adventures of Sherlock Holmes in Cluster 1
THE ADVENTURE OF THE SUSSEX VAMPIRE The_Case-Book_of_Sherlock_Holmes The Adventures of Sherlock Holmes in Cluster 1
THE FIVE ORANGE PIPS The_Adventures_of_Sherlock_Holmes The Adventures of Sherlock Holmes in Cluster 1
THE STOCK-BROKER'S CLERK The_Memoirs_of_Sherlock_Holmes The Adventures of Sherlock Holmes in Cluster 1
THE ADVENTURE OF THE THREE STUDENTS The_Return_of_Sherlock_Holmes The Case-Book of Cluster 2
THE NAVAL TREATY The_Memoirs_of_Sherlock_Holmes The Case-Book of Cluster 2
THE ADVENTURE OF THE BRUCE-PARTINGTON PLANS His_Last_Bow The Case-Book of Cluster 2
THE ADVENTURE OF THE COPPER BEECHES The_Adventures_of_Sherlock_Holmes The Case-Book of Cluster 2
THE YELLOW FACE The_Memoirs_of_Sherlock_Holmes The Adventures of Sherlock Holmes in Cluster 1
THE ADVENTURE OF THE RETIRED COLOURMAN The_Case-Book_of_Sherlock_Holmes The Adventures of Sherlock Holmes in Cluster 1
THE ADVENTURE OF WISTERIA LODGE His_Last_Bow The Case-Book of Cluster 2
THE ADVENTURE OF THE MAZARIN STONE The_Case-Book_of_Sherlock_Holmes The Adventures of Sherlock Holmes in Cluster 1
THE GREEK INTERPRETER The_Memoirs_of_Sherlock_Holmes The Adventures of Sherlock Holmes in Cluster 1
THE ADVENTURE OF THE CREEPING MAN The_Case-Book_of_Sherlock_Holmes The Adventures of Sherlock Holmes in Cluster 1
THE RED-HEADED LEAGUE The_Adventures_of_Sherlock_Holmes The Case-Book of Cluster 2
A CASE OF IDENTITY The_Adventures_of_Sherlock_Holmes The Adventures of Sherlock Holmes in Cluster 1
THE REIGATE SQUIRES The_Memoirs_of_Sherlock_Holmes The Case-Book of Cluster 2
THE ADVENTURE OF THE PRIORY SCHOOL The_Return_of_Sherlock_Holmes The Unassignable Adventures of Cluster 0
THE ADVENTURE OF THE DEVIL'S FOOT His_Last_Bow The Unassignable Adventures of Cluster 0
HIS LAST BOW His_Last_Bow The Adventures of Sherlock Holmes in Cluster 1
THE ADVENTURE OF THE SIX NAPOLEONS The_Return_of_Sherlock_Holmes The Case-Book of Cluster 2
THE ADVENTURE OF THE EMPTY HOUSE The_Return_of_Sherlock_Holmes The Adventures of Sherlock Holmes in Cluster 1
THE ADVENTURE OF CHARLES AUGUSTUS MILVERTON The_Return_of_Sherlock_Holmes The Adventures of Sherlock Holmes in Cluster 1

In [15]:
df_documents["Novel Collection"].value_counts()


Out[15]:
The Adventures of Sherlock Holmes in Cluster 1    38
The Case-Book of Cluster 2                        15
The Unassignable Adventures of Cluster 0           3
Name: Novel Collection, dtype: int64

Collections are uneven assigned. Cluster #1 is the predominant one. Looks like cluster #0 subsume the (rational) unassignable stories.

T-SNE plot eventually looks like that:


In [16]:
tsne_novel_named = TSNEVisualizer(colormap="Accent")
tsne_novel_named.fit(X_bow, novel_assignment)
tsne_novel_named.poof(outpath="plots/Novel_Sherlock_Holmes_Collections.png")


'c' argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with 'x' & 'y'.  Please use a 2-D array with a single row if you really want to specify the same RGB or RGBA value for all points.
'c' argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with 'x' & 'y'.  Please use a 2-D array with a single row if you really want to specify the same RGB or RGBA value for all points.
'c' argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with 'x' & 'y'.  Please use a 2-D array with a single row if you really want to specify the same RGB or RGBA value for all points.

Conclusion

A more rational assignment of Sherlock Holmes stories to collections is possible. And this can be achieved by using NLP and clustering, as shown above.

Nonetheless, the results (and also taken those from the HolmesTopicModels into account) suggest that there doesn't exist a perfect solution to seperate the stories. So we'd prefer to treat the whole canon as a single compilation of Sherlock Holmes stories.

But, if seperate collections from the canon should be derived, the most promising solution is the one presented above, by using the clustering approach.