Darwin's bibliography

Charles Darwin is one of the few universal figures of science. His most renowned work is without a doubt his "On the Origin of Species" published in 1859 which introduced the concept of natural selection. But Darwin wrote many other books on a wide range of topics, including geology, plants or his personal life. In this notebook, we will automatically detect how closely related his books are to each other.

To this purpose, we will develop the bases of a content-based book recommendation system, which will determine which books are close to each other based on how similar the discussed topics are. The methods we will use are commonly used in text- or documents-heavy industries such as legal, tech or customer support to perform some common task such as text classification or handling search engine queries.

Let's take a look at the books we'll use in our recommendation system.

Imports

Dependencies


In [8]:
import glob
import re, os

from tqdm import tqdm_notebook

import pickle
import pandas as pd

from nltk.stem import PorterStemmer
from gensim import corpora

from gensim.models import TfidfModel
from gensim import similarities

import matplotlib.pyplot as plt
%matplotlib inline

from scipy.cluster import hierarchy

In [7]:
ps = PorterStemmer()

In [ ]:

Data


In [2]:
folder = "datasets/"

files = glob.glob(folder + '*.txt')
files.sort()

In [3]:
txts = []
titles = []

for n in files:
    f = open(n, encoding='utf-8-sig')
    # Remove all non-alpha-numeric characters
    txts.append(re.sub('[\W_]+', ' ', f.read()))
    titles.append(os.path.basename(n).replace(".txt", ""))

# ['{} - {:,}'.format(title, len(txt)) for title, txt in zip(titles, txts)]

pd.DataFrame(data = [
    (title, len(txt)) for title, txt in zip(titles, txts)
], columns=['Title', '#characters']).sort_values('#characters', ascending=False)


Out[3]:
Title #characters
2 DescentofMan 1776539
13 MonographCirripediaVol2 1660866
19 VoyageBeagle 1149574
16 PowerMovementPlants 1093567
10 LifeandLettersVol1 1047518
17 VariationPlantsAnimalsDomestication 1043499
11 LifeandLettersVol2 1010643
15 OriginofSpecies 916267
4 EffectsCrossSelfFertilization 913713
9 InsectivorousPlants 901406
8 GeologicalObservationsSouthAmerica 797401
12 MonographCirripedia 767492
5 ExpressionofEmotionManAnimals 624232
3 DifferentFormsofFlowers 617088
7 FoundationsOriginofSpecies 523021
1 CoralReefs 496068
18 VolcanicIslands 341447
6 FormationVegetableMould 335920
14 MovementClimbingPlants 298319
0 Autobiography 123231

In [4]:
# for i in range(len(titles)):
#     if titles[i] == 'OriginofSpecies':
#         ori = i

book_index = titles.index('OriginofSpecies')
book_index


Out[4]:
15

Tokenize


In [19]:
%%time

# stop words
stoplist = set('for a of the and to in to be which some is at that we i who whom show via may my our might as well'.split())

txts_lower_case = [txt.lower() for txt in txts]
txts_split = [txt.split() for txt in txts_lower_case]
texts = [[word for word in txt if word not in stoplist] for txt in txts_split]

print(texts[book_index][:20])


['on', 'origin', 'species', 'but', 'with', 'regard', 'material', 'world', 'can', 'least', 'go', 'so', 'far', 'this', 'can', 'perceive', 'events', 'are', 'brought', 'about']
CPU times: user 559 ms, sys: 71.8 ms, total: 631 ms
Wall time: 639 ms

Stemming

As we are analysing 20 full books, the stemming algorithm can take several minutes to run and, in order to make the process faster, we will directly load the final results from a pickle file and review the method used to generate it.


In [21]:
# # Load the stemmed tokens list from the pregenerated pickle file
# texts_stem = pickle.load( open( 'datasets/texts_stem.p', 'rb' ) )

In [22]:
%%time

# texts_stem = [[ps.stem(word) for word in text] for text in texts]

texts_stem = []
for i in tqdm_notebook(range(len(texts))):
    book_stemmed = []
    
    for word in texts[i]:
        book_stemmed.append( ps.stem(word) )
    
    texts_stem.append(book_stemmed)

print(texts_stem[book_index][:20])


/usr/local/lib/python3.5/dist-packages/ipykernel_launcher.py:5: TqdmDeprecationWarning: This function will be removed in tqdm==5.0.0
Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  """
['on', 'origin', 'speci', 'but', 'with', 'regard', 'materi', 'world', 'can', 'least', 'go', 'so', 'far', 'thi', 'can', 'perceiv', 'event', 'are', 'brought', 'about']
CPU times: user 55.4 s, sys: 28.9 ms, total: 55.4 s
Wall time: 55.6 s

In [ ]:

Modelling

Building a bag-of-words model

Now that we have transformed the texts into stemmed tokens, we need to build models that will be useable by downstream algorithms.

First, we need to will create a universe of all words contained in our corpus of Charles Darwin's books, which we call a dictionary. Then, using the stemmed tokens and the dictionary, we will create bag-of-words models (BoW) of each of our texts. The BoW models will represent our books as a list of all uniques tokens they contain associated with their respective number of occurrences.

To better understand the structure of such a model, we will print the five first elements of one of the "On the Origin of Species" BoW model.


In [12]:
dictionary = corpora.Dictionary(texts_stem)

# Create a bag-of-words model for each book, using the previously generated dictionary
bows = [dictionary.doc2bow(txt) for txt in texts_stem]

print(bows[book_index][:5])


[(0, 11), (5, 51), (6, 1), (8, 2), (21, 1)]

The most common words of a given book

The results returned by the bag-of-words model is certainly easy to use for a computer but hard to interpret for a human. It is not straightforward to understand which stemmed tokens are present in a given book from Charles Darwin, and how many occurrences we can find.

In order to better understand how the model has been generated and visualize its content, we will transform it into a DataFrame and display the 10 most common stems for the book "On the Origin of Species".


In [13]:
# Convert the BoW model for "On the Origin of Species" into a DataFrame
df_bow_origin = pd.DataFrame(bows[book_index], columns=['index', 'occurrences'])

# Add a column containing the token corresponding to the dictionary index
df_bow_origin['token'] = df_bow_origin['index'].apply(lambda i: texts_stem[book_index][i])

df_bow_origin.sort_values('occurrences', ascending=False).head(10)


Out[13]:
index occurrences token
748 1168 2023 histori
1119 1736 1558 by
1489 2288 1543 somewhat
892 1366 1480 have
239 393 1362 intercross
1128 1747 1201 speci
125 218 1140 domest
665 1043 1137 mysteri
1774 2703 1000 perfectli
1609 2452 962 effect

Build a tf-idf model

If it wasn't for the presence of the stem "speci", we would have a hard time to guess this BoW model comes from the On the Origin of Species book. The most recurring words are, apart from few exceptions, very common and unlikely to carry any information peculiar to the given book. We need to use an additional step in order to determine which tokens are the most specific to a book.

To do so, we will use a tf-idf model (term frequency–inverse document frequency). This model defines the importance of each word depending on how frequent it is in this text and how infrequent it is in all the other documents. As a result, a high tf-idf score for a word will indicate that this word is specific to this text.

After computing those scores, we will print the 10 words most specific to the "On the Origin of Species" book (i.e., the 10 words with the highest tf-idf score).


In [20]:
model = TfidfModel(bows)

# Print the model for "On the Origin of Species"
print(len(model[bows[book_index]]))


4405

The results of the tf-idf model

Once again, the format of those results is hard to interpret for a human. Therefore, we will transform it into a more readable version and display the 10 most specific words for the "On the Origin of Species" book.


In [15]:
# Convert the tf-idf model for "On the Origin of Species" into a DataFrame
df_tfidf = pd.DataFrame(model[bows[book_index]], columns=['id', 'score'])

# Add the tokens corresponding to the numerical indices for better readability
df_tfidf['token'] = df_tfidf['id'].apply(lambda i: texts_stem[book_index][i])

df_tfidf.sort_values('score', ascending=False).head(10)


Out[15]:
id score token
880 2164 0.327823 variat
3103 10108 0.204162 alway
128 369 0.197968 it
2985 9395 0.167705 frequent
947 2325 0.148371 gener
285 752 0.146172 slow
504 1255 0.128433 reader
371 966 0.127694 458
3840 16046 0.124547 epidem
3536 12729 0.121348 are

Compute distance between texts

The results of the tf-idf algorithm now return stemmed tokens which are specific to each book. We can, for example, see that topics such as selection, breeding or domestication are defining "On the Origin of Species" (and yes, in this book, Charles Darwin talks quite a lot about pigeons too). Now that we have a model associating tokens to how specific they are to each book, we can measure how related to books are between each other.

To this purpose, we will use a measure of similarity called cosine similarity and we will visualize the results as a distance matrix, i.e., a matrix showing all pairwise distances between Darwin's books.


In [16]:
sims = similarities.MatrixSimilarity(model[bows])

sim_df = pd.DataFrame(list(sims))
sim_df.columns = titles
sim_df.index = titles

print(sim_df)


/usr/local/lib/python3.5/dist-packages/gensim/matutils.py:737: FutureWarning: Conversion of the second argument of issubdtype from `int` to `np.signedinteger` is deprecated. In future, it will be treated as `np.int64 == np.dtype(int).type`.
  if np.issubdtype(vec.dtype, np.int):
                                     Autobiography  CoralReefs  DescentofMan  \
Autobiography                             1.000000    0.049722      0.080789   
CoralReefs                                0.049722    1.000000      0.009516   
DescentofMan                              0.080789    0.009516      1.000000   
DifferentFormsofFlowers                   0.066615    0.001980      0.072792   
EffectsCrossSelfFertilization             0.077006    0.001936      0.029997   
ExpressionofEmotionManAnimals             0.089345    0.005062      0.148642   
FormationVegetableMould                   0.041182    0.029445      0.027106   
FoundationsOriginofSpecies                0.058990    0.022066      0.135001   
GeologicalObservationsSouthAmerica        0.030679    0.060744      0.009628   
InsectivorousPlants                       0.014945    0.002284      0.009468   
LifeandLettersVol1                        0.399534    0.031211      0.060040   
LifeandLettersVol2                        0.220023    0.017772      0.080569   
MonographCirripedia                       0.005854    0.006321      0.053426   
MonographCirripediaVol2                   0.008456    0.010497      0.042937   
MovementClimbingPlants                    0.022999    0.001534      0.005157   
OriginofSpecies                           0.101199    0.039200      0.267734   
PowerMovementPlants                       0.016059    0.002686      0.011267   
VariationPlantsAnimalsDomestication       0.048989    0.011383      0.228437   
VolcanicIslands                           0.038630    0.057402      0.007885   
VoyageBeagle                              0.184288    0.267414      0.123902   

                                     DifferentFormsofFlowers  \
Autobiography                                       0.066615   
CoralReefs                                          0.001980   
DescentofMan                                        0.072792   
DifferentFormsofFlowers                             1.000000   
EffectsCrossSelfFertilization                       0.391788   
ExpressionofEmotionManAnimals                       0.006545   
FormationVegetableMould                             0.010621   
FoundationsOriginofSpecies                          0.039993   
GeologicalObservationsSouthAmerica                  0.002855   
InsectivorousPlants                                 0.007487   
LifeandLettersVol1                                  0.016188   
LifeandLettersVol2                                  0.046692   
MonographCirripedia                                 0.009403   
MonographCirripediaVol2                             0.005451   
MovementClimbingPlants                              0.008165   
OriginofSpecies                                     0.129152   
PowerMovementPlants                                 0.018831   
VariationPlantsAnimalsDomestication                 0.049405   
VolcanicIslands                                     0.002624   
VoyageBeagle                                        0.013213   

                                     EffectsCrossSelfFertilization  \
Autobiography                                             0.077006   
CoralReefs                                                0.001936   
DescentofMan                                              0.029997   
DifferentFormsofFlowers                                   0.391788   
EffectsCrossSelfFertilization                             1.000000   
ExpressionofEmotionManAnimals                             0.006871   
FormationVegetableMould                                   0.032270   
FoundationsOriginofSpecies                                0.040248   
GeologicalObservationsSouthAmerica                        0.002247   
InsectivorousPlants                                       0.006763   
LifeandLettersVol1                                        0.019609   
LifeandLettersVol2                                        0.046567   
MonographCirripedia                                       0.003218   
MonographCirripediaVol2                                   0.002957   
MovementClimbingPlants                                    0.014939   
OriginofSpecies                                           0.146700   
PowerMovementPlants                                       0.039520   
VariationPlantsAnimalsDomestication                       0.054458   
VolcanicIslands                                           0.002183   
VoyageBeagle                                              0.017183   

                                     ExpressionofEmotionManAnimals  \
Autobiography                                             0.089345   
CoralReefs                                                0.005062   
DescentofMan                                              0.148642   
DifferentFormsofFlowers                                   0.006545   
EffectsCrossSelfFertilization                             0.006871   
ExpressionofEmotionManAnimals                             1.000000   
FormationVegetableMould                                   0.021066   
FoundationsOriginofSpecies                                0.047103   
GeologicalObservationsSouthAmerica                        0.005246   
InsectivorousPlants                                       0.011461   
LifeandLettersVol1                                        0.065391   
LifeandLettersVol2                                        0.049333   
MonographCirripedia                                       0.016802   
MonographCirripediaVol2                                   0.029644   
MovementClimbingPlants                                    0.005942   
OriginofSpecies                                           0.063242   
PowerMovementPlants                                       0.011234   
VariationPlantsAnimalsDomestication                       0.082567   
VolcanicIslands                                           0.005595   
VoyageBeagle                                              0.099124   

                                     FormationVegetableMould  \
Autobiography                                       0.041182   
CoralReefs                                          0.029445   
DescentofMan                                        0.027106   
DifferentFormsofFlowers                             0.010621   
EffectsCrossSelfFertilization                       0.032270   
ExpressionofEmotionManAnimals                       0.021066   
FormationVegetableMould                             1.000000   
FoundationsOriginofSpecies                          0.021468   
GeologicalObservationsSouthAmerica                  0.067712   
InsectivorousPlants                                 0.035498   
LifeandLettersVol1                                  0.028357   
LifeandLettersVol2                                  0.023943   
MonographCirripedia                                 0.019864   
MonographCirripediaVol2                             0.023915   
MovementClimbingPlants                              0.038823   
OriginofSpecies                                     0.049519   
PowerMovementPlants                                 0.039911   
VariationPlantsAnimalsDomestication                 0.032647   
VolcanicIslands                                     0.059299   
VoyageBeagle                                        0.098331   

                                     FoundationsOriginofSpecies  \
Autobiography                                          0.058990   
CoralReefs                                             0.022066   
DescentofMan                                           0.135001   
DifferentFormsofFlowers                                0.039993   
EffectsCrossSelfFertilization                          0.040248   
ExpressionofEmotionManAnimals                          0.047103   
FormationVegetableMould                                0.021468   
FoundationsOriginofSpecies                             1.000000   
GeologicalObservationsSouthAmerica                     0.027300   
InsectivorousPlants                                    0.005995   
LifeandLettersVol1                                     0.057749   
LifeandLettersVol2                                     0.054703   
MonographCirripedia                                    0.007650   
MonographCirripediaVol2                                0.010762   
MovementClimbingPlants                                 0.003971   
OriginofSpecies                                        0.322736   
PowerMovementPlants                                    0.008712   
VariationPlantsAnimalsDomestication                    0.196578   
VolcanicIslands                                        0.017528   
VoyageBeagle                                           0.089075   

                                     GeologicalObservationsSouthAmerica  \
Autobiography                                                  0.030679   
CoralReefs                                                     0.060744   
DescentofMan                                                   0.009628   
DifferentFormsofFlowers                                        0.002855   
EffectsCrossSelfFertilization                                  0.002247   
ExpressionofEmotionManAnimals                                  0.005246   
FormationVegetableMould                                        0.067712   
FoundationsOriginofSpecies                                     0.027300   
GeologicalObservationsSouthAmerica                             1.000000   
InsectivorousPlants                                            0.006844   
LifeandLettersVol1                                             0.028691   
LifeandLettersVol2                                             0.012241   
MonographCirripedia                                            0.009260   
MonographCirripediaVol2                                        0.023486   
MovementClimbingPlants                                         0.002046   
OriginofSpecies                                                0.052878   
PowerMovementPlants                                            0.003450   
VariationPlantsAnimalsDomestication                            0.013737   
VolcanicIslands                                                0.372272   
VoyageBeagle                                                   0.259514   

                                     InsectivorousPlants  LifeandLettersVol1  \
Autobiography                                   0.014945            0.399534   
CoralReefs                                      0.002284            0.031211   
DescentofMan                                    0.009468            0.060040   
DifferentFormsofFlowers                         0.007487            0.016188   
EffectsCrossSelfFertilization                   0.006763            0.019609   
ExpressionofEmotionManAnimals                   0.011461            0.065391   
FormationVegetableMould                         0.035498            0.028357   
FoundationsOriginofSpecies                      0.005995            0.057749   
GeologicalObservationsSouthAmerica              0.006844            0.028691   
InsectivorousPlants                             1.000001            0.006062   
LifeandLettersVol1                              0.006062            1.000000   
LifeandLettersVol2                              0.016549            0.885953   
MonographCirripedia                             0.019091            0.005839   
MonographCirripediaVol2                         0.019657            0.012716   
MovementClimbingPlants                          0.249011            0.005533   
OriginofSpecies                                 0.014982            0.098352   
PowerMovementPlants                             0.022841            0.009439   
VariationPlantsAnimalsDomestication             0.010321            0.054805   
VolcanicIslands                                 0.008526            0.026517   
VoyageBeagle                                    0.014758            0.172502   

                                     LifeandLettersVol2  MonographCirripedia  \
Autobiography                                  0.220023             0.005854   
CoralReefs                                     0.017772             0.006321   
DescentofMan                                   0.080569             0.053426   
DifferentFormsofFlowers                        0.046692             0.009403   
EffectsCrossSelfFertilization                  0.046567             0.003218   
ExpressionofEmotionManAnimals                  0.049333             0.016802   
FormationVegetableMould                        0.023943             0.019864   
FoundationsOriginofSpecies                     0.054703             0.007650   
GeologicalObservationsSouthAmerica             0.012241             0.009260   
InsectivorousPlants                            0.016549             0.019091   
LifeandLettersVol1                             0.885953             0.005839   
LifeandLettersVol2                             1.000000             0.004961   
MonographCirripedia                            0.004961             1.000000   
MonographCirripediaVol2                        0.010570             0.516038   
MovementClimbingPlants                         0.017673             0.012450   
OriginofSpecies                                0.097736             0.031278   
PowerMovementPlants                            0.012017             0.018565   
VariationPlantsAnimalsDomestication            0.050260             0.022999   
VolcanicIslands                                0.011930             0.010054   
VoyageBeagle                                   0.090674             0.014269   

                                     MonographCirripediaVol2  \
Autobiography                                       0.008456   
CoralReefs                                          0.010497   
DescentofMan                                        0.042937   
DifferentFormsofFlowers                             0.005451   
EffectsCrossSelfFertilization                       0.002957   
ExpressionofEmotionManAnimals                       0.029644   
FormationVegetableMould                             0.023915   
FoundationsOriginofSpecies                          0.010762   
GeologicalObservationsSouthAmerica                  0.023486   
InsectivorousPlants                                 0.019657   
LifeandLettersVol1                                  0.012716   
LifeandLettersVol2                                  0.010570   
MonographCirripedia                                 0.516038   
MonographCirripediaVol2                             1.000000   
MovementClimbingPlants                              0.006762   
OriginofSpecies                                     0.037865   
PowerMovementPlants                                 0.022154   
VariationPlantsAnimalsDomestication                 0.029694   
VolcanicIslands                                     0.016404   
VoyageBeagle                                        0.024734   

                                     MovementClimbingPlants  OriginofSpecies  \
Autobiography                                      0.022999         0.101199   
CoralReefs                                         0.001534         0.039200   
DescentofMan                                       0.005157         0.267734   
DifferentFormsofFlowers                            0.008165         0.129152   
EffectsCrossSelfFertilization                      0.014939         0.146700   
ExpressionofEmotionManAnimals                      0.005942         0.063242   
FormationVegetableMould                            0.038823         0.049519   
FoundationsOriginofSpecies                         0.003971         0.322736   
GeologicalObservationsSouthAmerica                 0.002046         0.052878   
InsectivorousPlants                                0.249011         0.014982   
LifeandLettersVol1                                 0.005533         0.098352   
LifeandLettersVol2                                 0.017673         0.097736   
MonographCirripedia                                0.012450         0.031278   
MonographCirripediaVol2                            0.006762         0.037865   
MovementClimbingPlants                             1.000000         0.008858   
OriginofSpecies                                    0.008858         1.000000   
PowerMovementPlants                                0.104310         0.018170   
VariationPlantsAnimalsDomestication                0.011361         0.404093   
VolcanicIslands                                    0.002842         0.035924   
VoyageBeagle                                       0.012328         0.165247   

                                     PowerMovementPlants  \
Autobiography                                   0.016059   
CoralReefs                                      0.002686   
DescentofMan                                    0.011267   
DifferentFormsofFlowers                         0.018831   
EffectsCrossSelfFertilization                   0.039520   
ExpressionofEmotionManAnimals                   0.011234   
FormationVegetableMould                         0.039911   
FoundationsOriginofSpecies                      0.008712   
GeologicalObservationsSouthAmerica              0.003450   
InsectivorousPlants                             0.022841   
LifeandLettersVol1                              0.009439   
LifeandLettersVol2                              0.012017   
MonographCirripedia                             0.018565   
MonographCirripediaVol2                         0.022154   
MovementClimbingPlants                          0.104310   
OriginofSpecies                                 0.018170   
PowerMovementPlants                             1.000000   
VariationPlantsAnimalsDomestication             0.020184   
VolcanicIslands                                 0.003787   
VoyageBeagle                                    0.023967   

                                     VariationPlantsAnimalsDomestication  \
Autobiography                                                   0.048989   
CoralReefs                                                      0.011383   
DescentofMan                                                    0.228437   
DifferentFormsofFlowers                                         0.049405   
EffectsCrossSelfFertilization                                   0.054458   
ExpressionofEmotionManAnimals                                   0.082567   
FormationVegetableMould                                         0.032647   
FoundationsOriginofSpecies                                      0.196578   
GeologicalObservationsSouthAmerica                              0.013737   
InsectivorousPlants                                             0.010321   
LifeandLettersVol1                                              0.054805   
LifeandLettersVol2                                              0.050260   
MonographCirripedia                                             0.022999   
MonographCirripediaVol2                                         0.029694   
MovementClimbingPlants                                          0.011361   
OriginofSpecies                                                 0.404093   
PowerMovementPlants                                             0.020184   
VariationPlantsAnimalsDomestication                             1.000000   
VolcanicIslands                                                 0.012210   
VoyageBeagle                                                    0.112421   

                                     VolcanicIslands  VoyageBeagle  
Autobiography                               0.038630      0.184288  
CoralReefs                                  0.057402      0.267414  
DescentofMan                                0.007885      0.123902  
DifferentFormsofFlowers                     0.002624      0.013213  
EffectsCrossSelfFertilization               0.002183      0.017183  
ExpressionofEmotionManAnimals               0.005595      0.099124  
FormationVegetableMould                     0.059299      0.098331  
FoundationsOriginofSpecies                  0.017528      0.089075  
GeologicalObservationsSouthAmerica          0.372272      0.259514  
InsectivorousPlants                         0.008526      0.014758  
LifeandLettersVol1                          0.026517      0.172502  
LifeandLettersVol2                          0.011930      0.090674  
MonographCirripedia                         0.010054      0.014269  
MonographCirripediaVol2                     0.016404      0.024734  
MovementClimbingPlants                      0.002842      0.012328  
OriginofSpecies                             0.035924      0.165247  
PowerMovementPlants                         0.003787      0.023967  
VariationPlantsAnimalsDomestication         0.012210      0.112421  
VolcanicIslands                             1.000000      0.138034  
VoyageBeagle                                0.138034      1.000000  

The book most similar to "On the Origin of Species"

We now have a matrix containing all the similarity measures between any pair of books from Charles Darwin! We can now use this matrix to quickly extract the information we need, i.e., the distance between one book and one or several others.

As a first step, we will display which books are the most similar to "On the Origin of Species," more specifically we will produce a bar chart showing all books ranked by how similar they are to Darwin's landmark work.


In [17]:
v = sim_df.OriginofSpecies
v_sorted = v.sort_values()
# v_sorted = v_sorted[:-1]

plt.barh(range(len(v_sorted)), v_sorted.values)

plt.xlabel('Similarity')
plt.ylabel('Books')
plt.yticks(range(len(v_sorted)), v_sorted.index)
plt.xlim((0, 1))
plt.title('Books most similar to the "Origin of Species"')

plt.show()


Which books have similar content?

This turns out to be extremely useful if we want to determine a given book's most similar work. For example, we have just seen that if you enjoyed "On the Origin of Species," you can read books discussing similar concepts such as "The Variation of Animals and Plants under Domestication" or "The Descent of Man, and Selection in Relation to Sex." If you are familiar with Darwin's work, these suggestions will likely seem natural to you. Indeed, On the Origin of Species has a whole chapter about domestication and The Descent of Man, and Selection in Relation to Sex applies the theory of natural selection to human evolution. Hence, the results make sense.

However, we now want to have a better understanding of the big picture and see how Darwin's books are generally related to each other (in terms of topics discussed). To this purpose, we will represent the whole similarity matrix as a dendrogram, which is a standard tool to display such data. This last approach will display all the information about book similarities at once. For example, we can find a book's closest relative but, also, we can visualize which groups of books have similar topics (e.g., the cluster about Charles Darwin personal life with his autobiography and letters). If you are familiar with Darwin's bibliography, the results should not surprise you too much, which indicates the method gives good results. Otherwise, next time you read one of the author's book, you will know which other books to read next in order to learn more about the topics it addressed.


In [18]:
Z = hierarchy.linkage(sim_df, method='ward')

a = hierarchy.dendrogram(
    Z,
    leaf_font_size=8,
    labels=sim_df.index,
    orientation="left"
)