There are 7647 distinct youtube video captions texts in English and a directory containing 24 english books, some of which from the site www.gutenberg.org which are proofread. The following scripts allow similarity measures between either the captions alone or the books to be made. Initially, get it to work on a subset of the 7647...


In [1]:
from collections import OrderedDict
from os import listdir
from os.path import isfile, join
import sys
sys.path.append('../')
import config
import pymysql.cursors
import pandas as pd
import spacy

nlp = spacy.load('en')

connection = pymysql.connect(host='localhost',
                             user='root',
                             password=config.MYSQL_SERVER_PASSWORD,
                             db='youtubeProjectDB',
                             charset='utf8mb4', 
                             cursorclass=pymysql.cursors.DictCursor)


mypath = '../textbooks'
onlyfiles = [f for f in listdir(mypath) if isfile(join(mypath, f))]
    
with connection.cursor() as cursor:
                       
            sql = """
            SELECT search_api.videoId, videoTitle, captionsText, wordCount, captions.id 
            FROM search_api
            INNER JOIN captions
            ON search_api.videoId = captions.videoId
            WHERE captions.id 
            IN (5830, 45, 52, 54, 6195, 6198, 6203, 6208, 14525, 14523, 14518);"""            
            cursor.execute(sql)
            manyCaptions = cursor.fetchall()
            videos_df = pd.read_sql(sql, connection)
                        
connection.close()

Place the captions and books into ordered dictionaries with keys which identify their contents


In [2]:
L1 = []
L2 = []
for file in onlyfiles:
    L1.append((file ,  (open(mypath + '/' + file, 'r').read()) ))
TextBooksDict = OrderedDict(L1)

for item in manyCaptions:
    #  L2.append((item.get('id')  ,  item.get('captionsText')))  # 'id' key is lower case!!!
    L2.append((item.get('videoTitle')  ,  item.get('captionsText')))
ManyCaptionsDict = OrderedDict(L2)   

# Merge OrderedDict's'
L3 = []
for k, v in zip(ManyCaptionsDict.keys(), ManyCaptionsDict.values()):
    L3.append((k,v))
for k, v in zip(TextBooksDict.keys(), TextBooksDict.values()):
    L3.append((k,v))
UnitedOrderedDict = OrderedDict(L3)

videos_df['characterCount'] = videos_df['captionsText'].map(len)
# reorder the columns
videos_df['charPerWord'] = videos_df.characterCount / videos_df.wordCount
videos_df = videos_df.reindex(columns=['videoTitle','characterCount','wordCount', 'charPerWord','captionsText','id', 'videoId'])

use pickes to avoid rerunning the word-count cell unnecessarily


In [6]:
textbooks_df = pd.read_pickle('textbooksDF.pickle')

In [32]:
# NB - this cell can take minutes to run rather load from pickle if nothing added. 
# https://chrisalbon.com/python/pandas_create_column_with_loop.html
fileName = [k for k in TextBooksDict.keys()]
characterCount = [len(TextBooksDict.get(k)) for k in TextBooksDict.keys()]
wordCount = [len(nlp(TextBooksDict.get(k))) for k in TextBooksDict.keys()]
raw_data = {'fileName' : fileName,
            'characterCount': characterCount,
            'wordCount':wordCount}
textbooks_df = pd.DataFrame(raw_data, columns = ['fileName', 'characterCount', 'wordCount'])
textbooks_df['charPerWord'] = textbooks_df.characterCount / textbooks_df.wordCount

In [42]:
textbooks_df.to_pickle('textbooksDF.pickle')

samples of both dataframes


In [4]:
videos_df[['videoTitle', 'characterCount', 'wordCount', 'charPerWord']].head(5)


Out[4]:
videoTitle characterCount wordCount charPerWord
0 ALL ABOUT LIVING WITH BOXER DOGS 16843 3862 4.361212
1 Lulu the Lab - Basic dog training in Austin (5... 485 100 4.850000
2 PetSmart Puppy Training: Feeding a Puppy 2953 677 4.361891
3 6. Reconstruction from Compressed Representation 3448 735 4.691156
4 Unit 6 8 Supervised vs Unsupervised Learning 2054 395 5.200000

In [4]:
textbooks_df.head(5)


Out[4]:
fileName characterCount wordCount charPerWord
0 sheep.txt 427001 92664 4.608057
1 Corporate Finance.txt 3141127 691207 4.544409
2 Excel2010Advanced.txt 287867 61451 4.684497
3 distributedAI.txt 175861 42156 4.171672
4 BrianYarvinPloughmansLunch.txt 287345 65212 4.406321

In [7]:
print (textbooks_df.wordCount.mean())
print (videos_df.wordCount.mean())


226327.125
835.727272727

Execute one of the follwing three cells to run similarity tests between either:

i) Just the textbooks

ii) Just the video captions

iii) textbooks and video captions


In [5]:
documents = [TextBooksDict.get(key) for key in list(TextBooksDict)]
# following two rows are used in a pretty-print thing at the bottom of the notebook
# to put the labels back on to an otherwise unlabeled NumPy array
row_labels = list(TextBooksDict)
column_labels = list(TextBooksDict)

In [180]:
documents = [ManyCaptionsDict.get(key) for key in list(ManyCaptionsDict)]

row_labels = list(ManyCaptionsDict)
column_labels = list(ManyCaptionsDict)

In [3]:
documents = [UnitedOrderedDict.get(key) for key in list(UnitedOrderedDict)]

row_labels = list(UnitedOrderedDict)
column_labels = list(UnitedOrderedDict)

Adapt the script from https://nicschrading.com/project/Intro-to-NLP-with-spaCy/ which uses the modern NLP library SpaCy to create a text cleaner and text tokenizer for the pipeline


In [4]:
from spacy.en import English
parser = English()

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.base import TransformerMixin
from sklearn.pipeline import Pipeline
from sklearn.svm import LinearSVC
from sklearn.feature_extraction.stop_words import ENGLISH_STOP_WORDS
from sklearn.metrics import accuracy_score
from nltk.corpus import stopwords
import string
import re

# A custom stoplist
STOPLIST = set(stopwords.words('english') + ["n't", "'s", "'m", "ca"] + list(ENGLISH_STOP_WORDS))
# List of symbols we don't care about
SYMBOLS = " ".join(string.punctuation).split(" ") + ["-----", "---", "...", "“", "”", "'ve"]

# Every step in a pipeline needs to be a "transformer". 
# Define a custom transformer to clean text using spaCy
class CleanTextTransformer(TransformerMixin):
    """
    Convert text to cleaned text
    """

    def transform(self, X, **transform_params):
        return [cleanText(text) for text in X]

    def fit(self, X, y=None, **fit_params):
        return self

    def get_params(self, deep=True):
        return {}
    
# A custom function to clean the text before sending it into the vectorizer
def cleanText(text):
    # get rid of newlines
    text = text.strip().replace("\n", " ").replace("\r", " ")
    
    # replace twitter @mentions
    mentionFinder = re.compile(r"@[a-z0-9_]{1,15}", re.IGNORECASE)
    text = mentionFinder.sub("@MENTION", text)
    
    # replace HTML symbols
    text = text.replace("&amp;", "and").replace("&gt;", ">").replace("&lt;", "<")
    
    # lowercase
    text = text.lower()

    return text

# A custom function to tokenize the text using spaCy
# and convert to lemmas
def tokenizeText(sample):

    # get the tokens using spaCy
    tokens = parser(sample)

    # lemmatize
    lemmas = []
    for tok in tokens:
        lemmas.append(tok.lemma_.lower().strip() if tok.lemma_ != "-PRON-" else tok.lower_)
    tokens = lemmas

    # stoplist the tokens
    tokens = [tok for tok in tokens if tok not in STOPLIST]

    # stoplist symbols
    tokens = [tok for tok in tokens if tok not in SYMBOLS]

    # remove large strings of whitespace
    while "" in tokens:
        tokens.remove("")
    while " " in tokens:
        tokens.remove(" ")
    while "\n" in tokens:
        tokens.remove("\n")
    while "\n\n" in tokens:
        tokens.remove("\n\n")

    return tokens

def printNMostInformative(vectorizer, clf, N):
    """Prints features with the highest coefficient values, per class"""
    feature_names = vectorizer.get_feature_names()
    coefs_with_fns = sorted(zip(clf.coef_[0], feature_names))
    topClass1 = coefs_with_fns[:N]
    topClass2 = coefs_with_fns[:-(N + 1):-1]
    print("Class 1 best: ")
    for feat in topClass1:
        print(feat)
    print("Class 2 best: ")
    for feat in topClass2:
        print(feat)

Choose the appropriate scikit-learn vectorizer to create a term-document matrix

http://scikit-learn.org/stable/modules/classes.html#module-sklearn.feature_extraction.text

The sklearn.feature_extraction.text submodule gathers utilities to build feature vectors from text documents.

feature_extraction.text.CountVectorizer([...]) - Convert a collection of text documents to a matrix of token counts

feature_extraction.text.HashingVectorizer([...]) - Convert a collection of text documents to a matrix of token occurrences

feature_extraction.text.TfidfTransformer([...]) - Transform a count matrix to a normalized tf or tf-idf representation

feature_extraction.text.TfidfVectorizer([...]) -Convert a collection of raw documents to a matrix of TF-IDF features.

CountVectorizer is wrong, go with TfidfVectorizer, but if this needs to be scaled to something bigger one day, test out HashingVectorizer


In [5]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(tokenizer=tokenizeText, ngram_range=(1,1))   
# (containing the SpacY tokenizer tokenizeText)

Put the cleaner and vectorizer in a pipeline and fit


In [6]:
# the pipeline to clean, tokenize, vectorize, and classify
pipe = Pipeline([('cleanText', CleanTextTransformer()), ('vectorizer', vectorizer)])

In [ ]:
p = pipe.fit_transform(documents)   # takes 20 min with ~10 textbooks!

Apply the transformation to the term document matrix to compute similarity between all pairs


In [11]:
pairwise_similarity = (p * p.T).A #  In Scipy, .A transforms a sparse matrix to a dense one

Captions and books together (First 10 are captions, If it ends with .txt it is a book)


In [47]:
df9 = pd.DataFrame(pairwise_similarity, columns=column_labels, index=row_labels)
df9.head(3)


Out[47]:
ALL ABOUT LIVING WITH BOXER DOGS Lulu the Lab - Basic dog training in Austin (512) 927-9443 PetSmart Puppy Training: Feeding a Puppy 6. Reconstruction from Compressed Representation Unit 6 8 Supervised vs Unsupervised Learning How To Socialise An Older Cat, 5 tips! PLUS: COMPETITION! Smart and Funny Maine Coon Cat Leo Patents His Invention - Cat Toy using Drill Tutorial My Cats Review the Licki Brush Crispy Spicy Fried Cauliflower - Frittierte knusprige Blumenkohl - vegetarisch - Arabisch Kochen Feuerwehrfrau kocht kinderleichten Nudelauflauf ... Excel2007VBA.txt chp_handbook.txt ElectricPowerGeneration.txt ChemicalProcessDesign.txt EnergyOptimization.txt InternalCombustionEngines.txt ISLR Sixth Printing.txt machineLearning_chapmanHall.txt Ensemble methods - Zhou.txt huntingDogs.txt
ALL ABOUT LIVING WITH BOXER DOGS 1.000000 0.162420 0.111564 0.023014 0.021225 0.090717 0.028510 0.068924 0.017389 0.061817 ... 0.027938 0.022023 0.027553 0.028639 0.024127 0.017754 0.028871 0.023511 0.014963 0.399497
Lulu the Lab - Basic dog training in Austin (512) 927-9443 0.162420 1.000000 0.060075 0.009552 0.005042 0.025828 0.002559 0.020363 0.005485 0.013808 ... 0.009829 0.003914 0.006152 0.002833 0.005993 0.005047 0.013376 0.007113 0.010723 0.181076
PetSmart Puppy Training: Feeding a Puppy 0.111564 0.060075 1.000000 0.028677 0.024602 0.038684 0.017653 0.094051 0.029028 0.095519 ... 0.059873 0.028281 0.026857 0.039062 0.031976 0.016927 0.031147 0.027284 0.024467 0.062205

3 rows × 35 columns


In [13]:
import numpy as np # save the 15 minutesfromlasttime
#np.save('pairwise_similarity_35textsAndCaptions', pairwise_similarity)   
#np.save('35textLabels', row_labels)

In [26]:
pairwise_similarity[0]


Out[26]:
array([ 1.        ,  0.1624195 ,  0.11156364,  0.02301434,  0.02122543,
        0.09071699,  0.02850971,  0.06892372,  0.01738899,  0.06181667,
        0.0854467 ,  0.1053678 ,  0.03638041,  0.02099344,  0.01741988,
        0.04163213,  0.02922342,  0.03679229,  0.04515691,  0.17137324,
        0.01482046,  0.02065474,  0.07567835,  0.04460489,  0.0435978 ,
        0.02793786,  0.02202279,  0.02755284,  0.02863862,  0.0241265 ,
        0.01775361,  0.02887078,  0.0235112 ,  0.01496311,  0.39949674])

In [28]:
(-pairwise_similarity[0]).argsort()


Out[28]:
array([ 0, 34, 19,  1,  2, 11,  5, 10, 22,  7,  9, 18, 23, 24, 15, 17, 12,
       16, 31, 28,  6, 25, 27, 29, 32,  3, 26,  4, 13, 21, 30, 14,  8, 33,
       20])

In [29]:
(-pairwise_similarity[0]).argsort()[1:4]


Out[29]:
array([34, 19,  1])

In [31]:
row_labels[0]
# http://stackoverflow.com/questions/18272160/access-multiple-elements-of-list-knowing-their-index


Out[31]:
'ALL ABOUT LIVING WITH BOXER DOGS'

In [37]:
(np.array(row_labels))[((-pairwise_similarity[0]).argsort()[1:4])]


Out[37]:
array(['huntingDogs.txt', 'TheDomesticCat.txt',
       'Lulu the Lab - Basic dog training in Austin (512) 927-9443'], 
      dtype='<U96')

In [45]:
for i in range(len(row_labels)):
    print (row_labels[i], '\n', (np.array(row_labels))[((-pairwise_similarity[i]).argsort()[1:4])], '\n')


ALL ABOUT LIVING WITH BOXER DOGS 
 ['huntingDogs.txt' 'TheDomesticCat.txt'
 'Lulu the Lab - Basic dog training in Austin (512) 927-9443'] 

Lulu the Lab - Basic dog training in Austin (512) 927-9443 
 ['huntingDogs.txt' 'ALL ABOUT LIVING WITH BOXER DOGS'
 'PetSmart Puppy Training: Feeding a Puppy'] 

PetSmart Puppy Training: Feeding a Puppy 
 ['ALL ABOUT LIVING WITH BOXER DOGS'
 '• Kochen bei Freunden (Vlog) • (Ep. 16)'
 'Feuerwehrfrau kocht kinderleichten Nudelauflauf'] 

6. Reconstruction from Compressed Representation 
 ['machineLearning_chapmanHall.txt' 'DataScienceBookV3.txt' 'datastyle.txt'] 

Unit 6 8 Supervised vs Unsupervised Learning 
 ['Ensemble methods - Zhou.txt' 'datastyle.txt' 'DataScienceBookV3.txt'] 

How To Socialise An Older Cat, 5 tips! PLUS: COMPETITION! 
 ['TheDomesticCat.txt' 'My Cats Review the Licki Brush'
 'DataScienceBookV3.txt'] 

Smart and Funny Maine Coon Cat Leo Patents His Invention - Cat Toy using Drill Tutorial 
 ['How To Socialise An Older Cat, 5 tips! PLUS: COMPETITION!'
 'TheDomesticCat.txt' 'My Cats Review the Licki Brush'] 

My Cats Review the Licki Brush 
 ['TheDomesticCat.txt'
 'How To Socialise An Older Cat, 5 tips! PLUS: COMPETITION!'
 'PetSmart Puppy Training: Feeding a Puppy'] 

Crispy Spicy Fried Cauliflower - Frittierte knusprige Blumenkohl - vegetarisch - Arabisch Kochen 
 ['BrianYarvinPloughmansLunch.txt' '• Kochen bei Freunden (Vlog) • (Ep. 16)'
 'Feuerwehrfrau kocht kinderleichten Nudelauflauf'] 

Feuerwehrfrau kocht kinderleichten Nudelauflauf 
 ['• Kochen bei Freunden (Vlog) • (Ep. 16)' 'BrianYarvinPloughmansLunch.txt'
 'PetSmart Puppy Training: Feeding a Puppy'] 

• Kochen bei Freunden (Vlog) • (Ep. 16) 
 ['Feuerwehrfrau kocht kinderleichten Nudelauflauf'
 'BrianYarvinPloughmansLunch.txt'
 'PetSmart Puppy Training: Feeding a Puppy'] 

sheep.txt 
 ['louisianaBeefCattle.txt' 'TheDomesticCat.txt' 'huntingDogs.txt'] 

Corporate Finance.txt 
 ['Contemporary Engineering Economics.txt' 'HeatingCoolingPower.txt'
 'DataAnalysisSQL.txt'] 

Excel2010Advanced.txt 
 ['Excel2007VBA.txt' 'DataScienceBookV3.txt' 'DataAnalysisSQL.txt'] 

distributedAI.txt 
 ['machineLearning_chapmanHall.txt' 'EnergyOptimization.txt'
 'DataScienceBookV3.txt'] 

BrianYarvinPloughmansLunch.txt 
 [ 'Crispy Spicy Fried Cauliflower - Frittierte knusprige Blumenkohl - vegetarisch - Arabisch Kochen'
 'ChemicalProcessDesign.txt' 'machineLearning_chapmanHall.txt'] 

HeatingCoolingPower.txt 
 ['catalog_chptech_full.txt' 'ChemicalProcessDesign.txt'
 'InternalCombustionEngines.txt'] 

louisianaBeefCattle.txt 
 ['sheep.txt' 'TheDomesticCat.txt' 'Contemporary Engineering Economics.txt'] 

datastyle.txt 
 ['DataScienceBookV3.txt' 'ISLR Sixth Printing.txt'
 'machineLearning_chapmanHall.txt'] 

TheDomesticCat.txt 
 ['How To Socialise An Older Cat, 5 tips! PLUS: COMPETITION!'
 'My Cats Review the Licki Brush' 'sheep.txt'] 

catalog_chptech_full.txt 
 ['HeatingCoolingPower.txt' 'InternalCombustionEngines.txt'
 'chp_handbook.txt'] 

appliedStatistics.txt 
 ['machineLearning_chapmanHall.txt' 'ISLR Sixth Printing.txt'
 'EnergyOptimization.txt'] 

DataScienceBookV3.txt 
 ['datastyle.txt' 'ISLR Sixth Printing.txt'
 'machineLearning_chapmanHall.txt'] 

DataAnalysisSQL.txt 
 ['DataScienceBookV3.txt' 'datastyle.txt' 'ISLR Sixth Printing.txt'] 

Contemporary Engineering Economics.txt 
 ['Corporate Finance.txt' 'HeatingCoolingPower.txt' 'chp_handbook.txt'] 

Excel2007VBA.txt 
 ['Excel2010Advanced.txt' 'DataScienceBookV3.txt' 'DataAnalysisSQL.txt'] 

chp_handbook.txt 
 ['catalog_chptech_full.txt' 'HeatingCoolingPower.txt'
 'Contemporary Engineering Economics.txt'] 

ElectricPowerGeneration.txt 
 ['HeatingCoolingPower.txt' 'ChemicalProcessDesign.txt'
 'EnergyOptimization.txt'] 

ChemicalProcessDesign.txt 
 ['EnergyOptimization.txt' 'HeatingCoolingPower.txt'
 'catalog_chptech_full.txt'] 

EnergyOptimization.txt 
 ['ChemicalProcessDesign.txt' 'HeatingCoolingPower.txt'
 'machineLearning_chapmanHall.txt'] 

InternalCombustionEngines.txt 
 ['catalog_chptech_full.txt' 'HeatingCoolingPower.txt'
 'ChemicalProcessDesign.txt'] 

ISLR Sixth Printing.txt 
 ['machineLearning_chapmanHall.txt' 'datastyle.txt' 'DataScienceBookV3.txt'] 

machineLearning_chapmanHall.txt 
 ['ISLR Sixth Printing.txt' 'appliedStatistics.txt' 'DataScienceBookV3.txt'] 

Ensemble methods - Zhou.txt 
 ['machineLearning_chapmanHall.txt' 'ISLR Sixth Printing.txt'
 'EnergyOptimization.txt'] 

huntingDogs.txt 
 ['ALL ABOUT LIVING WITH BOXER DOGS' 'TheDomesticCat.txt'
 'Lulu the Lab - Basic dog training in Austin (512) 927-9443'] 

fantastic it's working! Similar youtube captions and texts from books seem to have in most cases found each other :)

If needed examine the contents of the pipeline before fit_transform is applied with the following cell


In [ ]:
print("----------------------------------------------------------------------------------------------")
print("The original data as it appeared to the classifier after tokenizing, lemmatizing, stoplisting, etc")

transform = p 

# get the features that the vectorizer learned (its vocabulary)
vocab = vectorizer.get_feature_names()

# the values from the vectorizer transformed data (each item is a row,column index with value as # times occuring in the sample, stored as a sparse matrix)
for i in range(len(documents)):
    s = ""
    indexIntoVocab = transform.indices[transform.indptr[i]:transform.indptr[i+1]]
    numOccurences = transform.data[transform.indptr[i]:transform.indptr[i+1]]
    for idx, num in zip(indexIntoVocab, numOccurences):
        s += str((vocab[idx], num))
    print("Sample {}: {}".format(i, s))