There are 7647 distinct youtube video captions texts in English and a directory containing 24 english books, some of which from the site www.gutenberg.org which are proofread. The following scripts allow similarity measures between either the captions alone or the books to be made. Initially, get it to work on a subset of the 7647...



In [1]:

    
from collections import OrderedDict
from os import listdir
from os.path import isfile, join
import sys
sys.path.append('../')
import config
import pymysql.cursors
import pandas as pd
import spacy

nlp = spacy.load('en')

connection = pymysql.connect(host='localhost',
                             user='root',
                             password=config.MYSQL_SERVER_PASSWORD,
                             db='youtubeProjectDB',
                             charset='utf8mb4', 
                             cursorclass=pymysql.cursors.DictCursor)


mypath = '../textbooks'
onlyfiles = [f for f in listdir(mypath) if isfile(join(mypath, f))]
    
with connection.cursor() as cursor:
                       
            sql = """
            SELECT search_api.videoId, videoTitle, captionsText, wordCount, captions.id 
            FROM search_api
            INNER JOIN captions
            ON search_api.videoId = captions.videoId
            WHERE captions.id 
            IN (5830, 45, 52, 54, 6195, 6198, 6203, 6208, 14525, 14523, 14518);"""            
            cursor.execute(sql)
            manyCaptions = cursor.fetchall()
            videos_df = pd.read_sql(sql, connection)
                        
connection.close()

Place the captions and books into ordered dictionaries with keys which identify their contents



In [2]:

    
L1 = []
L2 = []
for file in onlyfiles:
    L1.append((file ,  (open(mypath + '/' + file, 'r').read()) ))
TextBooksDict = OrderedDict(L1)

for item in manyCaptions:
    #  L2.append((item.get('id')  ,  item.get('captionsText')))  # 'id' key is lower case!!!
    L2.append((item.get('videoTitle')  ,  item.get('captionsText')))
ManyCaptionsDict = OrderedDict(L2)   

# Merge OrderedDict's'
L3 = []
for k, v in zip(ManyCaptionsDict.keys(), ManyCaptionsDict.values()):
    L3.append((k,v))
for k, v in zip(TextBooksDict.keys(), TextBooksDict.values()):
    L3.append((k,v))
UnitedOrderedDict = OrderedDict(L3)

videos_df['characterCount'] = videos_df['captionsText'].map(len)
# reorder the columns
videos_df['charPerWord'] = videos_df.characterCount / videos_df.wordCount
videos_df = videos_df.reindex(columns=['videoTitle','characterCount','wordCount', 'charPerWord','captionsText','id', 'videoId'])

use pickes to avoid rerunning the word-count cell unnecessarily



In [6]:

    
textbooks_df = pd.read_pickle('textbooksDF.pickle')



In [32]:

    
# NB - this cell can take minutes to run rather load from pickle if nothing added. 
# https://chrisalbon.com/python/pandas_create_column_with_loop.html
fileName = [k for k in TextBooksDict.keys()]
characterCount = [len(TextBooksDict.get(k)) for k in TextBooksDict.keys()]
wordCount = [len(nlp(TextBooksDict.get(k))) for k in TextBooksDict.keys()]
raw_data = {'fileName' : fileName,
            'characterCount': characterCount,
            'wordCount':wordCount}
textbooks_df = pd.DataFrame(raw_data, columns = ['fileName', 'characterCount', 'wordCount'])
textbooks_df['charPerWord'] = textbooks_df.characterCount / textbooks_df.wordCount



In [42]:

    
textbooks_df.to_pickle('textbooksDF.pickle')

samples of both dataframes



In [4]:

    
videos_df[['videoTitle', 'characterCount', 'wordCount', 'charPerWord']].head(5)









    Out[4]:






  
    
      
      videoTitle
      characterCount
      wordCount
      charPerWord
    
  
  
    
      0
      ALL ABOUT LIVING WITH BOXER DOGS
      16843
      3862
      4.361212
    
    
      1
      Lulu the Lab - Basic dog training in Austin (5...
      485
      100
      4.850000
    
    
      2
      PetSmart Puppy Training: Feeding a Puppy
      2953
      677
      4.361891
    
    
      3
      6. Reconstruction from Compressed Representation
      3448
      735
      4.691156
    
    
      4
      Unit 6 8 Supervised vs Unsupervised Learning
      2054
      395
      5.200000



In [4]:

    
textbooks_df.head(5)









    Out[4]:






  
    
      
      fileName
      characterCount
      wordCount
      charPerWord
    
  
  
    
      0
      sheep.txt
      427001
      92664
      4.608057
    
    
      1
      Corporate Finance.txt
      3141127
      691207
      4.544409
    
    
      2
      Excel2010Advanced.txt
      287867
      61451
      4.684497
    
    
      3
      distributedAI.txt
      175861
      42156
      4.171672
    
    
      4
      BrianYarvinPloughmansLunch.txt
      287345
      65212
      4.406321



In [7]:

    
print (textbooks_df.wordCount.mean())
print (videos_df.wordCount.mean())









    



226327.125
835.727272727

Execute one of the follwing three cells to run similarity tests between either:

i) Just the textbooks

ii) Just the video captions

iii) textbooks and video captions



In [5]:

    
documents = [TextBooksDict.get(key) for key in list(TextBooksDict)]
# following two rows are used in a pretty-print thing at the bottom of the notebook
# to put the labels back on to an otherwise unlabeled NumPy array
row_labels = list(TextBooksDict)
column_labels = list(TextBooksDict)



In [180]:

    
documents = [ManyCaptionsDict.get(key) for key in list(ManyCaptionsDict)]

row_labels = list(ManyCaptionsDict)
column_labels = list(ManyCaptionsDict)



In [3]:

    
documents = [UnitedOrderedDict.get(key) for key in list(UnitedOrderedDict)]

row_labels = list(UnitedOrderedDict)
column_labels = list(UnitedOrderedDict)

Adapt the script from https://nicschrading.com/project/Intro-to-NLP-with-spaCy/ which uses the modern NLP library SpaCy to create a text cleaner and text tokenizer for the pipeline



In [4]:

    
from spacy.en import English
parser = English()

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.base import TransformerMixin
from sklearn.pipeline import Pipeline
from sklearn.svm import LinearSVC
from sklearn.feature_extraction.stop_words import ENGLISH_STOP_WORDS
from sklearn.metrics import accuracy_score
from nltk.corpus import stopwords
import string
import re

# A custom stoplist
STOPLIST = set(stopwords.words('english') + ["n't", "'s", "'m", "ca"] + list(ENGLISH_STOP_WORDS))
# List of symbols we don't care about
SYMBOLS = " ".join(string.punctuation).split(" ") + ["-----", "---", "...", "“", "”", "'ve"]

# Every step in a pipeline needs to be a "transformer". 
# Define a custom transformer to clean text using spaCy
class CleanTextTransformer(TransformerMixin):
    """
    Convert text to cleaned text
    """

    def transform(self, X, **transform_params):
        return [cleanText(text) for text in X]

    def fit(self, X, y=None, **fit_params):
        return self

    def get_params(self, deep=True):
        return {}
    
# A custom function to clean the text before sending it into the vectorizer
def cleanText(text):
    # get rid of newlines
    text = text.strip().replace("\n", " ").replace("\r", " ")
    
    # replace twitter @mentions
    mentionFinder = re.compile(r"@[a-z0-9_]{1,15}", re.IGNORECASE)
    text = mentionFinder.sub("@MENTION", text)
    
    # replace HTML symbols
    text = text.replace("&amp;", "and").replace("&gt;", ">").replace("&lt;", "<")
    
    # lowercase
    text = text.lower()

    return text

# A custom function to tokenize the text using spaCy
# and convert to lemmas
def tokenizeText(sample):

    # get the tokens using spaCy
    tokens = parser(sample)

    # lemmatize
    lemmas = []
    for tok in tokens:
        lemmas.append(tok.lemma_.lower().strip() if tok.lemma_ != "-PRON-" else tok.lower_)
    tokens = lemmas

    # stoplist the tokens
    tokens = [tok for tok in tokens if tok not in STOPLIST]

    # stoplist symbols
    tokens = [tok for tok in tokens if tok not in SYMBOLS]

    # remove large strings of whitespace
    while "" in tokens:
        tokens.remove("")
    while " " in tokens:
        tokens.remove(" ")
    while "\n" in tokens:
        tokens.remove("\n")
    while "\n\n" in tokens:
        tokens.remove("\n\n")

    return tokens

def printNMostInformative(vectorizer, clf, N):
    """Prints features with the highest coefficient values, per class"""
    feature_names = vectorizer.get_feature_names()
    coefs_with_fns = sorted(zip(clf.coef_[0], feature_names))
    topClass1 = coefs_with_fns[:N]
    topClass2 = coefs_with_fns[:-(N + 1):-1]
    print("Class 1 best: ")
    for feat in topClass1:
        print(feat)
    print("Class 2 best: ")
    for feat in topClass2:
        print(feat)

Choose the appropriate scikit-learn vectorizer to create a term-document matrix

http://scikit-learn.org/stable/modules/classes.html#module-sklearn.feature_extraction.text

The sklearn.feature_extraction.text submodule gathers utilities to build feature vectors from text documents.

feature_extraction.text.CountVectorizer([...]) - Convert a collection of text documents to a matrix of token counts

feature_extraction.text.HashingVectorizer([...]) - Convert a collection of text documents to a matrix of token occurrences

feature_extraction.text.TfidfTransformer([...]) - Transform a count matrix to a normalized tf or tf-idf representation

feature_extraction.text.TfidfVectorizer([...]) -Convert a collection of raw documents to a matrix of TF-IDF features.

CountVectorizer is wrong, go with TfidfVectorizer, but if this needs to be scaled to something bigger one day, test out HashingVectorizer



In [5]:

    
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(tokenizer=tokenizeText, ngram_range=(1,1))   
# (containing the SpacY tokenizer tokenizeText)

Put the cleaner and vectorizer in a pipeline and fit



In [6]:

    
# the pipeline to clean, tokenize, vectorize, and classify
pipe = Pipeline([('cleanText', CleanTextTransformer()), ('vectorizer', vectorizer)])



In [ ]:

    
p = pipe.fit_transform(documents)   # takes 20 min with ~10 textbooks!

Apply the transformation to the term document matrix to compute similarity between all pairs



In [11]:

    
pairwise_similarity = (p * p.T).A #  In Scipy, .A transforms a sparse matrix to a dense one

Captions and books together (First 10 are captions, If it ends with .txt it is a book)



In [47]:

    
df9 = pd.DataFrame(pairwise_similarity, columns=column_labels, index=row_labels)
df9.head(3)









    Out[47]:






  
    
      
      ALL ABOUT LIVING WITH BOXER DOGS
      Lulu the Lab - Basic dog training in Austin (512) 927-9443
      PetSmart Puppy Training: Feeding a Puppy
      6. Reconstruction from Compressed Representation
      Unit 6 8 Supervised vs Unsupervised Learning
      How To Socialise An Older Cat, 5 tips! PLUS: COMPETITION!
      Smart and Funny Maine Coon Cat Leo Patents His Invention - Cat Toy using Drill Tutorial
      My Cats Review the Licki Brush
      Crispy Spicy Fried Cauliflower - Frittierte knusprige Blumenkohl - vegetarisch - Arabisch Kochen
      Feuerwehrfrau kocht kinderleichten Nudelauflauf
      ...
      Excel2007VBA.txt
      chp_handbook.txt
      ElectricPowerGeneration.txt
      ChemicalProcessDesign.txt
      EnergyOptimization.txt
      InternalCombustionEngines.txt
      ISLR Sixth Printing.txt
      machineLearning_chapmanHall.txt
      Ensemble methods - Zhou.txt
      huntingDogs.txt
    
  
  
    
      ALL ABOUT LIVING WITH BOXER DOGS
      1.000000
      0.162420
      0.111564
      0.023014
      0.021225
      0.090717
      0.028510
      0.068924
      0.017389
      0.061817
      ...
      0.027938
      0.022023
      0.027553
      0.028639
      0.024127
      0.017754
      0.028871
      0.023511
      0.014963
      0.399497
    
    
      Lulu the Lab - Basic dog training in Austin (512) 927-9443
      0.162420
      1.000000
      0.060075
      0.009552
      0.005042
      0.025828
      0.002559
      0.020363
      0.005485
      0.013808
      ...
      0.009829
      0.003914
      0.006152
      0.002833
      0.005993
      0.005047
      0.013376
      0.007113
      0.010723
      0.181076
    
    
      PetSmart Puppy Training: Feeding a Puppy
      0.111564
      0.060075
      1.000000
      0.028677
      0.024602
      0.038684
      0.017653
      0.094051
      0.029028
      0.095519
      ...
      0.059873
      0.028281
      0.026857
      0.039062
      0.031976
      0.016927
      0.031147
      0.027284
      0.024467
      0.062205
    
  

3 rows × 35 columns



In [13]:

    
import numpy as np # save the 15 minutesfromlasttime
#np.save('pairwise_similarity_35textsAndCaptions', pairwise_similarity)   
#np.save('35textLabels', row_labels)



In [26]:

    
pairwise_similarity[0]









    Out[26]:





array([ 1.        ,  0.1624195 ,  0.11156364,  0.02301434,  0.02122543,
        0.09071699,  0.02850971,  0.06892372,  0.01738899,  0.06181667,
        0.0854467 ,  0.1053678 ,  0.03638041,  0.02099344,  0.01741988,
        0.04163213,  0.02922342,  0.03679229,  0.04515691,  0.17137324,
        0.01482046,  0.02065474,  0.07567835,  0.04460489,  0.0435978 ,
        0.02793786,  0.02202279,  0.02755284,  0.02863862,  0.0241265 ,
        0.01775361,  0.02887078,  0.0235112 ,  0.01496311,  0.39949674])



In [28]:

    
(-pairwise_similarity[0]).argsort()









    Out[28]:





array([ 0, 34, 19,  1,  2, 11,  5, 10, 22,  7,  9, 18, 23, 24, 15, 17, 12,
       16, 31, 28,  6, 25, 27, 29, 32,  3, 26,  4, 13, 21, 30, 14,  8, 33,
       20])



In [29]:

    
(-pairwise_similarity[0]).argsort()[1:4]









    Out[29]:





array([34, 19,  1])



In [31]:

    
row_labels[0]
# http://stackoverflow.com/questions/18272160/access-multiple-elements-of-list-knowing-their-index









    Out[31]:





'ALL ABOUT LIVING WITH BOXER DOGS'



In [37]:

    
(np.array(row_labels))[((-pairwise_similarity[0]).argsort()[1:4])]









    Out[37]:





array(['huntingDogs.txt', 'TheDomesticCat.txt',
       'Lulu the Lab - Basic dog training in Austin (512) 927-9443'], 
      dtype='<U96')



In [45]:

    
for i in range(len(row_labels)):
    print (row_labels[i], '\n', (np.array(row_labels))[((-pairwise_similarity[i]).argsort()[1:4])], '\n')









    



ALL ABOUT LIVING WITH BOXER DOGS 
 ['huntingDogs.txt' 'TheDomesticCat.txt'
 'Lulu the Lab - Basic dog training in Austin (512) 927-9443'] 

Lulu the Lab - Basic dog training in Austin (512) 927-9443 
 ['huntingDogs.txt' 'ALL ABOUT LIVING WITH BOXER DOGS'
 'PetSmart Puppy Training: Feeding a Puppy'] 

PetSmart Puppy Training: Feeding a Puppy 
 ['ALL ABOUT LIVING WITH BOXER DOGS'
 '• Kochen bei Freunden (Vlog) • (Ep. 16)'
 'Feuerwehrfrau kocht kinderleichten Nudelauflauf'] 

6. Reconstruction from Compressed Representation 
 ['machineLearning_chapmanHall.txt' 'DataScienceBookV3.txt' 'datastyle.txt'] 

Unit 6 8 Supervised vs Unsupervised Learning 
 ['Ensemble methods - Zhou.txt' 'datastyle.txt' 'DataScienceBookV3.txt'] 

How To Socialise An Older Cat, 5 tips! PLUS: COMPETITION! 
 ['TheDomesticCat.txt' 'My Cats Review the Licki Brush'
 'DataScienceBookV3.txt'] 

Smart and Funny Maine Coon Cat Leo Patents His Invention - Cat Toy using Drill Tutorial 
 ['How To Socialise An Older Cat, 5 tips! PLUS: COMPETITION!'
 'TheDomesticCat.txt' 'My Cats Review the Licki Brush'] 

My Cats Review the Licki Brush 
 ['TheDomesticCat.txt'
 'How To Socialise An Older Cat, 5 tips! PLUS: COMPETITION!'
 'PetSmart Puppy Training: Feeding a Puppy'] 

Crispy Spicy Fried Cauliflower - Frittierte knusprige Blumenkohl - vegetarisch - Arabisch Kochen 
 ['BrianYarvinPloughmansLunch.txt' '• Kochen bei Freunden (Vlog) • (Ep. 16)'
 'Feuerwehrfrau kocht kinderleichten Nudelauflauf'] 

Feuerwehrfrau kocht kinderleichten Nudelauflauf 
 ['• Kochen bei Freunden (Vlog) • (Ep. 16)' 'BrianYarvinPloughmansLunch.txt'
 'PetSmart Puppy Training: Feeding a Puppy'] 

• Kochen bei Freunden (Vlog) • (Ep. 16) 
 ['Feuerwehrfrau kocht kinderleichten Nudelauflauf'
 'BrianYarvinPloughmansLunch.txt'
 'PetSmart Puppy Training: Feeding a Puppy'] 

sheep.txt 
 ['louisianaBeefCattle.txt' 'TheDomesticCat.txt' 'huntingDogs.txt'] 

Corporate Finance.txt 
 ['Contemporary Engineering Economics.txt' 'HeatingCoolingPower.txt'
 'DataAnalysisSQL.txt'] 

Excel2010Advanced.txt 
 ['Excel2007VBA.txt' 'DataScienceBookV3.txt' 'DataAnalysisSQL.txt'] 

distributedAI.txt 
 ['machineLearning_chapmanHall.txt' 'EnergyOptimization.txt'
 'DataScienceBookV3.txt'] 

BrianYarvinPloughmansLunch.txt 
 [ 'Crispy Spicy Fried Cauliflower - Frittierte knusprige Blumenkohl - vegetarisch - Arabisch Kochen'
 'ChemicalProcessDesign.txt' 'machineLearning_chapmanHall.txt'] 

HeatingCoolingPower.txt 
 ['catalog_chptech_full.txt' 'ChemicalProcessDesign.txt'
 'InternalCombustionEngines.txt'] 

louisianaBeefCattle.txt 
 ['sheep.txt' 'TheDomesticCat.txt' 'Contemporary Engineering Economics.txt'] 

datastyle.txt 
 ['DataScienceBookV3.txt' 'ISLR Sixth Printing.txt'
 'machineLearning_chapmanHall.txt'] 

TheDomesticCat.txt 
 ['How To Socialise An Older Cat, 5 tips! PLUS: COMPETITION!'
 'My Cats Review the Licki Brush' 'sheep.txt'] 

catalog_chptech_full.txt 
 ['HeatingCoolingPower.txt' 'InternalCombustionEngines.txt'
 'chp_handbook.txt'] 

appliedStatistics.txt 
 ['machineLearning_chapmanHall.txt' 'ISLR Sixth Printing.txt'
 'EnergyOptimization.txt'] 

DataScienceBookV3.txt 
 ['datastyle.txt' 'ISLR Sixth Printing.txt'
 'machineLearning_chapmanHall.txt'] 

DataAnalysisSQL.txt 
 ['DataScienceBookV3.txt' 'datastyle.txt' 'ISLR Sixth Printing.txt'] 

Contemporary Engineering Economics.txt 
 ['Corporate Finance.txt' 'HeatingCoolingPower.txt' 'chp_handbook.txt'] 

Excel2007VBA.txt 
 ['Excel2010Advanced.txt' 'DataScienceBookV3.txt' 'DataAnalysisSQL.txt'] 

chp_handbook.txt 
 ['catalog_chptech_full.txt' 'HeatingCoolingPower.txt'
 'Contemporary Engineering Economics.txt'] 

ElectricPowerGeneration.txt 
 ['HeatingCoolingPower.txt' 'ChemicalProcessDesign.txt'
 'EnergyOptimization.txt'] 

ChemicalProcessDesign.txt 
 ['EnergyOptimization.txt' 'HeatingCoolingPower.txt'
 'catalog_chptech_full.txt'] 

EnergyOptimization.txt 
 ['ChemicalProcessDesign.txt' 'HeatingCoolingPower.txt'
 'machineLearning_chapmanHall.txt'] 

InternalCombustionEngines.txt 
 ['catalog_chptech_full.txt' 'HeatingCoolingPower.txt'
 'ChemicalProcessDesign.txt'] 

ISLR Sixth Printing.txt 
 ['machineLearning_chapmanHall.txt' 'datastyle.txt' 'DataScienceBookV3.txt'] 

machineLearning_chapmanHall.txt 
 ['ISLR Sixth Printing.txt' 'appliedStatistics.txt' 'DataScienceBookV3.txt'] 

Ensemble methods - Zhou.txt 
 ['machineLearning_chapmanHall.txt' 'ISLR Sixth Printing.txt'
 'EnergyOptimization.txt'] 

huntingDogs.txt 
 ['ALL ABOUT LIVING WITH BOXER DOGS' 'TheDomesticCat.txt'
 'Lulu the Lab - Basic dog training in Austin (512) 927-9443']

fantastic it's working! Similar youtube captions and texts from books seem to have in most cases found each other :)

If needed examine the contents of the pipeline before fit_transform is applied with the following cell



In [ ]:

    
print("----------------------------------------------------------------------------------------------")
print("The original data as it appeared to the classifier after tokenizing, lemmatizing, stoplisting, etc")

transform = p 

# get the features that the vectorizer learned (its vocabulary)
vocab = vectorizer.get_feature_names()

# the values from the vectorizer transformed data (each item is a row,column index with value as # times occuring in the sample, stored as a sparse matrix)
for i in range(len(documents)):
    s = ""
    indexIntoVocab = transform.indices[transform.indptr[i]:transform.indptr[i+1]]
    numOccurences = transform.data[transform.indptr[i]:transform.indptr[i+1]]
    for idx, num in zip(indexIntoVocab, numOccurences):
        s += str((vocab[idx], num))
    print("Sample {}: {}".format(i, s))

	videoTitle	characterCount	wordCount	charPerWord
0	ALL ABOUT LIVING WITH BOXER DOGS	16843	3862	4.361212
1	Lulu the Lab - Basic dog training in Austin (5...	485	100	4.850000
2	PetSmart Puppy Training: Feeding a Puppy	2953	677	4.361891
3	6. Reconstruction from Compressed Representation	3448	735	4.691156
4	Unit 6 8 Supervised vs Unsupervised Learning	2054	395	5.200000

	fileName	characterCount	wordCount	charPerWord
0	sheep.txt	427001	92664	4.608057
1	Corporate Finance.txt	3141127	691207	4.544409
2	Excel2010Advanced.txt	287867	61451	4.684497
3	distributedAI.txt	175861	42156	4.171672
4	BrianYarvinPloughmansLunch.txt	287345	65212	4.406321

	ALL ABOUT LIVING WITH BOXER DOGS	Lulu the Lab - Basic dog training in Austin (512) 927-9443	PetSmart Puppy Training: Feeding a Puppy	6. Reconstruction from Compressed Representation	Unit 6 8 Supervised vs Unsupervised Learning	How To Socialise An Older Cat, 5 tips! PLUS: COMPETITION!	Smart and Funny Maine Coon Cat Leo Patents His Invention - Cat Toy using Drill Tutorial	My Cats Review the Licki Brush	Crispy Spicy Fried Cauliflower - Frittierte knusprige Blumenkohl - vegetarisch - Arabisch Kochen	Feuerwehrfrau kocht kinderleichten Nudelauflauf	...	Excel2007VBA.txt	chp_handbook.txt	ElectricPowerGeneration.txt	ChemicalProcessDesign.txt	EnergyOptimization.txt	InternalCombustionEngines.txt	ISLR Sixth Printing.txt	machineLearning_chapmanHall.txt	Ensemble methods - Zhou.txt	huntingDogs.txt
ALL ABOUT LIVING WITH BOXER DOGS	1.000000	0.162420	0.111564	0.023014	0.021225	0.090717	0.028510	0.068924	0.017389	0.061817	...	0.027938	0.022023	0.027553	0.028639	0.024127	0.017754	0.028871	0.023511	0.014963	0.399497
Lulu the Lab - Basic dog training in Austin (512) 927-9443	0.162420	1.000000	0.060075	0.009552	0.005042	0.025828	0.002559	0.020363	0.005485	0.013808	...	0.009829	0.003914	0.006152	0.002833	0.005993	0.005047	0.013376	0.007113	0.010723	0.181076
PetSmart Puppy Training: Feeding a Puppy	0.111564	0.060075	1.000000	0.028677	0.024602	0.038684	0.017653	0.094051	0.029028	0.095519	...	0.059873	0.028281	0.026857	0.039062	0.031976	0.016927	0.031147	0.027284	0.024467	0.062205