TF-IDF demo



In [ ]:

    
import h2o

h2o.init()

Data

Data sources:



In [3]:

    
from collections import OrderedDict

documents = [
    'H2O is an in-memory platform for distributed, scalable machine learning. H2O uses familiar interfaces like R, Python, Scala, Java, JSON and the Flow notebook/web interface, and works seamlessly with big data technologies like Hadoop and Spark.',
    'Ice hockey is a contact team sport played on ice, usually in a rink, in which two teams of skaters use their sticks to shoot a vulcanized rubber puck into their opponent\'s net to score goals. The sport is known to be fast-paced and physical.',
    'An antibody (Ab), also known as an immunoglobulin (Ig), is a large, Y-shaped protein produced mainly by plasma cells that is used by the immune system to neutralize pathogens such as pathogenic bacteria and viruses.'
]
doc_ids = list(range(len(documents)))

input_frame = h2o.H2OFrame(OrderedDict([('DocID', doc_ids), ('Document', documents)]),
                            column_types=['numeric', 'string'])
input_frame.head()









    



Parse progress: |█████████████████████████████████████████████████████████| 100%






    






  DocID Document                                                                                                                                                                                                                                           


      0 H2O is an in-memory platform for distributed, scalable machine learning. H2O uses familiar interfaces like R, Python, Scala, Java, JSON and the Flow notebook/web interface, and works seamlessly with big data technologies like Hadoop and Spark.
      1 Ice hockey is a contact team sport played on ice, usually in a rink, in which two teams of skaters use their sticks to shoot a vulcanized rubber puck into their opponent's net to score goals. The sport is known to be fast-paced and physical.  
      2 An antibody (Ab), also known as an immunoglobulin (Ig), is a large, Y-shaped protein produced mainly by plasma cells that is used by the immune system to neutralize pathogens such as pathogenic bacteria and viruses.                            








    Out[3]:

TF-IDF with pre-processing



In [4]:

    
from h2o.information_retrieval.tf_idf import tf_idf

tf_idf_out = tf_idf(input_frame, "DocID", "Document", False, False)
tf_idf_out.head()









    






  DocID Word                                                                                                                                                                                                                                                 TF      IDF   TF-IDF


      2 an antibody (ab), also known as an immunoglobulin (ig), is a large, y-shaped protein produced mainly by plasma cells that is used by the immune system to neutralize pathogens such as pathogenic bacteria and viruses.                               1 0.693147 0.693147
      0 h2o is an in-memory platform for distributed, scalable machine learning. h2o uses familiar interfaces like r, python, scala, java, json and the flow notebook/web interface, and works seamlessly with big data technologies like hadoop and spark.    1 0.693147 0.693147
      1 ice hockey is a contact team sport played on ice, usually in a rink, in which two teams of skaters use their sticks to shoot a vulcanized rubber puck into their opponent's net to score goals. the sport is known to be fast-paced and physical.     1 0.693147 0.693147








    Out[4]:



In [5]:

    
from IPython.display import DisplayObject, display
VALUES_CNT_TO_SHOW = 3

def tf_idf_output_summary(tf_idf_out):
    for doc_id in doc_ids:
        sorted_doc_tf_idfs = tf_idf_out[tf_idf_out['DocID'] == doc_id].sort(by='TF-IDF')
        print('The highest TF-IDF values for document ' + str(doc_id) +':')
        display(sorted_doc_tf_idfs.tail(VALUES_CNT_TO_SHOW))
        print('The lowest TF-IDF values for document ' + str(doc_id) +':')
        display(sorted_doc_tf_idfs.head(VALUES_CNT_TO_SHOW))
        print('\n')



In [6]:

    
tf_idf_output_summary(tf_idf_out)









    



The highest TF-IDF values for document 0:






    






  DocID Word                                                                                                                                                                                                                                                 TF      IDF   TF-IDF


      0 h2o is an in-memory platform for distributed, scalable machine learning. h2o uses familiar interfaces like r, python, scala, java, json and the flow notebook/web interface, and works seamlessly with big data technologies like hadoop and spark.    1 0.693147 0.693147








    












    



The lowest TF-IDF values for document 0:






    






  DocID Word                                                                                                                                                                                                                                                 TF      IDF   TF-IDF


      0 h2o is an in-memory platform for distributed, scalable machine learning. h2o uses familiar interfaces like r, python, scala, java, json and the flow notebook/web interface, and works seamlessly with big data technologies like hadoop and spark.    1 0.693147 0.693147








    












    




The highest TF-IDF values for document 1:






    






  DocID Word                                                                                                                                                                                                                                               TF      IDF   TF-IDF


      1 ice hockey is a contact team sport played on ice, usually in a rink, in which two teams of skaters use their sticks to shoot a vulcanized rubber puck into their opponent's net to score goals. the sport is known to be fast-paced and physical.    1 0.693147 0.693147








    












    



The lowest TF-IDF values for document 1:






    






  DocID Word                                                                                                                                                                                                                                               TF      IDF   TF-IDF


      1 ice hockey is a contact team sport played on ice, usually in a rink, in which two teams of skaters use their sticks to shoot a vulcanized rubber puck into their opponent's net to score goals. the sport is known to be fast-paced and physical.    1 0.693147 0.693147








    












    




The highest TF-IDF values for document 2:






    






  DocID Word                                                                                                                                                                                                                     TF      IDF   TF-IDF


      2 an antibody (ab), also known as an immunoglobulin (ig), is a large, y-shaped protein produced mainly by plasma cells that is used by the immune system to neutralize pathogens such as pathogenic bacteria and viruses.    1 0.693147 0.693147








    












    



The lowest TF-IDF values for document 2:






    






  DocID Word                                                                                                                                                                                                                     TF      IDF   TF-IDF


      2 an antibody (ab), also known as an immunoglobulin (ig), is a large, y-shaped protein produced mainly by plasma cells that is used by the immune system to neutralize pathogens such as pathogenic bacteria and viruses.    1 0.693147 0.693147

TF-IDF without pre-processing



In [7]:

    
preprocessed_data = [(doc_id, word) for doc_id, document in enumerate(documents) for word in document.split()]

preprocessed_input_frame = h2o.H2OFrame(preprocessed_data,
                                        column_names=['DocID', 'Document'],
                                        column_types=['numeric', 'string'])
preprocessed_input_frame.head()









    



Parse progress: |█████████████████████████████████████████████████████████| 100%






    






  DocID Document    


      0 H2O         
      0 is          
      0 an          
      0 in-memory   
      0 platform    
      0 for         
      0 distributed,
      0 scalable    
      0 machine     
      0 learning.   








    Out[7]:



In [8]:

    
tf_idf_out = tf_idf(preprocessed_input_frame, 'DocID', 'Document', preprocess=False)
tf_idf_out.head()









    






  DocID Word     TF      IDF   TF-IDF


      2 (Ab),     1 0.693147 0.693147
      2 (Ig),     1 0.693147 0.693147
      2 An        1 0.693147 0.693147
      0 Flow      1 0.693147 0.693147
      0 H2O       2 0.693147 1.38629 
      0 Hadoop    1 0.693147 0.693147
      1 Ice       1 0.693147 0.693147
      0 JSON      1 0.693147 0.693147
      0 Java,     1 0.693147 0.693147
      0 Python,    1 0.693147 0.693147








    Out[8]:



In [9]:

    
tf_idf_output_summary(tf_idf_out)









    



The highest TF-IDF values for document 0:






    






  DocID Word    TF      IDF   TF-IDF


      0 works    1 0.693147 0.693147
      0 H2O      2 0.693147 1.38629 
      0 like     2 0.693147 1.38629 








    












    



The lowest TF-IDF values for document 0:






    






  DocID Word    TF      IDF   TF-IDF


      0 and      3 0       0       
      0 is       1 0       0       
      0 an       1 0.287682 0.287682








    












    




The highest TF-IDF values for document 1:






    






  DocID Word    TF      IDF   TF-IDF


      1 in       2 0.693147  1.38629
      1 sport    2 0.693147  1.38629
      1 their    2 0.693147  1.38629








    












    



The lowest TF-IDF values for document 1:






    






  DocID Word    TF      IDF   TF-IDF


      1 and      1 0       0       
      1 is       2 0       0       
      1 known    1 0.287682 0.287682








    












    




The highest TF-IDF values for document 2:






    






  DocID Word      TF      IDF   TF-IDF


      2 viruses.    1 0.693147 0.693147
      2 as         2 0.693147 1.38629 
      2 by         2 0.693147 1.38629 








    












    



The lowest TF-IDF values for document 2:






    






  DocID Word    TF      IDF   TF-IDF


      2 and      1 0       0       
      2 is       2 0       0       
      2 a        1 0.287682 0.287682

Case insensitive TF-IDF



In [10]:

    
input_frame = h2o.H2OFrame(OrderedDict([('DocID', doc_ids), ('Document', documents)]),
                            column_types=['numeric', 'string'])









    



Parse progress: |█████████████████████████████████████████████████████████| 100%



In [11]:

    
tf_idf_out = tf_idf(input_frame, 'DocID', 'Document', case_sensitive=False)
tf_idf_out.head()









    






  DocID Word    TF      IDF   TF-IDF


      2 (ab),    1 0.693147 0.693147
      2 (ig),    1 0.693147 0.693147
      1 a        3 0.287682 0.863046
      2 a        1 0.287682 0.287682
      2 also     1 0.693147 0.693147
      0 an       1 0.287682 0.287682
      2 an       2 0.287682 0.575364
      0 and      3 0       0       
      1 and      1 0       0       
      2 and      1 0       0       








    Out[11]:



In [12]:

    
tf_idf_output_summary(tf_idf_out)









    



The highest TF-IDF values for document 0:






    






  DocID Word    TF      IDF   TF-IDF


      0 works    1 0.693147 0.693147
      0 h2o      2 0.693147 1.38629 
      0 like     2 0.693147 1.38629 








    












    



The lowest TF-IDF values for document 0:






    






  DocID Word    TF   IDF   TF-IDF


      0 and      3     0        0
      0 is       1     0        0
      0 the      1     0        0








    












    




The highest TF-IDF values for document 1:






    






  DocID Word    TF      IDF   TF-IDF


      1 in       2 0.693147  1.38629
      1 sport    2 0.693147  1.38629
      1 their    2 0.693147  1.38629








    












    



The lowest TF-IDF values for document 1:






    






  DocID Word    TF   IDF   TF-IDF


      1 and      1     0        0
      1 is       2     0        0
      1 the      1     0        0








    












    




The highest TF-IDF values for document 2:






    






  DocID Word      TF      IDF   TF-IDF


      2 y-shaped    1 0.693147 0.693147
      2 as         2 0.693147 1.38629 
      2 by         2 0.693147 1.38629 








    












    



The lowest TF-IDF values for document 2:






    






  DocID Word    TF   IDF   TF-IDF


      2 and      1     0        0
      2 is       2     0        0
      2 the      1     0        0

DocID	Document
0	H2O is an in-memory platform for distributed, scalable machine learning. H2O uses familiar interfaces like R, Python, Scala, Java, JSON and the Flow notebook/web interface, and works seamlessly with big data technologies like Hadoop and Spark.
1	Ice hockey is a contact team sport played on ice, usually in a rink, in which two teams of skaters use their sticks to shoot a vulcanized rubber puck into their opponent's net to score goals. The sport is known to be fast-paced and physical.
2	An antibody (Ab), also known as an immunoglobulin (Ig), is a large, Y-shaped protein produced mainly by plasma cells that is used by the immune system to neutralize pathogens such as pathogenic bacteria and viruses.

DocID	Word	TF	IDF	TF-IDF
2	an antibody (ab), also known as an immunoglobulin (ig), is a large, y-shaped protein produced mainly by plasma cells that is used by the immune system to neutralize pathogens such as pathogenic bacteria and viruses.	1	0.693147	0.693147
0	h2o is an in-memory platform for distributed, scalable machine learning. h2o uses familiar interfaces like r, python, scala, java, json and the flow notebook/web interface, and works seamlessly with big data technologies like hadoop and spark.	1	0.693147	0.693147
1	ice hockey is a contact team sport played on ice, usually in a rink, in which two teams of skaters use their sticks to shoot a vulcanized rubber puck into their opponent's net to score goals. the sport is known to be fast-paced and physical.	1	0.693147	0.693147

DocID	Document
0	H2O
0	is
0	an
0	in-memory
0	platform
0	for
0	distributed,
0	scalable
0	machine
0	learning.

DocID	Word	TF	IDF	TF-IDF
2	(Ab),	1	0.693147	0.693147
2	(Ig),	1	0.693147	0.693147
2	An	1	0.693147	0.693147
0	Flow	1	0.693147	0.693147
0	H2O	2	0.693147	1.38629
0	Hadoop	1	0.693147	0.693147
1	Ice	1	0.693147	0.693147
0	JSON	1	0.693147	0.693147
0	Java,	1	0.693147	0.693147
0	Python,	1	0.693147	0.693147

Word	TF	IDF	TF-IDF
works	1	0.693147	0.693147
H2O	2	0.693147	1.38629
like	2	0.693147	1.38629

DocID	Word	TF	IDF	TF-IDF
1	in	2	0.693147	1.38629
1	sport	2	0.693147	1.38629
1	their	2	0.693147	1.38629

DocID	Word	TF	IDF	TF-IDF
2	viruses.	1	0.693147	0.693147
2	as	2	0.693147	1.38629
2	by	2	0.693147	1.38629

DocID	Word	TF	IDF	TF-IDF
2	y-shaped	1	0.693147	0.693147
2	as	2	0.693147	1.38629
2	by	2	0.693147	1.38629