TF-IDF demo


In [ ]:
import h2o

h2o.init()

Data


In [3]:
from collections import OrderedDict

documents = [
    'H2O is an in-memory platform for distributed, scalable machine learning. H2O uses familiar interfaces like R, Python, Scala, Java, JSON and the Flow notebook/web interface, and works seamlessly with big data technologies like Hadoop and Spark.',
    'Ice hockey is a contact team sport played on ice, usually in a rink, in which two teams of skaters use their sticks to shoot a vulcanized rubber puck into their opponent\'s net to score goals. The sport is known to be fast-paced and physical.',
    'An antibody (Ab), also known as an immunoglobulin (Ig), is a large, Y-shaped protein produced mainly by plasma cells that is used by the immune system to neutralize pathogens such as pathogenic bacteria and viruses.'
]
doc_ids = list(range(len(documents)))

input_frame = h2o.H2OFrame(OrderedDict([('DocID', doc_ids), ('Document', documents)]),
                            column_types=['numeric', 'string'])
input_frame.head()


Parse progress: |█████████████████████████████████████████████████████████| 100%
DocIDDocument
0H2O is an in-memory platform for distributed, scalable machine learning. H2O uses familiar interfaces like R, Python, Scala, Java, JSON and the Flow notebook/web interface, and works seamlessly with big data technologies like Hadoop and Spark.
1Ice hockey is a contact team sport played on ice, usually in a rink, in which two teams of skaters use their sticks to shoot a vulcanized rubber puck into their opponent's net to score goals. The sport is known to be fast-paced and physical.
2An antibody (Ab), also known as an immunoglobulin (Ig), is a large, Y-shaped protein produced mainly by plasma cells that is used by the immune system to neutralize pathogens such as pathogenic bacteria and viruses.
Out[3]:

TF-IDF with pre-processing


In [4]:
from h2o.information_retrieval.tf_idf import tf_idf

tf_idf_out = tf_idf(input_frame, "DocID", "Document", False, False)
tf_idf_out.head()


DocIDWord TF IDF TF-IDF
2an antibody (ab), also known as an immunoglobulin (ig), is a large, y-shaped protein produced mainly by plasma cells that is used by the immune system to neutralize pathogens such as pathogenic bacteria and viruses. 10.6931470.693147
0h2o is an in-memory platform for distributed, scalable machine learning. h2o uses familiar interfaces like r, python, scala, java, json and the flow notebook/web interface, and works seamlessly with big data technologies like hadoop and spark. 10.6931470.693147
1ice hockey is a contact team sport played on ice, usually in a rink, in which two teams of skaters use their sticks to shoot a vulcanized rubber puck into their opponent's net to score goals. the sport is known to be fast-paced and physical. 10.6931470.693147
Out[4]:


In [5]:
from IPython.display import DisplayObject, display
VALUES_CNT_TO_SHOW = 3

def tf_idf_output_summary(tf_idf_out):
    for doc_id in doc_ids:
        sorted_doc_tf_idfs = tf_idf_out[tf_idf_out['DocID'] == doc_id].sort(by='TF-IDF')
        print('The highest TF-IDF values for document ' + str(doc_id) +':')
        display(sorted_doc_tf_idfs.tail(VALUES_CNT_TO_SHOW))
        print('The lowest TF-IDF values for document ' + str(doc_id) +':')
        display(sorted_doc_tf_idfs.head(VALUES_CNT_TO_SHOW))
        print('\n')

In [6]:
tf_idf_output_summary(tf_idf_out)


The highest TF-IDF values for document 0:
DocIDWord TF IDF TF-IDF
0h2o is an in-memory platform for distributed, scalable machine learning. h2o uses familiar interfaces like r, python, scala, java, json and the flow notebook/web interface, and works seamlessly with big data technologies like hadoop and spark. 10.6931470.693147

The lowest TF-IDF values for document 0:
DocIDWord TF IDF TF-IDF
0h2o is an in-memory platform for distributed, scalable machine learning. h2o uses familiar interfaces like r, python, scala, java, json and the flow notebook/web interface, and works seamlessly with big data technologies like hadoop and spark. 10.6931470.693147


The highest TF-IDF values for document 1:
DocIDWord TF IDF TF-IDF
1ice hockey is a contact team sport played on ice, usually in a rink, in which two teams of skaters use their sticks to shoot a vulcanized rubber puck into their opponent's net to score goals. the sport is known to be fast-paced and physical. 10.6931470.693147

The lowest TF-IDF values for document 1:
DocIDWord TF IDF TF-IDF
1ice hockey is a contact team sport played on ice, usually in a rink, in which two teams of skaters use their sticks to shoot a vulcanized rubber puck into their opponent's net to score goals. the sport is known to be fast-paced and physical. 10.6931470.693147


The highest TF-IDF values for document 2:
DocIDWord TF IDF TF-IDF
2an antibody (ab), also known as an immunoglobulin (ig), is a large, y-shaped protein produced mainly by plasma cells that is used by the immune system to neutralize pathogens such as pathogenic bacteria and viruses. 10.6931470.693147

The lowest TF-IDF values for document 2:
DocIDWord TF IDF TF-IDF
2an antibody (ab), also known as an immunoglobulin (ig), is a large, y-shaped protein produced mainly by plasma cells that is used by the immune system to neutralize pathogens such as pathogenic bacteria and viruses. 10.6931470.693147


TF-IDF without pre-processing


In [7]:
preprocessed_data = [(doc_id, word) for doc_id, document in enumerate(documents) for word in document.split()]

preprocessed_input_frame = h2o.H2OFrame(preprocessed_data,
                                        column_names=['DocID', 'Document'],
                                        column_types=['numeric', 'string'])
preprocessed_input_frame.head()


Parse progress: |█████████████████████████████████████████████████████████| 100%
DocIDDocument
0H2O
0is
0an
0in-memory
0platform
0for
0distributed,
0scalable
0machine
0learning.
Out[7]:


In [8]:
tf_idf_out = tf_idf(preprocessed_input_frame, 'DocID', 'Document', preprocess=False)
tf_idf_out.head()


DocIDWord TF IDF TF-IDF
2(Ab), 10.6931470.693147
2(Ig), 10.6931470.693147
2An 10.6931470.693147
0Flow 10.6931470.693147
0H2O 20.6931471.38629
0Hadoop 10.6931470.693147
1Ice 10.6931470.693147
0JSON 10.6931470.693147
0Java, 10.6931470.693147
0Python, 10.6931470.693147
Out[8]:


In [9]:
tf_idf_output_summary(tf_idf_out)


The highest TF-IDF values for document 0:
DocIDWord TF IDF TF-IDF
0works 10.6931470.693147
0H2O 20.6931471.38629
0like 20.6931471.38629

The lowest TF-IDF values for document 0:
DocIDWord TF IDF TF-IDF
0and 30 0
0is 10 0
0an 10.2876820.287682


The highest TF-IDF values for document 1:
DocIDWord TF IDF TF-IDF
1in 20.693147 1.38629
1sport 20.693147 1.38629
1their 20.693147 1.38629

The lowest TF-IDF values for document 1:
DocIDWord TF IDF TF-IDF
1and 10 0
1is 20 0
1known 10.2876820.287682


The highest TF-IDF values for document 2:
DocIDWord TF IDF TF-IDF
2viruses. 10.6931470.693147
2as 20.6931471.38629
2by 20.6931471.38629

The lowest TF-IDF values for document 2:
DocIDWord TF IDF TF-IDF
2and 10 0
2is 20 0
2a 10.2876820.287682


Case insensitive TF-IDF


In [10]:
input_frame = h2o.H2OFrame(OrderedDict([('DocID', doc_ids), ('Document', documents)]),
                            column_types=['numeric', 'string'])


Parse progress: |█████████████████████████████████████████████████████████| 100%

In [11]:
tf_idf_out = tf_idf(input_frame, 'DocID', 'Document', case_sensitive=False)
tf_idf_out.head()


DocIDWord TF IDF TF-IDF
2(ab), 10.6931470.693147
2(ig), 10.6931470.693147
1a 30.2876820.863046
2a 10.2876820.287682
2also 10.6931470.693147
0an 10.2876820.287682
2an 20.2876820.575364
0and 30 0
1and 10 0
2and 10 0
Out[11]:


In [12]:
tf_idf_output_summary(tf_idf_out)


The highest TF-IDF values for document 0:
DocIDWord TF IDF TF-IDF
0works 10.6931470.693147
0h2o 20.6931471.38629
0like 20.6931471.38629

The lowest TF-IDF values for document 0:
DocIDWord TF IDF TF-IDF
0and 3 0 0
0is 1 0 0
0the 1 0 0


The highest TF-IDF values for document 1:
DocIDWord TF IDF TF-IDF
1in 20.693147 1.38629
1sport 20.693147 1.38629
1their 20.693147 1.38629

The lowest TF-IDF values for document 1:
DocIDWord TF IDF TF-IDF
1and 1 0 0
1is 2 0 0
1the 1 0 0


The highest TF-IDF values for document 2:
DocIDWord TF IDF TF-IDF
2y-shaped 10.6931470.693147
2as 20.6931471.38629
2by 20.6931471.38629

The lowest TF-IDF values for document 2:
DocIDWord TF IDF TF-IDF
2and 1 0 0
2is 2 0 0
2the 1 0 0