Course Lead: Dr James G. Shanahan (email Jimi via James.Shanahan AT gmail.com)
Name: Jason Sanchez
Class: MIDS w261 (Section Fall 2016 Group 2)
Email: jason.sanchez@iSchool.Berkeley.edu
Week: 5
Due Time: 2 Phases.
HW5 Phase 1 This can be done on a local machine (with a unit test on the cloud such as AltaScale's PaaS or on AWS) and is due Tuesday, Week 6 by 8AM (West coast time). It will primarily focus on building a unit/systems and for pairwise similarity calculations pipeline (for stripe documents)
HW5 Phase 2 This will require the AltaScale cluster and will be due Tuesday, Week 7 by 8AM (West coast time). The focus of HW5 Phase 2 will be to scale up the unit/systems tests to the Google 5 gram corpus. This will be a group exercise
MIDS UC Berkeley, Machine Learning at Scale DATSCIW261 ASSIGNMENT #5
Version 2016-09-25
=== INSTRUCTIONS for SUBMISSIONS === Follow the instructions for submissions carefully.
https://docs.google.com/forms/d/1ZOr9RnIe_A06AcZDB6K1mJN4vrLeSmS2PD6Xm3eOiis/viewform?usp=send_form
HW4 can be completed locally on your computer
Data warehouse: Stores a large amount of relational, semi-structured, and unstructured data. Is used for business intelligence and data science.
A star schema has fact tables and many dimension tables that connect to the fact tables. Fact tables record events such as sales or website visits and encodes details of the events as keys (user_id, product_id, store_id, ad_id). The dimension tables store the detailed information about each of these keys.
Star schemas provide simple approached to structuring data warehouses in a relational way.
3NF means third normal form. It is used to transform large flat files that have repeated data into a linked collection of smaller tables that can be joined on a set of common keys.
Machine learning does not use data in 3NF. Instead it uses large flat files so the details that are hidden by the keys can be used in the algorithms.
Log files can track specific events of interest. A denormalized log file allows a company to track these events in real time conditioned on specific customer features. Alternatively, a model can be running that triggers appropriate responses based on the next predicted action of a user given the user's latest action.
Using MRJob, implement a hashside join (memory-backed map-side) for left, right and inner joins. Run your code on the data used in HW 4.4: (Recall HW 4.4: Find the most frequent visitor of each page using mrjob and the output of 4.2 (i.e., transfromed log file). In this output please include the webpage URL, webpageID and Visitor ID.)
Justify which table you chose as the Left table in this hashside join.
Please report the number of rows resulting from:
List files in directory
In [1]:
!ls
Count lines in dataset. View the first 10 lines. Rename data to log.data
In [25]:
!wc -l anonymous-msweb-preprocessed.data && echo
!head anonymous-msweb-preprocessed.data
!cp anonymous-msweb-preprocessed.data log.txt
Convert the output of 4.4 to be just url and url_id
In [22]:
!cat mostFrequentVisitors.txt | cut -f 1,2 -d',' > urls.txt
!wc -l urls.txt && echo
!head urls.txt
The urls.txt file is much smaller than the log.txt data and should be what is loaded into memory. This means it would be the right-side table in a left join.
In [1]:
%%writefile lj.py
from mrjob.job import MRJob
# Avoid broken pipe error
from signal import signal, SIGPIPE, SIG_DFL
signal(SIGPIPE,SIG_DFL)
class LJ(MRJob):
def mapper_init(self):
self.urls = {}
with open("urls.txt") as urls:
for line in urls:
url, key = line.strip().replace('"',"").split(",")
self.urls[key] = url
def mapper(self, _, lines):
try:
yield (lines, self.urls[lines[2:6]])
except ValueError:
yield (lines, "")
if __name__ == "__main__":
LJ.run()
In [3]:
!cat log.txt | python lj.py --file urls.txt -r local -q | head
In [56]:
%%writefile ij.py
from mrjob.job import MRJob
# Avoid broken pipe error
from signal import signal, SIGPIPE, SIG_DFL
signal(SIGPIPE,SIG_DFL)
class IJ(MRJob):
def mapper_init(self):
self.urls = {}
with open("urls.txt") as urls:
for line in urls:
url, key = line.strip().replace('"',"").split(",")
self.urls[key] = url
def mapper(self, _, lines):
try:
yield (lines, self.urls[lines[2:6]])
except ValueError:
pass
if __name__ == "__main__":
IJ.run()
In [57]:
!cat log.txt | python ij.py --file urls.txt -q | head
In [73]:
%%writefile rj.py
from mrjob.job import MRJob
# Avoid broken pipe error
from signal import signal, SIGPIPE, SIG_DFL
signal(SIGPIPE,SIG_DFL)
class RJ(MRJob):
def mapper_init(self):
self.urls_used = set()
self.urls = {}
with open("urls.txt") as urls:
for line in urls:
url, key = line.strip().replace('"',"").split(",")
self.urls[key] = url
def mapper(self, _, lines):
try:
url = lines[2:6]
yield (self.urls[url], lines)
self.urls_used.add(url)
except ValueError:
pass
def mapper_final(self):
for key, value in self.urls.items():
if key not in self.urls_used:
yield (self.urls[key], "*")
def reducer(self, url, values):
quick_stash = 0
for val in values:
if val != "*":
quick_stash += 1
yield (url, val)
if quick_stash == 0:
yield (url, "None")
if __name__ == "__main__":
RJ.run()
To prove it works, we can only use the first 100 log entries. We see that urls without corresponding log entries are listed as "None".
In [77]:
!head -n 100 log.txt | python rj.py --file urls.txt -q | head -n 15
A large subset of the Google n-grams dataset
https://aws.amazon.com/datasets/google-books-ngrams/
which we have placed in a bucket/folder on Dropbox and on s3:
https://www.dropbox.com/sh/tmqpc4o0xswhkvz/AACUifrl6wrMrlK6a3X3lZ9Ea?dl=0
s3://filtered-5grams/
In particular, this bucket contains (~200) files (10Meg each) in the format:
(ngram) \t (count) \t (pages_count) \t (books_count)
The next cell shows the first 10 lines of the googlebooks-eng-all-5gram-20090715-0-filtered.txt file.
DISCLAIMER: Each record is already a 5-gram. We should calculate the stripes cooccurrence data from the raw text and not from the 5-gram preprocessed data. Calculatating pairs on this 5-gram is a little corrupt as we will be double counting cooccurences. Having said that this exercise can still pull out some simialr terms.
In [16]:
%%writefile 5gram_test.txt
A BILL FOR ESTABLISHING RELIGIOUS 59 59 54
A Biography of General George 92 90 74
A Case Study in Government 102 102 78
A Case Study of Female 447 447 327
A Case Study of Limited 55 55 43
A Child's Christmas in Wales 1099 1061 866
A Circumstantial Narrative of the 62 62 50
A City by the Sea 62 60 49
A Collection of Fairy Tales 123 117 80
A Collection of Forms of 116 103 82
In [114]:
%%writefile GetIndexandOtherWords.py
import heapq
from re import findall
from mrjob.job import MRJob
from mrjob.step import MRStep
class TopList(list):
def __init__(self, max_size, num_position=0):
"""
Just like a list, except the append method adds the new value to the
list only if it is larger than the smallest value (or if the size of
the list is less than max_size).
If each element of the list is an int or float, uses that value for
comparison. If the elements in the list are lists or tuples, uses the
list_position element of the list or tuple for the comparison.
"""
self.max_size = max_size
self.pos = num_position
def _get_key(self, x):
return x[self.pos] if isinstance(x, (list, tuple)) else x
def append(self, val):
if len(self) < self.max_size:
heapq.heappush(self, val)
elif self._get_key(self[0]) < self._get_key(val):
heapq.heapreplace(self, val)
def final_sort(self):
return sorted(self, key=self._get_key, reverse=True)
class GetIndexandOtherWords(MRJob):
"""
Usage: python GetIndexandOtherWords.py --index-range 9000-10000 --top-n-words 10000 --use-term-counts True
Given n-gram formatted data, outputs a file of the form:
index term
index term
...
word term
word term
...
Where there would be 1001 index words and 10000 total words. Each word would be ranked based
on either the term count listed in the Google n-gram data (i.e. the counts found in the
underlying books) or the ranks would be based on the word count of the n-grams in the actual
dataset (i.e. ignore the numbers/counts associated with each n-gram and count each n-gram
exactly once).
"""
def configure_options(self):
super(GetIndexandOtherWords, self).configure_options()
self.add_passthrough_option(
'--index-range',
default="9-10",
help="Specify the range of the index words. ex. 9-10 means the ninth and " +
"tenth most popular words will serve as the index")
self.add_passthrough_option(
'--top-n-words',
default="10",
help="Specify the number of words to output in all")
self.add_passthrough_option(
'--use-term-counts',
default="True",
choices=["True","False"],
help="When calculating the most frequent words, choose whether to count " +
"each word based on the term counts reported by Google or just based on " +
"the number of times the word appears in an n-gram")
self.add_passthrough_option(
'--return-counts',
default="False",
choices=["True","False"],
help="The final output includes the counts of each word")
def mapper_init(self):
# Ensure command line options are sane
top_n_words = int(self.options.top_n_words)
last_index_word = int(self.options.index_range.split("-")[1])
if top_n_words < last_index_word:
raise ValueError("""--top-n-words value (currently %d) must be equal to or greater than
--index-range value (currently %d).""" % (top_n_words, last_index_word))
self.stop_words = {'i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves',
'you', 'your', 'yours', 'yourself', 'yourselves', 'he', 'him',
'his', 'himself', 'she', 'her', 'hers', 'herself', 'it',
'its', 'itself', 'they', 'them', 'their', 'theirs',
'themselves', 'what', 'which', 'who', 'whom', 'this', 'that',
'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be',
'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does',
'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or',
'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for',
'with', 'about', 'against', 'between', 'into', 'through',
'during', 'before', 'after', 'above', 'below', 'to', 'from',
'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under',
'again', 'further', 'then', 'once', 'here', 'there', 'when',
'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few',
'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not',
'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't',
'can', 'will', 'just', 'don', 'should', 'now'}
def mapper(self, _, lines):
terms, term_count, page_count, book_count = lines.split("\t")
# Either use the ngram term count for the count or count each word just once
if self.options.use_term_counts == "True":
term_count = int(term_count)
else:
term_count = 1
# Iterate through each term. Skip stop words
for term in findall(r'[a-z]+', terms.lower()):
if term in self.stop_words:
pass
else:
yield (term, term_count)
def combiner(self, term, counts):
yield (term, sum(counts))
def reducer_init(self):
"""
Accumulates the top X words and yields them. Note: should only use if
you want to emit a reasonable amount of top words (i.e. an amount that
could fit on a single computer.)
"""
self.top_n_words = int(self.options.top_n_words)
self.TopTerms = TopList(self.top_n_words, num_position=1)
def reducer(self, term, counts):
self.TopTerms.append((term, sum(counts)))
def reducer_final(self):
for pair in self.TopTerms:
yield pair
def mapper_single_key(self, term, count):
"""
Send all the data to a single reducer
"""
yield (1, (term, count))
def reducer_init_top_vals(self):
# Collect top words
self.top_n_words = int(self.options.top_n_words)
self.TopTerms = TopList(self.top_n_words, num_position=1)
# Collect index words
self.index_range = [int(num) for num in self.options.index_range.split("-")]
self.index_low, self.index_high = self.index_range
# Control if output shows counts or just words
self.return_counts = self.options.return_counts == "True"
def reducer_top_vals(self, _, terms):
for term in terms:
self.TopTerms.append(term)
def reducer_final_top_vals(self):
TopTerms = self.TopTerms.final_sort()
if self.return_counts:
# Yield index words
for term in TopTerms[self.index_low-1:self.index_high]:
yield ("index", term)
# Yield all words
for term in TopTerms:
yield ("words", term)
else:
# Yield index words
for term in TopTerms[self.index_low-1:self.index_high]:
yield ("index", term[0])
# Yield all words
for term in TopTerms:
yield ("words", term[0])
def steps(self):
"""
Step one: Yield top n-words from each reducer. Means dataset size is
n-words * num_reducers. Guarantees overall top n-words are
sent to the next step.
"""
mr_steps = [MRStep(mapper_init=self.mapper_init,
mapper=self.mapper,
combiner=self.combiner,
reducer_init=self.reducer_init,
reducer_final=self.reducer_final,
reducer=self.reducer),
MRStep(mapper=self.mapper_single_key,
reducer_init=self.reducer_init_top_vals,
reducer=self.reducer_top_vals,
reducer_final=self.reducer_final_top_vals)]
return mr_steps
if __name__ == "__main__":
GetIndexandOtherWords.run()
In [120]:
!cat 5gram_test.txt | python GetIndexandOtherWords.py --index-range 9-10 \
--top-n-words 10 \
--return-counts False \
-q
For HW 5.4-5.5, unit test and regression test your code using the followings small test datasets:
In [2]:
%%writefile atlas-boon-systems-test.txt
atlas boon 50 50 50
boon cava dipped 10 10 10
atlas dipped 15 15 15
In [3]:
############################################
# Stripes for systems test 1 (predefined)
############################################
with open("mini_stripes.txt", "w") as f:
f.writelines([
'"DocA"\t{"X":20, "Y":30, "Z":5}\n',
'"DocB"\t{"X":100, "Y":20}\n',
'"DocC"\t{"M":5, "N":20, "Z":5, "Y":1}\n'
])
!cat mini_stripes.txt
In [4]:
%%writefile MakeStripes.py
from mrjob.job import MRJob
from collections import Counter
class MakeStripes(MRJob):
def mapper(self, _, lines):
terms, term_count, page_count, book_count = lines.split("\t")
terms = terms.split()
term_count = int(term_count)
for item in terms:
yield (item, {term:term_count for term in terms if term != item})
def combiner(self, keys, values):
values_sum = Counter()
for val in values:
values_sum += Counter(val)
yield keys, dict(values_sum)
def reducer(self, keys, values):
values_sum = Counter()
for val in values:
values_sum += Counter(val)
yield keys, dict(values_sum)
if __name__ == "__main__":
MakeStripes.run()
In [7]:
!cat atlas-boon-systems-test.txt | python MakeStripes.py -r local -q > atlas_stripes.txt
!cat atlas_stripes.txt
In [ ]:
"atlas" {"dipped": 15, "boon": 50}
"boon" {"atlas": 50, "dipped": 10, "cava": 10}
"cava" {"dipped": 10, "boon": 10}
"dipped" {"atlas": 15, "boon": 10, "cava": 10}
The calculated stripes match the systems test.
In [8]:
%%writefile InvertIndex.py
from mrjob.job import MRJob
from mrjob.protocol import JSONProtocol
from collections import Counter
class InvertIndex(MRJob):
MRJob.input_protocol = JSONProtocol
def mapper(self, key, words):
n_words = len(words)
for word in words:
yield (word, {key:n_words})
def combiner(self, keys, values):
values_sum = Counter()
for val in values:
values_sum += Counter(val)
yield keys, dict(values_sum)
def reducer(self, keys, values):
values_sum = Counter()
for val in values:
values_sum += Counter(val)
yield keys, dict(values_sum)
if __name__ == "__main__":
InvertIndex.run()
In [9]:
!cat atlas_stripes.txt | python InvertIndex.py -q > atlas_inverted.txt
!cat atlas_inverted.txt
In [10]:
!cat mini_stripes.txt | python InvertIndex.py -q > mini_stripes_inverted.txt
!cat mini_stripes_inverted.txt
In [ ]:
Systems test mini_stripes - Inverted Index
————————————————————————————————————————————————————————————————————————————————————————————————————
"M" | DocC 4 | |
"N" | DocC 4 | |
"X" | DocA 3 | DocB 2 |
"Y" | DocA 3 | DocB 2 | DocC 4
"Z" | DocA 3 | DocC 4 |
systems test atlas-boon - Inverted Index
————————————————————————————————————————————————————————————————————————————————————————————————————
"atlas" | boon 3 | dipped 3 |
"dipped" | atlas 2 | boon 3 | cava 2
"boon" | atlas 2 | cava 2 | dipped 3
"cava" | boon 3 | dipped 3 |
Tests pass
In [216]:
%%writefile Similarity.py
from mrjob.job import MRJob
from mrjob.protocol import JSONProtocol
from itertools import combinations
from statistics import mean
class Similarity(MRJob):
MRJob.input_protocol = JSONProtocol
def mapper(self, key_term, docs):
doc_names = docs.keys()
for doc_pairs in combinations(sorted(list(doc_names)), 2):
yield (doc_pairs, 1)
for name in doc_names:
yield (name, 1)
def combiner(self, key, value):
yield (key, sum(value))
def reducer_init(self):
self.words = {}
self.results = []
def reducer(self, doc_or_docs, count):
if isinstance(doc_or_docs, str):
self.words[doc_or_docs] = sum(count)
else:
d1, d2 = doc_or_docs
d1_n_words, d2_n_words = self.words[d1], self.words[d2]
intersection = sum(count)
jaccard = round(intersection/(d1_n_words + d2_n_words - intersection), 3)
cosine = round(intersection/(d1_n_words**.5 * d2_n_words**.5), 3)
dice = round(2*intersection/(d1_n_words + d2_n_words), 3)
overlap = round(intersection/min(d1_n_words, d2_n_words), 3)
average = round(mean([jaccard, cosine, dice, overlap]), 3)
self.results.append([doc_or_docs, {"jacc":jaccard, "cos":cosine,
"dice":dice, "ol":overlap, "ave":average}])
def reducer_final(self):
for doc, result in sorted(self.results, key=lambda x: x[1]["ave"], reverse=True):
yield (doc, result)
if __name__ == "__main__":
Similarity.run()
In [11]:
!cat atlas_inverted.txt | python Similarity.py -q --jobconf mapred.reduce.tasks=1
In [12]:
!cat mini_stripes_inverted.txt | python Similarity.py -q --jobconf mapred.reduce.tasks=1
Systems test mini_stripes - Similarity measures
average | pair | cosine | jaccard | overlap | dice |
---|---|---|---|---|---|
0.741582 | DocA - DocB | 0.816497 | 0.666667 | 1.000000 | 0.800000 |
0.488675 | DocA - DocC | 0.577350 | 0.400000 | 0.666667 | 0.571429 |
0.276777 | DocB - DocC | 0.353553 | 0.200000 | 0.500000 | 0.333333 |
Systems test atlas-boon 2 - Similarity measures
average | pair | cosine | jaccard | overlap | dice |
---|---|---|---|---|---|
1.000000 | atlas - cava | 1.000000 | 1.000000 | 1.000000 | 1.000000 |
0.625000 | boon - dipped | 0.666667 | 0.500000 | 0.666667 | 0.666667 |
0.389562 | cava - dipped | 0.408248 | 0.250000 | 0.500000 | 0.400000 |
0.389562 | boon - cava | 0.408248 | 0.250000 | 0.500000 | 0.400000 |
0.389562 | atlas - dipped | 0.408248 | 0.250000 | 0.500000 | 0.400000 |
0.389562 | atlas - boon | 0.408248 | 0.250000 | 0.500000 | 0.400000 |
The numbers I calculated exactly match the systems test except for the average calculations of the mini_stripes set. In this instance, the systems test calculations are not correct.
In [274]:
%%time
!source ../private/aws_creds.sh && python MakeStripes.py -r emr atlas-boon-systems-test.txt
Once you are happy with your test results proceed to generating your results on the Google n-grams dataset.
Do some EDA on this dataset using mrjob, e.g.,
In [ ]:
In [ ]:
In [ ]:
There is also a corpus of stopwords, that is, high-frequency words like "the", "to" and "also" that we sometimes want to filter out of a document before further processing. Stopwords usually have little lexical content, and their presence in a text fails to distinguish it from other texts. Python's nltk comes with a prebuilt list of stopwords (see below). Using this stopword list filter out these tokens from your analysis and rerun the experiments in 5.5 and disucuss the results of using a stopword list and without using a stopword list.
from nltk.corpus import stopwords stopwords.words('english') ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', 'her', 'hers', 'herself', 'it', 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', 'should', 'now']
For each HW 5.4 -5.5.1 Please unit test and system test your code with respect to SYSTEMS TEST DATASET and show the results. Please compute the expected answer by hand and show your hand calculations for the SYSTEMS TEST DATASET. Then show the results you get with your system.
In this part of the assignment we will focus on developing methods for detecting synonyms, using the Google 5-grams dataset. At a high level:
To accomplish this you must script two main tasks using MRJob:
TASK (1) Build stripes for the most frequent 10,000 words using cooccurence information based on the words ranked from 9001,-10,000 as a basis/vocabulary (drop stopword-like terms), and output to a file in your bucket on s3 (bigram analysis, though the words are non-contiguous).
TASK (2) Using two (symmetric) comparison methods of your choice (e.g., correlations, distances, similarities), pairwise compare all stripes (vectors), and output to a file in your bucket on s3.
For this task you will be able to modify the pattern we used in HW 3.2 (feel free to use the solution as reference). To total the word counts across the 5-grams, output the support from the mappers using the total order inversion pattern:
<*word,count>
to ensure that the support arrives before the cooccurrences.
In addition to ensuring the determination of the total word counts, the mapper must also output co-occurrence counts for the pairs of words inside of each 5-gram. Treat these words as a basket, as we have in HW 3, but count all stripes or pairs in both orders, i.e., count both orderings: (word1,word2), and (word2,word1), to preserve symmetry in our output for TASK (2).
For this task you will have to determine a method of comparison. Here are a few that you might consider:
However, be cautioned that some comparison methods are more difficult to parallelize than others, and do not perform more associations than is necessary, since your choice of association will be symmetric.
Please use the inverted index (discussed in live session #5) based pattern to compute the pairwise (term-by-term) similarity matrix.
Please report the size of the cluster used and the amount of time it takes to run for the index construction task and for the synonym calculation task. How many pairs need to be processed (HINT: use the posting list length to calculate directly)? Report your Cluster configuration!
In [ ]:
print "\nTop/Bottom 20 results - Similarity measures - sorted by cosine"
print "(From the entire data set)"
print '—'*117
print "{0:>30} |{1:>15} |{2:>15} |{3:>15} |{4:>15} |{5:>15}".format(
"pair", "cosine", "jaccard", "overlap", "dice", "average")
print '-'*117
for stripe in sortedSims[:20]:
print "{0:>30} |{1:>15f} |{2:>15f} |{3:>15f} |{4:>15f} |{5:>15f}".format(
stripe[0], float(stripe[1]), float(stripe[2]), float(stripe[3]), float(stripe[4]), float(stripe[5]) )
print '—'*117
for stripe in sortedSims[-20:]:
print "{0:>30} |{1:>15f} |{2:>15f} |{3:>15f} |{4:>15f} |{5:>15f}".format(
stripe[0], float(stripe[1]), float(stripe[2]), float(stripe[3]), float(stripe[4]), float(stripe[5]) )
In [ ]:
Top/Bottom 20 results - Similarity measures - sorted by cosine
(From the entire data set)
—————————————————————————————————————————————————————————————————————————————————————————————————————————————————————
pair | cosine | jaccard | overlap | dice | average
---------------------------------------------------------------------------------------------------------------------
cons - pros | 0.894427 | 0.800000 | 1.000000 | 0.888889 | 0.895829
forties - twenties | 0.816497 | 0.666667 | 1.000000 | 0.800000 | 0.820791
own - time | 0.809510 | 0.670563 | 0.921168 | 0.802799 | 0.801010
little - time | 0.784197 | 0.630621 | 0.926101 | 0.773473 | 0.778598
found - time | 0.783434 | 0.636364 | 0.883788 | 0.777778 | 0.770341
nova - scotia | 0.774597 | 0.600000 | 1.000000 | 0.750000 | 0.781149
hong - kong | 0.769800 | 0.615385 | 0.888889 | 0.761905 | 0.758995
life - time | 0.769666 | 0.608789 | 0.925081 | 0.756829 | 0.765091
time - world | 0.755476 | 0.585049 | 0.937500 | 0.738209 | 0.754058
means - time | 0.752181 | 0.587117 | 0.902597 | 0.739854 | 0.745437
form - time | 0.749943 | 0.588418 | 0.876733 | 0.740885 | 0.738995
infarction - myocardial | 0.748331 | 0.560000 | 1.000000 | 0.717949 | 0.756570
people - time | 0.745788 | 0.573577 | 0.923875 | 0.729010 | 0.743063
angeles - los | 0.745499 | 0.586207 | 0.850000 | 0.739130 | 0.730209
little - own | 0.739343 | 0.585834 | 0.767296 | 0.738834 | 0.707827
life - own | 0.737053 | 0.582217 | 0.778502 | 0.735951 | 0.708430
anterior - posterior | 0.733388 | 0.576471 | 0.790323 | 0.731343 | 0.707881
power - time | 0.719611 | 0.533623 | 0.933586 | 0.695898 | 0.720680
dearly - install | 0.707107 | 0.500000 | 1.000000 | 0.666667 | 0.718443
found - own | 0.704802 | 0.544134 | 0.710949 | 0.704776 | 0.666165
—————————————————————————————————————————————————————————————————————————————————————————————————————————————————————
arrival - essential | 0.008258 | 0.004098 | 0.009615 | 0.008163 | 0.007534
governments - surface | 0.008251 | 0.003534 | 0.014706 | 0.007042 | 0.008383
king - lesions | 0.008178 | 0.003106 | 0.017857 | 0.006192 | 0.008833
clinical - stood | 0.008178 | 0.003831 | 0.011905 | 0.007634 | 0.007887
till - validity | 0.008172 | 0.003367 | 0.015625 | 0.006711 | 0.008469
evidence - started | 0.008159 | 0.003802 | 0.012048 | 0.007576 | 0.007896
forces - record | 0.008152 | 0.003876 | 0.011364 | 0.007722 | 0.007778
primary - stone | 0.008146 | 0.004065 | 0.009091 | 0.008097 | 0.007350
beneath - federal | 0.008134 | 0.004082 | 0.008403 | 0.008130 | 0.007187
factors - rose | 0.008113 | 0.004032 | 0.009346 | 0.008032 | 0.007381
evening - functions | 0.008069 | 0.004049 | 0.008333 | 0.008065 | 0.007129
bone - told | 0.008061 | 0.003704 | 0.012346 | 0.007380 | 0.007873
building - occurs | 0.008002 | 0.003891 | 0.010309 | 0.007752 | 0.007489
company - fig | 0.007913 | 0.003257 | 0.015152 | 0.006494 | 0.008204
chronic - north | 0.007803 | 0.003268 | 0.014493 | 0.006515 | 0.008020
evaluation - king | 0.007650 | 0.003030 | 0.015625 | 0.006042 | 0.008087
resulting - stood | 0.007650 | 0.003663 | 0.010417 | 0.007299 | 0.007257
agent - round | 0.007515 | 0.003289 | 0.012821 | 0.006557 | 0.007546
afterwards - analysis | 0.007387 | 0.003521 | 0.010204 | 0.007018 | 0.007032
posterior - spirit | 0.007156 | 0.002660 | 0.016129 | 0.005305 | 0.007812
In this part of the assignment you will evaluate the success of you synonym detector (developed in response to HW5.4). Take the top 1,000 closest/most similar/correlative pairs of words as determined by your measure in HW5.4, and use the synonyms function in the accompanying python code:
nltk_synonyms.py
Note: This will require installing the python nltk package:
http://www.nltk.org/install.html
and downloading its data with nltk.download().
For each (word1,word2) pair, check to see if word1 is in the list, synonyms(word2), and vice-versa. If one of the two is a synonym of the other, then consider this pair a 'hit', and then report the precision, recall, and F1 measure of your detector across your 1,000 best guesses. Report the macro averages of these measures.
$$Recall (R) = \frac{TP}{TP + FN} $$
$$F1 = \frac{2 * ( precision * recall )}{precision + recall}$$
We calculate Precision by counting the number of hits and dividing by the number of occurances in our top1000 (opportunities)
We calculate Recall by counting the number of hits, and dividing by the number of synonyms in wordnet (syns)
Other diagnostic measures not implemented here: https://en.wikipedia.org/wiki/F1_score#Diagnostic_Testing
In [ ]:
''' Performance measures '''
from __future__ import division
import numpy as np
import json
import nltk
from nltk.corpus import wordnet as wn
import sys
#print all the synset element of an element
def synonyms(string):
syndict = {}
for i,j in enumerate(wn.synsets(string)):
syns = j.lemma_names()
for syn in syns:
syndict.setdefault(syn,1)
return syndict.keys()
hits = []
TP = 0
FP = 0
TOTAL = 0
flag = False # so we don't double count, but at the same time don't miss hits
## For this part we can use one of three outputs. They are all the same, but were generated differently
# 1. the top 1000 from the full sorted dataset -> sortedSims[:1000]
# 2. the top 1000 from the partial sort aggragate file -> sims2/top1000sims
# 3. the top 1000 from the total order sort file -> head -1000 sims_parts/part-00004
top1000sims = []
with open("sims2/top1000sims","r") as f:
for line in f.readlines():
line = line.strip()
avg,lisst = line.split("\t")
lisst = json.loads(lisst)
lisst.append(avg)
top1000sims.append(lisst)
measures = {}
not_in_wordnet = []
for line in top1000sims:
TOTAL += 1
pair = line[0]
words = pair.split(" - ")
for word in words:
if word not in measures:
measures[word] = {"syns":0,"opps": 0,"hits":0}
measures[word]["opps"] += 1
syns0 = synonyms(words[0])
measures[words[1]]["syns"] = len(syns0)
if len(syns0) == 0:
not_in_wordnet.append(words[0])
if words[1] in syns0:
TP += 1
hits.append(line)
flag = True
measures[words[1]]["hits"] += 1
syns1 = synonyms(words[1])
measures[words[0]]["syns"] = len(syns1)
if len(syns1) == 0:
not_in_wordnet.append(words[1])
if words[0] in syns1:
if flag == False:
TP += 1
hits.append(line)
measures[words[0]]["hits"] += 1
flag = False
precision = []
recall = []
f1 = []
for key in measures:
p,r,f = 0,0,0
if measures[key]["hits"] > 0 and measures[key]["syns"] > 0:
p = measures[key]["hits"]/measures[key]["opps"]
r = measures[key]["hits"]/measures[key]["syns"]
f = 2 * (p*r)/(p+r)
# For calculating measures, only take into account words that have synonyms in wordnet
if measures[key]["syns"] > 0:
precision.append(p)
recall.append(r)
f1.append(f)
# Take the mean of each measure
print "—"*110
print "Number of Hits:",TP, "out of top",TOTAL
print "Number of words without synonyms:",len(not_in_wordnet)
print "—"*110
print "Precision\t", np.mean(precision)
print "Recall\t\t", np.mean(recall)
print "F1\t\t", np.mean(f1)
print "—"*110
print "Words without synonyms:"
print "-"*100
for word in not_in_wordnet:
print synonyms(word),word
In [ ]:
——————————————————————————————————————————————————————————————————————————————————————————————————————————————
Number of Hits: 31 out of top 1000
Number of words without synonyms: 67
——————————————————————————————————————————————————————————————————————————————————————————————————————————————
Precision 0.0280214404967
Recall 0.0178598869579
F1 0.013965517619
——————————————————————————————————————————————————————————————————————————————————————————————————————————————
Words without synonyms:
----------------------------------------------------------------------------------------------------
[] scotia
[] hong
[] kong
[] angeles
[] los
[] nor
[] themselves
[]
.......
Repeat HW5 using vocabulary words ranked from 8001,-10,000; 7001,-10,000; 6001,-10,000; 5001,-10,000; 3001,-10,000; and 1001,-10,000; Dont forget to report you Cluster configuration.
Generate the following graphs: -- vocabulary size (X-Axis) versus CPU time for indexing -- vocabulary size (X-Axis) versus number of pairs processed -- vocabulary size (X-Axis) versus F1 measure, Precision, Recall
In [ ]:
There is also a corpus of stopwords, that is, high-frequency words like "the", "to" and "also" that we sometimes want to filter out of a document before further processing. Stopwords usually have little lexical content, and their presence in a text fails to distinguish it from other texts. Python's nltk comes with a prebuilt list of stopwords (see below). Using this stopword list filter out these tokens from your analysis and rerun the experiments in 5.5 and disucuss the results of using a stopword list and without using a stopword list.
from nltk.corpus import stopwords
stopwords.words('english') ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', 'her', 'hers', 'herself', 'it', 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', 'should', 'now']
In [ ]:
There are many good ways to build our synonym detectors, so for this optional homework, measure co-occurrence by (left/right/all) consecutive words only, or make stripes according to word co-occurrences with the accompanying 2-, 3-, or 4-grams (note here that your output will no longer be interpretable as a network) inside of the 5-grams.
In [ ]:
In [ ]: