wordoverlap baseline

for Anonymous 2017 'Anonymous'

You first need to download test data from the web. This implementation is considering each data is formatted just like CNN/dailymail dataset. And This method needs no learning process, so test set is the only thing we need.


In [1]:
with open('testfile.txt','r') as f:
    testlist = f.readlines()
    testlist = map(lambda x:x.strip(),testlist)
print testlist[:10]


['0005abecc02cef5369827ba5dec2b246d6c04b0f.question', '0015c2f95d940cb547054060cab362058c25e03f.question', '004c8f8878b646dd9d2d6cba2d624ae16abbc14f.question', '005f6d2f969d0f925fd0124cf91b40d6cffc75d2.question', '0066b726c91ead4a05d3dcbed9fdecdc9c7bc4af.question', '0073d51f1c1bbd23985566d9b0a286129a1b040a.question', '0090affc0127eecff757c1600f98587b2688263e.question', '009cc72d6c79b7bbbf3f76513cf4e47ee28a3e4c.question', '009f7eba61e3d0827dea6e39ef86e117ec35da7a.question', '00ab543b1bd24482f27799b42ab5955e3c48e908.question']

In [2]:
'''
This makes each datafile into one tuple of 5 elements, 
which contains filename, filetext, query, answer, answer candidates.
'''
def extract(datafile):
    filename = datafile[0].strip()
    filetext = datafile[2].strip()
    query = datafile[4].strip()
    answer = datafile[6].strip()
    answercand = [ item.strip() for item in datafile[8:] if item != '\n']
    
    return filename,filetext,query,answer,answercand

In [3]:
'''
This makes sentence with '@placeholder'-the blank part- into one with the placeholder filled with answer.
'''

def make_query(sentence_with_placeholder,answer):
    lst = sentence_with_placeholder.split()
    try:
        lst = [ w.replace("@placeholder",answer) for w in lst]
        
    except:
        print "no placeholder in the sentence..."
    
    return " ".join(lst)

In [4]:
'''
We used stopwords in natural language toolkit.
'''
import re
import nltk.data
tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
from nltk.corpus import stopwords

In [5]:
''' 
This makes the text of a file into the list of words composing of the text. 
'''
def text_to_wordlist(text, remove_stopwords=True):
    text = re.sub("[^a-zA-Z]"," ", text)
    words = text.lower().split()
    if remove_stopwords:
        stops = set(stopwords.words('english'))
        words = [w for w in words if not w in stops]
    
    return words

In [6]:
def text_to_sentences(text, tokenizer, remove_stopwords = True):
    raw_sentences = tokenizer.tokenize(text)
    
    sentences = []
    
    for raw_sentence in raw_sentences:
        if len(raw_sentence) > 0:
            lst = text_to_wordlist(raw_sentence, remove_stopwords)
            joint_sentence = " ".join(lst)
            sentences.append(joint_sentence)
            
    
    return sentences

In [7]:
def clean_text(text):
    text = re.sub("[^a-zA-Z]"," ", text)
    words = text.lower().split()
    stops = set(stopwords.words('english'))
    words = [w for w in words if not w in stops]
    
    return " ".join(words)

In [8]:
len(testlist)


Out[8]:
3198

In [9]:
'''
See whether it reads data well.
'''
TEST_PATH = 'test/'
sampleTestFilePath = TEST_PATH+testlist[0]
with open(sampleTestFilePath,'r') as f:
    print f.read()


http://web.archive.org/web/20150426033749id_/http://edition.cnn.com/2015/04/22/business/african-luxury-houses-london/index.html

( @entity0 ) sophisticated , glamorous and spacious -- when the super-rich go house - hunting they are searching for something special . real estate in @entity5 's swankier suburbs can catch a buyers ' eye . @entity8 , @entity9 and @entity10 have long been the stomping ground of the elite -- and are now welcoming a new wave of @entity15 investors . " the @entity19 who are coming into @entity5 now are @entity19 who themselves have worked for their money , " explains @entity16 , a @entity18 - @entity17 wealth manager based in @entity5 . " they have grown in industry and are actually part of the exciting story of the @entity15 renaissance , " she continues . " it 's bringing to @entity5 the best of the continent . " these investors are having a considerable impact on @entity5 's property market and they mainly come from just six countries : @entity17 , @entity32 , @entity33 , @entity34 , @entity35 and @entity36 . of these , @entity17 are splashing out the most cash when it comes to bricks and mortar in the @entity18 capital -- typically spending between $ 22 and $ 37 million on securing a property , according to luxury property agents @entity44 . their research shows that over the past three years @entity19 have spent over $ 900 million on luxury residential property in @entity5 . " the new international @entity15 is very well - traveled , " explains @entity16 . " educated in the @entity47 , @entity18 and different parts of @entity49 their taste is definitely more modern and clean . " ' @entity52 ' owning a home in post codes like @entity55 or @entity55 -- around the corner from @entity57 -- means more than having a place to lay your head . these buildings are investments which are expected to gain even bigger value in the coming years . high - end auction house @entity64 says that foreign investors see @entity5 as a " safe haven " for prime property investments , and ranks the city as the second most important hub for ultra high - net - worth homes . the only spot more important on the planet is @entity74 . for evidence that @entity5 still attracts high - end buyers , look no further than the sale of a penthouse in @entity8 which fetched $ 40 million earlier this year . educated thinking as well as an intelligent investment , many of the @entity15 buyers see these houses as a way of maintaining long standing cultural ties with @entity5 -- and it 's here they want to send their children to school . @entity87 , @entity88 , @entity89 are all among the list of respected institutions that teach the offspring of wealthy @entity19 . the @entity17 @entity93 in @entity5 calculates that @entity17 nationals now spend over $ 446 million per year on fees , tutoring and accommodation at @entity18 schools and university . " @entity15 clients are very much driven by the need to educate their children , " says @entity16 . " education usually means putting the children on an international stage , and that 's one reason why this is feeding into the demand for property in @entity5 . " indeed , education industry experts @entity111 say there were over 17,500 @entity17 studying in @entity18 universities in 2012 -- about 1,000 more than the 2009/10 academic session . and experts are expecting this trend to continue . " virtually all the transactions are for end use , not rental investment , which indicates that the @entity15 buyer market in @entity5 has significant room for growth , " says @entity117 , director at @entity44 . " african buyers or luxury tenants in @entity5 are currently where the @entity126 and @entity127 were five years ago . they have the resources and desire to purchase or rental luxury homes in @entity5 , " he adds . " it is going to be the @entity15 century . " more from @entity133 read this : @entity133 's green lean speed machines read this : @entity15 designs rocking art world editor 's note : @entity141 covers the macro trends impacting the region and also focuses on the continent 's key industries and corporations

property experts say @placeholder investment in @entity5 is set to grow

@entity15

@entity15:African
@entity117:Gary Hersham
@entity111:ICEF Monitor
@entity87:Harrow
@entity88:Eton
@entity89:Cheltenham Ladies College
@entity133:Africa
@entity0:CNN
@entity5:London
@entity9:Kensington
@entity8:Mayfair
@entity52:Safe-Haven
@entity10:Chelsea
@entity57:Kensington Palace
@entity55:W8
@entity74:New York City
@entity17:Nigerians
@entity16:Nkontchou
@entity33:Congo
@entity32:Ghana
@entity35:Cameroon
@entity34:Gabon
@entity36:Senegal
@entity19:Africans
@entity18:British
@entity126:Russians
@entity127:Ukrainians
@entity93:Embassy
@entity141:CNN Marketplace Africa
@entity44:Beauchamp Estates
@entity47:U.S.
@entity64:Sotheby 's
@entity49:Europe

In [10]:
'''
put file paths into a list
'''
testFilePaths = [TEST_PATH + testfileName for testfileName in testlist]
print testFilePaths[3]
print len(testFilePaths)


process_data/cnn/test/005f6d2f969d0f925fd0124cf91b40d6cffc75d2.question
3198

In [13]:
''' 
make a list of tuples containing testsets.
'''
testSets = []
for testPath in testFilePaths:
    with open(testPath,'r') as f:
        data_tuple = extract(f.readlines())
        testSets.append(data_tuple)
print len(testSets)
print testSets[0]


3198
('http://web.archive.org/web/20150426033749id_/http://edition.cnn.com/2015/04/22/business/african-luxury-houses-london/index.html', '( @entity0 ) sophisticated , glamorous and spacious -- when the super-rich go house - hunting they are searching for something special . real estate in @entity5 \'s swankier suburbs can catch a buyers \' eye . @entity8 , @entity9 and @entity10 have long been the stomping ground of the elite -- and are now welcoming a new wave of @entity15 investors . " the @entity19 who are coming into @entity5 now are @entity19 who themselves have worked for their money , " explains @entity16 , a @entity18 - @entity17 wealth manager based in @entity5 . " they have grown in industry and are actually part of the exciting story of the @entity15 renaissance , " she continues . " it \'s bringing to @entity5 the best of the continent . " these investors are having a considerable impact on @entity5 \'s property market and they mainly come from just six countries : @entity17 , @entity32 , @entity33 , @entity34 , @entity35 and @entity36 . of these , @entity17 are splashing out the most cash when it comes to bricks and mortar in the @entity18 capital -- typically spending between $ 22 and $ 37 million on securing a property , according to luxury property agents @entity44 . their research shows that over the past three years @entity19 have spent over $ 900 million on luxury residential property in @entity5 . " the new international @entity15 is very well - traveled , " explains @entity16 . " educated in the @entity47 , @entity18 and different parts of @entity49 their taste is definitely more modern and clean . " \' @entity52 \' owning a home in post codes like @entity55 or @entity55 -- around the corner from @entity57 -- means more than having a place to lay your head . these buildings are investments which are expected to gain even bigger value in the coming years . high - end auction house @entity64 says that foreign investors see @entity5 as a " safe haven " for prime property investments , and ranks the city as the second most important hub for ultra high - net - worth homes . the only spot more important on the planet is @entity74 . for evidence that @entity5 still attracts high - end buyers , look no further than the sale of a penthouse in @entity8 which fetched $ 40 million earlier this year . educated thinking as well as an intelligent investment , many of the @entity15 buyers see these houses as a way of maintaining long standing cultural ties with @entity5 -- and it \'s here they want to send their children to school . @entity87 , @entity88 , @entity89 are all among the list of respected institutions that teach the offspring of wealthy @entity19 . the @entity17 @entity93 in @entity5 calculates that @entity17 nationals now spend over $ 446 million per year on fees , tutoring and accommodation at @entity18 schools and university . " @entity15 clients are very much driven by the need to educate their children , " says @entity16 . " education usually means putting the children on an international stage , and that \'s one reason why this is feeding into the demand for property in @entity5 . " indeed , education industry experts @entity111 say there were over 17,500 @entity17 studying in @entity18 universities in 2012 -- about 1,000 more than the 2009/10 academic session . and experts are expecting this trend to continue . " virtually all the transactions are for end use , not rental investment , which indicates that the @entity15 buyer market in @entity5 has significant room for growth , " says @entity117 , director at @entity44 . " african buyers or luxury tenants in @entity5 are currently where the @entity126 and @entity127 were five years ago . they have the resources and desire to purchase or rental luxury homes in @entity5 , " he adds . " it is going to be the @entity15 century . " more from @entity133 read this : @entity133 \'s green lean speed machines read this : @entity15 designs rocking art world editor \'s note : @entity141 covers the macro trends impacting the region and also focuses on the continent \'s key industries and corporations', 'property experts say @placeholder investment in @entity5 is set to grow', '@entity15', ['@entity15', '@entity117', '@entity111', '@entity87', '@entity88', '@entity89', '@entity133', '@entity0', '@entity5', '@entity9', '@entity8', '@entity52', '@entity10', '@entity57', '@entity55', '@entity74', '@entity17', '@entity16', '@entity33', '@entity32', '@entity35', '@entity34', '@entity36', '@entity19', '@entity18', '@entity126', '@entity127', '@entity93', '@entity141', '@entity44', '@entity47', '@entity64', '@entity49'])

Overall pipeline of the function 'predict'

This function was designed to calculate cosine similarities between 'query(answer_cand)' (sentence where @placeholder was replaced by answer_cand) and each sentence in the text of the file. It records similarities and see where it peaks. One query(answer_cand) sentence is used to compute cos_sim for all sentences in the text and the highest similarity score is recorded for that answer_cand. After looping through all possible candidates, this model chooses the highest scoring candidates as the answer. As you can see, this returns predicted answer and its highest similarity calculated.

num_feature means 'the number of words to consider'. consine similarity computes similarity of two sentences using only top N frequent words. You can set number N using this feature.


In [14]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

In [15]:
def predict(testset,num_features=100):
    abstract = testset[1]
    query = testset[2]
    answer = testset[3]
    answer_cand = testset[4]
    query_list_filled = [ clean_text(make_query(query,cand)) for cand in answer_cand ]
    
    vectorizer = CountVectorizer(analyzer = 'word', tokenizer = None, preprocessor = None, stop_words = None, max_features = num_features)
    sentence_list = text_to_sentences(abstract,tokenizer)
    sentence_list += query_list_filled
    
    train_data_features =vectorizer.fit_transform(sentence_list)
    train_data_features = train_data_features.toarray()
    
    ratio = .0
    index = -1

    for i in range(-len(query_list_filled),0):
        cand_sentence = train_data_features[i].reshape(1,-1)
        for j in range(len(sentence_list) - len(query_list_filled)):
            abst_sentence = train_data_features[j].reshape(1,-1)
            temp = cosine_similarity(cand_sentence,abst_sentence)
            if temp > ratio:
                ratio = temp
                index = i
                
            
    prediction = answer_cand[index]
    similarity = ratio
    
    return prediction,ratio

In [17]:
predict(testSets[1])


Out[17]:
('@entity22', array([[ 0.89766562]]))

In [23]:
def get_accuracy(someSets,num_features =100):
    length = len(someSets)
    cnt = 0
    cnt_for_print =0
    percent =0
    num_err =0
    for a_set in someSets:
        cnt_for_print += 1
        answer = a_set[3].lower().strip()
        try:
            prediction = predict(a_set,num_features)[0].lower().strip()
            if answer == prediction:
                cnt += 1
        except:
            num_err += 1
        if cnt_for_print % (length/100) == 0:
            percent += 1
#             print answer
#             print prediction
            print "%d%% processed..."%(percent)
        
    
    return cnt * 100.0 / (length-num_err)

In [ ]:
get_accuracy(testSets)


1% processed...
2% processed...
3% processed...
4% processed...
5% processed...
6% processed...
7% processed...
8% processed...
9% processed...
10% processed...
11% processed...
12% processed...
13% processed...
14% processed...

In [12]:
'''
This extract function is for cnn data type, where answer candidates appear anonymized in the text.
Answer candidates are formatted like "@entity123:King" so we need to get rid of ':King' part to calculate cosine similarity.
In this case, use this 'extract' function instead of 'extract' above.
'''

def extract(listdata):
    title = listdata[0].strip()
    passage = listdata[2].strip()
    query = listdata[4].strip()
    ans=listdata[6].strip()
    cand = listdata[8:]
    try:
        cand.remove('\n')
    except:
        pass
    cand = map(lambda x: x.strip().split(':')[0] ,cand)
    
    
    return title,passage,query,ans,cand