NLTK WordNet explortation notebook

Just exploring WordNet


In [1]:
from nltk.corpus import wordnet as wn
import re
wnword = wn.synsets('computer')[0]
print wnword

In [3]:
wndef = wnword.definition()
print(wndef)


a machine for performing calculations automatically

In [4]:
wndefList = re.sub("[^\w]", " ",wndef).split()
print wndefList


[u'a', u'machine', u'for', u'performing', u'calculations', u'automatically']

In [5]:
machineWord = wn.synsets(wndefList[1])[0].definition()
print machineWord


any mechanical or electrical device that transmits or modifies energy to perform or assist in the performance of human tasks

In [6]:
synOffset = wn.synset('dogshit.n.01').offset()
#synOffsetFilled = str(ss).zfill(8)
print synOffset


6611376

In [7]:
syns = list(wn.all_synsets())
offsets_list = [(s.offset(), s) for s in syns]
offsets_dict = dict(offsets_list)

In [8]:
print offsets_dict[6611376]


Synset('bullshit.n.01')

Now, we read in the computer science keywords from a file. Then, convert old ID to new ID and word. This is the code to just play with in the notebook.


In [9]:
csWordsIndex = 1   #line number to read
csWords = open('decode_wordnet/csWordnetWordsNouns.txt')
csWordsLineData = csWords.readlines()
print len(csWordsLineData)
idIndex = csWordsLineData[csWordsIndex].find(':')
idToConvert = csWordsLineData[csWordsIndex][idIndex+2:idIndex+10]
print 'old ID number: ' + idToConvert
convertIdHash = open('decode_wordnet/wn16-30noun.txt')
convertIdHashData = convertIdHash.read()
newIdIndex = convertIdHashData.find(idToConvert)
newId = convertIdHashData[newIdIndex+9:newIdIndex+17]
print 'new ID number: ' + newId
newCsWord = offsets_dict[int(newId)]
print newCsWord


450
old ID number: 00292171
new ID number: 00458890
Synset('computer_game.n.01')

This code generates a file containing the old ID (WordNet 1.6) , new ID (WordNet 3.0), and Synset. Then, it puts it into a file readable by Excel.


In [10]:
from nltk.corpus import wordnet as wn
import re

#open files and read into variables
csWordnetWordsNounsOutput = open('decode_wordnet/csWordnetWordsNounsOutput.TXT','w') #the output file
csWords = open('decode_wordnet/csWordnetWordsNouns.txt') #the input file
csWordsLineData = csWords.readlines()
convertIdHash = open('decode_wordnet/wn16-30noun.txt') #the hash table between wordnet versions
convertIdHashData = convertIdHash.read()

#hash index and lookup word and write into file
for csWordsIndex in range(0,len(csWordsLineData)-1): #make sure read every line in file
    idIndex = csWordsLineData[csWordsIndex].find(':')
    idToConvert = csWordsLineData[csWordsIndex][idIndex+2:idIndex+10] #find id to convert
    try: #there is some problem in reading newId into the offsets_dict
        newIdIndex = convertIdHashData.find(idToConvert)
        newId = convertIdHashData[newIdIndex+9:newIdIndex+17]
        newCsWord = offsets_dict[int(newId)]
        newCsWordDef = newCsWord.definition()
        #write to file:
        csWordnetWordsNounsOutput.write(str(idToConvert) + '\t' + str(newId) + '\t' + str(newCsWord) + '\t' + newCsWordDef + '\n')
    except: 
        pass
csWordnetWordsNounsOutput.close()

Requests, wikipedia, data munging

Exploring requests module


In [11]:
import requests
 
r = requests.get('http://aione.tritera.com', auth=('itay.livni@tapnotion.com', 'WZYZpN5o'))
 
print r.status_code
print r.headers['content-type']
#r.text


200
text/html

Wikipedia API - getting data from wikipedia using the module "wikipedia" is really easy!


In [14]:
import wikipedia
wikiReturnText = wikipedia.summary("Vitual reality", sentences=1)
print wikiReturnText


Virtual reality (VR), sometimes referred to as immersive multimedia, is a computer-simulated environment that can simulate physical presence in places in the real world or imagined worlds.

Playing with Mechanize and Selenium:

Exploring mechanize module


In [80]:
import re
import mechanize

URL = "http://aione.tritera.com/gui/goto.main.php#"
br = mechanize.Browser()
br.open(URL)
request = mechanize.Request(URL)

print br.title()
print request


ai-one Integrated Brain Demo
<Request for http://aione.tritera.com/gui/goto.main.php#>

This code grabs definition from wikipedia and inputs it into ai-One using the Selenium module:


In [17]:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import Select
import wikipedia

wikiReturnText = wikipedia.summary("computer", sentences=1) #input keyword
driver = webdriver.Firefox()  #use Firefox
driver.get("http://aione.tritera.com/gui/goto.main.php#")  #go to website
elem = driver.find_element_by_link_text('Source')  #find Source tab and store to element elem
elem.click()  #click Source tab
elem2 = driver.find_element_by_id('sourceText')  #find text element
elem2.clear()  #clear the contents of the textbox
elem2.send_keys(wikiReturnText) #input definition
driver.find_element_by_id("btn_find_all").click()  #click
driver.implicitly_wait(10) #wait 10 sec for page to load results
aiOneKeywordList = [] #initialize/clear keyword list
try:
    for i in range(0,20):  #get up to 20 keywords (expect less)
        try:
            value1 = driver.find_element_by_id(".word"+str(i)+". ").get_attribute("value")
            #print value1
            aiOneKeywordList = aiOneKeywordList + [str(value1)]
        except:  #exit when no more keywords
            pass  #don't handle exception that element doesn't exist
            break #it was hanging here, so break was necessary
except:
    pass
print aiOneKeywordList
driver.close()


['purpose', 'device', 'arithmetic', 'operations', 'computer']

This code grabs defs from wordnet and puts them into aiOne


In [2]:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import Select
import re

aiOneKeywords = open('decode_wordnet/aiOneKeywords2.TXT','w')
wordnetDefs = open('decode_wordnet/csWordnetWordsNounsOutput.TXT','r')
wordnetDefsLineData = wordnetDefs.readlines()
wordnetDefsList = []
for i in range(0,len(wordnetDefsLineData)-1):
    
    wordnetDefString = wordnetDefsLineData[i].split('\t')[3]
    if wordnetDefString.startswith( '(computer science)' ):
        wordnetDefString = wordnetDefString[19:] 
        
    wordnetDefsList = wordnetDefsList + [wordnetDefString]

driver = webdriver.Firefox()  #use Firefox
driver.get("http://aione.tritera.com/gui/goto.main.php#")  #go to website

for k in range(0,len(wordnetDefsLineData)-1):
    elem = driver.find_element_by_link_text('Source')  #find Source tab and store to element elem
    elem.click()  #click Source tab
    elem2 = driver.find_element_by_id('sourceText')  #find text element
    elem2.clear()  #clear the contents of the textbox
    elem2.send_keys(wordnetDefsList[k]) #input definition
    driver.find_element_by_id("btn_find_all").click()  #click
    driver.implicitly_wait(10) #wait 10 sec for page to load results
    
    aiOneKeywordList = [] #initialize/clear keyword list
    for i in range(0,20):  #get up to 20 keywords (expect less)
        try:
            value1 = driver.find_element_by_id(".word"+str(i)+". ").get_attribute("value")
            #print value1
            aiOneKeywordList = aiOneKeywordList + [str(value1)]
        except:  #exit when no more keywords
            pass  #don't handle exception that element doesn't exist
            break #it was hanging here, so break was necessary
            
    aiOneKeywords.write(wordnetDefsLineData[k].split('\t')[2] + '\t' + str(aiOneKeywordList) + '\n')
    
driver.close()


---------------------------------------------------------------------------
error                                     Traceback (most recent call last)
<ipython-input-2-0cc2bfbcdca5> in <module>()
     19 
     20 for k in range(0,len(wordnetDefsLineData)-1):
---> 21     elem = driver.find_element_by_link_text('Source')  #find Source tab and store to element elem
     22     elem.click()  #click Source tab
     23     elem2 = driver.find_element_by_id('sourceText')  #find text element

/Users/scott/Library/Enthought/Canopy_64bit/User/lib/python2.7/site-packages/selenium/webdriver/remote/webdriver.pyc in find_element_by_link_text(self, link_text)
    252             driver.find_element_by_link_text('Sign In')
    253         """
--> 254         return self.find_element(by=By.LINK_TEXT, value=link_text)
    255 
    256     def find_elements_by_link_text(self, text):

/Users/scott/Library/Enthought/Canopy_64bit/User/lib/python2.7/site-packages/selenium/webdriver/remote/webdriver.pyc in find_element(self, by, value)
    660 
    661         return self.execute(Command.FIND_ELEMENT,
--> 662                              {'using': by, 'value': value})['value']
    663 
    664     def find_elements(self, by=By.ID, value=None):

/Users/scott/Library/Enthought/Canopy_64bit/User/lib/python2.7/site-packages/selenium/webdriver/remote/webdriver.pyc in execute(self, driver_command, params)
    169 
    170         params = self._wrap_value(params)
--> 171         response = self.command_executor.execute(driver_command, params)
    172         if response:
    173             self.error_handler.check_response(response)

/Users/scott/Library/Enthought/Canopy_64bit/User/lib/python2.7/site-packages/selenium/webdriver/remote/remote_connection.pyc in execute(self, command, params)
    347         path = string.Template(command_info[1]).substitute(params)
    348         url = '%s%s' % (self._url, path)
--> 349         return self._request(command_info[0], url, body=data)
    350 
    351     def _request(self, method, url, body=None):

/Users/scott/Library/Enthought/Canopy_64bit/User/lib/python2.7/site-packages/selenium/webdriver/remote/remote_connection.pyc in _request(self, method, url, body)
    377                 body = None
    378             try:
--> 379                 self._conn.request(method, parsed_url.path, body, headers)
    380                 resp = self._conn.getresponse()
    381             except httplib.HTTPException:

/Applications/Canopy.app/appdata/canopy-1.4.1.1975.macosx-x86_64/Canopy.app/Contents/lib/python2.7/httplib.pyc in request(self, method, url, body, headers)
    971     def request(self, method, url, body=None, headers={}):
    972         """Send a complete request to the server."""
--> 973         self._send_request(method, url, body, headers)
    974 
    975     def _set_content_length(self, body):

/Applications/Canopy.app/appdata/canopy-1.4.1.1975.macosx-x86_64/Canopy.app/Contents/lib/python2.7/httplib.pyc in _send_request(self, method, url, body, headers)
   1005         for hdr, value in headers.iteritems():
   1006             self.putheader(hdr, value)
-> 1007         self.endheaders(body)
   1008 
   1009     def getresponse(self, buffering=False):

/Applications/Canopy.app/appdata/canopy-1.4.1.1975.macosx-x86_64/Canopy.app/Contents/lib/python2.7/httplib.pyc in endheaders(self, message_body)
    967         else:
    968             raise CannotSendHeader()
--> 969         self._send_output(message_body)
    970 
    971     def request(self, method, url, body=None, headers={}):

/Applications/Canopy.app/appdata/canopy-1.4.1.1975.macosx-x86_64/Canopy.app/Contents/lib/python2.7/httplib.pyc in _send_output(self, message_body)
    827             msg += message_body
    828             message_body = None
--> 829         self.send(msg)
    830         if message_body is not None:
    831             #message_body was not a string (i.e. it is a file) and

/Applications/Canopy.app/appdata/canopy-1.4.1.1975.macosx-x86_64/Canopy.app/Contents/lib/python2.7/httplib.pyc in send(self, data)
    789         if self.sock is None:
    790             if self.auto_open:
--> 791                 self.connect()
    792             else:
    793                 raise NotConnected()

/Applications/Canopy.app/appdata/canopy-1.4.1.1975.macosx-x86_64/Canopy.app/Contents/lib/python2.7/httplib.pyc in connect(self)
    770         """Connect to the host and port specified in __init__."""
    771         self.sock = socket.create_connection((self.host,self.port),
--> 772                                              self.timeout, self.source_address)
    773 
    774         if self._tunnel_host:

/Applications/Canopy.app/appdata/canopy-1.4.1.1975.macosx-x86_64/Canopy.app/Contents/lib/python2.7/socket.pyc in create_connection(address, timeout, source_address)
    569 
    570     if err is not None:
--> 571         raise err
    572     else:
    573         raise error("getaddrinfo returns an empty list")

error: [Errno 61] Connection refused

Going deep with the keyword allocation


In [28]:
from nltk.corpus import wordnet as wn
import re

wnword = []
for i in range(0,20):  #print every definition possible for the word
    try:
        keyword = 'allocation'
        keyword = keyword.replace(' ','_')  #wordnet needs underscores not spaces
        wnword = wnword + [wn.synsets(keyword)[i].definition()]
    except:
        pass

print wnword


[u'a share set aside for a specific purpose', u'the act of distributing by allotting or apportioning; distribution according to a plan', u'(computer science) the assignment of particular areas of a magnetic disk to particular data or instructions']

Trying to pull everything together. Takes in a word and generates associated keywords and definitions.


In [6]:
#takes a keyword in and generates two files: a keyword and generated keywords; and a keyword definition file

import traceback
import sys
import re
import time
from selenium import webdriver  
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import Select
from nltk.corpus import wordnet as wn

masterKeywordFile = open('decode_wordnet/masterKeywordFile.TXT','w') #keyword definition file
keywordTreeFile = open('decode_wordnet/keywordTreeFile.TXT','w') #keyword generation file

masterKeywordList = []
masterKeywordList = masterKeywordList + ['allocation']   #initial word, in this case 'allocation'

driver = webdriver.Firefox()  #use Firefox
driver.get("http://aione.tritera.com/gui/goto.main.php#")  #go to website
time.sleep(10) #wait 10 sec for browser to load

for masterKeywordListIndex in range(0,201):  #do this for 200 words in masterKeywordList
    
    wordnetDefs = []
    for i in range(0,20):  #print every definition possible for the word in master keyword file (expect less than 20 definitions)
        try: #replaces spaces with underscores
            keyword = masterKeywordList[masterKeywordListIndex]
            keyword = keyword.replace(' ','_')  #wordnet needs underscores not spaces
            wordnetDefs = wordnetDefs + [wn.synsets(keyword)[i].definition()] #wordnet lookup
            masterKeywordFile.write( keyword + '\t' + str(wordnetDefs[i]) +'\n')
        except:
            pass

    for k in range(0,len(wordnetDefs)):  #do this for each wordnet definition of the keyword
        
        if 'computer science' in wordnetDefs[k]: #takes (computer science) out of string
            wordnetDefs[k] = wordnetDefs[k][19:] #starts 19 characters after first paranthesis in (computer sci...
            
        elem = driver.find_element_by_link_text('Source')  #find Source tab and store to element elem
        elem.click()  #click Source tab
        time.sleep(1) #give some time after click
        elem2 = driver.find_element_by_id('sourceText')  #find text element
        elem2.clear()  #clear the contents of the textbox
        elem2.send_keys(wordnetDefs[k]) #input definition
        driver.find_element_by_id("btn_find_all").click()  #click
        time.sleep(10) #wait 10 sec for page to load results
        
        for i in range(0,20):  #get up to 20 keywords (expect less)
            try:
                value1 = driver.find_element_by_id('.word'+str(i)+'. ').get_attribute("value")
                time.sleep(5)
                if str(value1) in masterKeywordList:
                    pass
                else:  #store keyword if returned keyword not in master keyword list
                    masterKeywordList = masterKeywordList + [str(value1)]
                    keywordTreeFile.write(keyword + '\t' + str(value1) + '\n')
            except Exception, err:
                #print traceback.format_exc()
                break #it was hanging here, so break was necessary        

masterKeywordFile.close()
keywordTreeFile.close()
driver.close()


---------------------------------------------------------------------------
ElementNotVisibleException                Traceback (most recent call last)
<ipython-input-6-551a3e8b3a27> in <module>()
     39         time.sleep(1)
     40         elem2 = driver.find_element_by_id('sourceText')  #find text element
---> 41         elem2.clear()  #clear the contents of the textbox
     42         elem2.send_keys(wordnetDefs[k]) #input definition
     43         driver.find_element_by_id("btn_find_all").click()  #click

/Users/scott/Library/Enthought/Canopy_64bit/User/lib/python2.7/site-packages/selenium/webdriver/remote/webelement.pyc in clear(self)
     71     def clear(self):
     72         """Clears the text if it's a text entry element."""
---> 73         self._execute(Command.CLEAR_ELEMENT)
     74 
     75     def get_attribute(self, name):

/Users/scott/Library/Enthought/Canopy_64bit/User/lib/python2.7/site-packages/selenium/webdriver/remote/webelement.pyc in _execute(self, command, params)
    383             params = {}
    384         params['id'] = self._id
--> 385         return self._parent.execute(command, params)
    386 
    387     def find_element(self, by=By.ID, value=None):

/Users/scott/Library/Enthought/Canopy_64bit/User/lib/python2.7/site-packages/selenium/webdriver/remote/webdriver.pyc in execute(self, driver_command, params)
    171         response = self.command_executor.execute(driver_command, params)
    172         if response:
--> 173             self.error_handler.check_response(response)
    174             response['value'] = self._unwrap_value(
    175                 response.get('value', None))

/Users/scott/Library/Enthought/Canopy_64bit/User/lib/python2.7/site-packages/selenium/webdriver/remote/errorhandler.pyc in check_response(self, response)
    164         elif exception_class == UnexpectedAlertPresentException and 'alert' in value:
    165             raise exception_class(message, screen, stacktrace, value['alert'].get('text'))
--> 166         raise exception_class(message, screen, stacktrace)
    167 
    168     def _value_or_default(self, obj, key, default):

ElementNotVisibleException: Message: u'Element is not currently visible and so may not be interacted with' ; Stacktrace: 
    at fxdriver.preconditions.visible (file:///var/folders/m0/zs_gtkld71bb06pqswcsndv00000gn/T/tmphhY4aL/extensions/fxdriver@googlecode.com/components/command-processor.js:8936:5)
    at DelayedCommand.prototype.checkPreconditions_ (file:///var/folders/m0/zs_gtkld71bb06pqswcsndv00000gn/T/tmphhY4aL/extensions/fxdriver@googlecode.com/components/command-processor.js:11595:1)
    at DelayedCommand.prototype.executeInternal_/h (file:///var/folders/m0/zs_gtkld71bb06pqswcsndv00000gn/T/tmphhY4aL/extensions/fxdriver@googlecode.com/components/command-processor.js:11612:11)
    at DelayedCommand.prototype.executeInternal_ (file:///var/folders/m0/zs_gtkld71bb06pqswcsndv00000gn/T/tmphhY4aL/extensions/fxdriver@googlecode.com/components/command-processor.js:11617:7)
    at DelayedCommand.prototype.execute/< (file:///var/folders/m0/zs_gtkld71bb06pqswcsndv00000gn/T/tmphhY4aL/extensions/fxdriver@googlecode.com/components/command-processor.js:11559:5) 

In [21]:


In [ ]: