Mems that get sent to Proxi should have an image associated with them. Human players should have an easy enough time finding images that represent, but when we have nothing but a chunk of text to go on, this is more difficult. What words are important? Do important words build clear, unambiguous queries?
First things first: I need a way to search the internet for images. I'll be using a combination of the Noun Project API, along with the Bing Image Search API. I considered using Google's Custom Image Search, but the daily limits (100 searches/day) were a little low for my use.
Here are some quick functions to format the GET requests for Bing and the Noun Project.
In [142]:
import os
import json
import requests
from requests_oauthlib import OAuth1
def get_secret(service):
"""Access local store to load secrets."""
local = os.getcwd()
root = os.path.sep.join(local.split(os.path.sep)[:3])
secret_pth = os.path.join(root, '.ssh', '{}.json'.format(service))
return secret_pth
def load_secret(service):
"""Load secrets from a local store.
Args:
server: str defining server
Returns:
dict: storing key: value secrets
"""
pth = get_secret(service)
secret = json.load(open(pth))
return secret
BING_API_KEY = load_secret('bing')
NP_API_KEY, NP_API_SECRET = load_secret('noun_project')
def search_bing_for_image(query):
"""
Perform a Bing image search.
Args:
query: Image search query
Returns:
results: List of urls from results
"""
search_params = {'q': query,
'mkt': 'en-us',
'safeSearch': 'strict'}
auth = {'Ocp-Apim-Subscription-Key': BING_API_KEY}
url = 'https://api.cognitive.microsoft.com/bing/v5.0/images/search'
r = requests.get(url, params=search_params, headers=auth)
results = r.json()['value']
urls = [result['contentUrl'] for result in results]
return urls
def search_np_for_image(query):
"""
Perform a Noun Project image search.
Args:
query: Image search query
Returns:
results: List of image result JSON dicts
"""
auth = OAuth1(NP_API_KEY, NP_API_SECRET)
endpoint = 'http://api.thenounproject.com/icons/{}'.format(query)
params = {'limit_to_public_domain': 1,
'limit': 5}
response = requests.get(endpoint, params=params, auth=auth)
urls = [icon['preview_url'] for icon in response.json()['icons']]
return urls
In [35]:
print(search_np_for_image('magic')[:3])
In [36]:
print(search_bing_for_image('magic')[:3])
For quick prototyping, I wrote another function to display an image from its URL.
In [28]:
from PIL import Image
import matplotlib.pyplot as plt
import urllib
%matplotlib inline
def view_urls(urls):
"""Display the images found at the provided urls"""
for i, url in enumerate(urls):
resp = requests.get(url)
dat = urllib.request.urlopen(resp.url)
img = Image.open(dat)
plt.imshow(img)
plt.axis('off')
plt.show()
In [57]:
view_urls(search_bing_for_image('magic')[:3])
In [61]:
view_urls(search_np_for_image('magic')[:3])
My teammates and I have done a lot of work these past few weeks extracting the names of people, places, things, activities, and moods from text. This work has been incorporated into a Python package, Pensieve. I'll load up the package and take a look at the words that were extracted from the paragraphs in Harry Potter and the Sorcerer's Stone.
In [1]:
import pensieve
In [2]:
book1 = pensieve.Doc('../../corpus/book1.txt', doc_id=1)
In [9]:
from pprint import pprint
from numpy.random import randint
In [11]:
rand = randint(len(book1.paragraphs))
print(book1.paragraphs[rand].text)
pprint(book1.paragraphs[rand].words)
It is nice that the objects and verbs are extracted, but there are too many to just throw into one search query. In order to find the most, let's play around with some features in textacy that can give some sort of importance ordering to the words in the paragraph. Most of these features effectively implement different vertex importance sorting algorithms on a semantic network, so a nice place to start might be at the semantic network itself
In [37]:
import textacy
import networkx as nx
In [74]:
graph = textacy.network.terms_to_semantic_network(book1.paragraphs[400].spacy_doc)
print(book1.paragraphs[400].text)
textacy.viz.draw_semantic_network(graph);
First, we'll look at textacy.keyterms.textrank
. This implements the TextRank algorithm, which iteratively computes a score for each vertex in the graph that roughly corresponds to the number of vertices connected to that vertex.
In [75]:
print(book1.paragraphs[400].text)
textacy.keyterms.textrank(book1.paragraphs[400].spacy_doc)
Out[75]:
This seems to work pretty well! I suppose in this example, I may have chosen "study", but "plant" still makes sense. Let's look at another paragraph to see what happens.
In [80]:
print(book1.paragraphs[654].text)
textacy.keyterms.textrank(book1.paragraphs[654].spacy_doc)
Out[80]:
We may run into trouble when the most important nodes are character names. An image search with the query "ron" or even "ron harry potter" is unlikely to give us good results. More on this later...
What about other algorithms? Let's try SGRank. This algorithm improves upon TextRank by getting rid of unlikely keyword candidates and performing multiple rankings. It is also capable of outputting multiple word phrases.
In [81]:
print(book1.paragraphs[400].text)
textacy.keyterms.sgrank(book1.paragraphs[400].spacy_doc)
Out[81]:
In [83]:
print(book1.paragraphs[654].text)
textacy.keyterms.sgrank(book1.paragraphs[654].spacy_doc)
Out[83]:
Now this may end up being a little too specific for our purposes. "gryffindor common room" would be a great result for this image, but "dumpy little witch" is not as good...
DivRank attempts to provide a ranking that balances node centrality with node diversity. Let's see how that fares.
In [90]:
print(book1.paragraphs[400].text)
textacy.keyterms.key_terms_from_semantic_network(book1.paragraphs[400].spacy_doc, ranking_algo='divrank')
Out[90]:
In [91]:
print(book1.paragraphs[654].text)
textacy.keyterms.key_terms_from_semantic_network(book1.paragraphs[654].spacy_doc, ranking_algo='divrank')
Out[91]:
We seem to be getting pretty consistent results between DivRank and TextRank. For simplicity, I'm settling on TextRank.
If the most important node in the semantic network is a character's name, we are unlikely to get decent image search results. The quickest way around this is to move down the ranking until we find a term that isn't a named character. Pensieve collects a list of all of the people named in a document, so this is simple to implement.
In [145]:
def build_query(par):
"""
Use TextRank to find the most important words that aren't character names.
"""
keyterms = textacy.keyterms.textrank(par.spacy_doc)
for keyterm, rank in keyterms:
if keyterm.title() not in par.doc.words['people']:
return keyterm
return None
In [146]:
par = book1.paragraphs[randint(len(book1.paragraphs))]
print(par.text)
build_query(par)
Out[146]:
In [166]:
def submit_query(query):
"""
Decide which search engine to use based on the part of speech of the query
"""
doc = textacy.Doc(query, lang='en')
try:
urls = search_np_for_image(query)
except Exception as e:
urls = search_bing_for_image(query)
return urls
In [173]:
par = book1.paragraphs[400]
print(par.text)
query = build_query(par)
print(query)
urls = submit_query(query)
view_urls(urls[:1])
In [ ]: