SearchBetter demos

Getting started

Before you run this demo, you'll need to do a few things:

  • Make sure you define secure.py in the src/ directory. We've provided a secure.py.example for you to work off! Also, ensure that all the folders you referenced in secure.py exist.
  • Download the Udacity course listings and put it in the folder you defined as DATASET_PATH_BASE in secure.py.
  • Download and clean the Wikipedia dump as described in the README.

In [1]:
# First, let's get all the imports out of the way...

import gensim.models.word2vec as word2vec
from pprint import pprint

import sys
sys.path.append('../')
sys.path.append('../src/')

import searchbetter.search as search
reload(search)
import searchbetter.rewriter as rewriter
reload(rewriter)

import secure

Making a search engine

SearchBetter lets you make custom, batteries-included search engines out of any dataset, no matter how large or how small. We include some examples in search.py. As an example, consider the pre-built edX search engine, which searches over a dump of all edX courses.


In [2]:
# Create a search engine that searches over all edX courses.
# Under the hood, this uses Python's Whoosh library to index
# the course data stored in a CSV and then run searches against it.
dataset_path = secure.DATASET_PATH_BASE+'udacity-api.json'
index_path = secure.INDEX_PATH_BASE+'udacity'

# Use `create=False` if you've already made the search engine, `create=True` if this is
# your first time making it. We cache the search indices behind search engines on disk.
### UNCOMMENT THE BELOW IF YOU'RE RUNNING THIS FOR THE FIRST TIME
# udacity_engine = search.UdacitySearchEngine(dataset_path, index_path, create=True)
### COMMENT THE BELOW IF YOU'RE RUNNING THIS FOR THE FIRST TIME
udacity_engine = search.UdacitySearchEngine(dataset_path, index_path, create=False)

# We expose a simple searching API
search_term = "android"
udacity_results = udacity_engine.search(search_term)
print "%d Udacity search results for '%s':" % (len(udacity_results), search_term)
pprint(udacity_results)

print "\n"

# Searching works on bigrams (two-word queries) too!
search_term = "machine learning"
udacity_results = udacity_engine.search(search_term)
print "%d Udacity search results for '%s':" % (len(udacity_results), search_term)
pprint(udacity_results)


32 Udacity search results for 'android':
[({'slug': u'developing-android-apps--ud853', 'title': u'Developing Android Apps'}, 27.891231793004),
 ({'slug': u'new-android-fundamentals--ud851', 'title': u'New Android Fundamentals'}, 26.929221353569236),
 ({'slug': u'android-for-beginners--ud834', 'title': u'Android for Beginners'}, 26.75707216368228),
 ({'slug': u'android-for-beginners--ud834', 'title': u'Android for Beginners'}, 26.75707216368228),
 ({'slug': u'android-tv-and-google-cast-development--ud875B', 'title': u'Android TV and Google Cast Development'}, 21.311355239018084),
 ({'slug': u'gradle-for-android-and-java--ud867', 'title': u'Gradle for Android and Java'}, 21.10007543619399),
 ({'slug': u'android-wear-development--ud875A', 'title': u'Android Wear Development'}, 19.264404039503354),
 ({'slug': u'material-design-for-android-developers--ud862', 'title': u'Material Design for Android Developers'}, 19.188179664578513),
 ({'slug': u'android-basics-user-input--ud836', 'title': u'Android Basics: User Input'}, 19.11634972679303),
 ({'slug': u'android-basics-user-input--ud836', 'title': u'Android Basics: User Input'}, 19.11634972679303),
 ({'slug': u'android-basics-networking--ud843', 'title': u'Android Basics: Networking'}, 17.60295037819826),
 ({'slug': u'android-basics-networking--ud843', 'title': u'Android Basics: Networking'}, 17.60295037819826),
 ({'slug': u'android-basics-multi-screen-apps--ud839', 'title': u'Android Basics: Multi-screen Apps'}, 16.58196369949852),
 ({'slug': u'android-basics-data-storage--ud845', 'title': u'Android Basics: Data Storage'}, 16.396480537288635),
 ({'slug': u'monetize-your-android-app-with-ads--ud876-3', 'title': u'Monetize Your Android App with Ads'}, 16.20912394544632),
 ({'slug': u'android-auto-development--ud875C', 'title': u'Android Auto Development'}, 15.777494182896843),
 ({'slug': u'google-location-services-on-android--ud876-1', 'title': u'Google Location Services on Android'}, 15.218265776752935),
 ({'slug': u'google-analytics-for-android--ud876-2', 'title': u'Google Analytics for Android'}, 14.61065347503726),
 ({'slug': u'add-google-maps-to-your-android-app--ud876-4', 'title': u'Add Google Maps to your Android App'}, 14.463698150368904),
 ({'slug': u'advanced-android-app-development--ud855', 'title': u'Advanced Android App Development'}, 13.960184800008111),
 ({'slug': u'android-performance--ud825', 'title': u'Android Performance'}, 12.713908308302491),
 ({'slug': u'ux-design-for-mobile-developers--ud849', 'title': u'UX Design for Mobile Developers'}, 12.282743978049073),
 ({'slug': u'firebase-essentials-for-android--ud009', 'title': u'Firebase Essentials For Android'}, 11.668952369040962),
 ({'slug': u'firebase-in-a-weekend-by-google-android--ud0352', 'title': u'Firebase in a Weekend by Google: Android'}, 11.038456872373068),
 ({'slug': u'developing-scalable-apps-in-java--ud859', 'title': u'Developing Scalable Apps in Java'}, 7.273413796900539),
 ({'slug': u'2d-game-development-with-libgdx--ud405', 'title': u'2D Game Development with libGDX'}, 4.331130843986378),
 ({'slug': u'java-programming-basics--ud282', 'title': u'Java Programming Basics'}, 4.042153082512434),
 ({'slug': u'developing-scalable-apps-in-python--ud858', 'title': u'Developing Scalable Apps in Python'}, 3.7690489731826435),
 ({'slug': u'mobile-web-development--cs256', 'title': u'Mobile Web Development'}, 3.032391199783027),
 ({'slug': u'intro-to-java-programming--cs046', 'title': u'Intro to Java Programming'}, 2.5291822941828452),
 ({'slug': u'software-development-process--ud805', 'title': u'Software Development Process'}, 2.185297310511487),
 ({'slug': u'firebase-in-a-weekend-by-google-ios--ud0351', 'title': u'Firebase in a Weekend by Google: iOS'}, 1.8700573294592286)]


15 Udacity search results for 'machine learning':
[({'slug': u'deep-learning--ud730', 'title': u'Deep Learning'}, 47.550493539028466),
 ({'slug': u'intro-to-machine-learning--ud120', 'title': u'Intro to Machine Learning'}, 47.366745059644586),
 ({'slug': u'machine-learning-for-trading--ud501', 'title': u'Machine Learning for Trading'}, 46.211637044152376),
 ({'slug': u'machine-learning--ud262', 'title': u'Machine Learning'}, 41.25203270672728),
 ({'slug': u'intro-to-artificial-intelligence--cs271', 'title': u'Intro to Artificial Intelligence'}, 33.710023338914375),
 ({'slug': u'reinforcement-learning--ud600', 'title': u'Reinforcement Learning'}, 32.032009425495815),
 ({'slug': u'model-building-and-validation--ud919', 'title': u'Model Building and Validation'}, 23.884872987325835),
 ({'slug': u'intro-to-data-science--ud359', 'title': u'Intro to Data Science'}, 15.197983858488389),
 ({'slug': u'cse-8803-special-topics-big-data--ud758', 'title': u'CSE 8803 Special Topics: Big Data'}, 13.473366651333809),
 ({'slug': u'introduction-to-computer-vision--ud810', 'title': u'Introduction to Computer Vision'}, 13.053622566247574),
 ({'slug': u'configuring-linux-web-servers--ud299', 'title': u'Configuring Linux Web Servers'}, 11.736469526891145),
 ({'slug': u'segmentation-and-clustering--ud981', 'title': u'Segmentation and Clustering'}, 8.541501498264628),
 ({'slug': u'intro-to-descriptive-statistics--ud827', 'title': u'Intro to Descriptive Statistics'}, 7.634297592944632),
 ({'slug': u'scalable-microservices-with-kubernetes--ud615', 'title': u'Scalable Microservices with Kubernetes'}, 6.5650180277851025),
 ({'slug': u'intro-to-relational-databases--ud197', 'title': u'Intro to Relational Databases'}, 5.6437400637826505)]

Query rewriting

Sometimes the plain-vanilla search engine just doesn't cut it. Sometimes search queries don't return enough results. With query rewriting, the search engine looks for semantically related terms to the user's query in addition to the query itself. This helps find more search results, which is particularly useful if the bare query doesn't get any hits.

SearchBetter has two built-in query rewriters: a simple one that uses Wikipedia's Categories API to find similar terms, and a more complex one that uses Google's Word2Vec (a ML similar-word-finding algorithm trained on Wikipedia article dumps) to find similar phrases.


In [3]:
# Query rewriting lets you turn a single search query into
# multiple related queries. You can then search for *all*
# of these queries, which can result in more and more useful
# results than just the original query would give.

# First, a rewriter that uses the Wikipedia category API
# to find terms related to the original term
wiki_rewriter = rewriter.WikipediaRewriter()
term = "socialism"
wiki_rewritten_queries = wiki_rewriter.rewrite(term)
print "Rewrites of '%s' using Wikipedia Categories:" % term
pprint(wiki_rewritten_queries)


print "\n"


# Second, a rewriter that uses Word2Vec to find similar
# words to the entered term. This is a machine learning
# algorithm trained on a large text corpus.
# Prepare the corpus (from Wikipedia) to use for the Word2Vec Rewriter.
corpus = word2vec.LineSentence(secure.DATASET_PATH_BASE + 'wikiclean8')

# Now make the rewriter...
model_path = secure.MODEL_PATH_BASE+'word2vec/word2vec'

## UNCOMMENT the below line if it's your first time making this rewriter
# w2v_rewriter = rewriter.Word2VecRewriter(model_path, create=True, corpus=corpus, bigrams=True)
## UNCOMMENT the below line if you've made the rewriter before
w2v_rewriter = rewriter.Word2VecRewriter(model_path, create=False)

w2v_rewritten_queries = w2v_rewriter.rewrite(term)
print "Rewrites of '%s' using Word2Vec:" % term
pprint(w2v_rewritten_queries)


Rewrites of 'socialism' using Wikipedia Categories:
['socialism']


Rewrites of 'socialism' using Word2Vec:
[u'communism',
 u'capitalism',
 u'ideology',
 u'fascism',
 u'liberalism',
 u'marxism',
 u'marxist',
 u'laissez faire',
 u'imperialism',
 u'nationalism',
 u'socialism']

Putting it together: Query-Rewriting Search Engines

As we've seen, query rewriters convert one search term into a set of semantically similar ones. Hopefully, by searching for the whole set of terms instead of just one term, we could get more (and more useful) results out of a search engine.

With SearchBetter, you can connect any query rewriter to any search engine and automatically start getting more results.


In [4]:
# Let's plug our two rewriters into the search engine
# to compare the results

# Suppose this is the user's search term
search_term = 'artificial intelligence'

# First, what do we get without rewriting?
udacity_engine.set_rewriter(None)
bare_results = udacity_engine.search(search_term)
print "Without rewriting, %d results for '%s':" % (len(bare_results), search_term)
pprint(bare_results)

print "\n"

# Next, try the Wikipedia rewriter
udacity_engine.set_rewriter(wiki_rewriter)
wiki_rewritten_results = udacity_engine.search(search_term)
print "With Wikipedia Categories rewriting, %d results for '%s':" % (len(wiki_rewritten_results), search_term)
pprint(wiki_rewritten_results)

print "\n"

# Last, try the Word2Vec rewriter
udacity_engine.set_rewriter(w2v_rewriter)
w2v_rewritten_results = udacity_engine.search(search_term)
print "With Word2Vec rewriting, %d results for '%s':" % (len(w2v_rewritten_results), search_term)
pprint(w2v_rewritten_results)


Without rewriting, 5 results for 'artificial intelligence':
[({'slug': u'knowledge-based-ai-cognitive-systems--ud409', 'title': u'Knowledge-Based AI: Cognitive Systems'}, 46.24125226687919),
 ({'slug': u'intro-to-artificial-intelligence--cs271', 'title': u'Intro to Artificial Intelligence'}, 28.374782225574677),
 ({'slug': u'artificial-intelligence-for-robotics--cs373', 'title': u'Artificial Intelligence for Robotics'}, 19.221448099586404),
 ({'slug': u'deep-learning--ud730', 'title': u'Deep Learning'}, 10.734610989407734),
 ({'slug': u'machine-learning--ud262', 'title': u'Machine Learning'}, 7.2160323163384295)]


With Wikipedia Categories rewriting, 5 results for 'artificial intelligence':
[({'slug': u'knowledge-based-ai-cognitive-systems--ud409', 'title': u'Knowledge-Based AI: Cognitive Systems'}, 46.24125226687919),
 ({'slug': u'intro-to-artificial-intelligence--cs271', 'title': u'Intro to Artificial Intelligence'}, 28.374782225574677),
 ({'slug': u'artificial-intelligence-for-robotics--cs373', 'title': u'Artificial Intelligence for Robotics'}, 19.221448099586404),
 ({'slug': u'deep-learning--ud730', 'title': u'Deep Learning'}, 10.734610989407734),
 ({'slug': u'machine-learning--ud262', 'title': u'Machine Learning'}, 7.2160323163384295)]


With Word2Vec rewriting, 19 results for 'artificial intelligence':
[({'slug': u'software-development-process--ud805', 'title': u'Software Development Process'}, 57.44743951652065),
 ({'slug': u'knowledge-based-ai-cognitive-systems--ud409', 'title': u'Knowledge-Based AI: Cognitive Systems'}, 46.24125226687919),
 ({'slug': u'intro-to-computer-science--cs101', 'title': u'Intro to Computer Science'}, 44.07395709156124),
 ({'slug': u'intro-to-theoretical-computer-science--cs313', 'title': u'Intro to Theoretical Computer Science'}, 39.97226939999818),
 ({'slug': u'computer-networking--ud436', 'title': u'Computer Networking'}, 29.08133009497014),
 ({'slug': u'intro-to-java-programming--cs046', 'title': u'Intro to Java Programming'}, 23.91325075613908),
 ({'slug': u'intro-to-artificial-intelligence--cs271', 'title': u'Intro to Artificial Intelligence'}, 16.32023060034226),
 ({'slug': u'introduction-to-computer-vision--ud810', 'title': u'Introduction to Computer Vision'}, 14.949360194906026),
 ({'slug': u'differential-equations-in-action--cs222', 'title': u'Differential Equations in Action'}, 12.321830208889988),
 ({'slug': u'deep-learning--ud730', 'title': u'Deep Learning'}, 10.734610989407734),
 ({'slug': u'technical-interview--ud513', 'title': u'Technical Interview'}, 10.283877682449393),
 ({'slug': u'product-design--ud509', 'title': u'Product Design'}, 9.348356182226443),
 ({'slug': u'machine-learning--ud262', 'title': u'Machine Learning'}, 9.348127550054356),
 ({'slug': u'software-debugging--cs259', 'title': u'Software Debugging'}, 7.909412457685507),
 ({'slug': u'reinforcement-learning--ud600', 'title': u'Reinforcement Learning'}, 7.75350157277364),
 ({'slug': u'intro-to-machine-learning--ud120', 'title': u'Intro to Machine Learning'}, 7.664740407368823),
 ({'slug': u'artificial-intelligence-for-robotics--cs373', 'title': u'Artificial Intelligence for Robotics'}, 7.222525679923923),
 ({'slug': u'developing-android-apps--ud853', 'title': u'Developing Android Apps'}, 5.529735822881141),
 ({'slug': u'data-wrangling-with-mongodb--ud032', 'title': u'Data Wrangling with MongoDB'}, 5.367658549871007)]

Using SearchBetter Yourself

If you want to use SearchBetter's query rewriting and search engine capabilities in your own project, you have two easy options:

  • If you have a raw dataset you want to search (e.g. a CSV or JSON file), make a subclass of search.WhooshSearchEngine. You can find examples in search.py. All you have to specify is how to read over the dataset and put the data into the search engine index.
  • If you've already made a search engine or are working with some external API, you can wrap it in a subclass of search.GenericSearchEngine and get access to SearchBetter's query rewriting power with no additional work.

Here's an example of how you'd wrap a custom search engine in a search.GenericSearchEngine to start taking advantage of our query rewriting functionality.


In [5]:
def prebuilt_black_box_search(term):
    # this is an example of a custom, pre-built search engine
    # that you can't change or look inside (i.e. it's a black box)
    strings = [
        "let sleeping dogs lie",
        "raining cats and dogs",
        "bay of pigs",
        "duck duck goose",
        "sacred cow",
        "llama llama llama",
        "poison dart frog"
    ]
    
    # do a simple text search for the term within the corpus
    # of strings
    matching_strings = [s for s in strings if term in s]
    return matching_strings


class SampleSearchEngine(search.GenericSearchEngine):
    """
    A sample search engine that wraps the
    `magic_black_box_search` search function.
    """
    def __init__(self):
        super(SampleSearchEngine, self).__init__()
    
    def single_search(self, term):
        return prebuilt_black_box_search(term)
    

sample_engine = SampleSearchEngine()

# first do a search without rewriting
term = "dog"
print "Before rewriting:"
pprint(sample_engine.search(term))

print "\n"

# now add a rewriter and search again
# you'll find that we get more results!
sample_engine.set_rewriter(w2v_rewriter)
print "After Word2Vec rewriting:"
pprint(sample_engine.search(term))


Before rewriting:
['let sleeping dogs lie', 'raining cats and dogs']


After Word2Vec rewriting:
['raining cats and dogs',
 'sacred cow',
 'bay of pigs',
 'duck duck goose',
 'let sleeping dogs lie',
 'raining cats and dogs']