Text Processing with Python

Packages Discussued:

Other packages:

NLP in Context

The science that has been developed around the facts of language passed through three stages before finding its true and unique object. First something called "grammar" was studied. This study, initiated by the Greeks and continued mainly by the French, was based on logic. It lacked a scientific approach and was detached from language itself. Its only aim was to give rules for distinguishing between correct and incorrect forms; it was a normative discipline, far removed from actual observation, and its scope was limited.

— Ferdinand de Saussure

The State of the Art

  • Academic design for use alongside intelligent agents (AI discipline)
  • Relies on formal models or representations of knowledge & language
  • Models are adapted and augment through probabilistic methods and machine learning.
  • A small number of algorithms comprise the standard framework.

Required:

  • Domain Knowledge
  • A Corpus in the Domain
  • Methods

The Data Science Pipeline

The NLP Pipeline

Morphology

The study of the forms of things, words in particular.

Consider pluralization for English:

  • Orthographic Rules: puppy → puppies
  • Morphological Rules: goose → geese or fish

Major parsing tasks:

  • stemming
  • lemmatization
  • tokenization.

Syntax

The study of the rules for the formation of sentences.

Major tasks:

  • chunking
  • parsing
  • feature parsing
  • grammars
  • NGram Models (perplexity)
  • Language generation

Semantics

The study of meaning.

I see what I eat.
I eat what I see.
He poached salmon.

Major Tasks

  • Frame extraction
  • creation of TMRs
  • Question and answer systems

Machine Learning

Solve Clustering Problems:

  • Topic Modeling
  • Language Similarity
  • Document Association (authorship)

Solve Classification Problems:

  • Language Detection
  • Sentiment Analysis
  • Part of Speech Tagging
  • Statistical Parsing
  • Much more

Use of word vectors to implement distance based metrics.

Setup and Dataset

To install the required packages (hopefully to a virtual environment) you can download the requirements.txt and run:

$ pip install -r requirements.txt

Or you can pip install each dependency as you need them.

Corpus Organization

Preprocessing HTML and XML Documents to Text

Much of the text that we're interested in is available on the web and formatted either as HTML or XML. It's not just web pages, however. Most eReader formats like ePub and Mobi are actually zip files containing XHTML. These semi-structured documents contain a lot of information, usually structural in nature. However, we want to get to the main body of the content of what we're looking for, disregarding other content that might be included such as headers for navigation, sidebars, ads and other extraneous content.

On the web, there are several services that provide web pages in a "readable" fashion like Instapaper and Clearly. Some browsers might even come with a clutter and distraction free "reading mode" that seems to give us exactly the content that we're looking for. An option that I've used in the past is to either programmatically access these renderers, Instapaper even provides an API. However, for large corpora, we need to quickly and repeatably perform extraction, while maintaining the original documents.

Corpus management requires that the original documents be stored alongside preprocessed documents - do not make changes to the originals in place! See discussions of data lakes and data pipelines for more on ingesting to WORM storages.

In Python, the fastest way to process HTML and XML text is with the lxml library - a superfast XML parser that binds the C libraries libxml2 and libxslt. However, the API for using lxml is a bit tricky, so instead use friendlier wrappers, readability-lxml and BeautifulSoup.

For example, consider the following code to fetch an HTML web article from The Washington Post:


In [11]:
import codecs
import requests

from urlparse import urljoin
from contextlib import closing

chunk_size = 10**6  # Download 1 MB at a time.
wpurl = "http://wpo.st/"  # Washington Post provides short links

def fetch_webpage(url, path):
    # Open up a stream request (to download large documents)
    # Ensure that we will close when complete using contextlib
    with closing(requests.get(url, stream=True)) as response:

        # Check that the response was successful
        if response.status_code == 200:
            
            # Write each chunk to disk with the correct encoding
            with codecs.open(path, 'w', response.encoding) as f:
                for chunk in response.iter_content(chunk_size,  decode_unicode=True):
                    f.write(chunk)

def fetch_wp_article(article_id):
    path = "%s.html" % article_id
    url  = urljoin(wpurl, article_id)
    return fetch_webpage(url, path)

In [ ]:
fetch_webpage("http://www.koreadaily.com/news/read.asp?art_id=3283896", "korean.html")

In [15]:
fetch_wp_article("nrRB0")

In [ ]:
fetch_wp_article("uyRB0")

BeautifulSoup allows us to search the DOM to extract particular elements, for example to load our document and find all the <p> tags, we would do the following:


In [20]:
import bs4

def get_soup(path):
    with open(path, 'r') as f:
        return bs4.BeautifulSoup(f, "lxml") # Note the use of the lxml parser

for p in get_soup("nrRB0.html").find_all('p'):
    print p


<p class="site-attribution"> <a href="//www.washingtonpost.com"><strong>washingtonpost.com</strong></a> <span class="copyright">© 1996-2015 The Washington Post</span> </p>
<p><a href="//www.washingtonpost.com/actmgmt/help/">Help and Contact Us</a></p>
<p><a href="//www.washingtonpost.com/terms-of-service/2011/11/18/gIQAldiYiN_story.html">Terms of Service</a></p>
<p><a href="//www.washingtonpost.com/privacy-policy/2011/11/18/gIQASIiaiN_story.html">Privacy Policy</a></p>
<p><a href="//www.washingtonpost.com/discussion-and-submission-guidelines/2011/11/21/gIQAuvIbiN_story.html">Submissions and Discussion Policy</a></p>
<p><a href="//www.washingtonpost.com/rss-terms-of-service/2012/01/16/gIQAadFYAQ_story.html">RSS Terms of Service</a></p>
<p><a href="//www.washingtonpost.com/how-can-i-opt-out-of-online-advertising-cookies/2011/11/18/gIQABECbiN_story.html">Ad Choices</a></p>
<p id="U9001680477708UJG"> <i id="U9001274173114UPG">It’s lowbrow. It’s messy. It could never be accused of being healthful. But we’d never let those formalities get between us and an order of crispy, crackly, delicious fried chicken. Whether it comes in a bucket or on a bun, or you eat it with your fingers or chopsticks, there’s a surprising variety to the Washington area’s fried chicken offerings. Here are some of the most irresistible. </i> </p>
<p id="U9001274173114EdC"></p>
<p id="U9001274173114AOB">Forget the cronut. Our newest favorite food chimera is the “rotissi-fried” chicken at <b>the Partisan</b>. Credit goes to chef Nate Anda, who dreamed up the dish: After a 12-hour brine, the chicken is rotisseried for two hours and then fried for two and a half minutes. Why both? “Everything is better once it’s fried in beef fat,” Anda said. We have to agree. Whether white or dark, the meat is succulent throughout. The batter-free frying leaves the simply seasoned skin rendered perfectly crisp, golden and translucent — cracklings, essentially. The sound of it shattering under the knife was music to our ears. And as if the lily needed further gilding, the chicken comes with a generous pour of honey hot sauce. The sauce is hard to resist, but try to reserve a few bites of unadorned chicken so you can fully appreciate this happy marriage of classic preparations.</p>
<p id="U9001274173114XlF"> <i>The Partisan, 709 D St. NW. 202-524-5322. <a class="showlink" href="http://www.thepartisandc.com">www.thepartisandc.com</a>. </i> </p>
<p><strong>— Becky Krystal</strong></p>
<p id="U90016804777081aH"> <i>[<a href="http://www.washingtonpost.com/goingoutguide/guilt-free-fried-chicken-sandwiches-any-day-of-the-week/2015/04/01/578efefe-d4a2-11e4-ab77-9646eea6a4c7_story.html" title="www.washingtonpost.com">In a love/hate relationship with Chick-fil-A? Here are some alternatives</a>]</i> </p>
<p id="U9001274173114AdG">When <a href="http://www.washingtonpost.com/lifestyle/food/bryan-voltaggio-from-a-teenager-amok-to-top-chef-masters/2013/07/22/682d35f8-ef04-11e2-bed3-b9b6fe264871_story.html">Bryan Voltaggio</a> started planning the menu for <b>Family Meal</b>, his modern, upscale spin on a diner, his thoughts turned to home and the carryout meal he most enjoyed as a kid: fried chicken. “It was one of our favorite things,” he says. “It just seems like a family dinner.” And if he was creating a restaurant called Family Meal, fried chicken “had to be an important part of it.” But Voltaggio wanted to do it right, and set about testing cooking methods, brines, breadings and fryers. That long process paid off with a home run of a fried chicken dish that’s become the most popular item on Family Meal’s menu. The whole chickens spend 12 hours in a brine of pickle juice and roasted poultry stock before getting dredged, rested and dredged again in a mixture of flour, cornmeal and corn starch. After a dip in the top-of-the-line pressure fryer, the thighs, legs and breasts emerge with a crisp, salty skin that cracks open to reveal wonderfully warm and moist flesh. You don’t even need to dunk it in the house-made hot sauce that accompanies the dish, but really, who can resist? </p>
<p id="U9001274173114THE"> <i>Family Meal Ashburn, 20462 Exchange St. 703-726-9800; Frederick, 880 N. East St. 301-378-2895; Baltimore, 621 E. Pratt St. 410-601-3242. <a class="showlink" href="http://www.voltfamilymeal.com">www.voltfamilymeal.com</a>. </i> </p>
<p><strong>— John Taylor</strong></p>
<p id="U9001680477708LnG"> <i></i> </p>
<p id="U90016804777082MC"> <i>[<a href="http://www.washingtonpost.com/sf/style/2015/03/12/d-c-s-most-essential-dishes-of-2015/">40 Eats: D.C.’s most essential dishes of 2015</a>]</i> </p>
<p id="U90012741731140AH">Although it’s commonly served in Japan at karaoke bars, convenience stores and on street carts, karaage chicken — like most of the country’s food — is held to an extremely high standard. “It’s taken it to the nth degree of obsession and detail,” says Cizuka Seki, who, with her father Hiroshi, owns <b>Izakaya Seki</b> on V Street NW. “Kara age” is used to describe the method for deep-frying bite-size pieces of fish and, more commonly, chicken. Though there are subtle variations on the ubiquitous dish, most recipes call for chicken thighs marinated in soy sauce, coated in flour or corn starch and deep-fried in oil. Izakaya Seki’s version sticks closely to the formula. Probably. “I’m not even quite sure what my dad puts into it, because we don’t have recipes,” Seki says, though she’s certain wheat flour is involved. The result is a thin, tender coating that’s slightly softer than tempura. The accompanying ponzu sauce lends a tartness to the nubs.</p>
<p id="U900127417311410F"> <i>Izakaya Seki, 1117 V St. NW. 202-588-5841. <a class="showlink" href="http://www.sekidc.com">www.sekidc.com</a>. </i> </p>
<p><strong>— Holley Simmons</strong></p>
<p id="U9001274173114ewH">Don’t waste your kimchi-stinking breath asking for more sauce at <b>BonChon</b>. The <a href="http://www.washingtonpost.com/goingoutguide/the-20-diner-the-zen-of-bonchon-chicken/2013/07/31/b879060c-f582-11e2-aa2e-4088616498b4_story.html">South Korean fried chicken chain</a>, founded in 2002, is so dedicated to consistency that it doesn’t allow for any modifications. And why would you want to change anything, really? The made-to-order wings, drumsticks and strips are fried twice, resulting in a paper-thin crust that yields the same satisfying crack as shattering crème brulee with a spoon. Founder Jinduk Seh spent two years perfecting his secret sauces, which come in three flavors — soy garlic, hot and a blend of the two — and are brushed on by hand post-fry, piece by piece. True to BonChon’s commitment to uniformity, sauces are made exclusively in South Korea and distributed to all 140-plus BonChon locations, which means the wings you’re chewing on in Arlington are slathered with the same exact stuff as those in the Philippines. Joints like these are so common throughout Korea they’re called “chimeks,” which is a hybrid term that combines “chicken” with the Korean word for beer. Washington should be happy to have 10 BonChons within driving distance, plus a brand new Metro-accessible location near the Navy Yard.</p>
<p id="U9001274173114bBI"> <i>BonChon, 1015 Half St. SE and nine other locations in Maryland and Virginia. <a class="showlink" href="http://www.bonchon.com">www.bonchon.com</a>. </i> </p>
<p><strong>— Holley Simmons</strong></p>
<p id="U9001274173114NgD">There’s not much agreement on what constitutes Maryland fried chicken. Some say it’s just a fresh Maryland chicken that’s pan-fried; others say it should be topped with white gravy, almost like a chicken-fried steak. The pan-fried chicken platter at <b> <br/>Crisfield Seafood</b> is a perfect example of the former style. Half of a chicken is dredged in flour, dusted with salt and pepper, and fried in a cast-iron pan. This preparation lends a snap and crunch to the exterior, and while the meat falls off the bone, the well-seasoned breading holds on. (The chicken is available only Friday through Sunday, and frequently sells out.) The Chesapeake fried chicken at <b>Hank’s Oyster Bar</b> in Dupont Circle and Capitol Hill is plumper than Crisfield’s version and seasoned with Old Bay, black pepper and cayenne, but the breading is softer and less crispy. It’s brined for 24 hours and deep-fried, rather than pan-fried, and it’s served only on Sunday. </p>
<p id="U9001274173114wEE"> <i>Crisfield Seafood, 8012 Georgia Ave., Silver Spring. 301-589-1306. <a class="showlink" href="http://www.crisfieldseafood.com">www.crisfieldseafood.com</a>. Hank’s Oyster Bar, 1624 Q St. NW. 202-462-4265; 633 Pennsylvania Ave. SE. 202-733-1971. <a class="showlink" href="http://www.hanksoysterbar.com">www.hanksoysterbar.com</a>.</i> </p>
<p><strong>— Fritz Hahn</strong></p>
<p id="U9001680477708jhF"></p>
<p id="U9001680477708ZdH"> <i>[<a href="http://www.washingtonpost.com/goingoutguide/pizza-in-washington-an-upper-crust-tour-of-every-dc-style/2014/09/04/f0482f42-2ef5-11e4-994d-202962a9150c_story.html">Pizza in Washington: An upper-crust tour of every D.C. style</a> <i>]</i></i></p>
<p id="U9001680477708VFH"></p>
<p id="U9001274173114bsG">Self-consciousness may prevent you from ordering fried chicken in a white-tablecloth restaurant. It feels incongruous — gauche, almost — to dig into picnic fare at the kind of place where you should be ordering risotto or tartare or something that comes with mousse, gelee or foam. But you have to override that adult voice in the back of your head, because if you don’t, you’ll miss out on <b>Central Michel Richard</b>’s famed fried chicken plate ($24 at lunch, $25 for dinner), which remains as good as ever. Though it’s no longer sold by the bucket to go, Michel Richard’s KFC-inspired crispy breast and thigh come stacked atop a pool of the butteriest mashed potatoes you’ll ever taste. It’s such a dignified presentation that this most American of dishes almost could pass as (gasp!) French. Self-consciousness should, however, steer you toward using a knife and fork — not your fingers — to eat the chicken. It is, after all, that kind of a place. </p>
<p id="U9001274173114p0D"> <i>Central Michel Richard, 1001 Pennsylvania Ave. NW. 202-626-0015. <a class="showlink" href="http://www.centralmichelrichard.com">www.centralmichelrichard.com</a> <i>. </i> </i> </p>
<p> </p>
<p><strong>— Maura Judkis</strong></p>
<p id="U900127417311467G"></p>
<p id="U9001274173114o0G">If you believe the lore, Nashville hot chicken was basically a crime of passion, created as a blistering rebuke to a no-account Romeo who couldn’t keep his hands off other women. Alas, this wolf was also a chili head who found pleasure, not pain, in this dish of revenge served hot. Decades later, chefs are latching onto this addictive form of punishment. Aaron Silverman served an ultra-refined version at Rose’s Luxury for months, and now Eric Reid, chef and co-owner of <b>Reserve 2216</b> in Del Ray, has developed his own take on hot chicken, even if he’s never actually enjoyed it in Nashville. He marinates an airline cut (boneless breast with the drumette wing attached) in buttermilk and Crystal hot sauce before dredging the chicken in seasoned flour and frying it. Reid ditches the traditional white bread base in favor of collards and a side of corn bread waffles. He finishes the dish with a combination of Cajun seasonings and more Crystal hot sauce for a moist, crispy bird that bites back. But not too hard. This is Alexandria, after all. </p>
<p id="U900127417311481"> <i>Reserve 2216, 2216 Mount Vernon Ave., Alexandria. 703-549-2889. <a class="showlink" href="http://www.drpreserve.com">www.drpreserve.com</a>.</i> </p>
<p><strong>— Tim Carman</strong></p>
<p id="U9001274173114HvD"></p>
<p id="U90012741731145tH">The sole virtue of most fast-food operations is consistency. Whether you bite into a Big Mac in Bethesda or Beijing, the sandwich should taste the same. The menu at <b>Popeyes</b> follows suit, but it deviates from the brand-name competition in an important respect: The signature at Popeyes could pass for home cooking (well, if your home had a vat of clean, hot oil and a person with a Southern accent tending the meal). Maybe that accounts for my occasional forays to the chicken fryer after a bum restaurant-review excursion. No matter where I eat my order, inevitably “spicy,” I know I can count on a coating that smacks of cayenne, paprika and even crushed cornflakes, and chicken that spurts with juice. The shatter is audible; the golden crumbs fly everywhere, but end up on my tongue. No one-trick pony, Popeyes has hot and tender buttermilk biscuits that bolster my favorite excuse to snack low on the food chain. Once, I got home to discover a clerk had forgotten to pack bread in my bag. I almost cried. Instead, I consoled myself with another piece of chicken. </p>
<p id="U9001274173114EOF"> <i>Popeyes has locations throughout the D.C. metro area. <a class="showlink" href="http://www.popeyes.com">www.popeyes.com</a>.</i> </p>
<p><strong>— Tom Sietsema</strong></p>
<p id="U9001274173114JiB"></p>
<p id="U9001274173114mFG">The fried chicken sandwich hasn’t been the same since KFC’s Double Down turned a guilty pleasure into an outright farce, using crispy white-meat fillets as both the sandwich’s primary protein and the oily handles by which you eat the monstrosity. Leave it to Rob Sonderman, pitmaster and co-owner of <b>DCity Smokehouse</b>, to bring dignity back to the bite. His Den-Den — named for co-creator and pitmaster-in-training <br/>Dennis Geddie — begins with boneless thighs marinated in buttermilk, hot sauce and honey. Sonderman then dredges the meat in seasoned flour before dropping the thighs into the fryer. Generously stuffed into a grilled hoagie roll with lettuce, tomato and crispy onions, the chicken is finished with two sauces, including a house-made cilantro ranch. Technically, the Den-Den ($9.25) is one of the few smokeless items on DCity’s menu (unless you count the chipotle peppers in the hot sauce). No matter. You won’t care the minute you sink your teeth into that heaping hoagie of spicy thigh meat.</p>
<p id="U9001274173114vPC"> <i>DCity Smokehouse, 8 Florida Ave. NW. 202-733-1919. <a class="showlink" href="http://www.dcitysmokehouse.com">www.dcitysmokehouse.com</a>. </i> </p>
<p><strong>— Tim Carman</strong></p>
<p id="U9001274173114L3B"></p>
<p id="U9001274173114yyH">Hearty is the appetite that can handle <b>Oohh’s and Aahh’s</b> chef-owner Oji Abbott’s boneless fried chicken breast without taking home leftovers. He buys local — from Hartman Meat Co. in Northeast Washington — and butterflies each 14-ounce portion ($12.95), which results in a lot of real estate for the crispy, well-seasoned coating. Abbott chalks up the consistently moist meat to proper cooking time and temperature, and to the recipe he learned from his grandmother.</p>
<p id="U9001274173114gCF"> <i>Oohh’s and Aahh’s, 1005 U St. NW. 202-667-7142. <a class="showlink" href="http://www.oohhsnaahhs.com">www.oohhsnaahhs.com</a>.</i> </p>
<p><strong>— Bonnie S. Benwick</strong></p>
<p id="U9001274173114ZvF"></p>
<p id="U9001274173114wdH">It’s all too easy to chomp through an order of Boardwalk Chicken at the shore-happy <b>Pop’s Sea Bar</b> in Adams Morgan. The bite-size pieces of dark-meat-only bird are brined for two hours, then treated to a buttermilk bath until they are coated with plain flour and flash-fried to order. A generous hand with salt and pepper just before serving means the effect of that seasoning builds as you empty the single-serving basket ($8.99). Ask for two portions of the accompanying Jersey sauce so you’ll have enough of its horseradish-y, kitchen-sink blend for every bite.</p>
<p id="U9001274173114ycC"> <i>Pop’s Sea Bar, 1817 Columbia Rd. NW. 202-534-3933. <a class="showlink" href="http://www.popsseabar.com">www.popsseabar.com</a>.</i> </p>
<p><strong>— Bonnie S. Benwick</strong></p>
<p id="U9001274173114ZfD"></p>
<p id="U9001274173114RpD">Chicken tenders are often relegated to the children’s menu, but the chicken Tendies at <b>GBD</b> are a fine meal for adults and children alike. Each white-meat tender comes with a dark, crispy outer layer with a heavy dose of salt and spice. But the best part about chicken tenders is the dipping, and GBD offers nine sauces, including a take on <a href="http://www.washingtonpost.com/lifestyle/food/mumbo-sauce-gets-gentrified/2013/07/08/b1011ade-cc67-11e2-9f1a-1a7cdee20287_story.html">D.C.’s own mumbo sauce</a>, buttermilk ranch, chipotle barbecue and Frankenbutter, which combines Frank’s RedHot sauce with butter. Ask for the $5.50 Saucetown option to try all nine. </p>
<p id="U9001274173114bCE"> <i>GBD, 1323 Connecticut Ave. NW. 202-524-5210. <a class="showlink" href="http://www.gbdchickendoughnuts.com/">www.gbdchickendoughnuts.com</a>.</i> </p>
<p><strong>— Margaret Ely</strong></p>
<p id="U9001274173114xGE"></p>
<p id="U90012741731149NI">Fried chicken fans can argue the merits of white meat vs. dark meat, or whether it’s better to chow down on a drumstick or the breast. But one thing we all can agree on is that the outer layer — the breading and the skin — is the most important element of a memorable piece of fried chicken. And sometimes you just want to savor the flavor of the skin — deeply spiced, perfect crunch — without filling up on meat or having to deal with bones. And that is when you grab one of the bar stools at R.J. Cooper’s <b>Gypsy Soul</b> in Fairfax’s Mosaic District. Cooper’s chicken skins ($9) are twisted slivers and shards, decadently salty and crackling with paprika, cayenne pepper and garlic. The dish arrives with a house-made “roof top honey-snake oil,” but it’s best to let these beauties shine on their own. </p>
<p id="U9001274173114vUB"> <i>Gypsy Soul, 8296 Glass Alley, Fairfax. 703-992-0933. <a class="showlink" href="http://www.gypsysoul-va.com">www.gypsysoul-va.com</a>.</i> </p>
<p><strong>— Fritz Hahn</strong></p>
<p id="U9001680477708VV"> <b>Related items:</b> </p>
<p id="U90016804777085vD"> <a href="http://www.washingtonpost.com/sf/style/2015/03/12/d-c-s-most-essential-dishes-of-2015/" title="www.washingtonpost.com">D.C.’s most essential dishes of 2015</a> </p>
<p id="U9001680477708DRG"> <a href="http://www.washingtonpost.com/goingoutguide/pizza-in-washington-an-upper-crust-tour-of-every-dc-style/2014/09/04/f0482f42-2ef5-11e4-994d-202962a9150c_story.html" title="www.washingtonpost.com">Pizza in Washington: An upper crust tour of every D.C. style</a> </p>
<p id="U9001680477708cFG"></p>
<p id="U9001680477708O2C"></p>
<p id="section-instream">goingoutguide</p>
<p id="subsection-instream"></p>
<p id="blogname-instream"></p>
<p class="headline" id="headline-instream"><i class="fa fa-check checked-icon"></i></p>
<p class="title" id="tagline-instream"></p>
<p class="title" id="confirmation-instream"> <span>Success!</span> Check your inbox for details. <span class="might-like"> You might also like:</span> </p>
<p class="error-msg-inStream">Please enter a valid email address</p>
<p class="suggestion-title"></p>
<p class="suggestion-title"></p>
<p class="suggestion-title"></p>
<p class="title" id="all-newsletters-inStream"> <a href="https://subscribe.washingtonpost.com/newsletters">See all newsletters</a> </p>
<p class="title">SuperFan Badge</p>
<p>SuperFan badge holders consistently post smart, timely comments about Washington area sports and teams.</p>
<p class="echo-badge-info-link"><a href="//www.washingtonpost.com/wp-srv/interactivity/get-a-badge.html" target="_badgeinfo">More about badges</a> | <a href="http://www.washingtonpost.com/wp-srv/interactivity/get-a-badge.html" target="_badgeinfo">Request a badge</a></p>
<p class="title">Culture Connoisseur Badge</p>
<p>Culture Connoisseurs consistently offer thought-provoking, timely comments on the arts, lifestyle and entertainment.</p>
<p class="echo-badge-info-link"><a href="//www.washingtonpost.com/wp-srv/interactivity/get-a-badge.html" target="_badgeinfo">More about badges</a> | <a href="http://www.washingtonpost.com/wp-srv/interactivity/get-a-badge.html" target="_badgeinfo">Request a badge</a></p>
<p class="title">Fact Checker Badge</p>
<p>Fact Checkers contribute questions, information and facts to <a href="//www.washingtonpost.com/blogs/fact-checker" target="_badgeinfo">The Fact Checker</a>.</p>
<p class="echo-badge-info-link"><a href="//www.washingtonpost.com/wp-srv/interactivity/get-a-badge.html" target="_badgeinfo">More about badges</a> | <a href="http://www.washingtonpost.com/wp-srv/interactivity/get-a-badge.html" target="_badgeinfo">Request a badge</a></p>
<p class="title">Washingtologist Badge</p>
<p>Washingtologists consistently post thought-provoking, timely comments on events, communities, and trends in the Washington area.</p>
<p class="echo-badge-info-link"><a href="//www.washingtonpost.com/wp-srv/interactivity/get-a-badge.html" target="_badgeinfo">More about badges</a> | <a href="http://www.washingtonpost.com/wp-srv/interactivity/get-a-badge.html" target="_badgeinfo">Request a badge</a></p>
<p class="title">Post Writer Badge</p>
<p>This commenter is a Washington Post editor, reporter or producer.</p>
<p class="title">Post Forum Badge</p>
<p>Post Forum members consistently offer thought-provoking, timely comments on politics, national and international affairs.</p>
<p class="echo-badge-info-link"><a href="//www.washingtonpost.com/wp-srv/interactivity/get-a-badge.html" target="_badgeinfo">More about badges</a> | <a href="http://www.washingtonpost.com/wp-srv/interactivity/get-a-badge.html" target="_badgeinfo">Request a badge</a></p>
<p class="title">Weather Watcher Badge</p>
<p>Weather Watchers consistently offer thought-provoking, timely comments on climates and forecasts.</p>
<p class="echo-badge-info-link"><a href="//www.washingtonpost.com/wp-srv/interactivity/get-a-badge.html" target="_badgeinfo">More about badges</a> | <a href="http://www.washingtonpost.com/wp-srv/interactivity/get-a-badge.html" target="_badgeinfo">Request a badge</a></p>
<p class="title">World Watcher Badge</p>
<p>World Watchers consistently offer thought-provoking, timely comments on international affairs.</p>
<p class="echo-badge-info-link"><a href="//www.washingtonpost.com/wp-srv/interactivity/get-a-badge.html" target="_badgeinfo">More about badges</a> | <a href="http://www.washingtonpost.com/wp-srv/interactivity/get-a-badge.html" target="_badgeinfo">Request a badge</a></p>
<p class="title">Post Contributor Badge</p>
<p>This commenter is a Washington Post contributor. Post contributors aren’t staff, but may write articles or columns. In some cases, contributors are sources or experts quoted in a story.</p>
<p class="echo-badge-info-link"><a href="//www.washingtonpost.com/wp-srv/interactivity/get-a-badge.html" target="_badgeinfo">More about badges</a> | <a href="http://www.washingtonpost.com/wp-srv/interactivity/get-a-badge.html" target="_badgeinfo">Request a badge</a></p>
<p class="title">Post Recommended</p>
<p>Washington Post reporters or editors recommend this comment or reader post.</p>
<p>You must be logged in to report a comment.</p>
<p>You must be logged in to recommend a comment.</p>
<p>Comments our editors find particularly useful or relevant are displayed in <strong>Top Comments</strong>, as are comments by users with these badges: <strong><span class="badge-list"></span></strong>. Replies to those posts appear here, as well as posts by staff writers.</p>
<p>All comments are posted in the <strong>All Comments</strong> tab.</p>
<p>To pause and restart automatic updates, click "Live" or "Paused". If paused, you'll be notified of the number of additional comments that have come in.</p>
<p id="newsletter-section">goingoutguide</p>
<p id="newsletter-subsection"></p>
<p id="newsletter-blogname"></p>
<p class="headline" id="newsletter-headline"><i class="fa fa-check" id="headline-checked"></i></p>
<p class="title" id="newsletter-tagline"></p>
<p class="title" id="subscribed-confirmation"><span>Success!</span> Check your inbox for details.</p>
<p class="newsLetter-error-msg"> Please enter a valid email address </p>
<p class="title">You might also like: </p>
<p class="title "></p>
<p class="title \"></p>
<p class="title"></p>
<p class="title" id="all-newsletters-lbl"><a href="https://subscribe.washingtonpost.com/newsletters">See all newsletters</a></p>
<p class="subscribe-headline">Every story. Every feature. Every insight.</p>
<p class="subscribe-tagline">Yours for as low as JUST 99¢!</p>
<p class="label">Not Now</p>
<p id="newsletter-banner-section">goingoutguide</p>
<p id="newsletter-banner-subsection"></p>
<p id="newsletter-banner-blogname"></p>
<p class="signup-headline" id="newsletter-headline-banner"> <i class="fa fa-check confirmation"></i> </p>
<p class="signup-tagline" id="newsletter-tagline-banner"></p>
<p class="title confirmation subscribed-confirmation"><span>Success!</span> Check your inbox for details.</p>
<p class="title confirmation all-newsletters"> <a href="https://subscribe.washingtonpost.com/newsletters">See all newsletters</a> </p>
<p class="newsLetter-error-msg-banner">Incorrect email</p>
<p class="label">Not Now</p>

In order to print out only the text with no nodes, do the following:


In [16]:
for p in get_soup("nrRB0.html").find_all('p'):
    print p.text
    print


---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-16-df168f805ace> in <module>()
----> 1 for p in get_soup("nrRB0.html").find_all('p'):
      2     print p.text
      3     print

NameError: name 'get_soup' is not defined

While this allows us to easily traverse the DOM and find specific elements by their id, class, or element type - we still have a lot of cruft in the document. This is where readability-lxml comes in. This library is a Python port of the readability project, written in Ruby and inspired by Instapaper. This code uses readability.js and some other helper functions to extract the main body and even title of the document you're working with.


In [17]:
from readability.readability import Document

def get_paper(path):
    with codecs.open(path, 'r', encoding='utf-8') as f:
        return Document(f.read())

paper = get_paper("nrRB0.html")
print paper.title()


A crisp and juicy bucket list of D.C.’s best fried chicken - The Washington Post

In [13]:
with codecs.open("nrRB0-clean.html", "w", encoding='utf-8') as f:
    f.write(paper.summary())


---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-13-13f9af69f23a> in <module>()
      1 with codecs.open("nrRB0-clean.html", "w", encoding='utf-8') as f:
----> 2     f.write(paper.summary())

NameError: name 'paper' is not defined

Combine readability and BeautifulSoup as follows:


In [7]:
def get_text(path):
    with open(path, 'r') as f:
        paper = Document(f.read())
        soup = bs4.BeautifulSoup(paper.summary())
        output = [paper.title()]
        for p in soup.find_all('p'):
            output.append(p.text)
        return "\n\n".join(output)

In [9]:
print get_text("nrRB0.html")


---------------------------------------------------------------------------
IOError                                   Traceback (most recent call last)
<ipython-input-9-3505802f4ac7> in <module>()
----> 1 print get_text("nrRB0.html")

<ipython-input-7-8eb3c26db015> in get_text(path)
      1 def get_text(path):
----> 2     with open(path, 'r') as f:
      3         paper = Document(f.read())
      4         soup = bs4.BeautifulSoup(paper.summary())
      5         output = [paper.title()]

IOError: [Errno 2] No such file or directory: 'nrRB0.html'

A note on binary formats

In order to transform PDF documents to XML, the best solution is currently PDFMiner, specificially their pdf2text tool. Note that this tool can output into multiple formats like XML or HTML, which is often better than the direct text export. Because of this it's often useful to convert PDF to XHTML and then use Readabiilty or BeautifulSoup to extract the text out of the document.

Unfortunately, the conversion from PDF to text is often not great, though statistical methodologies can help ease some of the errors in transformation. If PDFMiner is not sufficient, you can use tools like PyPDF2 to work directly on the PDF file, or write Python code to wrap other tools in Java and C like PDFBox.

Older binary formats like Pre-2007 Microsoft Word Documents (.doc) require special tools. Again, the best bet is to use Python to call another command line tool like antiword. Newer Microsoft formats are acutally zipped XML files (.docx) and can be either unzipped and handled using the XML tools mentioned above, or using Python packages like python-docx and python-excel.

Pattern

The pattern library by the CLiPS lab at the University of Antwerp is designed specifically for language processing of web data and contains a toolkit for fetching data via web APIS: Google, Gmail, Bing, Twitter, Facebook, Wikipedia, and more. It supports HTML DOM parsing and even includes a web crawler!

For example to ingest Twitter data:


In [1]:
from pattern.web import Twitter, plaintext

In [5]:
twitter = Twitter(language='en')
for tweet in twitter.search("#DataDC", cached=False):
    print tweet.text


Want to hack data with others on July 30th? Go to Data Owls! http://t.co/MDnmaan5Kh (Special project on energy data, too!) #datadc
RT @DataCommunityDC: Tonight's meetup: Using PMML, Python, R, and SAS for Production Analytics http://t.co/Qxi57pCLsf #DataDC #meetup #data…
RT @DataCommunityDC: Tonight's meetup: Using PMML, Python, R, and SAS for Production Analytics http://t.co/Qxi57pCLsf #DataDC #meetup #data…
RT @DataCommunityDC: Tonight's meetup: Using PMML, Python, R, and SAS for Production Analytics http://t.co/Qxi57pCLsf #DataDC #meetup #data…
Tonight's meetup: Using PMML, Python, R, and SAS for Production Analytics http://t.co/Qxi57pCLsf #DataDC #meetup #datascience
RT @HarlanH: DC Business Data Demonstration Project — Medium @katmeresin http://t.co/zYzBSYueqn #datadc
DC Business Data Demonstration Project — Medium @katmeresin http://t.co/zYzBSYueqn #datadc
RT @DistrictDataLab: Still some seats left for our NLP with NLTK workshop this Saturday! http://t.co/KlW3wzjpT8 #DataScience #NLProc #DataD…
RT @DistrictDataLab: Still some seats left for our NLP with NLTK workshop this Saturday! http://t.co/KlW3wzjpT8 #DataScience #NLProc #DataD…
RT @DistrictDataLab: Still some seats left for our NLP with NLTK workshop this Saturday! http://t.co/KlW3wzjpT8 #DataScience #NLProc #DataD…

Pattern also contains an NLP toolkit for English in the pattern.en module that utilizes statistical approcahes and regular expressions. Other languages include Spanish, French, Italian, German, and Dutch.

The patern parser will identify word classes (e.g. Part of Speech tagging), perform morphological inflection analysis, and includes a WordNet API for lemmatization.


In [68]:
from pattern.en import parse, parsetree

s = "The man hit the building with a baseball bat."
print parse(s, relations=True, lemmata=True)
print
for clause in parsetree(s):
    for chunk in clause.chunks:
        for word in chunk.words:
            print word,
        print


The/DT/B-NP/O/NP-SBJ-1/the man/NN/I-NP/O/NP-SBJ-1/man hit/VBD/B-VP/O/VP-1/hit the/DT/O/O/O/the building/VBG/B-VP/O/O/build with/IN/B-PP/B-PNP/O/with a/DT/B-NP/I-PNP/O/a baseball/NN/I-NP/I-PNP/O/baseball bat/NN/I-NP/I-PNP/O/bat ././O/O/O/.

Word(u'The/DT') Word(u'man/NN')
Word(u'hit/VBD')
Word(u'building/VBG')
Word(u'with/IN')
Word(u'a/DT') Word(u'baseball/NN') Word(u'bat/NN')

The pattern.search module allows you to retreive N-Grams from text based on phrasal patterns, and can be used to mine dependencies from text, e.g.


In [79]:
from pattern.search import search

s = "The man hit the building with a baseball bat."
pt = parsetree(s, relations=True, lemmata=True)
for match in search('NP VP', pt):
    print match


Match(words=[Word(u'The/DT'), Word(u'man/NN'), Word(u'hit/VBD')])

Lastly the pattern.vector module has a toolkit for distance-based bag-of-words model machine learning including clustering (K-Means, Hierarhcical Clustering) and classification.

NLTK

Suite of libraries for a variety of academic text processing tasks:

tokenization, stemming, tagging,
chunking, parsing, classification,
language modeling, logical semantics

Pedagogical resources for teaching NLP theory in Python ...

  • Python interface to over 50 corpora and lexical resources
  • Focus on Machine Learning with specific domain knowledge
  • Free and Open Source
  • Numpy and Scipy under the hood
  • Fast and Formal

What is NLTK not?

  • Production ready out of the box*
  • Lightweight
  • Generally applicable
  • Magic

There are actually a few things that are production ready right out of the box.

The Good Parts:

  • Preprocessing
    • segmentation
    • tokenization
    • PoS tagging
  • Word level processing
    • WordNet
    • Lemmatization
    • Stemming
    • NGram
  • Utilities
    • Tree
    • FreqDist
    • ConditionalFreqDist
  • Streaming CorpusReader objects
  • Classification
    • Maximum Entropy (Megam Algorithm)
    • Naive Bayes
    • Decision Tree
  • Chunking, Named Entity Recognition
  • Parsers Galore!

The Bad Parts:

  • Syntactic Parsing
    • No included grammar (not a black box)
  • Feature/Dependency Parsing
    • No included feature grammar
  • The sem package
  • Toy only (lambda-calculus & first order logic)
  • Lots of extra stuff
    • papers, chat programs, alignments, etc.

In [87]:
import nltk

text = get_text("nrRB0.html")
for idx, s in enumerate(nltk.sent_tokenize(text)): # Segmentation
    words = nltk.wordpunct_tokenize(s)  # Tokenization
    tags  = nltk.pos_tag(words)    # Part of Speech tagging
    print tags
    print
    if idx > 5:
        break


[(u'A', 'DT'), (u'crisp', 'NN'), (u'and', 'CC'), (u'juicy', 'NN'), (u'bucket', 'NN'), (u'list', 'NN'), (u'of', 'IN'), (u'D', 'NNP'), (u'.', '.'), (u'C', 'NNP'), (u'.\u2019', 'NNP'), (u's', 'VBZ'), (u'best', 'JJS'), (u'fried', 'VBN'), (u'chicken', 'NN'), (u'-', ':'), (u'The', 'DT'), (u'Washington', 'NNP'), (u'Post', 'NNP'), (u'It', 'NNP'), (u'\u2019', 'NNP'), (u's', 'VBZ'), (u'lowbrow', 'NN'), (u'.', '.')]

[(u'It', 'PRP'), (u'\u2019', 'VBP'), (u's', 'NNS'), (u'messy', 'JJ'), (u'.', '.')]

[(u'It', 'PRP'), (u'could', 'MD'), (u'never', 'RB'), (u'be', 'VB'), (u'accused', 'VBN'), (u'of', 'IN'), (u'being', 'VBG'), (u'healthful', 'JJ'), (u'.', '.')]

[(u'But', 'CC'), (u'we', 'PRP'), (u'\u2019', 'VBP'), (u'd', 'VBN'), (u'never', 'RB'), (u'let', 'VB'), (u'those', 'DT'), (u'formalities', 'NNS'), (u'get', 'VBP'), (u'between', 'IN'), (u'us', 'PRP'), (u'and', 'CC'), (u'an', 'DT'), (u'order', 'NN'), (u'of', 'IN'), (u'crispy', 'NN'), (u',', ','), (u'crackly', 'RB'), (u',', ','), (u'delicious', 'JJ'), (u'fried', 'JJ'), (u'chicken', 'NN'), (u'.', '.')]

[(u'Whether', 'IN'), (u'it', 'PRP'), (u'comes', 'VBZ'), (u'in', 'IN'), (u'a', 'DT'), (u'bucket', 'NN'), (u'or', 'CC'), (u'on', 'IN'), (u'a', 'DT'), (u'bun', 'NN'), (u',', ','), (u'or', 'CC'), (u'you', 'PRP'), (u'eat', 'VBP'), (u'it', 'PRP'), (u'with', 'IN'), (u'your', 'PRP$'), (u'fingers', 'NNS'), (u'or', 'CC'), (u'chopsticks', 'NNS'), (u',', ','), (u'there', 'EX'), (u'\u2019', ':'), (u's', 'NNS'), (u'a', 'DT'), (u'surprising', 'JJ'), (u'variety', 'NN'), (u'to', 'TO'), (u'the', 'DT'), (u'Washington', 'NNP'), (u'area', 'NN'), (u'\u2019', ':'), (u's', 'NNS'), (u'fried', 'VBD'), (u'chicken', 'VBN'), (u'offerings', 'NNS'), (u'.', '.')]

[(u'Here', 'RB'), (u'are', 'VBP'), (u'some', 'DT'), (u'of', 'IN'), (u'the', 'DT'), (u'most', 'RBS'), (u'irresistible', 'JJ'), (u'.', '.')]

[(u'\u2018', 'NN'), (u'Rotissi', 'NNP'), (u'-', ':'), (u'fried', 'VBD'), (u'\u2019', 'CD'), (u'chicken', 'VBN'), (u'at', 'IN'), (u'the', 'DT'), (u'Partisan', 'NNP'), (u'Forget', 'NNP'), (u'the', 'DT'), (u'cronut', 'NN'), (u'.', '.')]


In [90]:
from nltk import FreqDist
from nltk import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

text  = get_text("nrRB0.html")
vocab = FreqDist()
words = FreqDist()
for s in nltk.sent_tokenize(text): 
    for word in nltk.wordpunct_tokenize(s):
        words[word] += 1
        lemma = lemmatizer.lemmatize(word)
        vocab[lemma] += 1

print words
print vocab


<FreqDist with 1072 samples and 3084 outcomes>
<FreqDist with 1032 samples and 3084 outcomes>

The first thing you needed to do was create a corpus reader that could read the RSS feeds and their topics, implementing one of the built-in corpus readers:


In [94]:
import os
import nltk
import time
import random
import pickle
import string

from bs4 import BeautifulSoup
from nltk.corpus import CategorizedPlaintextCorpusReader

# The first group captures the category folder, docs are any HTML file.
CORPUS_ROOT = './corpus'
DOC_PATTERN = r'(?!\.).*\.html'
CAT_PATTERN = r'([a-z_]+)/.*'

# Specialized Corpus Reader for HTML documents
class CategorizedHTMLCorpusreader(CategorizedPlaintextCorpusReader):
    """
    Reads only the HTML body for the words and strips any tags.
    """

    def _read_word_block(self, stream):
        soup = BeautifulSoup(stream, 'lxml')
        return self._word_tokenizer.tokenize(soup.get_text())

    def _read_para_block(self, stream):
        soup  = BeautifulSoup(stream, 'lxml')
        paras = []
        piter = soup.find_all('p') if soup.find('p') else self._para_block_reader(stream)

        for para in piter:
            paras.append([self._word_tokenizer.tokenize(sent)
                          for sent in self._sent_tokenizer.tokenize(para)])

        return paras

# Create our corpus reader
rss_corpus = CategorizedHTMLCorpusreader(CORPUS_ROOT, DOC_PATTERN,
                    cat_pattern=CAT_PATTERN, encoding='utf-8')

Just to make things easy, I've also included all of the imports at the top of this snippet in case you're just copying and pasting. This should give you a corpus that is easily readable with the following properties:

RSS Corpus contains 5506 files in 11 categories Vocab: 69642 in 1920455 words for a lexical diversity of 27.576

This snippet demonstrates a choice I made - to override the _read_word_block and the _read_para_block functions in the CategorizedPlaintextCorpusReader, but of course you could have created your own HTMLCorpusReader class that implemented the categorization features.

The next thing to do is to figure out how you will generate your featuresets, I hope that you used unigrams, bigrams, TF-IDF and others. The simplest thing to do is simply a bag of words approach, however I have ensured that this bag of words does not contain punctuation or stopwords, has been normalized to all lowercase and has been lemmatized to reduce the number of word forms:


In [95]:
# Create feature extractor methodology
def normalize_words(document):
    """
    Expects as input a list of words that make up a document. This will
    yield only lowercase significant words (excluding stopwords and
    punctuation) and will lemmatize all words to ensure that we have word
    forms that are standardized.
    """
    stopwords  = set(nltk.corpus.stopwords.words('english'))
    lemmatizer = nltk.stem.wordnet.WordNetLemmatizer()
    for token in document:
        token = token.lower()
        if token in string.punctuation: continue
        if token in stopwords: continue
        yield lemmatizer.lemmatize(token)

def document_features(document):
    words = nltk.FreqDist(normalize_words(document))
    feats = {}
    for word in words.keys():
        feats['contains(%s)' % word] = True
    return feats

You should save a training, devtest and test as pickles to disk so that you can easily work on your classifier without having to worry about the overhead of randomization. I went ahead and saved the features to disk; but if you're developing features then you'll only save the word lists to disk. Here are the functions both for generation and for loading the data sets:


In [98]:
def timeit(func):
    def wrapper(*args, **kwargs):
        start  = time.time()
        result = func(*args, **kwargs)
        delta  = time.time() - start
        return result, delta
    return wrapper

@timeit
def generate_datasets(test_size=550, pickle_dir="."):
    """
    Creates three data sets; a test set and dev test set of 550 documents
    then a training set with the rest of the documents in the corpus. It
    will then write the data sets to disk at the pickle_dir.
    """
    documents = [(document_features(rss_corpus.words(fileid)), category)
                    for category in rss_corpus.categories()
                    for fileid in rss_corpus.fileids(category)]

    random.shuffle(documents)

    datasets = {
        'test':     documents[0:test_size],
        'devtest':  documents[test_size:test_size*2],
        'training': documents[test_size*2:],
    }

    for name, data in datasets.items():
        with open(os.path.join(pickle_dir, name+".pickle"), 'wb') as out:
            pickle.dump(data, out)

def load_datasets(pickle_dir="."):
    """
    Loads the randomly shuffled data sets from their pickles on disk.
    """

    def loader(name):
        path = os.path.join(pickle_dir, name+".pickle")
        with open(path, 'rb') as f:
            data = pickle.load(f)

        return name, data

    return dict(loader(name) for name in ('test', 'devtest', 'training'))

# Using a time it decorator you can see that this saves you quite a few seconds:

_, delta = generate_datasets(pickle_dir='datasets')
print "Took %0.3f seconds to generate datasets" % delta


Took 26.522 seconds to generate datasets

Last up is the building of the classifier. I used a maximum entropy classifier with the lemmatized word level features. Also note that I used the MEGAM algorithm to significantly speed up my training time:


In [102]:
@timeit
def train_classifier(training, path='classifier.pickle'):
    """
    Trains the classifier and saves it to disk.
    """
    classifier = nltk.MaxentClassifier.train(training,
                algorithm='megam', trace=2, gaussian_prior_sigma=1)

    with open(path, 'wb') as out:
        pickle.dump(classifier, out)

    return classifier

datasets = load_datasets(pickle_dir='datasets')
classifier, delta = train_classifier(datasets['training'])
print "trained in %0.3f seconds" % delta

testacc    = nltk.classify.accuracy(classifier, datasets['test']) * 100
print "test accuracy %0.2f%%" % testacc

classifier.show_most_informative_features(30)


[Found megam: /Users/benjamin/bin/megam]
[Found megam: /Users/benjamin/bin/megam]
trained in 136.805 seconds
test accuracy 82.00%
   4.189 contains(comment)==True and label is 'data_science'
   3.581 contains(...)==True and label is 'gaming'
   3.533 contains(data)==True and label is 'data_science'
   3.237 contains(book)==True and label is 'books'
   2.980 label is 'business'
   2.952 contains(wired)==True and label is 'tech'
   2.815 contains(game)==True and label is 'gaming'
   2.629 contains(»)==True and label is 'business'
   2.533 contains(read)==True and label is 'tech'
  -2.463 contains(read)==True and label is 'business'
  -2.452 label is 'essays'
   2.255 contains(entrepreneur)==True and label is 'business'
   2.214 contains(business)==True and label is 'business'
   2.193 contains(facebook)==True and label is 'tech'
   2.159 contains(film)==True and label is 'cinema'
   2.085 contains(adafruit)==True and label is 'do_it_yourself'
   2.006 contains(recipe)==True and label is 'cooking'
   1.990 contains(...)==True and label is 'cinema'
   1.988 contains(...))==True and label is 'design'
  -1.977 contains(’)==True and label is 'sports'
   1.915 contains(sweet)==True and label is 'cooking'
  -1.890 label is 'do_it_yourself'
   1.879 contains(design)==True and label is 'design'
   1.869 contains(mail)==True and label is 'books'
   1.793 contains(e)==True and label is 'books'
   1.787 contains(analytics)==True and label is 'data_science'
  -1.781 contains(comment)==True and label is 'business'
   1.781 contains(appeared)==True and label is 'tech'
   1.779 contains(steam)==True and label is 'gaming'
   1.773 contains(sport)==True and label is 'sports'
Exception RuntimeError: 'generator ignored GeneratorExit' in <generator object find_file_iter at 0x118f564b0> ignored

In [167]:
from operator import itemgetter

def classify(text, explain=False):
    
    classifier = None
    with open('classifier.pickle', 'rb') as f:
        classifier = pickle.load(f)
    
    document = nltk.wordpunct_tokenize(text)
    features = document_features(document)
    
    pd = classifier.prob_classify(features)
    for result in sorted([(s,pd.prob(s)) for s in pd.samples()], key=itemgetter(1), reverse=True):
        print "%s: %0.4f" % result

    print
    if explain:
        classifier.explain(features)

classify(get_text("nrRB0.html"), True)


cooking: 1.0000
essays: 0.0000
do_it_yourself: 0.0000
books: 0.0000
design: 0.0000
gaming: 0.0000
cinema: 0.0000
tech: 0.0000
data_science: 0.0000
sports: 0.0000
business: 0.0000

  Feature                                          cooking  essays do_it_y   books
  --------------------------------------------------------------------------------
  contains(recipe)==True (1)                         2.006
  contains(dish)==True (1)                           1.737
  contains(food)==True (1)                           1.144
  contains(tomato)==True (1)                         0.992
  contains(served)==True (1)                         0.868
  contains(stuffed)==True (1)                        0.832
  contains(spicy)==True (1)                          0.790
  contains(chef)==True (1)                           0.775
  contains(bar)==True (1)                            0.747
  contains(cooking)==True (1)                        0.716
  contains(sauce)==True (1)                          0.699
  contains(friday)==True (1)                         0.697
  contains(fresh)==True (1)                          0.667
  contains(—)==True (1)                              0.661
  contains(version)==True (1)                        0.651
  contains(oil)==True (1)                            0.626
  contains(topped)==True (1)                         0.621
  contains(grilled)==True (1)                        0.603
  contains(delicious)==True (1)                      0.588
  contains(classic)==True (1)                        0.563
  contains().)==True (1)                             0.555
  contains(flavor)==True (1)                         0.515
  contains(new)==True (1)                           -0.506
  contains(filling)==True (1)                        0.495
  contains(kitchen)==True (1)                        0.492
  contains(meat)==True (1)                           0.489
  contains(crisp)==True (1)                          0.466
  contains(made)==True (1)                           0.464
  contains(perfect)==True (1)                        0.450
  contains(butter)==True (1)                         0.446
  contains(throughout)==True (1)                     0.439
  contains(meal)==True (1)                           0.436
  contains(seasoning)==True (1)                      0.431
  contains(popular)==True (1)                        0.428
  contains(bite)==True (1)                           0.421
  contains(kimchi)==True (1)                         0.414
  contains(roasted)==True (1)                        0.413
  contains(pickle)==True (1)                         0.411
  contains(dinner)==True (1)                         0.380
  contains(home)==True (1)                           0.380
  contains(favorite)==True (1)                       0.377
  contains(deep)==True (1)                           0.377
  contains(’)==True (1)                             -0.372
  contains(bread)==True (1)                          0.359
  contains(say)==True (1)                           -0.358
  contains(restaurant)==True (1)                     0.344
  contains(garlic)==True (1)                         0.342
  contains(fried)==True (1)                          0.341
  contains(best)==True (1)                           0.340
  contains(potato)==True (1)                         0.338
  contains(year)==True (1)                          -0.335
  contains(juice)==True (1)                          0.325
  contains(time)==True (1)                          -0.325
  contains(side)==True (1)                           0.321
  contains(fish)==True (1)                           0.321
  contains(east)==True (1)                           0.314
  contains(come)==True (1)                           0.313
  contains(salt)==True (1)                           0.311
  contains(seafood)==True (1)                        0.306
  contains(almost)==True (1)                         0.306
  contains(onion)==True (1)                          0.302
  label is 'cooking' (1)                            -0.301
  contains(country)==True (1)                        0.300
  contains(traditional)==True (1)                    0.297
  contains(spring)==True (1)                         0.294
  contains(pepper)==True (1)                         0.290
  contains(combination)==True (1)                    0.288
  contains(warm)==True (1)                           0.284
  contains(variation)==True (1)                      0.282
  contains(change)==True (1)                        -0.277
  contains(pizza)==True (1)                          0.275
  contains(house)==True (1)                          0.274
  contains(though)==True (1)                         0.271
  contains(mixture)==True (1)                        0.262
  contains(hot)==True (1)                            0.258
  contains(without)==True (1)                        0.256
  contains(pack)==True (1)                           0.253
  contains(satisfying)==True (1)                     0.252
  contains(street)==True (1)                         0.252
  contains(roll)==True (1)                           0.251
  contains(one)==True (1)                            0.251
  contains(snack)==True (1)                          0.250
  contains(beauty)==True (1)                         0.247
  contains(‘)==True (1)                             -0.240
  contains(two)==True (1)                           -0.240
  contains(could)==True (1)                         -0.240
  contains(variety)==True (1)                        0.238
  contains(taste)==True (1)                          0.237
  contains(head)==True (1)                           0.234
  contains(often)==True (1)                          0.233
  contains(corn)==True (1)                           0.230
  contains(menu)==True (1)                           0.228
  contains(much)==True (1)                           0.224
  contains(temperature)==True (1)                    0.222
  contains(chicken)==True (1)                        0.222
  contains(may)==True (1)                            0.221
  contains(lettuce)==True (1)                        0.220
  contains(pan)==True (1)                            0.219
  contains(used)==True (1)                           0.214
  contains(plate)==True (1)                          0.212
  contains(flour)==True (1)                          0.210
  contains(steak)==True (1)                          0.206
  contains(john)==True (1)                          -0.204
  contains(lunch)==True (1)                          0.204
  contains(free)==True (1)                           0.200
  contains(better)==True (1)                        -0.200
  contains(batter)==True (1)                         0.197
  contains(white)==True (1)                          0.196
  contains(true)==True (1)                          -0.194
  contains(5)==True (1)                             -0.186
  contains(want)==True (1)                          -0.183
  contains(”)==True (1)                             -0.182
  contains(open)==True (1)                          -0.178
  contains(kind)==True (1)                           0.178
  contains(e)==True (1)                             -0.177
  contains(risotto)==True (1)                        0.176
  contains(good)==True (1)                           0.175
  contains(sea)==True (1)                            0.175
  contains(piece)==True (1)                         -0.175
  contains(name)==True (1)                          -0.175
  contains(run)==True (1)                           -0.174
  contains(month)==True (1)                         -0.174
  contains(style)==True (1)                          0.174
  contains(thing)==True (1)                         -0.173
  contains(created)==True (1)                       -0.171
  contains(spice)==True (1)                          0.168
  contains(probably)==True (1)                       0.168
  contains(blend)==True (1)                          0.168
  contains(8)==True (1)                             -0.166
  contains(com)==True (1)                           -0.165
  contains(beef)==True (1)                           0.165
  contains(fall)==True (1)                           0.164
  contains(crispy)==True (1)                         0.164
  contains(le)==True (1)                            -0.161
  contains(“)==True (1)                             -0.161
  contains(woman)==True (1)                         -0.161
  contains(hour)==True (1)                          -0.160
  contains(right)==True (1)                          0.155
  contains(child)==True (1)                         -0.154
  contains(layer)==True (1)                          0.153
  contains(spiced)==True (1)                         0.152
  contains(u)==True (1)                              0.152
  contains(account)==True (1)                       -0.151
  contains(longer)==True (1)                        -0.150
  contains(leave)==True (1)                          0.150
  contains(pas)==True (1)                            0.148
  contains(crust)==True (1)                          0.148
  contains(.”)==True (1)                             0.147
  contains(found)==True (1)                          0.147
  contains(d)==True (1)                              0.147
  contains(modern)==True (1)                        -0.147
  contains(tender)==True (1)                         0.147
  contains(important)==True (1)                     -0.146
  contains(),)==True (1)                             0.145
  contains(held)==True (1)                           0.143
  contains(memorable)==True (1)                      0.143
  contains(others)==True (1)                        -0.143
  contains(part)==True (1)                           0.143
  contains(bag)==True (1)                            0.143
  contains(beer)==True (1)                           0.140
  contains(line)==True (1)                           0.140
  contains(former)==True (1)                        -0.140
  contains(hand)==True (1)                           0.138
  contains(including)==True (1)                      0.137
  contains(dose)==True (1)                           0.136
  contains(spoon)==True (1)                          0.135
  contains(quite)==True (1)                          0.134
  contains(won)==True (1)                           -0.133
  contains(standard)==True (1)                       0.132
  contains(miss)==True (1)                          -0.129
  contains(happy)==True (1)                         -0.129
  contains(eat)==True (1)                            0.129
  contains(credit)==True (1)                         0.128
  contains(leftover)==True (1)                       0.128
  contains(ask)==True (1)                           -0.127
  contains(another)==True (1)                       -0.126
  contains(slightly)==True (1)                       0.125
  contains(grab)==True (1)                           0.125
  contains(dark)==True (1)                           0.125
  contains(offer)==True (1)                          0.123
  contains(leaf)==True (1)                           0.123
  contains(cayenne)==True (1)                        0.121
  contains(called)==True (1)                        -0.121
  contains(doesn)==True (1)                          0.121
  contains(re)==True (1)                             0.120
  contains(process)==True (1)                       -0.120
  contains(need)==True (1)                          -0.120
  contains(.))==True (1)                             0.119
  contains(long)==True (1)                           0.118
  contains(whether)==True (1)                       -0.118
  contains(really)==True (1)                         0.117
  contains(fat)==True (1)                            0.116
  contains(area)==True (1)                          -0.115
  contains(tour)==True (1)                          -0.115
  contains(bring)==True (1)                         -0.114
  contains(rather)==True (1)                        -0.114
  contains(!))==True (1)                             0.114
  contains(get)==True (1)                            0.113
  contains(eats)==True (1)                           0.112
  contains(v)==True (1)                             -0.112
  contains(serving)==True (1)                        0.111
  contains(using)==True (1)                          0.107
  contains(review)==True (1)                        -0.107
  contains(self)==True (1)                          -0.106
  contains(,”)==True (1)                             0.105
  contains(spent)==True (1)                          0.105
  contains(family)==True (1)                         0.105
  contains(thin)==True (1)                           0.104
  contains(taking)==True (1)                        -0.104
  contains(well)==True (1)                           0.103
  contains(j)==True (1)                             -0.103
  contains(sandwich)==True (1)                       0.103
  contains(clean)==True (1)                         -0.102
  contains(keep)==True (1)                          -0.102
  contains(build)==True (1)                         -0.101
  contains(suit)==True (1)                           0.101
  contains(offering)==True (1)                      -0.101
  contains(dedicated)==True (1)                      0.100
  contains(hard)==True (1)                          -0.100
  contains(easy)==True (1)                           0.099
  contains(wanted)==True (1)                         0.099
  contains(trick)==True (1)                         -0.099
  contains(put)==True (1)                            0.098
  contains(spend)==True (1)                         -0.098
  contains(person)==True (1)                        -0.097
  contains(competition)==True (1)                    0.097
  contains(related)==True (1)                        0.097
  contains(rendered)==True (1)                       0.096
  contains(discover)==True (1)                      -0.095
  contains(24)==True (1)                            -0.094
  contains(big)==True (1)                            0.094
  contains(diner)==True (1)                          0.093
  contains(american)==True (1)                      -0.092
  contains(paper)==True (1)                         -0.092
  contains(option)==True (1)                        -0.092
  contains(cilantro)==True (1)                       0.092
  contains(age)==True (1)                           -0.092
  contains(y)==True (1)                              0.089
  contains(fry)==True (1)                            0.088
  contains(later)==True (1)                         -0.087
  contains(combine)==True (1)                        0.087
  contains(basically)==True (1)                      0.087
  contains(plus)==True (1)                          -0.087
  contains(rose)==True (1)                           0.087
  contains(french)==True (1)                        -0.086
  contains(form)==True (1)                           0.085
  contains(ever)==True (1)                           0.085
  contains(would)==True (1)                          0.083
  contains(simply)==True (1)                         0.083
  contains(hearty)==True (1)                         0.081
  contains(marinated)==True (1)                      0.080
  contains(even)==True (1)                           0.080
  contains(whole)==True (1)                          0.079
  contains(pain)==True (1)                           0.079
  contains(like)==True (1)                          -0.079
  contains(founder)==True (1)                       -0.079
  contains(common)==True (1)                        -0.078
  contains(heavy)==True (1)                          0.077
  contains(instead)==True (1)                        0.076
  contains(mousse)==True (1)                         0.076
  contains(co)==True (1)                            -0.075
  contains(passion)==True (1)                        0.075
  contains(certain)==True (1)                       -0.074
  contains(taken)==True (1)                          0.074
  contains(salty)==True (1)                          0.074
  contains(set)==True (1)                           -0.072
  contains(begin)==True (1)                         -0.071
  contains(enough)==True (1)                         0.071
  contains(50)==True (1)                            -0.070
  contains(2002)==True (1)                           0.070
  contains(item)==True (1)                          -0.070
  contains(refined)==True (1)                        0.069
  contains(matter)==True (1)                         0.069
  contains(music)==True (1)                         -0.069
  contains(deal)==True (1)                          -0.068
  contains(stock)==True (1)                          0.068
  contains(let)==True (1)                           -0.067
  contains(dipping)==True (1)                        0.067
  contains(inspired)==True (1)                       0.067
  contains(post)==True (1)                          -0.066
  contains(involved)==True (1)                      -0.066
  contains(dusted)==True (1)                         0.065
  contains(know)==True (1)                          -0.064
  contains(element)==True (1)                       -0.064
  contains(tom)==True (1)                            0.063
  contains(wonderfully)==True (1)                    0.063
  contains(golden)==True (1)                         0.063
  contains(care)==True (1)                          -0.062
  contains(stick)==True (1)                          0.061
  contains(frequently)==True (1)                     0.061
  contains(picnic)==True (1)                         0.061
  contains(term)==True (1)                          -0.060
  contains(become)==True (1)                        -0.060
  contains(size)==True (1)                           0.060
  contains(pleasure)==True (1)                      -0.059
  contains(although)==True (1)                      -0.059
  contains(exclusively)==True (1)                    0.059
  contains(flash)==True (1)                         -0.058
  contains(newest)==True (1)                        -0.058
  contains(degree)==True (1)                         0.058
  contains(order)==True (1)                          0.058
  contains(kid)==True (1)                           -0.058
  contains(basket)==True (1)                         0.058
  contains(top)==True (1)                           -0.057
  contains(strip)==True (1)                          0.057
  contains(effect)==True (1)                         0.056
  contains(ray)==True (1)                           -0.056
  contains(alike)==True (1)                          0.055
  contains(crumb)==True (1)                          0.055
  contains(fork)==True (1)                           0.055
  contains(lends)==True (1)                          0.054
  contains(real)==True (1)                          -0.053
  contains(location)==True (1)                      -0.053
  contains(soul)==True (1)                          -0.053
  contains(signature)==True (1)                      0.053
  contains(pool)==True (1)                          -0.052
  contains(take)==True (1)                           0.052
  contains(ounce)==True (1)                          0.051
  contains(sunday)==True (1)                        -0.051
  contains(knife)==True (1)                          0.051
  contains(try)==True (1)                            0.051
  contains(horseradish)==True (1)                    0.050
  contains(crème)==True (1)                          0.050
  contains(actually)==True (1)                      -0.050
  contains(frank)==True (1)                         -0.050
  contains(sold)==True (1)                          -0.050
  contains(asking)==True (1)                        -0.050
  contains(iron)==True (1)                          -0.049
  contains(luxury)==True (1)                         0.049
  contains(fancy)==True (1)                          0.048
  contains(single)==True (1)                        -0.048
  contains(generous)==True (1)                       0.048
  contains(protein)==True (1)                        0.048
  contains(follows)==True (1)                       -0.048
  contains(st)==True (1)                             0.048
  contains(toward)==True (1)                        -0.047
  contains(remains)==True (1)                        0.047
  contains(decade)==True (1)                        -0.047
  contains(juicy)==True (1)                          0.047
  contains(detail)==True (1)                         0.047
  contains(fully)==True (1)                         -0.046
  contains(nashville)==True (1)                      0.046
  contains(seems)==True (1)                          0.046
  contains(something)==True (1)                     -0.046
  contains(believe)==True (1)                        0.046
  contains(nine)==True (1)                           0.046
  contains(dip)==True (1)                            0.046
  contains(cast)==True (1)                           0.045
  contains(since)==True (1)                         -0.045
  contains(c)==True (1)                              0.044
  contains(shore)==True (1)                          0.044
  contains(count)==True (1)                         -0.044
  contains(low)==True (1)                           -0.044
  contains(ll)==True (1)                            -0.044
  contains(florida)==True (1)                       -0.043
  contains(reveal)==True (1)                        -0.043
  contains(forget)==True (1)                        -0.042
  contains(never)==True (1)                         -0.042
  contains(seasoned)==True (1)                       0.042
  contains(developed)==True (1)                      0.042
  contains(southern)==True (1)                       0.042
  contains(anything)==True (1)                       0.042
  contains(georgia)==True (1)                        0.042
  contains(glass)==True (1)                         -0.041
  contains(appreciate)==True (1)                     0.041
  contains(bone)==True (1)                          -0.040
  contains(perfectly)==True (1)                     -0.040
  contains(moist)==True (1)                          0.040
  contains(adult)==True (1)                         -0.039
  contains(yield)==True (1)                          0.039
  contains(wheat)==True (1)                         -0.039
  contains(within)==True (1)                        -0.039
  contains(ordering)==True (1)                      -0.039
  contains(buy)==True (1)                           -0.039
  contains(testing)==True (1)                       -0.039
  contains(store)==True (1)                         -0.038
  contains(joint)==True (1)                         -0.037
  contains(creating)==True (1)                      -0.037
  contains(skin)==True (1)                          -0.037
  contains(half)==True (1)                          -0.037
  contains(bay)==True (1)                           -0.036
  contains(40)==True (1)                            -0.036
  contains(sound)==True (1)                         -0.036
  contains(popcorn)==True (1)                        0.036
  contains(chili)==True (1)                          0.035
  contains(teeth)==True (1)                          0.035
  contains(resist)==True (1)                        -0.035
  contains(prevent)==True (1)                       -0.034
  contains(lot)==True (1)                            0.034
  contains(presentation)==True (1)                  -0.034
  contains(tongue)==True (1)                         0.033
  contains(training)==True (1)                      -0.033
  contains(fast)==True (1)                          -0.033
  contains(owner)==True (1)                          0.032
  contains(waste)==True (1)                          0.032
  contains(south)==True (1)                          0.032
  contains(exact)==True (1)                          0.032
  contains(marriage)==True (1)                      -0.032
  contains(silver)==True (1)                         0.032
  contains(crushed)==True (1)                        0.032
  contains(exterior)==True (1)                      -0.031
  contains(bucket)==True (1)                        -0.031
  contains(pour)==True (1)                          -0.030
  contains(back)==True (1)                          -0.030
  contains(cut)==True (1)                           -0.030
  contains(honey)==True (1)                          0.030
  contains(double)==True (1)                         0.030
  contains(subtle)==True (1)                        -0.029
  contains(needed)==True (1)                        -0.029
  contains(call)==True (1)                           0.029
  contains(minute)==True (1)                         0.029
  contains(snap)==True (1)                           0.029
  contains(bun)==True (1)                            0.028
  contains(fine)==True (1)                          -0.028
  contains(finished)==True (1)                       0.028
  contains(twice)==True (1)                          0.028
  contains(thought)==True (1)                       -0.028
  contains(virginia)==True (1)                       0.028
  contains(couldn)==True (1)                        -0.027
  contains(.’)==True (1)                             0.027
  contains(irresistible)==True (1)                   0.027
  contains(karaoke)==True (1)                        0.027
  contains(oyster)==True (1)                         0.027
  contains(onto)==True (1)                           0.026
  contains(handle)==True (1)                        -0.026
  contains(go)==True (1)                            -0.026
  contains(chain)==True (1)                         -0.025
  contains(stacked)==True (1)                       -0.025
  contains(yard)==True (1)                          -0.025
  contains(creator)==True (1)                       -0.025
  contains(black)==True (1)                          0.025
  contains(n)==True (1)                              0.025
  contains(describe)==True (1)                      -0.024
  contains(del)==True (1)                           -0.024
  contains(started)==True (1)                        0.024
  contains(brand)==True (1)                          0.024
  contains(every)==True (1)                          0.024
  contains(crunch)==True (1)                         0.024
  contains(hold)==True (1)                          -0.023
  contains(proper)==True (1)                        -0.023
  contains(central)==True (1)                       -0.023
  contains(learned)==True (1)                        0.022
  contains(mount)==True (1)                         -0.022
  contains(everywhere)==True (1)                    -0.022
  contains(finger)==True (1)                        -0.022
  contains(place)==True (1)                         -0.022
  contains(result)==True (1)                        -0.022
  contains(district)==True (1)                      -0.021
  contains(sell)==True (1)                           0.021
  contains(maybe)==True (1)                         -0.020
  contains(base)==True (1)                           0.020
  contains(spin)==True (1)                          -0.020
  contains(14)==True (1)                            -0.020
  contains(secret)==True (1)                         0.020
  contains(obsession)==True (1)                      0.020
  contains(reserve)==True (1)                       -0.020
  contains(brushed)==True (1)                       -0.019
  contains(named)==True (1)                         -0.019
  contains(cart)==True (1)                           0.018
  contains(breath)==True (1)                        -0.018
  contains(hybrid)==True (1)                        -0.018
  contains(high)==True (1)                           0.018
  contains(crack)==True (1)                         -0.018
  contains(relegated)==True (1)                      0.018
  contains(enjoyed)==True (1)                       -0.018
  contains(ultra)==True (1)                         -0.018
  contains(resulting)==True (1)                      0.018
  contains(deeply)==True (1)                        -0.018
  contains(cajun)==True (1)                          0.017
  contains(pressure)==True (1)                       0.017
  contains(roof)==True (1)                          -0.017
  contains(25)==True (1)                            -0.017
  contains(generously)==True (1)                    -0.017
  contains(washington)==True (1)                     0.016
  contains(fillet)==True (1)                         0.016
  contains(everything)==True (1)                    -0.016
  contains(attached)==True (1)                      -0.015
  contains(extremely)==True (1)                     -0.015
  contains(coated)==True (1)                         0.014
  contains(sure)==True (1)                           0.014
  contains(stuff)==True (1)                          0.014
  contains(distributed)==True (1)                   -0.014
  contains(barbecue)==True (1)                       0.014
  contains(tartness)==True (1)                      -0.014
  contains(consistency)==True (1)                    0.013
  contains(unless)==True (1)                        -0.013
  contains(exchange)==True (1)                      -0.013
  contains(favor)==True (1)                         -0.013
  contains(northeast)==True (1)                      0.013
  contains(dig)==True (1)                           -0.013
  contains(fan)==True (1)                           -0.012
  contains(paid)==True (1)                           0.012
  contains(allow)==True (1)                         -0.012
  contains(however)==True (1)                        0.012
  contains(12)==True (1)                            -0.012
  contains(gravy)==True (1)                          0.012
  contains(shine)==True (1)                         -0.011
  contains(dad)==True (1)                            0.011
  contains(essentially)==True (1)                    0.011
  contains(q)==True (1)                              0.011
  contains(accessible)==True (1)                     0.011
  contains(example)==True (1)                       -0.011
  contains(grandmother)==True (1)                    0.011
  contains(.,)==True (1)                             0.011
  contains(finish)==True (1)                         0.010
  contains(turned)==True (1)                         0.010
  contains(taylor)==True (1)                         0.010
  contains(addictive)==True (1)                      0.010
  contains(surprising)==True (1)                    -0.010
  contains(also)==True (1)                           0.010
  contains(three)==True (1)                         -0.010
  contains(available)==True (1)                      0.010
  contains(mashed)==True (1)                         0.009
  contains(famed)==True (1)                         -0.009
  contains(bath)==True (1)                          -0.009
  contains(feel)==True (1)                          -0.009
  contains(ear)==True (1)                           -0.009
  contains(preparation)==True (1)                   -0.009
  contains(planning)==True (1)                      -0.009
  contains(agreement)==True (1)                     -0.008
  contains(got)==True (1)                           -0.008
  contains(foam)==True (1)                          -0.008
  contains(circle)==True (1)                        -0.008
  contains(essential)==True (1)                      0.008
  contains(appetite)==True (1)                      -0.008
  contains(messy)==True (1)                         -0.008
  contains(($)==True (1)                            -0.008
  contains(9)==True (1)                              0.008
  contains(alley)==True (1)                         -0.007
  contains(occasional)==True (1)                    -0.007
  contains(10)==True (1)                             0.007
  contains(upscale)==True (1)                       -0.007
  contains(atop)==True (1)                          -0.007
  contains(99)==True (1)                             0.007
  contains(formula)==True (1)                       -0.007
  contains(pop)==True (1)                            0.007
  contains(plain)==True (1)                         -0.007
  contains(argue)==True (1)                          0.007
  contains(mean)==True (1)                           0.006
  contains(platter)==True (1)                       -0.006
  contains(ala)==True (1)                           -0.006
  contains(end)==True (1)                           -0.005
  contains(empty)==True (1)                          0.005
  contains(technically)==True (1)                   -0.005
  contains(hill)==True (1)                          -0.005
  contains(healthful)==True (1)                      0.004
  contains(m)==True (1)                              0.004
  contains(softer)==True (1)                        -0.004
  contains(mosaic)==True (1)                        -0.004
  contains(near)==True (1)                          -0.004
  contains(excuse)==True (1)                        -0.004
  contains(list)==True (1)                           0.004
  contains(local)==True (1)                          0.004
  contains(mac)==True (1)                            0.004
  contains(jersey)==True (1)                        -0.004
  contains(old)==True (1)                            0.004
  contains(margaret)==True (1)                      -0.004
  contains(owns)==True (1)                          -0.004
  contains(korean)==True (1)                        -0.004
  contains(coating)==True (1)                       -0.004
  contains(thigh)==True (1)                          0.003
  contains(tablecloth)==True (1)                    -0.003
  contains(ubiquitous)==True (1)                     0.003
  contains(waffle)==True (1)                         0.003
  contains(biscuit)==True (1)                       -0.003
  contains(outright)==True (1)                       0.003
  contains(flesh)==True (1)                         -0.002
  contains(chipotle)==True (1)                       0.002
  contains(fare)==True (1)                          -0.002
  contains(portion)==True (1)                       -0.002
  contains(sometimes)==True (1)                      0.002
  contains(inevitably)==True (1)                    -0.002
  contains(soy)==True (1)                            0.002
  contains(japanese)==True (1)                      -0.002
  contains(sliver)==True (1)                        -0.002
  contains(dunk)==True (1)                          -0.002
  contains(ave)==True (1)                           -0.002
  contains(buttermilk)==True (1)                     0.001
  contains(se)==True (1)                            -0.001
  contains(sink)==True (1)                          -0.001
  contains(cornmeal)==True (1)                       0.001
  contains(getting)==True (1)                        0.001
  contains(method)==True (1)                        -0.001
  contains(wing)==True (1)                          -0.001
  contains(convenience)==True (1)                   -0.001
  contains(frying)==True (1)                         0.001
  contains(word)==True (1)                          -0.001
  contains(bum)==True (1)                           -0.001
  contains(dredged)==True (1)                       -0.001
  contains(said)==True (1)                           0.000
  contains(accent)==True (1)                         0.000
  contains(commonly)==True (1)                      -0.000
  contains(ranch)==True (1)                         -0.000
  contains(rd)==True (1)                             0.000
  contains(translucent)==True (1)                   -0.000
  contains(starch)==True (1)                        -0.000
  contains(perfecting)==True (1)                    -0.000
  contains(paprika)==True (1)                        0.000
  contains(nub)==True (1)                           -0.000
  contains(dreamed)==True (1)                       -0.000
  label is 'essays' (1)                                     -2.452
  contains(’)==True (1)                                      0.739
  contains(u)==True (1)                                      0.719
  contains(“)==True (1)                                      0.619
  contains(.”)==True (1)                                     0.456
  contains(re)==True (1)                                     0.366
  contains(fat)==True (1)                                    0.332
  contains(change)==True (1)                                 0.322
  contains(perfect)==True (1)                                0.311
  contains(toward)==True (1)                                 0.298
  contains(year)==True (1)                                  -0.292
  contains(5)==True (1)                                      0.289
  contains(never)==True (1)                                  0.280
  contains(american)==True (1)                               0.273
  contains(want)==True (1)                                  -0.262
  contains(best)==True (1)                                   0.261
  contains(term)==True (1)                                   0.261
  contains(post)==True (1)                                  -0.251
  contains(go)==True (1)                                     0.250
  contains(like)==True (1)                                   0.248
  contains(word)==True (1)                                   0.246
  contains(almost)==True (1)                                 0.245
  contains(said)==True (1)                                  -0.245
  contains(set)==True (1)                                   -0.243
  contains(half)==True (1)                                   0.242
  contains(know)==True (1)                                   0.240
  contains(name)==True (1)                                   0.238
  contains(joint)==True (1)                                  0.238
  contains(old)==True (1)                                    0.227
  contains(list)==True (1)                                   0.217
  contains(three)==True (1)                                  0.209
  contains(effect)==True (1)                                 0.204
  contains(m)==True (1)                                      0.198
  contains(home)==True (1)                                  -0.197
  contains(something)==True (1)                             -0.196
  contains(—)==True (1)                                      0.194
  contains(ever)==True (1)                                  -0.193
  contains(created)==True (1)                                0.189
  contains(top)==True (1)                                   -0.185
  contains(part)==True (1)                                  -0.184
  contains(kid)==True (1)                                    0.181
  contains(butter)==True (1)                                 0.171
  contains(could)==True (1)                                  0.167
  contains(hour)==True (1)                                   0.166
  contains(crime)==True (1)                                  0.160
  contains(enough)==True (1)                                -0.157
  contains(everything)==True (1)                             0.154
  contains(also)==True (1)                                   0.151
  contains(get)==True (1)                                   -0.150
  contains(using)==True (1)                                 -0.149
  contains(asking)==True (1)                                 0.149
  contains(street)==True (1)                                 0.148
  contains(unless)==True (1)                                 0.147
  contains(”)==True (1)                                     -0.147
  contains(sure)==True (1)                                  -0.147
  contains(paper)==True (1)                                  0.146
  contains(oil)==True (1)                                    0.143
  contains(big)==True (1)                                   -0.143
  contains(house)==True (1)                                  0.136
  contains(bird)==True (1)                                   0.134
  contains(even)==True (1)                                  -0.132
  contains(though)==True (1)                                -0.129
  contains(boardwalk)==True (1)                              0.127
  contains(y)==True (1)                                      0.126
  contains(glass)==True (1)                                  0.126
  contains(got)==True (1)                                    0.126
  contains(central)==True (1)                                0.126
  contains(doesn)==True (1)                                 -0.125
  contains(found)==True (1)                                 -0.125
  contains(new)==True (1)                                   -0.121
  contains(silver)==True (1)                                 0.118
  contains(knife)==True (1)                                  0.116
  contains(actually)==True (1)                              -0.116
  contains(two)==True (1)                                   -0.112
  contains(spend)==True (1)                                  0.112
  contains(month)==True (1)                                  0.112
  contains(free)==True (1)                                   0.110
  contains(le)==True (1)                                    -0.110
  contains(form)==True (1)                                   0.110
  contains(become)==True (1)                                -0.110
  contains(let)==True (1)                                   -0.109
  contains(need)==True (1)                                  -0.106
  contains(pennsylvania)==True (1)                           0.105
  contains(back)==True (1)                                  -0.104
  contains(every)==True (1)                                 -0.103
  contains(describe)==True (1)                               0.102
  contains(low)==True (1)                                    0.102
  contains(sound)==True (1)                                  0.101
  contains(thought)==True (1)                               -0.101
  contains(better)==True (1)                                -0.100
  contains(paid)==True (1)                                   0.099
  contains(another)==True (1)                               -0.097
  contains(whole)==True (1)                                  0.096
  contains(columbia)==True (1)                               0.094
  contains(favorite)==True (1)                              -0.093
  contains(‘)==True (1)                                      0.093
  contains(deal)==True (1)                                   0.092
  contains(take)==True (1)                                  -0.091
  contains(white)==True (1)                                  0.089
  contains(follows)==True (1)                                0.088
  contains(roof)==True (1)                                   0.088
  contains(wolf)==True (1)                                   0.088
  contains(really)==True (1)                                 0.087
  contains(probably)==True (1)                              -0.087
  contains(child)==True (1)                                  0.086
  contains(virginia)==True (1)                               0.085
  contains(cornflakes)==True (1)                             0.082
  contains(ll)==True (1)                                    -0.082
  contains(deep)==True (1)                                   0.082
  contains(8)==True (1)                                     -0.082
  contains(kitchen)==True (1)                                0.082
  contains(come)==True (1)                                  -0.079
  contains(40)==True (1)                                     0.079
  contains(adam)==True (1)                                   0.078
  contains(1971)==True (1)                                   0.078
  contains(99)==True (1)                                     0.078
  contains(.’)==True (1)                                     0.077
  contains(say)==True (1)                                    0.077
  contains(agree)==True (1)                                  0.076
  contains(style)==True (1)                                 -0.076
  contains(waffle)==True (1)                                 0.076
  contains(pour)==True (1)                                   0.074
  contains(resulting)==True (1)                              0.073
  contains(leg)==True (1)                                    0.071
  contains(formula)==True (1)                                0.071
  contains(important)==True (1)                             -0.071
  contains(fall)==True (1)                                  -0.070
  contains(line)==True (1)                                   0.069
  contains(alley)==True (1)                                  0.069
  contains(hearty)==True (1)                                 0.069
  contains(den)==True (1)                                    0.069
  contains(stuff)==True (1)                                  0.068
  contains(secret)==True (1)                                 0.068
  contains(empty)==True (1)                                  0.067
  contains(miss)==True (1)                                   0.067
  contains(,”)==True (1)                                     0.066
  contains(couldn)==True (1)                                 0.065
  contains(call)==True (1)                                  -0.065
  contains(philippine)==True (1)                             0.065
  contains(d)==True (1)                                     -0.065
  contains(age)==True (1)                                   -0.065
  contains(degree)==True (1)                                 0.065
  contains(called)==True (1)                                 0.064
  contains(right)==True (1)                                 -0.063
  contains(se)==True (1)                                     0.062
  contains(begin)==True (1)                                  0.062
  contains(richard)==True (1)                                0.061
  contains(true)==True (1)                                  -0.060
  contains(anything)==True (1)                              -0.060
  contains(good)==True (1)                                  -0.060
  contains(easy)==True (1)                                  -0.059
  contains(clean)==True (1)                                  0.059
  contains(maybe)==True (1)                                 -0.059
  contains(luxury)==True (1)                                 0.058
  contains(j)==True (1)                                      0.058
  contains(available)==True (1)                             -0.058
  contains(getting)==True (1)                                0.058
  contains(lot)==True (1)                                   -0.058
  contains(salt)==True (1)                                   0.057
  contains(outer)==True (1)                                  0.056
  contains(brand)==True (1)                                  0.056
  contains(teeth)==True (1)                                  0.056
  contains(chicken)==True (1)                                0.056
  contains(others)==True (1)                                -0.055
  contains(famed)==True (1)                                  0.054
  contains(used)==True (1)                                  -0.054
  contains(bring)==True (1)                                 -0.054
  contains(well)==True (1)                                  -0.054
  contains(high)==True (1)                                   0.053
  contains(basically)==True (1)                              0.053
  contains(japanese)==True (1)                               0.053
  contains(pool)==True (1)                                   0.052
  contains(sell)==True (1)                                   0.051
  contains(woman)==True (1)                                 -0.051
  contains(modern)==True (1)                                 0.050
  contains(forget)==True (1)                                -0.050
  contains(wanted)==True (1)                                 0.050
  contains(dedicated)==True (1)                              0.049
  contains(plate)==True (1)                                  0.049
  contains(since)==True (1)                                 -0.049
  contains(feel)==True (1)                                   0.048
  contains(mac)==True (1)                                    0.048
  contains(keep)==True (1)                                   0.046
  contains(bay)==True (1)                                    0.045
  contains(common)==True (1)                                 0.045
  contains(soul)==True (1)                                   0.045
  contains(won)==True (1)                                   -0.045
  contains(location)==True (1)                               0.045
  contains(.))==True (1)                                     0.044
  contains(excuse)==True (1)                                 0.044
  contains(including)==True (1)                              0.043
  contains(may)==True (1)                                    0.043
  contains(pas)==True (1)                                    0.043
  contains(),)==True (1)                                    -0.043
  contains(n)==True (1)                                      0.042
  contains(twice)==True (1)                                  0.042
  contains(often)==True (1)                                 -0.041
  contains(review)==True (1)                                 0.041
  contains(buy)==True (1)                                   -0.041
  contains(c)==True (1)                                      0.040
  contains(fish)==True (1)                                   0.039
  contains(kind)==True (1)                                   0.038
  contains(florida)==True (1)                                0.038
  contains(14)==True (1)                                     0.038
  contains(later)==True (1)                                 -0.037
  contains(tim)==True (1)                                    0.037
  contains(hard)==True (1)                                   0.036
  contains(nine)==True (1)                                   0.036
  contains(one)==True (1)                                   -0.036
  contains(pop)==True (1)                                    0.035
  contains(classic)==True (1)                                0.034
  contains(dark)==True (1)                                   0.034
  contains(east)==True (1)                                   0.034
  contains(10)==True (1)                                     0.034
  contains().)==True (1)                                     0.034
  contains(rather)==True (1)                                -0.033
  contains(sometimes)==True (1)                             -0.033
  contains(eat)==True (1)                                    0.033
  contains(near)==True (1)                                  -0.031
  contains(quite)==True (1)                                 -0.030
  contains(competition)==True (1)                            0.030
  contains(although)==True (1)                              -0.030
  contains(real)==True (1)                                   0.030
  contains(place)==True (1)                                  0.030
  contains(offer)==True (1)                                 -0.028
  contains(whether)==True (1)                                0.027
  contains(offering)==True (1)                              -0.027
  contains(area)==True (1)                                  -0.026
  contains(sea)==True (1)                                    0.025
  contains(hill)==True (1)                                   0.025
  contains(french)==True (1)                                 0.024
  contains(essential)==True (1)                              0.024
  contains(without)==True (1)                                0.024
  contains(single)==True (1)                                -0.023
  contains(thing)==True (1)                                 -0.023
  contains(music)==True (1)                                  0.023
  contains(long)==True (1)                                   0.022
  contains(happy)==True (1)                                 -0.021
  contains(taken)==True (1)                                 -0.021
  contains(much)==True (1)                                  -0.020
  contains(minute)==True (1)                                 0.020
  contains(side)==True (1)                                   0.019
  contains(everywhere)==True (1)                             0.019
  contains(self)==True (1)                                  -0.018
  contains(end)==True (1)                                   -0.017
  contains(hand)==True (1)                                   0.016
  contains(within)==True (1)                                -0.015
  contains(consistently)==True (1)                           0.014
  contains(store)==True (1)                                 -0.014
  contains(made)==True (1)                                  -0.014
  contains(beauty)==True (1)                                -0.013
  contains(order)==True (1)                                  0.012
  contains(adult)==True (1)                                 -0.012
  contains(john)==True (1)                                   0.011
  contains(black)==True (1)                                  0.011
  contains(dad)==True (1)                                    0.011
  contains(south)==True (1)                                 -0.010
  contains(owner)==True (1)                                  0.010
  contains(food)==True (1)                                  -0.009
  contains(former)==True (1)                                -0.008
  contains(mean)==True (1)                                   0.008
  contains(time)==True (1)                                  -0.008
  contains(believe)==True (1)                               -0.008
  contains(run)==True (1)                                    0.007
  contains(driving)==True (1)                               -0.007
  contains(voice)==True (1)                                 -0.006
  contains(would)==True (1)                                  0.006
  contains(!))==True (1)                                    -0.005
  contains(fine)==True (1)                                  -0.005
  contains(r)==True (1)                                      0.005
  contains(care)==True (1)                                  -0.005
  contains(tom)==True (1)                                    0.005
  contains(learned)==True (1)                               -0.004
  contains(started)==True (1)                                0.003
  contains(country)==True (1)                               -0.003
  contains(certain)==True (1)                               -0.002
  contains(meat)==True (1)                                  -0.002
  contains(size)==True (1)                                  -0.000
  label is 'do_it_yourself' (1)                                     -1.890
  contains().)==True (1)                                             0.854
  contains(’)==True (1)                                              0.789
  contains(need)==True (1)                                           0.748
  contains(using)==True (1)                                          0.611
  contains(related)==True (1)                                        0.539
  contains(started)==True (1)                                        0.501
  contains(like)==True (1)                                           0.496
  contains(hand)==True (1)                                           0.484
  contains(sure)==True (1)                                           0.483
  contains(ask)==True (1)                                            0.458
  contains(piece)==True (1)                                          0.453
  contains(build)==True (1)                                          0.431
  contains(year)==True (1)                                          -0.430
  contains(wanted)==True (1)                                         0.398
  contains(spent)==True (1)                                          0.375
  contains(ll)==True (1)                                             0.367
  contains(although)==True (1)                                       0.361
  contains(every)==True (1)                                          0.333
  contains(9)==True (1)                                             -0.323
  contains(account)==True (1)                                        0.321
  contains(finger)==True (1)                                         0.319
  contains(,”)==True (1)                                             0.315
  contains(near)==True (1)                                           0.301
  contains(mean)==True (1)                                          -0.299
  contains(place)==True (1)                                          0.296
  contains(size)==True (1)                                           0.288
  contains(also)==True (1)                                           0.287
  contains(”)==True (1)                                             -0.277
  contains(thing)==True (1)                                          0.271
  contains(well)==True (1)                                           0.265
  contains(rather)==True (1)                                         0.264
  contains(base)==True (1)                                           0.263
  contains(order)==True (1)                                          0.263
  contains(best)==True (1)                                          -0.258
  contains(lot)==True (1)                                           -0.257
  contains(though)==True (1)                                        -0.257
  contains(le)==True (1)                                            -0.252
  contains(top)==True (1)                                            0.249
  contains(foam)==True (1)                                           0.247
  contains(know)==True (1)                                          -0.246
  contains(.))==True (1)                                            -0.246
  contains(say)==True (1)                                           -0.245
  contains(probably)==True (1)                                      -0.242
  contains(allow)==True (1)                                          0.240
  contains(part)==True (1)                                           0.239
  contains(r)==True (1)                                             -0.236
  contains(food)==True (1)                                          -0.235
  contains(menu)==True (1)                                           0.231
  contains(stock)==True (1)                                          0.229
  contains(really)==True (1)                                        -0.224
  contains(u)==True (1)                                             -0.223
  contains(may)==True (1)                                            0.221
  contains(brand)==True (1)                                         -0.220
  contains(right)==True (1)                                         -0.216
  contains(get)==True (1)                                            0.215
  contains(month)==True (1)                                         -0.213
  contains(fresh)==True (1)                                          0.212
  contains(single)==True (1)                                         0.210
  contains(john)==True (1)                                          -0.209
  contains(long)==True (1)                                          -0.209
  contains(take)==True (1)                                          -0.207
  contains(method)==True (1)                                         0.207
  contains(inspired)==True (1)                                       0.204
  contains(item)==True (1)                                           0.203
  contains(butterfly)==True (1)                                      0.203
  contains(review)==True (1)                                        -0.199
  contains(would)==True (1)                                          0.198
  contains(option)==True (1)                                         0.197
  contains(try)==True (1)                                            0.197
  contains(perfect)==True (1)                                        0.193
  contains(never)==True (1)                                         -0.192
  contains(creating)==True (1)                                       0.192
  contains(‘)==True (1)                                             -0.187
  contains(country)==True (1)                                       -0.187
  contains(stuff)==True (1)                                         -0.186
  contains(said)==True (1)                                          -0.184
  contains(“)==True (1)                                              0.181
  contains(taken)==True (1)                                          0.181
  contains(instead)==True (1)                                        0.180
  contains(trick)==True (1)                                         -0.179
  contains(age)==True (1)                                            0.178
  contains(example)==True (1)                                        0.178
  contains(seems)==True (1)                                         -0.178
  contains(recipe)==True (1)                                        -0.177
  contains(become)==True (1)                                        -0.175
  contains(hard)==True (1)                                          -0.173
  contains(result)==True (1)                                        -0.173
  contains(thought)==True (1)                                        0.173
  contains(cut)==True (1)                                            0.164
  contains(black)==True (1)                                         -0.161
  contains(white)==True (1)                                         -0.161
  contains(skin)==True (1)                                           0.160
  contains(deep)==True (1)                                          -0.160
  contains(made)==True (1)                                           0.159
  contains(v)==True (1)                                             -0.159
  contains(something)==True (1)                                      0.158
  contains(found)==True (1)                                          0.157
  contains(central)==True (1)                                        0.156
  contains(extremely)==True (1)                                      0.156
  contains(creator)==True (1)                                       -0.153
  contains(getting)==True (1)                                        0.153
  contains(music)==True (1)                                          0.150
  contains(actually)==True (1)                                       0.149
  contains(roll)==True (1)                                          -0.149
  contains(could)==True (1)                                         -0.146
  contains(woman)==True (1)                                         -0.145
  contains(used)==True (1)                                           0.142
  contains(driving)==True (1)                                       -0.140
  contains(taking)==True (1)                                         0.139
  contains(without)==True (1)                                        0.138
  contains(available)==True (1)                                     -0.138
  contains(miss)==True (1)                                          -0.137
  contains(mount)==True (1)                                          0.137
  contains(local)==True (1)                                          0.137
  contains(double)==True (1)                                        -0.136
  contains(new)==True (1)                                            0.136
  contains(paper)==True (1)                                          0.136
  contains(!))==True (1)                                            -0.133
  contains(5)==True (1)                                              0.133
  contains(pop)==True (1)                                            0.133
  contains(two)==True (1)                                           -0.132
  contains(fan)==True (1)                                           -0.132
  contains(consistently)==True (1)                                   0.130
  contains(sound)==True (1)                                          0.128
  contains(low)==True (1)                                            0.128
  contains(favorite)==True (1)                                      -0.128
  contains(pour)==True (1)                                           0.127
  contains(offer)==True (1)                                          0.127
  contains(thin)==True (1)                                           0.127
  contains(roof)==True (1)                                          -0.127
  contains(handle)==True (1)                                         0.122
  contains(st)==True (1)                                             0.122
  contains(location)==True (1)                                       0.121
  contains(almost)==True (1)                                        -0.121
  contains(fish)==True (1)                                           0.120
  contains(others)==True (1)                                        -0.119
  contains(c)==True (1)                                              0.119
  contains(strip)==True (1)                                          0.119
  contains(discover)==True (1)                                       0.119
  contains(south)==True (1)                                         -0.117
  contains(www)==True (1)                                            0.117
  contains(another)==True (1)                                        0.117
  contains(resulting)==True (1)                                      0.116
  contains(q)==True (1)                                             -0.115
  contains(one)==True (1)                                            0.115
  contains(10)==True (1)                                            -0.115
  contains(flash)==True (1)                                          0.114
  contains(fast)==True (1)                                           0.114
  contains(hot)==True (1)                                           -0.114
  contains(fine)==True (1)                                           0.112
  contains(outer)==True (1)                                          0.111
  contains(e)==True (1)                                             -0.111
  contains(2015)==True (1)                                          -0.111
  contains(frank)==True (1)                                         -0.110
  contains(bite)==True (1)                                          -0.110
  contains(crack)==True (1)                                         -0.110
  contains(former)==True (1)                                        -0.109
  contains(southern)==True (1)                                       0.109
  contains(appreciate)==True (1)                                     0.108
  contains(asking)==True (1)                                        -0.108
  contains(essential)==True (1)                                      0.107
  contains(temperature)==True (1)                                    0.107
  contains(post)==True (1)                                           0.106
  contains(forget)==True (1)                                         0.106
  contains(finished)==True (1)                                       0.105
  contains(old)==True (1)                                            0.105
  contains(needed)==True (1)                                        -0.105
  contains(even)==True (1)                                          -0.105
  contains(run)==True (1)                                            0.103
  contains(got)==True (1)                                           -0.103
  contains(training)==True (1)                                      -0.103
  contains(testing)==True (1)                                       -0.102
  contains(tour)==True (1)                                          -0.102
  contains(back)==True (1)                                           0.101
  contains(variety)==True (1)                                        0.100
  contains(detail)==True (1)                                         0.100
  contains(quite)==True (1)                                         -0.100
  contains(99)==True (1)                                            -0.100
  contains(exchange)==True (1)                                       0.099
  contains(called)==True (1)                                        -0.099
  contains(important)==True (1)                                      0.099
  contains(competition)==True (1)                                   -0.098
  contains(within)==True (1)                                        -0.098
  contains(later)==True (1)                                         -0.098
  contains(child)==True (1)                                          0.097
  contains(leg)==True (1)                                            0.097
  contains(25)==True (1)                                            -0.096
  contains(co)==True (1)                                             0.095
  contains(founder)==True (1)                                       -0.095
  contains(often)==True (1)                                          0.095
  contains(bucket)==True (1)                                         0.095
  contains(standard)==True (1)                                      -0.094
  contains(put)==True (1)                                            0.094
  contains(m)==True (1)                                             -0.093
  contains(popular)==True (1)                                       -0.093
  contains(east)==True (1)                                           0.093
  contains(three)==True (1)                                         -0.093
  contains(credit)==True (1)                                        -0.092
  contains(served)==True (1)                                        -0.092
  contains(easy)==True (1)                                          -0.090
  contains(beer)==True (1)                                          -0.088
  contains(held)==True (1)                                          -0.088
  contains(deeply)==True (1)                                         0.087
  contains(grab)==True (1)                                           0.086
  contains(bird)==True (1)                                           0.086
  contains(oil)==True (1)                                            0.086
  contains(beauty)==True (1)                                         0.085
  contains(dinner)==True (1)                                         0.085
  contains(restaurant)==True (1)                                    -0.085
  contains(effect)==True (1)                                        -0.085
  contains(blend)==True (1)                                         -0.085
  contains(plus)==True (1)                                          -0.084
  contains(bring)==True (1)                                         -0.084
  contains(change)==True (1)                                        -0.084
  contains(40)==True (1)                                            -0.084
  contains(flavor)==True (1)                                        -0.083
  contains(pack)==True (1)                                           0.083
  contains(form)==True (1)                                          -0.083
  contains(hill)==True (1)                                          -0.082
  contains(term)==True (1)                                          -0.082
  contains(hour)==True (1)                                           0.082
  contains(learned)==True (1)                                       -0.081
  contains(district)==True (1)                                       0.081
  contains(),)==True (1)                                             0.081
  contains(won)==True (1)                                            0.079
  contains(created)==True (1)                                        0.079
  contains(fancy)==True (1)                                         -0.078
  contains(area)==True (1)                                          -0.078
  contains(feel)==True (1)                                           0.077
  contains(sell)==True (1)                                           0.077
  contains(store)==True (1)                                          0.077
  contains(family)==True (1)                                        -0.077
  contains(pressure)==True (1)                                       0.075
  contains(cooking)==True (1)                                       -0.075
  contains(hasn)==True (1)                                          -0.074
  contains(care)==True (1)                                          -0.074
  contains(time)==True (1)                                          -0.074
  contains(japanese)==True (1)                                      -0.074
  contains(basically)==True (1)                                     -0.072
  contains(japan)==True (1)                                         -0.072
  contains(classic)==True (1)                                       -0.072
  contains(friday)==True (1)                                         0.072
  contains(including)==True (1)                                      0.072
  contains(voice)==True (1)                                         -0.071
  contains(cast)==True (1)                                           0.071
  contains(planning)==True (1)                                      -0.071
  contains(—)==True (1)                                             -0.070
  contains(list)==True (1)                                          -0.070
  contains(tomato)==True (1)                                        -0.070
  contains(secret)==True (1)                                        -0.070
  contains(empty)==True (1)                                          0.070
  contains(anything)==True (1)                                      -0.069
  contains(adam)==True (1)                                          -0.069
  contains(commitment)==True (1)                                     0.069
  contains(accessible)==True (1)                                    -0.069
  contains(call)==True (1)                                          -0.069
  contains(offering)==True (1)                                       0.069
  contains(suit)==True (1)                                          -0.068
  contains(8)==True (1)                                              0.068
  contains(fly)==True (1)                                            0.066
  contains(kind)==True (1)                                          -0.066
  contains(operation)==True (1)                                      0.066
  contains(since)==True (1)                                         -0.066
  contains(butter)==True (1)                                        -0.066
  contains(good)==True (1)                                          -0.066
  contains(distance)==True (1)                                       0.065
  contains(matter)==True (1)                                         0.065
  contains(buy)==True (1)                                           -0.064
  contains(open)==True (1)                                           0.064
  contains(degree)==True (1)                                         0.064
  contains(rose)==True (1)                                          -0.063
  contains(onto)==True (1)                                           0.063
  contains(throughout)==True (1)                                    -0.063
  contains(d)==True (1)                                             -0.063
  contains(glass)==True (1)                                          0.062
  contains(word)==True (1)                                           0.062
  contains(big)==True (1)                                            0.062
  contains(head)==True (1)                                           0.061
  contains(.’)==True (1)                                            -0.060
  contains(hold)==True (1)                                          -0.060
  contains(subtle)==True (1)                                        -0.060
  contains(bolster)==True (1)                                        0.060
  contains(everything)==True (1)                                    -0.060
  contains(dark)==True (1)                                           0.059
  contains(real)==True (1)                                           0.059
  contains(com)==True (1)                                           -0.059
  contains(soul)==True (1)                                           0.059
  contains(pleasure)==True (1)                                      -0.059
  contains(arrives)==True (1)                                       -0.058
  contains(stool)==True (1)                                         -0.058
  contains(meat)==True (1)                                          -0.057
  contains(leaf)==True (1)                                           0.057
  contains(layer)==True (1)                                          0.057
  contains(fully)==True (1)                                          0.056
  contains(set)==True (1)                                            0.055
  contains(plain)==True (1)                                         -0.055
  contains(bone)==True (1)                                          -0.055
  contains(.,)==True (1)                                            -0.054
  contains(father)==True (1)                                        -0.054
  contains(kid)==True (1)                                           -0.054
  contains(joint)==True (1)                                         -0.053
  contains(combination)==True (1)                                   -0.053
  contains(chef)==True (1)                                          -0.053
  contains(count)==True (1)                                         -0.052
  contains(signature)==True (1)                                     -0.052
  contains(element)==True (1)                                        0.052
  contains(ear)==True (1)                                            0.052
  contains(alike)==True (1)                                         -0.051
  contains(street)==True (1)                                        -0.051
  contains(washington)==True (1)                                    -0.051
  contains(golden)==True (1)                                         0.050
  contains(doesn)==True (1)                                          0.050
  contains(coated)==True (1)                                         0.050
  contains(enough)==True (1)                                        -0.049
  contains(chow)==True (1)                                           0.049
  contains(ultra)==True (1)                                         -0.048
  contains(breath)==True (1)                                        -0.048
  contains(richard)==True (1)                                       -0.048
  contains(describe)==True (1)                                      -0.048
  contains(combine)==True (1)                                        0.048
  contains(finish)==True (1)                                        -0.047
  contains(exact)==True (1)                                          0.047
  contains(upper)==True (1)                                          0.047
  contains(virtue)==True (1)                                         0.047
  contains(moist)==True (1)                                          0.047
  contains(presentation)==True (1)                                  -0.046
  contains(slightly)==True (1)                                       0.046
  contains(spring)==True (1)                                        -0.046
  contains(dig)==True (1)                                           -0.046
  contains(sunday)==True (1)                                        -0.046
  contains(stuffed)==True (1)                                       -0.045
  contains(ounce)==True (1)                                         -0.045
  contains(closely)==True (1)                                        0.045
  contains(style)==True (1)                                          0.045
  contains(.”)==True (1)                                            -0.044
  contains(believe)==True (1)                                       -0.044
  contains(mac)==True (1)                                           -0.044
  contains(dunk)==True (1)                                           0.044
  contains(true)==True (1)                                          -0.044
  contains(house)==True (1)                                         -0.044
  contains(come)==True (1)                                           0.044
  contains(passion)==True (1)                                        0.044
  contains(bonnie)==True (1)                                         0.043
  contains(nine)==True (1)                                          -0.043
  contains(clean)==True (1)                                          0.043
  contains(reveal)==True (1)                                         0.042
  contains(remains)==True (1)                                        0.042
  contains(accent)==True (1)                                         0.042
  contains(essentially)==True (1)                                   -0.041
  contains(spend)==True (1)                                         -0.041
  contains(happy)==True (1)                                          0.041
  contains(proper)==True (1)                                        -0.040
  contains(begin)==True (1)                                          0.040
  contains(involved)==True (1)                                      -0.040
  contains(becky)==True (1)                                          0.040
  contains(snack)==True (1)                                          0.040
  contains(process)==True (1)                                        0.040
  contains(paid)==True (1)                                           0.040
  contains(excuse)==True (1)                                        -0.040
  contains(tom)==True (1)                                           -0.039
  contains(forgotten)==True (1)                                     -0.039
  contains(free)==True (1)                                          -0.039
  contains(ave)==True (1)                                            0.038
  contains(maybe)==True (1)                                          0.038
  contains(prevent)==True (1)                                        0.038
  contains(basket)==True (1)                                         0.038
  contains(deal)==True (1)                                          -0.038
  contains(wing)==True (1)                                           0.037
  contains(unless)==True (1)                                         0.037
  contains(y)==True (1)                                              0.037
  contains(n)==True (1)                                             -0.037
  contains(marriage)==True (1)                                      -0.037
  contains(filling)==True (1)                                       -0.037
  contains(enjoyed)==True (1)                                       -0.037
  contains(chalk)==True (1)                                         -0.037
  contains(sometimes)==True (1)                                     -0.036
  contains(however)==True (1)                                        0.036
  contains(24)==True (1)                                            -0.036
  contains(modern)==True (1)                                        -0.036
  contains(salt)==True (1)                                          -0.036
  contains(spice)==True (1)                                         -0.036
  contains(spin)==True (1)                                           0.035
  contains(lunch)==True (1)                                          0.035
  contains(luxury)==True (1)                                        -0.035
  contains(agree)==True (1)                                          0.034
  contains(pizza)==True (1)                                         -0.034
  contains(meal)==True (1)                                          -0.033
  contains(decade)==True (1)                                        -0.033
  contains(brushed)==True (1)                                       -0.033
  contains(dropping)==True (1)                                      -0.033
  contains(stacked)==True (1)                                        0.033
  contains(stick)==True (1)                                         -0.032
  contains(sandwich)==True (1)                                      -0.032
  contains(let)==True (1)                                            0.032
  contains(95)==True (1)                                            -0.032
  contains(owner)==True (1)                                         -0.032
  contains(ray)==True (1)                                            0.031
  contains(newest)==True (1)                                        -0.031
  contains(self)==True (1)                                           0.031
  contains(merit)==True (1)                                         -0.031
  contains(whether)==True (1)                                        0.031
  contains(fat)==True (1)                                           -0.031
  contains(modification)==True (1)                                  -0.031
  contains(honey)==True (1)                                          0.030
  contains(serving)==True (1)                                        0.030
  contains(sea)==True (1)                                           -0.030
  contains(coating)==True (1)                                        0.030
  contains(consistency)==True (1)                                   -0.030
  contains(french)==True (1)                                        -0.029
  contains(knife)==True (1)                                         -0.029
  contains(resist)==True (1)                                        -0.028
  contains(keep)==True (1)                                           0.028
  contains(convenience)==True (1)                                   -0.028
  contains(longer)==True (1)                                        -0.027
  contains(respect)==True (1)                                        0.027
  contains(obsession)==True (1)                                     -0.027
  contains(thigh)==True (1)                                          0.027
  contains(waste)==True (1)                                         -0.027
  contains(simply)==True (1)                                        -0.026
  contains(potato)==True (1)                                         0.026
  contains(taste)==True (1)                                         -0.026
  contains(estate)==True (1)                                         0.025
  contains(crushed)==True (1)                                       -0.025
  contains(atop)==True (1)                                          -0.025
  contains(everywhere)==True (1)                                     0.025
  contains(navy)==True (1)                                           0.025
  contains(re)==True (1)                                            -0.025
  contains(ditch)==True (1)                                         -0.025
  contains(fall)==True (1)                                          -0.025
  contains(half)==True (1)                                          -0.025
  contains(bath)==True (1)                                          -0.025
  contains(florida)==True (1)                                       -0.025
  contains(primary)==True (1)                                        0.025
  contains(favor)==True (1)                                         -0.025
  contains(juice)==True (1)                                         -0.025
  contains(version)==True (1)                                        0.024
  contains(preparation)==True (1)                                   -0.024
  contains(12)==True (1)                                             0.024
  contains(chain)==True (1)                                          0.024
  contains(mixture)==True (1)                                        0.023
  contains(chicken)==True (1)                                        0.023
  contains(sole)==True (1)                                           0.023
  contains(inevitably)==True (1)                                    -0.022
  contains(kitchen)==True (1)                                       -0.022
  contains(want)==True (1)                                          -0.022
  contains(pepper)==True (1)                                        -0.022
  contains(much)==True (1)                                           0.021
  contains(crime)==True (1)                                         -0.021
  contains(50)==True (1)                                            -0.021
  contains(connecticut)==True (1)                                    0.021
  contains(person)==True (1)                                        -0.020
  contains(dish)==True (1)                                          -0.020
  contains(toward)==True (1)                                        -0.020
  contains(certain)==True (1)                                        0.020
  contains(rendered)==True (1)                                       0.020
  contains(side)==True (1)                                          -0.019
  contains(fried)==True (1)                                          0.019
  contains(end)==True (1)                                            0.019
  contains(go)==True (1)                                            -0.019
  contains(better)==True (1)                                        -0.019
  contains(columbia)==True (1)                                       0.018
  contains(ever)==True (1)                                           0.018
  contains(baltimore)==True (1)                                     -0.018
  contains(snap)==True (1)                                          -0.018
  contains(northeast)==True (1)                                      0.017
  contains(follows)==True (1)                                       -0.017
  contains(satisfying)==True (1)                                    -0.017
  contains(spoon)==True (1)                                          0.016
  contains(steer)==True (1)                                         -0.016
  contains(upscale)==True (1)                                       -0.016
  contains(($)==True (1)                                             0.016
  contains(ranch)==True (1)                                          0.016
  contains(pennsylvania)==True (1)                                   0.016
  contains(silver)==True (1)                                         0.016
  contains(founded)==True (1)                                       -0.016
  contains(shine)==True (1)                                          0.016
  contains(fry)==True (1)                                           -0.015
  contains(lends)==True (1)                                          0.015
  contains(heavy)==True (1)                                         -0.015
  contains(name)==True (1)                                          -0.015
  contains(adult)==True (1)                                          0.015
  contains(surprising)==True (1)                                    -0.015
  contains(treated)==True (1)                                        0.015
  contains(dose)==True (1)                                           0.015
  contains(common)==True (1)                                        -0.014
  contains(distributed)==True (1)                                    0.014
  contains(perfectly)==True (1)                                      0.014
  contains(formula)==True (1)                                        0.013
  contains(2002)==True (1)                                          -0.013
  contains(messy)==True (1)                                          0.013
  contains(emerge)==True (1)                                        -0.012
  contains(pain)==True (1)                                          -0.012
  contains(deviate)==True (1)                                        0.012
  contains(dignity)==True (1)                                        0.012
  contains(bar)==True (1)                                            0.011
  contains(technically)==True (1)                                   -0.011
  contains(tongue)==True (1)                                        -0.011
  contains(leave)==True (1)                                         -0.011
  contains(warm)==True (1)                                           0.011
  contains(eric)==True (1)                                           0.011
  contains(excursion)==True (1)                                      0.011
  contains(pool)==True (1)                                           0.010
  contains(consciousness)==True (1)                                 -0.010
  contains(sold)==True (1)                                          -0.010
  contains(occasional)==True (1)                                     0.010
  contains(snake)==True (1)                                         -0.010
  contains(se)==True (1)                                            -0.010
  contains(teeth)==True (1)                                         -0.010
  contains(portion)==True (1)                                       -0.010
  contains(american)==True (1)                                      -0.010
  contains(softer)==True (1)                                         0.010
  contains(virginia)==True (1)                                       0.009
  contains(eat)==True (1)                                           -0.009
  contains(jersey)==True (1)                                        -0.009
  contains(crunch)==True (1)                                         0.008
  contains(dip)==True (1)                                            0.008
  contains(14)==True (1)                                             0.008
  contains(dad)==True (1)                                           -0.008
  contains(attached)==True (1)                                       0.008
  contains(sink)==True (1)                                          -0.008
  contains(nashville)==True (1)                                     -0.007
  contains(translucent)==True (1)                                    0.007
  contains(tender)==True (1)                                        -0.007
  contains(corn)==True (1)                                          -0.007
  contains(circle)==True (1)                                         0.007
  contains(couldn)==True (1)                                        -0.006
  contains(fritz)==True (1)                                         -0.006
  contains(j)==True (1)                                              0.006
  contains(appetite)==True (1)                                      -0.006
  contains(taylor)==True (1)                                        -0.005
  contains(georgia)==True (1)                                       -0.005
  contains(twice)==True (1)                                          0.005
  contains(generously)==True (1)                                    -0.005
  contains(yard)==True (1)                                           0.005
  contains(protein)==True (1)                                       -0.005
  contains(commonly)==True (1)                                      -0.004
  contains(shattering)==True (1)                                    -0.004
  contains(iron)==True (1)                                          -0.004
  contains(bag)==True (1)                                           -0.004
  contains(fairfax)==True (1)                                        0.004
  contains(named)==True (1)                                          0.004
  contains(wheat)==True (1)                                         -0.004
  contains(frequently)==True (1)                                     0.004
  contains(turned)==True (1)                                         0.004
  contains(plate)==True (1)                                          0.003
  contains(eats)==True (1)                                           0.003
  contains(pas)==True (1)                                           -0.003
  contains(line)==True (1)                                          -0.003
  contains(popcorn)==True (1)                                        0.002
  contains(traditional)==True (1)                                   -0.002
  contains(high)==True (1)                                          -0.002
  contains(whole)==True (1)                                         -0.002
  contains(karaoke)==True (1)                                       -0.002
  contains(dedicated)==True (1)                                     -0.002
  contains(frying)==True (1)                                         0.002
  contains(picnic)==True (1)                                        -0.002
  contains(ordering)==True (1)                                       0.001
  contains(flesh)==True (1)                                          0.001
  contains(yield)==True (1)                                          0.001
  contains(vernon)==True (1)                                         0.001
  contains(crystal)==True (1)                                        0.001
  contains(minute)==True (1)                                         0.001
  contains(developed)==True (1)                                     -0.001
  contains(memorable)==True (1)                                     -0.001
  contains(pan)==True (1)                                           -0.000
  contains(latching)==True (1)                                       0.000
  contains(home)==True (1)                                          -0.000
  contains(e)==True (1)                                                      1.793
  contains(“)==True (1)                                                      1.469
  contains(‘)==True (1)                                                      1.215
  contains(’)==True (1)                                                      0.919
  contains(9)==True (1)                                                      0.830
  contains(,”)==True (1)                                                     0.770
  contains(.”)==True (1)                                                     0.688
  contains(tour)==True (1)                                                   0.599
  contains(child)==True (1)                                                  0.589
  contains(best)==True (1)                                                  -0.549
  contains(finished)==True (1)                                               0.548
  contains(margaret)==True (1)                                               0.538
  contains(get)==True (1)                                                   -0.483
  contains(5)==True (1)                                                     -0.457
  contains(bone)==True (1)                                                   0.446
  contains(like)==True (1)                                                  -0.446
  contains(review)==True (1)                                                 0.444
  contains(”)==True (1)                                                      0.425
  contains(14)==True (1)                                                     0.416
  contains(u)==True (1)                                                     -0.411
  contains(post)==True (1)                                                   0.408
  contains(old)==True (1)                                                    0.399
  contains(crime)==True (1)                                                  0.390
  contains(real)==True (1)                                                  -0.389
  contains(family)==True (1)                                                 0.385
  contains(age)==True (1)                                                    0.374
  contains(believe)==True (1)                                                0.368
  contains(friday)==True (1)                                                 0.355
  contains(well)==True (1)                                                   0.340
  contains(top)==True (1)                                                   -0.338
  contains(.))==True (1)                                                     0.337
  contains(bar)==True (1)                                                    0.334
  contains(home)==True (1)                                                   0.333
  contains(time)==True (1)                                                   0.325
  contains(found)==True (1)                                                  0.320
  contains(thing)==True (1)                                                  0.315
  contains(something)==True (1)                                              0.290
  contains(big)==True (1)                                                   -0.289
  contains(change)==True (1)                                                 0.287
  contains(result)==True (1)                                                -0.277
  contains(called)==True (1)                                                -0.277
  contains(modern)==True (1)                                                -0.272
  contains(said)==True (1)                                                   0.272
  contains(double)==True (1)                                                 0.261
  contains(deal)==True (1)                                                  -0.252
  contains(side)==True (1)                                                  -0.250
  contains(since)==True (1)                                                 -0.245
  contains(free)==True (1)                                                  -0.243
  label is 'books' (1)                                                      -0.240
  contains(american)==True (1)                                               0.238
  contains(made)==True (1)                                                  -0.237
  contains(could)==True (1)                                                 -0.235
  contains(bring)==True (1)                                                  0.233
  contains(often)==True (1)                                                 -0.232
  contains(call)==True (1)                                                  -0.231
  contains(woman)==True (1)                                                  0.229
  contains(taking)==True (1)                                                -0.227
  contains(8)==True (1)                                                     -0.224
  contains(doesn)==True (1)                                                 -0.223
  contains(offer)==True (1)                                                 -0.217
  contains(voice)==True (1)                                                  0.213
  contains(much)==True (1)                                                  -0.212
  contains(want)==True (1)                                                  -0.208
  contains(store)==True (1)                                                 -0.205
  contains(m)==True (1)                                                      0.204
  contains(deep)==True (1)                                                  -0.202
  contains(minute)==True (1)                                                -0.201
  contains(leave)==True (1)                                                  0.199
  contains(le)==True (1)                                                    -0.195
  contains(take)==True (1)                                                  -0.194
  contains(cast)==True (1)                                                  -0.192
  contains(recipe)==True (1)                                                -0.189
  contains(account)==True (1)                                               -0.189
  contains(father)==True (1)                                                 0.189
  contains(finish)==True (1)                                                -0.184
  contains(list)==True (1)                                                   0.182
  contains(come)==True (1)                                                   0.180
  contains(won)==True (1)                                                   -0.176
  contains(popular)==True (1)                                               -0.176
  contains(become)==True (1)                                                -0.175
  contains(others)==True (1)                                                 0.175
  contains(item)==True (1)                                                   0.175
  contains(punishment)==True (1)                                             0.174
  contains(sure)==True (1)                                                  -0.173
  contains(obsession)==True (1)                                              0.172
  contains(know)==True (1)                                                   0.171
  contains(black)==True (1)                                                 -0.171
  contains(tender)==True (1)                                                 0.171
  contains(available)==True (1)                                              0.170
  contains(name)==True (1)                                                  -0.169
  contains(hot)==True (1)                                                   -0.169
  contains(brand)==True (1)                                                 -0.169
  contains(half)==True (1)                                                   0.169
  contains(created)==True (1)                                               -0.167
  contains(let)==True (1)                                                    0.165
  contains(co)==True (1)                                                    -0.163
  contains(fan)==True (1)                                                   -0.162
  contains(matter)==True (1)                                                 0.160
  contains(maybe)==True (1)                                                  0.159
  contains(street)==True (1)                                                 0.158
  contains(using)==True (1)                                                 -0.157
  contains(adult)==True (1)                                                  0.157
  contains(fully)==True (1)                                                  0.156
  contains(10)==True (1)                                                    -0.156
  contains(got)==True (1)                                                   -0.153
  contains(former)==True (1)                                                -0.153
  contains(whole)==True (1)                                                 -0.152
  contains(three)==True (1)                                                  0.152
  contains(though)==True (1)                                                 0.151
  contains(wanted)==True (1)                                                 0.149
  contains(25)==True (1)                                                    -0.149
  contains(crack)==True (1)                                                  0.149
  contains(end)==True (1)                                                    0.147
  contains(whether)==True (1)                                                0.147
  contains(founder)==True (1)                                               -0.145
  contains(back)==True (1)                                                  -0.138
  contains(single)==True (1)                                                -0.138
  contains(place)==True (1)                                                 -0.136
  contains(surprising)==True (1)                                             0.136
  contains(started)==True (1)                                               -0.135
  contains(decade)==True (1)                                                 0.135
  contains(d)==True (1)                                                     -0.134
  contains(begin)==True (1)                                                 -0.134
  contains(instead)==True (1)                                                0.134
  contains(anything)==True (1)                                               0.134
  contains(longer)==True (1)                                                 0.133
  contains(related)==True (1)                                               -0.133
  contains(sunday)==True (1)                                                -0.131
  contains(including)==True (1)                                              0.130
  contains(.’)==True (1)                                                     0.130
  contains(paid)==True (1)                                                  -0.128
  contains(wing)==True (1)                                                  -0.127
  contains(named)==True (1)                                                 -0.123
  contains(keep)==True (1)                                                   0.122
  contains(favorite)==True (1)                                              -0.122
  contains(developed)==True (1)                                             -0.122
  contains(perfectly)==True (1)                                              0.119
  contains(seems)==True (1)                                                  0.117
  contains(fine)==True (1)                                                  -0.117
  contains(although)==True (1)                                               0.116
  contains(lot)==True (1)                                                   -0.116
  contains(yard)==True (1)                                                   0.115
  contains(area)==True (1)                                                  -0.115
  contains(part)==True (1)                                                   0.114
  contains(washington)==True (1)                                             0.114
  contains(consciousness)==True (1)                                          0.113
  contains(tim)==True (1)                                                   -0.112
  contains(within)==True (1)                                                -0.112
  contains(re)==True (1)                                                    -0.112
  contains(pleasure)==True (1)                                              -0.112
  contains(com)==True (1)                                                   -0.111
  contains(kitchen)==True (1)                                               -0.110
  contains(also)==True (1)                                                   0.110
  contains(feel)==True (1)                                                  -0.109
  contains(2015)==True (1)                                                  -0.109
  contains(v)==True (1)                                                     -0.108
  contains(buy)==True (1)                                                    0.108
  contains(line)==True (1)                                                   0.108
  contains(example)==True (1)                                               -0.107
  contains(set)==True (1)                                                   -0.107
  contains(ray)==True (1)                                                   -0.107
  contains(sink)==True (1)                                                   0.105
  contains(say)==True (1)                                                   -0.105
  contains(everything)==True (1)                                            -0.105
  contains(),)==True (1)                                                    -0.103
  contains(competition)==True (1)                                           -0.102
  contains(offering)==True (1)                                              -0.102
  contains(r)==True (1)                                                     -0.101
  contains(paper)==True (1)                                                 -0.101
  contains(self)==True (1)                                                   0.101
  contains(ask)==True (1)                                                   -0.101
  contains(head)==True (1)                                                   0.100
  contains(50)==True (1)                                                     0.099
  contains(owner)==True (1)                                                 -0.095
  contains(stock)==True (1)                                                 -0.095
  contains(trick)==True (1)                                                 -0.095
  contains(one)==True (1)                                                   -0.094
  contains(white)==True (1)                                                 -0.092
  contains(option)==True (1)                                                -0.092
  contains(happy)==True (1)                                                 -0.090
  contains(important)==True (1)                                             -0.090
  contains(near)==True (1)                                                   0.089
  contains(low)==True (1)                                                   -0.089
  contains(good)==True (1)                                                   0.088
  contains(french)==True (1)                                                -0.088
  contains(actually)==True (1)                                              -0.088
  contains(fresh)==True (1)                                                 -0.087
  contains(house)==True (1)                                                 -0.087
  contains(music)==True (1)                                                 -0.086
  contains(shard)==True (1)                                                  0.085
  contains(dig)==True (1)                                                    0.084
  contains(word)==True (1)                                                   0.084
  contains(perfect)==True (1)                                               -0.083
  contains(effect)==True (1)                                                -0.083
  contains(nine)==True (1)                                                   0.083
  contains(creator)==True (1)                                               -0.083
  contains(build)==True (1)                                                 -0.082
  contains(rose)==True (1)                                                  -0.081
  contains(tom)==True (1)                                                   -0.081
  contains(enjoyed)==True (1)                                               -0.081
  contains(butterfly)==True (1)                                              0.081
  contains(roll)==True (1)                                                  -0.081
  contains(adam)==True (1)                                                  -0.081
  contains(leaf)==True (1)                                                   0.080
  contains(operation)==True (1)                                             -0.080
  contains(plate)==True (1)                                                 -0.079
  contains(later)==True (1)                                                  0.079
  contains(turned)==True (1)                                                 0.079
  contains(frank)==True (1)                                                 -0.078
  contains(really)==True (1)                                                 0.078
  contains(stuff)==True (1)                                                 -0.077
  contains(open)==True (1)                                                  -0.077
  contains(form)==True (1)                                                   0.077
  contains(inspired)==True (1)                                              -0.077
  contains(prevent)==True (1)                                               -0.075
  contains(common)==True (1)                                                 0.074
  contains(sea)==True (1)                                                   -0.072
  contains(signature)==True (1)                                              0.072
  contains(variety)==True (1)                                               -0.071
  contains(forget)==True (1)                                                -0.071
  contains(south)==True (1)                                                 -0.071
  contains(oil)==True (1)                                                   -0.070
  contains(fly)==True (1)                                                    0.069
  contains(try)==True (1)                                                    0.069
  contains(bird)==True (1)                                                  -0.069
  contains(long)==True (1)                                                   0.068
  contains(sold)==True (1)                                                   0.068
  contains(fat)==True (1)                                                   -0.068
  contains(distance)==True (1)                                               0.067
  contains(discover)==True (1)                                              -0.067
  contains(teeth)==True (1)                                                 -0.067
  contains(food)==True (1)                                                   0.066
  contains(throughout)==True (1)                                            -0.066
  contains(care)==True (1)                                                  -0.066
  contains(person)==True (1)                                                 0.066
  contains(creating)==True (1)                                              -0.065
  contains(reveal)==True (1)                                                 0.065
  contains(planning)==True (1)                                               0.065
  contains(stool)==True (1)                                                 -0.064
  contains(dark)==True (1)                                                  -0.064
  contains(juice)==True (1)                                                 -0.064
  contains(richard)==True (1)                                               -0.063
  contains(michel)==True (1)                                                 0.063
  contains(24)==True (1)                                                    -0.063
  contains(month)==True (1)                                                  0.063
  contains(chili)==True (1)                                                  0.062
  contains(pool)==True (1)                                                  -0.062
  contains(spent)==True (1)                                                  0.062
  contains(arrives)==True (1)                                               -0.061
  contains(learned)==True (1)                                                0.061
  contains(st)==True (1)                                                     0.061
  contains(den)==True (1)                                                   -0.060
  contains(cut)==True (1)                                                   -0.059
  contains(thought)==True (1)                                                0.059
  contains(john)==True (1)                                                   0.059
  contains(c)==True (1)                                                     -0.058
  contains(would)==True (1)                                                  0.058
  contains(eric)==True (1)                                                  -0.057
  contains(fancy)==True (1)                                                  0.057
  contains(cooking)==True (1)                                               -0.057
  contains(rather)==True (1)                                                 0.057
  contains(hasn)==True (1)                                                   0.057
  contains(pressure)==True (1)                                              -0.057
  contains(essentially)==True (1)                                           -0.056
  contains(chicken)==True (1)                                               -0.056
  contains(go)==True (1)                                                     0.056
  contains(allow)==True (1)                                                 -0.056
  contains(true)==True (1)                                                  -0.056
  contains(snack)==True (1)                                                 -0.055
  contains(secret)==True (1)                                                -0.055
  contains(biscuit)==True (1)                                                0.054
  contains(need)==True (1)                                                  -0.054
  contains(ordering)==True (1)                                               0.054
  contains(crushed)==True (1)                                                0.054
  contains(12)==True (1)                                                    -0.053
  contains(seasoned)==True (1)                                               0.053
  contains(($)==True (1)                                                     0.053
  contains(bag)==True (1)                                                   -0.053
  contains(cart)==True (1)                                                   0.053
  contains(driving)==True (1)                                               -0.053
  contains(florida)==True (1)                                               -0.053
  contains(excuse)==True (1)                                                -0.053
  contains(formula)==True (1)                                               -0.053
  contains(sound)==True (1)                                                  0.052
  contains(dad)==True (1)                                                    0.052
  contains(better)==True (1)                                                -0.052
  contains(shore)==True (1)                                                  0.051
  contains(high)==True (1)                                                   0.051
  contains(even)==True (1)                                                   0.051
  contains(commitment)==True (1)                                             0.051
  contains(follows)==True (1)                                                0.051
  contains(another)==True (1)                                                0.051
  contains(base)==True (1)                                                  -0.050
  contains(degree)==True (1)                                                -0.050
  contains(empty)==True (1)                                                  0.049
  contains(respect)==True (1)                                                0.048
  contains(central)==True (1)                                               -0.047
  contains(couldn)==True (1)                                                 0.046
  contains(blend)==True (1)                                                 -0.046
  contains(may)==True (1)                                                   -0.046
  contains(wolf)==True (1)                                                   0.046
  contains(twice)==True (1)                                                 -0.046
  contains(version)==True (1)                                               -0.045
  contains(put)==True (1)                                                   -0.045
  contains(—)==True (1)                                                     -0.045
  contains(taylor)==True (1)                                                 0.045
  contains(handle)==True (1)                                                 0.045
  contains(audible)==True (1)                                                0.045
  contains(plus)==True (1)                                                  -0.045
  contains(involved)==True (1)                                              -0.044
  contains(leg)==True (1)                                                   -0.043
  contains(east)==True (1)                                                  -0.043
  contains(traditional)==True (1)                                           -0.041
  contains(crystal)==True (1)                                                0.041
  contains(dose)==True (1)                                                   0.041
  contains(warm)==True (1)                                                   0.041
  contains(knife)==True (1)                                                 -0.040
  contains(bay)==True (1)                                                   -0.040
  contains(exterior)==True (1)                                               0.040
  contains(ll)==True (1)                                                    -0.040
  contains(simply)==True (1)                                                -0.039
  contains(hill)==True (1)                                                   0.039
  contains(asking)==True (1)                                                 0.039
  contains(without)==True (1)                                               -0.039
  contains(y)==True (1)                                                     -0.039
  contains(dennis)==True (1)                                                 0.039
  contains(ever)==True (1)                                                   0.038
  contains(clean)==True (1)                                                 -0.038
  contains(passion)==True (1)                                               -0.038
  contains(beauty)==True (1)                                                 0.038
  contains(hold)==True (1)                                                  -0.038
  contains(quite)==True (1)                                                 -0.037
  contains(eat)==True (1)                                                   -0.037
  contains(onto)==True (1)                                                   0.037
  contains(grilled)==True (1)                                               -0.036
  contains(attached)==True (1)                                              -0.036
  contains(miss)==True (1)                                                  -0.036
  contains(order)==True (1)                                                 -0.036
  contains(getting)==True (1)                                               -0.036
  contains(newest)==True (1)                                                -0.035
  contains(generous)==True (1)                                               0.035
  contains(held)==True (1)                                                  -0.035
  contains(kid)==True (1)                                                   -0.035
  contains(40)==True (1)                                                    -0.034
  contains(glass)==True (1)                                                 -0.034
  contains(chef)==True (1)                                                  -0.034
  contains(hand)==True (1)                                                  -0.034
  contains(toward)==True (1)                                                -0.034
  contains(element)==True (1)                                               -0.034
  contains(stick)==True (1)                                                  0.033
  contains(spend)==True (1)                                                 -0.033
  contains(almost)==True (1)                                                -0.033
  contains(resist)==True (1)                                                -0.033
  contains(classic)==True (1)                                                0.033
  contains(stuffed)==True (1)                                               -0.033
  contains(j)==True (1)                                                     -0.033
  contains(occasional)==True (1)                                             0.032
  contains(japan)==True (1)                                                 -0.032
  contains(marriage)==True (1)                                              -0.032
  contains(count)==True (1)                                                 -0.031
  contains(spin)==True (1)                                                  -0.031
  contains(treated)==True (1)                                               -0.031
  contains(picnic)==True (1)                                                -0.030
  contains(right)==True (1)                                                 -0.030
  contains(unless)==True (1)                                                 0.030
  contains(remains)==True (1)                                               -0.030
  contains(enough)==True (1)                                                 0.029
  contains(memorable)==True (1)                                             -0.029
  contains(protein)==True (1)                                                0.029
  contains(exact)==True (1)                                                  0.029
  contains().)==True (1)                                                    -0.029
  contains(iron)==True (1)                                                  -0.029
  contains(roof)==True (1)                                                   0.028
  contains(savor)==True (1)                                                  0.027
  contains(never)==True (1)                                                 -0.026
  contains(everywhere)==True (1)                                            -0.026
  contains(heavy)==True (1)                                                  0.026
  contains(ear)==True (1)                                                   -0.026
  contains(hard)==True (1)                                                  -0.026
  contains(dignity)==True (1)                                                0.025
  contains(topped)==True (1)                                                -0.025
  contains(butter)==True (1)                                                 0.025
  contains(probably)==True (1)                                               0.024
  contains(ala)==True (1)                                                    0.024
  contains(skin)==True (1)                                                   0.024
  contains(baltimore)==True (1)                                             -0.024
  contains(used)==True (1)                                                  -0.024
  contains(serving)==True (1)                                               -0.024
  contains(stacked)==True (1)                                               -0.024
  contains(style)==True (1)                                                  0.024
  contains(circle)==True (1)                                                 0.023
  contains(restaurant)==True (1)                                             0.023
  contains(strip)==True (1)                                                 -0.023
  contains(pain)==True (1)                                                  -0.023
  contains(japanese)==True (1)                                              -0.023
  contains(fish)==True (1)                                                  -0.023
  contains(pizza)==True (1)                                                 -0.023
  contains(pan)==True (1)                                                   -0.023
  contains(local)==True (1)                                                  0.022
  contains(1971)==True (1)                                                  -0.022
  contains(certain)==True (1)                                               -0.022
  contains(tartare)==True (1)                                                0.022
  contains(gauche)==True (1)                                                 0.022
  contains(filling)==True (1)                                               -0.022
  contains(frequently)==True (1)                                             0.022
  contains(reserve)==True (1)                                                0.021
  contains(basket)==True (1)                                                 0.021
  contains(inevitably)==True (1)                                             0.021
  contains(bucket)==True (1)                                                -0.021
  contains(hour)==True (1)                                                   0.021
  contains(variation)==True (1)                                             -0.021
  contains(process)==True (1)                                               -0.021
  contains(thin)==True (1)                                                  -0.021
  contains(merit)==True (1)                                                  0.021
  contains(country)==True (1)                                               -0.020
  contains(2002)==True (1)                                                  -0.020
  contains(grab)==True (1)                                                  -0.020
  contains(fall)==True (1)                                                  -0.019
  contains(accused)==True (1)                                                0.019
  contains(luxury)==True (1)                                                -0.019
  contains(menu)==True (1)                                                   0.019
  contains(famed)==True (1)                                                 -0.019
  contains(!))==True (1)                                                     0.018
  contains(mean)==True (1)                                                   0.018
  contains(convenience)==True (1)                                            0.018
  contains(fried)==True (1)                                                 -0.018
  contains(dipping)==True (1)                                                0.018
  contains(virginia)==True (1)                                               0.017
  contains(korea)==True (1)                                                  0.017
  contains(sandwich)==True (1)                                              -0.017
  contains(easy)==True (1)                                                   0.016
  contains(.,)==True (1)                                                    -0.016
  contains(clerk)==True (1)                                                  0.016
  contains(new)==True (1)                                                   -0.016
  contains(rendered)==True (1)                                               0.016
  contains(fast)==True (1)                                                  -0.016
  contains(meal)==True (1)                                                   0.016
  contains(flash)==True (1)                                                 -0.016
  contains(wonderfully)==True (1)                                            0.016
  contains(however)==True (1)                                               -0.016
  contains(pop)==True (1)                                                   -0.016
  contains(korean)==True (1)                                                -0.016
  contains(run)==True (1)                                                    0.015
  contains(steak)==True (1)                                                  0.014
  contains(dedicated)==True (1)                                             -0.014
  contains(taste)==True (1)                                                 -0.014
  contains(roasted)==True (1)                                               -0.014
  contains(cooper)==True (1)                                                 0.014
  contains(proper)==True (1)                                                 0.013
  contains(silver)==True (1)                                                 0.013
  contains(flesh)==True (1)                                                  0.013
  contains(wheat)==True (1)                                                  0.013
  contains(romeo)==True (1)                                                  0.013
  contains(piece)==True (1)                                                  0.013
  contains(ultra)==True (1)                                                 -0.012
  contains(abbott)==True (1)                                                 0.012
  contains(layer)==True (1)                                                 -0.012
  contains(gasp)==True (1)                                                   0.012
  contains(combine)==True (1)                                               -0.012
  contains(shine)==True (1)                                                 -0.012
  contains(southern)==True (1)                                               0.012
  contains(sole)==True (1)                                                   0.011
  contains(essential)==True (1)                                              0.011
  contains(describe)==True (1)                                               0.011
  contains(upper)==True (1)                                                 -0.011
  contains(primary)==True (1)                                               -0.011
  contains(basically)==True (1)                                              0.011
  contains(morgan)==True (1)                                                 0.010
  contains(suit)==True (1)                                                  -0.010
  contains(q)==True (1)                                                      0.010
  contains(combination)==True (1)                                           -0.010
  contains(sometimes)==True (1)                                             -0.010
  contains(bite)==True (1)                                                  -0.010
  contains(size)==True (1)                                                  -0.009
  contains(every)==True (1)                                                 -0.009
  contains(capitol)==True (1)                                                0.009
  contains(served)==True (1)                                                -0.009
  contains(smack)==True (1)                                                  0.008
  contains(taken)==True (1)                                                 -0.008
  contains(bryan)==True (1)                                                 -0.008
  contains(emerge)==True (1)                                                 0.008
  contains(spring)==True (1)                                                -0.008
  contains(bum)==True (1)                                                    0.008
  contains(sell)==True (1)                                                  -0.008
  contains(owns)==True (1)                                                   0.007
  contains(dinner)==True (1)                                                -0.007
  contains(tongue)==True (1)                                                -0.007
  contains(detail)==True (1)                                                -0.007
  contains(finger)==True (1)                                                 0.007
  contains(accompanying)==True (1)                                           0.007
  contains(ranch)==True (1)                                                  0.006
  contains(lily)==True (1)                                                   0.006
  contains(slightly)==True (1)                                               0.006
  contains(crumb)==True (1)                                                  0.006
  contains(rob)==True (1)                                                   -0.006
  contains(credit)==True (1)                                                 0.006
  contains(pack)==True (1)                                                  -0.006
  contains(forgotten)==True (1)                                              0.006
  contains(district)==True (1)                                               0.005
  contains(breast)==True (1)                                                 0.005
  contains(estate)==True (1)                                                 0.005
  contains(needed)==True (1)                                                -0.005
  contains(kind)==True (1)                                                  -0.005
  contains(agree)==True (1)                                                 -0.005
  contains(connecticut)==True (1)                                            0.005
  contains(golden)==True (1)                                                 0.005
  contains(spoon)==True (1)                                                 -0.005
  contains(argue)==True (1)                                                  0.004
  contains(portion)==True (1)                                                0.004
  contains(deeply)==True (1)                                                 0.004
  contains(upscale)==True (1)                                                0.004
  contains(guilty)==True (1)                                                -0.004
  contains(favor)==True (1)                                                 -0.004
  contains(soul)==True (1)                                                   0.004
  contains(maryland)==True (1)                                               0.004
  contains(lunch)==True (1)                                                  0.004
  contains(pour)==True (1)                                                   0.003
  contains(beer)==True (1)                                                   0.003
  contains(commonly)==True (1)                                               0.003
  contains(satisfying)==True (1)                                             0.003
  contains(pas)==True (1)                                                   -0.003
  contains(breath)==True (1)                                                 0.003
  contains(revenge)==True (1)                                                0.003
  contains(standard)==True (1)                                              -0.003
  contains(training)==True (1)                                              -0.003
  contains(founded)==True (1)                                               -0.003
  contains(accent)==True (1)                                                 0.003
  contains(ditch)==True (1)                                                  0.002
  contains(term)==True (1)                                                   0.002
  contains(year)==True (1)                                                   0.002
  contains(crunch)==True (1)                                                -0.002
  contains(appreciate)==True (1)                                             0.001
  contains(salty)==True (1)                                                 -0.001
  contains(yield)==True (1)                                                 -0.001
  contains(outright)==True (1)                                              -0.001
  contains(extremely)==True (1)                                              0.001
  contains(foam)==True (1)                                                  -0.001
  contains(two)==True (1)                                                   -0.001
  contains(georgia)==True (1)                                                0.001
  contains(exchange)==True (1)                                               0.001
  contains(addictive)==True (1)                                             -0.001
  contains(bath)==True (1)                                                  -0.000
  contains(pony)==True (1)                                                   0.000
  contains(bread)==True (1)                                                 -0.000
  contains(chewing)==True (1)                                                0.000
  ---------------------------------------------------------------------------------
  TOTAL:                                            39.803   8.837   7.184   6.190
  PROBS:                                             1.000   0.000   0.000   0.000

In [144]:
classifier.explain(document_features(get_text("nrRB0.html")))


  Feature                                            books  design data_sc busines
  --------------------------------------------------------------------------------
  contains(e)==True (1)                              1.793
  contains(“)==True (1)                              1.469
  contains(‘)==True (1)                              1.215
  contains(’)==True (1)                              0.919
  contains(9)==True (1)                              0.830
  contains(5)==True (1)                             -0.457
  contains(”)==True (1)                              0.425
  contains(u)==True (1)                             -0.411
  label is 'books' (1)                              -0.240
  contains(6)==True (1)                             -0.226
  contains(8)==True (1)                             -0.224
  contains(3)==True (1)                             -0.222
  contains(m)==True (1)                              0.204
  contains(2)==True (1)                             -0.184
  contains(1)==True (1)                             -0.177
  contains(d)==True (1)                             -0.134
  contains(b)==True (1)                             -0.132
  contains(v)==True (1)                             -0.108
  contains(r)==True (1)                             -0.101
  contains(4)==True (1)                             -0.099
  contains(f)==True (1)                              0.062
  contains(l)==True (1)                             -0.060
  contains(c)==True (1)                             -0.058
  contains(—)==True (1)                             -0.045
  contains(y)==True (1)                             -0.039
  contains(j)==True (1)                             -0.033
  contains(g)==True (1)                             -0.029
  contains(z)==True (1)                             -0.026
  contains(o)==True (1)                              0.018
  contains(k)==True (1)                              0.016
  contains(p)==True (1)                             -0.015
  contains(q)==True (1)                              0.010
  contains(7)==True (1)                              0.007
  contains(’)==True (1)                                      1.598
  contains(2)==True (1)                                      0.619
  contains(x)==True (1)                                      0.481
  contains(h)==True (1)                                      0.464
  contains(r)==True (1)                                      0.459
  contains(4)==True (1)                                      0.434
  contains(0)==True (1)                                     -0.322
  contains(o)==True (1)                                     -0.309
  contains(‘)==True (1)                                     -0.292
  contains(l)==True (1)                                      0.260
  label is 'design' (1)                                      0.248
  contains(c)==True (1)                                     -0.222
  contains(q)==True (1)                                      0.206
  contains(1)==True (1)                                      0.201
  contains(d)==True (1)                                      0.198
  contains(7)==True (1)                                     -0.178
  contains(v)==True (1)                                     -0.176
  contains(—)==True (1)                                     -0.152
  contains(3)==True (1)                                      0.136
  contains(9)==True (1)                                     -0.122
  contains(w)==True (1)                                      0.119
  contains(u)==True (1)                                     -0.104
  contains(b)==True (1)                                     -0.104
  contains(“)==True (1)                                      0.099
  contains(p)==True (1)                                     -0.097
  contains(f)==True (1)                                     -0.074
  contains(5)==True (1)                                      0.065
  contains(z)==True (1)                                     -0.062
  contains(8)==True (1)                                     -0.056
  contains(e)==True (1)                                     -0.042
  contains(y)==True (1)                                     -0.038
  contains(m)==True (1)                                     -0.033
  contains(k)==True (1)                                      0.031
  contains(j)==True (1)                                      0.011
  contains(”)==True (1)                                     -0.006
  contains(6)==True (1)                                     -0.002
  contains(r)==True (1)                                              1.112
  contains(—)==True (1)                                             -0.757
  label is 'data_science' (1)                                        0.713
  contains(m)==True (1)                                              0.641
  contains(’)==True (1)                                              0.493
  contains(”)==True (1)                                              0.412
  contains(4)==True (1)                                             -0.316
  contains(8)==True (1)                                             -0.268
  contains(6)==True (1)                                              0.249
  contains(w)==True (1)                                              0.247
  contains(g)==True (1)                                              0.236
  contains(d)==True (1)                                              0.194
  contains(p)==True (1)                                              0.178
  contains(k)==True (1)                                              0.173
  contains(3)==True (1)                                             -0.163
  contains(n)==True (1)                                              0.138
  contains(5)==True (1)                                             -0.131
  contains(2)==True (1)                                              0.126
  contains(o)==True (1)                                              0.108
  contains(1)==True (1)                                              0.106
  contains(l)==True (1)                                             -0.098
  contains(y)==True (1)                                             -0.070
  contains(9)==True (1)                                             -0.069
  contains(b)==True (1)                                             -0.063
  contains(x)==True (1)                                              0.061
  contains(j)==True (1)                                              0.058
  contains(“)==True (1)                                             -0.055
  contains(f)==True (1)                                             -0.049
  contains(c)==True (1)                                             -0.048
  contains(z)==True (1)                                              0.046
  contains(7)==True (1)                                              0.043
  contains(q)==True (1)                                             -0.043
  contains(v)==True (1)                                             -0.042
  contains(e)==True (1)                                              0.040
  contains(‘)==True (1)                                              0.034
  contains(0)==True (1)                                              0.026
  contains(h)==True (1)                                             -0.017
  contains(u)==True (1)                                             -0.013
  label is 'business' (1)                                                    2.980
  contains(’)==True (1)                                                     -1.420
  contains(5)==True (1)                                                      1.019
  contains(6)==True (1)                                                      0.813
  contains(“)==True (1)                                                     -0.661
  contains(3)==True (1)                                                      0.582
  contains(4)==True (1)                                                      0.557
  contains(2)==True (1)                                                     -0.507
  contains(x)==True (1)                                                     -0.445
  contains(y)==True (1)                                                      0.391
  contains(e)==True (1)                                                     -0.372
  contains(r)==True (1)                                                     -0.368
  contains(7)==True (1)                                                      0.305
  contains(1)==True (1)                                                     -0.290
  contains(v)==True (1)                                                     -0.260
  contains(‘)==True (1)                                                     -0.256
  contains(u)==True (1)                                                     -0.244
  contains(l)==True (1)                                                     -0.166
  contains(8)==True (1)                                                     -0.163
  contains(n)==True (1)                                                      0.149
  contains(h)==True (1)                                                     -0.147
  contains(9)==True (1)                                                     -0.136
  contains(j)==True (1)                                                      0.135
  contains(m)==True (1)                                                     -0.126
  contains(b)==True (1)                                                     -0.117
  contains(d)==True (1)                                                      0.097
  contains(c)==True (1)                                                      0.082
  contains(o)==True (1)                                                     -0.078
  contains(g)==True (1)                                                      0.064
  contains(p)==True (1)                                                     -0.062
  contains(k)==True (1)                                                     -0.049
  contains(q)==True (1)                                                     -0.047
  contains(0)==True (1)                                                      0.043
  contains(f)==True (1)                                                      0.028
  contains(”)==True (1)                                                      0.022
  contains(z)==True (1)                                                     -0.014
  contains(w)==True (1)                                                      0.008
  contains(—)==True (1)                                                     -0.006
  ---------------------------------------------------------------------------------
  TOTAL:                                             3.946   3.240   3.231   1.340
  PROBS:                                             0.354   0.217   0.216   0.058

The classifier did well - it trained in 2 minutes or so an dit got an initial accuracy of about 83% - a pretty good start!

Parsing with Stanford Parser and NLTK

NLTK parsing is notoriously bad - because it's pedagogical. However, you can use Stanford.


In [99]:
import os

from nltk.tag.stanford import NERTagger
from nltk.parse.stanford import StanfordParser

## NER JAR and Models
STANFORD_NER_MODEL = os.path.expanduser("~/Development/stanford-ner-2014-01-04/classifiers/english.all.3class.distsim.crf.ser.gz")
STANFORD_NER_JAR   = os.path.expanduser("~/Development/stanford-ner-2014-01-04/stanford-ner-2014-01-04.jar")

## Parser JAR and Models
STANFORD_PARSER_MODELS = os.path.expanduser("~/Development/stanford-parser-full-2014-10-31/stanford-parser-3.5.0-models.jar")
STANFORD_PARSER_JAR    = os.path.expanduser("~/Development/stanford-parser-full-2014-10-31/stanford-parser.jar")

def create_tagger(model=None, jar=None, encoding='ASCII'):
    model = model or STANFORD_NER_MODEL
    jar   = jar or STANFORD_NER_JAR

    return NERTagger(model, jar, encoding)

def create_parser(models=None, jar=None, **kwargs):
    models = models or STANFORD_PARSER_MODELS
    jar   = jar or STANFORD_PARSER_JAR

    return StanfordParser(jar, models, **kwargs)

class NER(object):

    tagger = None

    @classmethod
    def initialize_tagger(klass, model=None, jar=None, encoding='ASCII'):
        klass.tagger = create_tagger(model, jar, encoding)

    @classmethod
    def tag(klass, sent):
        if klass.tagger is None:
            klass.initialize_tagger()

        sent = nltk.word_tokenize(sent)
        return klass.tagger.tag(sent)

class Parser(object):

    parser = None

    @classmethod
    def initialize_parser(klass, models=None, jar=None, **kwargs):
        klass.parser = create_parser(models, jar, **kwargs)

    @classmethod
    def parse(klass, sent):
        if klass.parser is  None:
            klass.initialize_parser()

        return klass.parser.raw_parse(sent)

def tag(sent):
    return NER.tag(sent)

def parse(sent):
    return Parser.parse(sent)

In [100]:
tag("The man hit the building with the bat.")


Out[100]:
[[(u'The', u'O'),
  (u'man', u'O'),
  (u'hit', u'O'),
  (u'the', u'O'),
  (u'building', u'O'),
  (u'with', u'O'),
  (u'the', u'O'),
  (u'bat', u'O'),
  (u'.', u'O')]]

In [103]:
for p in parse("The man hit the building with the bat."):
    print p


(ROOT
  (S
    (NP (DT The) (NN man))
    (VP
      (VBD hit)
      (NP (DT the) (NN building))
      (PP (IN with) (NP (DT the) (NN bat))))
    (. .)))

TextBlob

A lightweight wrapper around nltk that provides a simple "Blob" interface for working with text.


In [23]:
from textblob import TextBlob
from bs4 import BeautifulSoup

text = TextBlob(get_text("nrRB0.html"))

print text.sentences


[Sentence("A crisp and juicy bucket list of D.C.’s best fried chicken - The Washington Post

 It’s lowbrow."), Sentence("It’s messy."), Sentence("It could never be accused of being healthful."), Sentence("But we’d never let those formalities get between us and an order of crispy, crackly, delicious fried chicken."), Sentence("Whether it comes in a bucket or on a bun, or you eat it with your fingers or chopsticks, there’s a surprising variety to the Washington area’s fried chicken offerings."), Sentence("Here are some of the most irresistible."), Sentence("‘Rotissi-fried’ chicken at the Partisan

Forget the cronut."), Sentence("Our newest favorite food chimera is the “rotissi-fried” chicken at the Partisan."), Sentence("Credit goes to chef Nate Anda, who dreamed up the dish: After a 12-hour brine, the chicken is rotisseried for two hours and then fried for two and a half minutes."), Sentence("Why both?"), Sentence("“Everything is better once it’s fried in beef fat,” Anda said."), Sentence("We have to agree."), Sentence("Whether white or dark, the meat is succulent throughout."), Sentence("The batter-free frying leaves the simply seasoned skin rendered perfectly crisp, golden and translucent — cracklings, essentially."), Sentence("The sound of it shattering under the knife was music to our ears."), Sentence("And as if the lily needed further gilding, the chicken comes with a generous pour of honey hot sauce."), Sentence("The sauce is hard to resist, but try to reserve a few bites of unadorned chicken so you can fully appreciate this happy marriage of classic preparations."), Sentence("The Partisan, 709 D St. NW."), Sentence("202-524-5322. www.thepartisandc.com."), Sentence("— Becky Krystal

 [In a love/hate relationship with Chick-fil-A?"), Sentence("Here are some alternatives] 

Traditional fried chicken at Family Meal

When Bryan Voltaggio started planning the menu for Family Meal, his modern, upscale spin on a diner, his thoughts turned to home and the carryout meal he most enjoyed as a kid: fried chicken."), Sentence("“It was one of our favorite things,” he says."), Sentence("“It just seems like a family dinner.” And if he was creating a restaurant called Family Meal, fried chicken “had to be an important part of it.” But Voltaggio wanted to do it right, and set about testing cooking methods, brines, breadings and fryers."), Sentence("That long process paid off with a home run of a fried chicken dish that’s become the most popular item on Family Meal’s menu."), Sentence("The whole chickens spend 12 hours in a brine of pickle juice and roasted poultry stock before getting dredged, rested and dredged again in a mixture of flour, cornmeal and corn starch."), Sentence("After a dip in the top-of-the-line pressure fryer, the thighs, legs and breasts emerge with a crisp, salty skin that cracks open to reveal wonderfully warm and moist flesh."), Sentence("You don’t even need to dunk it in the house-made hot sauce that accompanies the dish, but really, who can resist?"), Sentence("Family Meal Ashburn, 20462 Exchange St. 703-726-9800; Frederick, 880 N. East St. 301-378-2895; Baltimore, 621 E. Pratt St. 410-601-3242. www.voltfamilymeal.com."), Sentence("— John Taylor

  

 [40 Eats: D.C.’s most essential dishes of 2015] 



Japanese fried chicken at Izakaya Seki

Although it’s commonly served in Japan at karaoke bars, convenience stores and on street carts, karaage chicken — like most of the country’s food — is held to an extremely high standard."), Sentence("“It’s taken it to the nth degree of obsession and detail,” says Cizuka Seki, who, with her father Hiroshi, owns Izakaya Seki on V Street NW."), Sentence("“Kara age” is used to describe the method for deep-frying bite-size pieces of fish and, more commonly, chicken."), Sentence("Though there are subtle variations on the ubiquitous dish, most recipes call for chicken thighs marinated in soy sauce, coated in flour or corn starch and deep-fried in oil."), Sentence("Izakaya Seki’s version sticks closely to the formula."), Sentence("Probably."), Sentence("“I’m not even quite sure what my dad puts into it, because we don’t have recipes,” Seki says, though she’s certain wheat flour is involved."), Sentence("The result is a thin, tender coating that’s slightly softer than tempura."), Sentence("The accompanying ponzu sauce lends a tartness to the nubs."), Sentence("Izakaya Seki, 1117 V St. NW."), Sentence("202-588-5841. www.sekidc.com."), Sentence("— Holley Simmons

Korean fried chicken at BonChon

Don’t waste your kimchi-stinking breath asking for more sauce at BonChon."), Sentence("The South Korean fried chicken chain, founded in 2002, is so dedicated to consistency that it doesn’t allow for any modifications."), Sentence("And why would you want to change anything, really?"), Sentence("The made-to-order wings, drumsticks and strips are fried twice, resulting in a paper-thin crust that yields the same satisfying crack as shattering crème brulee with a spoon."), Sentence("Founder Jinduk Seh spent two years perfecting his secret sauces, which come in three flavors — soy garlic, hot and a blend of the two — and are brushed on by hand post-fry, piece by piece."), Sentence("True to BonChon’s commitment to uniformity, sauces are made exclusively in South Korea and distributed to all 140-plus BonChon locations, which means the wings you’re chewing on in Arlington are slathered with the same exact stuff as those in the Philippines."), Sentence("Joints like these are so common throughout Korea they’re called “chimeks,” which is a hybrid term that combines “chicken” with the Korean word for beer."), Sentence("Washington should be happy to have 10 BonChons within driving distance, plus a brand new Metro-accessible location near the Navy Yard."), Sentence("BonChon, 1015 Half St."), Sentence("SE and nine other locations in Maryland and Virginia."), Sentence("www.bonchon.com."), Sentence("— Holley Simmons

Maryland fried chicken at Crisfield Seafood and Hank’s Oyster Bar

There’s not much agreement on what constitutes Maryland fried chicken."), Sentence("Some say it’s just a fresh Maryland chicken that’s pan-fried; others say it should be topped with white gravy, almost like a chicken-fried steak."), Sentence("The pan-fried chicken platter at  Crisfield Seafood is a perfect example of the former style."), Sentence("Half of a chicken is dredged in flour, dusted with salt and pepper, and fried in a cast-iron pan."), Sentence("This preparation lends a snap and crunch to the exterior, and while the meat falls off the bone, the well-seasoned breading holds on."), Sentence("(The chicken is available only Friday through Sunday, and frequently sells out.)"), Sentence("The Chesapeake fried chicken at Hank’s Oyster Bar in Dupont Circle and Capitol Hill is plumper than Crisfield’s version and seasoned with Old Bay, black pepper and cayenne, but the breading is softer and less crispy."), Sentence("It’s brined for 24 hours and deep-fried, rather than pan-fried, and it’s served only on Sunday."), Sentence("Crisfield Seafood, 8012 Georgia Ave., Silver Spring."), Sentence("301-589-1306. www.crisfieldseafood.com."), Sentence("Hank’s Oyster Bar, 1624 Q St. NW."), Sentence("202-462-4265; 633 Pennsylvania Ave."), Sentence("SE."), Sentence("202-733-1971. www.hanksoysterbar.com."), Sentence("— Fritz Hahn



 [Pizza in Washington: An upper-crust tour of every D.C. style ]



Fancy fried chicken at Central

Self-consciousness may prevent you from ordering fried chicken in a white-tablecloth restaurant."), Sentence("It feels incongruous — gauche, almost — to dig into picnic fare at the kind of place where you should be ordering risotto or tartare or something that comes with mousse, gelee or foam."), Sentence("But you have to override that adult voice in the back of your head, because if you don’t, you’ll miss out on Central Michel Richard’s famed fried chicken plate ($24 at lunch, $25 for dinner), which remains as good as ever."), Sentence("Though it’s no longer sold by the bucket to go, Michel Richard’s KFC-inspired crispy breast and thigh come stacked atop a pool of the butteriest mashed potatoes you’ll ever taste."), Sentence("It’s such a dignified presentation that this most American of dishes almost could pass as (gasp!)"), Sentence("French."), Sentence("Self-consciousness should, however, steer you toward using a knife and fork — not your fingers — to eat the chicken."), Sentence("It is, after all, that kind of a place."), Sentence("Central Michel Richard, 1001 Pennsylvania Ave. NW."), Sentence("202-626-0015. www.centralmichelrichard.com ."), Sentence("— Maura Judkis



Nashville hot chicken at Reserve 2216

If you believe the lore, Nashville hot chicken was basically a crime of passion, created as a blistering rebuke to a no-account Romeo who couldn’t keep his hands off other women."), Sentence("Alas, this wolf was also a chili head who found pleasure, not pain, in this dish of revenge served hot."), Sentence("Decades later, chefs are latching onto this addictive form of punishment."), Sentence("Aaron Silverman served an ultra-refined version at Rose’s Luxury for months, and now Eric Reid, chef and co-owner of Reserve 2216 in Del Ray, has developed his own take on hot chicken, even if he’s never actually enjoyed it in Nashville."), Sentence("He marinates an airline cut (boneless breast with the drumette wing attached) in buttermilk and Crystal hot sauce before dredging the chicken in seasoned flour and frying it."), Sentence("Reid ditches the traditional white bread base in favor of collards and a side of corn bread waffles."), Sentence("He finishes the dish with a combination of Cajun seasonings and more Crystal hot sauce for a moist, crispy bird that bites back."), Sentence("But not too hard."), Sentence("This is Alexandria, after all."), Sentence("Reserve 2216, 2216 Mount Vernon Ave., Alexandria."), Sentence("703-549-2889. www.drpreserve.com."), Sentence("— Tim Carman



Fast-food fried chicken at Popeyes

The sole virtue of most fast-food operations is consistency."), Sentence("Whether you bite into a Big Mac in Bethesda or Beijing, the sandwich should taste the same."), Sentence("The menu at Popeyes follows suit, but it deviates from the brand-name competition in an important respect: The signature at Popeyes could pass for home cooking (well, if your home had a vat of clean, hot oil and a person with a Southern accent tending the meal)."), Sentence("Maybe that accounts for my occasional forays to the chicken fryer after a bum restaurant-review excursion."), Sentence("No matter where I eat my order, inevitably “spicy,” I know I can count on a coating that smacks of cayenne, paprika and even crushed cornflakes, and chicken that spurts with juice."), Sentence("The shatter is audible; the golden crumbs fly everywhere, but end up on my tongue."), Sentence("No one-trick pony, Popeyes has hot and tender buttermilk biscuits that bolster my favorite excuse to snack low on the food chain."), Sentence("Once, I got home to discover a clerk had forgotten to pack bread in my bag."), Sentence("I almost cried."), Sentence("Instead, I consoled myself with another piece of chicken."), Sentence("Popeyes has locations throughout the D.C. metro area."), Sentence("www.popeyes.com."), Sentence("— Tom Sietsema



Fried chicken sandwich at DCity Smokehouse

The fried chicken sandwich hasn’t been the same since KFC’s Double Down turned a guilty pleasure into an outright farce, using crispy white-meat fillets as both the sandwich’s primary protein and the oily handles by which you eat the monstrosity."), Sentence("Leave it to Rob Sonderman, pitmaster and co-owner of DCity Smokehouse, to bring dignity back to the bite."), Sentence("His Den-Den — named for co-creator and pitmaster-in-training Dennis Geddie — begins with boneless thighs marinated in buttermilk, hot sauce and honey."), Sentence("Sonderman then dredges the meat in seasoned flour before dropping the thighs into the fryer."), Sentence("Generously stuffed into a grilled hoagie roll with lettuce, tomato and crispy onions, the chicken is finished with two sauces, including a house-made cilantro ranch."), Sentence("Technically, the Den-Den ($9.25) is one of the few smokeless items on DCity’s menu (unless you count the chipotle peppers in the hot sauce)."), Sentence("No matter."), Sentence("You won’t care the minute you sink your teeth into that heaping hoagie of spicy thigh meat."), Sentence("DCity Smokehouse, 8 Florida Ave. NW."), Sentence("202-733-1919. www.dcitysmokehouse.com."), Sentence("— Tim Carman



Classic D.C. fried chicken at Oohh’s and Aahh’s

Hearty is the appetite that can handle Oohh’s and Aahh’s chef-owner Oji Abbott’s boneless fried chicken breast without taking home leftovers."), Sentence("He buys local — from Hartman Meat Co. in Northeast Washington — and butterflies each 14-ounce portion ($12.95), which results in a lot of real estate for the crispy, well-seasoned coating."), Sentence("Abbott chalks up the consistently moist meat to proper cooking time and temperature, and to the recipe he learned from his grandmother."), Sentence("Oohh’s and Aahh’s, 1005 U St. NW."), Sentence("202-667-7142. www.oohhsnaahhs.com."), Sentence("— Bonnie S. Benwick



Popcorn fried chicken at Pop’s Sea Bar

It’s all too easy to chomp through an order of Boardwalk Chicken at the shore-happy Pop’s Sea Bar in Adams Morgan."), Sentence("The bite-size pieces of dark-meat-only bird are brined for two hours, then treated to a buttermilk bath until they are coated with plain flour and flash-fried to order."), Sentence("A generous hand with salt and pepper just before serving means the effect of that seasoning builds as you empty the single-serving basket ($8.99)."), Sentence("Ask for two portions of the accompanying Jersey sauce so you’ll have enough of its horseradish-y, kitchen-sink blend for every bite."), Sentence("Pop’s Sea Bar, 1817 Columbia Rd."), Sentence("NW."), Sentence("202-534-3933. www.popsseabar.com."), Sentence("— Bonnie S. Benwick



Fried chicken tenders at GBD

Chicken tenders are often relegated to the children’s menu, but the chicken Tendies at GBD are a fine meal for adults and children alike."), Sentence("Each white-meat tender comes with a dark, crispy outer layer with a heavy dose of salt and spice."), Sentence("But the best part about chicken tenders is the dipping, and GBD offers nine sauces, including a take on D.C.’s own mumbo sauce, buttermilk ranch, chipotle barbecue and Frankenbutter, which combines Frank’s RedHot sauce with butter."), Sentence("Ask for the $5.50 Saucetown option to try all nine."), Sentence("GBD, 1323 Connecticut Ave. NW."), Sentence("202-524-5210. www.gbdchickendoughnuts.com."), Sentence("— Margaret Ely



Fried chicken skins at Gypsy Soul

Fried chicken fans can argue the merits of white meat vs. dark meat, or whether it’s better to chow down on a drumstick or the breast."), Sentence("But one thing we all can agree on is that the outer layer — the breading and the skin — is the most important element of a memorable piece of fried chicken."), Sentence("And sometimes you just want to savor the flavor of the skin — deeply spiced, perfect crunch — without filling up on meat or having to deal with bones."), Sentence("And that is when you grab one of the bar stools at R.J. Cooper’s Gypsy Soul in Fairfax’s Mosaic District."), Sentence("Cooper’s chicken skins ($9) are twisted slivers and shards, decadently salty and crackling with paprika, cayenne pepper and garlic."), Sentence("The dish arrives with a house-made “roof top honey-snake oil,” but it’s best to let these beauties shine on their own."), Sentence("Gypsy Soul, 8296 Glass Alley, Fairfax."), Sentence("703-992-0933. www.gypsysoul-va.com."), Sentence("— Fritz Hahn

 Related items: 

 D.C.’s most essential dishes of 2015 

 Pizza in Washington: An upper crust tour of every D.C. style 



")]

In [25]:
import nltk

In [26]:
np = nltk.FreqDist(text.noun_phrases)
print np.most_common(10)


[(u'hot sauce', 6), (u'popeyes', 5), (u'washington', 5), (u'it\u2019s', 4), (u'gbd', 4), (u'd.c.\u2019s', 4), (u'st. nw', 4), (u'bonchon', 3), (u'maryland', 3), (u'hank\u2019s oyster', 3)]

In [27]:
print text.sentiment


Sentiment(polarity=-0.0025676717918097208, subjectivity=0.5856343297507093)

In [28]:
review = TextBlob("Harrison Ford would be the most amazing, most wonderful, most handsome actor - the greatest that ever lived, if only he didn't have that silly earing.")
print review.sentiment


Sentiment(polarity=0.4555555555555555, subjectivity=0.8083333333333333)

Language Detection using TextBlob


In [29]:
b = TextBlob(u"بسيط هو أفضل من مجمع")
b.detect_language()


Out[29]:
u'ar'

In [32]:
chinese_blob = TextBlob(u"美丽优于丑陋")
chinese_blob.translate(from_lang="zh-CN", to='en')


Out[32]:
TextBlob("")

In [33]:
en_blob = TextBlob(u"Simple is better than complex.")
en_blob.translate(to="es")


Out[33]:
TextBlob("")

spaCy

Industrial strength NLP, in Python but with a strong Cython backend. Super fast. Licensing issue though.


In [34]:
from __future__ import unicode_literals 
from spacy.en import English

nlp = English()

tokens = nlp(u'The man hit the building with the baseball bat.')

baseball = tokens[7]
print (baseball.orth, baseball.orth_, baseball.head.lemma, baseball.head.lemma_)


(2303, u'baseball', 4193, u'bat')

In [139]:
tokens = nlp(u'The man hit the building with the baseball bat.', parse=True)
for token in tokens:
    print token.prob


-5.02773189545
-8.16621112823
-8.3605670929
-3.07847452164
-8.67186450958
-5.23164892197
-3.07847452164
-9.61269950867
-10.9683980942
-3.17597317696

gensim

Library for bag of words clustering - LSA and LDA.

Also implements word2vec - Google's word vectorizer: something that was explored in a previous post.