In [ ]:

Text Processing with Python

Packages Discussued:

Other packages:

NLP in Context

The science that has been developed around the facts of language passed through three stages before finding its true and unique object. First something called "grammar" was studied. This study, initiated by the Greeks and continued mainly by the French, was based on logic. It lacked a scientific approach and was detached from language itself. Its only aim was to give rules for distinguishing between correct and incorrect forms; it was a normative discipline, far removed from actual observation, and its scope was limited.

— Ferdinand de Saussure

The State of the Art

  • Academic design for use alongside intelligent agents (AI discipline)
  • Relies on formal models or representations of knowledge & language
  • Models are adapted and augment through probabilistic methods and machine learning.
  • A small number of algorithms comprise the standard framework.

Required:

  • Domain Knowledge
  • A Corpus in the Domain
  • Methods

The Data Science Pipeline

The NLP Pipeline

Morphology

The study of the forms of things, words in particular.

Consider pluralization for English:

  • Orthographic Rules: puppy → puppies
  • Morphological Rules: goose → geese or fish

Major parsing tasks:

  • stemming
  • lemmatization
  • tokenization.

Syntax

The study of the rules for the formation of sentences.

Major tasks:

  • chunking
  • parsing
  • feature parsing
  • grammars
  • NGram Models (perplexity)
  • Language generation

Semantics

The study of meaning.

I see what I eat.
I eat what I see.
He poached salmon.

Major Tasks

  • Frame extraction
  • creation of TMRs
  • Question and answer systems

Machine Learning

Solve Clustering Problems:

  • Topic Modeling
  • Language Similarity
  • Document Association (authorship)

Solve Classification Problems:

  • Language Detection
  • Sentiment Analysis
  • Part of Speech Tagging
  • Statistical Parsing
  • Much more

Use of word vectors to implement distance based metrics.

Setup and Dataset

To install the required packages (hopefully to a virtual environment) you can download the requirements.txt and run:

$ pip install -r requirements.txt

Or you can pip install each dependency as you need them.

Corpus Organization

Preprocessing HTML and XML Documents to Text

Much of the text that we're interested in is available on the web and formatted either as HTML or XML. It's not just web pages, however. Most eReader formats like ePub and Mobi are actually zip files containing XHTML. These semi-structured documents contain a lot of information, usually structural in nature. However, we want to get to the main body of the content of what we're looking for, disregarding other content that might be included such as headers for navigation, sidebars, ads and other extraneous content.

On the web, there are several services that provide web pages in a "readable" fashion like Instapaper and Clearly. Some browsers might even come with a clutter and distraction free "reading mode" that seems to give us exactly the content that we're looking for. An option that I've used in the past is to either programmatically access these renderers, Instapaper even provides an API. However, for large corpora, we need to quickly and repeatably perform extraction, while maintaining the original documents.

Corpus management requires that the original documents be stored alongside preprocessed documents - do not make changes to the originals in place! See discussions of data lakes and data pipelines for more on ingesting to WORM storages.

In Python, the fastest way to process HTML and XML text is with the lxml library - a superfast XML parser that binds the C libraries libxml2 and libxslt. However, the API for using lxml is a bit tricky, so instead use friendlier wrappers, readability-lxml and BeautifulSoup.

For example, consider the following code to fetch an HTML web article from The Washington Post:


In [1]:
import codecs
import requests

from urlparse import urljoin
from contextlib import closing

chunk_size = 10**6  # Download 1 MB at a time.
wpurl = "http://wpo.st/"  # Washington Post provides short links

def fetch_webpage(url, path):
    # Open up a stream request (to download large documents)
    # Ensure that we will close when complete using contextlib
    with closing(requests.get(url, stream=True)) as response:

        # Check that the response was successful
        if response.status_code == 200:
            
            # Write each chunk to disk with the correct encoding
            with codecs.open(path, 'w', response.encoding) as f:
                for chunk in response.iter_content(chunk_size,  decode_unicode=True):
                    f.write(chunk)

def fetch_wp_article(article_id):
    path = "%s.html" % article_id
    url  = urljoin(wpurl, article_id)
    return fetch_webpage(url, path)

In [2]:
fetch_webpage("http://www.koreadaily.com/news/read.asp?art_id=3283896", "korean.html")

In [3]:
fetch_wp_article("nrRB0")

In [4]:
fetch_wp_article("uyRB0")

BeautifulSoup allows us to search the DOM to extract particular elements, for example to load our document and find all the <p> tags, we would do the following:


In [2]:
import bs4

def get_soup(path):
    with open(path, 'r') as f:
        return bs4.BeautifulSoup(f, "lxml") # Note the use of the lxml parser

for p in get_soup("nrRB0.html").find_all('p'):
    print p


<p class="category-desc"> The inside track on Washington politics. </p>
<p class="invalid-email">*Invalid email address</p>
<p class="category-desc"> The inside track on Washington politics. </p>
<p class="invalid-email">*Invalid email address</p>
<p>Sign in or create an account so we can save this story to your Reading List. You'll be able to access the story from your Reading List on any computer, tablet or smartphone.</p>
<p class="top-header-message">Sign in to your account to save this article.</p>
<p id="U9001274173114EdC"></p>
<p id="U1000696839467p6H"> <i>It’s lowbrow. It’s messy. It could never be accused of being healthful. But we’d never let those formalities get between us and an order of crispy, crackly, delicious fried chicken. Whether it comes in a bucket or on a bun, or you eat it with your fingers or chopsticks, there’s a surprising variety to the Washington area’s fried chicken offerings. Here are some of the most irresistible.</i> </p>
<p id="U1000696839467ntG"></p>
<p id="U9001274173114AOB">Forget the cronut. Our newest favorite food chimera is the “rotissi-fried” chicken at <b>the Partisan</b>. Credit goes to chef Nate Anda, who dreamed up the dish: After a 12-hour brine, the chicken is rotisseried for two hours and then fried for two and a half minutes. Why both? “Everything is better once it’s fried in beef fat,” Anda said. We have to agree. Whether white or dark, the meat is succulent throughout. The batter-free frying leaves the simply seasoned skin rendered perfectly crisp, golden and translucent — cracklings, essentially. The sound of it shattering under the knife was music to our ears. And as if the lily needed further gilding, the chicken comes with a generous pour of honey hot sauce. The sauce is hard to resist, but try to reserve a few bites of unadorned chicken so you can fully appreciate this happy marriage of classic preparations.</p>
<p id="U9001274173114XlF"> <i>The Partisan, 709 D St. NW. 202-524-5322. <a class="showlink" href="http://www.thepartisandc.com">www.thepartisandc.com</a>. </i> </p>
<p><strong>— Becky Krystal</strong></p>
<p id="U9001274173114AdG">When <a href="http://www.washingtonpost.com/lifestyle/food/bryan-voltaggio-from-a-teenager-amok-to-top-chef-masters/2013/07/22/682d35f8-ef04-11e2-bed3-b9b6fe264871_story.html">Bryan Voltaggio</a> started planning the menu for <b>Family Meal</b>, his modern, upscale spin on a diner, his thoughts turned to home and the carryout meal he most enjoyed as a kid: fried chicken. “It was one of our favorite things,” he says. “It just seems like a family dinner.” And if he was creating a restaurant called Family Meal, fried chicken “had to be an important part of it.” But Voltaggio wanted to do it right, and set about testing cooking methods, brines, breadings and fryers. That long process paid off with a home run of a fried chicken dish that’s become the most popular item on Family Meal’s menu. The whole chickens spend 12 hours in a brine of pickle juice and roasted poultry stock before getting dredged, rested and dredged again in a mixture of flour, cornmeal and corn starch. After a dip in the top-of-the-line pressure fryer, the thighs, legs and breasts emerge with a crisp, salty skin that cracks open to reveal wonderfully warm and moist flesh. You don’t even need to dunk it in the house-made hot sauce that accompanies the dish, but really, who can resist? </p>
<p id="U9001274173114THE"> <i>Family Meal Ashburn, 20462 Exchange St. 703-726-9800; Frederick, 880 N. East St. 301-378-2895; Baltimore, 621 E. Pratt St. 410-601-3242. <a class="showlink" href="http://www.voltfamilymeal.com">www.voltfamilymeal.com</a>. </i> </p>
<p><strong>— John Taylor</strong></p>
<p id="U90012741731140AH">Although it’s commonly served in Japan at karaoke bars, convenience stores and on street carts, kara age chicken — like most of the country’s food — is held to an extremely high standard. “It’s taken it to the nth degree of obsession and detail,” says Cizuka Seki, who, with her father Hiroshi, owns <b>Izakaya Seki</b> on V Street NW. “Kara age” is used to describe the method for deep-frying bite-size pieces of fish and, more commonly, chicken. Though there are subtle variations on the ubiquitous dish, most recipes call for chicken thighs marinated in soy sauce, coated in flour or corn starch and deep-fried in oil. Izakaya Seki’s version sticks closely to the formula. Probably. “I’m not even quite sure what my dad puts into it, because we don’t have recipes,” Seki says, though she’s certain wheat flour is involved. The result is a thin, tender coating that’s slightly softer than tempura. The accompanying ponzu sauce lends a tartness to the nubs.</p>
<p id="U900127417311410F"> <i>Izakaya Seki, 1117 V St. NW. 202-588-5841. <a class="showlink" href="http://www.sekidc.com">www.sekidc.com</a>. </i> </p>
<p><strong>— Holley Simmons</strong></p>
<p id="U9001274173114ewH">Don’t waste your kimchi-stinking breath asking for more sauce at <b>BonChon</b>. The <a href="http://www.washingtonpost.com/goingoutguide/the-20-diner-the-zen-of-bonchon-chicken/2013/07/31/b879060c-f582-11e2-aa2e-4088616498b4_story.html">South Korean fried chicken chain</a>, founded in 2002, is so dedicated to consistency that it doesn’t allow for any modifications. And why would you want to change anything, really? The made-to-order wings, drumsticks and strips are fried twice, resulting in a paper-thin crust that yields the same satisfying crack as shattering crème brulee with a spoon. Founder Jinduk Seh spent two years perfecting his secret sauces, which come in three flavors — soy garlic, hot and a blend of the two — and are brushed on by hand post-fry, piece by piece. True to BonChon’s commitment to uniformity, sauces are made exclusively in South Korea and distributed to all 140-plus BonChon locations, which means the wings you’re chewing on in Arlington are slathered with the same exact stuff as those in the Philippines. Joints like these are so common throughout Korea they’re called “chimeks,” which is a hybrid term that combines “chicken” with the Korean word for beer. Washington should be happy to have 10 BonChons within driving distance, plus a brand new Metro-accessible location near the Navy Yard.</p>
<p id="U9001274173114bBI"> <i>BonChon, 1015 Half St. SE and nine other locations in Maryland and Virginia. <a class="showlink" href="http://www.bonchon.com">www.bonchon.com</a>. </i> </p>
<p><strong>— Holley Simmons</strong></p>
<p id="U9001274173114NgD">There’s not much agreement on what constitutes Maryland fried chicken. Some say it’s just a fresh Maryland chicken that’s pan-fried; others say it should be topped with white gravy, almost like a chicken-fried steak. The pan-fried chicken platter at <b> <br/>Crisfield Seafood</b> is a perfect example of the former style. Half of a chicken is dredged in flour, dusted with salt and pepper, and fried in a cast-iron pan. This preparation lends a snap and crunch to the exterior, and while the meat falls off the bone, the well-seasoned breading holds on. (The chicken is available only Friday through Sunday, and frequently sells out.) The Chesapeake fried chicken at <b>Hank’s Oyster Bar</b> in Dupont Circle and Capitol Hill is plumper than Crisfield’s version and seasoned with Old Bay, black pepper and cayenne, but the breading is softer and less crispy. It’s brined for 24 hours and deep-fried, rather than pan-fried, and it’s served only on Sunday. </p>
<p id="U9001274173114wEE"> <i>Crisfield Seafood, 8012 Georgia Ave., Silver Spring. 301-589-1306. <a class="showlink" href="http://www.crisfieldseafood.com">www.crisfieldseafood.com</a>. Hank’s Oyster Bar, 1624 Q St. NW. 202-462-4265; 633 Pennsylvania Ave. SE. 202-733-1971. <a class="showlink" href="http://www.hanksoysterbar.com">www.hanksoysterbar.com</a>.</i> </p>
<p><strong>— Fritz Hahn</strong></p>
<p id="U9001274173114bsG">Self-consciousness may prevent you from ordering fried chicken in a white-tablecloth restaurant. It feels incongruous — gauche, almost — to dig into picnic fare at the kind of place where you should be ordering risotto or tartare or something that comes with mousse, gelee or foam. But you have to override that adult voice in the back of your head, because if you don’t, you’ll miss out on <b>Central Michel Richard</b>’s famed fried chicken plate ($24 at lunch, $25 for dinner), which remains as good as ever. Though it’s no longer sold by the bucket to go, Michel Richard’s KFC-inspired crispy breast and thigh come stacked atop a pool of the butteriest mashed potatoes you’ll ever taste. It’s such a dignified presentation that this most American of dishes almost could pass as (gasp!) French. Self-consciousness should, however, steer you toward using a knife and fork — not your fingers — to eat the chicken. It is, after all, that kind of a place. </p>
<p id="U9001274173114p0D"> <i>Central Michel Richard, 1001 Pennsylvania Ave. NW. 202-626-0015. <a class="showlink" href="http://www.centralmichelrichard.com">www.centralmichelrichard.com</a> <i>. </i> </i> </p>
<p></p>
<p><strong>— Maura Judkis</strong></p>
<p id="U900127417311467G"></p>
<p id="U9001274173114o0G">If you believe the lore, Nashville hot chicken was basically a crime of passion, created as a blistering rebuke to a no-account Romeo who couldn’t keep his hands off other women. Alas, this wolf was also a chili head who found pleasure, not pain, in this dish of revenge served hot. Decades later, chefs are latching onto this addictive form of punishment. Aaron Silverman served an ultra-refined version at Rose’s Luxury for months, and now Eric Reid, chef and co-owner of <b>Reserve 2216</b> in Del Ray, has developed his own take on hot chicken, even if he’s never actually enjoyed it in Nashville. He marinates an airline cut (boneless breast with the drumette wing attached) in buttermilk and Crystal hot sauce before dredging the chicken in seasoned flour and frying it. Reid ditches the traditional white bread base in favor of collards and a side of corn bread waffles. He finishes the dish with a combination of Cajun seasonings and more Crystal hot sauce for a moist, crispy bird that bites back. But not too hard. This is Alexandria, after all. </p>
<p id="U900127417311481"> <i>Reserve 2216, 2216 Mount Vernon Ave., Alexandria. 703-549-2889. <a class="showlink" href="http://www.drpreserve.com">www.drpreserve.com</a>.</i> </p>
<p><strong>— Tim Carman</strong></p>
<p id="U9001274173114HvD"></p>
<p id="U90012741731145tH">The sole virtue of most fast-food operations is consistency. Whether you bite into a Big Mac in Bethesda or Beijing, the sandwich should taste the same. The menu at <b>Popeyes</b> follows suit, but it deviates from the brand-name competition in an important respect: The signature at Popeyes could pass for home cooking (well, if your home had a vat of clean, hot oil and a person with a Southern accent tending the meal). Maybe that accounts for my occasional forays to the chicken fryer after a bum restaurant-review excursion. No matter where I eat my order, inevitably “spicy,” I know I can count on a coating that smacks of cayenne, paprika and even crushed cornflakes, and chicken that spurts with juice. The shatter is audible; the golden crumbs fly everywhere, but end up on my tongue. No one-trick pony, Popeyes has hot and tender buttermilk biscuits that bolster my favorite excuse to snack low on the food chain. Once, I got home to discover a clerk had forgotten to pack bread in my bag. I almost cried. Instead, I consoled myself with another piece of chicken. </p>
<p id="U9001274173114EOF"> <i>Popeyes has locations throughout the D.C. metro area. <a class="showlink" href="http://www.popeyes.com">www.popeyes.com</a>.</i> </p>
<p><strong>— Tom Sietsema</strong></p>
<p id="U9001274173114JiB"></p>
<p id="U9001274173114mFG">The fried chicken sandwich hasn’t been the same since KFC’s Double Down turned a guilty pleasure into an outright farce, using crispy white-meat fillets as both the sandwich’s primary protein and the oily handles by which you eat the monstrosity. Leave it to Rob Sonderman, pitmaster and co-owner of <b>DCity Smokehouse</b>, to bring dignity back to the bite. His Den-Den — named for co-creator and pitmaster-in-training <br/>Dennis Geddie — begins with boneless thighs marinated in buttermilk, hot sauce and honey. Sonderman then dredges the meat in seasoned flour before dropping the thighs into the fryer. Generously stuffed into a grilled hoagie roll with lettuce, tomato and crispy onions, the chicken is finished with two sauces, including a house-made cilantro ranch. Technically, the Den-Den ($9.25) is one of the few smokeless items on DCity’s menu (unless you count the chipotle peppers in the hot sauce). No matter. You won’t care the minute you sink your teeth into that heaping hoagie of spicy thigh meat.</p>
<p id="U9001274173114vPC"> <i>DCity Smokehouse, 8 Florida Ave. NW. 202-733-1919. <a class="showlink" href="http://www.dcitysmokehouse.com">www.dcitysmokehouse.com</a>. </i> </p>
<p><strong>— Tim Carman</strong></p>
<p id="U9001274173114L3B"></p>
<p id="U9001274173114yyH">Hearty is the appetite that can handle <b>Oohh’s and Aahh’s</b> chef-owner Oji Abbott’s boneless fried chicken breast without taking home leftovers. He buys local — from Hartman Meat Co. in Northeast Washington — and butterflies each 14-ounce portion ($12.95), which results in a lot of real estate for the crispy, well-seasoned coating. Abbott chalks up the consistently moist meat to proper cooking time and temperature, and to the recipe he learned from his grandmother.</p>
<p id="U9001274173114gCF"> <i>Oohh’s and Aahh’s, 1005 U St. NW. 202-667-7142. <a class="showlink" href="http://www.oohhsnaahhs.com">www.oohhsnaahhs.com</a>.</i> </p>
<p><strong>— Bonnie S. Benwick</strong></p>
<p id="U9001274173114ZvF"></p>
<p id="U9001274173114wdH">It’s all too easy to chomp through an order of Boardwalk Chicken at the shore-happy <b>Pop’s Sea Bar</b> in Adams Morgan. The bite-size pieces of dark-meat-only bird are brined for two hours, then treated to a buttermilk bath until they are coated with plain flour and flash-fried to order. A generous hand with salt and pepper just before serving means the effect of that seasoning builds as you empty the single-serving basket ($8.99). Ask for two portions of the accompanying Jersey sauce so you’ll have enough of its horseradish-y, kitchen-sink blend for every bite.</p>
<p id="U9001274173114ycC"> <i>Pop’s Sea Bar, 1817 Columbia Rd. NW. 202-534-3933. <a class="showlink" href="http://www.popsseabar.com">www.popsseabar.com</a>.</i> </p>
<p><strong>— Bonnie S. Benwick</strong></p>
<p id="U9001274173114ZfD"></p>
<p id="U9001274173114RpD">Chicken tenders are often relegated to the children’s menu, but the chicken Tendies at <b>GBD</b> are a fine meal for adults and children alike. Each white-meat tender comes with a dark, crispy outer layer with a heavy dose of salt and spice. But the best part about chicken tenders is the dipping, and GBD offers nine sauces, including a take on <a href="http://www.washingtonpost.com/lifestyle/food/mumbo-sauce-gets-gentrified/2013/07/08/b1011ade-cc67-11e2-9f1a-1a7cdee20287_story.html">D.C.’s own mumbo sauce</a>, buttermilk ranch, chipotle barbecue and Frankenbutter, which combines Frank’s RedHot sauce with butter. Ask for the $5.50 Saucetown option to try all nine. </p>
<p id="U9001274173114bCE"> <i>GBD, 1323 Connecticut Ave. NW. 202-524-5210. <a class="showlink" href="http://www.gbdchickendoughnuts.com/">www.gbdchickendoughnuts.com</a>.</i> </p>
<p><strong>— Margaret Ely</strong></p>
<p id="U9001274173114xGE"></p>
<p id="U90012741731149NI">Fried chicken fans can argue the merits of white meat vs. dark meat, or whether it’s better to chow down on a drumstick or the breast. But one thing we all can agree on is that the outer layer — the breading and the skin — is the most important element of a memorable piece of fried chicken. And sometimes you just want to savor the flavor of the skin — deeply spiced, perfect crunch — without filling up on meat or having to deal with bones. And that is when you grab one of the bar stools at R.J. Cooper’s <b>Gypsy Soul</b> in Fairfax’s Mosaic District. Cooper’s chicken skins ($9) are twisted slivers and shards, decadently salty and crackling with paprika, cayenne pepper and garlic. The dish arrives with a house-made “roof top honey-snake oil,” but it’s best to let these beauties shine on their own. </p>
<p id="U9001274173114vUB"> <i>Gypsy Soul, 8296 Glass Alley, Fairfax. 703-992-0933. <a class="showlink" href="http://www.gypsysoul-va.com">www.gypsysoul-va.com</a>.</i> </p>
<p><strong>— Fritz Hahn</strong></p>
<p class="section-instream">goingoutguide</p>
<p class="subsection-instream"></p>
<p class="blogname-instream"></p>
<p class="headline"></p>
<p class="tagline"></p>
<p class="error hide">Please provide a valid email address. </p>
<p class="headline">The Freddie Gray case </p>
<p class="follow-tagline follow"> Sign up for email updates on the trials.</p>
<p class="follow-tagline unfollow hide">You’ve signed up for email updates on this story.</p>
<p class="error col-xs-12">Please provide a valid email address. </p>
<p class="headline"> <span class="campaign-txt"> <span class="bold">Campaign 2016 </span> <span class="flower-image"> </span> </span> Email Updates </p>
<p class="follow-tagline follow">Get the best analysis of the presidential race.</p>
<p class="follow-tagline unfollow hide">You’ve signed up for email updates on this story.</p>
<p class="error col-xs-12">Please provide a valid email address. </p>
<p class="headline">Get Zika news by email</p>
<p class="follow-tagline follow"> We will update you when news breaks about the virus.</p>
<p class="follow-tagline unfollow hide">You’ve signed up for email updates on this story.</p>
<p class="error col-xs-12">Please provide a valid email address. </p>
<p class="title">SuperFan Badge</p>
<p>SuperFan badge holders consistently post smart, timely comments about Washington area sports and teams.</p>
<p class="echo-badge-info-link"><a href="//www.washingtonpost.com/wp-srv/interactivity/get-a-badge.html" target="_badgeinfo">More about badges</a> | <a href="http://www.washingtonpost.com/wp-srv/interactivity/get-a-badge.html" target="_badgeinfo">Request a badge</a></p>
<p class="title">Culture Connoisseur Badge</p>
<p>Culture Connoisseurs consistently offer thought-provoking, timely comments on the arts, lifestyle and entertainment.</p>
<p class="echo-badge-info-link"><a href="//www.washingtonpost.com/wp-srv/interactivity/get-a-badge.html" target="_badgeinfo">More about badges</a> | <a href="http://www.washingtonpost.com/wp-srv/interactivity/get-a-badge.html" target="_badgeinfo">Request a badge</a></p>
<p class="title">Fact Checker Badge</p>
<p>Fact Checkers contribute questions, information and facts to <a href="//www.washingtonpost.com/blogs/fact-checker" target="_badgeinfo">The Fact Checker</a>.</p>
<p class="echo-badge-info-link"><a href="//www.washingtonpost.com/wp-srv/interactivity/get-a-badge.html" target="_badgeinfo">More about badges</a> | <a href="http://www.washingtonpost.com/wp-srv/interactivity/get-a-badge.html" target="_badgeinfo">Request a badge</a></p>
<p class="title">Washingtologist Badge</p>
<p>Washingtologists consistently post thought-provoking, timely comments on events, communities, and trends in the Washington area.</p>
<p class="echo-badge-info-link"><a href="//www.washingtonpost.com/wp-srv/interactivity/get-a-badge.html" target="_badgeinfo">More about badges</a> | <a href="http://www.washingtonpost.com/wp-srv/interactivity/get-a-badge.html" target="_badgeinfo">Request a badge</a></p>
<p class="title">Post Writer Badge</p>
<p>This commenter is a Washington Post editor, reporter or producer.</p>
<p class="title">Post Forum Badge</p>
<p>Post Forum members consistently offer thought-provoking, timely comments on politics, national and international affairs.</p>
<p class="echo-badge-info-link"><a href="//www.washingtonpost.com/wp-srv/interactivity/get-a-badge.html" target="_badgeinfo">More about badges</a> | <a href="http://www.washingtonpost.com/wp-srv/interactivity/get-a-badge.html" target="_badgeinfo">Request a badge</a></p>
<p class="title">Weather Watcher Badge</p>
<p>Weather Watchers consistently offer thought-provoking, timely comments on climates and forecasts.</p>
<p class="echo-badge-info-link"><a href="//www.washingtonpost.com/wp-srv/interactivity/get-a-badge.html" target="_badgeinfo">More about badges</a> | <a href="http://www.washingtonpost.com/wp-srv/interactivity/get-a-badge.html" target="_badgeinfo">Request a badge</a></p>
<p class="title">World Watcher Badge</p>
<p>World Watchers consistently offer thought-provoking, timely comments on international affairs.</p>
<p class="echo-badge-info-link"><a href="//www.washingtonpost.com/wp-srv/interactivity/get-a-badge.html" target="_badgeinfo">More about badges</a> | <a href="http://www.washingtonpost.com/wp-srv/interactivity/get-a-badge.html" target="_badgeinfo">Request a badge</a></p>
<p class="title">Post Contributor Badge</p>
<p>This commenter is a Washington Post contributor. Post contributors aren’t staff, but may write articles or columns. In some cases, contributors are sources or experts quoted in a story.</p>
<p class="echo-badge-info-link"><a href="//www.washingtonpost.com/wp-srv/interactivity/get-a-badge.html" target="_badgeinfo">More about badges</a> | <a href="http://www.washingtonpost.com/wp-srv/interactivity/get-a-badge.html" target="_badgeinfo">Request a badge</a></p>
<p class="title">Post Recommended</p>
<p>Washington Post reporters or editors recommend this comment or reader post.</p>
<p>You must be logged in to report a comment.</p>
<p>You must be logged in to recommend a comment.</p>
<p>Comments our editors find particularly useful or relevant are displayed in <strong>Top Comments</strong>, as are comments by users with these badges: <strong><span class="badge-list"></span></strong>. Replies to those posts appear here, as well as posts by staff writers.</p>
<p>All comments are posted in the <strong>All Comments</strong> tab.</p>
<p>To pause and restart automatic updates, click "Live" or "Paused". If paused, you'll be notified of the number of additional comments that have come in.</p>
<p>Play right from this page</p>
<p> It's 3D Mahjongg- you don't even need to wear 3D glasses! </p>
<p> Online crossword. </p>
<p> Spider Solitaire is known as the king of all solitaire games! </p>
<p> Challenge your crossword skills everyday with a huge variety of puzzles waiting for you to solve. </p>
<p id="newsletter-section">goingoutguide</p>
<p id="newsletter-subsection"></p>
<p id="newsletter-blogname"></p>
<p class="headline" id="newsletter-headline"><i class="fa fa-check" id="headline-checked"></i></p>
<p class="title" id="newsletter-tagline"></p>
<p class="title" id="subscribed-confirmation"><span>Success!</span> Check your inbox for details.</p>
<p class="newsLetter-error-msg"> Please enter a valid email address </p>
<p class="title">You might also like: </p>
<p class="title "></p>
<p class="title \"></p>
<p class="title"></p>
<p class="title" id="all-newsletters-lbl"><a href="https://subscribe.washingtonpost.com/newsletters">See all newsletters</a></p>

In order to print out only the text with no nodes, do the following:


In [3]:
for p in get_soup("nrRB0.html").find_all('p'):
    print p.text
    print


 The inside track on Washington politics. 

*Invalid email address

 The inside track on Washington politics. 

*Invalid email address

Sign in or create an account so we can save this story to your Reading List. You'll be able to access the story from your Reading List on any computer, tablet or smartphone.

Sign in to your account to save this article.



 It’s lowbrow. It’s messy. It could never be accused of being healthful. But we’d never let those formalities get between us and an order of crispy, crackly, delicious fried chicken. Whether it comes in a bucket or on a bun, or you eat it with your fingers or chopsticks, there’s a surprising variety to the Washington area’s fried chicken offerings. Here are some of the most irresistible. 



Forget the cronut. Our newest favorite food chimera is the “rotissi-fried” chicken at the Partisan. Credit goes to chef Nate Anda, who dreamed up the dish: After a 12-hour brine, the chicken is rotisseried for two hours and then fried for two and a half minutes. Why both? “Everything is better once it’s fried in beef fat,” Anda said. We have to agree. Whether white or dark, the meat is succulent throughout. The batter-free frying leaves the simply seasoned skin rendered perfectly crisp, golden and translucent — cracklings, essentially. The sound of it shattering under the knife was music to our ears. And as if the lily needed further gilding, the chicken comes with a generous pour of honey hot sauce. The sauce is hard to resist, but try to reserve a few bites of unadorned chicken so you can fully appreciate this happy marriage of classic preparations.

 The Partisan, 709 D St. NW. 202-524-5322. www.thepartisandc.com.  

— Becky Krystal

When Bryan Voltaggio started planning the menu for Family Meal, his modern, upscale spin on a diner, his thoughts turned to home and the carryout meal he most enjoyed as a kid: fried chicken. “It was one of our favorite things,” he says. “It just seems like a family dinner.” And if he was creating a restaurant called Family Meal, fried chicken “had to be an important part of it.” But Voltaggio wanted to do it right, and set about testing cooking methods, brines, breadings and fryers. That long process paid off with a home run of a fried chicken dish that’s become the most popular item on Family Meal’s menu. The whole chickens spend 12 hours in a brine of pickle juice and roasted poultry stock before getting dredged, rested and dredged again in a mixture of flour, cornmeal and corn starch. After a dip in the top-of-the-line pressure fryer, the thighs, legs and breasts emerge with a crisp, salty skin that cracks open to reveal wonderfully warm and moist flesh. You don’t even need to dunk it in the house-made hot sauce that accompanies the dish, but really, who can resist? 

 Family Meal Ashburn, 20462 Exchange St. 703-726-9800; Frederick, 880 N. East St. 301-378-2895; Baltimore, 621 E. Pratt St. 410-601-3242. www.voltfamilymeal.com.  

— John Taylor

Although it’s commonly served in Japan at karaoke bars, convenience stores and on street carts, kara age chicken — like most of the country’s food — is held to an extremely high standard. “It’s taken it to the nth degree of obsession and detail,” says Cizuka Seki, who, with her father Hiroshi, owns Izakaya Seki on V Street NW. “Kara age” is used to describe the method for deep-frying bite-size pieces of fish and, more commonly, chicken. Though there are subtle variations on the ubiquitous dish, most recipes call for chicken thighs marinated in soy sauce, coated in flour or corn starch and deep-fried in oil. Izakaya Seki’s version sticks closely to the formula. Probably. “I’m not even quite sure what my dad puts into it, because we don’t have recipes,” Seki says, though she’s certain wheat flour is involved. The result is a thin, tender coating that’s slightly softer than tempura. The accompanying ponzu sauce lends a tartness to the nubs.

 Izakaya Seki, 1117 V St. NW. 202-588-5841. www.sekidc.com.  

— Holley Simmons

Don’t waste your kimchi-stinking breath asking for more sauce at BonChon. The South Korean fried chicken chain, founded in 2002, is so dedicated to consistency that it doesn’t allow for any modifications. And why would you want to change anything, really? The made-to-order wings, drumsticks and strips are fried twice, resulting in a paper-thin crust that yields the same satisfying crack as shattering crème brulee with a spoon. Founder Jinduk Seh spent two years perfecting his secret sauces, which come in three flavors — soy garlic, hot and a blend of the two — and are brushed on by hand post-fry, piece by piece. True to BonChon’s commitment to uniformity, sauces are made exclusively in South Korea and distributed to all 140-plus BonChon locations, which means the wings you’re chewing on in Arlington are slathered with the same exact stuff as those in the Philippines. Joints like these are so common throughout Korea they’re called “chimeks,” which is a hybrid term that combines “chicken” with the Korean word for beer. Washington should be happy to have 10 BonChons within driving distance, plus a brand new Metro-accessible location near the Navy Yard.

 BonChon, 1015 Half St. SE and nine other locations in Maryland and Virginia. www.bonchon.com.  

— Holley Simmons

There’s not much agreement on what constitutes Maryland fried chicken. Some say it’s just a fresh Maryland chicken that’s pan-fried; others say it should be topped with white gravy, almost like a chicken-fried steak. The pan-fried chicken platter at  Crisfield Seafood is a perfect example of the former style. Half of a chicken is dredged in flour, dusted with salt and pepper, and fried in a cast-iron pan. This preparation lends a snap and crunch to the exterior, and while the meat falls off the bone, the well-seasoned breading holds on. (The chicken is available only Friday through Sunday, and frequently sells out.) The Chesapeake fried chicken at Hank’s Oyster Bar in Dupont Circle and Capitol Hill is plumper than Crisfield’s version and seasoned with Old Bay, black pepper and cayenne, but the breading is softer and less crispy. It’s brined for 24 hours and deep-fried, rather than pan-fried, and it’s served only on Sunday. 

 Crisfield Seafood, 8012 Georgia Ave., Silver Spring. 301-589-1306. www.crisfieldseafood.com. Hank’s Oyster Bar, 1624 Q St. NW. 202-462-4265; 633 Pennsylvania Ave. SE. 202-733-1971. www.hanksoysterbar.com. 

— Fritz Hahn

Self-consciousness may prevent you from ordering fried chicken in a white-tablecloth restaurant. It feels incongruous — gauche, almost — to dig into picnic fare at the kind of place where you should be ordering risotto or tartare or something that comes with mousse, gelee or foam. But you have to override that adult voice in the back of your head, because if you don’t, you’ll miss out on Central Michel Richard’s famed fried chicken plate ($24 at lunch, $25 for dinner), which remains as good as ever. Though it’s no longer sold by the bucket to go, Michel Richard’s KFC-inspired crispy breast and thigh come stacked atop a pool of the butteriest mashed potatoes you’ll ever taste. It’s such a dignified presentation that this most American of dishes almost could pass as (gasp!) French. Self-consciousness should, however, steer you toward using a knife and fork — not your fingers — to eat the chicken. It is, after all, that kind of a place. 

 Central Michel Richard, 1001 Pennsylvania Ave. NW. 202-626-0015. www.centralmichelrichard.com .   



— Maura Judkis



If you believe the lore, Nashville hot chicken was basically a crime of passion, created as a blistering rebuke to a no-account Romeo who couldn’t keep his hands off other women. Alas, this wolf was also a chili head who found pleasure, not pain, in this dish of revenge served hot. Decades later, chefs are latching onto this addictive form of punishment. Aaron Silverman served an ultra-refined version at Rose’s Luxury for months, and now Eric Reid, chef and co-owner of Reserve 2216 in Del Ray, has developed his own take on hot chicken, even if he’s never actually enjoyed it in Nashville. He marinates an airline cut (boneless breast with the drumette wing attached) in buttermilk and Crystal hot sauce before dredging the chicken in seasoned flour and frying it. Reid ditches the traditional white bread base in favor of collards and a side of corn bread waffles. He finishes the dish with a combination of Cajun seasonings and more Crystal hot sauce for a moist, crispy bird that bites back. But not too hard. This is Alexandria, after all. 

 Reserve 2216, 2216 Mount Vernon Ave., Alexandria. 703-549-2889. www.drpreserve.com. 

— Tim Carman



The sole virtue of most fast-food operations is consistency. Whether you bite into a Big Mac in Bethesda or Beijing, the sandwich should taste the same. The menu at Popeyes follows suit, but it deviates from the brand-name competition in an important respect: The signature at Popeyes could pass for home cooking (well, if your home had a vat of clean, hot oil and a person with a Southern accent tending the meal). Maybe that accounts for my occasional forays to the chicken fryer after a bum restaurant-review excursion. No matter where I eat my order, inevitably “spicy,” I know I can count on a coating that smacks of cayenne, paprika and even crushed cornflakes, and chicken that spurts with juice. The shatter is audible; the golden crumbs fly everywhere, but end up on my tongue. No one-trick pony, Popeyes has hot and tender buttermilk biscuits that bolster my favorite excuse to snack low on the food chain. Once, I got home to discover a clerk had forgotten to pack bread in my bag. I almost cried. Instead, I consoled myself with another piece of chicken. 

 Popeyes has locations throughout the D.C. metro area. www.popeyes.com. 

— Tom Sietsema



The fried chicken sandwich hasn’t been the same since KFC’s Double Down turned a guilty pleasure into an outright farce, using crispy white-meat fillets as both the sandwich’s primary protein and the oily handles by which you eat the monstrosity. Leave it to Rob Sonderman, pitmaster and co-owner of DCity Smokehouse, to bring dignity back to the bite. His Den-Den — named for co-creator and pitmaster-in-training Dennis Geddie — begins with boneless thighs marinated in buttermilk, hot sauce and honey. Sonderman then dredges the meat in seasoned flour before dropping the thighs into the fryer. Generously stuffed into a grilled hoagie roll with lettuce, tomato and crispy onions, the chicken is finished with two sauces, including a house-made cilantro ranch. Technically, the Den-Den ($9.25) is one of the few smokeless items on DCity’s menu (unless you count the chipotle peppers in the hot sauce). No matter. You won’t care the minute you sink your teeth into that heaping hoagie of spicy thigh meat.

 DCity Smokehouse, 8 Florida Ave. NW. 202-733-1919. www.dcitysmokehouse.com.  

— Tim Carman



Hearty is the appetite that can handle Oohh’s and Aahh’s chef-owner Oji Abbott’s boneless fried chicken breast without taking home leftovers. He buys local — from Hartman Meat Co. in Northeast Washington — and butterflies each 14-ounce portion ($12.95), which results in a lot of real estate for the crispy, well-seasoned coating. Abbott chalks up the consistently moist meat to proper cooking time and temperature, and to the recipe he learned from his grandmother.

 Oohh’s and Aahh’s, 1005 U St. NW. 202-667-7142. www.oohhsnaahhs.com. 

— Bonnie S. Benwick



It’s all too easy to chomp through an order of Boardwalk Chicken at the shore-happy Pop’s Sea Bar in Adams Morgan. The bite-size pieces of dark-meat-only bird are brined for two hours, then treated to a buttermilk bath until they are coated with plain flour and flash-fried to order. A generous hand with salt and pepper just before serving means the effect of that seasoning builds as you empty the single-serving basket ($8.99). Ask for two portions of the accompanying Jersey sauce so you’ll have enough of its horseradish-y, kitchen-sink blend for every bite.

 Pop’s Sea Bar, 1817 Columbia Rd. NW. 202-534-3933. www.popsseabar.com. 

— Bonnie S. Benwick



Chicken tenders are often relegated to the children’s menu, but the chicken Tendies at GBD are a fine meal for adults and children alike. Each white-meat tender comes with a dark, crispy outer layer with a heavy dose of salt and spice. But the best part about chicken tenders is the dipping, and GBD offers nine sauces, including a take on D.C.’s own mumbo sauce, buttermilk ranch, chipotle barbecue and Frankenbutter, which combines Frank’s RedHot sauce with butter. Ask for the $5.50 Saucetown option to try all nine. 

 GBD, 1323 Connecticut Ave. NW. 202-524-5210. www.gbdchickendoughnuts.com. 

— Margaret Ely



Fried chicken fans can argue the merits of white meat vs. dark meat, or whether it’s better to chow down on a drumstick or the breast. But one thing we all can agree on is that the outer layer — the breading and the skin — is the most important element of a memorable piece of fried chicken. And sometimes you just want to savor the flavor of the skin — deeply spiced, perfect crunch — without filling up on meat or having to deal with bones. And that is when you grab one of the bar stools at R.J. Cooper’s Gypsy Soul in Fairfax’s Mosaic District. Cooper’s chicken skins ($9) are twisted slivers and shards, decadently salty and crackling with paprika, cayenne pepper and garlic. The dish arrives with a house-made “roof top honey-snake oil,” but it’s best to let these beauties shine on their own. 

 Gypsy Soul, 8296 Glass Alley, Fairfax. 703-992-0933. www.gypsysoul-va.com. 

— Fritz Hahn

goingoutguide









Please provide a valid email address. 

The Freddie Gray case 

 Sign up for email updates on the trials.

You’ve signed up for email updates on this story.

Please provide a valid email address. 

  Campaign 2016     Email Updates 

Get the best analysis of the presidential race.

You’ve signed up for email updates on this story.

Please provide a valid email address. 

Get Zika news by email

 We will update you when news breaks about the virus.

You’ve signed up for email updates on this story.

Please provide a valid email address. 

SuperFan Badge

SuperFan badge holders consistently post smart, timely comments about Washington area sports and teams.

More about badges | Request a badge

Culture Connoisseur Badge

Culture Connoisseurs consistently offer thought-provoking, timely comments on the arts, lifestyle and entertainment.

More about badges | Request a badge

Fact Checker Badge

Fact Checkers contribute questions, information and facts to The Fact Checker.

More about badges | Request a badge

Washingtologist Badge

Washingtologists consistently post thought-provoking, timely comments on events, communities, and trends in the Washington area.

More about badges | Request a badge

Post Writer Badge

This commenter is a Washington Post editor, reporter or producer.

Post Forum Badge

Post Forum members consistently offer thought-provoking, timely comments on politics, national and international affairs.

More about badges | Request a badge

Weather Watcher Badge

Weather Watchers consistently offer thought-provoking, timely comments on climates and forecasts.

More about badges | Request a badge

World Watcher Badge

World Watchers consistently offer thought-provoking, timely comments on international affairs.

More about badges | Request a badge

Post Contributor Badge

This commenter is a Washington Post contributor. Post contributors aren’t staff, but may write articles or columns. In some cases, contributors are sources or experts quoted in a story.

More about badges | Request a badge

Post Recommended

Washington Post reporters or editors recommend this comment or reader post.

You must be logged in to report a comment.

You must be logged in to recommend a comment.

Comments our editors find particularly useful or relevant are displayed in Top Comments, as are comments by users with these badges: . Replies to those posts appear here, as well as posts by staff writers.

All comments are posted in the All Comments tab.

To pause and restart automatic updates, click "Live" or "Paused". If paused, you'll be notified of the number of additional comments that have come in.

Play right from this page

 It's 3D Mahjongg- you don't even need to wear 3D glasses! 

 Online crossword. 

 Spider Solitaire is known as the king of all solitaire games! 

 Challenge your crossword skills everyday with a huge variety of puzzles waiting for you to solve. 

goingoutguide









Success! Check your inbox for details.

 Please enter a valid email address 

You might also like: 







See all newsletters

While this allows us to easily traverse the DOM and find specific elements by their id, class, or element type - we still have a lot of cruft in the document. This is where readability-lxml comes in. This library is a Python port of the readability project, written in Ruby and inspired by Instapaper. This code uses readability.js and some other helper functions to extract the main body and even title of the document you're working with.


In [4]:
from readability.readability import Document

def get_paper(path):
    with codecs.open(path, 'r', encoding='utf-8') as f:
        return Document(f.read())

paper = get_paper("nrRB0.html")
print paper.title()


A crisp and juicy bucket list of D.C.’s best fried chicken - The Washington Post

In [5]:
with codecs.open("nrRB0-clean.html", "w", encoding='utf-8') as f:
    f.write(paper.summary())

Combine readability and BeautifulSoup as follows:


In [6]:
def get_text(path):
    with open(path, 'r') as f:
        paper = Document(f.read())
        soup = bs4.BeautifulSoup(paper.summary())
        output = [paper.title()]
        for p in soup.find_all('p'):
            output.append(p.text)
        return "\n\n".join(output)

In [7]:
print get_text("nrRB0.html")


A crisp and juicy bucket list of D.C.’s best fried chicken - The Washington Post



 It’s lowbrow. It’s messy. It could never be accused of being healthful. But we’d never let those formalities get between us and an order of crispy, crackly, delicious fried chicken. Whether it comes in a bucket or on a bun, or you eat it with your fingers or chopsticks, there’s a surprising variety to the Washington area’s fried chicken offerings. Here are some of the most irresistible. 



‘Rotissi-fried’ chicken at the Partisan

Forget the cronut. Our newest favorite food chimera is the “rotissi-fried” chicken at the Partisan. Credit goes to chef Nate Anda, who dreamed up the dish: After a 12-hour brine, the chicken is rotisseried for two hours and then fried for two and a half minutes. Why both? “Everything is better once it’s fried in beef fat,” Anda said. We have to agree. Whether white or dark, the meat is succulent throughout. The batter-free frying leaves the simply seasoned skin rendered perfectly crisp, golden and translucent — cracklings, essentially. The sound of it shattering under the knife was music to our ears. And as if the lily needed further gilding, the chicken comes with a generous pour of honey hot sauce. The sauce is hard to resist, but try to reserve a few bites of unadorned chicken so you can fully appreciate this happy marriage of classic preparations.

 The Partisan, 709 D St. NW. 202-524-5322. www.thepartisandc.com.  

— Becky Krystal

Traditional fried chicken at Family Meal

When Bryan Voltaggio started planning the menu for Family Meal, his modern, upscale spin on a diner, his thoughts turned to home and the carryout meal he most enjoyed as a kid: fried chicken. “It was one of our favorite things,” he says. “It just seems like a family dinner.” And if he was creating a restaurant called Family Meal, fried chicken “had to be an important part of it.” But Voltaggio wanted to do it right, and set about testing cooking methods, brines, breadings and fryers. That long process paid off with a home run of a fried chicken dish that’s become the most popular item on Family Meal’s menu. The whole chickens spend 12 hours in a brine of pickle juice and roasted poultry stock before getting dredged, rested and dredged again in a mixture of flour, cornmeal and corn starch. After a dip in the top-of-the-line pressure fryer, the thighs, legs and breasts emerge with a crisp, salty skin that cracks open to reveal wonderfully warm and moist flesh. You don’t even need to dunk it in the house-made hot sauce that accompanies the dish, but really, who can resist? 

 Family Meal Ashburn, 20462 Exchange St. 703-726-9800; Frederick, 880 N. East St. 301-378-2895; Baltimore, 621 E. Pratt St. 410-601-3242. www.voltfamilymeal.com.  

— John Taylor

Japanese fried chicken at Izakaya Seki

Although it’s commonly served in Japan at karaoke bars, convenience stores and on street carts, kara age chicken — like most of the country’s food — is held to an extremely high standard. “It’s taken it to the nth degree of obsession and detail,” says Cizuka Seki, who, with her father Hiroshi, owns Izakaya Seki on V Street NW. “Kara age” is used to describe the method for deep-frying bite-size pieces of fish and, more commonly, chicken. Though there are subtle variations on the ubiquitous dish, most recipes call for chicken thighs marinated in soy sauce, coated in flour or corn starch and deep-fried in oil. Izakaya Seki’s version sticks closely to the formula. Probably. “I’m not even quite sure what my dad puts into it, because we don’t have recipes,” Seki says, though she’s certain wheat flour is involved. The result is a thin, tender coating that’s slightly softer than tempura. The accompanying ponzu sauce lends a tartness to the nubs.

 Izakaya Seki, 1117 V St. NW. 202-588-5841. www.sekidc.com.  

— Holley Simmons

Korean fried chicken at BonChon

Don’t waste your kimchi-stinking breath asking for more sauce at BonChon. The South Korean fried chicken chain, founded in 2002, is so dedicated to consistency that it doesn’t allow for any modifications. And why would you want to change anything, really? The made-to-order wings, drumsticks and strips are fried twice, resulting in a paper-thin crust that yields the same satisfying crack as shattering crème brulee with a spoon. Founder Jinduk Seh spent two years perfecting his secret sauces, which come in three flavors — soy garlic, hot and a blend of the two — and are brushed on by hand post-fry, piece by piece. True to BonChon’s commitment to uniformity, sauces are made exclusively in South Korea and distributed to all 140-plus BonChon locations, which means the wings you’re chewing on in Arlington are slathered with the same exact stuff as those in the Philippines. Joints like these are so common throughout Korea they’re called “chimeks,” which is a hybrid term that combines “chicken” with the Korean word for beer. Washington should be happy to have 10 BonChons within driving distance, plus a brand new Metro-accessible location near the Navy Yard.

 BonChon, 1015 Half St. SE and nine other locations in Maryland and Virginia. www.bonchon.com.  

— Holley Simmons

Maryland fried chicken at Crisfield Seafood and Hank’s Oyster Bar

There’s not much agreement on what constitutes Maryland fried chicken. Some say it’s just a fresh Maryland chicken that’s pan-fried; others say it should be topped with white gravy, almost like a chicken-fried steak. The pan-fried chicken platter at  Crisfield Seafood is a perfect example of the former style. Half of a chicken is dredged in flour, dusted with salt and pepper, and fried in a cast-iron pan. This preparation lends a snap and crunch to the exterior, and while the meat falls off the bone, the well-seasoned breading holds on. (The chicken is available only Friday through Sunday, and frequently sells out.) The Chesapeake fried chicken at Hank’s Oyster Bar in Dupont Circle and Capitol Hill is plumper than Crisfield’s version and seasoned with Old Bay, black pepper and cayenne, but the breading is softer and less crispy. It’s brined for 24 hours and deep-fried, rather than pan-fried, and it’s served only on Sunday. 

 Crisfield Seafood, 8012 Georgia Ave., Silver Spring. 301-589-1306. www.crisfieldseafood.com. Hank’s Oyster Bar, 1624 Q St. NW. 202-462-4265; 633 Pennsylvania Ave. SE. 202-733-1971. www.hanksoysterbar.com. 

— Fritz Hahn

Fancy fried chicken at Central

Self-consciousness may prevent you from ordering fried chicken in a white-tablecloth restaurant. It feels incongruous — gauche, almost — to dig into picnic fare at the kind of place where you should be ordering risotto or tartare or something that comes with mousse, gelee or foam. But you have to override that adult voice in the back of your head, because if you don’t, you’ll miss out on Central Michel Richard’s famed fried chicken plate ($24 at lunch, $25 for dinner), which remains as good as ever. Though it’s no longer sold by the bucket to go, Michel Richard’s KFC-inspired crispy breast and thigh come stacked atop a pool of the butteriest mashed potatoes you’ll ever taste. It’s such a dignified presentation that this most American of dishes almost could pass as (gasp!) French. Self-consciousness should, however, steer you toward using a knife and fork — not your fingers — to eat the chicken. It is, after all, that kind of a place. 

 Central Michel Richard, 1001 Pennsylvania Ave. NW. 202-626-0015. www.centralmichelrichard.com .   



— Maura Judkis



Nashville hot chicken at Reserve 2216

If you believe the lore, Nashville hot chicken was basically a crime of passion, created as a blistering rebuke to a no-account Romeo who couldn’t keep his hands off other women. Alas, this wolf was also a chili head who found pleasure, not pain, in this dish of revenge served hot. Decades later, chefs are latching onto this addictive form of punishment. Aaron Silverman served an ultra-refined version at Rose’s Luxury for months, and now Eric Reid, chef and co-owner of Reserve 2216 in Del Ray, has developed his own take on hot chicken, even if he’s never actually enjoyed it in Nashville. He marinates an airline cut (boneless breast with the drumette wing attached) in buttermilk and Crystal hot sauce before dredging the chicken in seasoned flour and frying it. Reid ditches the traditional white bread base in favor of collards and a side of corn bread waffles. He finishes the dish with a combination of Cajun seasonings and more Crystal hot sauce for a moist, crispy bird that bites back. But not too hard. This is Alexandria, after all. 

 Reserve 2216, 2216 Mount Vernon Ave., Alexandria. 703-549-2889. www.drpreserve.com. 

— Tim Carman



Fast-food fried chicken at Popeyes

The sole virtue of most fast-food operations is consistency. Whether you bite into a Big Mac in Bethesda or Beijing, the sandwich should taste the same. The menu at Popeyes follows suit, but it deviates from the brand-name competition in an important respect: The signature at Popeyes could pass for home cooking (well, if your home had a vat of clean, hot oil and a person with a Southern accent tending the meal). Maybe that accounts for my occasional forays to the chicken fryer after a bum restaurant-review excursion. No matter where I eat my order, inevitably “spicy,” I know I can count on a coating that smacks of cayenne, paprika and even crushed cornflakes, and chicken that spurts with juice. The shatter is audible; the golden crumbs fly everywhere, but end up on my tongue. No one-trick pony, Popeyes has hot and tender buttermilk biscuits that bolster my favorite excuse to snack low on the food chain. Once, I got home to discover a clerk had forgotten to pack bread in my bag. I almost cried. Instead, I consoled myself with another piece of chicken. 

 Popeyes has locations throughout the D.C. metro area. www.popeyes.com. 

— Tom Sietsema



Fried chicken sandwich at DCity Smokehouse

The fried chicken sandwich hasn’t been the same since KFC’s Double Down turned a guilty pleasure into an outright farce, using crispy white-meat fillets as both the sandwich’s primary protein and the oily handles by which you eat the monstrosity. Leave it to Rob Sonderman, pitmaster and co-owner of DCity Smokehouse, to bring dignity back to the bite. His Den-Den — named for co-creator and pitmaster-in-training Dennis Geddie — begins with boneless thighs marinated in buttermilk, hot sauce and honey. Sonderman then dredges the meat in seasoned flour before dropping the thighs into the fryer. Generously stuffed into a grilled hoagie roll with lettuce, tomato and crispy onions, the chicken is finished with two sauces, including a house-made cilantro ranch. Technically, the Den-Den ($9.25) is one of the few smokeless items on DCity’s menu (unless you count the chipotle peppers in the hot sauce). No matter. You won’t care the minute you sink your teeth into that heaping hoagie of spicy thigh meat.

 DCity Smokehouse, 8 Florida Ave. NW. 202-733-1919. www.dcitysmokehouse.com.  

— Tim Carman



Classic D.C. fried chicken at Oohh’s and Aahh’s

Hearty is the appetite that can handle Oohh’s and Aahh’s chef-owner Oji Abbott’s boneless fried chicken breast without taking home leftovers. He buys local — from Hartman Meat Co. in Northeast Washington — and butterflies each 14-ounce portion ($12.95), which results in a lot of real estate for the crispy, well-seasoned coating. Abbott chalks up the consistently moist meat to proper cooking time and temperature, and to the recipe he learned from his grandmother.

 Oohh’s and Aahh’s, 1005 U St. NW. 202-667-7142. www.oohhsnaahhs.com. 

— Bonnie S. Benwick



Popcorn fried chicken at Pop’s Sea Bar

It’s all too easy to chomp through an order of Boardwalk Chicken at the shore-happy Pop’s Sea Bar in Adams Morgan. The bite-size pieces of dark-meat-only bird are brined for two hours, then treated to a buttermilk bath until they are coated with plain flour and flash-fried to order. A generous hand with salt and pepper just before serving means the effect of that seasoning builds as you empty the single-serving basket ($8.99). Ask for two portions of the accompanying Jersey sauce so you’ll have enough of its horseradish-y, kitchen-sink blend for every bite.

 Pop’s Sea Bar, 1817 Columbia Rd. NW. 202-534-3933. www.popsseabar.com. 

— Bonnie S. Benwick



Fried chicken tenders at GBD

Chicken tenders are often relegated to the children’s menu, but the chicken Tendies at GBD are a fine meal for adults and children alike. Each white-meat tender comes with a dark, crispy outer layer with a heavy dose of salt and spice. But the best part about chicken tenders is the dipping, and GBD offers nine sauces, including a take on D.C.’s own mumbo sauce, buttermilk ranch, chipotle barbecue and Frankenbutter, which combines Frank’s RedHot sauce with butter. Ask for the $5.50 Saucetown option to try all nine. 

 GBD, 1323 Connecticut Ave. NW. 202-524-5210. www.gbdchickendoughnuts.com. 

— Margaret Ely



Fried chicken skins at Gypsy Soul

Fried chicken fans can argue the merits of white meat vs. dark meat, or whether it’s better to chow down on a drumstick or the breast. But one thing we all can agree on is that the outer layer — the breading and the skin — is the most important element of a memorable piece of fried chicken. And sometimes you just want to savor the flavor of the skin — deeply spiced, perfect crunch — without filling up on meat or having to deal with bones. And that is when you grab one of the bar stools at R.J. Cooper’s Gypsy Soul in Fairfax’s Mosaic District. Cooper’s chicken skins ($9) are twisted slivers and shards, decadently salty and crackling with paprika, cayenne pepper and garlic. The dish arrives with a house-made “roof top honey-snake oil,” but it’s best to let these beauties shine on their own. 

 Gypsy Soul, 8296 Glass Alley, Fairfax. 703-992-0933. www.gypsysoul-va.com. 

— Fritz Hahn

A note on binary formats

In order to transform PDF documents to XML, the best solution is currently PDFMiner, specificially their pdf2text tool. Note that this tool can output into multiple formats like XML or HTML, which is often better than the direct text export. Because of this it's often useful to convert PDF to XHTML and then use Readabiilty or BeautifulSoup to extract the text out of the document.

Unfortunately, the conversion from PDF to text is often not great, though statistical methodologies can help ease some of the errors in transformation. If PDFMiner is not sufficient, you can use tools like PyPDF2 to work directly on the PDF file, or write Python code to wrap other tools in Java and C like PDFBox.

Older binary formats like Pre-2007 Microsoft Word Documents (.doc) require special tools. Again, the best bet is to use Python to call another command line tool like antiword. Newer Microsoft formats are acutally zipped XML files (.docx) and can be either unzipped and handled using the XML tools mentioned above, or using Python packages like python-docx and python-excel.

Pattern

The pattern library by the CLiPS lab at the University of Antwerp is designed specifically for language processing of web data and contains a toolkit for fetching data via web APIS: Google, Gmail, Bing, Twitter, Facebook, Wikipedia, and more. It supports HTML DOM parsing and even includes a web crawler!

For example to ingest Twitter data:


In [11]:
from pattern.web import Twitter, plaintext

In [12]:
twitter = Twitter(language='en')
for tweet in twitter.search("#DataDC", cached=False):
    print tweet.text


RT @MicrosoftR: MT @DataEducationDC: Register for our #rstats special event w/ @RevoJoe 4/12 @WeWork in Dupont: https://t.co/2L75cA781v #da…
RT @wahalulu: More #datadc at #rstatsnyc. @robertvesco @HarlanH https://t.co/1PfLS9w351
RT @wahalulu: More #datadc at #rstatsnyc. @robertvesco @HarlanH https://t.co/1PfLS9w351
More #datadc at #rstatsnyc. @robertvesco @HarlanH https://t.co/1PfLS9w351
RT @wahalulu: Getting the #datadc gang together at #rstatsnyc  @robertvesco https://t.co/qebNLGWFFd
Getting the #datadc gang together at #rstatsnyc  @robertvesco https://t.co/qebNLGWFFd
RT @robertvesco: Screw teddy bears and dolls -&gt; Awesome plush statistical distributions via @NausicaaDist https://t.co/zHJiLQMfba #rstatsny…
RT @tonyojeda3: Natural Language Processing with Python Workshop on 4/9 https://t.co/3gzxbXW2Jw #DataScience #BigData #NLProc #DataDC #DCTe…
Screw teddy bears and dolls -&gt; Awesome plush statistical distributions via @NausicaaDist https://t.co/zHJiLQMfba #rstatsnyc #datadc
Natural Language Processing with Python Workshop on 4/9 https://t.co/FcDwuReMoA #DataScience #BigData #NLProc #DataDC #DCTech #NLTK

Pattern also contains an NLP toolkit for English in the pattern.en module that utilizes statistical approcahes and regular expressions. Other languages include Spanish, French, Italian, German, and Dutch.

The patern parser will identify word classes (e.g. Part of Speech tagging), perform morphological inflection analysis, and includes a WordNet API for lemmatization.


In [13]:
from pattern.en import parse, parsetree

s = "The man hit the building with a baseball bat."
print parse(s, relations=True, lemmata=True)
print
for clause in parsetree(s):
    for chunk in clause.chunks:
        for word in chunk.words:
            print word,
        print


The/DT/B-NP/O/NP-SBJ-1/the man/NN/I-NP/O/NP-SBJ-1/man hit/VBD/B-VP/O/VP-1/hit the/DT/O/O/O/the building/VBG/B-VP/O/O/build with/IN/B-PP/B-PNP/O/with a/DT/B-NP/I-PNP/O/a baseball/NN/I-NP/I-PNP/O/baseball bat/NN/I-NP/I-PNP/O/bat ././O/O/O/.

Word(u'The/DT') Word(u'man/NN')
Word(u'hit/VBD')
Word(u'building/VBG')
Word(u'with/IN')
Word(u'a/DT') Word(u'baseball/NN') Word(u'bat/NN')

The pattern.search module allows you to retreive N-Grams from text based on phrasal patterns, and can be used to mine dependencies from text, e.g.


In [14]:
from pattern.search import search

s = "The man hit the building with a baseball bat."
pt = parsetree(s, relations=True, lemmata=True)
for match in search('NP VP', pt):
    print match


Match(words=[Word(u'The/DT'), Word(u'man/NN'), Word(u'hit/VBD')])

Lastly the pattern.vector module has a toolkit for distance-based bag-of-words model machine learning including clustering (K-Means, Hierarhcical Clustering) and classification.

NLTK

Suite of libraries for a variety of academic text processing tasks:

tokenization, stemming, tagging,
chunking, parsing, classification,
language modeling, logical semantics

Pedagogical resources for teaching NLP theory in Python ...

  • Python interface to over 50 corpora and lexical resources
  • Focus on Machine Learning with specific domain knowledge
  • Free and Open Source
  • Numpy and Scipy under the hood
  • Fast and Formal

What is NLTK not?

  • Production ready out of the box*
  • Lightweight
  • Generally applicable
  • Magic

There are actually a few things that are production ready right out of the box.

The Good Parts:

  • Preprocessing
    • segmentation
    • tokenization
    • PoS tagging
  • Word level processing
    • WordNet
    • Lemmatization
    • Stemming
    • NGram
  • Utilities
    • Tree
    • FreqDist
    • ConditionalFreqDist
  • Streaming CorpusReader objects
  • Classification
    • Maximum Entropy (Megam Algorithm)
    • Naive Bayes
    • Decision Tree
  • Chunking, Named Entity Recognition
  • Parsers Galore!

The Bad Parts:

  • Syntactic Parsing
    • No included grammar (not a black box)
  • Feature/Dependency Parsing
    • No included feature grammar
  • The sem package
  • Toy only (lambda-calculus & first order logic)
  • Lots of extra stuff
    • papers, chat programs, alignments, etc.

In [87]:
import nltk

text = get_text("nrRB0.html")
for idx, s in enumerate(nltk.sent_tokenize(text)): # Segmentation
    words = nltk.wordpunct_tokenize(s)  # Tokenization
    tags  = nltk.pos_tag(words)    # Part of Speech tagging
    print tags
    print
    if idx > 5:
        break


[(u'A', 'DT'), (u'crisp', 'NN'), (u'and', 'CC'), (u'juicy', 'NN'), (u'bucket', 'NN'), (u'list', 'NN'), (u'of', 'IN'), (u'D', 'NNP'), (u'.', '.'), (u'C', 'NNP'), (u'.\u2019', 'NNP'), (u's', 'VBZ'), (u'best', 'JJS'), (u'fried', 'VBN'), (u'chicken', 'NN'), (u'-', ':'), (u'The', 'DT'), (u'Washington', 'NNP'), (u'Post', 'NNP'), (u'It', 'NNP'), (u'\u2019', 'NNP'), (u's', 'VBZ'), (u'lowbrow', 'NN'), (u'.', '.')]

[(u'It', 'PRP'), (u'\u2019', 'VBP'), (u's', 'NNS'), (u'messy', 'JJ'), (u'.', '.')]

[(u'It', 'PRP'), (u'could', 'MD'), (u'never', 'RB'), (u'be', 'VB'), (u'accused', 'VBN'), (u'of', 'IN'), (u'being', 'VBG'), (u'healthful', 'JJ'), (u'.', '.')]

[(u'But', 'CC'), (u'we', 'PRP'), (u'\u2019', 'VBP'), (u'd', 'VBN'), (u'never', 'RB'), (u'let', 'VB'), (u'those', 'DT'), (u'formalities', 'NNS'), (u'get', 'VBP'), (u'between', 'IN'), (u'us', 'PRP'), (u'and', 'CC'), (u'an', 'DT'), (u'order', 'NN'), (u'of', 'IN'), (u'crispy', 'NN'), (u',', ','), (u'crackly', 'RB'), (u',', ','), (u'delicious', 'JJ'), (u'fried', 'JJ'), (u'chicken', 'NN'), (u'.', '.')]

[(u'Whether', 'IN'), (u'it', 'PRP'), (u'comes', 'VBZ'), (u'in', 'IN'), (u'a', 'DT'), (u'bucket', 'NN'), (u'or', 'CC'), (u'on', 'IN'), (u'a', 'DT'), (u'bun', 'NN'), (u',', ','), (u'or', 'CC'), (u'you', 'PRP'), (u'eat', 'VBP'), (u'it', 'PRP'), (u'with', 'IN'), (u'your', 'PRP$'), (u'fingers', 'NNS'), (u'or', 'CC'), (u'chopsticks', 'NNS'), (u',', ','), (u'there', 'EX'), (u'\u2019', ':'), (u's', 'NNS'), (u'a', 'DT'), (u'surprising', 'JJ'), (u'variety', 'NN'), (u'to', 'TO'), (u'the', 'DT'), (u'Washington', 'NNP'), (u'area', 'NN'), (u'\u2019', ':'), (u's', 'NNS'), (u'fried', 'VBD'), (u'chicken', 'VBN'), (u'offerings', 'NNS'), (u'.', '.')]

[(u'Here', 'RB'), (u'are', 'VBP'), (u'some', 'DT'), (u'of', 'IN'), (u'the', 'DT'), (u'most', 'RBS'), (u'irresistible', 'JJ'), (u'.', '.')]

[(u'\u2018', 'NN'), (u'Rotissi', 'NNP'), (u'-', ':'), (u'fried', 'VBD'), (u'\u2019', 'CD'), (u'chicken', 'VBN'), (u'at', 'IN'), (u'the', 'DT'), (u'Partisan', 'NNP'), (u'Forget', 'NNP'), (u'the', 'DT'), (u'cronut', 'NN'), (u'.', '.')]


In [90]:
from nltk import FreqDist
from nltk import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

text  = get_text("nrRB0.html")
vocab = FreqDist()
words = FreqDist()
for s in nltk.sent_tokenize(text): 
    for word in nltk.wordpunct_tokenize(s):
        words[word] += 1
        lemma = lemmatizer.lemmatize(word)
        vocab[lemma] += 1

print words
print vocab


<FreqDist with 1072 samples and 3084 outcomes>
<FreqDist with 1032 samples and 3084 outcomes>

The first thing you needed to do was create a corpus reader that could read the RSS feeds and their topics, implementing one of the built-in corpus readers:


In [16]:
import os
import nltk
import time
import random
import pickle
import string

from bs4 import BeautifulSoup
from nltk.corpus import CategorizedPlaintextCorpusReader

# The first group captures the category folder, docs are any HTML file.
CORPUS_ROOT = './corpus'
DOC_PATTERN = r'(?!\.).*\.html'
CAT_PATTERN = r'([a-z_]+)/.*'

# Specialized Corpus Reader for HTML documents
class CategorizedHTMLCorpusreader(CategorizedPlaintextCorpusReader):
    """
    Reads only the HTML body for the words and strips any tags.
    """

    def _read_word_block(self, stream):
        soup = BeautifulSoup(stream, 'lxml')
        return self._word_tokenizer.tokenize(soup.get_text())

    def _read_para_block(self, stream):
        soup  = BeautifulSoup(stream, 'lxml')
        paras = []
        piter = soup.find_all('p') if soup.find('p') else self._para_block_reader(stream)

        for para in piter:
            paras.append([self._word_tokenizer.tokenize(sent)
                          for sent in self._sent_tokenizer.tokenize(para)])

        return paras

# Create our corpus reader
rss_corpus = CategorizedHTMLCorpusreader(CORPUS_ROOT, DOC_PATTERN,
                    cat_pattern=CAT_PATTERN, encoding='utf-8')

Just to make things easy, I've also included all of the imports at the top of this snippet in case you're just copying and pasting. This should give you a corpus that is easily readable with the following properties:

RSS Corpus contains 5506 files in 11 categories Vocab: 69642 in 1920455 words for a lexical diversity of 27.576

This snippet demonstrates a choice I made - to override the _read_word_block and the _read_para_block functions in the CategorizedPlaintextCorpusReader, but of course you could have created your own HTMLCorpusReader class that implemented the categorization features.

The next thing to do is to figure out how you will generate your featuresets, I hope that you used unigrams, bigrams, TF-IDF and others. The simplest thing to do is simply a bag of words approach, however I have ensured that this bag of words does not contain punctuation or stopwords, has been normalized to all lowercase and has been lemmatized to reduce the number of word forms:


In [17]:
# Create feature extractor methodology
def normalize_words(document):
    """
    Expects as input a list of words that make up a document. This will
    yield only lowercase significant words (excluding stopwords and
    punctuation) and will lemmatize all words to ensure that we have word
    forms that are standardized.
    """
    stopwords  = set(nltk.corpus.stopwords.words('english'))
    lemmatizer = nltk.stem.wordnet.WordNetLemmatizer()
    for token in document:
        token = token.lower()
        if token in string.punctuation: continue
        if token in stopwords: continue
        yield lemmatizer.lemmatize(token)

def document_features(document):
    words = nltk.FreqDist(normalize_words(document))
    feats = {}
    for word in words.keys():
        feats['contains(%s)' % word] = True
    return feats

You should save a training, devtest and test as pickles to disk so that you can easily work on your classifier without having to worry about the overhead of randomization. I went ahead and saved the features to disk; but if you're developing features then you'll only save the word lists to disk. Here are the functions both for generation and for loading the data sets:


In [19]:
def timeit(func):
    def wrapper(*args, **kwargs):
        start  = time.time()
        result = func(*args, **kwargs)
        delta  = time.time() - start
        return result, delta
    return wrapper

@timeit
def generate_datasets(test_size=550, pickle_dir="."):
    """
    Creates three data sets; a test set and dev test set of 550 documents
    then a training set with the rest of the documents in the corpus. It
    will then write the data sets to disk at the pickle_dir.
    """
    documents = [(document_features(rss_corpus.words(fileid)), category)
                    for category in rss_corpus.categories()
                    for fileid in rss_corpus.fileids(category)]

    random.shuffle(documents)

    datasets = {
        'test':     documents[0:test_size],
        'devtest':  documents[test_size:test_size*2],
        'training': documents[test_size*2:],
    }

    for name, data in datasets.items():
        with open(os.path.join(pickle_dir, name+".pickle"), 'wb') as out:
            pickle.dump(data, out)

def load_datasets(pickle_dir="."):
    """
    Loads the randomly shuffled data sets from their pickles on disk.
    """

    def loader(name):
        path = os.path.join(pickle_dir, name+".pickle")
        with open(path, 'rb') as f:
            data = pickle.load(f)

        return name, data

    return dict(loader(name) for name in ('test', 'devtest', 'training'))

# Using a time it decorator you can see that this saves you quite a few seconds:

_, delta = generate_datasets(pickle_dir='datasets')
print "Took %0.3f seconds to generate datasets" % delta


Took 26.951 seconds to generate datasets

Last up is the building of the classifier. I used a maximum entropy classifier with the lemmatized word level features. Also note that I used the MEGAM algorithm to significantly speed up my training time:


In [20]:
@timeit
def train_classifier(training, path='classifier.pickle'):
    """
    Trains the classifier and saves it to disk.
    """
    classifier = nltk.MaxentClassifier.train(training,
                algorithm='megam', trace=2, gaussian_prior_sigma=1)

    with open(path, 'wb') as out:
        pickle.dump(classifier, out)

    return classifier

datasets = load_datasets(pickle_dir='datasets')
classifier, delta = train_classifier(datasets['training'])
print "trained in %0.3f seconds" % delta

testacc    = nltk.classify.accuracy(classifier, datasets['test']) * 100
print "test accuracy %0.2f%%" % testacc

classifier.show_most_informative_features(30)


[Found megam: /Users/benjamin/bin/megam]
[Found megam: /Users/benjamin/bin/megam]
trained in 133.205 seconds
test accuracy 82.36%
   3.917 contains(comment)==True and label is 'data_science'
   3.599 contains(...)==True and label is 'gaming'
   3.573 contains(data)==True and label is 'data_science'
   3.248 contains(book)==True and label is 'books'
   2.984 contains(wired)==True and label is 'tech'
   2.970 label is 'business'
   2.836 contains(»)==True and label is 'business'
   2.667 contains(game)==True and label is 'gaming'
   2.481 contains(entrepreneur)==True and label is 'business'
  -2.418 label is 'essays'
   2.342 contains(facebook)==True and label is 'tech'
   2.259 contains(read)==True and label is 'tech'
   2.255 contains(...)==True and label is 'cinema'
   2.229 contains(adafruit)==True and label is 'do_it_yourself'
   2.186 contains(recipe)==True and label is 'cooking'
   2.166 contains(film)==True and label is 'cinema'
  -2.101 contains(read)==True and label is 'business'
   2.067 contains(business)==True and label is 'business'
  -1.991 contains(’)==True and label is 'sports'
  -1.957 label is 'do_it_yourself'
   1.953 contains(nfl)==True and label is 'sports'
   1.922 contains(sweet)==True and label is 'cooking'
   1.913 contains(...))==True and label is 'design'
   1.860 contains(steam)==True and label is 'gaming'
   1.839 contains(dish)==True and label is 'cooking'
   1.834 contains(mail)==True and label is 'books'
   1.769 contains(cup)==True and label is 'sports'
   1.766 contains(e)==True and label is 'books'
   1.731 contains(appeared)==True and label is 'tech'
   1.728 contains(.»)==True and label is 'books'
Exception RuntimeError: 'generator ignored GeneratorExit' in <generator object find_file_iter at 0x1246070a0> ignored

In [21]:
from operator import itemgetter

def classify(text, explain=False):
    
    classifier = None
    with open('classifier.pickle', 'rb') as f:
        classifier = pickle.load(f)
    
    document = nltk.wordpunct_tokenize(text)
    features = document_features(document)
    
    pd = classifier.prob_classify(features)
    for result in sorted([(s,pd.prob(s)) for s in pd.samples()], key=itemgetter(1), reverse=True):
        print "%s: %0.4f" % result

    print
    if explain:
        classifier.explain(features)

classify(get_text("nrRB0.html"), True)


cooking: 1.0000
essays: 0.0000
books: 0.0000
do_it_yourself: 0.0000
gaming: 0.0000
design: 0.0000
tech: 0.0000
cinema: 0.0000
data_science: 0.0000
sports: 0.0000
business: 0.0000

  Feature                                          cooking  essays   books do_it_y
  --------------------------------------------------------------------------------
  contains(recipe)==True (1)                         2.186
  contains(dish)==True (1)                           1.839
  contains(food)==True (1)                           1.073
  contains(chef)==True (1)                           1.026
  contains(classic)==True (1)                        0.980
  contains(spicy)==True (1)                          0.881
  contains(flavor)==True (1)                         0.853
  contains(stuffed)==True (1)                        0.845
  contains(served)==True (1)                         0.841
  contains(bar)==True (1)                            0.823
  contains(fresh)==True (1)                          0.787
  contains(tomato)==True (1)                         0.783
  contains(cooking)==True (1)                        0.729
  contains(friday)==True (1)                         0.717
  contains(delicious)==True (1)                      0.690
  contains(sauce)==True (1)                          0.649
  contains(oil)==True (1)                            0.609
  contains(—)==True (1)                              0.583
  contains(meal)==True (1)                           0.580
  contains(topped)==True (1)                         0.577
  contains(combination)==True (1)                    0.567
  contains(spiced)==True (1)                         0.562
  contains(grilled)==True (1)                        0.557
  contains(without)==True (1)                        0.524
  contains(pickle)==True (1)                         0.522
  contains(made)==True (1)                           0.500
  contains(meat)==True (1)                           0.500
  contains(kitchen)==True (1)                        0.499
  contains().)==True (1)                             0.477
  contains(perfect)==True (1)                        0.476
  contains(filling)==True (1)                        0.458
  contains(pepper)==True (1)                         0.436
  contains(time)==True (1)                          -0.429
  contains(crisp)==True (1)                          0.429
  contains(dinner)==True (1)                         0.424
  contains(kimchi)==True (1)                         0.419
  contains(home)==True (1)                           0.406
  contains(seasoning)==True (1)                      0.387
  contains(favorite)==True (1)                       0.373
  contains(roasted)==True (1)                        0.369
  contains(butter)==True (1)                         0.353
  contains(restaurant)==True (1)                     0.349
  contains(bite)==True (1)                           0.345
  contains(year)==True (1)                          -0.344
  contains(corn)==True (1)                           0.343
  label is 'cooking' (1)                            -0.337
  contains(almost)==True (1)                         0.329
  contains(say)==True (1)                           -0.328
  contains(fish)==True (1)                           0.326
  contains(come)==True (1)                           0.298
  contains(spring)==True (1)                         0.295
  contains(secret)==True (1)                         0.294
  contains(taken)==True (1)                          0.292
  contains(potato)==True (1)                         0.292
  contains(variation)==True (1)                      0.291
  contains(hot)==True (1)                            0.289
  contains(.”)==True (1)                             0.288
  contains(onion)==True (1)                          0.286
  contains(side)==True (1)                           0.286
  contains(satisfying)==True (1)                     0.285
  contains(change)==True (1)                        -0.283
  contains(pack)==True (1)                           0.281
  contains(’)==True (1)                             -0.281
  contains(salt)==True (1)                           0.281
  contains(garlic)==True (1)                         0.280
  contains(chicken)==True (1)                        0.279
  contains(seafood)==True (1)                        0.277
  contains(cayenne)==True (1)                        0.274
  contains(country)==True (1)                        0.270
  contains(version)==True (1)                        0.269
  contains(mixture)==True (1)                        0.267
  contains(juice)==True (1)                          0.267
  contains(used)==True (1)                           0.260
  contains(often)==True (1)                          0.259
  contains(traditional)==True (1)                    0.250
  contains(temperature)==True (1)                    0.249
  contains(house)==True (1)                          0.247
  contains(taste)==True (1)                          0.244
  contains(snack)==True (1)                          0.241
  contains(plate)==True (1)                          0.236
  contains(good)==True (1)                           0.233
  contains(fried)==True (1)                          0.227
  contains(beauty)==True (1)                         0.226
  contains(lettuce)==True (1)                        0.226
  contains(lunch)==True (1)                          0.224
  contains(right)==True (1)                          0.224
  contains(flour)==True (1)                          0.223
  contains(style)==True (1)                          0.222
  contains(kind)==True (1)                           0.221
  contains(‘)==True (1)                             -0.221
  contains(risotto)==True (1)                        0.215
  contains(roll)==True (1)                           0.214
  contains(probably)==True (1)                       0.213
  contains(open)==True (1)                          -0.211
  contains(white)==True (1)                          0.210
  contains(5)==True (1)                             -0.208
  contains(variety)==True (1)                        0.208
  contains(new)==True (1)                           -0.207
  contains(want)==True (1)                          -0.202
  contains(com)==True (1)                           -0.202
  contains(warm)==True (1)                           0.200
  contains(piece)==True (1)                         -0.199
  contains(slathered)==True (1)                      0.199
  contains(crispy)==True (1)                         0.199
  contains(best)==True (1)                           0.197
  contains(street)==True (1)                         0.195
  contains(menu)==True (1)                           0.194
  contains(much)==True (1)                           0.192
  contains(head)==True (1)                           0.192
  contains(e)==True (1)                             -0.190
  contains(.))==True (1)                             0.184
  contains(need)==True (1)                          -0.184
  contains(8)==True (1)                             -0.183
  contains(leave)==True (1)                          0.179
  contains(le)==True (1)                            -0.178
  contains(hour)==True (1)                          -0.176
  contains(could)==True (1)                         -0.176
  contains(child)==True (1)                         -0.174
  contains(woman)==True (1)                         -0.174
  contains(”)==True (1)                             -0.174
  contains(steak)==True (1)                          0.174
  contains(list)==True (1)                           0.171
  contains(slightly)==True (1)                       0.170
  contains(put)==True (1)                            0.169
  contains(option)==True (1)                        -0.166
  contains(former)==True (1)                        -0.163
  contains(may)==True (1)                            0.163
  contains(know)==True (1)                          -0.162
  contains(dark)==True (1)                           0.160
  contains(name)==True (1)                          -0.159
  contains(hand)==True (1)                           0.158
  contains(created)==True (1)                       -0.157
  contains(throughout)==True (1)                     0.157
  contains(month)==True (1)                         -0.155
  contains(one)==True (1)                            0.153
  contains(miss)==True (1)                          -0.152
  contains(including)==True (1)                      0.151
  contains(batter)==True (1)                         0.151
  contains(shore)==True (1)                          0.151
  contains(hard)==True (1)                          -0.150
  contains(credit)==True (1)                         0.150
  contains(really)==True (1)                         0.150
  contains(area)==True (1)                          -0.149
  contains(anything)==True (1)                       0.149
  contains(another)==True (1)                       -0.149
  contains(suit)==True (1)                           0.148
  contains(pas)==True (1)                            0.147
  contains(leaf)==True (1)                           0.146
  contains(tender)==True (1)                         0.145
  contains(honey)==True (1)                          0.145
  contains(crust)==True (1)                          0.145
  contains(oyster)==True (1)                         0.142
  contains(part)==True (1)                           0.142
  contains(begin)==True (1)                         -0.142
  contains(v)==True (1)                             -0.142
  contains(u)==True (1)                              0.142
  contains(quite)==True (1)                          0.141
  contains(deep)==True (1)                           0.138
  contains(discover)==True (1)                      -0.137
  contains(beer)==True (1)                           0.137
  contains(,”)==True (1)                             0.135
  contains(run)==True (1)                           -0.134
  contains(grab)==True (1)                           0.132
  contains(spice)==True (1)                          0.131
  contains(sea)==True (1)                            0.131
  contains(family)==True (1)                         0.130
  contains(review)==True (1)                        -0.129
  contains(blend)==True (1)                          0.129
  contains(easy)==True (1)                           0.129
  contains(whether)==True (1)                       -0.128
  contains(modern)==True (1)                        -0.128
  contains(co)==True (1)                            -0.128
  contains(others)==True (1)                        -0.128
  contains(standard)==True (1)                       0.126
  contains(taking)==True (1)                        -0.125
  contains(like)==True (1)                           0.125
  contains(spend)==True (1)                         -0.124
  contains(free)==True (1)                           0.123
  contains(detail)==True (1)                         0.122
  contains(dose)==True (1)                           0.122
  contains(wanted)==True (1)                         0.121
  contains(pan)==True (1)                            0.119
  contains(thin)==True (1)                           0.119
  contains(maybe)==True (1)                         -0.118
  contains(person)==True (1)                        -0.118
  contains(spent)==True (1)                          0.118
  contains(longer)==True (1)                        -0.118
  contains(line)==True (1)                           0.117
  contains(french)==True (1)                        -0.117
  contains(better)==True (1)                        -0.117
  contains(self)==True (1)                          -0.116
  contains(memorable)==True (1)                      0.116
  contains(clean)==True (1)                         -0.114
  contains(true)==True (1)                          -0.113
  contains(founder)==True (1)                       -0.113
  contains(started)==True (1)                        0.113
  contains(frequently)==True (1)                     0.112
  contains(keep)==True (1)                          -0.112
  contains(age)==True (1)                           -0.111
  contains(picnic)==True (1)                         0.111
  contains(fat)==True (1)                            0.110
  contains(fall)==True (1)                           0.109
  contains(happy)==True (1)                         -0.108
  contains(simply)==True (1)                         0.108
  contains(leftover)==True (1)                       0.107
  contains(getting)==True (1)                       -0.106
  contains(),)==True (1)                             0.106
  contains(fry)==True (1)                            0.104
  contains(offer)==True (1)                          0.104
  contains(j)==True (1)                             -0.102
  contains(diner)==True (1)                          0.102
  contains(competition)==True (1)                    0.102
  contains(plus)==True (1)                          -0.101
  contains(held)==True (1)                           0.100
  contains(important)==True (1)                     -0.100
  contains(spoon)==True (1)                          0.099
  contains(serving)==True (1)                        0.099
  contains(set)==True (1)                           -0.098
  contains(rose)==True (1)                           0.098
  contains(basically)==True (1)                      0.098
  contains(try)==True (1)                            0.098
  contains(bag)==True (1)                            0.097
  contains(although)==True (1)                      -0.096
  contains(degree)==True (1)                         0.095
  contains(half)==True (1)                          -0.094
  contains(thing)==True (1)                         -0.093
  contains(american)==True (1)                      -0.093
  contains(waste)==True (1)                          0.092
  contains(seasoned)==True (1)                       0.091
  contains(eat)==True (1)                            0.090
  contains(place)==True (1)                         -0.090
  contains(pain)==True (1)                           0.090
  contains(sandwich)==True (1)                       0.089
  contains(well)==True (1)                           0.089
  contains(mousse)==True (1)                         0.088
  contains(feel)==True (1)                           0.086
  contains(strip)==True (1)                          0.086
  contains(process)==True (1)                       -0.086
  contains(real)==True (1)                          -0.086
  contains(washington)==True (1)                     0.085
  contains(combine)==True (1)                        0.085
  contains(within)==True (1)                        -0.085
  contains(never)==True (1)                         -0.085
  contains(24)==True (1)                            -0.084
  contains(term)==True (1)                          -0.084
  contains(rendered)==True (1)                       0.083
  contains(also)==True (1)                           0.083
  contains(buy)==True (1)                            0.082
  contains(whole)==True (1)                          0.081
  contains(build)==True (1)                         -0.081
  contains(result)==True (1)                        -0.080
  contains(using)==True (1)                          0.080
  contains(appreciate)==True (1)                     0.080
  contains(thought)==True (1)                        0.080
  contains(store)==True (1)                         -0.079
  contains(50)==True (1)                            -0.077
  contains(base)==True (1)                           0.077
  contains(become)==True (1)                        -0.076
  contains(john)==True (1)                          -0.076
  contains(sometimes)==True (1)                      0.076
  contains(layer)==True (1)                          0.075
  contains(japanese)==True (1)                      -0.074
  contains(beef)==True (1)                           0.073
  contains(snap)==True (1)                           0.073
  contains(would)==True (1)                          0.073
  contains(dipping)==True (1)                        0.072
  contains(trick)==True (1)                         -0.072
  contains(care)==True (1)                          -0.072
  contains(everything)==True (1)                     0.071
  contains(crumb)==True (1)                          0.071
  contains(single)==True (1)                        -0.071
  contains(element)==True (1)                       -0.071
  contains(near)==True (1)                          -0.070
  contains(2002)==True (1)                           0.070
  contains(gravy)==True (1)                          0.070
  contains(found)==True (1)                          0.070
  contains(music)==True (1)                         -0.070
  contains(exclusively)==True (1)                    0.070
  contains(wonderfully)==True (1)                    0.070
  contains(hearty)==True (1)                         0.069
  contains(12)==True (1)                            -0.068
  contains(decade)==True (1)                        -0.068
  contains(ask)==True (1)                           -0.068
  contains(dusted)==True (1)                         0.068
  contains(driving)==True (1)                       -0.067
  contains(passion)==True (1)                        0.066
  contains(!))==True (1)                             0.065
  contains(order)==True (1)                          0.064
  contains(refined)==True (1)                        0.063
  contains(luxury)==True (1)                        -0.063
  contains(example)==True (1)                       -0.062
  contains(take)==True (1)                           0.062
  contains(however)==True (1)                       -0.061
  contains(item)==True (1)                          -0.061
  contains(prevent)==True (1)                       -0.061
  contains(stock)==True (1)                          0.059
  contains(kid)==True (1)                           -0.059
  contains(bring)==True (1)                         -0.059
  contains(25)==True (1)                            -0.059
  contains(black)==True (1)                          0.058
  contains(wing)==True (1)                           0.058
  contains(basket)==True (1)                         0.058
  contains(cilantro)==True (1)                       0.057
  contains(tom)==True (1)                            0.057
  contains(mashed)==True (1)                         0.056
  contains(fork)==True (1)                           0.056
  contains(later)==True (1)                         -0.056
  contains(fan)==True (1)                           -0.056
  contains(lends)==True (1)                          0.055
  contains(iron)==True (1)                          -0.054
  contains(alike)==True (1)                          0.054
  contains(fancy)==True (1)                          0.054
  contains(soul)==True (1)                          -0.054
  contains(signature)==True (1)                      0.053
  contains(needed)==True (1)                        -0.053
  contains(training)==True (1)                      -0.053
  contains(location)==True (1)                      -0.052
  contains(attached)==True (1)                      -0.052
  contains(moist)==True (1)                          0.051
  contains(sold)==True (1)                          -0.051
  contains(frank)==True (1)                         -0.051
  contains(actually)==True (1)                      -0.050
  contains(minute)==True (1)                        -0.050
  contains(subtle)==True (1)                        -0.050
  contains(pleasure)==True (1)                      -0.049
  contains(plain)==True (1)                          0.049
  contains(cut)==True (1)                           -0.049
  contains(believe)==True (1)                        0.049
  contains(mean)==True (1)                           0.049
  contains(enough)==True (1)                         0.048
  contains(bay)==True (1)                           -0.047
  contains(excursion)==True (1)                      0.047
  contains(pool)==True (1)                          -0.046
  contains(heavy)==True (1)                          0.046
  contains(brand)==True (1)                          0.045
  contains(c)==True (1)                              0.045
  contains(south)==True (1)                          0.045
  contains(offering)==True (1)                      -0.045
  contains(crunch)==True (1)                         0.044
  contains(generous)==True (1)                       0.044
  contains(post)==True (1)                          -0.044
  contains(seems)==True (1)                          0.044
  contains(sound)==True (1)                         -0.044
  contains(dip)==True (1)                            0.043
  contains(st)==True (1)                             0.043
  contains(fine)==True (1)                          -0.043
  contains(sell)==True (1)                           0.043
  contains(lot)==True (1)                            0.042
  contains(owner)==True (1)                          0.042
  contains(paid)==True (1)                           0.042
  contains(nine)==True (1)                           0.042
  contains(juicy)==True (1)                          0.042
  contains(matter)==True (1)                         0.041
  contains(technically)==True (1)                   -0.041
  contains(kara)==True (1)                           0.041
  contains(chili)==True (1)                         -0.040
  contains(finished)==True (1)                       0.040
  contains(toward)==True (1)                        -0.040
  contains(georgia)==True (1)                        0.040
  contains(teeth)==True (1)                          0.040
  contains(form)==True (1)                           0.040
  contains(old)==True (1)                           -0.040
  contains(newest)==True (1)                        -0.039
  contains(twice)==True (1)                          0.039
  contains(exterior)==True (1)                      -0.039
  contains(dedicated)==True (1)                      0.039
  contains(even)==True (1)                           0.039
  contains(father)==True (1)                        -0.039
  contains(involved)==True (1)                      -0.038
  contains(district)==True (1)                      -0.038
  contains(adult)==True (1)                         -0.037
  contains(count)==True (1)                         -0.037
  contains(resist)==True (1)                        -0.037
  contains(pour)==True (1)                          -0.037
  contains(sure)==True (1)                           0.036
  contains(available)==True (1)                     -0.036
  contains(bone)==True (1)                          -0.036
  contains(sunday)==True (1)                        -0.036
  contains(joint)==True (1)                         -0.036
  contains(surprising)==True (1)                    -0.035
  contains(frying)==True (1)                        -0.035
  contains(planning)==True (1)                      -0.034
  contains(long)==True (1)                           0.034
  contains(popular)==True (1)                        0.033
  contains(agree)==True (1)                         -0.033
  contains(taylor)==True (1)                         0.033
  contains(florida)==True (1)                       -0.032
  contains(testing)==True (1)                       -0.032
  contains(stacked)==True (1)                       -0.032
  contains(hold)==True (1)                          -0.032
  contains(deal)==True (1)                          -0.032
  contains(10)==True (1)                             0.032
  contains(remains)==True (1)                        0.031
  contains(golden)==True (1)                        -0.031
  contains(obsession)==True (1)                     -0.031
  contains(follows)==True (1)                       -0.031
  contains(wheat)==True (1)                         -0.031
  contains(presentation)==True (1)                  -0.030
  contains(exact)==True (1)                          0.030
  contains(.’)==True (1)                             0.030
  contains(marriage)==True (1)                      -0.030
  contains(let)==True (1)                           -0.029
  contains(crack)==True (1)                         -0.029
  contains(go)==True (1)                            -0.029
  contains(“)==True (1)                              0.029
  contains(back)==True (1)                          -0.029
  contains(salty)==True (1)                          0.028
  contains(ever)==True (1)                          -0.028
  contains(central)==True (1)                       -0.028
  contains(fully)==True (1)                         -0.027
  contains(protein)==True (1)                        0.027
  contains(creator)==True (1)                       -0.027
  contains(perfectly)==True (1)                     -0.026
  contains(q)==True (1)                              0.026
  contains(relegated)==True (1)                      0.026
  contains(developed)==True (1)                      0.026
  contains(fillet)==True (1)                         0.025
  contains(nashville)==True (1)                      0.025
  contains(southern)==True (1)                       0.024
  contains(yard)==True (1)                          -0.024
  contains(proper)==True (1)                        -0.024
  contains(effect)==True (1)                        -0.024
  contains(finger)==True (1)                        -0.023
  contains(brushed)==True (1)                       -0.023
  contains(word)==True (1)                           0.022
  contains(hybrid)==True (1)                        -0.022
  contains(high)==True (1)                          -0.022
  contains(stuff)==True (1)                         -0.021
  contains(circle)==True (1)                        -0.021
  contains(($)==True (1)                            -0.021
  contains(two)==True (1)                           -0.021
  contains(inspired)==True (1)                       0.021
  contains(bucket)==True (1)                        -0.021
  contains(outer)==True (1)                         -0.021
  contains(rather)==True (1)                        -0.021
  contains(unless)==True (1)                         0.020
  contains(essentially)==True (1)                    0.020
  contains(forget)==True (1)                        -0.020
  contains(local)==True (1)                          0.020
  contains(spin)==True (1)                          -0.020
  contains(resulting)==True (1)                      0.020
  contains(ordering)==True (1)                      -0.019
  contains(skin)==True (1)                          -0.019
  contains(pop)==True (1)                            0.019
  contains(chain)==True (1)                          0.019
  contains(allow)==True (1)                          0.019
  contains(knife)==True (1)                          0.019
  contains(everywhere)==True (1)                    -0.018
  contains(roof)==True (1)                          -0.018
  contains(big)==True (1)                           -0.018
  contains(onto)==True (1)                          -0.018
  contains(addictive)==True (1)                      0.018
  contains(describe)==True (1)                      -0.018
  contains(end)==True (1)                           -0.017
  contains(three)==True (1)                         -0.017
  contains(silver)==True (1)                         0.017
  contains(paper)==True (1)                         -0.017
  contains(excuse)==True (1)                        -0.017
  contains(mount)==True (1)                         -0.017
  contains(jersey)==True (1)                        -0.017
  contains(virginia)==True (1)                       0.017
  contains(generously)==True (1)                    -0.017
  contains(exchange)==True (1)                      -0.016
  contains(dunk)==True (1)                          -0.016
  contains(ear)==True (1)                           -0.016
  contains(.,)==True (1)                             0.016
  contains(creating)==True (1)                       0.016
  contains(northeast)==True (1)                      0.016
  contains(cart)==True (1)                           0.015
  contains(bird)==True (1)                          -0.015
  contains(extremely)==True (1)                     -0.015
  contains(14)==True (1)                            -0.015
  contains(reserve)==True (1)                       -0.015
  contains(named)==True (1)                         -0.014
  contains(shine)==True (1)                         -0.014
  contains(cast)==True (1)                          -0.014
  contains(reveal)==True (1)                        -0.014
  contains(asking)==True (1)                        -0.014
  contains(agreement)==True (1)                     -0.014
  contains(barbecue)==True (1)                       0.014
  contains(favor)==True (1)                         -0.013
  contains(glass)==True (1)                         -0.013
  contains(breath)==True (1)                        -0.013
  contains(9)==True (1)                              0.012
  contains(though)==True (1)                        -0.012
  contains(yield)==True (1)                          0.012
  contains(cajun)==True (1)                          0.012
  contains(commitment)==True (1)                    -0.012
  contains(since)==True (1)                          0.012
  contains(something)==True (1)                     -0.011
  contains(accessible)==True (1)                     0.011
  contains(sink)==True (1)                          -0.011
  contains(ultra)==True (1)                         -0.011
  contains(enjoyed)==True (1)                       -0.011
  contains(top)==True (1)                           -0.011
  contains(argue)==True (1)                          0.010
  contains(tartness)==True (1)                      -0.010
  contains(closely)==True (1)                       -0.010
  contains(brine)==True (1)                          0.010
  contains(dig)==True (1)                           -0.009
  contains(korean)==True (1)                        -0.009
  contains(low)==True (1)                           -0.009
  contains(instead)==True (1)                        0.009
  contains(get)==True (1)                           -0.009
  contains(portion)==True (1)                       -0.009
  contains(99)==True (1)                             0.009
  contains(double)==True (1)                        -0.009
  contains(margaret)==True (1)                       0.009
  contains(ala)==True (1)                           -0.008
  contains(consistency)==True (1)                    0.008
  contains(mosaic)==True (1)                        -0.008
  contains(owns)==True (1)                          -0.008
  contains(size)==True (1)                           0.007
  contains(occasional)==True (1)                    -0.007
  contains(said)==True (1)                           0.007
  contains(appetite)==True (1)                      -0.007
  contains(healthful)==True (1)                      0.007
  contains(bath)==True (1)                          -0.007
  contains(alley)==True (1)                         -0.006
  contains(coated)==True (1)                         0.006
  contains(tending)==True (1)                        0.006
  contains(tongue)==True (1)                        -0.006
  contains(dad)==True (1)                           -0.006
  contains(learned)==True (1)                       -0.006
  contains(famed)==True (1)                         -0.006
  contains(fast)==True (1)                           0.005
  contains(messy)==True (1)                         -0.005
  contains(atop)==True (1)                          -0.005
  contains(common)==True (1)                         0.005
  contains(outright)==True (1)                       0.005
  contains(distributed)==True (1)                   -0.005
  contains(ave)==True (1)                            0.005
  contains(finish)==True (1)                        -0.004
  contains(formula)==True (1)                        0.004
  contains(method)==True (1)                         0.004
  contains(foam)==True (1)                          -0.004
  contains(thigh)==True (1)                         -0.004
  contains(mac)==True (1)                           -0.004
  contains(biscuit)==True (1)                        0.004
  contains(ounce)==True (1)                          0.004
  contains(stick)==True (1)                         -0.003
  contains(guilty)==True (1)                        -0.003
  contains(call)==True (1)                           0.003
  contains(bread)==True (1)                          0.003
  contains(crushed)==True (1)                        0.003
  contains(del)==True (1)                           -0.003
  contains(commonly)==True (1)                       0.003
  contains(east)==True (1)                          -0.003
  contains(pressure)==True (1)                      -0.003
  contains(got)==True (1)                            0.003
  contains(accent)==True (1)                         0.003
  contains(ubiquitous)==True (1)                     0.002
  contains(upscale)==True (1)                       -0.002
  contains(deeply)==True (1)                        -0.002
  contains(n)==True (1)                              0.002
  contains(softer)==True (1)                         0.002
  contains(handle)==True (1)                         0.002
  contains(preparation)==True (1)                    0.002
  contains(sliver)==True (1)                        -0.002
  contains(turned)==True (1)                        -0.002
  contains(every)==True (1)                          0.002
  contains(brined)==True (1)                         0.002
  contains(savor)==True (1)                          0.001
  contains(popcorn)==True (1)                       -0.001
  contains(convenience)==True (1)                   -0.001
  contains(fare)==True (1)                          -0.001
  contains(tartare)==True (1)                        0.001
  contains(bum)==True (1)                           -0.001
  contains(flesh)==True (1)                         -0.001
  contains(se)==True (1)                            -0.001
  contains(perfecting)==True (1)                     0.001
  contains(cornmeal)==True (1)                       0.001
  contains(irresistible)==True (1)                   0.001
  contains(called)==True (1)                         0.001
  contains(tablecloth)==True (1)                     0.001
  contains(soy)==True (1)                            0.000
  contains(buttermilk)==True (1)                     0.000
  contains(chipotle)==True (1)                       0.000
  contains(starch)==True (1)                        -0.000
  contains(empty)==True (1)                         -0.000
  contains(translucent)==True (1)                    0.000
  contains(bun)==True (1)                            0.000
  contains(inevitably)==True (1)                    -0.000
  contains(rd)==True (1)                             0.000
  contains(dreamed)==True (1)                        0.000
  contains(platter)==True (1)                        0.000
  contains(ranch)==True (1)                         -0.000
  contains(paprika)==True (1)                        0.000
  contains(grandmother)==True (1)                    0.000
  contains(dredged)==True (1)                        0.000
  label is 'essays' (1)                                     -2.418
  contains(’)==True (1)                                      0.767
  contains(.”)==True (1)                                     0.597
  contains(“)==True (1)                                      0.518
  contains(u)==True (1)                                      0.451
  contains(term)==True (1)                                   0.442
  contains(—)==True (1)                                      0.413
  contains(tom)==True (1)                                    0.385
  contains(toward)==True (1)                                 0.369
  contains(best)==True (1)                                   0.337
  contains(year)==True (1)                                  -0.326
  contains(crime)==True (1)                                  0.285
  contains(almost)==True (1)                                 0.284
  contains(want)==True (1)                                  -0.282
  contains(new)==True (1)                                   -0.281
  contains(change)==True (1)                                 0.270
  contains(5)==True (1)                                      0.269
  contains(old)==True (1)                                    0.268
  contains(child)==True (1)                                  0.262
  contains(good)==True (1)                                  -0.260
  contains(time)==True (1)                                   0.255
  contains(go)==True (1)                                     0.254
  contains(joint)==True (1)                                  0.247
  contains(fat)==True (1)                                    0.244
  contains(effect)==True (1)                                 0.243
  contains(created)==True (1)                                0.243
  contains(said)==True (1)                                  -0.241
  contains(name)==True (1)                                   0.238
  contains(american)==True (1)                               0.236
  contains(one)==True (1)                                    0.222
  contains(lot)==True (1)                                   -0.215
  contains(home)==True (1)                                  -0.214
  contains(soul)==True (1)                                   0.212
  contains(know)==True (1)                                   0.212
  contains(never)==True (1)                                  0.211
  contains(part)==True (1)                                  -0.211
  contains(perfect)==True (1)                                0.206
  contains(unless)==True (1)                                 0.199
  contains(let)==True (1)                                   -0.198
  contains(back)==True (1)                                  -0.191
  contains(three)==True (1)                                  0.189
  contains(10)==True (1)                                     0.187
  contains(black)==True (1)                                  0.175
  contains(kid)==True (1)                                    0.173
  contains(open)==True (1)                                  -0.173
  contains(butter)==True (1)                                 0.172
  contains(hour)==True (1)                                   0.170
  contains(ever)==True (1)                                   0.169
  contains(say)==True (1)                                    0.167
  contains(.’)==True (1)                                     0.165
  contains(free)==True (1)                                   0.163
  contains(made)==True (1)                                  -0.161
  contains(actually)==True (1)                              -0.160
  contains(better)==True (1)                                -0.158
  contains(person)==True (1)                                -0.156
  contains(need)==True (1)                                  -0.155
  contains(much)==True (1)                                  -0.153
  contains(form)==True (1)                                   0.153
  contains(street)==True (1)                                 0.153
  contains(asking)==True (1)                                 0.149
  contains(using)==True (1)                                 -0.148
  contains(whole)==True (1)                                  0.146
  contains(oil)==True (1)                                    0.145
  contains(adult)==True (1)                                  0.142
  contains(list)==True (1)                                   0.141
  contains(subtle)==True (1)                                 0.141
  contains(exchange)==True (1)                               0.140
  contains(take)==True (1)                                  -0.140
  contains(like)==True (1)                                   0.139
  contains(dad)==True (1)                                    0.138
  contains(two)==True (1)                                    0.138
  contains(secret)==True (1)                                 0.136
  contains(boardwalk)==True (1)                              0.136
  contains(spend)==True (1)                                  0.131
  contains(bird)==True (1)                                   0.130
  contains(sometimes)==True (1)                             -0.129
  contains(come)==True (1)                                   0.129
  contains(even)==True (1)                                  -0.126
  contains(outer)==True (1)                                  0.125
  contains(waffle)==True (1)                                 0.123
  contains(thought)==True (1)                                0.121
  contains(father)==True (1)                                 0.121
  contains(house)==True (1)                                  0.121
  contains(knife)==True (1)                                  0.121
  contains(big)==True (1)                                   -0.121
  contains(high)==True (1)                                   0.118
  contains(enough)==True (1)                                -0.118
  contains(”)==True (1)                                     -0.117
  contains(.))==True (1)                                     0.116
  contains(become)==True (1)                                -0.115
  contains(le)==True (1)                                    -0.114
  contains(‘)==True (1)                                      0.109
  contains(pennsylvania)==True (1)                           0.107
  contains(store)==True (1)                                  0.106
  contains(something)==True (1)                             -0.103
  contains().)==True (1)                                    -0.101
  contains(word)==True (1)                                   0.101
  contains(describe)==True (1)                               0.099
  contains(central)==True (1)                                0.098
  contains(columbia)==True (1)                               0.097
  contains(line)==True (1)                                   0.096
  contains(john)==True (1)                                   0.094
  contains(style)==True (1)                                 -0.093
  contains(8)==True (1)                                     -0.093
  contains(roof)==True (1)                                   0.091
  contains(virginia)==True (1)                               0.091
  contains(quite)==True (1)                                 -0.091
  contains(really)==True (1)                                 0.089
  contains(right)==True (1)                                 -0.089
  contains(adam)==True (1)                                   0.089
  contains(month)==True (1)                                 -0.088
  contains(white)==True (1)                                  0.087
  contains(hard)==True (1)                                  -0.087
  contains(important)==True (1)                             -0.086
  contains(true)==True (1)                                  -0.086
  contains(1971)==True (1)                                   0.084
  contains(easy)==True (1)                                  -0.084
  contains(seems)==True (1)                                 -0.083
  contains(follows)==True (1)                                0.082
  contains(thigh)==True (1)                                  0.082
  contains(leave)==True (1)                                  0.081
  contains(cornflakes)==True (1)                             0.080
  contains(kitchen)==True (1)                                0.080
  contains(later)==True (1)                                  0.080
  contains(leg)==True (1)                                    0.080
  contains(location)==True (1)                               0.079
  contains(post)==True (1)                                   0.079
  contains(favorite)==True (1)                              -0.079
  contains(also)==True (1)                                   0.078
  contains(resulting)==True (1)                              0.078
  contains(probably)==True (1)                              -0.077
  contains(rather)==True (1)                                -0.075
  contains(pour)==True (1)                                   0.074
  contains(everything)==True (1)                             0.074
  contains(get)==True (1)                                    0.074
  contains(found)==True (1)                                 -0.073
  contains(anything)==True (1)                              -0.071
  contains(sound)==True (1)                                  0.070
  contains(paper)==True (1)                                  0.070
  contains(agree)==True (1)                                  0.070
  contains(richard)==True (1)                                0.069
  contains(den)==True (1)                                    0.069
  contains(hearty)==True (1)                                 0.069
  contains(keep)==True (1)                                  -0.069
  contains(alley)==True (1)                                  0.069
  contains(age)==True (1)                                   -0.065
  contains(miss)==True (1)                                   0.065
  contains(basically)==True (1)                              0.065
  contains(kind)==True (1)                                   0.064
  contains(($)==True (1)                                     0.062
  contains(believe)==True (1)                               -0.061
  contains(fall)==True (1)                                  -0.060
  contains(philippine)==True (1)                             0.059
  contains(j)==True (1)                                      0.058
  contains(teeth)==True (1)                                  0.057
  contains(brand)==True (1)                                  0.057
  contains(modern)==True (1)                                 0.057
  contains(place)==True (1)                                 -0.056
  contains(deal)==True (1)                                  -0.056
  contains(stuff)==True (1)                                  0.054
  contains(since)==True (1)                                 -0.054
  contains(salt)==True (1)                                   0.054
  contains(mac)==True (1)                                    0.054
  contains(sell)==True (1)                                   0.053
  contains(clean)==True (1)                                  0.053
  contains(order)==True (1)                                 -0.053
  contains(option)==True (1)                                 0.053
  contains(,”)==True (1)                                     0.053
  contains(begin)==True (1)                                  0.052
  contains(long)==True (1)                                  -0.051
  contains(classic)==True (1)                                0.051
  contains(chicken)==True (1)                                0.050
  contains(area)==True (1)                                  -0.050
  contains(popular)==True (1)                               -0.050
  contains(plate)==True (1)                                  0.050
  contains(friday)==True (1)                                -0.048
  contains(would)==True (1)                                  0.048
  contains(pool)==True (1)                                   0.048
  contains(c)==True (1)                                      0.048
  contains(12)==True (1)                                    -0.048
  contains(woman)==True (1)                                 -0.047
  contains(95)==True (1)                                     0.047
  contains(consistently)==True (1)                           0.047
  contains(happy)==True (1)                                 -0.047
  contains(half)==True (1)                                   0.046
  contains(presentation)==True (1)                           0.046
  contains(),)==True (1)                                    -0.045
  contains(luxury)==True (1)                                 0.044
  contains(turned)==True (1)                                -0.044
  contains(excuse)==True (1)                                 0.044
  contains(dedicated)==True (1)                              0.042
  contains(paid)==True (1)                                   0.042
  contains(thin)==True (1)                                   0.042
  contains(wanted)==True (1)                                 0.041
  contains(without)==True (1)                               -0.040
  contains(used)==True (1)                                  -0.040
  contains(another)==True (1)                               -0.040
  contains(including)==True (1)                              0.039
  contains(forget)==True (1)                                -0.039
  contains(got)==True (1)                                    0.039
  contains(bay)==True (1)                                    0.039
  contains(pas)==True (1)                                    0.038
  contains(south)==True (1)                                  0.038
  contains(real)==True (1)                                  -0.037
  contains(others)==True (1)                                -0.037
  contains(side)==True (1)                                  -0.037
  contains(eat)==True (1)                                    0.037
  contains(buy)==True (1)                                    0.037
  contains(maybe)==True (1)                                  0.035
  contains(mean)==True (1)                                   0.035
  contains(99)==True (1)                                     0.034
  contains(offer)==True (1)                                  0.033
  contains(twice)==True (1)                                  0.032
  contains(circle)==True (1)                                 0.031
  contains(nine)==True (1)                                   0.031
  contains(run)==True (1)                                    0.030
  contains(near)==True (1)                                  -0.030
  contains(started)==True (1)                               -0.030
  contains(pop)==True (1)                                    0.029
  contains(pack)==True (1)                                   0.029
  contains(getting)==True (1)                                0.029
  contains(within)==True (1)                                -0.028
  contains(music)==True (1)                                  0.026
  contains(single)==True (1)                                -0.025
  contains(east)==True (1)                                   0.025
  contains(well)==True (1)                                  -0.025
  contains(review)==True (1)                                 0.024
  contains(14)==True (1)                                     0.024
  contains(french)==True (1)                                 0.024
  contains(call)==True (1)                                  -0.024
  contains(whether)==True (1)                                0.023
  contains(reveal)==True (1)                                 0.023
  contains(tim)==True (1)                                    0.023
  contains(self)==True (1)                                   0.023
  contains(although)==True (1)                               0.023
  contains(flash)==True (1)                                  0.022
  contains(beauty)==True (1)                                -0.022
  contains(heavy)==True (1)                                  0.021
  contains(called)==True (1)                                -0.020
  contains(care)==True (1)                                  -0.020
  contains(offering)==True (1)                              -0.019
  contains(account)==True (1)                               -0.017
  contains(feel)==True (1)                                  -0.017
  contains(everywhere)==True (1)                             0.017
  contains(n)==True (1)                                      0.017
  contains(competition)==True (1)                            0.017
  contains(certain)==True (1)                               -0.016
  contains(could)==True (1)                                 -0.016
  contains(hand)==True (1)                                   0.016
  contains(former)==True (1)                                -0.016
  contains(r)==True (1)                                      0.016
  contains(though)==True (1)                                -0.014
  contains(fine)==True (1)                                  -0.012
  contains(dark)==True (1)                                   0.012
  contains(learned)==True (1)                               -0.012
  contains(minute)==True (1)                                -0.011
  contains(food)==True (1)                                   0.011
  contains(owner)==True (1)                                  0.011
  contains(deep)==True (1)                                  -0.010
  contains(may)==True (1)                                   -0.010
  contains(taken)==True (1)                                 -0.010
  contains(glass)==True (1)                                  0.009
  contains(double)==True (1)                                 0.009
  contains(voice)==True (1)                                 -0.009
  contains(common)==True (1)                                -0.008
  contains(meat)==True (1)                                  -0.007
  contains(!))==True (1)                                    -0.006
  contains(country)==True (1)                                0.005
  contains(thing)==True (1)                                  0.004
  contains(every)==True (1)                                 -0.003
  contains(driving)==True (1)                                0.002
  contains(often)==True (1)                                 -0.000
  contains(e)==True (1)                                              1.766
  contains(“)==True (1)                                              1.441
  contains(‘)==True (1)                                              1.095
  contains(9)==True (1)                                              0.865
  contains(child)==True (1)                                          0.863
  contains(,”)==True (1)                                             0.859
  contains(’)==True (1)                                              0.745
  contains(finished)==True (1)                                       0.672
  contains(age)==True (1)                                            0.604
  contains(”)==True (1)                                              0.577
  contains(bone)==True (1)                                           0.549
  contains(u)==True (1)                                             -0.520
  contains(best)==True (1)                                          -0.519
  contains(.”)==True (1)                                             0.503
  contains(margaret)==True (1)                                       0.488
  contains(believe)==True (1)                                        0.477
  contains(family)==True (1)                                         0.477
  contains(get)==True (1)                                           -0.474
  contains(14)==True (1)                                             0.436
  label is 'books' (1)                                              -0.416
  contains(like)==True (1)                                          -0.412
  contains(review)==True (1)                                         0.410
  contains(well)==True (1)                                           0.395
  contains(5)==True (1)                                             -0.388
  contains(crime)==True (1)                                          0.380
  contains(friday)==True (1)                                         0.367
  contains(bring)==True (1)                                          0.355
  contains(old)==True (1)                                            0.335
  contains(list)==True (1)                                           0.323
  contains(american)==True (1)                                       0.317
  contains(also)==True (1)                                           0.315
  contains(post)==True (1)                                           0.314
  contains(result)==True (1)                                        -0.308
  contains(virginia)==True (1)                                       0.305
  contains(store)==True (1)                                         -0.301
  contains(end)==True (1)                                            0.299
  contains(seems)==True (1)                                          0.296
  contains(real)==True (1)                                          -0.292
  contains(found)==True (1)                                          0.279
  contains(created)==True (1)                                       -0.276
  contains(others)==True (1)                                         0.270
  contains(.))==True (1)                                             0.267
  contains(open)==True (1)                                          -0.262
  contains(something)==True (1)                                      0.262
  contains(know)==True (1)                                           0.257
  contains(called)==True (1)                                        -0.257
  contains(change)==True (1)                                         0.250
  contains(free)==True (1)                                          -0.248
  contains(),)==True (1)                                             0.248
  contains(since)==True (1)                                         -0.242
  contains(top)==True (1)                                           -0.236
  contains(time)==True (1)                                           0.235
  contains(lot)==True (1)                                            0.224
  contains(side)==True (1)                                          -0.224
  contains(beer)==True (1)                                           0.223
  contains(item)==True (1)                                           0.223
  contains(big)==True (1)                                           -0.222
  contains(place)==True (1)                                         -0.219
  contains(popular)==True (1)                                       -0.217
  contains(8)==True (1)                                             -0.215
  contains(part)==True (1)                                           0.212
  contains(le)==True (1)                                            -0.209
  contains(bite)==True (1)                                           0.206
  contains(former)==True (1)                                        -0.206
  contains(made)==True (1)                                          -0.206
  contains(form)==True (1)                                           0.206
  contains(sure)==True (1)                                          -0.204
  contains(tender)==True (1)                                         0.203
  contains(matter)==True (1)                                         0.193
  contains(year)==True (1)                                           0.191
  contains(take)==True (1)                                          -0.190
  contains(favorite)==True (1)                                      -0.189
  contains(punishment)==True (1)                                     0.188
  contains(deal)==True (1)                                          -0.185
  contains(50)==True (1)                                             0.184
  contains(instead)==True (1)                                        0.183
  contains(25)==True (1)                                            -0.182
  contains(modern)==True (1)                                        -0.181
  contains(sandwich)==True (1)                                       0.180
  contains(—)==True (1)                                             -0.177
  contains(v)==True (1)                                             -0.176
  contains(fully)==True (1)                                          0.176
  contains(recipe)==True (1)                                        -0.175
  contains(much)==True (1)                                          -0.172
  contains(brand)==True (1)                                         -0.171
  contains(commitment)==True (1)                                     0.170
  contains(one)==True (1)                                           -0.170
  contains(co)==True (1)                                            -0.170
  contains(minute)==True (1)                                        -0.169
  contains(often)==True (1)                                         -0.165
  contains(12)==True (1)                                            -0.165
  contains(started)==True (1)                                       -0.163
  contains(could)==True (1)                                         -0.163
  contains(let)==True (1)                                            0.161
  contains(actually)==True (1)                                      -0.161
  contains(head)==True (1)                                           0.160
  contains(call)==True (1)                                          -0.158
  contains(golden)==True (1)                                         0.158
  contains(deep)==True (1)                                          -0.157
  contains(traditional)==True (1)                                    0.157
  contains(month)==True (1)                                          0.156
  contains(john)==True (1)                                           0.156
  contains(example)==True (1)                                       -0.156
  contains(black)==True (1)                                         -0.155
  contains(obsession)==True (1)                                      0.153
  contains(two)==True (1)                                            0.152
  contains(stock)==True (1)                                         -0.151
  contains(r)==True (1)                                             -0.150
  contains(offer)==True (1)                                          0.150
  contains(cast)==True (1)                                          -0.149
  contains(held)==True (1)                                          -0.148
  contains(music)==True (1)                                         -0.148
  contains(founder)==True (1)                                       -0.147
  contains(fan)==True (1)                                           -0.147
  contains(crack)==True (1)                                          0.146
  contains(washington)==True (1)                                     0.145
  contains(including)==True (1)                                      0.145
  contains(father)==True (1)                                         0.139
  contains(tim)==True (1)                                           -0.137
  contains(three)==True (1)                                          0.137
  contains(hold)==True (1)                                          -0.137
  contains(available)==True (1)                                      0.136
  contains(come)==True (1)                                           0.135
  contains(street)==True (1)                                         0.134
  contains(double)==True (1)                                         0.132
  contains(version)==True (1)                                        0.130
  contains(anything)==True (1)                                       0.128
  contains(yard)==True (1)                                           0.125
  contains(hot)==True (1)                                           -0.123
  contains(try)==True (1)                                            0.123
  contains(wing)==True (1)                                          -0.123
  contains(paper)==True (1)                                         -0.122
  contains(single)==True (1)                                        -0.122
  contains(without)==True (1)                                       -0.120
  contains(new)==True (1)                                            0.118
  contains(go)==True (1)                                             0.118
  contains(name)==True (1)                                          -0.117
  contains(every)==True (1)                                         -0.116
  contains(secret)==True (1)                                        -0.116
  contains(high)==True (1)                                          -0.116
  contains(offering)==True (1)                                      -0.116
  contains(dad)==True (1)                                            0.116
  contains(important)==True (1)                                     -0.115
  contains(got)==True (1)                                           -0.113
  contains(fresh)==True (1)                                         -0.112
  contains(discover)==True (1)                                      -0.112
  contains(white)==True (1)                                         -0.112
  contains(everything)==True (1)                                    -0.110
  contains(decade)==True (1)                                         0.110
  contains(maybe)==True (1)                                          0.109
  contains(finish)==True (1)                                        -0.108
  contains(option)==True (1)                                        -0.108
  contains(using)==True (1)                                         -0.108
  contains(roll)==True (1)                                          -0.105
  contains(owner)==True (1)                                         -0.105
  contains(though)==True (1)                                         0.105
  contains(tom)==True (1)                                           -0.105
  contains(adult)==True (1)                                          0.105
  contains(kid)==True (1)                                           -0.102
  contains(shard)==True (1)                                          0.102
  contains(getting)==True (1)                                       -0.101
  contains(line)==True (1)                                           0.099
  contains(10)==True (1)                                            -0.098
  contains(word)==True (1)                                           0.098
  contains(happy)==True (1)                                         -0.097
  contains(area)==True (1)                                          -0.097
  contains(french)==True (1)                                        -0.097
  contains(creator)==True (1)                                       -0.097
  contains(build)==True (1)                                         -0.096
  contains(sound)==True (1)                                          0.094
  contains(cooking)==True (1)                                       -0.094
  contains(whether)==True (1)                                        0.091
  contains(method)==True (1)                                        -0.091
  contains(house)==True (1)                                         -0.090
  contains(trick)==True (1)                                         -0.090
  contains(fancy)==True (1)                                          0.089
  contains(low)==True (1)                                           -0.088
  contains(sold)==True (1)                                           0.088
  contains(run)==True (1)                                            0.086
  contains(probably)==True (1)                                       0.085
  contains(whole)==True (1)                                         -0.084
  contains(wolf)==True (1)                                           0.084
  contains(even)==True (1)                                          -0.083
  contains(developed)==True (1)                                     -0.083
  contains().)==True (1)                                            -0.083
  contains(newest)==True (1)                                        -0.082
  contains(near)==True (1)                                           0.082
  contains(pop)==True (1)                                           -0.081
  contains(really)==True (1)                                         0.081
  contains(enough)==True (1)                                        -0.081
  contains(com)==True (1)                                           -0.080
  contains(named)==True (1)                                         -0.080
  contains(dark)==True (1)                                          -0.080
  contains(respect)==True (1)                                       -0.080
  contains(voice)==True (1)                                          0.079
  contains(nine)==True (1)                                           0.079
  contains(effect)==True (1)                                        -0.078
  contains(fast)==True (1)                                          -0.078
  contains(taking)==True (1)                                        -0.078
  contains(competition)==True (1)                                   -0.077
  contains(prevent)==True (1)                                       -0.077
  contains(used)==True (1)                                          -0.077
  contains(michel)==True (1)                                         0.077
  contains(taylor)==True (1)                                         0.077
  contains(sea)==True (1)                                           -0.077
  contains(operation)==True (1)                                     -0.076
  contains(teeth)==True (1)                                         -0.076
  contains(fine)==True (1)                                          -0.075
  contains(dig)==True (1)                                            0.075
  contains(butterfly)==True (1)                                      0.075
  contains(keep)==True (1)                                           0.075
  contains(never)==True (1)                                         -0.074
  contains(plate)==True (1)                                         -0.074
  contains(stuff)==True (1)                                         -0.074
  contains(bay)==True (1)                                           -0.073
  contains(style)==True (1)                                          0.073
  contains(follows)==True (1)                                        0.072
  contains(person)==True (1)                                         0.072
  contains(begin)==True (1)                                          0.071
  contains(spend)==True (1)                                         -0.071
  contains(long)==True (1)                                           0.070
  contains(standard)==True (1)                                       0.069
  contains(alike)==True (1)                                         -0.069
  contains(arrives)==True (1)                                       -0.069
  contains(signature)==True (1)                                      0.068
  contains(adam)==True (1)                                          -0.068
  contains(snack)==True (1)                                         -0.068
  contains(juice)==True (1)                                         -0.067
  contains(leave)==True (1)                                          0.066
  contains(half)==True (1)                                           0.066
  contains(variety)==True (1)                                       -0.066
  contains(south)==True (1)                                         -0.065
  contains(chicken)==True (1)                                       -0.065
  contains(blend)==True (1)                                         -0.065
  contains(longer)==True (1)                                        -0.064
  contains(needed)==True (1)                                        -0.063
  contains(set)==True (1)                                            0.063
  contains(fish)==True (1)                                           0.063
  contains(kitchen)==True (1)                                       -0.063
  contains(eric)==True (1)                                          -0.063
  contains(perfectly)==True (1)                                      0.062
  contains(bird)==True (1)                                          -0.062
  contains(leg)==True (1)                                           -0.062
  contains(creating)==True (1)                                      -0.061
  contains(fat)==True (1)                                           -0.061
  contains(ordering)==True (1)                                       0.060
  contains(central)==True (1)                                       -0.060
  contains(thought)==True (1)                                       -0.060
  contains(hill)==True (1)                                           0.059
  contains(japanese)==True (1)                                      -0.058
  contains(biscuit)==True (1)                                        0.058
  contains(woman)==True (1)                                          0.057
  contains(crushed)==True (1)                                        0.057
  contains(enjoyed)==True (1)                                       -0.055
  contains(describe)==True (1)                                      -0.055
  contains(reveal)==True (1)                                         0.055
  contains(dedicated)==True (1)                                     -0.055
  contains(bag)==True (1)                                           -0.055
  contains(unless)==True (1)                                         0.054
  contains(occasional)==True (1)                                     0.054
  contains(beauty)==True (1)                                         0.053
  contains(fly)==True (1)                                            0.053
  contains(richard)==True (1)                                       -0.053
  contains(24)==True (1)                                            -0.053
  contains(say)==True (1)                                            0.053
  contains(simply)==True (1)                                        -0.052
  contains(rose)==True (1)                                          -0.052
  contains(toward)==True (1)                                        -0.052
  contains(cart)==True (1)                                           0.052
  contains(however)==True (1)                                        0.052
  contains(beef)==True (1)                                           0.051
  contains(country)==True (1)                                       -0.051
  contains(home)==True (1)                                           0.051
  contains(n)==True (1)                                             -0.051
  contains(self)==True (1)                                           0.051
  contains(thin)==True (1)                                          -0.051
  contains(den)==True (1)                                           -0.051
  contains(back)==True (1)                                          -0.050
  contains(j)==True (1)                                             -0.050
  contains(pas)==True (1)                                           -0.050
  contains(buy)==True (1)                                            0.050
  contains(grab)==True (1)                                          -0.049
  contains(would)==True (1)                                          0.049
  contains(suit)==True (1)                                          -0.049
  contains(credit)==True (1)                                         0.049
  contains(may)==True (1)                                           -0.049
  contains(piece)==True (1)                                          0.049
  contains(waste)==True (1)                                         -0.049
  contains(dinner)==True (1)                                         0.048
  contains(later)==True (1)                                         -0.048
  contains(throughout)==True (1)                                     0.048
  contains(consciousness)==True (1)                                  0.048
  contains(frequently)==True (1)                                     0.047
  contains(leaf)==True (1)                                          -0.047
  contains(sell)==True (1)                                          -0.047
  contains(allow)==True (1)                                         -0.047
  contains(essentially)==True (1)                                   -0.047
  contains(crystal)==True (1)                                        0.046
  contains(sometimes)==True (1)                                      0.046
  contains(wanted)==True (1)                                        -0.045
  contains(classic)==True (1)                                        0.044
  contains(warm)==True (1)                                           0.044
  contains(luxury)==True (1)                                        -0.043
  contains(attached)==True (1)                                      -0.043
  contains(generous)==True (1)                                       0.043
  contains(paid)==True (1)                                          -0.043
  contains(spin)==True (1)                                          -0.042
  contains(hour)==True (1)                                          -0.042
  contains(accompanying)==True (1)                                   0.042
  contains(passion)==True (1)                                       -0.042
  contains(testing)==True (1)                                       -0.041
  contains(learned)==True (1)                                        0.041
  contains(inspired)==True (1)                                      -0.041
  contains(true)==True (1)                                          -0.041
  contains(ever)==True (1)                                          -0.040
  contains(element)==True (1)                                       -0.040
  contains(spent)==True (1)                                          0.040
  contains(iron)==True (1)                                          -0.040
  contains(local)==True (1)                                          0.039
  contains(twice)==True (1)                                         -0.039
  contains(asking)==True (1)                                         0.039
  contains(stool)==True (1)                                         -0.039
  contains(excuse)==True (1)                                        -0.039
  contains(seasoned)==True (1)                                       0.039
  contains(agree)==True (1)                                          0.039
  contains(formula)==True (1)                                       -0.038
  contains(better)==True (1)                                         0.038
  contains(exchange)==True (1)                                      -0.037
  contains(onto)==True (1)                                           0.037
  contains(east)==True (1)                                          -0.037
  contains(planning)==True (1)                                       0.037
  contains(clean)==True (1)                                         -0.037
  contains(exact)==True (1)                                          0.036
  contains(farce)==True (1)                                          0.036
  contains(want)==True (1)                                          -0.036
  contains(right)==True (1)                                         -0.036
  contains(pain)==True (1)                                          -0.035
  contains(hard)==True (1)                                           0.035
  contains(atop)==True (1)                                           0.035
  contains(become)==True (1)                                        -0.035
  contains(another)==True (1)                                        0.034
  contains(c)==True (1)                                             -0.034
  contains(cut)==True (1)                                           -0.034
  contains(almost)==True (1)                                         0.034
  contains(flash)==True (1)                                          0.033
  contains(bar)==True (1)                                           -0.032
  contains(finger)==True (1)                                        -0.032
  contains(training)==True (1)                                      -0.032
  contains(food)==True (1)                                           0.031
  contains(common)==True (1)                                         0.031
  contains(care)==True (1)                                          -0.030
  contains(pressure)==True (1)                                      -0.030
  contains(restaurant)==True (1)                                    -0.030
  contains(chef)==True (1)                                          -0.030
  contains(dose)==True (1)                                           0.030
  contains(butter)==True (1)                                         0.030
  contains(everywhere)==True (1)                                    -0.030
  contains(variation)==True (1)                                     -0.029
  contains(lily)==True (1)                                           0.029
  contains(ask)==True (1)                                            0.029
  contains(stuffed)==True (1)                                       -0.029
  contains(within)==True (1)                                        -0.029
  contains(forget)==True (1)                                         0.028
  contains(surprising)==True (1)                                     0.028
  contains(combination)==True (1)                                   -0.028
  contains(.’)==True (1)                                             0.028
  contains(taken)==True (1)                                          0.028
  contains(inevitably)==True (1)                                     0.027
  contains(heavy)==True (1)                                          0.027
  contains(slightly)==True (1)                                       0.027
  contains(southern)==True (1)                                       0.027
  contains(topped)==True (1)                                        -0.027
  contains(presentation)==True (1)                                  -0.026
  contains(rather)==True (1)                                        -0.026
  contains(combine)==True (1)                                        0.026
  contains(base)==True (1)                                          -0.026
  contains(frank)==True (1)                                         -0.025
  contains(process)==True (1)                                       -0.025
  contains(chain)==True (1)                                         -0.025
  contains(plain)==True (1)                                         -0.025
  contains(circle)==True (1)                                         0.025
  contains(remains)==True (1)                                       -0.025
  contains(strip)==True (1)                                         -0.024
  contains(spring)==True (1)                                         0.024
  contains(hank)==True (1)                                           0.024
  contains(dennis)==True (1)                                         0.024
  contains(yield)==True (1)                                         -0.024
  contains(although)==True (1)                                      -0.024
  contains(($)==True (1)                                             0.023
  contains(bread)==True (1)                                          0.023
  contains(shore)==True (1)                                          0.023
  contains(bucket)==True (1)                                        -0.023
  contains(protein)==True (1)                                        0.022
  contains(skin)==True (1)                                           0.022
  contains(gauche)==True (1)                                         0.022
  contains(addictive)==True (1)                                     -0.022
  contains(tartare)==True (1)                                        0.022
  contains(plus)==True (1)                                           0.022
  contains(picnic)==True (1)                                        -0.022
  contains(!))==True (1)                                             0.022
  contains(empty)==True (1)                                          0.022
  contains(feel)==True (1)                                          -0.021
  contains(ala)==True (1)                                            0.021
  contains(baltimore)==True (1)                                     -0.021
  contains(ear)==True (1)                                           -0.020
  contains(steak)==True (1)                                          0.020
  contains(stick)==True (1)                                          0.020
  contains(cooper)==True (1)                                         0.020
  contains(turned)==True (1)                                         0.020
  contains(good)==True (1)                                          -0.020
  contains(count)==True (1)                                         -0.020
  contains(satisfying)==True (1)                                     0.019
  contains(hand)==True (1)                                          -0.018
  contains(degree)==True (1)                                        -0.018
  contains(salt)==True (1)                                          -0.018
  contains(perfect)==True (1)                                       -0.018
  contains(rendered)==True (1)                                       0.018
  contains(abbott)==True (1)                                         0.018
  contains(clerk)==True (1)                                          0.017
  contains(certain)==True (1)                                        0.017
  contains(marriage)==True (1)                                       0.017
  contains(need)==True (1)                                          -0.017
  contains(silver)==True (1)                                         0.016
  contains(distance)==True (1)                                       0.015
  contains(resist)==True (1)                                        -0.015
  contains(romeo)==True (1)                                          0.015
  contains(breath)==True (1)                                         0.015
  contains(thing)==True (1)                                          0.015
  contains(japan)==True (1)                                         -0.015
  contains(99)==True (1)                                            -0.015
  contains(meal)==True (1)                                           0.015
  contains(menu)==True (1)                                          -0.014
  contains(basket)==True (1)                                         0.014
  contains(soul)==True (1)                                          -0.014
  contains(pack)==True (1)                                          -0.014
  contains(roof)==True (1)                                           0.014
  contains(spice)==True (1)                                         -0.014
  contains(accused)==True (1)                                        0.013
  contains(put)==True (1)                                            0.013
  contains(ray)==True (1)                                            0.013
  contains(2002)==True (1)                                          -0.013
  contains(knife)==True (1)                                         -0.013
  contains(extremely)==True (1)                                      0.013
  contains(forgotten)==True (1)                                      0.013
  contains(memorable)==True (1)                                     -0.012
  contains(order)==True (1)                                          0.012
  contains(tending)==True (1)                                        0.012
  contains(shine)==True (1)                                         -0.012
  contains(virtue)==True (1)                                        -0.012
  contains(fall)==True (1)                                           0.012
  contains(flesh)==True (1)                                          0.012
  contains(.,)==True (1)                                            -0.012
  contains(miss)==True (1)                                          -0.012
  contains(roasted)==True (1)                                       -0.012
  contains(deeply)==True (1)                                         0.012
  contains(korean)==True (1)                                         0.011
  contains(dignity)==True (1)                                        0.011
  contains(appreciate)==True (1)                                    -0.011
  contains(basically)==True (1)                                      0.011
  contains(stacked)==True (1)                                       -0.011
  contains(taste)==True (1)                                          0.011
  contains(fried)==True (1)                                         -0.011
  contains(layer)==True (1)                                         -0.010
  contains(easy)==True (1)                                           0.010
  contains(size)==True (1)                                           0.010
  contains(wonderfully)==True (1)                                    0.010
  contains(pan)==True (1)                                           -0.010
  contains(kind)==True (1)                                           0.010
  contains(morgan)==True (1)                                         0.010
  contains(merit)==True (1)                                          0.010
  contains(capitol)==True (1)                                        0.009
  contains(treated)==True (1)                                       -0.009
  contains(connecticut)==True (1)                                    0.009
  contains(dipping)==True (1)                                        0.009
  contains(account)==True (1)                                        0.009
  contains(handle)==True (1)                                        -0.009
  contains(detail)==True (1)                                         0.009
  contains(sole)==True (1)                                           0.009
  contains(filling)==True (1)                                       -0.009
  contains(upscale)==True (1)                                        0.008
  contains(bum)==True (1)                                            0.008
  contains(said)==True (1)                                          -0.008
  contains(st)==True (1)                                             0.008
  contains(reserve)==True (1)                                        0.008
  contains(appetite)==True (1)                                       0.008
  contains(ubiquitous)==True (1)                                     0.008
  contains(commonly)==True (1)                                       0.008
  contains(involved)==True (1)                                       0.008
  contains(smack)==True (1)                                          0.007
  contains(district)==True (1)                                      -0.007
  contains(primary)==True (1)                                       -0.007
  contains(crumb)==True (1)                                          0.007
  contains(served)==True (1)                                        -0.007
  contains(exterior)==True (1)                                       0.007
  contains(founded)==True (1)                                       -0.006
  contains(korea)==True (1)                                          0.006
  contains(ultra)==True (1)                                         -0.005
  contains(pleasure)==True (1)                                      -0.005
  contains(rob)==True (1)                                           -0.005
  contains(mean)==True (1)                                           0.005
  contains(estate)==True (1)                                         0.005
  contains(portion)==True (1)                                       -0.004
  contains(ranch)==True (1)                                          0.004
  contains(honey)==True (1)                                         -0.004
  contains(glass)==True (1)                                         -0.004
  contains(pony)==True (1)                                           0.004
  contains(sink)==True (1)                                          -0.004
  contains(lunch)==True (1)                                         -0.004
  contains(accent)==True (1)                                         0.004
  contains(bath)==True (1)                                          -0.004
  contains(argue)==True (1)                                         -0.004
  contains(breast)==True (1)                                         0.004
  contains(favor)==True (1)                                         -0.003
  contains(emerge)==True (1)                                         0.003
  contains(maryland)==True (1)                                       0.003
  contains(pour)==True (1)                                           0.003
  contains(quite)==True (1)                                         -0.003
  contains(term)==True (1)                                           0.003
  contains(ditch)==True (1)                                          0.003
  contains(proper)==True (1)                                         0.002
  contains(eat)==True (1)                                            0.002
  contains(crunch)==True (1)                                         0.001
  contains(technically)==True (1)                                    0.001
  contains(lends)==True (1)                                          0.001
  contains(famed)==True (1)                                          0.001
  contains(chewing)==True (1)                                        0.001
  contains(leftover)==True (1)                                      -0.001
  contains(preparation)==True (1)                                    0.000
  contains(driving)==True (1)                                        0.000
  contains(georgia)==True (1)                                        0.000
  contains(guilty)==True (1)                                         0.000
  label is 'do_it_yourself' (1)                                             -1.957
  contains(’)==True (1)                                                      0.986
  contains().)==True (1)                                                     0.984
  contains(need)==True (1)                                                   0.850
  contains(like)==True (1)                                                   0.654
  contains(using)==True (1)                                                  0.639
  contains(wanted)==True (1)                                                 0.515
  contains(year)==True (1)                                                  -0.476
  contains(started)==True (1)                                                0.466
  contains(sure)==True (1)                                                   0.453
  contains(piece)==True (1)                                                  0.445
  contains(although)==True (1)                                               0.430
  contains(every)==True (1)                                                  0.388
  contains(creating)==True (1)                                               0.364
  contains(account)==True (1)                                                0.355
  contains(know)==True (1)                                                  -0.338
  contains(offer)==True (1)                                                  0.333
  contains(place)==True (1)                                                  0.329
  contains(build)==True (1)                                                  0.326
  contains(r)==True (1)                                                     -0.304
  contains(head)==True (1)                                                   0.302
  contains(thing)==True (1)                                                  0.301
  contains(used)==True (1)                                                   0.297
  contains(9)==True (1)                                                     -0.287
  contains(”)==True (1)                                                     -0.276
  contains(near)==True (1)                                                   0.269
  contains(hand)==True (1)                                                   0.267
  contains(well)==True (1)                                                   0.260
  contains(really)==True (1)                                                -0.256
  contains(right)==True (1)                                                 -0.250
  contains(spent)==True (1)                                                  0.248
  contains(finger)==True (1)                                                 0.248
  contains(option)==True (1)                                                 0.245
  contains(low)==True (1)                                                    0.245
  contains(cut)==True (1)                                                    0.244
  contains(quite)==True (1)                                                 -0.244
  contains(change)==True (1)                                                -0.238
  contains(.))==True (1)                                                    -0.238
  contains(pop)==True (1)                                                    0.238
  contains(hard)==True (1)                                                  -0.236
  contains(try)==True (1)                                                    0.234
  contains(found)==True (1)                                                  0.230
  contains(run)==True (1)                                                    0.229
  contains(also)==True (1)                                                   0.226
  contains(fresh)==True (1)                                                  0.225
  contains(le)==True (1)                                                    -0.224
  contains(stuff)==True (1)                                                 -0.223
  contains(mean)==True (1)                                                  -0.221
  contains(trick)==True (1)                                                 -0.219
  contains(john)==True (1)                                                  -0.218
  contains(seems)==True (1)                                                 -0.217
  contains(probably)==True (1)                                              -0.209
  contains(never)==True (1)                                                 -0.208
  contains(best)==True (1)                                                  -0.207
  contains(v)==True (1)                                                     -0.200
  contains(forget)==True (1)                                                 0.199
  contains(method)==True (1)                                                 0.198
  contains(take)==True (1)                                                  -0.196
  contains(part)==True (1)                                                   0.193
  contains(important)==True (1)                                              0.191
  contains(almost)==True (1)                                                -0.190
  contains(food)==True (1)                                                  -0.188
  contains(lot)==True (1)                                                   -0.187
  contains(minute)==True (1)                                                 0.186
  contains(‘)==True (1)                                                     -0.185
  contains(foam)==True (1)                                                   0.184
  contains(thought)==True (1)                                                0.183
  contains(ask)==True (1)                                                    0.177
  contains(result)==True (1)                                                -0.176
  contains(say)==True (1)                                                   -0.176
  contains(mount)==True (1)                                                  0.176
  contains(5)==True (1)                                                      0.176
  contains(butterfly)==True (1)                                              0.175
  contains(could)==True (1)                                                 -0.173
  contains(classic)==True (1)                                               -0.170
  contains(handle)==True (1)                                                 0.169
  contains(menu)==True (1)                                                   0.168
  contains(former)==True (1)                                                -0.167
  contains(local)==True (1)                                                  0.166
  contains(pour)==True (1)                                                   0.163
  contains(clean)==True (1)                                                  0.163
  contains(top)==True (1)                                                    0.161
  contains(south)==True (1)                                                 -0.161
  contains(review)==True (1)                                                -0.160
  contains(miss)==True (1)                                                  -0.156
  contains(base)==True (1)                                                   0.153
  contains(detail)==True (1)                                                 0.152
  contains(stock)==True (1)                                                  0.151
  contains(finished)==True (1)                                               0.150
  contains(resulting)==True (1)                                              0.149
  contains(single)==True (1)                                                 0.147
  contains(!))==True (1)                                                    -0.146
  contains(recipe)==True (1)                                                -0.145
  contains(woman)==True (1)                                                 -0.144
  contains(free)==True (1)                                                   0.144
  contains(said)==True (1)                                                  -0.144
  contains(outer)==True (1)                                                  0.144
  contains(begin)==True (1)                                                  0.141
  contains(99)==True (1)                                                    -0.140
  contains(taking)==True (1)                                                 0.140
  contains(double)==True (1)                                                -0.138
  contains(8)==True (1)                                                      0.138
  contains(fall)==True (1)                                                   0.138
  contains(may)==True (1)                                                    0.138
  contains(fish)==True (1)                                                   0.137
  contains(feel)==True (1)                                                   0.134
  contains(called)==True (1)                                                -0.133
  contains(everything)==True (1)                                             0.133
  contains(temperature)==True (1)                                            0.133
  contains(co)==True (1)                                                     0.133
  contains(month)==True (1)                                                 -0.133
  contains(hour)==True (1)                                                   0.131
  contains(driving)==True (1)                                               -0.129
  contains(made)==True (1)                                                   0.129
  contains(roll)==True (1)                                                  -0.129
  contains(item)==True (1)                                                   0.128
  contains(instead)==True (1)                                                0.126
  contains(central)==True (1)                                                0.126
  contains(50)==True (1)                                                    -0.125
  contains(creator)==True (1)                                               -0.124
  contains(created)==True (1)                                               -0.123
  contains(put)==True (1)                                                    0.122
  contains(strip)==True (1)                                                  0.122
  contains(black)==True (1)                                                 -0.121
  contains(founder)==True (1)                                               -0.120
  contains(asking)==True (1)                                                -0.118
  contains(become)==True (1)                                                -0.118
  contains(post)==True (1)                                                   0.118
  contains(fine)==True (1)                                                   0.117
  contains(location)==True (1)                                               0.117
  contains(needed)==True (1)                                                -0.116
  contains(fancy)==True (1)                                                 -0.116
  contains(size)==True (1)                                                   0.115
  contains(u)==True (1)                                                     -0.113
  contains(order)==True (1)                                                  0.113
  contains(age)==True (1)                                                    0.112
  contains(simply)==True (1)                                                -0.112
  contains(let)==True (1)                                                   -0.111
  contains(believe)==True (1)                                               -0.111
  contains(learned)==True (1)                                               -0.110
  contains(tim)==True (1)                                                   -0.110
  contains(easy)==True (1)                                                  -0.109
  contains(kind)==True (1)                                                  -0.109
  contains(extremely)==True (1)                                              0.109
  contains(owner)==True (1)                                                 -0.109
  contains(district)==True (1)                                               0.108
  contains(sound)==True (1)                                                  0.108
  contains(attached)==True (1)                                               0.107
  contains(training)==True (1)                                              -0.105
  contains(southern)==True (1)                                               0.105
  contains(thin)==True (1)                                                   0.104
  contains(brand)==True (1)                                                 -0.103
  contains(prevent)==True (1)                                                0.102
  contains(come)==True (1)                                                   0.102
  contains(testing)==True (1)                                               -0.101
  contains(portion)==True (1)                                                0.101
  contains(dark)==True (1)                                                  -0.100
  contains(10)==True (1)                                                    -0.100
  contains(finish)==True (1)                                                -0.100
  contains(care)==True (1)                                                  -0.100
  contains(st)==True (1)                                                     0.100
  contains(later)==True (1)                                                 -0.099
  contains(sometimes)==True (1)                                             -0.099
  contains(onto)==True (1)                                                   0.097
  contains(three)==True (1)                                                 -0.097
  contains(bucket)==True (1)                                                 0.097
  contains(self)==True (1)                                                  -0.095
  contains(want)==True (1)                                                  -0.095
  contains(leg)==True (1)                                                    0.095
  contains(restaurant)==True (1)                                            -0.094
  contains(go)==True (1)                                                     0.094
  contains(friday)==True (1)                                                 0.094
  contains(voice)==True (1)                                                 -0.093
  contains(music)==True (1)                                                  0.093
  contains(plus)==True (1)                                                  -0.093
  contains(degree)==True (1)                                                 0.092
  contains(enjoyed)==True (1)                                               -0.092
  contains(effect)==True (1)                                                -0.092
  contains(leaf)==True (1)                                                   0.092
  contains(wing)==True (1)                                                   0.092
  contains(others)==True (1)                                                -0.091
  contains(,”)==True (1)                                                     0.090
  contains(nine)==True (1)                                                  -0.090
  contains(n)==True (1)                                                     -0.090
  contains(fast)==True (1)                                                   0.089
  contains(grab)==True (1)                                                   0.088
  contains(),)==True (1)                                                     0.088
  contains(q)==True (1)                                                     -0.087
  contains(passion)==True (1)                                                0.087
  contains(pack)==True (1)                                                   0.086
  contains(beauty)==True (1)                                                 0.086
  contains(get)==True (1)                                                    0.084
  contains(rather)==True (1)                                                 0.084
  contains(throughout)==True (1)                                            -0.083
  contains(whole)==True (1)                                                 -0.083
  contains(stick)==True (1)                                                 -0.083
  contains(something)==True (1)                                              0.082
  contains(spin)==True (1)                                                   0.082
  contains(held)==True (1)                                                  -0.082
  contains(com)==True (1)                                                   -0.082
  contains(kitchen)==True (1)                                               -0.082
  contains(tom)==True (1)                                                   -0.081
  contains(maybe)==True (1)                                                 -0.080
  contains(process)==True (1)                                                0.080
  contains(old)==True (1)                                                    0.080
  contains(though)==True (1)                                                -0.080
  contains(cooking)==True (1)                                               -0.080
  contains(much)==True (1)                                                   0.079
  contains(crack)==True (1)                                                 -0.079
  contains(example)==True (1)                                                0.078
  contains(dinner)==True (1)                                                 0.078
  contains(decade)==True (1)                                                -0.077
  contains(roof)==True (1)                                                  -0.077
  contains(paper)==True (1)                                                  0.077
  contains(taken)==True (1)                                                  0.076
  contains(tomato)==True (1)                                                -0.076
  contains(perfectly)==True (1)                                             -0.075
  contains(exchange)==True (1)                                               0.075
  contains(leave)==True (1)                                                 -0.075
  contains(.”)==True (1)                                                    -0.074
  contains(sunday)==True (1)                                                -0.073
  contains(inspired)==True (1)                                               0.073
  contains(developed)==True (1)                                             -0.073
  contains(bite)==True (1)                                                  -0.073
  contains(deal)==True (1)                                                  -0.073
  contains(frank)==True (1)                                                 -0.072
  contains(“)==True (1)                                                      0.072
  contains(eat)==True (1)                                                    0.072
  contains(layer)==True (1)                                                  0.071
  contains(exact)==True (1)                                                  0.070
  contains(oil)==True (1)                                                    0.070
  contains(slightly)==True (1)                                               0.070
  contains(25)==True (1)                                                    -0.069
  contains(circle)==True (1)                                                 0.068
  contains(adam)==True (1)                                                  -0.068
  contains(pan)==True (1)                                                    0.067
  contains(two)==True (1)                                                   -0.067
  contains(kid)==True (1)                                                   -0.067
  contains(alike)==True (1)                                                 -0.067
  contains(www)==True (1)                                                    0.066
  contains(joint)==True (1)                                                 -0.066
  contains(.,)==True (1)                                                    -0.066
  contains(competition)==True (1)                                           -0.066
  contains(set)==True (1)                                                    0.065
  contains(butter)==True (1)                                                -0.065
  contains(since)==True (1)                                                 -0.065
  contains(c)==True (1)                                                      0.065
  contains(involved)==True (1)                                               0.065
  contains(country)==True (1)                                               -0.065
  contains(house)==True (1)                                                 -0.064
  contains(—)==True (1)                                                     -0.064
  contains(pressure)==True (1)                                               0.064
  contains(beer)==True (1)                                                  -0.064
  contains(hot)==True (1)                                                   -0.062
  contains(plain)==True (1)                                                 -0.062
  contains(delicious)==True (1)                                             -0.062
  contains(allow)==True (1)                                                  0.061
  contains(line)==True (1)                                                  -0.061
  contains(ray)==True (1)                                                   -0.061
  contains(enough)==True (1)                                                 0.061
  contains(child)==True (1)                                                  0.060
  contains(father)==True (1)                                                -0.060
  contains(lunch)==True (1)                                                  0.059
  contains(bring)==True (1)                                                  0.059
  contains(12)==True (1)                                                     0.059
  contains(washington)==True (1)                                            -0.059
  contains(fully)==True (1)                                                 -0.058
  contains(east)==True (1)                                                   0.058
  contains(juice)==True (1)                                                  0.058
  contains(family)==True (1)                                                -0.058
  contains(empty)==True (1)                                                  0.057
  contains(honey)==True (1)                                                  0.056
  contains(adult)==True (1)                                                 -0.056
  contains(perfect)==True (1)                                                0.056
  contains(element)==True (1)                                               -0.055
  contains(silver)==True (1)                                                 0.055
  contains(spice)==True (1)                                                 -0.055
  contains(golden)==True (1)                                                 0.055
  contains(hill)==True (1)                                                  -0.054
  contains(french)==True (1)                                                -0.054
  contains(secret)==True (1)                                                -0.053
  contains(big)==True (1)                                                    0.053
  contains(chow)==True (1)                                                   0.053
  contains(cast)==True (1)                                                   0.052
  contains(snack)==True (1)                                                  0.052
  contains(served)==True (1)                                                -0.052
  contains(anything)==True (1)                                              -0.051
  contains(term)==True (1)                                                  -0.051
  contains(planning)==True (1)                                              -0.051
  contains(paid)==True (1)                                                   0.050
  contains(blend)==True (1)                                                 -0.050
  contains(rose)==True (1)                                                  -0.050
  contains(pleasure)==True (1)                                              -0.050
  contains(accessible)==True (1)                                            -0.050
  contains(popular)==True (1)                                               -0.049
  contains(combine)==True (1)                                                0.049
  contains(getting)==True (1)                                                0.049
  contains(spend)==True (1)                                                 -0.049
  contains(basically)==True (1)                                             -0.048
  contains(operation)==True (1)                                              0.048
  contains(stool)==True (1)                                                 -0.048
  contains(becky)==True (1)                                                  0.048
  contains(actually)==True (1)                                               0.048
  contains(another)==True (1)                                                0.048
  contains(bonnie)==True (1)                                                 0.047
  contains(deeply)==True (1)                                                 0.047
  contains(flavor)==True (1)                                                -0.047
  contains(american)==True (1)                                              -0.046
  contains(ever)==True (1)                                                   0.046
  contains(certain)==True (1)                                                0.046
  contains(white)==True (1)                                                 -0.045
  contains(stuffed)==True (1)                                               -0.045
  contains(bolster)==True (1)                                                0.045
  contains(time)==True (1)                                                  -0.045
  contains(moist)==True (1)                                                  0.045
  contains(common)==True (1)                                                 0.044
  contains(distributed)==True (1)                                           -0.043
  contains(navy)==True (1)                                                   0.043
  contains(chicken)==True (1)                                                0.043
  contains(bag)==True (1)                                                   -0.042
  contains(commitment)==True (1)                                             0.042
  contains(protein)==True (1)                                                0.042
  contains(list)==True (1)                                                   0.042
  contains(95)==True (1)                                                    -0.042
  contains(dropping)==True (1)                                              -0.042
  contains(named)==True (1)                                                 -0.041
  contains(luxury)==True (1)                                                -0.041
  contains(offering)==True (1)                                               0.041
  contains(bread)==True (1)                                                 -0.041
  contains(hold)==True (1)                                                   0.041
  contains(fat)==True (1)                                                   -0.041
  contains(arrives)==True (1)                                               -0.040
  contains(standard)==True (1)                                              -0.040
  contains(name)==True (1)                                                  -0.040
  contains(taste)==True (1)                                                  0.040
  contains(good)==True (1)                                                   0.040
  contains(treated)==True (1)                                                0.039
  contains(subtle)==True (1)                                                -0.039
  contains(dunk)==True (1)                                                   0.038
  contains(including)==True (1)                                              0.038
  contains(dig)==True (1)                                                   -0.038
  contains(frying)==True (1)                                                 0.037
  contains(describe)==True (1)                                              -0.037
  contains(closely)==True (1)                                                0.037
  contains(obsession)==True (1)                                             -0.037
  contains(twice)==True (1)                                                  0.037
  contains(merit)==True (1)                                                 -0.037
  contains(long)==True (1)                                                  -0.036
  contains(often)==True (1)                                                  0.036
  contains(chili)==True (1)                                                  0.036
  contains(newest)==True (1)                                                -0.035
  contains(would)==True (1)                                                  0.035
  contains(warm)==True (1)                                                   0.035
  contains(bird)==True (1)                                                   0.035
  contains(agree)==True (1)                                                  0.035
  contains(signature)==True (1)                                             -0.034
  contains(snap)==True (1)                                                  -0.034
  contains(appreciate)==True (1)                                             0.034
  contains(breath)==True (1)                                                -0.034
  contains(sandwich)==True (1)                                              -0.034
  contains(yard)==True (1)                                                   0.034
  contains(count)==True (1)                                                 -0.033
  contains(word)==True (1)                                                   0.033
  contains(glass)==True (1)                                                 -0.033
  contains(suit)==True (1)                                                  -0.033
  contains(pas)==True (1)                                                   -0.033
  contains(excuse)==True (1)                                                -0.033
  contains(style)==True (1)                                                  0.032
  contains(chef)==True (1)                                                  -0.032
  contains(cilantro)==True (1)                                               0.032
  contains(distance)==True (1)                                               0.031
  contains(brushed)==True (1)                                               -0.031
  contains(real)==True (1)                                                   0.031
  contains(2002)==True (1)                                                  -0.031
  contains(combination)==True (1)                                           -0.030
  contains(founded)==True (1)                                               -0.030
  contains(proper)==True (1)                                                -0.030
  contains(form)==True (1)                                                  -0.029
  contains(call)==True (1)                                                  -0.029
  contains(sell)==True (1)                                                   0.029
  contains(virtue)==True (1)                                                 0.029
  contains(stacked)==True (1)                                                0.029
  contains(better)==True (1)                                                -0.028
  contains(coated)==True (1)                                                 0.028
  contains(soul)==True (1)                                                   0.028
  contains(teeth)==True (1)                                                  0.028
  contains(tempura)==True (1)                                                0.028
  contains(back)==True (1)                                                   0.028
  contains(matter)==True (1)                                                -0.028
  contains(one)==True (1)                                                   -0.027
  contains(marriage)==True (1)                                              -0.027
  contains(credit)==True (1)                                                -0.027
  contains(deep)==True (1)                                                  -0.027
  contains(potato)==True (1)                                                 0.027
  contains(atop)==True (1)                                                  -0.026
  contains(bar)==True (1)                                                   -0.026
  contains(keep)==True (1)                                                   0.025
  contains(appetite)==True (1)                                              -0.025
  contains(richard)==True (1)                                               -0.025
  contains(e)==True (1)                                                     -0.025
  contains(ultra)==True (1)                                                 -0.025
  contains(person)==True (1)                                                -0.025
  contains(everywhere)==True (1)                                             0.025
  contains(steer)==True (1)                                                 -0.024
  contains(knife)==True (1)                                                 -0.024
  contains(street)==True (1)                                                 0.024
  contains(seasoning)==True (1)                                              0.024
  contains(bone)==True (1)                                                  -0.023
  contains(ranch)==True (1)                                                  0.023
  contains(shine)==True (1)                                                  0.023
  contains(taylor)==True (1)                                                -0.023
  contains(mac)==True (1)                                                   -0.023
  contains(store)==True (1)                                                 -0.022
  contains(consistency)==True (1)                                           -0.022
  contains(14)==True (1)                                                    -0.022
  contains(24)==True (1)                                                    -0.022
  contains(sea)==True (1)                                                    0.022
  contains(salt)==True (1)                                                  -0.022
  contains(buy)==True (1)                                                    0.022
  contains(ounce)==True (1)                                                 -0.022
  contains(($)==True (1)                                                     0.022
  contains(frequently)==True (1)                                             0.022
  contains(basket)==True (1)                                                -0.021
  contains(steak)==True (1)                                                  0.021
  contains(coating)==True (1)                                                0.021
  contains(florida)==True (1)                                               -0.021
  contains(eric)==True (1)                                                   0.021
  contains(preparation)==True (1)                                           -0.021
  contains(fan)==True (1)                                                   -0.020
  contains(crushed)==True (1)                                                0.020
  contains(surprising)==True (1)                                             0.020
  contains(version)==True (1)                                                0.019
  contains(bath)==True (1)                                                  -0.019
  contains(open)==True (1)                                                   0.019
  contains(reveal)==True (1)                                                 0.019
  contains(technically)==True (1)                                            0.019
  contains(deviate)==True (1)                                                0.019
  contains(skin)==True (1)                                                   0.019
  contains(resist)==True (1)                                                -0.019
  contains(high)==True (1)                                                   0.019
  contains(pennsylvania)==True (1)                                           0.019
  contains(rendered)==True (1)                                               0.019
  contains(convenience)==True (1)                                           -0.018
  contains(.’)==True (1)                                                    -0.018
  contains(got)==True (1)                                                    0.018
  contains(happy)==True (1)                                                 -0.018
  contains(j)==True (1)                                                     -0.018
  contains(columbia)==True (1)                                               0.018
  contains(northeast)==True (1)                                              0.018
  contains(variety)==True (1)                                                0.018
  contains(primary)==True (1)                                                0.017
  contains(mixture)==True (1)                                               -0.017
  contains(modification)==True (1)                                          -0.017
  contains(jersey)==True (1)                                                -0.017
  contains(however)==True (1)                                                0.017
  contains(dose)==True (1)                                                   0.017
  contains(half)==True (1)                                                  -0.017
  contains(fry)==True (1)                                                   -0.016
  contains(japan)==True (1)                                                 -0.016
  contains(snake)==True (1)                                                 -0.016
  contains(true)==True (1)                                                   0.016
  contains(onion)==True (1)                                                  0.015
  contains(garlic)==True (1)                                                 0.015
  contains(sold)==True (1)                                                  -0.015
  contains(pepper)==True (1)                                                 0.015
  contains(turned)==True (1)                                                 0.014
  contains(japanese)==True (1)                                              -0.014
  contains(follows)==True (1)                                               -0.014
  contains(dish)==True (1)                                                   0.014
  contains(punishment)==True (1)                                            -0.014
  contains(dignity)==True (1)                                                0.014
  contains(new)==True (1)                                                   -0.013
  contains(serving)==True (1)                                               -0.013
  contains(iron)==True (1)                                                  -0.013
  contains(formula)==True (1)                                                0.013
  contains(accent)==True (1)                                                 0.013
  contains(presentation)==True (1)                                          -0.013
  contains(inevitably)==True (1)                                            -0.013
  contains(toward)==True (1)                                                 0.012
  contains(fly)==True (1)                                                    0.012
  contains(remains)==True (1)                                               -0.012
  contains(flash)==True (1)                                                  0.012
  contains(sole)==True (1)                                                   0.012
  contains(side)==True (1)                                                  -0.012
  contains(sink)==True (1)                                                  -0.011
  contains(modern)==True (1)                                                -0.011
  contains(generous)==True (1)                                              -0.011
  contains(meat)==True (1)                                                  -0.011
  contains(emerge)==True (1)                                                -0.010
  contains(dad)==True (1)                                                   -0.010
  contains(discover)==True (1)                                              -0.010
  contains(hartman)==True (1)                                                0.010
  contains(chain)==True (1)                                                 -0.010
  contains(unless)==True (1)                                                -0.010
  contains(messy)==True (1)                                                 -0.010
  contains(tender)==True (1)                                                -0.010
  contains(dip)==True (1)                                                   -0.010
  contains(spoon)==True (1)                                                  0.009
  contains(home)==True (1)                                                   0.009
  contains(nashville)==True (1)                                             -0.009
  contains(pain)==True (1)                                                  -0.009
  contains(flesh)==True (1)                                                  0.009
  contains(forgotten)==True (1)                                             -0.009
  contains(estate)==True (1)                                                 0.009
  contains(filling)==True (1)                                                0.008
  contains(longer)==True (1)                                                -0.008
  contains(favorite)==True (1)                                               0.008
  contains(pool)==True (1)                                                   0.008
  contains(spring)==True (1)                                                -0.008
  contains(se)==True (1)                                                     0.008
  contains(lends)==True (1)                                                  0.008
  contains(excursion)==True (1)                                              0.007
  contains(end)==True (1)                                                    0.007
  contains(within)==True (1)                                                 0.007
  contains(crunch)==True (1)                                                 0.007
  contains(shattering)==True (1)                                            -0.007
  contains(waste)==True (1)                                                 -0.007
  contains(fairfax)==True (1)                                                0.006
  contains(corn)==True (1)                                                  -0.006
  contains(heavy)==True (1)                                                 -0.006
  contains(favor)==True (1)                                                 -0.006
  contains(ordering)==True (1)                                              -0.006
  contains(whether)==True (1)                                                0.006
  contains(traditional)==True (1)                                           -0.006
  contains(georgia)==True (1)                                               -0.005
  contains(memorable)==True (1)                                             -0.005
  contains(even)==True (1)                                                   0.005
  contains(thigh)==True (1)                                                 -0.005
  contains(area)==True (1)                                                  -0.005
  contains(plate)==True (1)                                                  0.005
  contains(upscale)==True (1)                                                0.005
  contains(popcorn)==True (1)                                                0.005
  contains(yield)==True (1)                                                  0.005
  contains(tongue)==True (1)                                                 0.005
  contains(generously)==True (1)                                            -0.004
  contains(880)==True (1)                                                    0.004
  contains(unadorned)==True (1)                                              0.004
  contains(starch)==True (1)                                                 0.004
  contains(rested)==True (1)                                                 0.003
  contains(ear)==True (1)                                                   -0.003
  contains(commonly)==True (1)                                               0.003
  contains(without)==True (1)                                                0.003
  contains(meal)==True (1)                                                   0.003
  contains(essentially)==True (1)                                           -0.002
  contains(consciousness)==True (1)                                          0.002
  contains(fritz)==True (1)                                                 -0.002
  contains(available)==True (1)                                             -0.002
  contains(crystal)==True (1)                                                0.002
  contains(consistently)==True (1)                                          -0.002
  contains(wheat)==True (1)                                                 -0.001
  contains(respect)==True (1)                                                0.001
  contains(bun)==True (1)                                                   -0.001
  contains(occasional)==True (1)                                            -0.001
  contains(dedicated)==True (1)                                             -0.001
  contains(satisfying)==True (1)                                            -0.000
  contains(latching)==True (1)                                               0.000
  contains(vernon)==True (1)                                                 0.000
  contains(karaoke)==True (1)                                                0.000
  ---------------------------------------------------------------------------------
  TOTAL:                                            40.166   9.835   9.084   6.412
  PROBS:                                             1.000   0.000   0.000   0.000

In [22]:
classifier.explain(document_features(get_text("nrRB0.html")))


  Feature                                           design   books data_sc    tech
  --------------------------------------------------------------------------------
  contains(’)==True (1)                              1.444
  contains(2)==True (1)                              0.709
  contains(x)==True (1)                              0.607
  contains(4)==True (1)                              0.513
  contains(h)==True (1)                              0.475
  contains(“)==True (1)                              0.368
  label is 'design' (1)                              0.362
  contains(r)==True (1)                              0.331
  contains(0)==True (1)                             -0.296
  contains(7)==True (1)                             -0.290
  contains(8)==True (1)                             -0.266
  contains(v)==True (1)                             -0.261
  contains(q)==True (1)                              0.255
  contains(9)==True (1)                             -0.237
  contains(—)==True (1)                             -0.228
  contains(l)==True (1)                              0.210
  contains(1)==True (1)                              0.182
  contains(f)==True (1)                             -0.161
  contains(c)==True (1)                             -0.115
  contains(”)==True (1)                             -0.106
  contains(‘)==True (1)                             -0.093
  contains(w)==True (1)                              0.092
  contains(e)==True (1)                             -0.091
  contains(5)==True (1)                              0.086
  contains(z)==True (1)                             -0.077
  contains(n)==True (1)                             -0.066
  contains(b)==True (1)                              0.050
  contains(j)==True (1)                              0.044
  contains(3)==True (1)                              0.038
  contains(p)==True (1)                             -0.031
  contains(k)==True (1)                              0.026
  contains(6)==True (1)                             -0.004
  contains(u)==True (1)                              0.003
  contains(e)==True (1)                                      1.766
  contains(“)==True (1)                                      1.441
  contains(‘)==True (1)                                      1.095
  contains(9)==True (1)                                      0.865
  contains(’)==True (1)                                      0.745
  contains(”)==True (1)                                      0.577
  contains(u)==True (1)                                     -0.520
  label is 'books' (1)                                      -0.416
  contains(5)==True (1)                                     -0.388
  contains(3)==True (1)                                     -0.358
  contains(6)==True (1)                                     -0.263
  contains(2)==True (1)                                     -0.236
  contains(8)==True (1)                                     -0.215
  contains(1)==True (1)                                     -0.212
  contains(x)==True (1)                                     -0.190
  contains(—)==True (1)                                     -0.177
  contains(v)==True (1)                                     -0.176
  contains(4)==True (1)                                     -0.157
  contains(r)==True (1)                                     -0.150
  contains(b)==True (1)                                     -0.096
  contains(z)==True (1)                                     -0.052
  contains(n)==True (1)                                     -0.051
  contains(j)==True (1)                                     -0.050
  contains(7)==True (1)                                     -0.048
  contains(k)==True (1)                                      0.042
  contains(g)==True (1)                                     -0.036
  contains(c)==True (1)                                     -0.034
  contains(f)==True (1)                                      0.017
  contains(l)==True (1)                                      0.014
  contains(p)==True (1)                                      0.002
  contains(r)==True (1)                                              1.170
  contains(—)==True (1)                                             -0.854
  label is 'data_science' (1)                                        0.606
  contains(”)==True (1)                                              0.393
  contains(4)==True (1)                                             -0.331
  contains(v)==True (1)                                              0.287
  contains(7)==True (1)                                             -0.285
  contains(g)==True (1)                                              0.264
  contains(’)==True (1)                                              0.257
  contains(6)==True (1)                                              0.225
  contains(n)==True (1)                                              0.223
  contains(3)==True (1)                                             -0.221
  contains(“)==True (1)                                             -0.194
  contains(8)==True (1)                                             -0.185
  contains(j)==True (1)                                              0.166
  contains(1)==True (1)                                              0.138
  contains(x)==True (1)                                              0.093
  contains(l)==True (1)                                             -0.091
  contains(q)==True (1)                                             -0.077
  contains(h)==True (1)                                             -0.075
  contains(e)==True (1)                                              0.073
  contains(5)==True (1)                                             -0.067
  contains(f)==True (1)                                             -0.066
  contains(w)==True (1)                                              0.057
  contains(9)==True (1)                                              0.055
  contains(b)==True (1)                                             -0.052
  contains(u)==True (1)                                              0.040
  contains(z)==True (1)                                              0.039
  contains(p)==True (1)                                              0.032
  contains(k)==True (1)                                             -0.032
  contains(‘)==True (1)                                             -0.032
  contains(0)==True (1)                                              0.028
  contains(c)==True (1)                                              0.026
  contains(2)==True (1)                                              0.004
  contains(8)==True (1)                                                      1.563
  contains(—)==True (1)                                                      1.497
  contains(2)==True (1)                                                     -0.904
  contains(r)==True (1)                                                     -0.437
  contains(f)==True (1)                                                      0.431
  contains(4)==True (1)                                                     -0.414
  contains(‘)==True (1)                                                     -0.409
  contains(u)==True (1)                                                      0.389
  contains(5)==True (1)                                                     -0.344
  contains(“)==True (1)                                                     -0.323
  contains(v)==True (1)                                                      0.295
  label is 'tech' (1)                                                        0.278
  contains(0)==True (1)                                                      0.272
  contains(9)==True (1)                                                     -0.243
  contains(7)==True (1)                                                      0.238
  contains(6)==True (1)                                                      0.213
  contains(k)==True (1)                                                     -0.141
  contains(1)==True (1)                                                     -0.134
  contains(3)==True (1)                                                     -0.120
  contains(p)==True (1)                                                     -0.118
  contains(b)==True (1)                                                      0.102
  contains(n)==True (1)                                                     -0.101
  contains(j)==True (1)                                                     -0.091
  contains(q)==True (1)                                                      0.067
  contains(”)==True (1)                                                     -0.058
  contains(h)==True (1)                                                     -0.056
  contains(z)==True (1)                                                      0.052
  contains(c)==True (1)                                                     -0.044
  contains(l)==True (1)                                                     -0.040
  contains(’)==True (1)                                                      0.037
  contains(g)==True (1)                                                     -0.036
  contains(e)==True (1)                                                     -0.035
  contains(w)==True (1)                                                     -0.026
  contains(x)==True (1)                                                     -0.023
  ---------------------------------------------------------------------------------
  TOTAL:                                             3.473   2.739   1.617   1.338
  PROBS:                                             0.378   0.227   0.104   0.086

The classifier did well - it trained in 2 minutes or so an dit got an initial accuracy of about 83% - a pretty good start!

Parsing with Stanford Parser and NLTK

NLTK parsing is notoriously bad - because it's pedagogical. However, you can use Stanford.


In [23]:
import os

from nltk.tag.stanford import NERTagger
from nltk.parse.stanford import StanfordParser

## NER JAR and Models
STANFORD_NER_MODEL = os.path.expanduser("~/Development/stanford-ner-2014-01-04/classifiers/english.all.3class.distsim.crf.ser.gz")
STANFORD_NER_JAR   = os.path.expanduser("~/Development/stanford-ner-2014-01-04/stanford-ner-2014-01-04.jar")

## Parser JAR and Models
STANFORD_PARSER_MODELS = os.path.expanduser("~/Development/stanford-parser-full-2014-10-31/stanford-parser-3.5.0-models.jar")
STANFORD_PARSER_JAR    = os.path.expanduser("~/Development/stanford-parser-full-2014-10-31/stanford-parser.jar")

def create_tagger(model=None, jar=None, encoding='ASCII'):
    model = model or STANFORD_NER_MODEL
    jar   = jar or STANFORD_NER_JAR

    return NERTagger(model, jar, encoding)

def create_parser(models=None, jar=None, **kwargs):
    models = models or STANFORD_PARSER_MODELS
    jar   = jar or STANFORD_PARSER_JAR

    return StanfordParser(jar, models, **kwargs)

class NER(object):

    tagger = None

    @classmethod
    def initialize_tagger(klass, model=None, jar=None, encoding='ASCII'):
        klass.tagger = create_tagger(model, jar, encoding)

    @classmethod
    def tag(klass, sent):
        if klass.tagger is None:
            klass.initialize_tagger()

        sent = nltk.word_tokenize(sent)
        return klass.tagger.tag(sent)

class Parser(object):

    parser = None

    @classmethod
    def initialize_parser(klass, models=None, jar=None, **kwargs):
        klass.parser = create_parser(models, jar, **kwargs)

    @classmethod
    def parse(klass, sent):
        if klass.parser is  None:
            klass.initialize_parser()

        return klass.parser.raw_parse(sent)

def tag(sent):
    return NER.tag(sent)

def parse(sent):
    return Parser.parse(sent)

In [24]:
tag("The man hit the building with the bat.")


Out[24]:
[(u'The', u'O'),
 (u'man', u'O'),
 (u'hit', u'O'),
 (u'the', u'O'),
 (u'building', u'O'),
 (u'with', u'O'),
 (u'the', u'O'),
 (u'bat', u'O'),
 (u'.', u'O')]

In [25]:
for p in parse("The man hit the building with the bat."):
    print p


(ROOT
  (S
    (NP (DT The) (NN man))
    (VP
      (VBD hit)
      (NP (DT the) (NN building))
      (PP (IN with) (NP (DT the) (NN bat))))
    (. .)))

TextBlob

A lightweight wrapper around nltk that provides a simple "Blob" interface for working with text.


In [23]:
from textblob import TextBlob
from bs4 import BeautifulSoup

text = TextBlob(get_text("nrRB0.html"))

print text.sentences


[Sentence("A crisp and juicy bucket list of D.C.’s best fried chicken - The Washington Post

 It’s lowbrow."), Sentence("It’s messy."), Sentence("It could never be accused of being healthful."), Sentence("But we’d never let those formalities get between us and an order of crispy, crackly, delicious fried chicken."), Sentence("Whether it comes in a bucket or on a bun, or you eat it with your fingers or chopsticks, there’s a surprising variety to the Washington area’s fried chicken offerings."), Sentence("Here are some of the most irresistible."), Sentence("‘Rotissi-fried’ chicken at the Partisan

Forget the cronut."), Sentence("Our newest favorite food chimera is the “rotissi-fried” chicken at the Partisan."), Sentence("Credit goes to chef Nate Anda, who dreamed up the dish: After a 12-hour brine, the chicken is rotisseried for two hours and then fried for two and a half minutes."), Sentence("Why both?"), Sentence("“Everything is better once it’s fried in beef fat,” Anda said."), Sentence("We have to agree."), Sentence("Whether white or dark, the meat is succulent throughout."), Sentence("The batter-free frying leaves the simply seasoned skin rendered perfectly crisp, golden and translucent — cracklings, essentially."), Sentence("The sound of it shattering under the knife was music to our ears."), Sentence("And as if the lily needed further gilding, the chicken comes with a generous pour of honey hot sauce."), Sentence("The sauce is hard to resist, but try to reserve a few bites of unadorned chicken so you can fully appreciate this happy marriage of classic preparations."), Sentence("The Partisan, 709 D St. NW."), Sentence("202-524-5322. www.thepartisandc.com."), Sentence("— Becky Krystal

 [In a love/hate relationship with Chick-fil-A?"), Sentence("Here are some alternatives] 

Traditional fried chicken at Family Meal

When Bryan Voltaggio started planning the menu for Family Meal, his modern, upscale spin on a diner, his thoughts turned to home and the carryout meal he most enjoyed as a kid: fried chicken."), Sentence("“It was one of our favorite things,” he says."), Sentence("“It just seems like a family dinner.” And if he was creating a restaurant called Family Meal, fried chicken “had to be an important part of it.” But Voltaggio wanted to do it right, and set about testing cooking methods, brines, breadings and fryers."), Sentence("That long process paid off with a home run of a fried chicken dish that’s become the most popular item on Family Meal’s menu."), Sentence("The whole chickens spend 12 hours in a brine of pickle juice and roasted poultry stock before getting dredged, rested and dredged again in a mixture of flour, cornmeal and corn starch."), Sentence("After a dip in the top-of-the-line pressure fryer, the thighs, legs and breasts emerge with a crisp, salty skin that cracks open to reveal wonderfully warm and moist flesh."), Sentence("You don’t even need to dunk it in the house-made hot sauce that accompanies the dish, but really, who can resist?"), Sentence("Family Meal Ashburn, 20462 Exchange St. 703-726-9800; Frederick, 880 N. East St. 301-378-2895; Baltimore, 621 E. Pratt St. 410-601-3242. www.voltfamilymeal.com."), Sentence("— John Taylor

  

 [40 Eats: D.C.’s most essential dishes of 2015] 



Japanese fried chicken at Izakaya Seki

Although it’s commonly served in Japan at karaoke bars, convenience stores and on street carts, karaage chicken — like most of the country’s food — is held to an extremely high standard."), Sentence("“It’s taken it to the nth degree of obsession and detail,” says Cizuka Seki, who, with her father Hiroshi, owns Izakaya Seki on V Street NW."), Sentence("“Kara age” is used to describe the method for deep-frying bite-size pieces of fish and, more commonly, chicken."), Sentence("Though there are subtle variations on the ubiquitous dish, most recipes call for chicken thighs marinated in soy sauce, coated in flour or corn starch and deep-fried in oil."), Sentence("Izakaya Seki’s version sticks closely to the formula."), Sentence("Probably."), Sentence("“I’m not even quite sure what my dad puts into it, because we don’t have recipes,” Seki says, though she’s certain wheat flour is involved."), Sentence("The result is a thin, tender coating that’s slightly softer than tempura."), Sentence("The accompanying ponzu sauce lends a tartness to the nubs."), Sentence("Izakaya Seki, 1117 V St. NW."), Sentence("202-588-5841. www.sekidc.com."), Sentence("— Holley Simmons

Korean fried chicken at BonChon

Don’t waste your kimchi-stinking breath asking for more sauce at BonChon."), Sentence("The South Korean fried chicken chain, founded in 2002, is so dedicated to consistency that it doesn’t allow for any modifications."), Sentence("And why would you want to change anything, really?"), Sentence("The made-to-order wings, drumsticks and strips are fried twice, resulting in a paper-thin crust that yields the same satisfying crack as shattering crème brulee with a spoon."), Sentence("Founder Jinduk Seh spent two years perfecting his secret sauces, which come in three flavors — soy garlic, hot and a blend of the two — and are brushed on by hand post-fry, piece by piece."), Sentence("True to BonChon’s commitment to uniformity, sauces are made exclusively in South Korea and distributed to all 140-plus BonChon locations, which means the wings you’re chewing on in Arlington are slathered with the same exact stuff as those in the Philippines."), Sentence("Joints like these are so common throughout Korea they’re called “chimeks,” which is a hybrid term that combines “chicken” with the Korean word for beer."), Sentence("Washington should be happy to have 10 BonChons within driving distance, plus a brand new Metro-accessible location near the Navy Yard."), Sentence("BonChon, 1015 Half St."), Sentence("SE and nine other locations in Maryland and Virginia."), Sentence("www.bonchon.com."), Sentence("— Holley Simmons

Maryland fried chicken at Crisfield Seafood and Hank’s Oyster Bar

There’s not much agreement on what constitutes Maryland fried chicken."), Sentence("Some say it’s just a fresh Maryland chicken that’s pan-fried; others say it should be topped with white gravy, almost like a chicken-fried steak."), Sentence("The pan-fried chicken platter at  Crisfield Seafood is a perfect example of the former style."), Sentence("Half of a chicken is dredged in flour, dusted with salt and pepper, and fried in a cast-iron pan."), Sentence("This preparation lends a snap and crunch to the exterior, and while the meat falls off the bone, the well-seasoned breading holds on."), Sentence("(The chicken is available only Friday through Sunday, and frequently sells out.)"), Sentence("The Chesapeake fried chicken at Hank’s Oyster Bar in Dupont Circle and Capitol Hill is plumper than Crisfield’s version and seasoned with Old Bay, black pepper and cayenne, but the breading is softer and less crispy."), Sentence("It’s brined for 24 hours and deep-fried, rather than pan-fried, and it’s served only on Sunday."), Sentence("Crisfield Seafood, 8012 Georgia Ave., Silver Spring."), Sentence("301-589-1306. www.crisfieldseafood.com."), Sentence("Hank’s Oyster Bar, 1624 Q St. NW."), Sentence("202-462-4265; 633 Pennsylvania Ave."), Sentence("SE."), Sentence("202-733-1971. www.hanksoysterbar.com."), Sentence("— Fritz Hahn



 [Pizza in Washington: An upper-crust tour of every D.C. style ]



Fancy fried chicken at Central

Self-consciousness may prevent you from ordering fried chicken in a white-tablecloth restaurant."), Sentence("It feels incongruous — gauche, almost — to dig into picnic fare at the kind of place where you should be ordering risotto or tartare or something that comes with mousse, gelee or foam."), Sentence("But you have to override that adult voice in the back of your head, because if you don’t, you’ll miss out on Central Michel Richard’s famed fried chicken plate ($24 at lunch, $25 for dinner), which remains as good as ever."), Sentence("Though it’s no longer sold by the bucket to go, Michel Richard’s KFC-inspired crispy breast and thigh come stacked atop a pool of the butteriest mashed potatoes you’ll ever taste."), Sentence("It’s such a dignified presentation that this most American of dishes almost could pass as (gasp!)"), Sentence("French."), Sentence("Self-consciousness should, however, steer you toward using a knife and fork — not your fingers — to eat the chicken."), Sentence("It is, after all, that kind of a place."), Sentence("Central Michel Richard, 1001 Pennsylvania Ave. NW."), Sentence("202-626-0015. www.centralmichelrichard.com ."), Sentence("— Maura Judkis



Nashville hot chicken at Reserve 2216

If you believe the lore, Nashville hot chicken was basically a crime of passion, created as a blistering rebuke to a no-account Romeo who couldn’t keep his hands off other women."), Sentence("Alas, this wolf was also a chili head who found pleasure, not pain, in this dish of revenge served hot."), Sentence("Decades later, chefs are latching onto this addictive form of punishment."), Sentence("Aaron Silverman served an ultra-refined version at Rose’s Luxury for months, and now Eric Reid, chef and co-owner of Reserve 2216 in Del Ray, has developed his own take on hot chicken, even if he’s never actually enjoyed it in Nashville."), Sentence("He marinates an airline cut (boneless breast with the drumette wing attached) in buttermilk and Crystal hot sauce before dredging the chicken in seasoned flour and frying it."), Sentence("Reid ditches the traditional white bread base in favor of collards and a side of corn bread waffles."), Sentence("He finishes the dish with a combination of Cajun seasonings and more Crystal hot sauce for a moist, crispy bird that bites back."), Sentence("But not too hard."), Sentence("This is Alexandria, after all."), Sentence("Reserve 2216, 2216 Mount Vernon Ave., Alexandria."), Sentence("703-549-2889. www.drpreserve.com."), Sentence("— Tim Carman



Fast-food fried chicken at Popeyes

The sole virtue of most fast-food operations is consistency."), Sentence("Whether you bite into a Big Mac in Bethesda or Beijing, the sandwich should taste the same."), Sentence("The menu at Popeyes follows suit, but it deviates from the brand-name competition in an important respect: The signature at Popeyes could pass for home cooking (well, if your home had a vat of clean, hot oil and a person with a Southern accent tending the meal)."), Sentence("Maybe that accounts for my occasional forays to the chicken fryer after a bum restaurant-review excursion."), Sentence("No matter where I eat my order, inevitably “spicy,” I know I can count on a coating that smacks of cayenne, paprika and even crushed cornflakes, and chicken that spurts with juice."), Sentence("The shatter is audible; the golden crumbs fly everywhere, but end up on my tongue."), Sentence("No one-trick pony, Popeyes has hot and tender buttermilk biscuits that bolster my favorite excuse to snack low on the food chain."), Sentence("Once, I got home to discover a clerk had forgotten to pack bread in my bag."), Sentence("I almost cried."), Sentence("Instead, I consoled myself with another piece of chicken."), Sentence("Popeyes has locations throughout the D.C. metro area."), Sentence("www.popeyes.com."), Sentence("— Tom Sietsema



Fried chicken sandwich at DCity Smokehouse

The fried chicken sandwich hasn’t been the same since KFC’s Double Down turned a guilty pleasure into an outright farce, using crispy white-meat fillets as both the sandwich’s primary protein and the oily handles by which you eat the monstrosity."), Sentence("Leave it to Rob Sonderman, pitmaster and co-owner of DCity Smokehouse, to bring dignity back to the bite."), Sentence("His Den-Den — named for co-creator and pitmaster-in-training Dennis Geddie — begins with boneless thighs marinated in buttermilk, hot sauce and honey."), Sentence("Sonderman then dredges the meat in seasoned flour before dropping the thighs into the fryer."), Sentence("Generously stuffed into a grilled hoagie roll with lettuce, tomato and crispy onions, the chicken is finished with two sauces, including a house-made cilantro ranch."), Sentence("Technically, the Den-Den ($9.25) is one of the few smokeless items on DCity’s menu (unless you count the chipotle peppers in the hot sauce)."), Sentence("No matter."), Sentence("You won’t care the minute you sink your teeth into that heaping hoagie of spicy thigh meat."), Sentence("DCity Smokehouse, 8 Florida Ave. NW."), Sentence("202-733-1919. www.dcitysmokehouse.com."), Sentence("— Tim Carman



Classic D.C. fried chicken at Oohh’s and Aahh’s

Hearty is the appetite that can handle Oohh’s and Aahh’s chef-owner Oji Abbott’s boneless fried chicken breast without taking home leftovers."), Sentence("He buys local — from Hartman Meat Co. in Northeast Washington — and butterflies each 14-ounce portion ($12.95), which results in a lot of real estate for the crispy, well-seasoned coating."), Sentence("Abbott chalks up the consistently moist meat to proper cooking time and temperature, and to the recipe he learned from his grandmother."), Sentence("Oohh’s and Aahh’s, 1005 U St. NW."), Sentence("202-667-7142. www.oohhsnaahhs.com."), Sentence("— Bonnie S. Benwick



Popcorn fried chicken at Pop’s Sea Bar

It’s all too easy to chomp through an order of Boardwalk Chicken at the shore-happy Pop’s Sea Bar in Adams Morgan."), Sentence("The bite-size pieces of dark-meat-only bird are brined for two hours, then treated to a buttermilk bath until they are coated with plain flour and flash-fried to order."), Sentence("A generous hand with salt and pepper just before serving means the effect of that seasoning builds as you empty the single-serving basket ($8.99)."), Sentence("Ask for two portions of the accompanying Jersey sauce so you’ll have enough of its horseradish-y, kitchen-sink blend for every bite."), Sentence("Pop’s Sea Bar, 1817 Columbia Rd."), Sentence("NW."), Sentence("202-534-3933. www.popsseabar.com."), Sentence("— Bonnie S. Benwick



Fried chicken tenders at GBD

Chicken tenders are often relegated to the children’s menu, but the chicken Tendies at GBD are a fine meal for adults and children alike."), Sentence("Each white-meat tender comes with a dark, crispy outer layer with a heavy dose of salt and spice."), Sentence("But the best part about chicken tenders is the dipping, and GBD offers nine sauces, including a take on D.C.’s own mumbo sauce, buttermilk ranch, chipotle barbecue and Frankenbutter, which combines Frank’s RedHot sauce with butter."), Sentence("Ask for the $5.50 Saucetown option to try all nine."), Sentence("GBD, 1323 Connecticut Ave. NW."), Sentence("202-524-5210. www.gbdchickendoughnuts.com."), Sentence("— Margaret Ely



Fried chicken skins at Gypsy Soul

Fried chicken fans can argue the merits of white meat vs. dark meat, or whether it’s better to chow down on a drumstick or the breast."), Sentence("But one thing we all can agree on is that the outer layer — the breading and the skin — is the most important element of a memorable piece of fried chicken."), Sentence("And sometimes you just want to savor the flavor of the skin — deeply spiced, perfect crunch — without filling up on meat or having to deal with bones."), Sentence("And that is when you grab one of the bar stools at R.J. Cooper’s Gypsy Soul in Fairfax’s Mosaic District."), Sentence("Cooper’s chicken skins ($9) are twisted slivers and shards, decadently salty and crackling with paprika, cayenne pepper and garlic."), Sentence("The dish arrives with a house-made “roof top honey-snake oil,” but it’s best to let these beauties shine on their own."), Sentence("Gypsy Soul, 8296 Glass Alley, Fairfax."), Sentence("703-992-0933. www.gypsysoul-va.com."), Sentence("— Fritz Hahn

 Related items: 

 D.C.’s most essential dishes of 2015 

 Pizza in Washington: An upper crust tour of every D.C. style 



")]

In [25]:
import nltk

In [26]:
np = nltk.FreqDist(text.noun_phrases)
print np.most_common(10)


[(u'hot sauce', 6), (u'popeyes', 5), (u'washington', 5), (u'it\u2019s', 4), (u'gbd', 4), (u'd.c.\u2019s', 4), (u'st. nw', 4), (u'bonchon', 3), (u'maryland', 3), (u'hank\u2019s oyster', 3)]

In [27]:
print text.sentiment


Sentiment(polarity=-0.0025676717918097208, subjectivity=0.5856343297507093)

In [28]:
review = TextBlob("Harrison Ford would be the most amazing, most wonderful, most handsome actor - the greatest that ever lived, if only he didn't have that silly earing.")
print review.sentiment


Sentiment(polarity=0.4555555555555555, subjectivity=0.8083333333333333)

Language Detection using TextBlob


In [29]:
b = TextBlob(u"بسيط هو أفضل من مجمع")
b.detect_language()


Out[29]:
u'ar'

In [32]:
chinese_blob = TextBlob(u"美丽优于丑陋")
chinese_blob.translate(from_lang="zh-CN", to='en')


Out[32]:
TextBlob("")

In [33]:
en_blob = TextBlob(u"Simple is better than complex.")
en_blob.translate(to="es")


Out[33]:
TextBlob("")

spaCy

Industrial strength NLP, in Python but with a strong Cython backend. Super fast. Licensing issue though.


In [34]:
from __future__ import unicode_literals 
from spacy.en import English

nlp = English()

tokens = nlp(u'The man hit the building with the baseball bat.')

baseball = tokens[7]
print (baseball.orth, baseball.orth_, baseball.head.lemma, baseball.head.lemma_)


(2303, u'baseball', 4193, u'bat')

In [139]:
tokens = nlp(u'The man hit the building with the baseball bat.', parse=True)
for token in tokens:
    print token.prob


-5.02773189545
-8.16621112823
-8.3605670929
-3.07847452164
-8.67186450958
-5.23164892197
-3.07847452164
-9.61269950867
-10.9683980942
-3.17597317696

gensim

Library for bag of words clustering - LSA and LDA.

Also implements word2vec - Google's word vectorizer: something that was explored in a previous post.