Text Classification

Authors: Victor Zhong, Kelvin Guu

We are going to tackle a relatively straightforward text classification problem with Stanza and Tensorflow.

Dataset

First, we'll grab the 20 newsgroup data, which is conveniently downloaded by sklearn.


In [1]:
from sklearn.datasets import fetch_20newsgroups
classes = ['alt.atheism', 'soc.religion.christian']
newsgroups_train = fetch_20newsgroups(subset='train', categories=classes)

from collections import Counter
Counter([classes[t] for t in newsgroups_train.target])


Out[1]:
Counter({'alt.atheism': 480, 'soc.religion.christian': 599})

In [2]:
print newsgroups_train.data[0]


From: nigel.allen@canrem.com (Nigel Allen)
Subject: library of congress to host dead sea scroll symposium april 21-22
Lines: 96


 Library of Congress to Host Dead Sea Scroll Symposium April 21-22
 To: National and Assignment desks, Daybook Editor
 Contact: John Sullivan, 202-707-9216, or Lucy Suddreth, 202-707-9191
          both of the Library of Congress

   WASHINGTON, April 19  -- A symposium on the Dead Sea 
Scrolls will be held at the Library of Congress on Wednesday,
April 21, and Thursday, April 22.  The two-day program, cosponsored
by the library and Baltimore Hebrew University, with additional
support from the Project Judaica Foundation, will be held in the
library's Mumford Room, sixth floor, Madison Building.
   Seating is limited, and admission to any session of the symposium
must be requested in writing (see Note A).
   The symposium will be held one week before the public opening of a
major exhibition, "Scrolls from the Dead Sea: The Ancient Library of
Qumran and Modern Scholarship," that opens at the Library of Congress
on April 29.  On view will be fragmentary scrolls and archaeological
artifacts excavated at Qumran, on loan from the Israel Antiquities
Authority.  Approximately 50 items from Library of Congress special
collections will augment these materials.  The exhibition, on view in
the Madison Gallery, through Aug. 1, is made possible by a generous
gift from the Project Judaica Foundation of Washington, D.C.
   The Dead Sea Scrolls have been the focus of public and scholarly
interest since 1947, when they were discovered in the desert 13 miles
east of Jerusalem.  The symposium will explore the origin and meaning
of the scrolls and current scholarship.  Scholars from diverse
academic backgrounds and religious affiliations, will offer their
disparate views, ensuring a lively discussion.
   The symposium schedule includes opening remarks on April 21, at
2 p.m., by Librarian of Congress James H. Billington, and by
Dr. Norma Furst, president, Baltimore Hebrew University.  Co-chairing
the symposium are Joseph Baumgarten, professor of Rabbinic Literature
and Institutions, Baltimore Hebrew University and Michael Grunberger,
head, Hebraic Section, Library of Congress.
   Geza Vermes, professor emeritus of Jewish studies, Oxford
University, will give the keynote address on the current state of
scroll research, focusing on where we stand today. On the second
day, the closing address will be given by Shmaryahu Talmon, who will
propose a research agenda, picking up the theme of how the Qumran
studies might proceed.
   On Wednesday, April 21, other speakers will include:

   -- Eugene Ulrich, professor of Hebrew Scriptures, University of
Notre Dame and chief editor, Biblical Scrolls from Qumran, on "The
Bible at Qumran;"
   -- Michael Stone, National Endowment for the Humanities
distinguished visiting professor of religious studies, University of
Richmond, on "The Dead Sea Scrolls and the Pseudepigrapha."
   -- From 5 p.m. to 6:30 p.m. a special preview of the exhibition
will be given to symposium participants and guests.

   On Thursday, April 22, beginning at 9 a.m., speakers will include:

   -- Magen Broshi, curator, shrine of the Book, Israel Museum,
Jerusalem, on "Qumran: The Archaeological Evidence;"
   -- P. Kyle McCarter, Albright professor of Biblical and ancient
near Eastern studies, The Johns Hopkins University, on "The Copper
Scroll;"
   -- Lawrence H. Schiffman, professor of Hebrew and Judaic studies,
New York University, on "The Dead Sea Scrolls and the History of
Judaism;" and
   -- James VanderKam, professor of theology, University of Notre
Dame, on "Messianism in the Scrolls and in Early Christianity."

   The Thursday afternoon sessions, at 1:30 p.m., include:

   -- Devorah Dimant, associate professor of Bible and Ancient Jewish
Thought, University of Haifa, on "Qumran Manuscripts: Library of a
Jewish Community;"
   -- Norman Golb, Rosenberger professor of Jewish history and
civilization, Oriental Institute, University of Chicago, on "The
Current Status of the Jerusalem Origin of the Scrolls;"
   -- Shmaryahu Talmon, J.L. Magnas professor emeritus of Biblical
studies, Hebrew University, Jerusalem, on "The Essential 'Commune of
the Renewed Covenant': How Should Qumran Studies Proceed?" will close
the symposium.

   There will be ample time for question and answer periods at the
end of each session.

   Also on Wednesday, April 21, at 11 a.m.:
   The Library of Congress and The Israel Antiquities Authority
will hold a lecture by Esther Boyd-Alkalay, consulting conservator,
Israel Antiquities Authority, on "Preserving the Dead Sea Scrolls"
in the Mumford Room, LM-649, James Madison Memorial Building, The
Library of Congress, 101 Independence Ave., S.E., Washington, D.C.
    ------
   NOTE A: For more information about admission to the symposium,
please contact, in writing, Dr. Michael Grunberger, head, Hebraic
Section, African and Middle Eastern Division, Library of Congress,
Washington, D.C. 20540.
 -30-
--
Canada Remote Systems - Toronto, Ontario
416-629-7000/629-7044

Annotating using CoreNLP

If you do not have CoreNLP, download it from here:

http://stanfordnlp.github.io/CoreNLP/index.html#download

We are going to use the Java server feature of CoreNLP to annotate data in python. In the CoreNLP directory, run the server:

java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer

Next, we'll annotate an example to see how the server works.


In [3]:
from stanza.corenlp.client import Client

client = Client()
annotation = client.annotate(newsgroups_train.data[0], properties={'annotators': 'tokenize,ssplit,pos'})
annotation['sentences'][0]


Out[3]:
{u'index': 0,
 u'parse': u'SENTENCE_SKIPPED_OR_UNPARSABLE',
 u'tokens': [{u'after': u'',
   u'before': u'',
   u'characterOffsetBegin': 0,
   u'characterOffsetEnd': 4,
   u'index': 1,
   u'originalText': u'From',
   u'pos': u'IN',
   u'word': u'From'},
  {u'after': u' ',
   u'before': u'',
   u'characterOffsetBegin': 4,
   u'characterOffsetEnd': 5,
   u'index': 2,
   u'originalText': u':',
   u'pos': u':',
   u'word': u':'},
  {u'after': u' ',
   u'before': u' ',
   u'characterOffsetBegin': 6,
   u'characterOffsetEnd': 28,
   u'index': 3,
   u'originalText': u'nigel.allen@canrem.com',
   u'pos': u'NNP',
   u'word': u'nigel.allen@canrem.com'},
  {u'after': u'',
   u'before': u' ',
   u'characterOffsetBegin': 29,
   u'characterOffsetEnd': 30,
   u'index': 4,
   u'originalText': u'(',
   u'pos': u'-LRB-',
   u'word': u'-LRB-'},
  {u'after': u' ',
   u'before': u'',
   u'characterOffsetBegin': 30,
   u'characterOffsetEnd': 35,
   u'index': 5,
   u'originalText': u'Nigel',
   u'pos': u'NNP',
   u'word': u'Nigel'},
  {u'after': u'',
   u'before': u' ',
   u'characterOffsetBegin': 36,
   u'characterOffsetEnd': 41,
   u'index': 6,
   u'originalText': u'Allen',
   u'pos': u'NNP',
   u'word': u'Allen'},
  {u'after': u'\n',
   u'before': u'',
   u'characterOffsetBegin': 41,
   u'characterOffsetEnd': 42,
   u'index': 7,
   u'originalText': u')',
   u'pos': u'-RRB-',
   u'word': u'-RRB-'},
  {u'after': u'',
   u'before': u'\n',
   u'characterOffsetBegin': 43,
   u'characterOffsetEnd': 50,
   u'index': 8,
   u'originalText': u'Subject',
   u'pos': u'NNP',
   u'word': u'Subject'},
  {u'after': u' ',
   u'before': u'',
   u'characterOffsetBegin': 50,
   u'characterOffsetEnd': 51,
   u'index': 9,
   u'originalText': u':',
   u'pos': u':',
   u'word': u':'},
  {u'after': u' ',
   u'before': u' ',
   u'characterOffsetBegin': 52,
   u'characterOffsetEnd': 59,
   u'index': 10,
   u'originalText': u'library',
   u'pos': u'NN',
   u'word': u'library'},
  {u'after': u' ',
   u'before': u' ',
   u'characterOffsetBegin': 60,
   u'characterOffsetEnd': 62,
   u'index': 11,
   u'originalText': u'of',
   u'pos': u'IN',
   u'word': u'of'},
  {u'after': u' ',
   u'before': u' ',
   u'characterOffsetBegin': 63,
   u'characterOffsetEnd': 71,
   u'index': 12,
   u'originalText': u'congress',
   u'pos': u'NN',
   u'word': u'congress'},
  {u'after': u' ',
   u'before': u' ',
   u'characterOffsetBegin': 72,
   u'characterOffsetEnd': 74,
   u'index': 13,
   u'originalText': u'to',
   u'pos': u'TO',
   u'word': u'to'},
  {u'after': u' ',
   u'before': u' ',
   u'characterOffsetBegin': 75,
   u'characterOffsetEnd': 79,
   u'index': 14,
   u'originalText': u'host',
   u'pos': u'NN',
   u'word': u'host'},
  {u'after': u' ',
   u'before': u' ',
   u'characterOffsetBegin': 80,
   u'characterOffsetEnd': 84,
   u'index': 15,
   u'originalText': u'dead',
   u'pos': u'JJ',
   u'word': u'dead'},
  {u'after': u' ',
   u'before': u' ',
   u'characterOffsetBegin': 85,
   u'characterOffsetEnd': 88,
   u'index': 16,
   u'originalText': u'sea',
   u'pos': u'NN',
   u'word': u'sea'},
  {u'after': u' ',
   u'before': u' ',
   u'characterOffsetBegin': 89,
   u'characterOffsetEnd': 95,
   u'index': 17,
   u'originalText': u'scroll',
   u'pos': u'NN',
   u'word': u'scroll'},
  {u'after': u' ',
   u'before': u' ',
   u'characterOffsetBegin': 96,
   u'characterOffsetEnd': 105,
   u'index': 18,
   u'originalText': u'symposium',
   u'pos': u'NN',
   u'word': u'symposium'},
  {u'after': u' ',
   u'before': u' ',
   u'characterOffsetBegin': 106,
   u'characterOffsetEnd': 111,
   u'index': 19,
   u'originalText': u'april',
   u'pos': u'NNP',
   u'word': u'april'},
  {u'after': u'\n',
   u'before': u' ',
   u'characterOffsetBegin': 112,
   u'characterOffsetEnd': 117,
   u'index': 20,
   u'originalText': u'21-22',
   u'pos': u'CD',
   u'word': u'21-22'},
  {u'after': u'',
   u'before': u'\n',
   u'characterOffsetBegin': 118,
   u'characterOffsetEnd': 123,
   u'index': 21,
   u'originalText': u'Lines',
   u'pos': u'NNPS',
   u'word': u'Lines'},
  {u'after': u' ',
   u'before': u'',
   u'characterOffsetBegin': 123,
   u'characterOffsetEnd': 124,
   u'index': 22,
   u'originalText': u':',
   u'pos': u':',
   u'word': u':'},
  {u'after': u'\n\n\n ',
   u'before': u' ',
   u'characterOffsetBegin': 125,
   u'characterOffsetEnd': 127,
   u'index': 23,
   u'originalText': u'96',
   u'pos': u'CD',
   u'word': u'96'},
  {u'after': u' ',
   u'before': u'\n\n\n ',
   u'characterOffsetBegin': 131,
   u'characterOffsetEnd': 138,
   u'index': 24,
   u'originalText': u'Library',
   u'pos': u'NNP',
   u'word': u'Library'},
  {u'after': u' ',
   u'before': u' ',
   u'characterOffsetBegin': 139,
   u'characterOffsetEnd': 141,
   u'index': 25,
   u'originalText': u'of',
   u'pos': u'IN',
   u'word': u'of'},
  {u'after': u' ',
   u'before': u' ',
   u'characterOffsetBegin': 142,
   u'characterOffsetEnd': 150,
   u'index': 26,
   u'originalText': u'Congress',
   u'pos': u'NNP',
   u'word': u'Congress'},
  {u'after': u' ',
   u'before': u' ',
   u'characterOffsetBegin': 151,
   u'characterOffsetEnd': 153,
   u'index': 27,
   u'originalText': u'to',
   u'pos': u'TO',
   u'word': u'to'},
  {u'after': u' ',
   u'before': u' ',
   u'characterOffsetBegin': 154,
   u'characterOffsetEnd': 158,
   u'index': 28,
   u'originalText': u'Host',
   u'pos': u'NNP',
   u'word': u'Host'},
  {u'after': u' ',
   u'before': u' ',
   u'characterOffsetBegin': 159,
   u'characterOffsetEnd': 163,
   u'index': 29,
   u'originalText': u'Dead',
   u'pos': u'NNP',
   u'word': u'Dead'},
  {u'after': u' ',
   u'before': u' ',
   u'characterOffsetBegin': 164,
   u'characterOffsetEnd': 167,
   u'index': 30,
   u'originalText': u'Sea',
   u'pos': u'NNP',
   u'word': u'Sea'},
  {u'after': u' ',
   u'before': u' ',
   u'characterOffsetBegin': 168,
   u'characterOffsetEnd': 174,
   u'index': 31,
   u'originalText': u'Scroll',
   u'pos': u'NNP',
   u'word': u'Scroll'},
  {u'after': u' ',
   u'before': u' ',
   u'characterOffsetBegin': 175,
   u'characterOffsetEnd': 184,
   u'index': 32,
   u'originalText': u'Symposium',
   u'pos': u'NNP',
   u'word': u'Symposium'},
  {u'after': u' ',
   u'before': u' ',
   u'characterOffsetBegin': 185,
   u'characterOffsetEnd': 190,
   u'index': 33,
   u'originalText': u'April',
   u'pos': u'NNP',
   u'word': u'April'},
  {u'after': u'\n ',
   u'before': u' ',
   u'characterOffsetBegin': 191,
   u'characterOffsetEnd': 196,
   u'index': 34,
   u'originalText': u'21-22',
   u'pos': u'CD',
   u'word': u'21-22'},
  {u'after': u'',
   u'before': u'\n ',
   u'characterOffsetBegin': 198,
   u'characterOffsetEnd': 200,
   u'index': 35,
   u'originalText': u'To',
   u'pos': u'TO',
   u'word': u'To'},
  {u'after': u' ',
   u'before': u'',
   u'characterOffsetBegin': 200,
   u'characterOffsetEnd': 201,
   u'index': 36,
   u'originalText': u':',
   u'pos': u':',
   u'word': u':'},
  {u'after': u' ',
   u'before': u' ',
   u'characterOffsetBegin': 202,
   u'characterOffsetEnd': 210,
   u'index': 37,
   u'originalText': u'National',
   u'pos': u'NNP',
   u'word': u'National'},
  {u'after': u' ',
   u'before': u' ',
   u'characterOffsetBegin': 211,
   u'characterOffsetEnd': 214,
   u'index': 38,
   u'originalText': u'and',
   u'pos': u'CC',
   u'word': u'and'},
  {u'after': u' ',
   u'before': u' ',
   u'characterOffsetBegin': 215,
   u'characterOffsetEnd': 225,
   u'index': 39,
   u'originalText': u'Assignment',
   u'pos': u'NNP',
   u'word': u'Assignment'},
  {u'after': u'',
   u'before': u' ',
   u'characterOffsetBegin': 226,
   u'characterOffsetEnd': 231,
   u'index': 40,
   u'originalText': u'desks',
   u'pos': u'NNS',
   u'word': u'desks'},
  {u'after': u' ',
   u'before': u'',
   u'characterOffsetBegin': 231,
   u'characterOffsetEnd': 232,
   u'index': 41,
   u'originalText': u',',
   u'pos': u',',
   u'word': u','},
  {u'after': u' ',
   u'before': u' ',
   u'characterOffsetBegin': 233,
   u'characterOffsetEnd': 240,
   u'index': 42,
   u'originalText': u'Daybook',
   u'pos': u'NNP',
   u'word': u'Daybook'},
  {u'after': u'\n ',
   u'before': u' ',
   u'characterOffsetBegin': 241,
   u'characterOffsetEnd': 247,
   u'index': 43,
   u'originalText': u'Editor',
   u'pos': u'NNP',
   u'word': u'Editor'},
  {u'after': u'',
   u'before': u'\n ',
   u'characterOffsetBegin': 249,
   u'characterOffsetEnd': 256,
   u'index': 44,
   u'originalText': u'Contact',
   u'pos': u'NN',
   u'word': u'Contact'},
  {u'after': u' ',
   u'before': u'',
   u'characterOffsetBegin': 256,
   u'characterOffsetEnd': 257,
   u'index': 45,
   u'originalText': u':',
   u'pos': u':',
   u'word': u':'},
  {u'after': u' ',
   u'before': u' ',
   u'characterOffsetBegin': 258,
   u'characterOffsetEnd': 262,
   u'index': 46,
   u'originalText': u'John',
   u'pos': u'NNP',
   u'word': u'John'},
  {u'after': u'',
   u'before': u' ',
   u'characterOffsetBegin': 263,
   u'characterOffsetEnd': 271,
   u'index': 47,
   u'originalText': u'Sullivan',
   u'pos': u'NNP',
   u'word': u'Sullivan'},
  {u'after': u' ',
   u'before': u'',
   u'characterOffsetBegin': 271,
   u'characterOffsetEnd': 272,
   u'index': 48,
   u'originalText': u',',
   u'pos': u',',
   u'word': u','},
  {u'after': u'',
   u'before': u' ',
   u'characterOffsetBegin': 273,
   u'characterOffsetEnd': 285,
   u'index': 49,
   u'originalText': u'202-707-9216',
   u'pos': u'CD',
   u'word': u'202-707-9216'},
  {u'after': u' ',
   u'before': u'',
   u'characterOffsetBegin': 285,
   u'characterOffsetEnd': 286,
   u'index': 50,
   u'originalText': u',',
   u'pos': u',',
   u'word': u','},
  {u'after': u' ',
   u'before': u' ',
   u'characterOffsetBegin': 287,
   u'characterOffsetEnd': 289,
   u'index': 51,
   u'originalText': u'or',
   u'pos': u'CC',
   u'word': u'or'},
  {u'after': u' ',
   u'before': u' ',
   u'characterOffsetBegin': 290,
   u'characterOffsetEnd': 294,
   u'index': 52,
   u'originalText': u'Lucy',
   u'pos': u'NNP',
   u'word': u'Lucy'},
  {u'after': u'',
   u'before': u' ',
   u'characterOffsetBegin': 295,
   u'characterOffsetEnd': 303,
   u'index': 53,
   u'originalText': u'Suddreth',
   u'pos': u'NNP',
   u'word': u'Suddreth'},
  {u'after': u' ',
   u'before': u'',
   u'characterOffsetBegin': 303,
   u'characterOffsetEnd': 304,
   u'index': 54,
   u'originalText': u',',
   u'pos': u',',
   u'word': u','},
  {u'after': u'\n          ',
   u'before': u' ',
   u'characterOffsetBegin': 305,
   u'characterOffsetEnd': 317,
   u'index': 55,
   u'originalText': u'202-707-9191',
   u'pos': u'CD',
   u'word': u'202-707-9191'},
  {u'after': u' ',
   u'before': u'\n          ',
   u'characterOffsetBegin': 328,
   u'characterOffsetEnd': 332,
   u'index': 56,
   u'originalText': u'both',
   u'pos': u'DT',
   u'word': u'both'},
  {u'after': u' ',
   u'before': u' ',
   u'characterOffsetBegin': 333,
   u'characterOffsetEnd': 335,
   u'index': 57,
   u'originalText': u'of',
   u'pos': u'IN',
   u'word': u'of'},
  {u'after': u' ',
   u'before': u' ',
   u'characterOffsetBegin': 336,
   u'characterOffsetEnd': 339,
   u'index': 58,
   u'originalText': u'the',
   u'pos': u'DT',
   u'word': u'the'},
  {u'after': u' ',
   u'before': u' ',
   u'characterOffsetBegin': 340,
   u'characterOffsetEnd': 347,
   u'index': 59,
   u'originalText': u'Library',
   u'pos': u'NNP',
   u'word': u'Library'},
  {u'after': u' ',
   u'before': u' ',
   u'characterOffsetBegin': 348,
   u'characterOffsetEnd': 350,
   u'index': 60,
   u'originalText': u'of',
   u'pos': u'IN',
   u'word': u'of'},
  {u'after': u'\n\n   ',
   u'before': u' ',
   u'characterOffsetBegin': 351,
   u'characterOffsetEnd': 359,
   u'index': 61,
   u'originalText': u'Congress',
   u'pos': u'NNP',
   u'word': u'Congress'},
  {u'after': u'',
   u'before': u'\n\n   ',
   u'characterOffsetBegin': 364,
   u'characterOffsetEnd': 374,
   u'index': 62,
   u'originalText': u'WASHINGTON',
   u'pos': u'NNP',
   u'word': u'WASHINGTON'},
  {u'after': u' ',
   u'before': u'',
   u'characterOffsetBegin': 374,
   u'characterOffsetEnd': 375,
   u'index': 63,
   u'originalText': u',',
   u'pos': u',',
   u'word': u','},
  {u'after': u' ',
   u'before': u' ',
   u'characterOffsetBegin': 376,
   u'characterOffsetEnd': 381,
   u'index': 64,
   u'originalText': u'April',
   u'pos': u'NNP',
   u'word': u'April'},
  {u'after': u'  ',
   u'before': u' ',
   u'characterOffsetBegin': 382,
   u'characterOffsetEnd': 384,
   u'index': 65,
   u'originalText': u'19',
   u'pos': u'CD',
   u'word': u'19'},
  {u'after': u' ',
   u'before': u'  ',
   u'characterOffsetBegin': 386,
   u'characterOffsetEnd': 388,
   u'index': 66,
   u'originalText': u'--',
   u'pos': u':',
   u'word': u'--'},
  {u'after': u' ',
   u'before': u' ',
   u'characterOffsetBegin': 389,
   u'characterOffsetEnd': 390,
   u'index': 67,
   u'originalText': u'A',
   u'pos': u'DT',
   u'word': u'A'},
  {u'after': u' ',
   u'before': u' ',
   u'characterOffsetBegin': 391,
   u'characterOffsetEnd': 400,
   u'index': 68,
   u'originalText': u'symposium',
   u'pos': u'NN',
   u'word': u'symposium'},
  {u'after': u' ',
   u'before': u' ',
   u'characterOffsetBegin': 401,
   u'characterOffsetEnd': 403,
   u'index': 69,
   u'originalText': u'on',
   u'pos': u'IN',
   u'word': u'on'},
  {u'after': u' ',
   u'before': u' ',
   u'characterOffsetBegin': 404,
   u'characterOffsetEnd': 407,
   u'index': 70,
   u'originalText': u'the',
   u'pos': u'DT',
   u'word': u'the'},
  {u'after': u' ',
   u'before': u' ',
   u'characterOffsetBegin': 408,
   u'characterOffsetEnd': 412,
   u'index': 71,
   u'originalText': u'Dead',
   u'pos': u'NNP',
   u'word': u'Dead'},
  {u'after': u' \n',
   u'before': u' ',
   u'characterOffsetBegin': 413,
   u'characterOffsetEnd': 416,
   u'index': 72,
   u'originalText': u'Sea',
   u'pos': u'NNP',
   u'word': u'Sea'},
  {u'after': u' ',
   u'before': u' \n',
   u'characterOffsetBegin': 418,
   u'characterOffsetEnd': 425,
   u'index': 73,
   u'originalText': u'Scrolls',
   u'pos': u'NNP',
   u'word': u'Scrolls'},
  {u'after': u' ',
   u'before': u' ',
   u'characterOffsetBegin': 426,
   u'characterOffsetEnd': 430,
   u'index': 74,
   u'originalText': u'will',
   u'pos': u'MD',
   u'word': u'will'},
  {u'after': u' ',
   u'before': u' ',
   u'characterOffsetBegin': 431,
   u'characterOffsetEnd': 433,
   u'index': 75,
   u'originalText': u'be',
   u'pos': u'VB',
   u'word': u'be'},
  {u'after': u' ',
   u'before': u' ',
   u'characterOffsetBegin': 434,
   u'characterOffsetEnd': 438,
   u'index': 76,
   u'originalText': u'held',
   u'pos': u'VBN',
   u'word': u'held'},
  {u'after': u' ',
   u'before': u' ',
   u'characterOffsetBegin': 439,
   u'characterOffsetEnd': 441,
   u'index': 77,
   u'originalText': u'at',
   u'pos': u'IN',
   u'word': u'at'},
  {u'after': u' ',
   u'before': u' ',
   u'characterOffsetBegin': 442,
   u'characterOffsetEnd': 445,
   u'index': 78,
   u'originalText': u'the',
   u'pos': u'DT',
   u'word': u'the'},
  {u'after': u' ',
   u'before': u' ',
   u'characterOffsetBegin': 446,
   u'characterOffsetEnd': 453,
   u'index': 79,
   u'originalText': u'Library',
   u'pos': u'NNP',
   u'word': u'Library'},
  {u'after': u' ',
   u'before': u' ',
   u'characterOffsetBegin': 454,
   u'characterOffsetEnd': 456,
   u'index': 80,
   u'originalText': u'of',
   u'pos': u'IN',
   u'word': u'of'},
  {u'after': u' ',
   u'before': u' ',
   u'characterOffsetBegin': 457,
   u'characterOffsetEnd': 465,
   u'index': 81,
   u'originalText': u'Congress',
   u'pos': u'NNP',
   u'word': u'Congress'},
  {u'after': u' ',
   u'before': u' ',
   u'characterOffsetBegin': 466,
   u'characterOffsetEnd': 468,
   u'index': 82,
   u'originalText': u'on',
   u'pos': u'IN',
   u'word': u'on'},
  {u'after': u'',
   u'before': u' ',
   u'characterOffsetBegin': 469,
   u'characterOffsetEnd': 478,
   u'index': 83,
   u'originalText': u'Wednesday',
   u'pos': u'NNP',
   u'word': u'Wednesday'},
  {u'after': u'\n',
   u'before': u'',
   u'characterOffsetBegin': 478,
   u'characterOffsetEnd': 479,
   u'index': 84,
   u'originalText': u',',
   u'pos': u',',
   u'word': u','},
  {u'after': u' ',
   u'before': u'\n',
   u'characterOffsetBegin': 480,
   u'characterOffsetEnd': 485,
   u'index': 85,
   u'originalText': u'April',
   u'pos': u'NNP',
   u'word': u'April'},
  {u'after': u'',
   u'before': u' ',
   u'characterOffsetBegin': 486,
   u'characterOffsetEnd': 488,
   u'index': 86,
   u'originalText': u'21',
   u'pos': u'CD',
   u'word': u'21'},
  {u'after': u' ',
   u'before': u'',
   u'characterOffsetBegin': 488,
   u'characterOffsetEnd': 489,
   u'index': 87,
   u'originalText': u',',
   u'pos': u',',
   u'word': u','},
  {u'after': u' ',
   u'before': u' ',
   u'characterOffsetBegin': 490,
   u'characterOffsetEnd': 493,
   u'index': 88,
   u'originalText': u'and',
   u'pos': u'CC',
   u'word': u'and'},
  {u'after': u'',
   u'before': u' ',
   u'characterOffsetBegin': 494,
   u'characterOffsetEnd': 502,
   u'index': 89,
   u'originalText': u'Thursday',
   u'pos': u'NNP',
   u'word': u'Thursday'},
  {u'after': u' ',
   u'before': u'',
   u'characterOffsetBegin': 502,
   u'characterOffsetEnd': 503,
   u'index': 90,
   u'originalText': u',',
   u'pos': u',',
   u'word': u','},
  {u'after': u' ',
   u'before': u' ',
   u'characterOffsetBegin': 504,
   u'characterOffsetEnd': 509,
   u'index': 91,
   u'originalText': u'April',
   u'pos': u'NNP',
   u'word': u'April'},
  {u'after': u'',
   u'before': u' ',
   u'characterOffsetBegin': 510,
   u'characterOffsetEnd': 512,
   u'index': 92,
   u'originalText': u'22',
   u'pos': u'CD',
   u'word': u'22'},
  {u'after': u'  ',
   u'before': u'',
   u'characterOffsetBegin': 512,
   u'characterOffsetEnd': 513,
   u'index': 93,
   u'originalText': u'.',
   u'pos': u'.',
   u'word': u'.'}]}

That was rather long, but the gist is that the annotation is organized into sentences, which is then organized into tokens. Each token carries a number of annotations (we've only asked for the POS tags).


In [4]:
for token in annotation['sentences'][0]['tokens']:
    print token['word'], token['pos']


From IN
: :
nigel.allen@canrem.com NNP
-LRB- -LRB-
Nigel NNP
Allen NNP
-RRB- -RRB-
Subject NNP
: :
library NN
of IN
congress NN
to TO
host NN
dead JJ
sea NN
scroll NN
symposium NN
april NNP
21-22 CD
Lines NNPS
: :
96 CD
Library NNP
of IN
Congress NNP
to TO
Host NNP
Dead NNP
Sea NNP
Scroll NNP
Symposium NNP
April NNP
21-22 CD
To TO
: :
National NNP
and CC
Assignment NNP
desks NNS
, ,
Daybook NNP
Editor NNP
Contact NN
: :
John NNP
Sullivan NNP
, ,
202-707-9216 CD
, ,
or CC
Lucy NNP
Suddreth NNP
, ,
202-707-9191 CD
both DT
of IN
the DT
Library NNP
of IN
Congress NNP
WASHINGTON NNP
, ,
April NNP
19 CD
-- :
A DT
symposium NN
on IN
the DT
Dead NNP
Sea NNP
Scrolls NNP
will MD
be VB
held VBN
at IN
the DT
Library NNP
of IN
Congress NNP
on IN
Wednesday NNP
, ,
April NNP
21 CD
, ,
and CC
Thursday NNP
, ,
April NNP
22 CD
. .

For our purpose, we're actually going to just take the document as a long sequence of words as opposed to a sequence of sequences (eg. a list of sentences of words). We'll do this by passing in the ssplit.isOneSentence flag.


In [5]:
docs = []
labels = []
for doc, label in zip(newsgroups_train.data, newsgroups_train.target)[:100]:
    try:
        annotation = client.annotate(doc, properties={'annotators': 'tokenize,ssplit', 'ssplit.isOneSentence': True})
        docs.append([t['word'] for t in annotation['sentences'][0]['tokens']])
        labels.append(label)
    except Exception as e:
        pass  # we're going to punt and ignore unicode errors...
print len(docs), len(labels)


99 99

We'll create a lightweight dataset object out of this. A Dataset is really a glorified dictionary of fields, where each field corresponds to an attribute of the examples in the dataset.


In [6]:
from stanza.text.dataset import Dataset
dataset = Dataset({'X': docs, 'Y': labels})

# dataset supports, amongst other functionalities, shuffling:
dataset.shuffle()


Out[6]:
Dataset(Y, X)

In [7]:
# indexing of a single element
print dataset[0].keys()


['Y', 'X']

In [8]:
# indexing of multiple elements
n_train = int(0.7 * len(dataset))
train = Dataset(dataset[:n_train])
test = Dataset(dataset[n_train:])

print 'train: {}, test: {}'.format(len(train), len(test))


train: 69, test: 30

Creating vocabulary and mapping to vector space

Stanza provides means to convert words to vocabularies (eg. map to indices and back). We also provide convienient means of loading pretrained embeddings such as Senna and Glove.


In [9]:
from stanza.text.vocab import Vocab
vocab = Vocab('***UNK***')
vocab


Out[9]:
OrderedDict([('***UNK***', 0)])

We'll try our hands at some conversions:


In [13]:
sents = ['I like cats and dogs', 'I like nothing', 'I like cats and nothing else']
inds = []

# `vocab.update` adds the list of words to the Vocab object.
# It also returns the list of words as ints.
for s in sents[:2]:
    inds.append(vocab.update(s.split()))

# `vocab.words2indices` converts the list of words to ints (but does not update the vocab)
inds.append(vocab.words2indices(sents[2].split()))

for s, ind in zip(sents, inds):
    print '{:50}{}\nrecovered: {}'.format(s, ind, vocab.indices2words(ind))
    print


I like cats and dogs                              [1, 2, 3, 4, 5]
recovered: ['I', 'like', 'cats', 'and', 'dogs']

I like nothing                                    [1, 2, 6]
recovered: ['I', 'like', 'nothing']

I like cats and nothing else                      [1, 2, 3, 4, 6, 0]
recovered: ['I', 'like', 'cats', 'and', 'nothing', '***UNK***']

A common operation to do with vocabular objects is to replace rare words with UNKNOWN tokens. We'll convert words that occured less than some number of times.


In [14]:
vocab.counts


Out[14]:
Counter({'***UNK***': 0,
         'I': 6,
         'and': 3,
         'cats': 3,
         'dogs': 3,
         'like': 6,
         'nothing': 3})

In [15]:
# this is actually a copy operation, because indices change when words are removed from the vocabulary
vocab = vocab.prune_rares(cutoff=6)
for s in sents:
    inds = vocab.words2indices(s.split())
    print vocab.indices2words(inds)


['I', 'like', '***UNK***', '***UNK***', '***UNK***']
['I', 'like', '***UNK***']
['I', 'like', '***UNK***', '***UNK***', '***UNK***', '***UNK***']

Now, we'll convert the entire dataset. The convert function applies a transform to the specified field of the dataset. We'll apply a transform using the vocabulary.


In [16]:
from stanza.text.vocab import SennaVocab
vocab = SennaVocab()

# we'll actually just use the first 200 tokens of the document
max_len = 200
train = train.convert({'X': lambda x: x[:max_len]}, in_place=True)
test = test.convert({'X': lambda x: x[:max_len]}, in_place=True)
    
# make a backup
train_orig = train
test_orig = test

train = train_orig.convert({'X': vocab.update}, in_place=False)
vocab = vocab.prune_rares(cutoff=3)
train = train_orig.convert({'X': vocab.words2indices}, in_place=False)
test = test_orig.convert({'X': vocab.words2indices}, in_place=False)
pad_index = vocab.add('***PAD***', count=100)

max_len = max([len(x) for x in train.fields['X'] + test.fields['X']])

print 'train: {}, test: {}'.format(len(train), len(test))
print 'vocab size: {}'.format(vocab)
print 'sequence max len: {}'.format(max_len)
print
print test[:2]


train: 69, test: 30
vocab size: Vocab(668 words)
sequence max len: 200

OrderedDict([('Y', [0, 1]), ('X', [[1, 2, 339, 4, 340, 341, 342, 7, 8, 2, 9, 2, 0, 0, 0, 80, 2, 346, 347, 12, 348, 39, 349, 14, 2, 0, 350, 2, 0, 320, 0, 0, 16, 2, 182, 69, 70, 41, 481, 96, 136, 578, 153, 39, 58, 126, 138, 355, 182, 0, 43, 38, 0, 124, 39, 111, 69, 0, 67, 37, 77, 49, 69, 70, 257, 269, 257, 12, 31, 182, 0, 138, 89, 151, 73, 0, 39, 28, 58, 37, 399, 596, 90, 43, 49, 17, 0, 194, 0, 0, 0, 0, 195, 37, 73, 0, 131, 31, 0, 39, 28, 31, 0, 17, 48, 73, 0, 12, 31, 0, 39, 0, 48, 0, 17, 0, 283, 33, 0, 245, 111, 0, 49, 479, 39, 51, 481, 0, 31, 0, 26, 226, 149, 147, 0, 12, 31, 0, 26, 17, 238, 0, 39, 69, 52, 425, 67, 133, 61, 0, 66, 0, 49, 17, 0, 39, 142, 323, 0, 0, 131, 95, 455, 39, 159, 67, 37, 231, 142, 17, 481, 149, 52, 0, 33, 590, 49, 104, 37, 0, 367, 148, 26, 10, 0, 0, 26, 0, 75, 89, 0, 0, 26, 107, 0, 0, 26, 10, 542, 217], [1, 2, 0, 4, 0, 486, 7, 8, 2, 0, 2, 124, 85, 61, 73, 639, 26, 4, 105, 9, 2, 409, 0, 0, 0, 26, 80, 2, 486, 0, 14, 2, 0, 0, 4, 0, 4, 0, 7, 0, 7, 16, 2, 17, 18, 0, 0, 39, 31, 0, 90, 31, 0, 12, 31, 0, 37, 17, 194, 0, 496, 0, 195, 375, 67, 37, 0, 0, 33, 0, 4, 0, 7, 28, 31, 17, 0, 0, 49, 479, 37, 31, 0, 0, 4, 0, 0, 0, 7, 0, 17, 0, 152, 75, 0, 31, 145, 639, 26, 69, 105, 151, 267, 0, 470, 0, 28, 0, 31, 0, 131, 231, 12, 31, 0, 105, 0, 39, 0, 28, 0, 49, 18, 31, 0, 194, 0, 90, 0, 195, 31, 0, 0, 133, 105, 194, 0, 496, 0, 49, 195, 524, 87, 0, 0, 67, 151, 31, 486, 0, 28, 0, 0, 33, 0, 155, 0, 90, 0, 49, 468, 0, 33, 31, 0, 36, 61, 67, 0, 0, 13, 12, 0, 0, 28, 31, 237, 2, 69, 76, 0, 461, 69, 500, 0, 131, 67, 0, 305, 31, 0, 48, 13, 12, 0, 0, 0, 31, 0, 107]])])

Training a model

At this point, you're welcome to use whatever program/model/package you like to run your experiments. We'll go with TensorFlow. In particular, we'll define a LSTM classifier.

Model definition

We'll define a lookup table, a LSTM, and a linear classifier.


In [17]:
import tensorflow as tf    
from tensorflow.models.rnn import rnn    
from tensorflow.models.rnn.rnn_cell import LSTMCell
from stanza.ml.tensorflow_utils import labels_to_onehots
import numpy as np

np.random.seed(42)      
embedding_size = 50
hidden_size = 100
seq_len = max_len
vocab_size = len(vocab)
class_size = len(classes)

# symbolic variable for word indices
indices = tf.placeholder(tf.int32, [None, seq_len])
# symbolic variable for labels
labels = tf.placeholder(tf.float32, [None, class_size])

# lookup table
with tf.device('/cpu:0'), tf.name_scope("embedding"):
    E = tf.Variable(
        tf.random_uniform([vocab_size, embedding_size], -1.0, 1.0),
        name="emb")
    embeddings = tf.nn.embedding_lookup(E, indices)
    embeddings_list = [tf.squeeze(t, [1]) for t in tf.split(1, seq_len, embeddings)]

# rnn
cell = LSTMCell(hidden_size, embedding_size)  
outputs, states = rnn.rnn(cell, embeddings_list, dtype=tf.float32)
final_output = outputs[-1]

# classifier
def weights(shape):
    return tf.Variable(tf.random_normal(shape, stddev=0.01))
scores = tf.matmul(final_output, weights((hidden_size, class_size)))

# objective
cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(scores, labels))

# operations
train_op = tf.train.AdamOptimizer(0.001, 0.9).minimize(cost)
predict_op = tf.argmax(scores, 1)

Training

We'll train the network for a fixed number of epochs and then evaluate on the test set. This is a relatively simple procedure without tuning, regularization and early stopping.


In [18]:
from sklearn.metrics import accuracy_score
from time import time
batch_size = 128
num_epochs = 10

def run_epoch(split, train=False):
    epoch_cost = 0
    epoch_pred = []
    for i in xrange(0, len(split), batch_size):
        batch = split[i: i+batch_size]
        n = len(batch['Y'])
        X = Dataset.pad(batch['X'], pad_index, seq_len)
        Y = np.zeros((n, class_size))
        Y[np.arange(n), np.array(batch['Y'])] = 1
        if train:
            batch_cost, batch_pred, _ = session.run(
                [cost, predict_op, train_op], {indices: X, labels: Y})
        else:
            batch_cost, batch_pred = session.run(
                [cost, predict_op], {indices: X, labels: Y})
        epoch_cost += batch_cost * n
        epoch_pred += batch_pred.flatten().tolist()
    return epoch_cost, epoch_pred

def train_eval(session):
    for epoch in xrange(num_epochs):
        start = time()
        print 'epoch: {}'.format(epoch)
        epoch_cost, epoch_pred = run_epoch(train, True)
        print 'train cost: {}, acc: {}'.format(epoch_cost/len(train),
                                               accuracy_score(train.fields['Y'], epoch_pred))
        print 'time elapsed: {}'.format(time() - start)
    
    test_cost, test_pred = run_epoch(test, False)
    print '-' * 20
    print 'test cost: {}, acc: {}'.format(test_cost/len(test),
                                          accuracy_score(test.fields['Y'], test_pred))

with tf.Session() as session:
    tf.set_random_seed(123)
    session.run(tf.initialize_all_variables())
    train_eval(session)


epoch: 0
train cost: 0.693798243999, acc: 0.463768115942
time elapsed: 2.13884997368
epoch: 1
train cost: 0.689658164978, acc: 0.608695652174
time elapsed: 0.947463989258
epoch: 2
train cost: 0.685618042946, acc: 0.608695652174
time elapsed: 0.931604862213
epoch: 3
train cost: 0.681350648403, acc: 0.594202898551
time elapsed: 0.989146947861
epoch: 4
train cost: 0.676672458649, acc: 0.608695652174
time elapsed: 0.967782974243
epoch: 5
train cost: 0.6715965271, acc: 0.608695652174
time elapsed: 0.938482046127
epoch: 6
train cost: 0.666440963745, acc: 0.594202898551
time elapsed: 1.01319694519
epoch: 7
train cost: 0.661608576775, acc: 0.594202898551
time elapsed: 0.951257944107
epoch: 8
train cost: 0.656547665596, acc: 0.594202898551
time elapsed: 0.969254016876
epoch: 9
train cost: 0.64949887991, acc: 0.594202898551
time elapsed: 0.979185819626
--------------------
test cost: 0.700526297092, acc: 0.566666666667

Remember how we used SennaVocab? Let's see what happens if we preinitialize our embeddings:


In [19]:
preinit_op = E.assign(vocab.get_embeddings())
with tf.Session() as session:
    tf.set_random_seed(123)
    session.run(tf.initialize_all_variables())
    session.run(preinit_op)
    train_eval(session)


epoch: 0
train cost: 0.691315352917, acc: 0.536231884058
time elapsed: 2.2313709259
epoch: 1
train cost: 0.685225009918, acc: 0.565217391304
time elapsed: 0.93435382843
epoch: 2
train cost: 0.679915368557, acc: 0.594202898551
time elapsed: 0.934975862503
epoch: 3
train cost: 0.675073981285, acc: 0.594202898551
time elapsed: 0.968421220779
epoch: 4
train cost: 0.670495569706, acc: 0.594202898551
time elapsed: 0.991052865982
epoch: 5
train cost: 0.666101515293, acc: 0.594202898551
time elapsed: 0.95667886734
epoch: 6
train cost: 0.661866605282, acc: 0.594202898551
time elapsed: 0.931576013565
epoch: 7
train cost: 0.657688558102, acc: 0.594202898551
time elapsed: 0.932205915451
epoch: 8
train cost: 0.653176009655, acc: 0.594202898551
time elapsed: 1.01794791222
epoch: 9
train cost: 0.647635579109, acc: 0.594202898551
time elapsed: 0.996000051498
--------------------
test cost: 0.698161303997, acc: 0.566666666667