Word Frequencies

Can we identify different types of text documents based on the frequency of their words? Can we identify different authors, styles, or disciplines like medical versus information technology?

We can start with counting the occurance of words in a document. Hereby, words should be converted to one case (e.g. lower case), and all punctuation characters should be eliminated.

Our program reads a (plain) text file, isolates individual words, and computes their frequencies in the document.

The following steps outline the process:

  1. load text data
  2. clean up text, convert characters, and transform to a list of words
  3. count the occurance of words

Load Text Data

The following shows how to load data from a web-site, local file system, and the Hadoop File System.

Pull text documents from the web

Instead of saving documents on the local file system, we can also load them directly from the Web. The mechanism of loading from an URL is different from opening a local file is quite different. Fortumately, libraries like urllib make this operating fairly easy.


In [43]:
from urllib.request import urlopen
# from urllib.request import *

In [45]:
# in order to get the help text, we should import the whole subpackage.
import urllib.request
help(urllib.request)


Help on module urllib.request in urllib:

NAME
    urllib.request - An extensible library for opening URLs using a variety of protocols

DESCRIPTION
    The simplest way to use this module is to call the urlopen function,
    which accepts a string containing a URL or a Request object (described
    below).  It opens the URL and returns the results as file-like
    object; the returned object has some extra methods described below.
    
    The OpenerDirector manages a collection of Handler objects that do
    all the actual work.  Each Handler implements a particular protocol or
    option.  The OpenerDirector is a composite object that invokes the
    Handlers needed to open the requested URL.  For example, the
    HTTPHandler performs HTTP GET and POST requests and deals with
    non-error returns.  The HTTPRedirectHandler automatically deals with
    HTTP 301, 302, 303 and 307 redirect errors, and the HTTPDigestAuthHandler
    deals with digest authentication.
    
    urlopen(url, data=None) -- Basic usage is the same as original
    urllib.  pass the url and optionally data to post to an HTTP URL, and
    get a file-like object back.  One difference is that you can also pass
    a Request instance instead of URL.  Raises a URLError (subclass of
    OSError); for HTTP errors, raises an HTTPError, which can also be
    treated as a valid response.
    
    build_opener -- Function that creates a new OpenerDirector instance.
    Will install the default handlers.  Accepts one or more Handlers as
    arguments, either instances or Handler classes that it will
    instantiate.  If one of the argument is a subclass of the default
    handler, the argument will be installed instead of the default.
    
    install_opener -- Installs a new opener as the default opener.
    
    objects of interest:
    
    OpenerDirector -- Sets up the User Agent as the Python-urllib client and manages
    the Handler classes, while dealing with requests and responses.
    
    Request -- An object that encapsulates the state of a request.  The
    state can be as simple as the URL.  It can also include extra HTTP
    headers, e.g. a User-Agent.
    
    BaseHandler --
    
    internals:
    BaseHandler and parent
    _call_chain conventions
    
    Example usage:
    
    import urllib.request
    
    # set up authentication info
    authinfo = urllib.request.HTTPBasicAuthHandler()
    authinfo.add_password(realm='PDQ Application',
                          uri='https://mahler:8092/site-updates.py',
                          user='klem',
                          passwd='geheim$parole')
    
    proxy_support = urllib.request.ProxyHandler({"http" : "http://ahad-haam:3128"})
    
    # build a new opener that adds authentication and caching FTP handlers
    opener = urllib.request.build_opener(proxy_support, authinfo,
                                         urllib.request.CacheFTPHandler)
    
    # install it
    urllib.request.install_opener(opener)
    
    f = urllib.request.urlopen('http://www.python.org/')

CLASSES
    builtins.object
        AbstractBasicAuthHandler
            HTTPBasicAuthHandler(AbstractBasicAuthHandler, BaseHandler)
            ProxyBasicAuthHandler(AbstractBasicAuthHandler, BaseHandler)
        AbstractDigestAuthHandler
        BaseHandler
            DataHandler
            FTPHandler
                CacheFTPHandler
            FileHandler
            HTTPCookieProcessor
            HTTPDefaultErrorHandler
            HTTPDigestAuthHandler(BaseHandler, AbstractDigestAuthHandler)
            HTTPErrorProcessor
            HTTPRedirectHandler
            ProxyDigestAuthHandler(BaseHandler, AbstractDigestAuthHandler)
            ProxyHandler
            UnknownHandler
        HTTPPasswordMgr
            HTTPPasswordMgrWithDefaultRealm
        OpenerDirector
        Request
        URLopener
            FancyURLopener
    AbstractHTTPHandler(BaseHandler)
        HTTPHandler
        HTTPSHandler
    
    class AbstractBasicAuthHandler(builtins.object)
     |  Methods defined here:
     |  
     |  __init__(self, password_mgr=None)
     |  
     |  http_error_auth_reqed(self, authreq, host, req, headers)
     |  
     |  retry_http_basic_auth(self, host, req, realm)
     |  
     |  ----------------------------------------------------------------------
     |  Data descriptors defined here:
     |  
     |  __dict__
     |      dictionary for instance variables (if defined)
     |  
     |  __weakref__
     |      list of weak references to the object (if defined)
     |  
     |  ----------------------------------------------------------------------
     |  Data and other attributes defined here:
     |  
     |  rx = re.compile('(?:.*,)*[ \t]*([^ \t]+)[ \t]+realm=(["\']?)([^"\']*)\...
    
    class AbstractDigestAuthHandler(builtins.object)
     |  Methods defined here:
     |  
     |  __init__(self, passwd=None)
     |  
     |  get_algorithm_impls(self, algorithm)
     |  
     |  get_authorization(self, req, chal)
     |  
     |  get_cnonce(self, nonce)
     |  
     |  get_entity_digest(self, data, chal)
     |  
     |  http_error_auth_reqed(self, auth_header, host, req, headers)
     |  
     |  reset_retry_count(self)
     |  
     |  retry_http_digest_auth(self, req, auth)
     |  
     |  ----------------------------------------------------------------------
     |  Data descriptors defined here:
     |  
     |  __dict__
     |      dictionary for instance variables (if defined)
     |  
     |  __weakref__
     |      list of weak references to the object (if defined)
    
    class BaseHandler(builtins.object)
     |  Methods defined here:
     |  
     |  __lt__(self, other)
     |  
     |  add_parent(self, parent)
     |  
     |  close(self)
     |  
     |  ----------------------------------------------------------------------
     |  Data descriptors defined here:
     |  
     |  __dict__
     |      dictionary for instance variables (if defined)
     |  
     |  __weakref__
     |      list of weak references to the object (if defined)
     |  
     |  ----------------------------------------------------------------------
     |  Data and other attributes defined here:
     |  
     |  handler_order = 500
    
    class CacheFTPHandler(FTPHandler)
     |  Method resolution order:
     |      CacheFTPHandler
     |      FTPHandler
     |      BaseHandler
     |      builtins.object
     |  
     |  Methods defined here:
     |  
     |  __init__(self)
     |      # XXX would be nice to have pluggable cache strategies
     |      # XXX this stuff is definitely not thread safe
     |  
     |  check_cache(self)
     |  
     |  clear_cache(self)
     |  
     |  connect_ftp(self, user, passwd, host, port, dirs, timeout)
     |  
     |  setMaxConns(self, m)
     |  
     |  setTimeout(self, t)
     |  
     |  ----------------------------------------------------------------------
     |  Methods inherited from FTPHandler:
     |  
     |  ftp_open(self, req)
     |  
     |  ----------------------------------------------------------------------
     |  Methods inherited from BaseHandler:
     |  
     |  __lt__(self, other)
     |  
     |  add_parent(self, parent)
     |  
     |  close(self)
     |  
     |  ----------------------------------------------------------------------
     |  Data descriptors inherited from BaseHandler:
     |  
     |  __dict__
     |      dictionary for instance variables (if defined)
     |  
     |  __weakref__
     |      list of weak references to the object (if defined)
     |  
     |  ----------------------------------------------------------------------
     |  Data and other attributes inherited from BaseHandler:
     |  
     |  handler_order = 500
    
    class DataHandler(BaseHandler)
     |  Method resolution order:
     |      DataHandler
     |      BaseHandler
     |      builtins.object
     |  
     |  Methods defined here:
     |  
     |  data_open(self, req)
     |  
     |  ----------------------------------------------------------------------
     |  Methods inherited from BaseHandler:
     |  
     |  __lt__(self, other)
     |  
     |  add_parent(self, parent)
     |  
     |  close(self)
     |  
     |  ----------------------------------------------------------------------
     |  Data descriptors inherited from BaseHandler:
     |  
     |  __dict__
     |      dictionary for instance variables (if defined)
     |  
     |  __weakref__
     |      list of weak references to the object (if defined)
     |  
     |  ----------------------------------------------------------------------
     |  Data and other attributes inherited from BaseHandler:
     |  
     |  handler_order = 500
    
    class FTPHandler(BaseHandler)
     |  Method resolution order:
     |      FTPHandler
     |      BaseHandler
     |      builtins.object
     |  
     |  Methods defined here:
     |  
     |  connect_ftp(self, user, passwd, host, port, dirs, timeout)
     |  
     |  ftp_open(self, req)
     |  
     |  ----------------------------------------------------------------------
     |  Methods inherited from BaseHandler:
     |  
     |  __lt__(self, other)
     |  
     |  add_parent(self, parent)
     |  
     |  close(self)
     |  
     |  ----------------------------------------------------------------------
     |  Data descriptors inherited from BaseHandler:
     |  
     |  __dict__
     |      dictionary for instance variables (if defined)
     |  
     |  __weakref__
     |      list of weak references to the object (if defined)
     |  
     |  ----------------------------------------------------------------------
     |  Data and other attributes inherited from BaseHandler:
     |  
     |  handler_order = 500
    
    class FancyURLopener(URLopener)
     |  Derived class with handlers for errors we can handle (perhaps).
     |  
     |  Method resolution order:
     |      FancyURLopener
     |      URLopener
     |      builtins.object
     |  
     |  Methods defined here:
     |  
     |  __init__(self, *args, **kwargs)
     |  
     |  get_user_passwd(self, host, realm, clear_cache=0)
     |  
     |  http_error_301(self, url, fp, errcode, errmsg, headers, data=None)
     |      Error 301 -- also relocated (permanently).
     |  
     |  http_error_302(self, url, fp, errcode, errmsg, headers, data=None)
     |      Error 302 -- relocated (temporarily).
     |  
     |  http_error_303(self, url, fp, errcode, errmsg, headers, data=None)
     |      Error 303 -- also relocated (essentially identical to 302).
     |  
     |  http_error_307(self, url, fp, errcode, errmsg, headers, data=None)
     |      Error 307 -- relocated, but turn POST into error.
     |  
     |  http_error_401(self, url, fp, errcode, errmsg, headers, data=None, retry=False)
     |      Error 401 -- authentication required.
     |      This function supports Basic authentication only.
     |  
     |  http_error_407(self, url, fp, errcode, errmsg, headers, data=None, retry=False)
     |      Error 407 -- proxy authentication required.
     |      This function supports Basic authentication only.
     |  
     |  http_error_default(self, url, fp, errcode, errmsg, headers)
     |      Default error handling -- don't raise an exception.
     |  
     |  prompt_user_passwd(self, host, realm)
     |      Override this in a GUI environment!
     |  
     |  redirect_internal(self, url, fp, errcode, errmsg, headers, data)
     |  
     |  retry_http_basic_auth(self, url, realm, data=None)
     |  
     |  retry_https_basic_auth(self, url, realm, data=None)
     |  
     |  retry_proxy_http_basic_auth(self, url, realm, data=None)
     |  
     |  retry_proxy_https_basic_auth(self, url, realm, data=None)
     |  
     |  ----------------------------------------------------------------------
     |  Methods inherited from URLopener:
     |  
     |  __del__(self)
     |  
     |  addheader(self, *args)
     |      Add a header to be used by the HTTP interface only
     |      e.g. u.addheader('Accept', 'sound/basic')
     |  
     |  cleanup(self)
     |  
     |  close(self)
     |  
     |  http_error(self, url, fp, errcode, errmsg, headers, data=None)
     |      Handle http errors.
     |      
     |      Derived class can override this, or provide specific handlers
     |      named http_error_DDD where DDD is the 3-digit error code.
     |  
     |  open(self, fullurl, data=None)
     |      Use URLopener().open(file) instead of open(file, 'r').
     |  
     |  open_data(self, url, data=None)
     |      Use "data" URL.
     |  
     |  open_file(self, url)
     |      Use local file or FTP depending on form of URL.
     |  
     |  open_ftp(self, url)
     |      Use FTP protocol.
     |  
     |  open_http(self, url, data=None)
     |      Use HTTP protocol.
     |  
     |  open_https(self, url, data=None)
     |      Use HTTPS protocol.
     |  
     |  open_local_file(self, url)
     |      Use local file.
     |  
     |  open_unknown(self, fullurl, data=None)
     |      Overridable interface to open unknown URL type.
     |  
     |  open_unknown_proxy(self, proxy, fullurl, data=None)
     |      Overridable interface to open unknown URL type.
     |  
     |  retrieve(self, url, filename=None, reporthook=None, data=None)
     |      retrieve(url) returns (filename, headers) for a local object
     |      or (tempfilename, headers) for a remote object.
     |  
     |  ----------------------------------------------------------------------
     |  Data descriptors inherited from URLopener:
     |  
     |  __dict__
     |      dictionary for instance variables (if defined)
     |  
     |  __weakref__
     |      list of weak references to the object (if defined)
     |  
     |  ----------------------------------------------------------------------
     |  Data and other attributes inherited from URLopener:
     |  
     |  version = 'Python-urllib/3.4'
    
    class FileHandler(BaseHandler)
     |  Method resolution order:
     |      FileHandler
     |      BaseHandler
     |      builtins.object
     |  
     |  Methods defined here:
     |  
     |  file_open(self, req)
     |      # Use local file or FTP depending on form of URL
     |  
     |  get_names(self)
     |  
     |  open_local_file(self, req)
     |      # not entirely sure what the rules are here
     |  
     |  ----------------------------------------------------------------------
     |  Data and other attributes defined here:
     |  
     |  names = None
     |  
     |  ----------------------------------------------------------------------
     |  Methods inherited from BaseHandler:
     |  
     |  __lt__(self, other)
     |  
     |  add_parent(self, parent)
     |  
     |  close(self)
     |  
     |  ----------------------------------------------------------------------
     |  Data descriptors inherited from BaseHandler:
     |  
     |  __dict__
     |      dictionary for instance variables (if defined)
     |  
     |  __weakref__
     |      list of weak references to the object (if defined)
     |  
     |  ----------------------------------------------------------------------
     |  Data and other attributes inherited from BaseHandler:
     |  
     |  handler_order = 500
    
    class HTTPBasicAuthHandler(AbstractBasicAuthHandler, BaseHandler)
     |  Method resolution order:
     |      HTTPBasicAuthHandler
     |      AbstractBasicAuthHandler
     |      BaseHandler
     |      builtins.object
     |  
     |  Methods defined here:
     |  
     |  http_error_401(self, req, fp, code, msg, headers)
     |  
     |  ----------------------------------------------------------------------
     |  Data and other attributes defined here:
     |  
     |  auth_header = 'Authorization'
     |  
     |  ----------------------------------------------------------------------
     |  Methods inherited from AbstractBasicAuthHandler:
     |  
     |  __init__(self, password_mgr=None)
     |  
     |  http_error_auth_reqed(self, authreq, host, req, headers)
     |  
     |  retry_http_basic_auth(self, host, req, realm)
     |  
     |  ----------------------------------------------------------------------
     |  Data descriptors inherited from AbstractBasicAuthHandler:
     |  
     |  __dict__
     |      dictionary for instance variables (if defined)
     |  
     |  __weakref__
     |      list of weak references to the object (if defined)
     |  
     |  ----------------------------------------------------------------------
     |  Data and other attributes inherited from AbstractBasicAuthHandler:
     |  
     |  rx = re.compile('(?:.*,)*[ \t]*([^ \t]+)[ \t]+realm=(["\']?)([^"\']*)\...
     |  
     |  ----------------------------------------------------------------------
     |  Methods inherited from BaseHandler:
     |  
     |  __lt__(self, other)
     |  
     |  add_parent(self, parent)
     |  
     |  close(self)
     |  
     |  ----------------------------------------------------------------------
     |  Data and other attributes inherited from BaseHandler:
     |  
     |  handler_order = 500
    
    class HTTPCookieProcessor(BaseHandler)
     |  Method resolution order:
     |      HTTPCookieProcessor
     |      BaseHandler
     |      builtins.object
     |  
     |  Methods defined here:
     |  
     |  __init__(self, cookiejar=None)
     |  
     |  http_request(self, request)
     |  
     |  http_response(self, request, response)
     |  
     |  https_request = http_request(self, request)
     |  
     |  https_response = http_response(self, request, response)
     |  
     |  ----------------------------------------------------------------------
     |  Methods inherited from BaseHandler:
     |  
     |  __lt__(self, other)
     |  
     |  add_parent(self, parent)
     |  
     |  close(self)
     |  
     |  ----------------------------------------------------------------------
     |  Data descriptors inherited from BaseHandler:
     |  
     |  __dict__
     |      dictionary for instance variables (if defined)
     |  
     |  __weakref__
     |      list of weak references to the object (if defined)
     |  
     |  ----------------------------------------------------------------------
     |  Data and other attributes inherited from BaseHandler:
     |  
     |  handler_order = 500
    
    class HTTPDefaultErrorHandler(BaseHandler)
     |  Method resolution order:
     |      HTTPDefaultErrorHandler
     |      BaseHandler
     |      builtins.object
     |  
     |  Methods defined here:
     |  
     |  http_error_default(self, req, fp, code, msg, hdrs)
     |  
     |  ----------------------------------------------------------------------
     |  Methods inherited from BaseHandler:
     |  
     |  __lt__(self, other)
     |  
     |  add_parent(self, parent)
     |  
     |  close(self)
     |  
     |  ----------------------------------------------------------------------
     |  Data descriptors inherited from BaseHandler:
     |  
     |  __dict__
     |      dictionary for instance variables (if defined)
     |  
     |  __weakref__
     |      list of weak references to the object (if defined)
     |  
     |  ----------------------------------------------------------------------
     |  Data and other attributes inherited from BaseHandler:
     |  
     |  handler_order = 500
    
    class HTTPDigestAuthHandler(BaseHandler, AbstractDigestAuthHandler)
     |  An authentication protocol defined by RFC 2069
     |  
     |  Digest authentication improves on basic authentication because it
     |  does not transmit passwords in the clear.
     |  
     |  Method resolution order:
     |      HTTPDigestAuthHandler
     |      BaseHandler
     |      AbstractDigestAuthHandler
     |      builtins.object
     |  
     |  Methods defined here:
     |  
     |  http_error_401(self, req, fp, code, msg, headers)
     |  
     |  ----------------------------------------------------------------------
     |  Data and other attributes defined here:
     |  
     |  auth_header = 'Authorization'
     |  
     |  handler_order = 490
     |  
     |  ----------------------------------------------------------------------
     |  Methods inherited from BaseHandler:
     |  
     |  __lt__(self, other)
     |  
     |  add_parent(self, parent)
     |  
     |  close(self)
     |  
     |  ----------------------------------------------------------------------
     |  Data descriptors inherited from BaseHandler:
     |  
     |  __dict__
     |      dictionary for instance variables (if defined)
     |  
     |  __weakref__
     |      list of weak references to the object (if defined)
     |  
     |  ----------------------------------------------------------------------
     |  Methods inherited from AbstractDigestAuthHandler:
     |  
     |  __init__(self, passwd=None)
     |  
     |  get_algorithm_impls(self, algorithm)
     |  
     |  get_authorization(self, req, chal)
     |  
     |  get_cnonce(self, nonce)
     |  
     |  get_entity_digest(self, data, chal)
     |  
     |  http_error_auth_reqed(self, auth_header, host, req, headers)
     |  
     |  reset_retry_count(self)
     |  
     |  retry_http_digest_auth(self, req, auth)
    
    class HTTPErrorProcessor(BaseHandler)
     |  Process HTTP error responses.
     |  
     |  Method resolution order:
     |      HTTPErrorProcessor
     |      BaseHandler
     |      builtins.object
     |  
     |  Methods defined here:
     |  
     |  http_response(self, request, response)
     |  
     |  https_response = http_response(self, request, response)
     |  
     |  ----------------------------------------------------------------------
     |  Data and other attributes defined here:
     |  
     |  handler_order = 1000
     |  
     |  ----------------------------------------------------------------------
     |  Methods inherited from BaseHandler:
     |  
     |  __lt__(self, other)
     |  
     |  add_parent(self, parent)
     |  
     |  close(self)
     |  
     |  ----------------------------------------------------------------------
     |  Data descriptors inherited from BaseHandler:
     |  
     |  __dict__
     |      dictionary for instance variables (if defined)
     |  
     |  __weakref__
     |      list of weak references to the object (if defined)
    
    class HTTPHandler(AbstractHTTPHandler)
     |  Method resolution order:
     |      HTTPHandler
     |      AbstractHTTPHandler
     |      BaseHandler
     |      builtins.object
     |  
     |  Methods defined here:
     |  
     |  http_open(self, req)
     |  
     |  http_request = do_request_(self, request)
     |  
     |  ----------------------------------------------------------------------
     |  Methods inherited from AbstractHTTPHandler:
     |  
     |  __init__(self, debuglevel=0)
     |  
     |  do_open(self, http_class, req, **http_conn_args)
     |      Return an HTTPResponse object for the request, using http_class.
     |      
     |      http_class must implement the HTTPConnection API from http.client.
     |  
     |  do_request_(self, request)
     |  
     |  set_http_debuglevel(self, level)
     |  
     |  ----------------------------------------------------------------------
     |  Methods inherited from BaseHandler:
     |  
     |  __lt__(self, other)
     |  
     |  add_parent(self, parent)
     |  
     |  close(self)
     |  
     |  ----------------------------------------------------------------------
     |  Data descriptors inherited from BaseHandler:
     |  
     |  __dict__
     |      dictionary for instance variables (if defined)
     |  
     |  __weakref__
     |      list of weak references to the object (if defined)
     |  
     |  ----------------------------------------------------------------------
     |  Data and other attributes inherited from BaseHandler:
     |  
     |  handler_order = 500
    
    class HTTPPasswordMgr(builtins.object)
     |  Methods defined here:
     |  
     |  __init__(self)
     |  
     |  add_password(self, realm, uri, user, passwd)
     |  
     |  find_user_password(self, realm, authuri)
     |  
     |  is_suburi(self, base, test)
     |      Check if test is below base in a URI tree
     |      
     |      Both args must be URIs in reduced form.
     |  
     |  reduce_uri(self, uri, default_port=True)
     |      Accept authority or URI and extract only the authority and path.
     |  
     |  ----------------------------------------------------------------------
     |  Data descriptors defined here:
     |  
     |  __dict__
     |      dictionary for instance variables (if defined)
     |  
     |  __weakref__
     |      list of weak references to the object (if defined)
    
    class HTTPPasswordMgrWithDefaultRealm(HTTPPasswordMgr)
     |  Method resolution order:
     |      HTTPPasswordMgrWithDefaultRealm
     |      HTTPPasswordMgr
     |      builtins.object
     |  
     |  Methods defined here:
     |  
     |  find_user_password(self, realm, authuri)
     |  
     |  ----------------------------------------------------------------------
     |  Methods inherited from HTTPPasswordMgr:
     |  
     |  __init__(self)
     |  
     |  add_password(self, realm, uri, user, passwd)
     |  
     |  is_suburi(self, base, test)
     |      Check if test is below base in a URI tree
     |      
     |      Both args must be URIs in reduced form.
     |  
     |  reduce_uri(self, uri, default_port=True)
     |      Accept authority or URI and extract only the authority and path.
     |  
     |  ----------------------------------------------------------------------
     |  Data descriptors inherited from HTTPPasswordMgr:
     |  
     |  __dict__
     |      dictionary for instance variables (if defined)
     |  
     |  __weakref__
     |      list of weak references to the object (if defined)
    
    class HTTPRedirectHandler(BaseHandler)
     |  Method resolution order:
     |      HTTPRedirectHandler
     |      BaseHandler
     |      builtins.object
     |  
     |  Methods defined here:
     |  
     |  http_error_301 = http_error_302(self, req, fp, code, msg, headers)
     |  
     |  http_error_302(self, req, fp, code, msg, headers)
     |      # Implementation note: To avoid the server sending us into an
     |      # infinite loop, the request object needs to track what URLs we
     |      # have already seen.  Do this by adding a handler-specific
     |      # attribute to the Request object.
     |  
     |  http_error_303 = http_error_302(self, req, fp, code, msg, headers)
     |  
     |  http_error_307 = http_error_302(self, req, fp, code, msg, headers)
     |  
     |  redirect_request(self, req, fp, code, msg, headers, newurl)
     |      Return a Request or None in response to a redirect.
     |      
     |      This is called by the http_error_30x methods when a
     |      redirection response is received.  If a redirection should
     |      take place, return a new Request to allow http_error_30x to
     |      perform the redirect.  Otherwise, raise HTTPError if no-one
     |      else should try to handle this url.  Return None if you can't
     |      but another Handler might.
     |  
     |  ----------------------------------------------------------------------
     |  Data and other attributes defined here:
     |  
     |  inf_msg = 'The HTTP server returned a redirect error that w...n infini...
     |  
     |  max_redirections = 10
     |  
     |  max_repeats = 4
     |  
     |  ----------------------------------------------------------------------
     |  Methods inherited from BaseHandler:
     |  
     |  __lt__(self, other)
     |  
     |  add_parent(self, parent)
     |  
     |  close(self)
     |  
     |  ----------------------------------------------------------------------
     |  Data descriptors inherited from BaseHandler:
     |  
     |  __dict__
     |      dictionary for instance variables (if defined)
     |  
     |  __weakref__
     |      list of weak references to the object (if defined)
     |  
     |  ----------------------------------------------------------------------
     |  Data and other attributes inherited from BaseHandler:
     |  
     |  handler_order = 500
    
    class HTTPSHandler(AbstractHTTPHandler)
     |  Method resolution order:
     |      HTTPSHandler
     |      AbstractHTTPHandler
     |      BaseHandler
     |      builtins.object
     |  
     |  Methods defined here:
     |  
     |  __init__(self, debuglevel=0, context=None, check_hostname=None)
     |  
     |  https_open(self, req)
     |  
     |  https_request = do_request_(self, request)
     |  
     |  ----------------------------------------------------------------------
     |  Methods inherited from AbstractHTTPHandler:
     |  
     |  do_open(self, http_class, req, **http_conn_args)
     |      Return an HTTPResponse object for the request, using http_class.
     |      
     |      http_class must implement the HTTPConnection API from http.client.
     |  
     |  do_request_(self, request)
     |  
     |  set_http_debuglevel(self, level)
     |  
     |  ----------------------------------------------------------------------
     |  Methods inherited from BaseHandler:
     |  
     |  __lt__(self, other)
     |  
     |  add_parent(self, parent)
     |  
     |  close(self)
     |  
     |  ----------------------------------------------------------------------
     |  Data descriptors inherited from BaseHandler:
     |  
     |  __dict__
     |      dictionary for instance variables (if defined)
     |  
     |  __weakref__
     |      list of weak references to the object (if defined)
     |  
     |  ----------------------------------------------------------------------
     |  Data and other attributes inherited from BaseHandler:
     |  
     |  handler_order = 500
    
    class OpenerDirector(builtins.object)
     |  Methods defined here:
     |  
     |  __init__(self)
     |  
     |  add_handler(self, handler)
     |  
     |  close(self)
     |  
     |  error(self, proto, *args)
     |  
     |  open(self, fullurl, data=None, timeout=<object object at 0x7f4cd4cc8130>)
     |  
     |  ----------------------------------------------------------------------
     |  Data descriptors defined here:
     |  
     |  __dict__
     |      dictionary for instance variables (if defined)
     |  
     |  __weakref__
     |      list of weak references to the object (if defined)
    
    class ProxyBasicAuthHandler(AbstractBasicAuthHandler, BaseHandler)
     |  Method resolution order:
     |      ProxyBasicAuthHandler
     |      AbstractBasicAuthHandler
     |      BaseHandler
     |      builtins.object
     |  
     |  Methods defined here:
     |  
     |  http_error_407(self, req, fp, code, msg, headers)
     |  
     |  ----------------------------------------------------------------------
     |  Data and other attributes defined here:
     |  
     |  auth_header = 'Proxy-authorization'
     |  
     |  ----------------------------------------------------------------------
     |  Methods inherited from AbstractBasicAuthHandler:
     |  
     |  __init__(self, password_mgr=None)
     |  
     |  http_error_auth_reqed(self, authreq, host, req, headers)
     |  
     |  retry_http_basic_auth(self, host, req, realm)
     |  
     |  ----------------------------------------------------------------------
     |  Data descriptors inherited from AbstractBasicAuthHandler:
     |  
     |  __dict__
     |      dictionary for instance variables (if defined)
     |  
     |  __weakref__
     |      list of weak references to the object (if defined)
     |  
     |  ----------------------------------------------------------------------
     |  Data and other attributes inherited from AbstractBasicAuthHandler:
     |  
     |  rx = re.compile('(?:.*,)*[ \t]*([^ \t]+)[ \t]+realm=(["\']?)([^"\']*)\...
     |  
     |  ----------------------------------------------------------------------
     |  Methods inherited from BaseHandler:
     |  
     |  __lt__(self, other)
     |  
     |  add_parent(self, parent)
     |  
     |  close(self)
     |  
     |  ----------------------------------------------------------------------
     |  Data and other attributes inherited from BaseHandler:
     |  
     |  handler_order = 500
    
    class ProxyDigestAuthHandler(BaseHandler, AbstractDigestAuthHandler)
     |  Method resolution order:
     |      ProxyDigestAuthHandler
     |      BaseHandler
     |      AbstractDigestAuthHandler
     |      builtins.object
     |  
     |  Methods defined here:
     |  
     |  http_error_407(self, req, fp, code, msg, headers)
     |  
     |  ----------------------------------------------------------------------
     |  Data and other attributes defined here:
     |  
     |  auth_header = 'Proxy-Authorization'
     |  
     |  handler_order = 490
     |  
     |  ----------------------------------------------------------------------
     |  Methods inherited from BaseHandler:
     |  
     |  __lt__(self, other)
     |  
     |  add_parent(self, parent)
     |  
     |  close(self)
     |  
     |  ----------------------------------------------------------------------
     |  Data descriptors inherited from BaseHandler:
     |  
     |  __dict__
     |      dictionary for instance variables (if defined)
     |  
     |  __weakref__
     |      list of weak references to the object (if defined)
     |  
     |  ----------------------------------------------------------------------
     |  Methods inherited from AbstractDigestAuthHandler:
     |  
     |  __init__(self, passwd=None)
     |  
     |  get_algorithm_impls(self, algorithm)
     |  
     |  get_authorization(self, req, chal)
     |  
     |  get_cnonce(self, nonce)
     |  
     |  get_entity_digest(self, data, chal)
     |  
     |  http_error_auth_reqed(self, auth_header, host, req, headers)
     |  
     |  reset_retry_count(self)
     |  
     |  retry_http_digest_auth(self, req, auth)
    
    class ProxyHandler(BaseHandler)
     |  Method resolution order:
     |      ProxyHandler
     |      BaseHandler
     |      builtins.object
     |  
     |  Methods defined here:
     |  
     |  __init__(self, proxies=None)
     |  
     |  proxy_open(self, req, proxy, type)
     |  
     |  ----------------------------------------------------------------------
     |  Data and other attributes defined here:
     |  
     |  handler_order = 100
     |  
     |  ----------------------------------------------------------------------
     |  Methods inherited from BaseHandler:
     |  
     |  __lt__(self, other)
     |  
     |  add_parent(self, parent)
     |  
     |  close(self)
     |  
     |  ----------------------------------------------------------------------
     |  Data descriptors inherited from BaseHandler:
     |  
     |  __dict__
     |      dictionary for instance variables (if defined)
     |  
     |  __weakref__
     |      list of weak references to the object (if defined)
    
    class Request(builtins.object)
     |  Methods defined here:
     |  
     |  __init__(self, url, data=None, headers={}, origin_req_host=None, unverifiable=False, method=None)
     |  
     |  add_header(self, key, val)
     |  
     |  add_unredirected_header(self, key, val)
     |  
     |  get_full_url(self)
     |  
     |  get_header(self, header_name, default=None)
     |  
     |  get_method(self)
     |      Return a string indicating the HTTP request method.
     |  
     |  has_header(self, header_name)
     |  
     |  has_proxy(self)
     |  
     |  header_items(self)
     |  
     |  remove_header(self, header_name)
     |  
     |  set_proxy(self, host, type)
     |  
     |  ----------------------------------------------------------------------
     |  Data descriptors defined here:
     |  
     |  __dict__
     |      dictionary for instance variables (if defined)
     |  
     |  __weakref__
     |      list of weak references to the object (if defined)
     |  
     |  data
     |  
     |  full_url
    
    class URLopener(builtins.object)
     |  Class to open URLs.
     |  This is a class rather than just a subroutine because we may need
     |  more than one set of global protocol-specific options.
     |  Note -- this is a base class for those who don't want the
     |  automatic handling of errors type 302 (relocated) and 401
     |  (authorization needed).
     |  
     |  Methods defined here:
     |  
     |  __del__(self)
     |  
     |  __init__(self, proxies=None, **x509)
     |      # Constructor
     |  
     |  addheader(self, *args)
     |      Add a header to be used by the HTTP interface only
     |      e.g. u.addheader('Accept', 'sound/basic')
     |  
     |  cleanup(self)
     |  
     |  close(self)
     |  
     |  http_error(self, url, fp, errcode, errmsg, headers, data=None)
     |      Handle http errors.
     |      
     |      Derived class can override this, or provide specific handlers
     |      named http_error_DDD where DDD is the 3-digit error code.
     |  
     |  http_error_default(self, url, fp, errcode, errmsg, headers)
     |      Default error handler: close the connection and raise OSError.
     |  
     |  open(self, fullurl, data=None)
     |      Use URLopener().open(file) instead of open(file, 'r').
     |  
     |  open_data(self, url, data=None)
     |      Use "data" URL.
     |  
     |  open_file(self, url)
     |      Use local file or FTP depending on form of URL.
     |  
     |  open_ftp(self, url)
     |      Use FTP protocol.
     |  
     |  open_http(self, url, data=None)
     |      Use HTTP protocol.
     |  
     |  open_https(self, url, data=None)
     |      Use HTTPS protocol.
     |  
     |  open_local_file(self, url)
     |      Use local file.
     |  
     |  open_unknown(self, fullurl, data=None)
     |      Overridable interface to open unknown URL type.
     |  
     |  open_unknown_proxy(self, proxy, fullurl, data=None)
     |      Overridable interface to open unknown URL type.
     |  
     |  retrieve(self, url, filename=None, reporthook=None, data=None)
     |      retrieve(url) returns (filename, headers) for a local object
     |      or (tempfilename, headers) for a remote object.
     |  
     |  ----------------------------------------------------------------------
     |  Data descriptors defined here:
     |  
     |  __dict__
     |      dictionary for instance variables (if defined)
     |  
     |  __weakref__
     |      list of weak references to the object (if defined)
     |  
     |  ----------------------------------------------------------------------
     |  Data and other attributes defined here:
     |  
     |  version = 'Python-urllib/3.4'
    
    class UnknownHandler(BaseHandler)
     |  Method resolution order:
     |      UnknownHandler
     |      BaseHandler
     |      builtins.object
     |  
     |  Methods defined here:
     |  
     |  unknown_open(self, req)
     |  
     |  ----------------------------------------------------------------------
     |  Methods inherited from BaseHandler:
     |  
     |  __lt__(self, other)
     |  
     |  add_parent(self, parent)
     |  
     |  close(self)
     |  
     |  ----------------------------------------------------------------------
     |  Data descriptors inherited from BaseHandler:
     |  
     |  __dict__
     |      dictionary for instance variables (if defined)
     |  
     |  __weakref__
     |      list of weak references to the object (if defined)
     |  
     |  ----------------------------------------------------------------------
     |  Data and other attributes inherited from BaseHandler:
     |  
     |  handler_order = 500

FUNCTIONS
    build_opener(*handlers)
        Create an opener object from a list of handlers.
        
        The opener will use several default handlers, including support
        for HTTP, FTP and when applicable HTTPS.
        
        If any of the handlers passed as arguments are subclasses of the
        default handlers, the default handlers will not be used.
    
    getproxies = getproxies_environment()
        Return a dictionary of scheme -> proxy server URL mappings.
        
        Scan the environment for variables named <scheme>_proxy;
        this seems to be the standard convention.  If you need a
        different way, you can pass a proxies dictionary to the
        [Fancy]URLopener constructor.
    
    install_opener(opener)
    
    pathname2url(pathname)
        OS-specific conversion from a file system path to a relative URL
        of the 'file' scheme; not recommended for general use.
    
    url2pathname(pathname)
        OS-specific conversion from a relative URL of the 'file' scheme
        to a file system path; not recommended for general use.
    
    urlcleanup()
    
    urlopen(url, data=None, timeout=<object object at 0x7f4cd4cc8130>, *, cafile=None, capath=None, cadefault=False, context=None)
    
    urlretrieve(url, filename=None, reporthook=None, data=None)
        Retrieve a URL into a temporary location on disk.
        
        Requires a URL argument. If a filename is passed, it is used as
        the temporary file location. The reporthook argument should be
        a callable that accepts a block number, a read size, and the
        total file size of the URL target. The data argument should be
        valid URL encoded data.
        
        If a filename is passed and the URL points to a local resource,
        the result is a copy from local file to new file.
        
        Returns a tuple containing the path to the newly created
        data file as well as the resulting HTTPMessage object.

DATA
    __all__ = ['Request', 'OpenerDirector', 'BaseHandler', 'HTTPDefaultErr...

VERSION
    3.4

FILE
    /usr/lib64/python3.4/urllib/request.py



In [46]:
help(urlopen)


Help on function urlopen in module urllib.request:

urlopen(url, data=None, timeout=<object object at 0x7f4cd4cc8130>, *, cafile=None, capath=None, cadefault=False, context=None)

For example: load the collection of Shakespear's work and print a couple of rows. (The first 244 lines of this particular document are copyright information, and should be skipped.)


In [12]:
with urlopen('http://ocw.mit.edu/ans7870/6/6.006/s08/lecturenotes/files/t8.shakespeare.txt') as src:
    txt = src.readlines()
    for t in txt[244:250]:
        print(t.decode())


1609



THE SONNETS



by William Shakespeare



Load everything at once:


In [73]:
data = urlopen('http://ocw.mit.edu/ans7870/6/6.006/s08/lecturenotes/files/t8.shakespeare.txt').read().decode()
data[0:100]


Out[73]:
'This is the 100th Etext file presented by Project Gutenberg, and\nis presented in cooperation with Wo'

Note: there is a difference between read and readlines. While read loads the entire content into string of bytes, readline allow to iterate over sections of the input stream that are separated by the new-line character(s).

Pull text from local files

Alternatively, we may just read from a local file.


In [73]:
with open('textfiles/shakespeare.txt', 'r') as src:
    txt = src.readlines()
    for t in txt[0:10]:
        print(t)       ## Note: we don't need to decode the string


1609



THE SONNETS



by William Shakespeare







                     1

  From fairest creatures we desire increase,

Read everything at once...


In [76]:
txt = open('textfiles/shakespeare.txt', 'r').read()
txt[0:100]


Out[76]:
'1609\n\nTHE SONNETS\n\nby William Shakespeare\n\n\n\n                     1\n  From fairest creatures we desi'

Pull text from Hadoop File System (HDFS)

We're usually interested in fairly big data sets which we keep on the Hadoop File System. All Hadoop and Spark functions can uncompress text files on the fly. Therefore they are stored in a compressed format (.gz).


In [79]:
import zlib
from hdfs import InsecureClient
client = InsecureClient('http://backend-0-0:50070')

In [80]:
with client.read('/user/pmolnar/data/20news/20news-bydate-test/talk.politics.mideast/77239.gz') as reader:
  txt = zlib.decompress(reader.read(), 16+zlib.MAX_WBITS).decode()
txt[0:100]


Out[80]:
'From: oaf@zurich.ai.mit.edu (Oded Feingold)\nSubject: Re: To All My Friends on T.P.M., I send Greetin'

In [81]:
txt.split('\n')


Out[81]:
['From: oaf@zurich.ai.mit.edu (Oded Feingold)',
 'Subject: Re: To All My Friends on T.P.M., I send Greetings',
 'Organization: M.I.T. Artificial Intelligence Lab.',
 'Lines: 1',
 'Reply-To: oaf@zurich.ai.mit.edu',
 'NNTP-Posting-Host: klosters.ai.mit.edu',
 "In-reply-to: szljubi@chip.ucdavis.edu's message of Thu, 6 May 1993 22:47:00 GMT",
 '',
 "This is an outrage!  I don't even own a dog.",
 '']

In order to read the text files within an entire directory we have to first get thg list, and then iterate through it.


In [83]:
dir_list = client.list('/user/pmolnar/data/20news/20news-bydate-test/talk.politics.mideast/')
dir_list[0:10]


Out[83]:
['76355.gz',
 '76366.gz',
 '76367.gz',
 '76368.gz',
 '76369.gz',
 '76370.gz',
 '76372.gz',
 '76373.gz',
 '76374.gz',
 '76375.gz']

In [87]:
text_docs = []
for f in dir_list:
    with client.read('/user/pmolnar/data/20news/20news-bydate-test/talk.politics.mideast/%s' % f) as reader:
        txt = zlib.decompress(reader.read(), 16+zlib.MAX_WBITS).decode()
        text_docs.append(txt)
print("Read %d text files." % len(text_docs))


Read 376 text files.

In [86]:
text_docs[1:3]


Out[86]:
['From: rj3s@Virginia.EDU ("Get thee to a nunnery.....")\nSubject: Re: Deir Yassin\nOrganization: University of Virginia\nLines: 65\n\nhm@cs.brown.edu  writes:\n> In article <martinb.735590895@brise.ERE.UMontreal.CA> aurag@ERE.UMontreal.CA (Aurag Hassan) writes:\n> \n>    Are you trying to say that there were no massacres in Deir Yassin\n>    or in Sabra and Shatila? If so then let me tell you some good jokes:\n> \n>    There is not and was not any such thing like jewish killing in WWII\n> \n>    Palestinians just did what Davidians did for fourty years and more.\n> \n>    In fact no one was killed in any war at any time or any place.\n> \n>    People die that is all. No one gets killed.\n> \n>    Maybe also vietamiese didn\'t die in Vietnam war killed by american\n>    napalm they were just pyromaniacs and that\'s all.\n> \n>    Maybe jews just liked gas chambers and no one forced them to get in there.they \n>    may be thought it was like snifing cocaine. No?\n> \n>    What do you think of this ? Isn\'t it stupid to say so?\n>    Well it is as stupid as what you said .Next time you want to lie do it\n>    intelligently.\n> \n>    Sincerely yours.\n> \n>    Hassan\n> \n> Arab civilians did die at Dir yassin. But there was no massacre. First\n> of all, the village housed many *armed* troops. Secondly, the Irgun\n> and Stern fighters had absolutely no intentions of killing civilians.\n> The village was attacked only for its military significance. In fact,\n> a warning was given to the occupants of the village to leave before\n> the attack was to begin.\n> \n> By all rational standards, Dir Yassin was not a massacre. The killing\n> was unintentional. The village housed Arab snipers and Arab troops.\n> Thus it was attacked for its military significance. It was not\n> attacked with intentions of killing any civilians.\n> \n> To even compare Dir Yassin, in which some 120 or so Arabs died, to the\n> Holocaust is absurd. The Irgun did not want to kill any civilians. The\n> village had almost 1000 inhabitants, most of whom survived.\n> \n> Harry.\nThis is such Bullshit.  Deir Yassin was an unprovoked attack on\nthe part of the Jews, and a massacre defines it best in my\nopinion.  The village of Deir Yassin had had a pact with the\nJews, a peace pact, but the Irgun purposely broke this\nagreement in order to scare off the Palestinians.  I might\ngrant that this village housed armed Arabs [I doubt it] but\nnothing in the archives and available literature indicates that\nthis was a motivating force amongst the Irgun.  The Deir Yassin\nMASSACRE was part of an over all strategy to intimidate the\nPalestinians to flee the Jewish Homeland.,...and contrary to\nyour belief, many civilians were killed.  Deir Yassin was later\nadvertized by the very Jews who perpetrated it because it was\nuseful in getting many Palestinians to leave.  The Palestinians\nwere rightfully scared off, because they did not want another\nDeir Yassin.  \n\tI\'m not necessarily condemning the Israelites here;\natrocities were aslo committed on the part of the Arabs.\nIsraelophiles should just be careful in thinking that they are\nand were the good guys in the middle east.  Both Arab and Jew\nsuck equally.\n',
 'From: ohayon@jcpltyo.JCPL.CO.JP (Tsiel Ohayon)\nSubject: Re: rejoinder. Questions to Israelis\nOrganization: James Capel Pacific Limited, Tokyo Japan\nLines: 31\n\nIn article <1993Apr26.211905.28317@freenet.carleton.ca> aa229@Freenet.carleton.ca (Steve Birnbaum) writes:\n\n[SB] Oh yeah, Israel was really ready to "expand its borders" on the holiest day\n[SB] of the year (Yom Kippur) when the Arabs attacked in 1973.  Oh wait, you\n[SB] chose to omit that war...perhaps because it 100% supports the exact \n[SB] OPPOSITE to the point you are trying to make?  I don\'t think that it\'s\n[SB] because it was the war that hit Israel the hardest.  Also, in 1967 it was\n[SB] Egypt, not Israel who kicked out the UN force.  In 1948 it was the Arabs\n[SB] who refused to accept the existance of Israel BASED ON THE BORDERS SET\n[SB] BY THE UNITED NATIONS.  In 1956, Egypt closed off the Red Sea to Israeli\n[SB] shipping, a clear antagonistic act.  And in 1982 the attack was a response\n[SB] to years of constant shelling by terrorist organizations from the Golan\n\t\t\t\t\t\t\t     ^^^^^^^^^^^^^^^^\n[SB] Heights. Children were being murdered all the time by terrorists and Israel\n^^^^^^^^^^^^\n[SB] finally retaliated.  Nowhere do I see a war that Israel started so that \n[SB] the borders could be expanded.\n\nI agree with all you write except that Terrorist orgs. were not shelling\nIsrael from the Golan Heights in 1982, but rather from Lebanon. The Golan\nHeights have been held by Israel since 1967, and therefore the PLO could\nnot have been shelling Israel from there, unless there is something I am\nnot aware of.\n\n\nTsiel\n-- \n----8<--------------------------------------------------------------->8------\nTsiel:ohayon@jcpl.co.jp\t   | If you do not receive this E-mail, please let me\nEmployer may not have same | know as soon as possible, if possible.\nopinions, if any !         | Two percent of zero is almost nothing.\n']

In [ ]:


In [ ]:


In [ ]:


In [ ]:

Clean up text

We need to know about some string operations

In particular how to change to lower case and replace special characters.


In [93]:
import string
help(string)


Help on module string:

NAME
    string - A collection of string constants.

DESCRIPTION
    Public module variables:
    
    whitespace -- a string containing all ASCII whitespace
    ascii_lowercase -- a string containing all ASCII lowercase letters
    ascii_uppercase -- a string containing all ASCII uppercase letters
    ascii_letters -- a string containing all ASCII letters
    digits -- a string containing all ASCII decimal digits
    hexdigits -- a string containing all ASCII hexadecimal digits
    octdigits -- a string containing all ASCII octal digits
    punctuation -- a string containing all ASCII punctuation characters
    printable -- a string containing all ASCII characters considered printable

CLASSES
    builtins.object
        Formatter
        Template
    
    class Formatter(builtins.object)
     |  Methods defined here:
     |  
     |  check_unused_args(self, used_args, args, kwargs)
     |  
     |  convert_field(self, value, conversion)
     |  
     |  format(self, format_string, *args, **kwargs)
     |  
     |  format_field(self, value, format_spec)
     |  
     |  get_field(self, field_name, args, kwargs)
     |      # given a field_name, find the object it references.
     |      #  field_name:   the field being looked up, e.g. "0.name"
     |      #                 or "lookup[3]"
     |      #  used_args:    a set of which args have been used
     |      #  args, kwargs: as passed in to vformat
     |  
     |  get_value(self, key, args, kwargs)
     |  
     |  parse(self, format_string)
     |      # returns an iterable that contains tuples of the form:
     |      # (literal_text, field_name, format_spec, conversion)
     |      # literal_text can be zero length
     |      # field_name can be None, in which case there's no
     |      #  object to format and output
     |      # if field_name is not None, it is looked up, formatted
     |      #  with format_spec and conversion and then used
     |  
     |  vformat(self, format_string, args, kwargs)
     |  
     |  ----------------------------------------------------------------------
     |  Data descriptors defined here:
     |  
     |  __dict__
     |      dictionary for instance variables (if defined)
     |  
     |  __weakref__
     |      list of weak references to the object (if defined)
    
    class Template(builtins.object)
     |  A string class for supporting $-substitutions.
     |  
     |  Methods defined here:
     |  
     |  __init__(self, template)
     |  
     |  safe_substitute(self, *args, **kws)
     |  
     |  substitute(self, *args, **kws)
     |  
     |  ----------------------------------------------------------------------
     |  Data descriptors defined here:
     |  
     |  __dict__
     |      dictionary for instance variables (if defined)
     |  
     |  __weakref__
     |      list of weak references to the object (if defined)
     |  
     |  ----------------------------------------------------------------------
     |  Data and other attributes defined here:
     |  
     |  delimiter = '$'
     |  
     |  flags = 2
     |  
     |  idpattern = '[_a-z][_a-z0-9]*'
     |  
     |  pattern = re.compile('\n    \\$(?:\n      (?P<escaped>\\$)..._a-z][_a-...

FUNCTIONS
    capwords(s, sep=None)
        capwords(s [,sep]) -> string
        
        Split the argument into words using split, capitalize each
        word using capitalize, and join the capitalized words using
        join.  If the optional second argument sep is absent or None,
        runs of whitespace characters are replaced by a single space
        and leading and trailing whitespace are removed, otherwise
        sep is used to split and join the words.

DATA
    ascii_letters = 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ'
    ascii_lowercase = 'abcdefghijklmnopqrstuvwxyz'
    ascii_uppercase = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
    digits = '0123456789'
    hexdigits = '0123456789abcdefABCDEF'
    octdigits = '01234567'
    printable = '0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTU...
    punctuation = '!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'
    whitespace = ' \t\n\r\x0b\x0c'

FILE
    /usr/lib64/python3.4/string.py



In [96]:
txt = open("textfiles/shakespeare.txt").read()
txt[0:100]


Out[96]:
'1609\n\nTHE SONNETS\n\nby William Shakespeare\n\n\n\n                     1\n  From fairest creatures we desi'

In [97]:
txt = txt.lower()

In [98]:
for c in '.;!\'" ':
    txt = txt.replace(c, '\n')
txt[0:100]


Out[98]:
'1609\n\nthe\nsonnets\n\nby\nwilliam\nshakespeare\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n1\n\n\nfrom\nfairest\ncreatures\nwe\ndesi'

In [100]:
word_list = txt.split('\n')
word_list[0:10]


Out[100]:
['1609', '', 'the', 'sonnets', '', 'by', 'william', 'shakespeare', '', '']

Lists and Tuples

Review list operations, such appending elements, concatenating lists, etc. Python also provides a structure for tuples which are quite useful.


In [ ]:
help(list)

In [ ]:
help(tuple)

In [ ]:


In [98]:
# Example
a = []
a.append('a')
a.append('z')
a += ['b', 'x', 'c']
a.sort()
a[0:2]


Out[98]:
['a', 'b']

Dictonaries

Dictionaries serve as associative arrays that binds keys to values. These can be used to keep track of the individual words. However, retrieving values from their keys can be time consuming.


In [ ]:
help(dict)

In [19]:
f = { 'one': 1, 'two': 2}
f['a'] = 0

In [20]:
f


Out[20]:
{'a': 0, 'one': 1, 'two': 2}

In [22]:
f['one']


Out[22]:
1

In [23]:
f.keys()


Out[23]:
dict_keys(['one', 'two', 'a'])

In [24]:
f.values()


Out[24]:
dict_values([1, 2, 0])

In [25]:
Ω = 17

In [26]:
Δ


Out[26]:
17

In [55]:
'a' in f.keys()


Out[55]:
True

In [56]:
f['b']


---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-56-6202e5beb3a1> in <module>()
----> 1 f['b']

KeyError: 'b'

Sorting

Here's an example for sorting a list of tuples.


In [30]:
l2 = [3,4,1,45,7,234,123]
l2.sort()
l2


Out[30]:
[1, 3, 4, 7, 45, 123, 234]

In [35]:
l = [(3,'a'), (9, 'z'), (1, 'y'), (1, 'b'), (5, 'd'), (7, 'x')]
l


Out[35]:
[(3, 'a'), (9, 'z'), (1, 'y'), (1, 'b'), (5, 'd'), (7, 'x')]

In [37]:
def take_first(x):
    return x[0]

l.sort(key=take_first)
l


Out[37]:
[(1, 'b'), (1, 'y'), (3, 'a'), (5, 'd'), (7, 'x'), (9, 'z')]

In [92]:
l.sort(key=lambda x: x[0], reverse=True)
l


Out[92]:
[(7, 'x'), (5, 'd'), (3, 'a'), (1, 'b')]

In [87]:
sorted(l, key=lambda x: x[0], reverse=True)


Out[87]:
[(5, 'd'), (3, 'a'), (1, 'b')]

In [77]:
l


Out[77]:
[(3, 'a'), (1, 'b'), (5, 'd')]

In [41]:
l3 = [10, 110, 12, 1203]
l3.sort(key=lambda x: str(x))
l3


Out[41]:
[10, 110, 12, 1203]

In [ ]:


In [ ]:


In [ ]:


In [82]:
help(sorted)


Help on built-in function sorted in module builtins:

sorted(iterable, key=None, reverse=False)
    Return a new list containing all items from the iterable in ascending order.
    
    A custom key function can be supplied to customise the sort order, and the
    reverse flag can be set to request the result in descending order.


In [ ]:
# curl http://ocw.mit.edu/ans7870/6/6.006/s08/lecturenotes/files/t8.shakespeare.txt | tail -n +245 | tr 'A-Z' 'a-z'| tr ' .?:,;' '\n' | sort | uniq -c | sort -rn | more

In [ ]:


In [1]:
txt = open('textfiles/shakespeare.txt', 'r').read()
txt[0:100]


Out[1]:
'1609\n\nTHE SONNETS\n\nby William Shakespeare\n\n\n\n                     1\n  From fairest creatures we desi'

In [22]:
txt2 = txt.replace(',', '\n').replace('.', '\n').replace('?', '\n').replace('!', '\n').replace('\'', '\n').replace('"', '\n').lower()
txt2[0:100]


Out[22]:
'1609\n\nthe sonnets\n\nby william shakespeare\n\n\n\n                     1\n  from fairest creatures we desi'

In [23]:
wordlist = txt2.split()

wordlist.sort()
results = []
current_word = wordlist[0]
current_counter = 1
for w in wordlist[1:]:
    if w!=current_word:
        results.append((current_word, current_counter))
        current_word = w
        current_counter = 1
    else:
        current_counter += 1
results.append((current_word, current_counter))
results.sort(key=lambda x: x[1], reverse=True)
results[0:10]

In [25]:
results[0:10]


Out[25]:
[('&', 3),
 ('&c', 18),
 ('(1)', 218),
 ('(2)', 218),
 ('(a', 3),
 ('(alack', 1),
 ('(all', 4),
 ('(although', 2),
 ('(always', 1),
 ('(as', 17)]

In [27]:
results.sort(key=lambda x: x[1], reverse=True)

In [28]:
results[0:10]


Out[28]:
[('the', 27531),
 ('and', 26658),
 ('i', 22430),
 ('to', 18937),
 ('of', 18103),
 ('a', 14554),
 ('you', 13475),
 ('my', 12474),
 ('that', 11457),
 ('in', 11010)]

In [ ]:


In [33]:
wordlist = txt2.split()

reshash = {}
for w in wordlist:
    if w in reshash.keys():
        reshash[w] += 1
    else:
        reshash[w] = 1

results = [(k, reshash[k]) for k in reshash.keys()]
results.sort(key=lambda x: x[1], reverse=True)
results[0:10]


Out[33]:
[('the', 27531),
 ('and', 26658),
 ('i', 22430),
 ('to', 18937),
 ('of', 18103),
 ('a', 14554),
 ('you', 13475),
 ('my', 12474),
 ('that', 11457),
 ('in', 11010)]

In [31]:



Out[31]:
[('misuse', 8),
 ('julia', 153),
 ('legacy', 5),
 ('unhand', 1),
 ('nine-', 1),
 ('long-ingraffed', 1),
 ('substances', 2),
 ('profound;', 1),
 ('austerely', 2),
 ('executed', 18)]

In [ ]:


In [ ]:


In [ ]:


In [ ]:


In [ ]: