In [1]:
from setup import *
import sys
# if DATA_PATH not in sys.path: sys.path.append(DATA_PATH)
%matplotlib inline
display(HTML("<style>.container { width:100% !important; }</style>"))
pd.set_option('display.max_rows', 4)
pd.set_option('display.max_columns', 200)


/home/hobs/.virtualenvs/AgileMachineLearning/lib/python3.5/site-packages/matplotlib/__init__.py:1350: UserWarning:  This call to matplotlib.use() has no effect
because the backend has already been chosen;
matplotlib.use() must be called *before* pylab, matplotlib.pyplot,
or matplotlib.backends is imported for the first time.

  warnings.warn(_use_error_msg)

In [4]:
tfdf = pd.read_csv(os.path.join(DATA_PATH, 'tweet_vocab.csv.gz'), index_col=0, compression='gzip',
                   quotechar='"', quoting=pd.io.common.csv.QUOTE_NONNUMERIC, low_memory=False)
tfdf.describe().round().astype(int)


Out[4]:
tf df
count 16039 16039
mean 85 78
... ... ...
75% 31 31
max 41983 34885

8 rows × 2 columns

If you try to allocate a 16k word by 100k document DataFrame of 64-bit integers, you'll get a memory error on a 16 GB laptop.
Later we'll learn about "constant RAM" tools that can handle an unlimitted stream of documents with a large (1M word) vocabulary. But first let's be frugal and see what we can do with robust, mature tools like Pandas.
Rather than cutting back on those 100k tweets, lets cut back on the words. What are all those 16k words and how often are they all used (maybe we can ignore infrequent words).


In [5]:
GB = 8 * (100 * 1000 * len(tfdf)) / 1.e9
GB


Out[5]:
12.8312

In [6]:
tfdf


Out[6]:
tf df
0 1417 1395
0.0 355 354
... ... ...
zzp 14 9
zzrkwxgbqv 21 21

16039 rows × 2 columns

Fortunately the odd words are at the top and bottom of an alphabetical index!
And it does look like the less useful tokens aren't used many times or in many documents.
What do you notice that might help distinguish "natural" words (zoom, zoos, zope, zynga) from URLs and machine-code (000, zzp, zsl107)?


In [7]:
tfdf = tfdf[tfdf.df > 9]
tfdf = tfdf[(tfdf.df > 9) & (((tfdf.df - tfdf.tf) / tfdf.tf) < 0.15)]
tfdf = tfdf[(tfdf.df > 20) & (((tfdf.df - tfdf.tf) / tfdf.tf) < 0.15)]
tfdf


Out[7]:
tf df
0 1417 1395
0.0 355 354
... ... ...
zy0nsstslv 27 27
zzrkwxgbqv 21 21

5391 rows × 2 columns


In [8]:
Numpy arrays (guts of Pandas DataFrame) require 8 bytes for each double-precision value (int64)


  File "<ipython-input-8-2a632d3ab850>", line 1
    Numpy arrays (guts of Pandas DataFrame) require 8 bytes for each double-precision value (int64)
               ^
SyntaxError: invalid syntax

In [9]:
GB = 8 * (100 * 1000 * len(tfdf)) / 1.e9
GB


Out[9]:
4.3128

Memory requirements (4 GB) are doable
But we've lost important words: "zoom"
And there's still a bit of garbage: "zh3gs0wbno"
These look like keys, slugs, hashes or URLs
Even though the tweets.json format includes a column for URLs
The URLs are left within the raw text as well
Let's use a formal but simple grammar engine:

Extended regular expressions


In [10]:
url_scheme_popular = r'(\b(' + '|'.join(uri_schemes_popular) + r')[:][/]{2})'
fqdn_popular = r'(\b[a-zA-Z0-9-.]+\b([.]' + r'|'.join(tld_popular) + r'\b)\b)'
url_path = r'(\b[\w/?=+#-_&%~\'"\\.,]*\b)'

pd.set_option('display.max_rows', 14)
pd.Series(uri_schemes_iana)


Out[10]:
0      ms-secondary-screen-controller
1      ms-settings-connectabledevices
2       ms-settings-displays-topology
3        ms-settings-emailandaccounts
4         ms-settings-nfctransactions
5          ms-settings-screenrotation
6           ms-secondary-screen-setup
                    ...              
237                                aw
238                                gg
239                                go
240                                im
241                                ni
242                                tv
243                                ws
dtype: object

In [11]:
url_popular = r'(\b' + r'(http|https|svn|git|apt)[:]//' + fqdn_popular + url_path + r'\b)'
tweet = "Play the [postiive sum game](http://totalgood.com/a/b?c=42) of life instead of svn://us.gov."
import re
re.findall(url_popular, tweet)


Out[11]:
[('http://totalgood.com/a/b?c=42',
  'http',
  'totalgood.com',
  '.com',
  '/a/b?c=42'),
 ('svn://us.gov', 'svn', 'us.gov', 'gov', '')]

In [12]:
# email = re.compile(r'^([\w-]+(?:\.[\w-]+)*)@((?:[\w-]+\.)*\w[\w-]{0,66})\.([a-z]{2,6}(?:\.[a-z]{2})?)')
fqdn = r'(\b[a-zA-Z0-9-.]+([.]' + r'|'.join(tld_iana) + r')\b)'
fqdn_popular = r'(\b[a-zA-Z0-9-.]+\b([.]' + r'|'.join(tld_popular) + r'\b)\b)'
username = r'(\b[a-zA-Z0-9-.!#$%&*+-/=?^_`{|}~]+\b)'

email = re.compile(r'(\b' + username + r'\b@\b' + fqdn + r'\b)')
email_popular = re.compile(r'(\b' + username + r'\b@\b' + fqdn_popular + r'\b)')

# TODO: unmatched surrounding symbols are accepted/consumed, likewise for multiple dots/ats
at = r'(([-@="_(\[{\|\s]+(at|At|AT)[-@="_)\]\}\|\s]+)|[@])'
dot = r'(([-.="_(\[{\|\s]+(dot|dt|Dot|DOT)[-.="_)\]\}\|\s]+)|[.])'
fqdn_obfuscated = r'(\b(([a-zA-Z0-9-]+' + dot + r'){1,7})(' + r'|'.join(tld_iana) + r')\b)'
fqdn_popular_obfuscated = r'(\b(([a-zA-Z0-9-]+' + dot + r'){1,7})(' + r'|'.join(tld_popular) + r')\b)'
username_obfuscated = r'(([a-zA-Z0-9!#$%&*+/?^`~]+' + dot + r'?){1,7})'
email_obfuscated = re.compile(r'(\b' + username_obfuscated + at + fqdn_obfuscated + r'\b)')
email_popular_obfuscated = re.compile(r'(\b' + username_obfuscated + at + fqdn_popular_obfuscated + r'\b)')

url_path = r'(\b[^\s]+)'
url_scheme = r'(\b(' + '|'.join(uri_schemes_iana) + r')[:][/]{2})'
url_scheme_popular = r'(\b(' + '|'.join(uri_schemes_popular) + r')[:][/]{2})'
url = r'(\b' + url_scheme + fqdn + url_path + r'?\b)'

In [ ]: