In [1]:
from setup import *
import sys
# if DATA_PATH not in sys.path: sys.path.append(DATA_PATH)
%matplotlib inline
display(HTML("<style>.container { width:100% !important; }</style>"))
pd.set_option('display.max_rows', 4)
pd.set_option('display.max_columns', 200)
In [4]:
tfdf = pd.read_csv(os.path.join(DATA_PATH, 'tweet_vocab.csv.gz'), index_col=0, compression='gzip',
quotechar='"', quoting=pd.io.common.csv.QUOTE_NONNUMERIC, low_memory=False)
tfdf.describe().round().astype(int)
Out[4]:
If you try to allocate a 16k word by 100k document DataFrame of 64-bit integers, you'll get a memory error on a 16 GB laptop.
Later we'll learn about "constant RAM" tools that can handle an unlimitted stream of documents with a large (1M word) vocabulary. But first let's be frugal and see what we can do with robust, mature tools like Pandas.
Rather than cutting back on those 100k tweets, lets cut back on the words. What are all those 16k words and how often are they all used (maybe we can ignore infrequent words).
In [5]:
GB = 8 * (100 * 1000 * len(tfdf)) / 1.e9
GB
Out[5]:
In [6]:
tfdf
Out[6]:
Fortunately the odd words are at the top and bottom of an alphabetical index!
And it does look like the less useful tokens aren't used many times or in many documents.
What do you notice that might help distinguish "natural" words (zoom, zoos, zope, zynga) from URLs and machine-code (000, zzp, zsl107)?
In [7]:
tfdf = tfdf[tfdf.df > 9]
tfdf = tfdf[(tfdf.df > 9) & (((tfdf.df - tfdf.tf) / tfdf.tf) < 0.15)]
tfdf = tfdf[(tfdf.df > 20) & (((tfdf.df - tfdf.tf) / tfdf.tf) < 0.15)]
tfdf
Out[7]:
In [8]:
Numpy arrays (guts of Pandas DataFrame) require 8 bytes for each double-precision value (int64)
In [9]:
GB = 8 * (100 * 1000 * len(tfdf)) / 1.e9
GB
Out[9]:
Memory requirements (4 GB) are doable
But we've lost important words: "zoom"
And there's still a bit of garbage: "zh3gs0wbno"
These look like keys, slugs, hashes or URLs
Even though the tweets.json format includes a column for URLs
The URLs are left within the raw text as well
Let's use a formal but simple grammar engine:
In [10]:
url_scheme_popular = r'(\b(' + '|'.join(uri_schemes_popular) + r')[:][/]{2})'
fqdn_popular = r'(\b[a-zA-Z0-9-.]+\b([.]' + r'|'.join(tld_popular) + r'\b)\b)'
url_path = r'(\b[\w/?=+#-_&%~\'"\\.,]*\b)'
pd.set_option('display.max_rows', 14)
pd.Series(uri_schemes_iana)
Out[10]:
In [11]:
url_popular = r'(\b' + r'(http|https|svn|git|apt)[:]//' + fqdn_popular + url_path + r'\b)'
tweet = "Play the [postiive sum game](http://totalgood.com/a/b?c=42) of life instead of svn://us.gov."
import re
re.findall(url_popular, tweet)
Out[11]:
In [12]:
# email = re.compile(r'^([\w-]+(?:\.[\w-]+)*)@((?:[\w-]+\.)*\w[\w-]{0,66})\.([a-z]{2,6}(?:\.[a-z]{2})?)')
fqdn = r'(\b[a-zA-Z0-9-.]+([.]' + r'|'.join(tld_iana) + r')\b)'
fqdn_popular = r'(\b[a-zA-Z0-9-.]+\b([.]' + r'|'.join(tld_popular) + r'\b)\b)'
username = r'(\b[a-zA-Z0-9-.!#$%&*+-/=?^_`{|}~]+\b)'
email = re.compile(r'(\b' + username + r'\b@\b' + fqdn + r'\b)')
email_popular = re.compile(r'(\b' + username + r'\b@\b' + fqdn_popular + r'\b)')
# TODO: unmatched surrounding symbols are accepted/consumed, likewise for multiple dots/ats
at = r'(([-@="_(\[{\|\s]+(at|At|AT)[-@="_)\]\}\|\s]+)|[@])'
dot = r'(([-.="_(\[{\|\s]+(dot|dt|Dot|DOT)[-.="_)\]\}\|\s]+)|[.])'
fqdn_obfuscated = r'(\b(([a-zA-Z0-9-]+' + dot + r'){1,7})(' + r'|'.join(tld_iana) + r')\b)'
fqdn_popular_obfuscated = r'(\b(([a-zA-Z0-9-]+' + dot + r'){1,7})(' + r'|'.join(tld_popular) + r')\b)'
username_obfuscated = r'(([a-zA-Z0-9!#$%&*+/?^`~]+' + dot + r'?){1,7})'
email_obfuscated = re.compile(r'(\b' + username_obfuscated + at + fqdn_obfuscated + r'\b)')
email_popular_obfuscated = re.compile(r'(\b' + username_obfuscated + at + fqdn_popular_obfuscated + r'\b)')
url_path = r'(\b[^\s]+)'
url_scheme = r'(\b(' + '|'.join(uri_schemes_iana) + r')[:][/]{2})'
url_scheme_popular = r'(\b(' + '|'.join(uri_schemes_popular) + r')[:][/]{2})'
url = r'(\b' + url_scheme + fqdn + url_path + r'?\b)'
In [ ]: