In [1]:

    
from setup import *
import sys
# if DATA_PATH not in sys.path: sys.path.append(DATA_PATH)
%matplotlib inline
display(HTML("<style>.container { width:100% !important; }</style>"))
pd.set_option('display.max_rows', 4)
pd.set_option('display.max_columns', 200)









    



/home/hobs/.virtualenvs/AgileMachineLearning/lib/python3.5/site-packages/matplotlib/__init__.py:1350: UserWarning:  This call to matplotlib.use() has no effect
because the backend has already been chosen;
matplotlib.use() must be called *before* pylab, matplotlib.pyplot,
or matplotlib.backends is imported for the first time.

  warnings.warn(_use_error_msg)



In [4]:

    
tfdf = pd.read_csv(os.path.join(DATA_PATH, 'tweet_vocab.csv.gz'), index_col=0, compression='gzip',
                   quotechar='"', quoting=pd.io.common.csv.QUOTE_NONNUMERIC, low_memory=False)
tfdf.describe().round().astype(int)









    Out[4]:






  
    
      
      tf
      df
    
  
  
    
      count
      16039
      16039
    
    
      mean
      85
      78
    
    
      ...
      ...
      ...
    
    
      75%
      31
      31
    
    
      max
      41983
      34885
    
  

8 rows × 2 columns

If you try to allocate a 16k word by 100k document DataFrame of 64-bit integers, you'll get a memory error on a 16 GB laptop.
Later we'll learn about "constant RAM" tools that can handle an unlimitted stream of documents with a large (1M word) vocabulary. But first let's be frugal and see what we can do with robust, mature tools like Pandas.
Rather than cutting back on those 100k tweets, lets cut back on the words. What are all those 16k words and how often are they all used (maybe we can ignore infrequent words).



In [5]:

    
GB = 8 * (100 * 1000 * len(tfdf)) / 1.e9
GB









    Out[5]:





12.8312



In [6]:

    
tfdf









    Out[6]:






  
    
      
      tf
      df
    
  
  
    
      0
      1417
      1395
    
    
      0.0
      355
      354
    
    
      ...
      ...
      ...
    
    
      zzp
      14
      9
    
    
      zzrkwxgbqv
      21
      21
    
  

16039 rows × 2 columns

Fortunately the odd words are at the top and bottom of an alphabetical index!
And it does look like the less useful tokens aren't used many times or in many documents.
What do you notice that might help distinguish "natural" words (zoom, zoos, zope, zynga) from URLs and machine-code (000, zzp, zsl107)?



In [7]:

    
tfdf = tfdf[tfdf.df > 9]
tfdf = tfdf[(tfdf.df > 9) & (((tfdf.df - tfdf.tf) / tfdf.tf) < 0.15)]
tfdf = tfdf[(tfdf.df > 20) & (((tfdf.df - tfdf.tf) / tfdf.tf) < 0.15)]
tfdf









    Out[7]:






  
    
      
      tf
      df
    
  
  
    
      0
      1417
      1395
    
    
      0.0
      355
      354
    
    
      ...
      ...
      ...
    
    
      zy0nsstslv
      27
      27
    
    
      zzrkwxgbqv
      21
      21
    
  

5391 rows × 2 columns



In [8]:

    
Numpy arrays (guts of Pandas DataFrame) require 8 bytes for each double-precision value (int64)









    



  File "<ipython-input-8-2a632d3ab850>", line 1
    Numpy arrays (guts of Pandas DataFrame) require 8 bytes for each double-precision value (int64)
               ^
SyntaxError: invalid syntax



In [9]:

    
GB = 8 * (100 * 1000 * len(tfdf)) / 1.e9
GB









    Out[9]:





4.3128

Memory requirements (4 GB) are doable
But we've lost important words: "zoom"
And there's still a bit of garbage: "zh3gs0wbno"
These look like keys, slugs, hashes or URLs
Even though the tweets.json format includes a column for URLs
The URLs are left within the raw text as well
Let's use a formal but simple grammar engine:

Extended regular expressions



In [10]:

    
url_scheme_popular = r'(\b(' + '|'.join(uri_schemes_popular) + r')[:][/]{2})'
fqdn_popular = r'(\b[a-zA-Z0-9-.]+\b([.]' + r'|'.join(tld_popular) + r'\b)\b)'
url_path = r'(\b[\w/?=+#-_&%~\'"\\.,]*\b)'

pd.set_option('display.max_rows', 14)
pd.Series(uri_schemes_iana)









    Out[10]:





0      ms-secondary-screen-controller
1      ms-settings-connectabledevices
2       ms-settings-displays-topology
3        ms-settings-emailandaccounts
4         ms-settings-nfctransactions
5          ms-settings-screenrotation
6           ms-secondary-screen-setup
                    ...              
237                                aw
238                                gg
239                                go
240                                im
241                                ni
242                                tv
243                                ws
dtype: object



In [11]:

    
url_popular = r'(\b' + r'(http|https|svn|git|apt)[:]//' + fqdn_popular + url_path + r'\b)'
tweet = "Play the [postiive sum game](http://totalgood.com/a/b?c=42) of life instead of svn://us.gov."
import re
re.findall(url_popular, tweet)









    Out[11]:





[('http://totalgood.com/a/b?c=42',
  'http',
  'totalgood.com',
  '.com',
  '/a/b?c=42'),
 ('svn://us.gov', 'svn', 'us.gov', 'gov', '')]



In [12]:

    
# email = re.compile(r'^([\w-]+(?:\.[\w-]+)*)@((?:[\w-]+\.)*\w[\w-]{0,66})\.([a-z]{2,6}(?:\.[a-z]{2})?)')
fqdn = r'(\b[a-zA-Z0-9-.]+([.]' + r'|'.join(tld_iana) + r')\b)'
fqdn_popular = r'(\b[a-zA-Z0-9-.]+\b([.]' + r'|'.join(tld_popular) + r'\b)\b)'
username = r'(\b[a-zA-Z0-9-.!#$%&*+-/=?^_`{|}~]+\b)'

email = re.compile(r'(\b' + username + r'\b@\b' + fqdn + r'\b)')
email_popular = re.compile(r'(\b' + username + r'\b@\b' + fqdn_popular + r'\b)')

# TODO: unmatched surrounding symbols are accepted/consumed, likewise for multiple dots/ats
at = r'(([-@="_(\[{\|\s]+(at|At|AT)[-@="_)\]\}\|\s]+)|[@])'
dot = r'(([-.="_(\[{\|\s]+(dot|dt|Dot|DOT)[-.="_)\]\}\|\s]+)|[.])'
fqdn_obfuscated = r'(\b(([a-zA-Z0-9-]+' + dot + r'){1,7})(' + r'|'.join(tld_iana) + r')\b)'
fqdn_popular_obfuscated = r'(\b(([a-zA-Z0-9-]+' + dot + r'){1,7})(' + r'|'.join(tld_popular) + r')\b)'
username_obfuscated = r'(([a-zA-Z0-9!#$%&*+/?^`~]+' + dot + r'?){1,7})'
email_obfuscated = re.compile(r'(\b' + username_obfuscated + at + fqdn_obfuscated + r'\b)')
email_popular_obfuscated = re.compile(r'(\b' + username_obfuscated + at + fqdn_popular_obfuscated + r'\b)')

url_path = r'(\b[^\s]+)'
url_scheme = r'(\b(' + '|'.join(uri_schemes_iana) + r')[:][/]{2})'
url_scheme_popular = r'(\b(' + '|'.join(uri_schemes_popular) + r')[:][/]{2})'
url = r'(\b' + url_scheme + fqdn + url_path + r'?\b)'



In [ ]: