Pattern

Pattern is a web mining module for the Python programming language.

It has tools for data mining (Google, Twitter and Wikipedia API, a web crawler, a HTML DOM parser), natural language processing (part-of-speech taggers, n-gram search, sentiment analysis, WordNet), machine learning (vector space model, clustering, SVM), network analysis and <canvas> visualization.


In [3]:
!pip install Pattern3


Collecting Pattern3
  Downloading pattern3-3.0.0.tar.gz (23.7MB)
    100% |████████████████████████████████| 23.7MB 21kB/s eta 0:00:011    44% |██████████████▍                 | 10.7MB 1.2MB/s eta 0:00:12
Collecting beautifulsoup4 (from Pattern3)
  Downloading beautifulsoup4-4.6.0-py3-none-any.whl (86kB)
    100% |████████████████████████████████| 92kB 2.9MB/s ta 0:00:011
Collecting cherrypy (from Pattern3)
  Downloading CherryPy-11.0.0-py2.py3-none-any.whl (435kB)
    100% |████████████████████████████████| 440kB 972kB/s ta 0:00:011
Collecting docx (from Pattern3)
  Downloading docx-0.2.4.tar.gz (54kB)
    100% |████████████████████████████████| 61kB 3.6MB/s ta 0:00:011
Collecting feedparser (from Pattern3)
  Downloading feedparser-5.2.1.zip (1.2MB)
    100% |████████████████████████████████| 1.2MB 510kB/s eta 0:00:01
Collecting pdfminer3k (from Pattern3)
  Downloading pdfminer3k-1.3.1.tar.gz (4.1MB)
    99% |████████████████████████████████| 4.1MB 3.1MB/s eta 0:00:011    100% |████████████████████████████████| 4.1MB 125kB/s 
Collecting simplejson (from Pattern3)
  Downloading simplejson-3.11.1.tar.gz (78kB)
    100% |████████████████████████████████| 81kB 1.8MB/s ta 0:00:01
Collecting pdfminer.six (from Pattern3)
  Downloading pdfminer.six-20170720.tar.gz (12.0MB)
    100% |████████████████████████████████| 12.0MB 46kB/s eta 0:00:01
Collecting cheroot>=5.2.0 (from cherrypy->Pattern3)
  Downloading cheroot-5.8.3-py2.py3-none-any.whl (65kB)
    100% |████████████████████████████████| 71kB 2.8MB/s ta 0:00:01
Requirement already satisfied: six in /Users/sampathm/miniconda3/lib/python3.6/site-packages (from cherrypy->Pattern3)
Collecting portend>=2.1.1 (from cherrypy->Pattern3)
  Downloading portend-2.1.2-py2.py3-none-any.whl
Collecting lxml (from docx->Pattern3)
  Downloading lxml-4.0.0-cp36-cp36m-macosx_10_6_intel.macosx_10_9_intel.macosx_10_9_x86_64.macosx_10_10_intel.macosx_10_10_x86_64.whl (8.7MB)
    100% |████████████████████████████████| 8.7MB 65kB/s eta 0:00:01
Requirement already satisfied: Pillow>=2.0 in /Users/sampathm/miniconda3/lib/python3.6/site-packages (from docx->Pattern3)
Collecting pytest>=2.0 (from pdfminer3k->Pattern3)
  Downloading pytest-3.2.2-py2.py3-none-any.whl (187kB)
    100% |████████████████████████████████| 194kB 900kB/s ta 0:00:01
Requirement already satisfied: ply>=3.4 in /Users/sampathm/miniconda3/lib/python3.6/site-packages (from pdfminer3k->Pattern3)
Collecting pycryptodome (from pdfminer.six->Pattern3)
  Downloading pycryptodome-3.4.7.tar.gz (6.5MB)
    100% |████████████████████████████████| 6.5MB 76kB/s eta 0:00:011
Collecting chardet (from pdfminer.six->Pattern3)
  Using cached chardet-3.0.4-py2.py3-none-any.whl
Collecting tempora>=1.8 (from portend>=2.1.1->cherrypy->Pattern3)
  Downloading tempora-1.9-py2.py3-none-any.whl
Requirement already satisfied: olefile in /Users/sampathm/miniconda3/lib/python3.6/site-packages (from Pillow>=2.0->docx->Pattern3)
Collecting py>=1.4.33 (from pytest>=2.0->pdfminer3k->Pattern3)
  Downloading py-1.4.34-py2.py3-none-any.whl (84kB)
    100% |████████████████████████████████| 92kB 2.8MB/s ta 0:00:011
Requirement already satisfied: setuptools in /Users/sampathm/miniconda3/lib/python3.6/site-packages (from pytest>=2.0->pdfminer3k->Pattern3)
Requirement already satisfied: pytz in /Users/sampathm/miniconda3/lib/python3.6/site-packages (from tempora>=1.8->portend>=2.1.1->cherrypy->Pattern3)
Building wheels for collected packages: Pattern3, docx, feedparser, pdfminer3k, simplejson, pdfminer.six, pycryptodome
  Running setup.py bdist_wheel for Pattern3 ... done
  Stored in directory: /Users/sampathm/Library/Caches/pip/wheels/f3/11/2a/96e925779b6f0e9323b7a28b020dedc76afc3315f6bdb46898
  Running setup.py bdist_wheel for docx ... done
  Stored in directory: /Users/sampathm/Library/Caches/pip/wheels/43/43/f7/ae02727f01b27dd92d5ba84982cfd8da9484b7179e263253a0
  Running setup.py bdist_wheel for feedparser ... done
  Stored in directory: /Users/sampathm/Library/Caches/pip/wheels/15/ce/10/b500f745822ea6db6ea8ed225c06b15c000d71016b89ef9037
  Running setup.py bdist_wheel for pdfminer3k ... done
  Stored in directory: /Users/sampathm/Library/Caches/pip/wheels/cd/84/67/3eb20c984d51d38db1ca65ecba0d866407f46d8b3a9e72f7b2
  Running setup.py bdist_wheel for simplejson ... done
  Stored in directory: /Users/sampathm/Library/Caches/pip/wheels/bf/0a/27/5d5e337ed16a175fd483a8c1486b4343ea2632be7ac57bad5d
  Running setup.py bdist_wheel for pdfminer.six ... done
  Stored in directory: /Users/sampathm/Library/Caches/pip/wheels/92/af/bf/158b037892b25aa7768ce93127397530910704ad8e2f15f67f
  Running setup.py bdist_wheel for pycryptodome ... done
  Stored in directory: /Users/sampathm/Library/Caches/pip/wheels/f7/cb/16/1ed6dc50b92888af9051bda6f59f335dfd966a1ce5edd8a4af
Successfully built Pattern3 docx feedparser pdfminer3k simplejson pdfminer.six pycryptodome
Installing collected packages: beautifulsoup4, cheroot, tempora, portend, cherrypy, lxml, docx, feedparser, py, pytest, pdfminer3k, simplejson, pycryptodome, chardet, pdfminer.six, Pattern3
Successfully installed Pattern3-3.0.0 beautifulsoup4-4.6.0 chardet-3.0.4 cheroot-5.8.3 cherrypy-11.0.0 docx-0.2.4 feedparser-5.2.1 lxml-4.0.0 pdfminer.six-20170720 pdfminer3k-1.3.1 portend-2.1.2 py-1.4.34 pycryptodome-3.4.7 pytest-3.2.2 simplejson-3.11.1 tempora-1.9

In [8]:
from pattern3.en import tokenize

In [9]:
word_tokenize_test = "this's pattern word tokenize"

In [ ]: