First let's make sure Python is installed. If you receive no errors in this section, then your computer is ready to run the materials for this course! If you do receive an error, read the message closely since it offers clues to resolving the issue.
Jupyter notebooks consist of cells. The cell we're reading now is formatted as a "Markdown" cell, and is used for text.
To run Python code we need cells that are formatted as "Code" cells. A code cell has a "In [ ]" to the left of the cell. Running code in a Jupyter notebook is relatively easy. Click on the cell you wish to run (a segment of code with a gray background) in order to highlight it. Then either click the "Play" button in the toolbar above the code window or press CTRL+RETURN on your keyboard.
A quick check to see make sure that we are running Python 3. If the number "2" is printed below, it means you're running Python 2. Because we're using a shared server everyone should see Python 3. If you install Python on your own machine you can run this to see which version you're running.
In [ ]:
import sys
sys.version_info.major
In [ ]:
import os
import string
import numpy
import matplotlib
import pandas
import sklearn
import scipy
import nltk
print("Success!")
In [ ]:
%pylab inline
In order to fully use the NLTK package for Natural Language Processing, we need to download a couple of language models that give Python extra instructions. For example, the 'punkt' model below tells Python how to break strings of text into individual words or sentences. Running this cell will require a stable internet connection and perhaps a little patience. If it completes successfully, then it will print the word "True" at the bottom.
In [ ]:
nltk_data = ["punkt", "words", "stopwords", "averaged_perceptron_tagger", "maxent_ne_chunker", 'wordnet']
nltk.download(nltk_data)
As a quick, opening toy example to see Python in action, let's find all the present participles used in Jane Austen's Pride and Prejudice. There is a plain text file containing this book in this folder. Part of the reason why people use Python to do work on human-language texts (natural language processing) is because it makes tasks like this relatively simple.
In [ ]:
# every line that starts with a hash is a comment
# the computer ignores these lines, they are meant to address a human reader
# here is some starter code to make sure everything is set up (don't worry about understanding everything here)
for line in open('../Data/Austen_PrideAndPrejudice.txt', encoding='utf-8'):
for word in line.split():
if word.endswith('ing'):
print(word)
In [ ]:
If you double click on a markdown cell you will see the syntax behind the cell, and you can modify the cell.
Here are some basic formatting tags you might use in markdown. This is borrowed from here, which has more formatting tips. Google is also your friend, if you have a particular question about formatting in Markdown.
And another item.
You can have properly indented paragraphs within list items. Notice the blank line above, and the leading spaces (at least one, but we'll use three here to also align the raw Markdown).
To have a line break without a paragraph, you will need to use two trailing spaces.
Note that this line is separate, but within the same paragraph.
Blockquotes are very handy if you are including longer quores. This line is part of the same quote.
Quote break.
This is a very long line that will still be quoted properly when it wraps. Oh boy let's keep writing to make sure this is long enough to actually wrap for everyone. Oh, you can put Markdown into a blockquote.
In [ ]: