Introduction to Python for Natural Language Processing

Adapted from a lesson by Teddy Roland

Lesson Goals

  • Learn the basics of the Python programming language
  • Start to build an intution about how computers may help us analyze text

Outline

Key Terms

  • coding or programming:
    • The purpose of programming is to find a sequence of instructions that will enable a computer to perform a specific task or solve a given problem. It involves writing those instructions in a specific programming language, in our case, Python.
  • script:
    • A block of executable code, typically saved in a executable file. For example, script1.py
  • packages and modules:
    • Python files, or collections of files, that implement a set of pre-made functions (so we don't have to write all of the functions ourselves). To utilize a module we use the import function.
  • parse:
    • the process of analysing a string of symbols, in this case the symbols that make up natural language. This can also include understanding, or parsing, computer code.
  • variable:
    • A variable is something that holds a value that may change. In simplest terms, a variable is just a box that you can put stuff in. You can use variables to store all kinds of stuff, including numbers and letters.
  • assigning a variable:
    • telling Python what you want to name the variable, and what is stored in the variable.
  • string:
    • a type of variable the consists of a sequence of characters in a particular order. Characters can be anything, including letters or numbers. The order of a string is fixed.
  • list:
    • a type of variable that consists of a sequence of elements. The order is fixed.

Basic Types and Operations

Arithmetic


In [ ]:
# Addition

2+5

In [ ]:
# Let's have Python report the results from three operations at the same time

print(2-5)
print(2*5)
print(2/5)

In [ ]:
# If we have all of our operations in the last line of the cell, Jupyter will print them together

2-5, 2*5, 2/5

In [ ]:
# And let's compare values

2>5

Variable assignment

Assigning variables is something that we do all the time in programming. These aren't quite like the variables from high school algebra, where x represents an unknown to solve for. Instead these are like notes to ourselves that we want to save some value(s) for later use.

Note that the equals sign is directional, like an arrow, telling the computer to give a certain value to a certain label.


In [ ]:
# 'a' is being given the value 2; 'b' is given 5

a = 2
b = 5

In [ ]:
# Let's perform an operation on the variables

a+b

In [ ]:
# Variables can have many different kinds of names

this_number = 2
b/this_number

Strings

In Python, human language text gets represented as a string. These contain sequential sets of characters and they are offset by quotation marks, either double (") or single (').

We will explore different kinds of operations in Python that are specific to human language objects, but it is useful to start by trying to see them as the computer does, as numerical representations.


In [ ]:
# The iconic string

print("Hello, World!")

In [ ]:
# Assign these strings to variables

a = "Hello"
b = 'World'

In [ ]:
# Try out arithmetic operations.
# When we add strings we call it 'concatenation'

print(a+" "+b)
print(a*5)

In [ ]:
# Unlike a number that consists of a single value, a string is an ordered
# sequence of characters. We can find out the length of that sequence.

len("Hello, World!")

In [ ]:
## EX. How long is the string below?

this_string = "It was the best of times; it was the worst of times."
len(this_string)

Lists

The numbers and strings we have just looked at are the two basic data types that we will focus our attention on in this workshop. (In a few days, we will look at a third data type, boolean, which consists of just True/False values.) When we are working with just a few numbers or strings, it is easy to keep track of them, but as we collect more we will want a system to organize them.

One such organizational system is a list. This contains values (regardless of type) in order, and we can perform operations on it very similarly to the way we did with numbers.

A list in which each element is a string

The creators of Python recognize that human language has many important yet idiosyncratic features, so they have tried to make it easy for us to identify and manipulate them. For example, in the demonstration at the very beginning of the workshop, we referred to the idea of the suffix: the final letters of a word tell us something about its grammatical role and potentially the author's argument.

We can analyze or manipulate certain features of a string using its methods. These are basically internal functions that every string automatically possesses. Note that even though the method may transform the string at hand, they don't change it permanently!


In [ ]:
# Let's assign a couple lists to variables

list1 = ['Call', 'me', 'Ishmael']
list2 = ['In', 'the', 'beginning']

In [ ]:
## Q. Predict what will happen when we perform the following operations

print(list1+list2)
print(list1*5)

In [ ]:
# As with a string, we can find out the length of a list

len(list1)

In [ ]:
# Sometimes we just want a single value from the list at a time

print(list1[0])
print(list1[1])
print(list1[2])

In [ ]:
# Or maybe we want the first few

print(list1[0:2])
print(list1[:2])

In [ ]:
# Of course, lists can contain numbers or even a mix of numbers and strings

list3 = [7,8,9]
list4 = [7,'ate',9]

In [ ]:
# And python is smart with numbers, so we can add them easily!

sum(list3)

In [ ]:
## EX. Concatenate 'list1' and 'list2' into a single list.
##     Retrieve the third element from the combined list.
##     Retrieve the fourth through sixth elements from the combined list.
new_list = list1+list2
new_list[3:]

A couple of useful tricks

String Methods

The creators of Python recognize that human language has many important yet idiosyncratic features, so they have tried to make it easy for us to identify and manipulate them. For example, in the demonstration at the very beginning of the workshop, we referred to the idea of the suffix: the final letters of a word tell us something about its grammatical role and potentially the author's argument.

We can analyze or manipulate certain features of a string using its methods. These are basically internal functions that every string automatically possesses. Note that even though the method may transform the string at hand, they don't change it permanently!


In [ ]:
# Let's assign a variable to perform methods upon

greeting = "Hello, World!"

In [ ]:
# We saw the 'endswith' method at the very beginning
# Note the type of output that gets printed

greeting.startswith('H'), greeting.endswith('d')

In [ ]:
# We can check whether the string is a letter or a number

this_string = 'f'

this_string.isalpha()

In [ ]:
# When there are multiple characters, it checks whether *all*
# of the characters belong to that category

greeting.isalpha(), greeting.isdigit()

In [ ]:
# Similarly, we can check whether the string is lower or upper case

greeting.islower(), greeting.isupper(), greeting.istitle()

In [ ]:
# Sometimes we want not just to check, but to change the string

greeting.lower(), greeting.upper()

In [ ]:
# The case of the string hasn't changed!

greeting

In [ ]:
# But if we want to permanently make it lower case we re-assign it

greeting = greeting.lower()

greeting

In [ ]:
# Oh hey. And strings are kind of like lists, so we can slice them similarly

greeting[:3]

In [ ]:
# Strings may be like lists of characters, but as humans we often treat them as
# lists of words. We tell the computer to can perform that conversion.

greeting.split()

In [ ]:
## EX. Return the second through eighth characters in 'greeting'

## EX. Split the string below into a list of words and assign this to a new variable
## Note: A slash at the end of a line allows a string to continue unbroken onto the next

In [ ]:
new_string = "It seems very strange that one must turn back, \
and be transported to the very beginnings of history, \
in order to arrive at an understanding of humanity as it is at present."

print(greeting[1:8])
new_string_list = new_string.split() 
new_string_list

List Comprehension

List comprehensions are a fairly advanced programming technique that we will spend more time talking about tomorrow. For now, you can think of them as list filters. Often, we don't need every value in a list, just a few that fulfill certain criteria.


In [ ]:
# 'list1' had contained three words, two of which were in title case.
# We can automatically return those words using a list comprehension

[word for word in list1 if word.istitle()]

In [ ]:
# Or we can include all the words in the list but just take their first letters

[word[0] for word in list1]
for word in list1:
    print(word[0])

In [ ]:
## EX. Using the list of words you produced by splitting 'new_string', create
##     a new list that contains only the words whose last letter is "y" 
y_list = [word for word in new_string_list if word.endswith('y')]
print(y_list)
## EX. Create a new list that contains the first letter of each word.
first_letter = [word[0] for word in new_string_list]
print(first_letter)
## EX. Create a new list that contains only words longer than two letters.
long_words = [word for word in new_string_list if len(word)>2]
print(long_words)