Python is a very flexible and very powerful programming language that can help you working with texts and corpora. Python's phylosophy emphasizes code readability and features a simple and very expressive syntax. It is actually easy to master the basic aspects of Python's syntax: it is amazing how much you can do even with just the most basic concepts... The aim of these two lectures is to introduce to you some of these basic operation, let you see some code in action and also give you some exercise where you can apply what you've seen.
It is also amazing how many thing you can accomplish with some well written lines of Python! By the end of this class, we'd like to show you how you use Python to perform (some) Natural Language Processing. But of course, you can even just use Python do somethin as easy as...
In [1]:
2 + 3
Out[1]:
Here we go! we've written our first line of code... But I guess we want to do something a little more interesting, right? Well, for a start, we might want to use Python to execute some operation (say: sum two numbers like 2 and 3) and process the result to print it on the screen, process it, and reuse it as many time as we want...
Variables is what we use to store values. Think of it as a shoebox where you place your content; next time you need that content (i.e. the result of a previous operation, or for example some input you've read from a file) you simply call the shoebox name...
In [3]:
result = 2 - 2
In [4]:
#now we print the result
print(result)
In [5]:
# by the way, I'm a comment. I'm not executed
# every line of code following the sign # is ignored:
# print("I'm line n. 3: do you see me?")
# see? You don't see me...
print("I'm line nr. 5 and you DO see me!")
That's it! As easy as that (yes, in some programming languages you have to create or declare the variable first and then use it to fill the shoebox; in Python, you go ahead and simply use it!)
Now, what do you think we will get when we execute the following code?
In [7]:
result + 8
Out[7]:
What types of values can we put into a variable? What goes into the shoebox? We can start by the members of this list:
If you're not sure what type of value you're dealing with, you can use the function type(). Yes, it works with variables too...!
In [6]:
type("I am the α and the ω!")
Out[6]:
In [7]:
type(2.7182818284590452353602874713527)
Out[7]:
In [9]:
type(True)
Out[9]:
In [15]:
result = "hello"
In [16]:
type(results)
Out[16]:
You declare strings with single ('') or double ("") quote: it's totally indifferent! But now two questions:
In [10]:
mionome = "lucia"
print(mionome)
In [11]:
print("ciao")
In [13]:
type("4")
Out[13]:
String, integer, float... Why is that so important? Well, try to sum two strings and see what happens...
In [22]:
"2" + "3"
Out[22]:
In [23]:
#probably you wanted this...
int("2") + int("3")
Out[23]:
But if we are working with strings, then the "+" sign is used to concatenate the strings:
In [24]:
a = "interesting!"
print("not very " + a)
Lists and dictionaries are two very useful types to store whole collections of data
In [14]:
beatles = ["John", "Paul", "George", "Ringo"]
type(beatles)
Out[14]:
In [15]:
# dictionaries collections of key : value pairs
beatles_dictionary = { "john" : "John Lennon" ,
"paul" : "Paul McCartney",
"george" : "George Harrison",
"ringo" : "Ringo Starr"}
type(beatles_dictionary)
Out[15]:
Items in list are accessible using their index. Do remember that indexing starts from 0!
In [20]:
print(beatles[3], [1])
In [28]:
#indexes can be negative!
beatles[-1]
Out[28]:
Dictionaries are collections of key : value pairs. You access the value using the key as index
In [21]:
beatles_dictionary["paul"]
Out[21]:
In [22]:
beatles_dictionary[0]
There are a bunch of methods that you can apply to list to work with them.
You can append items at the end of a list
In [23]:
beatles.append("randomname")
beatles
Out[23]:
You can learn the index of an item
In [24]:
beatles.index("Paul")
Out[24]:
You can insert elements at a predefinite index:
In [33]:
beatles.insert(0, "Pete Best")
print(beatles.index("George"))
beatles
Out[33]:
But most importantly, you can slice lists, producing sub-lists by specifying the range of indexes you want:
In [34]:
beatles[1:5]
Out[34]:
Do you notice something strange? Yes, the limit index is not inclusive (i.e. item beatles[5] is not included)
In [35]:
beatles[5]
Out[35]:
What happens if you specify an index that is too high?
In [36]:
beatles[7]
Out[36]:
How can you know how long a list is?
In [37]:
len(beatles)
Out[37]:
Do remember that indexing starts at 0, so don't make the mistake of thinking that len(yourlist) will give you the last item of your list!
In [38]:
beatles[len(beatles)]
This will work!
In [39]:
beatles[len(beatles) -1]
Out[39]:
Most of the times, what you want to do when you program is to check a value and execute some operation depending on whether the value matches some condition. That's where if statements help!
In its easiest form, an If statement is syntactic construction that checks whether a condition is met; if it is some part of code is executed
In [43]:
basist = "Paul McCartney"
if bassist == "Paul McCartney":
print("Paul played bass with the Beatles!")
Mind the indentation very much! This is the essential element in the syntax of the statement
In [44]:
bassist = "Bill Wyman"
if bassist == "Paul McCartney":
print("I'm part of the if statement...")
print("Paul played bass in the Beatles!")
What happens if the condition is not met? Nothing! The indented code is not executed, because the condition is not met, so lines 4 and 5 are simply skipped.
But what happens if we de-indent line 5? Can you guess why this is what happes?
Most of the time, we need to specify what happens if the conditions are not met
In [45]:
bassist = ""
if bassist == "Paul McCartney":
print("Paul played bass in the Beatles!")
else:
print("This guy did not play for the Beatles...")
This is the flow:
Or we can specify many different conditions...
In [52]:
bassist = "Mike"
if bassist == "Paul McCartney":
print("Paul played bass in the Beatles!")
elif bassist == "Bill Wyman":
print("Bill Wyman played for the Rolling Stones!")
elif bassist == "Lucia":
print ("Lucia played for Oasis")
elif bassist == "Mike":
print ("Mike played for xxx")
else:
print("I don't know what band this guy played for...")
The greatest thing about lists is that thet are iterable, that is you can loop through them. What do we do if we want to apply some line of code to each element in a list? Try with a for loop!
A for loop can be paraphrased as: "for each element named x in an iterable (e.g. a list): do some code (e.g. print the value of x)"
In [55]:
for b in beatles:
print(b + " was one of the Beatles")
Let's break the code down to its parts:
Now, let's join if statements and for loop to do something nice...
In [57]:
beatles = ["John", "Paul", "George", "Ringo", "Lucia"]
for b in beatles:
if b == "Paul":
instrument = "bass"
elif b == "John":
instrument = "rhythm guitar"
elif b == "George":
instrument = "lead guitar"
elif b == "Ringo":
instrument = "drum"
elif b == "Lucia":
instrument = "piano"
print(b + " played " + instrument + " with the Beatles")
One of the most frequent tasks that programmers do is reading data from files, and write some of the output of the programs to a file.
In Python (as in many language), we need first to open a file-handler with the appropriate mode in order to process it. Files can be opened in:
Let's try to read the content of one of the txt files of our Sunoikisis directory
First, we open the file handler in read mode:
In [58]:
#see? we assign the file-handler to a variable, or we wouldn't be able
#to do anything with that!
f = open("NOTES.md", "r")
note that "r" is optional: read is the default mode!
Now there are a bunch of things we can do:
content = f.read()
lines = f.readlines()
In [ ]:
for l in f:
print(l)
Once you're done, don't forget to close the handle:
In [60]:
f.close()
In [ ]:
#all together
f = open("NOTES.md")
for l in f:
print(l)
f.close()
Now, there's a shortcut statement, which you'll often see and is very convenient, because it takes care of opening, closing and cleaning up the mess, in case there's some error:
In [61]:
with open("NOTES.md") as f:
#mind the indent!
for l in f:
#double indent, of course!
print(l)
Now, how about writing to a file? Let's try to write a simple message on a file; first, we open the handler in write mode
In [ ]:
out = open("test.txt", "w")
In [ ]:
#the file is now open; let's write something in it
out.write("Mio test This is a test!\nThis is a second line (separated with a new-line feed)")
The file has been created! Let's check this out
In [ ]:
#don't worry if you don't understand this code!
#We're simply listing the content of the current directory...
import os
os.listdir()
But before we can do anything (e.g. open it with your favorite text editor) you have to close the file-handler!
In [ ]:
out.close()
Let's look at its content
In [ ]:
with open("test.txt") as f:
print(f.read())
Again, also for writing we can use a with statement, which is very handy.
But let's have a look at what happens here, so we understand a bit better why "write mode" must be used carefully!
In [ ]:
with open("test.txt", "w") as out:
out.write("Oooops! new content")
Let's have a look at the content of "test.txt" now
In [ ]:
with open("test.txt") as f:
print(f.read())
See? After we opened the file in "write mode" for the second time, all content of the file was erased and replaced with the new content that we wrote!!!
So keep in mind: when you open a file in "w" mode:
If you want to write content to an existing file without losing its pervious content, you have to open the file with the "a" mode:
In [69]:
with open("test.txt", "a") as out:
out.write('''\nAnd this is some additional content.
The new content is appended at the bottom of the existing file''')
In [70]:
with open("test.txt") as f:
print(f.read())
Above, we have opened a file several times to inspect its content. Each time, we had to type the same code over and over. This is the typical case where you would like to save some typing (and write code that is much easier to maintain!) by defining a function
A function is a block of reusable code that can be invoked to perform a definite task. Most often (but not necessarily), it accepts one or more arguments and return a certain value.
We have already seen one of the built-in functions of Python: print("some str")
But it's actually very easy to define your own. Let's define the function to print out the file content, as we said before. Note that this function takes one argument (the file name) and prints out some text, but doesn't return back any value.
In [71]:
def printFileContent(file_name):
#the function takes one argument: file_name
with open(file_name) as f:
print(f.read())
As usual, mind the indent!
file_name (line 1) is the placeholder that we use in the function for any argument that we want to pass to the function in our real-life reuse of the code.
Now, if we want to use our function we simply call it with the file name that we want to print out
In [73]:
printFileContent("README.md")
Now, let's see an example of a function that returns some value to the users. Those functions typically take some argument, process them and yield back the result of this processing.
Here's the easiest example possible: a function that takes two numbers as arguments, sum them and returns the result.
In [75]:
def sumTwoNumbers(first_int, second_int):
s = first_int + second_int
return s
In [ ]:
#could be even shorter:
def sumTwoNumbers(first_int, second_int):
return first_int + second_int
In [76]:
sumTwoNumbers(5, 6)
Out[76]:
Most often, you want to assign the result returned to a variable, so that you can go on working with the results...
In [78]:
s = sumTwoNumbers(5,6)
s * 2
Out[78]:
Things can go wrong, especially when you're a beginner. But no panic! Errors and exceptions are actually a good thing! Python gives you detailed reports about what is wrong, so read them carefully and try to figure out what is not right.
Once you're getting better, you'll actually learn that you can do something good with the exceptions: you'll learn how to handle them, and to anticipate some of the most common problems that dirty data can face you with...
Now, what happens if you forget the all-important syntactic constraint of the code indent?
In [80]:
if 1 > 0:
print("Well, we know that 1 is bigger than 0!")
Pretty clear, isn't it? What you get is an error a construct that is not grammatical in Python's syntax. Note that you're also told where (at what line, and at what point of the code) your error is occurring. That is not always perfect (there are cases where the problem is actually occuring before what Python thinks), but in this case it's pretty OK.
What if you forget to define a variable (or you misspell the name of a variable)?
In [84]:
var = "bla bla"
if var1:
print("If you see me, then I was defined...")
You get an exception! The syntax of your code is right, but the execution met with a problem that caused the program to stop.
Now, in your program, you can handle selected exception: this means that you can write your code in a way that the program would still be executed even if a certain exception is raised.
Let's see what happens if we use our function to try to print the content of a file that doesn't exist:
In [85]:
printFileContent("file_that_is_not_there.txt")
We get a FileNotFoundError! Now, let's re-write the function so that this event (somebody uses the function with a wrong file name) is taken care of...
In [86]:
def printFileContent(file_name):
#the function takes one argument: file_name
try:
with open(file_name) as f:
print(f.read())
except FileNotFoundError:
print("The file does not exist.\nNevertheless, I do like you, and I will print something to you anyway...")
In [87]:
printFileContent("file_that_doesnt_exist.txt")
If you're using Mac OSX or Linux, you already have (at least one version) of Python installed. Anyway, it's very easy to install Python or upgrade your version. See:
Python and Jupyter come also in a pre-packaged environment (which is designed especially for data science) called Anaconda. You might be interested to look at that.
Python 3 is the latest version of Python (currently, 3.6.1). It's a major upgrade from Python 2, but the code has been somewhat dramatically changed in the passage from 2 to 3 and there is some problem of backward compatibility. Some version of Linux or Mac OSX still come with Python 2.7 (the final version of Python 2).
Anyway, Python 3 is currently in active development: it's where the cutting-edge improvements and new stuff are being developed (especially for NLP and the NLTK library). In this code, we assume Python 3!
Would you like a book that is a great introduction to Python for absolute beginners, is a wonderfull resource to learn the basics of Natural Language processing and gives you a thorough introduction to the NLTK library to do NLP in Python? Oh, yeah, I was forgetting: that can be read for free on the internet? Yes, it's Christmass time!